Journal of Statistical Software

The R Package tipsae: Tools for Mapping Proportions and Indicators on the Unit Interval

Silvia De Nicolò, Aldo Gardini — Wed, 27 Mar 2024 00:00:00 +0000

The tipsae package implements a set of small area estimation tools for mapping proportions and indicators defined on the unit interval. It provides for small area models defined at area level, including the classical beta regression, zero- and/or one-inflated beta and flexible beta ones, possibly accounting for spatial and/or temporal dependency structures. The models, developed within a Bayesian framework, are estimated through Stan language, allowing fast estimation and customized parallel computing. The additional features of the tipsae package, such as diagnostics, visualization and exporting functions as well as variance smoothing and benchmarking functions, improve the user experience through the entire process of estimation, validation and outcome presentation. A shiny application with a user-friendly interface further eases the implementation of Bayesian models for small area analysis.

The R Package markets: Estimation Methods for Markets in Equilibrium and Disequilibrium

Pantelis Karapanagiotis — Sun, 18 Feb 2024 00:00:00 +0000

Market models constitute a significant cornerstone of empirical applications in business, industrial organization, and policymaking macroeconomics. The econometric literature proposes various estimation methods for markets in equilibrium, which entail a market-clearing structural condition, and disequilibrium, which are described based on a structural short-side rule. Nonetheless, maximum likelihood estimations of such models are computationally demanding, and software providing simple, out-of-the-box methods for estimating them is scarce. Therefore, applications rely on project-specific implementations for estimating these models, which hinders research reproducibility and result comparability. This article presents the R package markets, which provides a common interface with generic functionality simplifying the estimation of models for markets in equilibrium and disequilibrium. The package specializes in estimating demanded, supplied, and aggregated market quantities and absolute, normalized, and relative market shortages. Its functionality is exemplified via an empirical application using a classic dataset of United States credit for housing starts. Moreover, the article details the scope and design of the implementation and provides statistical measurements of the computational performance of its estimation functionality gathered via large-scale benchmarking simulations. The markets package is free software distributed under the Expat license as part of the R software ecosystem. It comprises a set of estimation and analysis tools that are not directly available from either alternative R packages or other statistical software projects.

DoubleML: An Object-Oriented Implementation of Double Machine Learning in R

Philipp Bach, Malte S. Kurz, Victor Chernozhukov, Martin Spindler, Sven Klaassen — Sun, 18 Feb 2024 00:00:00 +0000

The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consists of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.

gcimpute: A Package for Missing Data Imputation

Yuxuan Zhao, Madeleine Udell — Sun, 18 Feb 2024 00:00:00 +0000

This article introduces the Python package gcimpute for missing data imputation. Package gcimpute can impute missing data with many different variable types, including continuous, binary, ordinal, count, and truncated values, by modeling data as samples from a Gaussian copula model. This semiparametric model learns the marginal distribution of each variable to match the empirical distribution, yet describes the interactions between variables with a joint Gaussian that enables fast inference, imputation with confidence intervals, and multiple imputation. The package also provides specialized extensions to handle large datasets (with complexity linear in the number of observations) and streaming datasets (with online imputation). This article describes the underlying methodology and demonstrates how to use the software package.

melt: Multiple Empirical Likelihood Tests in R

Eunseop Kim, Steven N. MacEachern, Mario Peruggia — Sun, 18 Feb 2024 00:00:00 +0000

Empirical likelihood enables a nonparametric, likelihood-driven style of inference without relying on assumptions frequently made in parametric models. Empirical likelihood-based tests are asymptotically pivotal and thus avoid explicit studentization. This paper presents the R package melt that provides a unified framework for data analysis with empirical likelihood methods. A collection of functions are available to perform multiple empirical likelihood tests for linear and generalized linear models in R. The package melt offers an easy-to-use interface and flexibility in specifying hypotheses and calibration methods, extending the framework to simultaneous inferences. Hypothesis testing uses a projected gradient algorithm to solve constrained empirical likelihood optimization problems. The core computational routines are implemented in C++, with OpenMP for parallel computation.

PUMP: Estimating Power, Minimum Detectable Effect Size, and Sample Size When Adjusting for Multiple Outcomes in Multi-Level Experiments

Kristen B. Hunter, Luke Miratrix, Kristin Porter — Mon, 18 Mar 2024 00:00:00 +0000

For randomized controlled trials (RCTs) with a single intervention's impact being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust p values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially. A reduction in power means a reduction in the probability of detecting effects when they do exist. This consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP (Power Under Multiplicity Project) R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. PUMP uses a simulation-based approach to flexibly estimate power for a wide variety of experimental designs, number of outcomes, multiple testing procedures, and other user choices. By assuming linear mixed effects models, we can draw directly from the joint distribution of test statistics across outcomes and thus estimate power via simulation. One of PUMP's main innovations is accommodating multiple outcomes, which are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in p values from applying a multiple testing procedure. Second, when considering multiple outcomes rather than a single outcome, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power in order to choose the most appropriate types of power for the goals of their study. The package supports a variety of commonly used frequentist multi-level RCT designs and linear mixed effects models. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.

Holistic Generalized Linear Models

Benjamin Schwendinger, Florian Schwendinger, Laura Vana — Sun, 18 Feb 2024 00:00:00 +0000

Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The R package holiglm provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art mixed-integer conic solvers, the package can reliably solve generalized linear models for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the stats::glm() function.

salmon: A Symbolic Linear Regression Package for Python

Alex Boyd, Dennis L. Sun — Thu, 28 Mar 2024 00:00:00 +0000

One of the most attractive features of R is its linear modeling capabilities. We describe a Python package, salmon, that brings the best of R's linear modeling functionality to Python in a Pythonic way - by providing composable objects for specifying and fitting linear models. This object-oriented design also enables other features that enhance easeof-use, such as automatic visualizations and intelligent model building.

Modeling Nonstationary Financial Volatility with the R Package tvgarch

Susana Campos-Martins, Genaro Sucarrat — Mon, 08 Apr 2024 00:00:00 +0000

Certain events can make the structure of volatility of financial returns to change, making it nonstationary. Models of time-varying conditional variance such as generalized autoregressive conditional heteroscedasticity (GARCH) models usually assume stationarity. However, this assumption can be inappropriate and volatility predictions can fail in the presence of structural changes in the unconditional variance. To overcome this problem, in the time-varying (TV-)GARCH model, the GARCH parameters are allowed to vary smoothly over time by assuming not only the conditional but also the unconditional variance to be time-varying. In this paper, we show how useful the R package tvgarch (Campos-Martins and Sucarrat 2023) can be for modeling nonstationary volatility in financial empirical applications. The functions for simulating, testing and estimating TV-GARCH-X models, where additional covariates can be included, are implemented in both univariate and multivariate settings.

Modeling Big, Heterogeneous, Non-Gaussian Spatial and Spatio-Temporal Data Using FRK

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Noel Cressie — Mon, 08 Apr 2024 00:00:00 +0000

Non-Gaussian spatial and spatio-temporal data are becoming increasingly prevalent, and their analysis is needed in a variety of disciplines. FRK is an R package for spatial and spatio-temporal modeling and prediction with very large data sets that, to date, has only supported linear process models and Gaussian data models. In this paper, we describe a major upgrade to FRK that allows for non-Gaussian data to be analyzed in a generalized linear mixed model framework. These vastly more general spatial and spatio-temporal models are fitted using the Laplace approximation via the software TMB. The existing functionality of FRK is retained with this advance into non-Gaussian models; in particular, it allows for automatic basis-function construction, it can handle both point-referenced and areal data simultaneously, and it can predict process values at any spatial support from these data. This new version of FRK also allows for the use of a large number of basis functions when modeling the spatial process, and thus it is often able to achieve more accurate predictions than previous versions of the package in a Gaussian setting. We demonstrate innovative features in this new version of FRK, highlight its ease of use, and compare it to alternative packages using both simulated and real data sets.

CRTFASTGEEPWR: A SAS Macro for Power of Generalized Estimating Equations Analysis of Multi-Period Cluster Randomized Trials with Application to Stepped Wedge Designs

Ying Zhang, John S. Preisser, Fan Li, Elizabeth L. Turner, Paul J. Rathouz — Wed, 27 Mar 2024 00:00:00 +0000

Multi-period cluster randomized trials (CRTs) are increasingly used for the evaluation of interventions delivered at the group level. While generalized estimating equations (GEE) are commonly used to provide population-averaged inference in CRTs, there is a gap of general methods and statistical software tools for power calculation based on multi-parameter, within-cluster correlation structures suitable for multi-period CRTs that can accommodate both complete and incomplete designs. A computationally fast, nonsimulation procedure for determining statistical power is described for the GEE analysis of complete and incomplete multi-period cluster randomized trials. The procedure is implemented via a SAS macro, CRTFASTGEEPWR, which is applicable to binary, count and continuous responses and several correlation structures in multi-period CRTs. The SAS macro is illustrated in the power calculation of two complete and two incomplete stepped wedge cluster randomized trial scenarios under different specifications of marginal mean model and within-cluster correlation structure. The proposed GEE power method is quite general as demonstrated in the SAS macro with numerous input options. The power procedure and macro can also be used in the planning of parallel and crossover CRTs in addition to cross-sectional and closed cohort stepped wedge trials.