Journal of Statistical Software 2022-07-13T02:23:48+00:00 Editorial Office Open Journal Systems The Journal of Statistical Software publishes articles on statistical software along with the source code of the software itself and replication code for all empirical results. Learning Base R (2nd Edition) 2022-07-13T02:23:48+00:00 James E. Helmreich 2022-07-13T00:00:00+00:00 Copyright (c) 2022 James E. Helmreich Python and R for the Modern Data Scientist 2022-07-13T02:15:16+00:00 Christopher J. Lortie 2022-07-13T00:00:00+00:00 Copyright (c) 2022 Christopher J. Lortie modelsummary: Data and Model Summaries in R 2021-10-16T20:58:44+00:00 Vincent Arel-Bundock <p>modelsummary is a package to summarize data and statistical models in R. It supports over one hundred types of models out-of-the-box, and allows users to report the results of those models side-by-side in a table, or in coefficient plots. It makes it easy to execute common tasks such as computing robust standard errors, adding significance stars, and manipulating coefficient and model labels. Beyond model summaries, the package also includes a suite of tools to produce highly flexible data summary tables, such as dataset overviews, correlation matrices, (multi-level) cross-tabulations, and balance tables (also known as "Table 1"). The appearance of the tables produced by modelsummary can be customized using external packages such as kableExtra, gt, flextable, or huxtable; the plots can be customized using ggplot2. Tables can be exported to many output formats, including HTML, LaTeX, Text/Markdown, Microsoft Word, Powerpoint, Excel, RTF, PDF, and image files. Tables and plots can be embedded seamlessly in rmarkdown, knitr, or Sweave dynamic documents. The modelsummary package is designed to be simple, robust, modular, and extensible.</p> 2022-07-11T00:00:00+00:00 Copyright (c) 2022 Vincent Arel-Bundock stringi: Fast and Portable Character String Processing in R 2021-09-23T19:56:22+00:00 Marek Gagolewski <p>Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills.</p> 2022-07-11T00:00:00+00:00 Copyright (c) 2022 Marek Gagolewski evgam: An R Package for Generalized Additive Extreme Value Models 2021-01-30T12:00:16+00:00 Benjamin D. Youngman <p>This article introduces the R package evgam. The package provides functions for fitting extreme value distributions. These include the generalized extreme value and generalized Pareto distributions. The former can also be fitted through a point process representation. Package evgam supports quantile regression via the asymmetric Laplace distribution, which can be useful for estimating high thresholds, sometimes used to discriminate between extreme and non-extreme values. The main addition of package evgam is to let extreme value distribution parameters have generalized additive model forms, the smoothness of which can be objectively estimated using Laplace's method. Illustrative examples fitting various distributions with various specifications are given. These include daily precipitation accumulations for part of Colorado, US, used to illustrate spatial models, and daily maximum temperatures for Fort Collins, Colorado, US, used to illustrate temporal models.</p> 2022-07-11T00:00:00+00:00 Copyright (c) 2022 Benjamin D. Youngman scikit-mobility: A Python Library for the Analysis, Generation, and Risk Assessment of Mobility Data 2021-06-03T09:08:05+00:00 Luca Pappalardo Filippo Simini Gianni Barlacchi Roberto Pellungrini <p>The last decade has witnessed the emergence of massive mobility datasets, such as tracks generated by GPS devices, call detail records, and geo-tagged posts from social media platforms. These datasets have fostered a vast scientific production on various applications of mobility analysis, ranging from computational epidemiology to urban planning and transportation engineering. A strand of literature addresses data cleaning issues related to raw spatiotemporal trajectories, while the second line of research focuses on discovering the statistical "laws" that govern human movements. A significant effort has also been put on designing algorithms to generate synthetic trajectories able to reproduce, realistically, the laws of human mobility. Last but not least, a line of research addresses the crucial problem of privacy, proposing techniques to perform the re-identification of individuals in a database. A view on state-of-the-art cannot avoid noticing that there is no statistical software that can support scientists and practitioners with all the aspects mentioned above of mobility data analysis. In this paper, we propose scikit-mobility, a Python library that has the ambition of providing an environment to reproduce existing research, analyze mobility data, and simulate human mobility habits. scikit-mobility is efficient and easy to use as it extends pandas, a popular Python library for data analysis. Moreover, scikit-mobility provides the user with many functionalities, from visualizing trajectories to generating synthetic data, from analyzing statistical patterns to assessing the privacy risk related to the analysis of mobility datasets.</p> 2022-07-11T00:00:00+00:00 Copyright (c) 2022 Luca Pappalardo, Filippo Simini, Gianni Barlacchi, Roberto Pellungrini spNNGP R Package for Nearest Neighbor Gaussian Process Models 2021-02-17T15:48:36+00:00 Andrew O. Finley Abhirup Datta Sudipto Banerjee <p>This paper describes and illustrates functionality of the spNNGP R package. The package provides a suite of spatial regression models for Gaussian and non-Gaussian pointreferenced outcomes that are spatially indexed. The package implements several Markov chain Monte Carlo (MCMC) and MCMC-free nearest neighbor Gaussian process (NNGP) models for inference about large spatial data. Non-Gaussian outcomes are modeled using a NNGP Pólya-Gamma latent variable. OpenMP parallelization options are provided to take advantage of multiprocessor systems. Package features are illustrated using simulated and real data sets.</p> 2022-07-11T00:00:00+00:00 Copyright (c) 2022 Andrew O. Finley, Abhirup Datta, Sudipto Banerjee Feller-Pareto and Related Distributions: Numerical Implementation and Actuarial Applications 2021-05-20T03:28:02+00:00 Christophe Dutang Vincent Goulet no@e-mail.provided Nicholas Langevin no@e-mail.provided <p>Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar. The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution.</p> 2022-07-16T00:00:00+00:00 Copyright (c) 2022 Christophe Dutang, Vincent Goulet, Nicholas Langevin Hierarchical Clustering with Contiguity Constraint in R 2021-08-23T21:01:53+00:00 Guillaume Guénard Pierre Legendre <p>This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C. The result is a new R function, constr.hclust, which is distributed in package adespatial. The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats (R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.</p> 2022-09-04T00:00:00+00:00 Copyright (c) 2022 Guillaume Guénard, Pierre Legendre On the Programmatic Generation of Reproducible Documents 2021-06-13T21:10:28+00:00 Michael Kane Xun (Tony) Jiang Simon Urbanek <p>Reproducible document standards, like R Markdown, facilitate the programmatic creation of documents whose content is itself programmatically generated. While programmatic content alone may not be sufficient for a rendered document since it does not include prose (content generated by an author to provide context, a narrative, etc.) programmatic generation can provide substantial efficiencies for structuring and constructing documents. This paper explores the programmatic generation of reproducible documents by distinguishing components that can be created by computational means from those requiring human-generation, providing guidelines for the generation of these documents, and identifying a use case in clinical trial reporting. These concepts and use case are illustrated through the listdown package for the R programming environment, which is is currently available on the Comprehensive R Archive Network.</p> 2022-07-20T00:00:00+00:00 Copyright (c) 2022 Michael Kane, Xun Jiang, Simon Urbanek Automatic Identification and Forecasting of Structural Unobserved Components Models with UComp 2021-08-04T00:12:53+00:00 Diego J. Pedregal <p>UComp is a powerful library for building unobserved components models, useful for forecasting and other important operations, such us de-trending, cycle analysis, seasonal adjustment, signal extraction, etc. One of the most outstanding features that makes UComp unique among its class of related software implementations is that models may be built automatically by identification algorithms (three versions are available). These algorithms select the best model among many possible combinations. Another relevant feature is that it is coded in C++, opening the door to link it to different popular and widely used environments, like R, MATLAB, Octave, Python, etc. The implemented models for the components are more general than the usual ones in the field of unobserved components modeling, including different types of trend, cycle, seasonal and irregular components, input variables and outlier detection. The automatic character of the algorithms required the development of many complementary algorithms to control performance and make it applicable to as many different time series as possible. The library is open source and available in different formats in public repositories. The performance of the library is illustrated working on real data in several varied examples.</p> 2022-08-17T00:00:00+00:00 Copyright (c) 2022 Diego J. Pedregal exuber: Recursive Right-Tailed Unit Root Testing with R 2021-09-21T16:37:52+00:00 Kostas Vasilopoulos Efthymios Pavlidis Enrique Martínez-García <p>This paper introduces the R package exuber for testing and date-stamping periods of mildly explosive dynamics (exuberance) in time series. The package computes test statistics for the supremum augmented Dickey-Fuller test (SADF) of Phillips, Wu, and Yu (2011), the generalized SADF (GSADF) of Phillips, Shi, and Yu (2015a,b), and the panel GSADF proposed by Pavlidis, Yusupova, Paya, Peel, Martínez-García, Mack, and Grossman (2016); generates finite-sample critical values based on Monte Carlo and bootstrap methods; and implements the corresponding date-stamping procedures. The recursive least-squares algorithm that we introduce in our implementation of these techniques utilizes the matrix inversion lemma and in that way achieves significant speed improvements. We illustrate the speed gains in a simulation experiment, and provide illustrations of the package using artificial series and a panel on international house prices.</p> 2022-08-19T00:00:00+00:00 Copyright (c) 2022 Kostas Vasilopoulos, Efthymios Pavlidis, Enrique Martínez-García Blang: Bayesian Declarative Modeling of General Data Structures and Inference via Algorithms Based on Distribution Continua 2021-03-30T04:33:01+00:00 Alexandre Bouchard-Côté Kevin Chern Davor Cubranic Sahand Hosseini Justin Hume Matteo Lepur Zihui Ouyang Giorgio Sgarbi <p>Consider a Bayesian inference problem where a variable of interest does not take values in a Euclidean space. These "non-standard" data structures are in reality fairly common. They are frequently used in problems involving latent discrete factor models, networks, and domain specific problems such as sequence alignments and reconstructions, pedigrees, and phylogenies. In principle, Bayesian inference should be particularly wellsuited in such scenarios, as the Bayesian paradigm provides a principled way to obtain confidence assessment for random variables of any type. However, much of the recent work on making Bayesian analysis more accessible and computationally efficient has focused on inference in Euclidean spaces. In this paper, we introduce Blang, a domain specific language and library aimed at bridging this gap. Blang allows users to perform Bayesian analysis on arbitrary data types while using a declarative syntax similar to the popular family of probabilistic programming languages, BUGS. Blang is augmented with intuitive language additions to create data types of the user's choosing. To perform inference at scale on such arbitrary state spaces, Blang leverages recent advances in sequential Monte Carlo and non-reversible Markov chain Monte Carlo methods.</p> 2022-08-23T00:00:00+00:00 Copyright (c) 2022 Alexandre Bouchard-Côté, Kevin Chern, Davor Cubranic, Sahand Hosseini, Justin Hume, Matteo Lepur, Zihui Ouyang, Giorgio Sgarbi [RETRACTED ARTICLE] irtplay: An R Package for Unidimensional Item Response Theory Modeling 2022-03-06T17:22:52+00:00 Hwanggyu Lim Craig S. Wells <p>The article and accompanying software package have been retracted by the authors, and hence removed from the journal, because the software violated the copyright of a proprietary software and the intellectual property of a third party.</p> 2022-08-16T00:00:00+00:00 Copyright (c) 2022 Hwanggyu Lim, Craig S. Wells Robust Mediation Analysis: The R Package robmed 2022-04-04T16:42:59+00:00 Andreas Alfons Nüfer Y. Ateş Patrick J. F. Groenen <p>Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results.</p> 2022-08-17T00:00:00+00:00 Copyright (c) 2022 Andreas Alfons, Nüfer Y. Ateş, Patrick J. F. Groenen HighFrequencyCovariance: A Julia Package for Estimating Covariance Matrices Using High Frequency Financial Data 2021-12-10T21:54:53+00:00 Stuart Baumann Margaryta Klymak <p>High frequency data typically exhibit asynchronous trading and microstructure noise, which can bias the covariances estimated by standard estimators. While a number of specialized estimators have been proposed, they have had limited availability in open source software. HighFrequencyCovariance is the first Julia package which implements specialized estimators for volatility, correlation and covariance using high frequency financial data. It also implements complementary algorithms for matrix regularization. This paper presents the issues associated with exploiting high frequency financial data and describes the volatility, covariance and regularization algorithms that have been implemented. We then demonstrate the use of the package using foreign exchange market tick data to estimate the covariance of the exchange rates between different currencies. We also perform a Monte Carlo experiment, which shows the accuracy gains that are possible over simpler covariance estimation techniques.</p> 2022-08-15T00:00:00+00:00 Copyright (c) 2022 Stuart Baumann, Margaryta Klymak Bambi: A Simple Interface for Fitting Bayesian Linear Models in Python 2021-09-30T16:03:50+00:00 Tomás Capretto Camen Piho Ravin Kumar Jacob Westfall Tal Yarkoni Osvaldo A Martin <p>The popularity of Bayesian statistical methods has increased dramatically in recent years across many research areas and industrial applications. This is the result of a variety of methodological advances with faster and cheaper hardware as well as the development of new software tools. Here we introduce an open source Python package named Bambi (BAyesian Model Building Interface) that is built on top of the PyMC probabilistic programming framework and the ArviZ package for exploratory analysis of Bayesian models. Bambi makes it easy to specify complex generalized linear hierarchical models using a formula notation similar to those found in R. We demonstrate Bambi's versatility and ease of use with a few examples spanning a range of common statistical models including multiple regression, logistic regression, and mixed-effects modeling with crossed group specific effects. Additionally we discuss how automatic priors are constructed. Finally, we conclude with a discussion of our plans for the future development of Bambi.</p> 2022-08-15T00:00:00+00:00 Copyright (c) 2022 Tomás Capretto, Camen Piho, Ravin Kumar, Jacob Westfall, Tal Yarkoni, Osvaldo A. Martin Spbsampling: An R Package for Spatially Balanced Sampling 2021-09-07T14:07:26+00:00 Francesco Pantalone Roberto Benedetti Federica Piersimoni <p>The basic idea underpinning the theory of spatially balanced sampling is that units closer to each other provide less information about a target of inference than units farther apart. Therefore, it should be desirable to select a sample well spread over the population of interest, or a spatially balanced sample. This situation is easily understood in, among many others, environmental, geological, biological, and agricultural surveys, where usually the main feature of the population is to be geo-referenced. Since traditional sampling designs generally do not exploit the spatial features and since it is desirable to take into account the information regarding spatial dependence, several sampling designs have been developed in order to achieve this objective. In this paper, we present the R package Spbsampling, which provides functions in order to perform three specific sampling designs that pursue the aforementioned purpose. In particular, these sampling designs achieve spatially balanced samples using a summary index of the distance matrix. In this sense, the applicability of the package is much wider, as a distance matrix can be defined for units according to variables different than geographical coordinates.</p> 2022-08-24T00:00:00+00:00 Copyright (c) 2022 Francesco Pantalone, Roberto Benedetti, Federica Pierismoni plot3logit: Ternary Plots for Interpreting Trinomial Regression Models 2021-06-07T19:11:17+00:00 Flavio Santi Maria Michela Dickson no@e-mail.provided Giuseppe Espa no@e-mail.provided Diego Giuliani no@e-mail.provided <p>This paper presents the R package plot3logit which enables the covariate effects of trinomial regression models to be represented graphically by means of a ternary plot. The aim of the plot is helping the interpretation of regression coefficients in terms of the effects that a change in values of regressors has on the probability distribution of the dependent variable. Such changes may involve either a single regressor, or a group of them (composite changes), and the package permits both cases to be handled in a user-friendly way. Moreover, plot3logit can compute and draw confidence regions of the effects of covariate changes and enables multiple changes and profiles to be represented and compared jointly. Upstream and downstream compatibility makes the package able to work with other R packages or applications other than R.</p> 2022-07-19T00:00:00+00:00 Copyright (c) 2022 Flavio Santi, Maria Michela Dickson, Giuseppe Espa, Diego Giuliani