Implementing Reproducible Research

Victoria Stodden, Friedrich Leisch, Roger D. Peng

CRC Press, 2014

ISBN: 978-1-4665-6159-5

XML and Web Technologies for Data Sciences with R

Deborah Nolan and Duncan Temple Lang

Springer-Verlag, 2014

ISBN: 978-1-4614-7899-7

Abstract:

In this paper we describe the main features of the Bergm package for the open-source R software which provides a comprehensive framework for Bayesian analysis of exponential random graph models: tools for parameter estimation, model selection and goodness-of- fit diagnostics. We illustrate the capabilities of this package describing the algorithms through a tutorial analysis of three network datasets.

]]>Abstract:

Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. This paper describes the evtree package, which implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. Computationally intensive tasks are fully computed in C++ while the partykit package is leveraged for representing the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions. evtree is compared to the open-source CART implementation rpart, conditional inference trees (ctree), and the open-source C4.5 implementation J48. A benchmark study of predictive accuracy and complexity is carried out in which evtree achieved at least similar and most of the time better results compared to rpart, ctree, and J48. Furthermore, the usefulness of evtree in practice is illustrated in a textbook customer classification task.

]]>Abstract:

This article surveys currently available implementations in R for continuous global optimization problems. A new R package globalOptTests is presented that provides a set of standard test problems for continuous global optimization based on C functions by Ali, Khompatraporn, and Zabinsky (2005). 48 of the objective functions contained in the package are used in empirical comparison of 18 R implementations in terms of the quality of the solutions found and speed.

]]>Abstract:

Convex optimization now plays an essential role in many facets of statistics. We briefly survey some recent developments and describe some implementations of these methods in R . Applications of linear and quadratic programming are introduced including quantile regression, the Huber M-estimator and various penalized regression methods. Applications to additively separable convex problems subject to linear equality and inequality constraints such as nonparametric density estimation and maximum likelihood estimation of general nonparametric mixture models are described, as are several cone programming problems. We focus throughout primarily on implementations in the R environment that rely on solution methods linked to R, like MOSEK by the package Rmosek. Code is provided in R to illustrate several of these problems. Other applications are available in the R package REBayes, dealing with empirical Bayes estimation of nonparametric mixture models.

]]>Abstract:

Trust region algorithms are nonlinear optimization tools that tend to be stable and reliable when the objective function is non-concave, ill-conditioned, or exhibits regions that are nearly flat. Additionally, most freely-available optimization routines do not exploit the sparsity of the Hessian when such sparsity exists, as in log posterior densities of Bayesian hierarchical models. The trustOptim package for the R programming language addresses both of these issues. It is intended to be robust, scalable and efficient for a large class of nonlinear optimization problems that are often encountered in statistics, such as finding posterior modes. The user must supply the objective function, gradient and Hessian. However, when used in conjunction with the sparseHessianFD package, the user does not need to supply the exact sparse Hessian, as long as the sparsity structure is known in advance. For models with a large number of parameters, but for which most of the cross-partial derivatives are zero (i.e., the Hessian is sparse), trustOptim offers dramatic performance improvements over existing options, in terms of computational time and memory footprint.

]]>Abstract:

Over the last two decades, it has been observed that using the gradient vector as a search direction in large-scale optimization may lead to efficient algorithms. The effectiveness relies on choosing the step lengths according to novel ideas that are related to the spectrum of the underlying local Hessian rather than related to the standard decrease in the objective function. A review of these so-called spectral projected gradient methods for convex constrained optimization is presented. To illustrate the performance of these low-cost schemes, an optimization problem on the set of positive definite matrices is described.

]]>Abstract:

R (R Core Team 2014) provides a powerful and flexible system for statistical computations. It has a default-install set of functionality that can be expanded by the use of several thousand add-in packages as well as user-written scripts. While R is itself a programming language, it has proven relatively easy to incorporate programs in other languages, particularly Fortran and C. Success, however, can lead to its own costs:

Users face a confusion of choice when trying to select packages in approaching a problem.

A need to maintain workable examples using early methods may mean some tools offered as a default may be dated.

In an open-source project like R, how to decide what tools offer "best practice" choices, and how to implement such a policy, present a serious challenge.

We discuss these issues with reference to the tools in R for nonlinear parameter estimation (NLPE) and optimization, though for the present article `optimization` will be limited to function minimization of essentially smooth functions with at most bounds constraints on the parameters. We will abbreviate this class of problems as NLPE. We believe that the concepts proposed are transferable to other classes of problems seen by R users.

]]>Abstract:

Numerical optimization is often an essential aspect of mathematical analysis in science, technology and other areas. The function optim() provides basic optimization capabilities and is among the most widely used functions in R . Additionally, there are various packages and functions for solving various types of optimization problem (the optimization task view on Comprehensive R Archive Network provides a comprehensive list of available options for solving optimization problems in R). In this special volume, four papers are presented which discuss some of the areas in numerical optimization where significant developments have been made recently to enhance the capabilities in R . This introduction provides a brief overview of the volume.

]]>Abstract:

The R package structSSI provides an accessible implementation of two recently developed simultaneous and selective inference techniques: the group Benjamini-Hochberg and hierarchical false discovery rate procedures. Unlike many multiple testing schemes, these methods specifically incorporate existing information about the grouped or hierarchical dependence between hypotheses under consideration while controlling the false discovery rate. Doing so increases statistical power and interpretability. Furthermore, these procedures provide novel approaches to the central problem of encoding complex dependency between hypotheses.

We briefly describe the group Benjamini-Hochberg and hierarchical false discovery rate procedures and then illustrate them using two examples, one a measure of ecological microbial abundances and the other a global temperature time series. For both procedures, we detail the steps associated with the analysis of these particular data sets, including establishing the dependence structures, performing the test, and interpreting the results. These steps are encapsulated by R functions, and we explain their applicability to general data sets.

Abstract:

R package mixAK originally implemented routines primarily for Bayesian estimation of finite normal mixture models for possibly interval-censored data. The functionality of the package was considerably enhanced by implementing methods for Bayesian estimation of mixtures of multivariate generalized linear mixed models proposed in Komárek and Komárková (2013). Among other things, this allows for a cluster analysis (classification) based on multivariate continuous and discrete longitudinal data that arise whenever multiple outcomes of a different nature are recorded in a longitudinal study. This package also allows for a data-driven selection of a number of clusters as methods for selecting a number of mixture components were implemented. A model and clustering methodology for multivariate continuous and discrete longitudinal data is overviewed. Further, a step-by-step cluster analysis based jointly on three longitudinal variables of different types (continuous, count, dichotomous) is given, which provides a user manual for using the package for similar problems.

]]>Abstract:

In this paper we show how complete hierarchical multinomial marginal (HMM) models for categorical variables can be defined, estimated and tested using the R package hmmm. Models involving equality and inequality constraints on marginal parameters are needed to define hypotheses of conditional independence, stochastic dominance or notions of positive dependence, or when the parameters are allowed to depend on covariates. The hmmm package also serves the need of estimating and testing HMM models under equality and inequality constraints on marginal interactions.

]]>