Abstract:

There are many different ways in which change point analysis can be performed, from purely parametric methods to those that are distribution free. The ecp package is designed to perform multiple change point analysis while making as few assumptions as possible. While many other change point methods are applicable only for univariate data, this R package is suitable for both univariate and multivariate observations. Hierarchical estimation can be based upon either a divisive or agglomerative algorithm. Divisive estimation sequentially identifies change points via a bisection algorithm. The agglomerative algorithm estimates change point locations by determining an optimal segmentation. Both approaches are able to detect any type of distributional change within the data. This provides an advantage over many existing change point algorithms which are only able to detect changes within the marginal distributions.

]]>Multilevel Modeling Using R

William Holmes Finch, Jocelyn E. Bolin, Ken Kelley

Chapman and Hall/CRC, 2014

ISBN: 978-1-4665-1585-7

Abstract:

Sequential Monte Carlo is a family of algorithms for sampling from a sequence of distributions. Some of these algorithms, such as particle filters, are widely used in physics and signal processing research. More recent developments have established their application in more general inference problems such as Bayesian modeling.

These algorithms have attracted considerable attention in recent years not only be- cause that they have desired statistical properties, but also because they admit natural and scalable parallelization. However, they are perceived to be difficult to implement. In addition, parallel programming is often unfamiliar to many researchers though conceptually appealing.

A C++ template library is presented for the purpose of implementing generic sequential Monte Carlo algorithms on parallel hardware. Two examples are presented: a simple particle filter and a classic Bayesian modeling problem.

Abstract:

Envelope models and methods represent new constructions that can lead to substantial increases in estimation efficiency in multivariate analyses. The envlp toolbox implements a variety of envelope estimators under the framework of multivariate linear regression, including the envelope model, partial envelope model, heteroscedastic envelope model, inner envelope model, scaled envelope model, and envelope model in the predictor space. The toolbox also implements the envelope model for estimating a multivariate mean. The capabilities of this toolbox include estimation of the model parameters, as well as performing standard multivariate inference in the context of envelope models; for example, prediction and prediction errors, F test for two nested models, the standard errors for contrasts or linear combinations of coefficients, and more. Examples and datasets are contained in the toolbox to illustrate the use of each model. All functions and datasets are documented.

]]>Abstract:

This paper discusses the software D-STEM as a statistical tool for the analysis and mapping of environmental space-time variables. The software is based on a flexible hierarchical space-time model which is able to deal with multiple variables, heterogeneous spatial supports, heterogeneous sampling networks and missing data. Model estimation is based on the expectation maximization algorithm and it can be performed using a distributed computing environment to reduce computing time when dealing with large data sets. The estimated model is eventually used to dynamically map the variables over the geographic region of interest. Three examples of increasing complexity illustrate usage and capabilities of D-STEM, both in terms of modeling and implementation, starting from a univariate model and arriving at a multivariate data fusion with tapering.

]]>Abstract:

We have developed the R package c060 with the aim of improving R software func- tionality for high-dimensional risk prediction modeling, e.g., for prognostic modeling of survival data using high-throughput genomic data. Penalized regression models provide a statistically appealing way of building risk prediction models from high-dimensional data. The popular CRAN package glmnet implements an efficient algorithm for fitting penalized Cox and generalized linear models. However, in practical applications the data analysis will typically not stop at the point where the model has been fitted. One is for example often interested in the stability of selected features and in assessing the prediction performance of a model and we provide functions to deal with both of these tasks. Our R functions are computationally efficient and offer the possibility of speeding up computing time through parallel computing. Another feature which can drastically reduce computing time is an efficient interval-search algorithm, which we have implemented for selecting the optimal parameter combination for elastic net penalties. These functions have been useful in our daily work at the Biostatistics department (C060) of the German Cancer Research Center where prognostic modeling of patient survival data is of particular interest. Although we focus on a survival data application of penalized Cox models in this article, the functions in our R package are in general applicable to all types of regression models implemented in the glmnet package, with the exception of prediction error curves, which are specific to time-to-event data.

]]>Abstract:

One major goal in clinical applications of multi-state models is the estimation of transition probabilities. The usual nonparametric estimator of the transition matrix for non-homogeneous Markov processes is the Aalen-Johansen estimator (Aalen and Johansen 1978). However, two problems may arise from using this estimator: first, its standard error may be large in heavy censored scenarios; second, the estimator may be inconsistent if the process is non-Markovian. The development of the R package TPmsm has been motivated by several recent contributions that account for these estimation problems. Estimation and statistical inference for transition probabilities can be performed using TPmsm. The TPmsm package provides seven different approaches to three-state illness-death modeling. In two of these approaches the transition probabilities are estimated conditionally on current or past covariate measures. Two real data examples are included for illustration of software usage.

]]>Abstract:

Parameter and structural learning on continuous time Bayesian network classifiers are challenging tasks when you are dealing with big data. This paper describes an efficient scalable parallel algorithm for parameter and structural learning in the case of complete data using the MapReduce framework. Two popular instances of classifiers are analyzed, namely the continuous time naive Bayes and the continuous time tree augmented naive Bayes. Details of the proposed algorithm are presented using Hadoop, an open-source implementation of a distributed file system and the MapReduce framework for distributed data processing. Performance evaluation of the designed algorithm shows a robust parallel scaling.

]]>Abstract:

The X-12-ARIMA seasonal adjustment program of the US Census Bureau extracts the different components (mainly: seasonal component, trend component, outlier component and irregular component) of a monthly or quarterly time series. It is the state-of-the- art technology for seasonal adjustment used in many statistical offices. It is possible to include a moving holiday effect, a trading day effect and user-defined regressors, and additionally incorporates automatic outlier detection. The procedure makes additive or multiplicative adjustments and creates an output data set containing the adjusted time series and intermediate calculations.

The original output from X-12-ARIMA is somehow static and it is not always an easy task for users to extract the required information for further processing. The R package x12 provides wrapper functions and an abstraction layer for batch processing of X-12-ARIMA. It allows summarizing, modifying and storing the output from X-12-ARIMA within a well-defined class-oriented implementation. On top of the class-oriented (command line) implementation the graphical user interface allows access to the R package x12 without requiring too much R knowledge. Users can interactively select additive outliers, level shifts and temporary changes and see the impact immediately.

The provision of the powerful X-12-ARIMA seasonal adjustment program available directly from within R, as well as of the new facilities for marking outliers, batch processing and change tracking, makes the package a potent and functional tool.

Abstract:

Time series clustering is an active research area with applications in a wide range of fields. One key component in cluster analysis is determining a proper dissimilarity measure between two data objects, and many criteria have been proposed in the literature to assess dissimilarity between two time series. The R package TSclust is aimed to implement a large set of well-established peer-reviewed time series dissimilarity measures, including measures based on raw data, extracted features, underlying parametric models, complexity levels, and forecast behaviors. Computation of these measures allows the user to perform clustering by using conventional clustering algorithms. TSclust also includes a clustering procedure based on p values from checking the equality of generating models, and some utilities to evaluate cluster solutions. The implemented dissimilarity functions are accessible individually for an easier extension and possible use out of the clustering context. The main features of TSclust are described and examples of its use are presented.

]]>Abstract:

The BayesLCA package for R provides tools for performing latent class analysis within a Bayesian setting. Three methods for fitting the model are provided, incorporating an expectation-maximization algorithm, Gibbs sampling and a variational Bayes approximation. The article briefly outlines the methodology behind each of these techniques and discusses some of the technical difficulties associated with them. Methods to remedy these problems are also described. Visualization methods for each of these techniques are included, as well as criteria to aid model selection.

]]>Abstract:

The coneproj package contains routines for cone projection and quadratic programming, plus applications in estimation and inference for constrained parametric regression and shape-restricted regression problems. A short routine check_irred is included to check the irreducibility of a matrix, whose rows are supposed to be a set of cone edges used by coneA or coneB. For the coneA and coneB functions, the vector to project is provided by the user, along with the cone specification and a weight vector. For coneA, a constraint matrix is specified to define the cone, and for coneB, the cone edges are provided. The coneA and coneB algorithms have been coded and compiled in C++, and are called by R. The qprog function transforms a quadratic programming problem into a cone projection problem and calls coneA. The constreg function does estimation and inference for parametric least-squares regression with constraints on the parameters (using coneA). A p value for the “one-sided" test is provided. The shapereg function uses coneB to provide a least-squares estimator for a regression function with several choices of constraints including isotonic and convex regression functions, as well as estimates of parametrically modeled covariate effects. Results from hypothesis tests for significance of the effects are also provided. This package is now available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=coneproj.

]]>Abstract:

Accelerated failure time (AFT) models are alternatives to relative risk models which are used extensively to examine the covariate effects on event times in censored data regression. Nevertheless, AFT models have been much less utilized in practice due to lack of reliable computing methods and software. This paper describes an R package aftgee that implements recently developed inference procedures for AFT models with both the rank-based approach and the least squares approach. For the rank-based approach, the package allows various weight choices and uses an induced smoothing procedure that leads to much more efficient computation than the linear programming method. With the rank-based estimator as an initial value, the generalized estimating equation approach is used as an extension of the least squares approach to the multivariate case. Additional sampling weights are incorporated to handle missing data needed as in case-cohort studies or general sampling schemes. A simulated dataset and two real life examples from biomedical research are employed to illustrate the usage of the package.

]]>Abstract:

Random ferns is a very simple yet powerful classification method originally introduced for specific computer vision tasks. In this paper, I show that this algorithm may be considered as a constrained decision tree ensemble and use this interpretation to introduce a series of modifications which enable the use of random ferns in general machine learning problems. Moreover, I extend the method with an internal error approximation and an attribute importance measure based on corresponding features of the random forest algorithm. I also present the R package rFerns containing an efficient implementation of this modified version of random ferns.

]]>Abstract:

Nonparametric density and regression estimation methods for circular data are included in the R package NPCirc. Specifically, a circular kernel density estimation procedure is provided, jointly with different alternatives for choosing the smoothing parameter. In the regression setting, nonparametric estimation for circular-linear, circular-circular and linear-circular data is also possible via the adaptation of the classical Nadaraya-Watson and local linear estimators. In order to assess the significance of the features observed in the smooth curves, both for density and regression with a circular covariate and a linear response, a SiZer technique is developed for circular data, namely CircSiZer. Some data examples are also included in the package, jointly with a routine that allows generating mixtures of different circular distributions.

]]>