Abstract:

Accelerated failure time (AFT) models are alternatives to relative risk models which are used extensively to examine the covariate effects on event times in censored data regression. Nevertheless, AFT models have been much less utilized in practice due to lack of reliable computing methods and software. This paper describes an R package aftgee that implements recently developed inference procedures for AFT models with both the rank-based approach and the least squares approach. For the rank-based approach, the package allows various weight choices and uses an induced smoothing procedure that leads to much more efficient computation than the linear programming method. With the rank-based estimator as an initial value, the generalized estimating equation approach is used as an extension of the least squares approach to the multivariate case. Additional sampling weights are incorporated to handle missing data needed as in case-cohort studies or general sampling schemes. A simulated dataset and two real life examples from biomedical research are employed to illustrate the usage of the package.

]]>Abstract:

Random ferns is a very simple yet powerful classification method originally introduced for specific computer vision tasks. In this paper, I show that this algorithm may be considered as a constrained decision tree ensemble and use this interpretation to introduce a series of modifications which enable the use of random ferns in general machine learning problems. Moreover, I extend the method with an internal error approximation and an attribute importance measure based on corresponding features of the random forest algorithm. I also present the R package rFerns containing an efficient implementation of this modified version of random ferns.

]]>Abstract:

Nonparametric density and regression estimation methods for circular data are included in the R package NPCirc. Specifically, a circular kernel density estimation procedure is provided, jointly with different alternatives for choosing the smoothing parameter. In the regression setting, nonparametric estimation for circular-linear, circular-circular and linear-circular data is also possible via the adaptation of the classical Nadaraya-Watson and local linear estimators. In order to assess the significance of the features observed in the smooth curves, both for density and regression with a circular covariate and a linear response, a SiZer technique is developed for circular data, namely CircSiZer. Some data examples are also included in the package, jointly with a routine that allows generating mixtures of different circular distributions.

]]>Abstract:

Continuous diagnostic tests are often used for discriminating between healthy and diseased populations. For the clinical application of such tests, it is useful to select a cutpoint or discrimination value c that defines positive and negative test results. In general, individuals with a diagnostic test value of c or higher are classified as diseased. Several search strategies have been proposed for choosing optimal cutpoints in diagnostic tests, depending on the underlying reason for this choice. This paper introduces an R package, known as OptimalCutpoints, for selecting optimal cutpoints in diagnostic tests. It incorporates criteria that take the costs of the different diagnostic decisions into account, as well as the prevalence of the target disease and several methods based on measures of diagnostic test accuracy. Moreover, it enables optimal levels to be calculated according to levels of given (categorical) covariates. While the numerical output includes the optimal cutpoint values and associated accuracy measures with their confidence intervals, the graphical output includes the receiver operating characteristic (ROC) and predictive ROC curves. An illustration of the use of OptimalCutpoints is provided, using a real biomedical dataset.

]]>Abstract:

A web interface, named WebBUGS, is developed to conduct Bayesian analysis online over the Internet through OpenBUGS and R. WebBUGS can be used with the minimum requirement of a web browser both remotely and locally. WebBUGS has many collaborative features such as email notification and sharing. WebBUGS also eases the use of OpenBUGS by providing built-in model templates, data management module, and other useful modules. In this paper, the use of WebBUGS is illustrated and discussed.

]]>Abstract:

Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity.

The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them.

The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

Abstract:

The statistical analysis and modeling of natural images is an important branch of statistics with applications in image signaling, image compression, computer vision, and human perception. Because the space of all possible images is too large to be sampled exhaustively, natural image models must inevitably make assumptions in order to stay tractable. Subsequent model comparison can then ﬁlter out those models that best capture the statistical regularities in natural images. Proper model comparison, however, often requires that the models and the preprocessing of the data match down to the implementation details. Here we present the Natter, a statistical software toolbox for natural images models, that can provide such consistency. The Natter includes powerful but tractable baseline model as well as standardized data preprocessing steps. It has an extensive test suite to ensure correctness of its algorithms, it interfaces to the modular toolkit for data processing toolbox MDP, and provides simple ways to log the results of numerical experiments. Most importantly, its modular structure can be extended by new models with minimal coding eﬀort, thereby providing a platform for the development and comparison of probabilistic models for natural image data.

]]>Abstract:

When designing a sampling survey, usually constraints are set on the desired precision levels regarding one or more target estimates (the Y’s). If a sampling frame is available, containing auxiliary information related to each unit (the X’s), it is possible to adopt a stratified sample design. For any given stratiﬁcation of the frame, in the multivariate case it is possible to solve the problem of the best allocation of units in strata, by minimizing a cost function sub ject to precision constraints (or, conversely, by maximizing the precision of the estimates under a given budget). The problem is to determine the best stratification in the frame, i.e., the one that ensures the overall minimal cost of the sample necessary to satisfy precision constraints. The X’s can be categorical or continuous; continuous ones can be transformed into categorical ones. The most detailed stratiﬁcation is given by the Cartesian product of the X’s (the atomic strata). A way to determine the best stratification is to explore exhaustively the set of all possible partitions derivable by the set of atomic strata, evaluating each one by calculating the corresponding cost in terms of the sample required to satisfy precision constraints. This is unaﬀordable in practical situations, where the dimension of the space of the partitions can be very high. Another possible way is to explore the space of partitions with an algorithm that is particularly suitable in such situations: the genetic algorithm. The R package SamplingStrata, based on the use of a genetic algorithm, allows to determine the best stratiﬁcation for a population frame, i.e., the one that ensures the minimum sample cost necessary to satisfy precision constraints, in a multivariate and multi-domain case.

]]>Abstract:

In regression settings, a suﬃcient dimension reduction (SDR) method seeks the core information in a p-vector predictor that completely captures its relationship with a response. The reduced predictor may reside in a lower dimension d < p, improving ability to visualize data and predict future observations, and mitigating dimensionality issues when carrying out further analysis. We introduce ldr, a new R software package that implements three recently proposed likelihood-based methods for SDR: covariance reduction, likelihood acquired directions, and principal fitted components. All three methods reduce the dimensionality of the data by pro jection into lower dimensional subspaces. The package also implements a variable screening method built upon principal ﬁtted components which makes use of ﬂexible basis functions to capture the dependencies between the predictors and the response. Examples are given to demonstrate likelihood-based SDR analyses using ldr, including estimation of the dimension of reduction subspaces and selection of basis functions. The ldr package provides a framework that we hope to grow into a comprehensive library of likelihood-based SDR methodologies.

]]>Implementing Reproducible Research

Victoria Stodden, Friedrich Leisch, Roger D. Peng

CRC Press, 2014

ISBN: 978-1-4665-6159-5

XML and Web Technologies for Data Sciences with R

Deborah Nolan and Duncan Temple Lang

Springer-Verlag, 2014

ISBN: 978-1-4614-7899-7

Abstract:

In this paper we describe the main features of the Bergm package for the open-source R software which provides a comprehensive framework for Bayesian analysis of exponential random graph models: tools for parameter estimation, model selection and goodness-of- fit diagnostics. We illustrate the capabilities of this package describing the algorithms through a tutorial analysis of three network datasets.

]]>Abstract:

Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. This paper describes the evtree package, which implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. Computationally intensive tasks are fully computed in C++ while the partykit package is leveraged for representing the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions. evtree is compared to the open-source CART implementation rpart, conditional inference trees (ctree), and the open-source C4.5 implementation J48. A benchmark study of predictive accuracy and complexity is carried out in which evtree achieved at least similar and most of the time better results compared to rpart, ctree, and J48. Furthermore, the usefulness of evtree in practice is illustrated in a textbook customer classification task.

]]>