credsubs: Multiplicity-Adjusted Subset Identification

Subset identification methods are used to select the subset of a covariate space over which the conditional distribution of a response has certain properties - for example, identifying types of patients whose conditional treatment effect is positive. An often critical requirement of subset identification methods is multiplicity control, by which the family-wise Type I error rate is controlled, rather than the Type I error rate of each covariate-determined hypothesis separately. The credible subset (or credible subgroup) method provides a multiplicity-controlled estimate of the target subset in the form of two bounding subsets: one which entirely contains the target subset, and one which is entirely contained by it. We introduce a new R package, credsubs, which constructs credible subset estimates using a sample from the joint posterior distribution of any regression model, a description of the covariate space, and a function mapping the parameters and covariates to the subset criterion. We demonstrate parametric and nonparametric applications of the package to a clinical trial dataset and a neuroimaging dataset, respectively.


Introduction
Subset identification, the selection of the subset of an index set for which random variables have certain properties, can be applied to a wide variety of topics, but is especially useful in medicine and public health. In this context, the goal is usually to identify the types of patients, identified by their baseline covariate profiles, for whom a treatment has a positive effect. Given an index set X and an unknown, possibly parametric, real-valued function b(x; θ), the large class of subset identification problems considered in this paper can be formulated as identifying the subset B(θ), comprised of the x ∈ X for which b(x; θ) > δ. In one example presented here, an analysis of clinical trials of Alzheimer's disease treatments, b(x; θ) is a personalized (or conditional average) treatment effect, i.e., the difference in expected outcomes between subjects with identical baseline covariate profiles x, E[Y |x, t = 1; θ] − E[Y |x, t = 0; θ]. The target subset B(θ) is then the set of covariate profiles that represent patients with positive personalized treatment effects, i.e., the benefiting subgroup. In a second example, x represents coordinates of a voxel in a 2-dimensional slice of an fMRI scan of the brain, b(x; θ) is the blood oxygen level response to a stimulus, and B(θ) is the set of voxels excited by that stimulus.
The subset identification problems considered here can be seen as collections of tests of the hypotheses H x = {θ : b(x; θ) ≤ δ}. The subset B(θ) is then the x for which the H x are false, and a natural estimate B is the set of x for which the hypotheses are rejected by some test. However, in many cases the outcome of the family of tests as a whole is of interest. In particular, it may be desirable to control the probability that at least one member of B is not truly in B, or that at least one member of B is not placed in B. This article introduces the R (R Core Team 2020) package credsubs (Schnell and Carlin 2020), available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/ package=credsubs, based on the "credible subgroups" methods developed by Schnell, Tang, Offen, and Carlin (2016), Schnell, Tang, Müller, and Carlin (2017), and Schnell, Müller, Tang, and Carlin (2018). Given a sample from the joint posterior distribution of b(x; θ) or the information necessary to construct one, the method constructs an estimator (D, S) with the property that P [D ⊆ B(θ) ⊆ S|data] ≥ 1 − α, thereby controlling multiplicity in both dimensions mentioned above. In most common cases, the estimator also has the frequentist property that P [D ⊆ B(θ 0 ) ⊆ S|θ 0 ] ≥ 1 − α.
The original credible subgroups methodology focused on identification of patient populations who benefit from treatment. This subgroup identification problem is one among many related problems under the umbrella of subgroup analysis, and credible subgroups is one among many possible approaches to addressing it. The broad problem of subset identification presented here is a generalization of subgroup identification as it appears in medicine. Broadly, subgroup analysis in medicine is concerned with treatment effects that may be heterogeneous among different subsets of the patient population, and includes problems such as detecting such heterogeneity, estimating treatment effects in subgroups, identifying subgroups that are anomalous in some way, identifying subgroups with positive average treatment effects, and identifying baseline covariate profiles that correspond to positive personalized treatment effects. It is especially important for our purposes to distinguish between the final two problems. Suppose that for the purpose of prescribing treatment for a particular condition, patients are characterized by their sex and a particular biomarker status. If biomarker-positive male patients benefit from treatment, and there is no effect for other patients, then the answer to the former problem is that three subgroups of the population have positive average treatment effects: male patients, biomarker-positive patients, and biomarker-positive male patients. However, the answer to the latter problem is that the only baseline covariate profile corresponding to a positive personalized treatment effect is that of biomarker-positive male patients.
As described in Schnell et al. (2016), there exist many approaches to benefiting subgroup identification. These include direct Bayesian inference from linear models with skeptical priors on interaction effects (Simon 2002), and tree-based or "flowchart-style" approaches such as classification and regression trees (Breiman, Friedman, Stone, and Olshen 1984;Chipman, George, and McCulloch 1998). A recent survey of variations on these approaches is given in Zhang, Seibold, Vettore, Song, and François (2018), along with examples in R. In principle, the credible subsets approach may be applied on top of any modeling technique from which a sample from the joint posterior of b(x; θ) can be obtained.
The scope of this package includes only the construction of the estimator given properly formatted regression output. It does not include functionality for fitting any regression models. Packages that provide output in forms that facilitate use of this package include those which can directly produce a sample from the joint posterior of the regression surface, such as the BayesTree (Chipman and McCulloch 2016) package implementing Bayesian additive regression trees (BART; Chipman, George, and McCulloch 2010), and interfaces to hierarchical model samplers such as R2OpenBUGS (Sturtz, Ligges, and Gelman 2005), rjags (Plummer 2019), nimble (de Valpine et al. 2019de Valpine, Turek, Paciorek, Anderson-Bergman, Temple Lang, and Bodik 2017), INLA (Martins, Simpson, Lindgren, and Rue 2013), and rstan (Carpenter et al. 2017;Stan Development Team 2020), which give samples from the joint posterior of the parameters.
The rest of this article proceeds as follows. Section 2 gives a brief overview of the credible subsets estimator's definition and construction. Section 3 illustrates the basic usage of the package for a parametric example analysis of an Alzheimer's disease clinical trial dataset. Section 4 presents a nonparametric example analysis of a neuroimaging dataset. Finally, Section 5 concludes with a discussion.

Definition
Given a covariate space X consisting of points x, and a real-valued function b(x; θ) possibly parameterized by θ, the credible subsets method is concerned with estimating the set B(θ) ≡ {x : b(x; θ) > δ}, i.e., the subset of the covariate space for which b exceeds some threshold. The method constructs a credible subset pair (D, S) with the aim that D ⊆ B(θ) ⊆ S. The set D is termed the exclusive credible subset and is intended to contain only covariate points for which b(x; θ) > δ, and the set S is termed the inclusive credible subset and is intended to contain every covariate point for which b(x; θ) > δ.
The credible subset pair is constructed from a sample from the joint posterior distribution of b(x; θ), written compactly as b(X; θ). The pair may be interpreted in the Bayesian sense that P (D ⊆ B(θ) ⊆ S|data) ≥ 1 − α. However, under the construction of (D, S) from simultaneous confidence or credible bands detailed below, pairs constructed from a posterior sample often have asymptotic frequentist interpretations.
In most applications, b(x; θ) is estimated via regression of a response Y onto the covariate space. In the original credible subgroups usage, b( i.e., the conditional treatment effect (t = 1 versus t = 0). Thus, setting δ = 0, B(θ) is the subgroup of patient types having positive conditional treatment effects.

Construction
In this section we present the construction of credible subsets in the Bayesian framework. We suppress dependence on both θ and Y in our notation, and unless otherwise stated, all distributions are posterior.
1 Let M = 1, T 0 = X, and R 0 = ∅ be the starting iteration, base test set, and base rejection set, respectively; Construct a two-sided 1 − α restricted covariate space simultaneous confidence band Let R M be the set of x for which the band does not contain δ; Credible subset pairs are constructed primarily through simultaneous credible bands, i.e, upper and lower bound surfaces u(X) and l(X) constructed so that the entirety of the regression surface b(X) lies between them with posterior probability 1 − α. If the marginal distributions of the b(x) are bell-shaped (e.g., b(X) is multivariate normal or multivariate t), then a familiar form generalized from Uusipaikka (1983) may be used to construct the band: whereb(x) and VAR [b(x)] are the posterior mean and variance, respectively, of b(x) at each x, and W * α,X is the 1 − α quantile of In cases where the marginals of the b(x) are not so well-behaved (for example, if they are discrete or multimodal) a quantile-based band may be used instead. For a fixed x temporarily suppressed in notation, let We may then use the simultaneous credible band given by Schnell et al. (2018): where W * α,X is the 1 − α quantile of In either case, the distribution and quantile function of W X may be estimated from the posterior sample of b(X). Credible subsets may be constructed as D = {x : l(x) > δ} and S = {x : u(x) ≥ δ}. Under mild regularity conditions (Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin 2013;Carlin and Louis 2009), the posterior of b(X) corresponds asymptotically to the sampling distribution of its maximum likelihood estimate, thus the Bayesian simultaneous credible bands and credible subset pairs are interpretable in the frequentist sense.
When interpreting credible subset pairs in the frequentist sense, the estimate may be improved by using a step-down testing procedure similar to the Holm-Bonferroni procedure (Holm 1979). For a given covariate vector x, let H Then Algorithm 1 controls the overall Type I error rate at α. In words, the stepdown procedure reconstructs the simultaneous credible band on the smaller subset of X for which previous iterations did not reject H x . The proof of validity for this procedure relies on showing that it is a closed testing procedure (Marcus, Peritz, and Ruben 1976), in part by If H x is rejected, we may place x in D when the lower credible bound at x was greater than δ, and S (the complement of S) when the upper bound was less than δ. If H x is not rejected, we leave x in S \ D.

Package
This section describes the use of the package's three main functions: • sim.cred.band, which constructs simultaneous credible bands according to (1) or (3); • credsubs, which determines the membership of each covariate point in the exclusive and inclusive credible subsets using the step-down procedure described by Algorithm 1; • credsubs.level, which determines the maximum credible level at which each covariate point is a member of either D or S .
It is important to note that this package provides functionality only for constructing credible subsets given a sample from the posterior of b(X; θ), or of the parameters necessary to construct them. The package provides no functionality for producing a regression fit in the first place, which is intended to be handled by software of the user's choice, as long as it can provide the input necessary for the credsubs package. In this section, we analyze a dataset from four clinical trials of Alzheimer's disease treatments, using nimble as the modeling package.
The clinical trials of Alzheimer's disease treatments were carried out by AbbVie, Inc., and originally analyzed in the context of credible subsets in Schnell et al. (2018). The experimental treatment arms have been excluded, leaving the placebo (T = 0) and standard of care (T = 1) arms. Although the standard of care is palliative rather than disease-modifying, it is expected to alleviate symptoms, including worsening of the cognitive endpoint used in the trials. The primary endpoint Y is improvement in cognitive function from baseline, and the covariates making up x are sex, carrier status of a genetic biomarker, baseline severity, and rate of cognitive decline since diagnosis. The quantity of interest for defining the benefiting population is the personalized treatment effect , not to be confused with the improvement of individual subjects assigned to the treatment arm.

Formatting input
Each of the three main functions relies on the same three primary input objects: • params, a matrix whose rows are draws from the posterior of b(X; θ), or of θ; • design, a matrix whose rows represent points in the covariate space X; • FUN, a function which takes as arguments a row of design and the entirety of params and returns the corresponding sample from the posterior of b(x; θ).
By default, design is set to NULL, in which case params is taken to be a sample from the posterior of b(X; θ) and FUN is ignored. If design is not NULL, params is taken to be a sample from the posterior of θ. Also by default, FUN is set to function(x, params) { params %*% t(x) }, so that a linear model is assumed whenever design is provided and FUN not otherwise specified. The outputs of each function are aligned with the rows of design if not NULL, and with the columns of params otherwise.
Note that params and design are matrices rather than data frames, and will be used in computations as such. Each function will attempt to coerce objects supplied to the params and design arguments to matrices via data.matrix(). In particular, vectors will be interpreted as matrices with a single column, and factors will be replaced by their numerical codes (so "dummy variable" or other coding schemes must be applied manually).
To begin our example, we load the data and reform it to be used by nimble. The nimble R package allows models to be specified in a manner similar to that of the BUGS language (Lunn, Spiegelhalter, Thomas, and Best 2009), can automatically configure an MCMC sampler, and compile to C. We take use of the BUGS model specification here for clarity, and in Section 4 take advantage of another feature: new datasets can be associated with the compiled model and sampler without recompilation.
R> library("nimble") R> library("credsubs") R> set.seed(1) R> data("alzheimers", package = "credsubs") R> head(alzheimers) Here beta and gamma are the regression parameters, of which we will be primarily concerned with gamma in making inferences about the PTEs. The parameter tau is the inverse of the usual error variance σ 2 . The following code compiles the model and runs a MCMC sampler for 11,000 iterations.
When identifying the benefiting subgroup, we consider not only the covariate profiles of subjects enrolled in the trials, but those of all potential patients. Thus, we construct our design object as a design matrix representing the covariate space over which we will make inferences. We must place the design matrix on the same (here, standardized) scale as the data.

Treatment Severity
Decline Sex Carrier Sex.by.Carrier

Inference
Four arguments determine the behavior of the primary functions in ways that may affect their primary outputs: • cred.level, the credible level 1−α (default 0.95) at which simultaneous credible bands or credible subset pairs are constructed; • threshold, the threshold δ (default 0) of b(x; θ) for membership in B(θ); • method, either "asymptotic" (default, using (1)) or "quantile" (using (3)), specifying which simultaneous credible band construction method is used; • step.down, whether or not the step-down procedure should be used (default TRUE). Table 1 displays the applicability of each of the above arguments to the primary functions. Additionally, the argument sides (default "both") may be used to specify one-sided constructions by passing the values "lower" or "upper" for sim.cred.band(), and "exclusive" or "inclusive" for credsubs() and credsubs.level(). The output of credsubs() is a list of class 'credsubs' with the primary elements exclusive and inclusive, which are logical vectors of the same lengths as the primary elements of the output of sim.cred.band(). An element of exclusive equal to TRUE indicates that the corresponding covariate point is a member of the exclusive credible subset D; a similar relationship holds for inclusive and the inclusive credible subset S. The output of credsubs.level() is a list of class 'credsubs.level' and is somewhat more complicated. The primary elements are numerical vectors level, which gives the maximum credible level at which each covariate point is not a member of S \ D, and sign, which is equal to 1 if the covariate point is in D for credible levels less than or equal to that specified by level and −1 if it is in S for the same credible levels. If a covariate point is always in S \ D (i.e., the credible band is centered at δ there), its entries in level and sign are both 0.

Diagnostics
The package offers several facilities for diagnostics of simultaneous credible band and credible subsets outputs. This section describes them and presents illustrations.

Target function distributions
The material input to the credible subsets methods is the sample from the joint posterior of the b(x; θ), and thus it is an important target of diagnostics. In particular, the realized distribution of each of the b(x; θ) should be examined to determine whether the asymptotic simultaneous credible band (1) is appropriate, or if the quantile-based version (3) should be used instead.
For parametric models, a sample from the posterior of θ and the matrix representing X are passed to the package's functions and the b(x; θ) computed internally. In this case the user may request the samples of the b(x; θ) from a subset (or all) of the x using the track argument to any of the primary functions. The argument (default numeric(0)) is a vector containing the indices of the x in X for which the sample of the b(x; θ) should be returned. For example, setting track = c(1, 10, 100) would result in the output of sim.cred.band() containing a 3-column matrix trace whose columns are samples of b(x; θ) where x is the first, tenth, and hundredth row of design. The value of track is retained as the column names of trace.

Band and credible level plots
The 'sim.cred.band' and 'credsubs.level' objects are equipped with plotting methods that produce visualizations ignoring the spatial relationships between covariate points. Discarding the covariate information allows generic plots to be produced regardless of the quantity, type, or range of the covariates. In the case of the 'sim.cred.band' object, the covariate points  are placed in ascending order according to the estimates of b(x; θ), by default their posterior means (method = "asymptotic") or medians (method = "quantile"). The estimates are then plotted along with the corresponding upper and lower bounds. The plot method for 'credsubs.level' objects plots the signed maximum credible levels credsubs.level$sign * credsubs.level$level in ascending order.
R> plot(scb, main = "95% simultaneous credible band") R> plot(csl, main = "Maximum credible levels") The output is shown in Figure 3. The plot of the simultaneous credible band may be used to examine the range of widths of the band and compare its bounds to arbitrary thresholds. In the presented case the majority of the estimates, along with many lower bounds, are above zero, and no upper bounds are below zero. That of the maximum credible levels may be examined to determine the proportion of covariate points falling outside of S \ D at various credible levels. Here, roughly two-thirds of the points have maximum credible levels above 80%.

Funnel plots
The 'credsubs' objects also have a plot method inspired by the funnel plots often used to detect publication bias in meta-analysis (Richard and Pillemer 1984). Funnel plots relate estimates -study treatment effects in meta-analyses, andb(x; θ) − δ in credible subsets analyses -to their standard errors. Since constructing a funnel plot in our context requires a posterior mean and standard error, the method is only available for objects created with the method = "asymptotic" argument. If visible, the vertical and horizontal lines through the origin, as well as those through the origin with slopes ±1/W * α,X , are added. Points falling below the positively sloped line are members of the exclusive credible subset D, while those falling below the negatively sloped line are not members of the inclusive credible subset S. Points falling above both lines fall in S \ D. When step.down = TRUE, the value of W * α,X from the final iteration is used.
R> plot(cs, main = "Funnel plot of credible subsets") Figure 4 displays the result. The funnel plot may be used to understand how the distribution of points among D, S \ D, and S change as the threshold varies (points are translated horizontally) and as the credible level decreases (sloped lines approach vertical) or increases (approach horizontal).

Reporting
The package contains functions for building and running a shiny (Chang, Cheng, Allaire, Xie, and McPherson 2020) application that allows users to choose a covariate point by selecting levels from a list for each covariate, and obtain the maximum credible level and interpretation. The application is built by passing a 'credsubs.level' object along with a data frame cov.space representing the covariate space to build.shiny.calc(). The cov.space object should be formatted to be human-readable, e.g., covariate values on the original scales, and well-formatted column names and factor levels.

Credible Subsets Calculator
Select a covariate point.

R> run.shiny.calc("alzheimers")
An image of the example calculator is displayed in Figure 5. The build.shiny.calc() function creates a directory containing the shiny application, which can be moved to any location or deployed online. The argument to run.shiny.calc() need only point to the intended directory.
The credible subsets represent bounds on the benefiting subgroup with a clinically useful interpretation: With high credibility (or confidence, depending on the desired paradigm), researchers can make the claim that all types of patients in the exclusive credible subset have a positive PTE, and all types of patients with positive PTEs are contained in the inclusive credible subset. A primary utility of such a global inference is regulatory: Investigators control the probability of erroneously declaring benefit for one or more types of patients that do not in fact benefit. This family-wise Type I error control is critical when submitting results to a regulatory agency such as the United States Food and Drug Administration. Another use is in designing enrichment studies, which target a particular population thought to be more likely to benefit or have a stronger treatment effect. In this case, the calculator may be used to define eligibility criteria, e.g., in order to enroll, patients must be part of the exclusive credible subgroup at a maximum credible level of 90% or higher. The multiplicity control provided by this approach then protects against proceeding with an enrichment trial on a target population that is either overoptimistically large or entirely spurious.

A nonparametric example
The example usage presented in Section 3 is parametric both in the modeling and credible subsets senses: A subset of the regression parameters in the model are collected in θ and passed to the package's functions along with a design matrix. In this section we present a usage which is nonparametric in the sense that a posterior draw from b(C) is passed directly. Additionally, the following analysis is an example outside of the clinical trials scope for which the credible subsets method was originally developed.
We analyze a neuroimaging dataset derived from that described in the SPM12 manual (The FIL Methods Group 2014), Chapter 30, and provided in the supplementary materials. The data are from an fMRI experiment in which subjects were given alternating blocks of auditory stimulus and no stimulus. We preprocessed the data using FSL (Woolrich et al. 2009). The preprocessing steps were as follows. We applied motion correction using mcflirt (Jenkinson, Bannister, Brady, and Smith 2002; rigid body transform; cost function normalized correlation; reference volume the middle volume), and then we smoothed the images using a Gaussian kernel with FWHM = 5mm. We consider data from one patient, which consist of 96 blood oxygen level scans of 64 × 64 × 64 voxels. We restrict our attention to the 33rd horizontal slice, and at the advice of the SPM12 Manual, we ignore the first 12 scans to avoid start-up artifacts. Each scan lasts for 7 seconds, and the scans are grouped into blocks of 6. The blocks alternate between no stimulus and an auditory stimulus, beginning with a no-stimulus block. The blood oxygen level does not respond instantaneously to stimulus; instead, the canonical hemodynamic response function (HRF; Friston, Fletcher, Josephs, Holmes, Rugg, and Turner 1998) is convolved with the stimulus indicator to produce the design coefficient x(t) pattern shown in Figure 6. R> library("credsubs") R> library("neuRosim") R> library("nimble") R> load("slice32.RData") R> slice.arr <-array(as.vector(t(slice[-(1:3), ])), dim = c(64, 64, 96)) R> experiment <-rep(rep(c(0, 1), each = 6), 8) R> chrf <-canonicalHRF(1:96 * 7) R> design <-convolve(experiment, rev(chrf), type = "open")[1:96] R> plot(experiment, main = "Experiment and design coefficients", + ylim = range(experiment, design), ylab = "Coefficient", xlab = "Scan") R> lines(design) R> abline(v = seq(6, 90, by = 6) + 1 / 2, lty = 2) For the sake of simple illustration, we model each voxel individually. The blood oxygen level measurement for the voxel v at time t is denoted Y v (t), and modeled as where v (t) is a normal lag-1 autoregressive (AR (1) A large portion of the scan slice is outside the brain. As a crude method of restricting further inference to the brain alone, we select the voxels for which the observed mean blood oxygen level measurement is at least 1,000. R> brain <-rowMeans(slice.arr [, , 13:96], dims = 2) >= 1000 The activation coefficient β v is our primary inferential target. We identify the voxels at which the ratio of the posterior mean to posterior standard deviation corresponds to the 10th and 90th quantiles of the standard normal distribution, and also compute the maximum credible levels for each voxel.
The left side of Figure 8 displays the voxels for which the ratio of posterior mean to standard deviation is below the 10th or above the 90th quantile of the standard normal distribution. Note that a great many voxels are "significant" in this sense, but the result appears noisy. The right side displays the 80% credible subsets. Because the posterior distributions of the β v are uncorrelated due to our independent modeling strategy, the credible subsets procedure is very underpowered. A model in which the α v and β v are spatially correlated, and other parameters are perhaps shared across all voxels, is likely to yield a much larger exclusive credible subset and smaller inclusive credible subset. Regardless of inefficiency, several voxels are identified as positively or negatively activated. These voxels appear to fall into three clusters: The middle left and right green voxels lie in the auditory cortex, while the lower red voxels lie in the visual cortex -sensible conclusions about responses to auditory stimuli.

Discussion
This article presented the R package credsubs for implementing the credible subsets inference procedure, which provides exclusive and inclusive bounds for a target subset of the covariate space. The essential input to package functions is a sample from the joint posterior from an arbitrary regression model, which may be obtained via various other software packages, for example nimble. The credible subsets approach was initially developed in order to identify benefiting subgroups in clinical trials, as illustrated by the example in Section 3, but is also applicable to a wide range of other settings, for example the neuroimaging analysis in Section 4.
In the clinical trial context, we hope that our package can form the basis for a "lingua franca" between drug developers and regulatory authorities, the former of whom are increasingly interested in targeting therapies to relevant patient subgroups, while also satisfying the latter group with regard to overall Type I error protection. We also hope to explore the capabilities of this approach in other areas. For example, it is likely that the neuroimaging example could be greatly improved by taking into account spatial correlation among the activation levels via a conditional autoregressive (CAR; Besag 1974Besag , 1975 or other smoothing prior.
Other future work may include investigations into frequentist analogs through bootstrap sampling, and parallel computing or other methods to handle larger and higher-dimensional samples.