SIS : An R Package for Sure Independence Screening in Ultrahigh-Dimensional Statistical Models

We revisit sure independence screening procedures for variable selection in generalized linear models and the Cox proportional hazards model. Through the publicly available R package SIS , we provide a uniﬁed environment to carry out variable selection using iterative sure independence screening (ISIS) and all of its variants. For the regularization steps in the ISIS recruiting process, available penalties include the LASSO, SCAD, and MCP while the implemented variants for the screening steps are sample splitting, data-driven thresholding, and combinations thereof. Performance of these feature selection techniques is investigated by means of real and simulated data sets, where we ﬁnd considerable improvements in terms of model selection and computational time between our algorithms and traditional penalized pseudo-likelihood methods applied directly to the full set of covariates.


Introduction
With the remarkable development of modern technology, including computing power and storage, more and more high-dimensional and high-throughput data of unprecedented size and complexity are being generated for contemporary statistical studies. For instance, bioimaging technology has made it possible to collect a huge amount of predictor information such as microarray, proteomic, and SNP data while observing survival information and tumor classification on patients in clinical studies. A common feature of all these examples is that the number of variables p can be potentially much larger than the number of observations n, i.e., the number of gene expression profiles is in the order of tens of thousands while the number of patient samples is in the order of tens or hundreds. Following Fan and Lv (2008), we call this setting ultrahigh-dimensional, where the authors gave a precise mathematical formulation of the growth rate of p relative to n. In order to provide more representative and reasonable statistical models, it is typically assumed that only a small fraction of predictors are associated with the outcome. This is the notion of sparsity which emphasizes the prominent role feature selection techniques play in ultrahigh-dimensional statistical modeling.
One popular family of variable selection methods for parametric models is based on the penalized (pseudo-)likelihood approach. Examples include the LASSO (Tibshirani 1996(Tibshirani , 1997, SCAD (Fan and Li 2001), the elastic net penalty (Zou and Hastie 2005), the MCP (Zhang 2010), and related methods. Nevertheless, in ultrahigh-dimensional statistical learning problems, these methods may not perform well due to the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability (Fan, Samworth, and Wu 2009).
Motivated by these concerns, Fan and Lv (2008) introduced a new framework for variable screening via independent correlation learning that tackles the aforementioned challenges in the context of ultrahigh-dimensional linear models. Their proposed sure independence screening (SIS) is a two-stage procedure; first filtering out the features that have weak marginal correlation with the response, effectively reducing the dimensionality p to a moderate scale below the sample size n, and then performing variable selection and parameter estimation simultaneously through a lower dimensional penalized least squares method such as SCAD or LASSO. Under certain regularity conditions, Fan and Lv (2008) showed surprisingly that this fast feature selection method has a "sure screening property"; that is, with probability tending to 1, the independence screening technique retains all of the important features in the model. However, the SIS procedure in Fan and Lv (2008) only covers ordinary linear regression models and their technical arguments do not extend easily to more general models such as generalized linear models and hazard regression with right-censored times.
In order to enhance finite sample performance, an important methodological extension, iterative sure independence screening (ISIS), was also proposed by Fan and Lv (2008) to handle cases where the regularity conditions may fail, such as when some important predictors are marginally uncorrelated with the response, or the reverse situation where an unimportant predictor has higher marginal correlation than some important features. Roughly speaking, the original ISIS procedure works by iteratively performing variable selection to recruit a small number of predictors, computing residuals based on the model fitted using these recruited predictors, and then using the residuals as the working response variable to continue recruiting new predictors. With the purpose of handling more complex real data, Fan and Song (2010) extended SIS to generalized linear models; and Fan et al. (2009) improved some important steps of the original ISIS procedure, allowing variable deletion in the recruiting process through penalized pseudo-likelihood, while dealing with more general loss based models. In particular, they introduced the concept of conditional marginal regressions and, with the aim of reducing the false discovery rate, proposed two new ISIS variants based on the idea of splitting samples. Other extensions of ISIS include Fan, Feng, and Wu (2010) to the Cox proportional hazards model, and Fan, Feng, and Song (2011) to nonparametric additive models.
In this paper we build on the work of Fan et al. (2009) and  to provide a publicly available package SIS (Fan, Feng, Saldana, Samworth, and Wu 2018), implemented in the R statistical software (R Core Team 2017), extending sure independence screening and all of its variants to generalized linear models and the Cox proportional hazards model. In particular, our codes are able to perform variable selection through the proposed ISIS variants of Fan et al. (2009) and through the data-driven thresholding approach of Fan et al. (2011). Furthermore, we combine these sample splitting and data-driven thresholding ideas to provide two novel feature selection techniques. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=SIS.
Taking advantage of the fast cyclical coordinate descent algorithms developed in the packages glmnet (Friedman, Hastie, Simon, and Tibshirani 2017) and ncvreg (Breheny 2017), for convex and nonconvex penalty functions, respectively, we are able to efficiently perform the moderate scale penalized pseudo-likelihood steps from the ISIS procedure, thus yielding variable selection techniques outperforming direct use of glmnet and ncvreg in terms of both computational time and estimation error. Our procedures scale favorably in both n and p, allowing us to expeditiously and accurately solve much larger problems than with previous packages, particularly in the case of nonconvex penalties. We would like to point out that the recent package apple (Yu and Feng 2012), using a hybrid of the predictor-corrector method and coordinate descent procedures, provides an alternative for the penalized pseudo-likelihood estimation with nonconvex penalties (Yu and Feng 2014a). In the present work, we limit all numerical results to the use of ncvreg, noting that there are other available options to implement the nonconvex variable selection procedures performed by SIS. Similarly, although the package survHD (Bernau, Waldron, and Riester 2014) provides an efficient alternative for implementing Cox proportional hazards regression, in the current presentation, we only make use of the survival package (Therneau 2017) to compute conditional marginal regressions and of the glmnet package (Friedman et al. 2017) to fit high-dimensional Cox models.
The remainder of the paper is organized as follows. In Section 2, we describe the vanilla SIS and ISIS variable selection procedures in the context of generalized linear models and the Cox proportional hazards model. Section 3 discusses several ISIS variants, as well as important implementation details. Simulation results comparing model selection performance and run time trials are given in Section 4, where we also analyze four gene expression data sets and work through an example using our package with one of them. The paper is concluded with a short discussion in Section 5.

General SIS and ISIS methodological framework
Consider the usual generalized linear model (GLM) framework, where we have independent and identically distributed observations {(x i , y i ) : i = 1, . . . , n} from the population (x, y), where the predictor x = (x 0 , x 1 , . . . , x p ) is a (p + 1)-dimensional random vector with x 0 = 1 and y is the response. We further assume the conditional distribution of y given x is from an exponential family taking the canonical form where θ = x β, β = (β 0 , β 1 , . . . , β p ) is a vector of unknown regression parameters and b(·), c(·) are known functions. As we are only interested in modeling the mean regression, the dispersion parameter is assumed known. In virtue of (1), inference about the parameter β in the GLM context is made via maximization of the log-likelihood function For the survival analysis framework, the observed data . . n} is an independent and identically distributed random sample from a certain population (x, y, δ). Here, as in the context of linear models, x = (x 1 , x 2 , . . . , x p ) is a pdimensional random vector of predictors and y, the observed time, is a time of failure if δ = 1, or a right-censored time if δ = 0. Suppose that the sample comprises m distinct uncensored failure times t 1 < t 2 < · · · < t m . Let (j) denote the individual failing at time t j and R(t j ) be the risk set just prior to time t j , that is, R(t j ) = {i : y i ≥ t j }. The main problem of interest is to study the relationship between the predictor variables and the failure time, and a common approach is through the Cox proportional hazards model (Cox 1975). For a vector β = (β 1 , β 2 , . . . , β p ) of unknown regression parameters, the Cox model assumes a semiparametric form of the hazard function where h 0 (t) is an unknown arbitrary baseline hazard function giving the hazard when x i = 0. Following the argument in Cox (1975), inference about β is made via maximization of the partial likelihood function which is equivalent to maximizing the log-partial likelihood We refer interested readers to Kalbfleisch and Prentice (2002) and references therein for a comprehensive literature review on the Cox proportional hazards model.
For both statistical models, we assume all predictors x 1 , . . . , x p are standardized to have mean zero and standard deviation one. Additionally, although our variable selection procedures within the SIS package also handle the classical p < n setting, we allow the number of covariates p to be much larger than the number of observations n. What makes statistical inference possible in this "large p, small n" scenario is the sparsity assumption; only a small subset of variables among predictors x 1 , . . . , x p contribute to the response, which implies the parameter vector β is sparse. Therefore, variable selection techniques play a pivotal role in these ultrahigh-dimensional statistical models.

SIS and feature ranking by maximum marginal likelihood estimators
Let M = {1 ≤ j ≤ p : β j = 0} be the true sparse model, where β = (β 0 , β 1 , . . . , β p ) denotes the true value of the parameter vector and β 0 = 0 for the Cox model. In order to carry out the vanilla sure independence screening variable selection procedure, we initially fit marginal versions of models (2) and (3) with componentwise covariates. The maximum marginal likelihood estimator (MMLE)β M j , for j = 1, . . . , p, is defined in the GLM context as the maximizer of the componentwise regression where x i = (x i0 , x i1 , . . . , x ip ) and x i0 = 1. Similarly, for each covariate x j (1 ≤ j ≤ p), one can define the MMLE for the Cox model as the maximizer of the log-partial likelihood with a single covariatê with x i = (x i1 , . . . , x ip ) . Both componentwise estimators can be computed very rapidly and implemented in a modular way, avoiding the numerical instability associated with ultrahighdimensional estimation problems.
The vanilla SIS procedure then ranks the importance of features according to the magnitude of their marginal regression coefficients, excluding the intercept in the case of GLMs. Therefore, we select a set of variables where δ n is a threshold value chosen so that we pick the d top ranked covariates. Typically, one may take d = n/ log n , so that dimensionality is reduced from ultrahigh to below the sample size. As further discussed in Sections 3.3 and 3.4, the choice of d may also be either data-driven or model-based. Under a mild set of technical conditions, Fan and Song (2010) show the magnitude of these marginal estimators can preserve the nonsparsity information about the joint model with full covariates. In other words, for a given sequence {δ n }, the sure screening property P(M ⊂ M δn ) → 1 as n → ∞ holds for SIS, effectively reducing the dimensionality of the model from ultrahigh to below the sample size, and solving the aforementioned challenges of computational expediency, statistical accuracy, and algorithmic stability.
With features being crudely selected by the intensity of their marginal signals, the index set M δn may also include a great deal of unimportant predictors. To further improve finite sample performance of vanilla SIS, variable selection and parameter estimation can be simultaneously achieved via penalized likelihood estimation, using the joint information of the covariates in M δn . Without loss of generality, by reordering the features if necessary, we may assume that x 1 , . . . , x d are the predictors recruited by SIS. We define β d = (β 0 , β 1 , . . . , β d ) and let Thus, our original problem of estimating the sparse (p + 1)-vector β in the GLM framework (2) reduces to estimating a sparse (d + 1)-vector β d = (β 0 , β 1 , . . . , β d ) based on maximizing the moderate scale penalized likelihood Likewise, after defining β d = (β 1 , β 2 , . . . , β d ) and setting x i,d = (x i1 , x i2 , . . . , x id ) for survival data within the Cox model, the moderate scale version of the penalized log-partial likelihood problem consists in maximizing Algorithm 1 Vanilla SIS (Van-SIS) 1. Inputs: Screening model size d. Penalty p λ (·). 2. For every j ∈ {1, . . . , p}, compute the MMLEβ M j from (4) or (5). 3. Choose a threshold value δ n in (6) such that M δn consists of the d top ranked covariates. 4. Obtain the parameter estimate β d from the penalized likelihood estimation problems (8) or (9). 5. Outputs: Parameter estimate β d and the corresponding estimate of the true sparse model Here, p λ (·) is a penalty function and λ > 0 is a regularization parameter which may be selected using the AIC (Akaike 1973), BIC (Schwarz 1978) or EBIC (Chen and Chen 2008) criteria, or through ten-fold cross-validation and the modified cross-validation framework (Feng and Yu 2013;Yu and Feng 2014b), for example. Penalty functions whose resulting estimators yield sparse solutions to the maximization problems (8) and (9) include the LASSO penalty p λ (|β|) = λ|β| (Tibshirani 1996), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li 2001), which is a folded-concave quadratic spline with p λ (0) = 0 and for some γ > 2 and |β| > 0, and the minimax concave penalty (MCP), another folded-concave quadratic spline with p λ (0) = 0 such that p λ (|β|) = (λ − |β|/γ) + , for some γ > 0 and |β| > 0 (Zhang 2010). For the SCAD and MCP penalties, the tuning parameter γ is used to adjust the concavity of the penalty. The smaller γ is, the more concave the penalty becomes, which means finding a global minimizer is more difficult; but on the other hand, the resulting estimators overcome the bias introduced by the LASSO penalty.
Once penalized likelihood estimation is carried out in (8) and (9), a refined estimate of M can be obtained from M 1 , the index set of the nonzero components of the sparse regression parameter estimator. We summarize this initial screening procedure based on componentwise regressions through Algorithm 1.

Iterative sure independence screening
The technical conditions in Fan and Lv (2008) and Fan and Song (2010) guaranteeing the sure screening property for SIS fail to hold if there is a predictor marginally unrelated, but jointly related with the response, or if a predictor is jointly uncorrelated with the response but has higher marginal correlation with the response than some important predictors in M .
In the former case, the important predictor cannot be picked up by vanilla SIS, whereas in the latter case, unimportant predictors in M c are ranked too high, leaving out features from the true sparse model M .
To deal with these difficult scenarios in which the SIS methodology breaks down, Fan and Lv (2008) and Fan et al. (2009) proposed iterative sure independence screening based on conditional marginal regressions and feature ranking. The approach seeks to overcome the limitations of SIS, which is based on marginal models only, by making more use of the joint covariate information while retaining computational expediency and stability as in the original SIS.
In its first iteration, the vanilla ISIS variable selection procedure begins with using regular SIS to pick an index set A 1 of size k 1 , and then similarly applies a penalized likelihood estimation approach to select a subset M 1 of these indices. Note that the screening step only fits componentwise regressions, while the penalized likelihood step solves optimization problems of moderate scale k 1 , typically below the sample size n. This is an attractive feature for any variable selection technique applied to ultrahigh-dimensional statistical models. After the first iteration, we denote the resulting estimator with nonzero components and indices in M 1 byβ M 1 .
In an effort to use more fully the joint covariate information, in the second iteration of vanilla ISIS we compute the conditional marginal regression of each predictor j that is not in M 1 . That is, under the generalized linear model framework, we solve arg max whereas under the Cox model, we obtain for each j ∈ {1, . . . , p} \ M 1 , where x i, M 1 denotes the sub-vector of x i with indices in M 1 and similarly for β M 1 . These are again low-dimensional problems which can be solved quite efficiently. Similar to the componentwise regressions (4) and (5), letβ M j denote the last coordinate of the maximizer in (10) and (11). The magnitude |β M j | reflects the additional contribution of the jth predictor given that all covariates with indices in M 1 have been included in the model.
Once the conditional marginal regressions have been computed for each predictor not in M 1 , we perform conditional feature ranking by ordering {|β M j | : j ∈ M c 1 } and forming an index set A 2 of size k 2 , say, consisting of the indices with the top ranked elements. After this screening step, under the GLM framework, we maximize the moderate scale penalized likelihood Similarly, in the Cox model, we obtain a sparse estimator by maximizing the moderate scale penalized log-partial likelihood  (4) or (5). Select the k 1 top ranked covariates to form the index set A 1 . 3. Apply penalized likelihood estimation on the set A 1 to obtain a subset of indices M 1 . 4. For every j ∈ M c 1 , solve the conditional marginal regression problem (10) or (11) and sort {|β M j | : j ∈ M c 1 }. Form the index set A 2 with the k 2 top ranked covariates, and apply penalized likelihood estimation on M 1 ∪ A 2 as in (12) or (13) to obtain a new index set M 2 . 5. Iterate the process in Step 4 until we have an index set M l such that | M l | = d or M l = M j for some j < l or l = l max . 6. Outputs: M l from Step 5 along with the parameter estimate from (12) or (13).
The indices of the nonzero coefficients ofβ M 1 ∪ A 2 provide an updated estimate M 2 of the true sparse model M .
In the screening step above, an alternative approach is to substitute the fitted regression parameterβ M 1 from the first step of vanilla ISIS into problems (10) and (11), so that the optimization problems become again componentwise regressions. This approach is exactly an extension of the original ISIS proposal of Fan and Lv (2008) to generalized linear models and the Cox proportional hazards model. Even for the ordinary linear model, the conditional contributions of predictors are more relevant for variable selection in the second ISIS iteration than the original proposal of using the residualsr as the new response (see Fan et al. 2009). Furthermore, the penalized likelihood steps (12) and (13) allow the procedure to delete some features {x j : j ∈ M 1 } that were previously selected. This is also an important deviation from Fan and Lv (2008), as their original ISIS procedure does not contemplate intermediate deletion steps.
Lastly, the vanilla ISIS procedure, which iteratively recruits and deletes predictors, can be repeated until some convergence criterion is reached. We adopt the criterion of having an index set M l which either has the prescribed size d, or satisfies M l = M j for some j < l. In order to ensure that iterated SIS takes at least two iterations to terminate, in our implementation we fix k 1 = 2d/3 , and thereafter at the lth iteration we set k l = d − | M l−1 |. A step-by-step description of the vanilla ISIS procedure is provided in Algorithm 2.
We conclude this section providing a simple overview of the main features of the vanilla SIS and ISIS procedures for applied practitioners. In the ultrahigh-dimensional statistical model setting where p n, and even in the classical p < n setting with p > 30, variable screening is an essential tool in helping eliminate irrelevant predictors while reducing data gathering and storage requirements. The vanilla SIS procedure given in Algorithm 1 provides an extremely fast and efficient variable screening based on marginal regressions of each predictor with the response. While under certain independence assumptions among predictors this may prove a successful strategy in terms of estimating the true sparse model M , there are well-known issues associated with variable screening using only information from marginal regressions, such as missing important predictors from M which happen to have low marginal correlation with the response. The vanilla ISIS procedure addresses these issues by using more thoroughly the joint covariate information through the conditional marginal regressions (10) and (11), which aim at measuring the additional contribution of a predictor x j given the presence of the variables in M 1 in the current model, all while maintaining low computational costs. Finally, we would like to point out that the intermediate deletion steps of the vanilla ISIS procedure could be carried out with any other variable selection methods, such as the weight vector ranking with support vector machines (Rakotomamonjy 2003) or the greedy search strategies of forward selection and backward elimination (Guyon and Elisseeff 2003). In our implementation within the SIS package we favor the penalized likelihood criteria (12) and (13), but in principle any variable selection technique could be employed to further filter unimportant predictors.

Variants of ISIS
By nature of their marginal approach, sure independence screening procedures have large false selection rates, namely, many unimportant predictors in M c are selected after the screening steps. In order to reduce the false selection rate, Fan et al. (2009) suggested the idea of sample splitting. Without loss of generality, we assume the sample size n is even, and we randomly split the sample into two halves. Two variants of the ISIS methodology have been proposed in the literature; both of them relying on the idea of performing variable screening to the data in each partition separately, combining the results in a subsequent penalized likelihood step. We also revisit the approach of Fan et al. (2011), in which a random permutation of the observations is used to obtain a data-driven threshold for independence screening. be the two sets of indices, each of size k 1 , obtained after applying regular SIS to the data in each partition. As the first crude estimates of the true sparse model M , both of them should have large false selection rates. Yet, under the technical conditions given in Fan and Song (2010), the estimates should satisfy

First variant of ISIS
That is, important features should appear in both sets with probability tending to one. If we define as a new estimate of M , this new index set must also satisfy P(M ⊂ A 1 ) → 1 as n → ∞.
However, by construction, the number of unimportant predictors in A 1 should be much smaller, thus reducing the false selection rate. The reason is that, in order for an unimportant predictor to appear in A 1 , it has to be included in both sets A 1 and A 1 randomly. In their theoretical support for this variant based on random splitting, Fan et al. (2009) provided a non-asymptotic upper bound for the probability of the event that m unimportant covariates are included in the intersection A 1 . The probability bound, obtained under an exchangeability condition ensuring that each unimportant feature is equally likely to be recruited by SIS, is decreasing in the dimensionality, showing an apparent "blessing of dimensionality". This is only part of the full story, since, as pointed out in Fan et al. (2009), the probability of missing important predictors from the true sparse model M is expected to increase with p. However, as we show in our simulation settings of Section 4.1, and further confirm in the real data analysis of Section 4.3, the procedure is quite effective at obtaining a minimal set of features that should be included in a final model.
The remainder of the first variant of ISIS proceeds as follows. After forming the intersection 1 , we perform penalized likelihood estimation as in Algorithm 2 to obtain a refined approximation M 1 to the true sparse model. We then perform the second iteration of the vanilla ISIS procedure, computing conditional marginal regressions to each partition separately to obtain sets of indices A (1) 2 and A (2) 2 , each of size k 2 . After taking the intersection of these two sets, we carry out sparse penalized likelihood estimation as in (12) and (13), obtaining a second approximation M 2 to the true model M . As before, the iteration continues until we have an index set M l of size d, or satisfying M l = M j for some j < l.

Second variant of ISIS
The variable selection performed by the first variant of ISIS could potentially lead to a very aggressive screening of predictors, reducing the overall false selection rate, but sometimes yielding undesirably minimal model sizes. The second variant of ISIS is a more conservative variable selection procedure, where we again apply regular SIS to each partition separately, but we now choose larger sets of indices A (1) 1 and A (2) 1 to ensure that their intersection has k 1 elements. In this way, the second variant guarantees that there are at least k 1 predictors included before the penalized likelihood step, making it considerably less aggressive than the first variant.
After applying penalized likelihood to the predictors with indices in A 1 , we obtain a first estimate M 1 of the true sparse model. The second iteration computes conditional marginal regressions to each partition separately, recruiting enough features in index sets A

Data-driven thresholding
Motivated by a false discovery rate criterion in Fan et al. (2011), the following variant of ISIS determines a data-driven threshold for marginal screening. Given GLM data of the form {(x i , y i ) : i = 1, . . . , n}, a random permutation π of {1, . . . , n} is used to decouple x i and y i so that the resulting data {(x π(i) , y i ) : i = 1, . . . , n} follow a null model, i.e., a model in which the features have no prediction power over the response. For the newly permuted data, we recalculate marginal regression coefficients (β M j ) * for each predictor j, with j = 1, . . . , p. The motivation behind this approach is that whenever j is the index of an important predictor in M , the MMLE |β M j | given in (4) should be larger than most of |(β M j ) * |, as the random permutation is meant to eliminate the prediction power of features. For a given q ∈ [0, 1], let ω (q) be the qth quantile of {|(β M j ) * | : j = 1, . . . , p}. The data-driven threshold allows only a 1 − q proportion of inactive variables to enter the model when x and y are not related (the null model). Thus, the initial index set A 1 is defined as Select the variables in the index set Select the variables in the index set A 2 = {j ∈ M c 1 : |β M j | ≥ ω (q) }, and apply penalized likelihood estimation on M 1 ∪ A 2 to obtain a new subset M 2 . 5. Iterate the process in Step 4 until we have an index set M l such that | M l | ≥ d or M l = M j for some j < l or l = l max . 6. Outputs: M l from Step 5 along with the parameter estimate from (12) or (13).
way to obtain a finer approximation M 1 of the true sparse model M . The complete procedure for this variant is detailed in Algorithm 3 above.
A greedy modification of the above algorithm can be proposed to enhance variable selection performance. Specifically, we restrict the size of the sets A j in the iterative screening steps to be at most p 0 , a small positive integer. Moreover, a completely analogous algorithm can be proposed for survival data, with permutation π and data-driven threshold ω (q) defined accordingly. Details of such a procedure are omitted in the current presentation.

Implementation details
There are several important details in the implementation of the vanilla versions of SIS and ISIS, as well as of all the above mentioned variants.
• In order to speed up computations, and exclusively for the first screening step, all variable selection procedures use correlation learning (i.e., marginal Pearson correlations) between each predictor and the response, instead of the componentwise GLM or partial likelihood fits (4) and (5). We found no major differences in variable selection performance between this variant and the one using the MMLEs.
• Although the asymptotic theory of Fan and Song (2010) guarantees the sure screening property (7) for a sequence δ n ∼ n −θ * , with θ * > 0 a fixed constant, in practice Fan et al. (2009) recommend using d = n/ log n as a sensible choice since θ * is typically unknown. Furthermore, Fan et al. (2009) also advocate using model-based choices of d, favoring smaller values in models where the response provides less information.
In our numerical implementation we use d = n/(4 log n) for logistic regression and d = n/(2 log n) for Poisson regression, which are less informative than the real-valued response in a linear model, for which we select d = n/ log n . For the right-censored response in the survival analysis framework, we fix d = n/(4 log n) .
• Regardless of the statistical model at hand, we set d = n/ log n for the first variant of ISIS. Note that since the selected variables for this first variant are in the intersection of two sets of size k l ≤ d, we typically end up with far fewer than d variables selected by this method. In any case, our procedures within the SIS package allow for a user-specified value of d.
• Variable selection under the traditional p < n setting can also be carried out using our screening procedures, for which we fix d = p as default for all variants. In this regard, practicing data analysts can view sure independence screening procedures along classical criteria for variable selection such as the AIC (Akaike 1973), BIC (Schwarz 1978), C p (Mallows 1973) or the generalized information criterion GIC (Nishii 1984) applied directly to the full set of covariates.
• The intermediate penalized likelihood problems (12) and (13) are solved using the glmnet and ncvreg packages. Our code has an option allowing the regularization parameter λ > 0 to be selected through the AIC, BIC, EBIC or K-fold cross-validation criteria. The concavity parameter γ in the SCAD and MCP penalties can also be user-specified.
• In our permutation-based variant with data-driven thresholding, we use q = 1 (i.e., we take ω (q) to be the maximum absolute value of the permuted estimates) and take p 0 to be 1 in the greedy modification. Note that if none of the variables is recruited, that is, exceeding the threshold for the null model, the criterion in Step 5 stops the procedure.
• We can further combine the permutation-based approach of Section 3.3 with the sample splitting idea from the first two variants to define a new ISIS procedure. Concretely, we first select two subsets of indices A (1) 1 and A 1 , each consisting of variables whose MMLEs, or correlation with the response, exceed the data-driven thresholds ω (1) (q) and ω (2) (q) of their respective samples. If the size of their intersection is less than k 1 , we 1 ; otherwise, we reduce the size of A to ensure their intersection has at most k 1 elements. This is done to control the size of A 1 when too many variables exceed the thresholds. The rest of the iteration is carried out accordingly.
• Every variant of ISIS is coded to guarantee there will be at least two predictors at the end of the first screening step.
As the proposed ISIS variants grow more involved, the associated number of tuning parameters is bound to increase. While this may initially make some data practitioners feel uneasy, our intent here is to be as flexible as possible, providing all available tools that the powerful family of sure independence screening procedures has to offer. Additionally, it is important Parameter Description SIS package options p λ (·) Penalty function for intermediate penalized likelihood estimation. penalty = "SCAD" (default) / = "MCP" /= "lasso" λ Method for tuning regularization parameter of penalty function p λ (·). tune = "bic" (default) / = "ebic" /= "aic" /= "cv" γ Concavity parameter for SCAD and MCP penalties.

Model selection and timings
In this section we illustrate all independence screening procedures by studying their performance on simulated data and on four popular gene expression data sets. Most of the simulation settings are adapted from the work of Fan et al. (2009) and.

Model selection and statistical accuracy
We first conduct simulation studies comparing the run time of the vanilla version of SIS (Van-SIS), its iterated vanilla version (Van-ISIS), the first variant (Var1-ISIS), the second variant (Var2-ISIS), the permutation-based ISIS (Perm-ISIS), its greedy modification (Perm-g-ISIS), the permutation-based variant with sample splitting (Perm-var-ISIS) and its greedy modification (Perm-var-g-ISIS), under both generalized linear models and the Cox proportional hazards model. We also demonstrate the power of ISIS and its variants, in terms of model selection and estimation, by comparing them with traditional LASSO and SCAD penalized estimation. Our SIS code is tested against the competing R packages glmnet (Friedman et al. 2017) and ncvreg (Breheny 2017) for LASSO and SCAD penalization, respectively. All calculations were carried out on an Intel Xeon L5430 @ 2.66 GHz processor. We refer interested readers to Friedman, Hastie, and Tibshirani (2010), Breheny andHuang (2011), andTibshirani (2011) for a detailed discussion of penalized likelihood estimation algorithms under generalized linear models and the Cox proportional hazards model.
Four different statistical models are considered here: Linear regression, Logistic regression, Poisson regression, and Cox proportional hazards regression. We select the configuration (n, p) = (400, 5000) for all models, and generate covariates x 1 , . . . , x p as follows: Case 1: x 1 , . . . , x p are independent and identically distributed N (0, 1) random variables. With independent features, Case 1 is the most straightforward for variable selection. In Cases 2-4, however, we have serial correlation among predictors such that Corr(x i , x j ) does not decay as |i − j| increases. We will see below that for Cases 3 and 4, the true sparse model M is chosen such that the response is marginally independent but jointly dependent of x 4 . This type of dependence is ruled out in the asymptotic theory of SIS in Fan and Song (2010), so we should expect variable selection in these settings to be more challenging for the non-iterated version of SIS.
Case 2: The coefficients are the same as in Case 1.
For Cases 1 and 2, the coefficients were randomly generated as (4 log n/ √ n + |Z|/4)U with Z ∼ N (0, 1) and U = 1 with probability 0.5 and −1 with probability 0.5, independent of   the value of Z. For Cases 3 and 4, the selected model ensures that even though β 4 = 0, the associated predictor x 4 and the response y are marginally independent. This is designed in order to make it challenging for the vanilla sure independence screening procedure to select this important variable. Furthermore, in Case 4, we add another important predictor x 5 with a small coefficient to make it even more challenging to identify the true sparse model.
The results are given in Tables 2-5, in which the median and robust estimate of the standard deviation (over 100 repetitions) of several performance measures are reported: 1 -estimation error, squared 2 -estimation error, true positives (TP), false positives (FP), and computational time in seconds (Time). In Cases 1 and 2, under the Linear and Logistic regression setups, for any type of SIS or ISIS, we employ the SCAD penalty (γ = 3.7) at the end of the screening steps; whereas LASSO is applied for Cases 3 and 4, under the Poisson and Cox proportional hazards regression frameworks. For simplicity, we exclude the performance of MCP-based screening procedures in the current analysis. Whenever necessary, for all variable selection procedures considered here, the BIC criterion is used as a fast way to select the regularization parameter λ > 0, which is always chosen from a path of 100 candidate λ values.
As the covariates are all independent in Case 1, it is not surprising to see that Van-SIS performs reasonably well. However, this non-iterative procedure fails in terms of identifying   the true model when correlation is present, particularly in the challenging Cases 3 and 4. When predictors are dependent, the vanilla ISIS improves significantly over SIS in terms of true positives. While the number of false positive variables may be larger in some settings, Van-ISIS provides comparable estimation errors in Cases 1-3 but significant reduction in the complicated Case 4.
In terms of further reducing the false selection rate and estimation errors, while still selecting the true model M , Var1-ISIS performs much better than Var2-ISIS. Being a more conservative variable selection approach, Var2-ISIS tends to have a higher number of false positives. This is particularly true in the Poisson regression scenario, in which the second variant even misses one important predictor.
From the permutation-based variants, the ones that combine the sample splitting approach (Perm-var-ISIS and Perm-var-g-ISIS) outperform all other ISIS procedures in terms of true positives, low false selection rates, and small estimation errors, with Var1-ISIS following closely. In particular, for Perm-var-ISIS, the number of false positives is approximately zero for all examples. The only drawback seems to be their relatively large computation cost, being at least twice as large as that of Var1-ISIS. This is to be expected considering the amount of extra work these procedures have to perform: two rounds of marginal fits to obtain sample specific data-driven thresholds ω (1) (q) and ω (2) (q) , plus two additional rounds of marginal fits to compute the corresponding index sets A (1) 1 and A (2) 1 . Computational costs potentially increase further when carrying out the conditional marginal regression steps described in Section 2.2. However, the gains in statistical accuracy and model selection offset the increased timings, particularly in the equally correlated Case 2 with nonconvex penalties.  show that SCAD also enjoys the sure screening property for the relatively easy Cases 1 and 2. However, model sizes and estimation errors are significantly larger than for any of the ISIS procedures in the correlated scenario. On the other hand, for the difficult Cases 3 and 4, surprisingly, LASSO rarely includes the important predictor x 4 even though it is not a marginal screening based method. While exhibiting competitive performance with some of the ISIS variants in the Poisson regression scenario, LASSO performs poorly in the Cox model setup, having prohibitively large model sizes and estimation errors.
The ncvreg package with SCAD outperforms all ISIS variants in terms of computational cost for the uncorrelated Case 1. Still, the vanilla SIS procedure identifies the true model faster than ncvreg. For the correlated Case 2, however, only the greedy modification Perm-g-ISIS is slower than SCAD. In the Poisson and Cox model setups, while the computational cost of LASSO with the glmnet package is smaller than any of the ISIS procedures, the vanilla SIS shows better performance in terms of timings and estimation errors.
As they become more elaborate, ISIS procedures become more computationally expensive. Yet, vanilla ISIS and most of its variants presented here, particularly Var1-ISIS and Perm-var-ISIS, are clearly competitive variable selection methods in ultrahigh-dimensional statistical models, adequately using the joint covariate information, while exhibiting a very low false selection rate and competing computational cost.

Scaling in n and p with feature screening
In addition to comparing our SIS codes with glmnet and ncvreg, we would like to know how the timings of the vanilla SIS and ISIS procedures scale in n and p. We simulated data sets from Cases 1-3 above and, for a variety of n and p values, we took the median running time over 10 repetitions. Again, for each (n, p) pair, whenever necessary, the BIC criterion was used to select the best λ value among a path of 100 possible candidates. Figure 1 shows timings for fixed n as p grows (Cases 1-3), and for fixed p as n grows (Case 2).
From the plots we see that independence screening procedures perform uniformly faster than ncvreg with the SCAD penalty. For Poisson regression, vanilla SIS also outperforms glmnet with LASSO, particularly in the n = p scenario, where glmnet exhibits unusually slow performance. It is worth pointing out that iterative variable screening procedures typically do not show a strictly monotone timing as n or p increase. This is due to the varying number of iterations it takes to recruit d predictors, the random splitting of the sample in the first two ISIS variants, and the random permutation in the data-driven thresholding, among other factors.

Real data analysis
We now evaluate the performance of all variable screening procedures on four well-studied gene expression data sets: Leukemia (Golub et al. 1999), Prostate cancer (Singh et al. 2002), Lung cancer (Gordon et al. 2002) and Neuroblastoma (Oberthuer et al. 2006). The first three data sets come with predetermined, separate training and test sets of data vectors. The    Golub et al. (1999), Singh et al. (2002), Gordon et al. (2002) and Oberthuer et al. (2006).
Following the approach of Dudoit, Fridlyand, and Speed (2002), before variable screening and classification, we first standardize each sample to zero mean and unit variance. We compare the performance of all described variable screening procedures with the nearest shrunken centroids (NSC) method of Tibshirani, Hastie, Narasimhan, and Chu (2002), the independence rule (IR) in the high-dimensional setting (Bickel and Levina 2004) and the LASSO (Tibshirani 1996), which uses ten-fold cross-validation to select its tuning parameter, applied to the full set of covariates. Under a working independence assumption in the feature space, NSC selects an important subset of variables for classification by thresholding a corresponding two-sample t statistic, whereas IR makes use of the full set of predictors.
Tables 6-7 show the median and robust standard deviation of the classification error rates and model sizes for all procedures, taken over 100 random splittings into 50%-50% balanced training and test data. At each intermediate step of the (I)SIS procedures, we employ the LASSO with ten-fold cross-validation to further filter unimportant predictors for classification purposes. To determine a data-driven threshold for independence screening, we fix q = 0.95 for all permutation-based variable selection procedures. Lastly, for each data set considered, we apply all screening procedures to reduce dimensionality from the corresponding p to d = 100.
From the results in Tables 6-7, we observe that all ISIS variants perform similarly in terms of test error rates, whereas the main differences lie in the estimated model sizes. Compared with the LASSO applied to the full set of covariates, a majority of ISIS procedures select  (   smaller models while retaining competitive classification error rates. This is in agreement with our simulation results, which highlight the benefits of variable screening over a direct highdimensional regularized logistic regression approach. In particular, we observe the variants Var1-ISIS and Perm-var-g-ISIS provide the most parsimonious models across all four data sets, yielding optimal test error rates while using only 2 features in the case of the Lung cancer data set. Nonetheless, due to its robust performance in both the simulated data and these four gene expression data sets, and its reduced computational cost compared with all available ISIS variants, we select the vanilla ISIS of Algorithm 2 as the default variable selection procedure within our SIS package.
While the NSC method achieves competitive test error rates, it typically makes use of larger sets of genes which vary considerably across the different 50%-50% training and test data splittings. The independence rule exhibits poor test error performance, except for the Neuroblastoma data set, where it even outperforms some of the ISIS procedures. However, this approach uses all features without performing variable selection, thus yielding models of little practical use for researchers.
We now perform variable selection using the Var1-ISIS and Perm-var-ISIS procedures paired with the LASSO penalty and the ten-fold cross-validation method for choosing the regularization parameter.

Discussion
Sure independence screening is a powerful family of methods for performing variable selection in statistical models when the dimension is much larger than the sample size, as well as in the classical setting where p < n. The focus of the paper is on iterative sure independence screening, which iteratively applies a large scale screening by means of conditional marginal regressions, filtering out unimportant predictors, and a moderate scale variable selection through penalized pseudo-likelihood methods, which further selects the unfiltered predictors. With the goal of providing further flexibility to the iterative screening paradigm, special attention is also paid to powerful variants which reduce the number of false positives by means of sample splitting and data-driven thresholding approaches. Compared with the versions of LASSO and SCAD we used, the iterative procedures presented in this paper are much more accurate in selecting important variables and achieving small estimation errors. In addition, computational time is also reduced, particularly in the case of nonconvex penalties, thus resulting in a robust family of procedures for model selection and estimation in ultrahigh-dimensional statistical models. Extensions of the current package to more general loss-based models and nonparametric independence screening procedures, as well as the implementation of conditional marginal regressions through support vector machine methods are lines of future work.