Conditional Model Selection in Mixed-Effects Models with cAIC4

Model selection in mixed models based on the conditional distribution is appropriate for many practical applications and has been a focus of recent statistical research. In this paper we introduce the R-package cAIC4 that allows for the computation of the conditional Akaike Information Criterion (cAIC). Computation of the conditional AIC needs to take into account the uncertainty of the random effects variance and is therefore not straightforward. We introduce a fast and stable implementation for the calculation of the cAIC for linear mixed models estimated with lme4 and additive mixed models estimated with gamm4 . Furthermore, cAIC4 offers a stepwise function that allows for a fully automated stepwise selection scheme for mixed models based on the conditional AIC. Examples of many possible applications are presented to illustrate the practical impact and easy handling of the package.


Introduction
The (generalized) linear mixed model is a flexible and broadly applicable statistical model. It is naturally used for analyzing longitudinal or clustered data. Furthermore, any regularized regression model incorporating a quadratic penalty can be written in terms of a mixed model. This incorporates smoothing spline models, spatial models and more general additive models (Wood 2017). Thus efficient and reliable estimation of such models is of major interest for applied statisticians. The package lme4 (Bates, Mächler, Bolker, and Walker 2015;Bates, Maechler, Bolker, and Walker 2021) for the statistical computing software R (R Core Team 2021) offers such an exceptionally fast and generic implementation for mixed models (see Bates et al. 2015). The package has a modular framework allowing for the profile restricted maximum likelihood (REML) criterion as a function of the model parameters to be optimized using any constrained optimization function in R and uses rapid techniques for solving penalized least squares problems based on sparse matrix methods.
The fact that mixed models are widely used popular statistical tools makes model selection an indispensable necessity. Consequently research regarding model choice, variable selection and hypothesis testing in mixed models has flourished in recent years.
Hypothesis testing on random effects is well established, although for likelihood ratio tests boundary issues arise (Crainiceanu and Ruppert 2004;Greven, Crainiceanu, Küchenhoff, and Peters 2008;Wood 2013). In model selection for linear mixed models using the Akaike information criterion (AIC; Akaike 1973), Vaida and Blanchard (2005) suggest to use different criteria depending on the focus of the underlying research question. They make a distinction between questions with a focus on the population and on clusters, respectively. For the latter, they introduce a conditional AIC accounting for the shrinkage in the random effects. Based on this conditional AIC, Liang, Wu, and Zou (2008) propose a criterion that corrects for the estimation uncertainty of the random effects variance parameters based on a numerical approximation. Greven and Kneib (2010) show that ignoring this estimation uncertainty induces a bias and derive an analytical representation for the conditional AIC.
For certain generalized mixed models, analytical representations of the conditional AIC exist, for instance for Poisson responses (see Lian 2012). Although there is no general unbiased criterion in analytical form for all exponential family distributions as argued in Säfken, Kneib, van Waveren, and Greven (2014), bootstrap-based methods can often be applied as we will show for those presented in Efron (2004). An asymptotic criterion for a wider class of distributions is described in Wood, Pya, and Säfken (2016). The R package lmerTest (see Kuznetsova, Brockhoff, and Christensen 2017) offers some tests in linear mixed effects models.
In this paper, we describe an add-on package to lme4 that facilitates model selection based on the conditional AIC and illustrate it with several examples. For the conditional AIC proposed by Greven and Kneib (2010) for linear mixed models, the computation of the criterion is not as simple as it is for other common AIC criteria. This article focuses on techniques for fast and stable computation of the conditional AIC in mixed models estimated with lme4, as they are implemented in the R package cAIC4 (Säfken, Rügamer, Baumann, and Kruse 2021), available from the Comprehensive R Archive Network (CRAN) at https://CRAN. R-project.org/package=cAIC4. The amount of possible models increases substantially with the R package gamm4 (Wood and Scheipl 2020) allowing for the estimation of a wide class of models with quadratic penalty such as spline smoothing and additive models. The presented conditional AIC applies to any of these models.
In addition to translating the findings of Greven and Kneib (2010) to the model formulations used in Bates et al. (2015), we present the implementation of conditional AICs proposed for non-Gaussian settings in Säfken et al. (2014) and a new version for binary responses based on Efron (2004). With these results, a new scheme for stepwise conditional variable selection in mixed models is introduced. This allows for automatic choice of fixed and random effects based on the optimal conditional AIC. All methods are accompanied by examples, mainly taken from lme4, see Bates et al. (2015). The rest of this paper is structured as follows: In Section 2 the mixed model formulations are introduced based on one example with random intercepts and random slopes and a second example on penalized spline smoothing. The conditional AIC for Gaussian, Poisson and Bernoulli responses is introduced in Section 3. Section 4 gives a hands-on introduction to cAIC4 with specific examples for the sleepstudy and the grouseticks data from lme4. The new scheme for stepwise conditional variable selection in mixed models is presented in Section 5 and applied to the Zambia data set that is included in the package. After the conclusion in Section 6, Appendix A describes how cAIC4 automatically deals with boundary issues. Furthermore the underlying code for the rapid computation of the conditional AIC is presented in Appendix B.

The mixed model
In a linear mixed model, the conditional distribution of the response y given the random effects u has the form y|u ∼ N Xβ + Zu, σ 2 I n , where y = (y 1 , . . . , y n ) is the n-dimensional vector of responses, β is the p-dimensional vector of fixed effects and u is the q-dimensional vector of random effects. The matrices X and Z are the (n × p) and (n × q) design matrices for fixed and random effects, respectively, and σ 2 refers to the variance of the error terms.
The unconditional distribution of the random effects u is assumed to be a multivariate Gaussian with mean 0 and positive semidefinite (q × q) covariance matrix D θ , i.e., The symmetric covariance matrix D θ depends on the covariance parameters θ and may be decomposed as with the lower triangular covariance factor Λ θ and the variance parameter σ 2 of the conditional response distribution. In analogy to generalized linear models, the generalized linear mixed model extends the distributional assumption in (1) to a distribution F from the exponential family, where φ is a scale parameter and the mean has the form with h being the response function applied componentwise and natural parameter η = h −1 (µ). As the hereinafter presented results are limited to the Poisson and binomial distributions we can assume φ = 1. The symmetric covariance matrix in (2) then is the same as for Gaussian responses except that σ 2 is omitted, i.e., D θ = Λ θ Λ θ .
The given conditional formulations of (generalized) linear mixed models imply marginal models, which can (conceptually) be obtained by integrating the random effects out of the joint distribution of y and u, i.e., However, there is typically no closed form solution for this integral. While the marginal model formulation is usually used for estimation, an analytic representation of f (y) is only available for the linear mixed model (1). The marginal distribution f (y) for Gaussian responses y is given by y ∼ N Xβ, σ 2 I n + ZΛ θ Λ θ Z .
Further extensions of linear mixed models can be obtained by, for example, relaxing the assumption Cov(y|u) = σ 2 I n .

Example I: Random intercepts and random slopes
Some special cases of mixed models are commonly used in applications, including the random intercept model and the random slope model. In the random intercept model, the responses differ in an individual-or cluster-specific intercept for m individuals or clusters. In this case the individual-specific intercept is modeled as random effect u = (u 1,1 , u 1,2 , . . . , u 1,m ), yielding the (generalized) linear mixed model for the jth observation from an individual or cluster i.
Whereas for the random intercept model all covariates modeled with fixed effects are assumed to have the same influence on the response variable across individuals, the random slope model is suitable when an independent variable x s is assumed to have an individual-specific effect on the dependent variable. The random intercept model is extended to where u 2,i is the individual-specific slope, which can be regarded as the deviation from the population slope β s corresponding to the sth covariate x s,ij in x ij . In most cases, there is no reason to suppose u 1,i and u 2,i to be uncorrelated and the distributional assumption thus is (4)

Example II: Penalized spline smoothing
In addition to many possibilities to extend these simple random effect models, (generalized) linear mixed models can also be utilized to fit semi-parametric regression models (see, e.g., Ruppert, Wand, and Carroll 2003). For univariate smoothing, consider the model for i = 1, . . . , n, where f (·) is a deterministic function of the covariate x i , which shall be approximated using splines. For illustrative purposes, we consider the truncated polynomial basis representation in the following, where κ 1 < . . . < κ k are k ∈ N knots, partitioning the domain of x, g ∈ N and As the truncated part u j (x − κ j ) g + is non-zero for x > κ j , u j can be seen as a gradient change of the two consecutive function segments defined on (κ j−1 , κ j ] and (κ j , κ j+1 ]. In order to estimate β j , j = 0, . . . , g and u j , j = 1, . . . , k, the method of ordinary least squares (OLS) could in principle be applied. In most cases, however, this yields a rather rough estimate of f for suitably large k as the gradient changes of function segments have a large impact. Therefore estimation methods for linear mixed models can be utilized in order to obtain a smooth function. Representing the untruncated polynomial part in (6) as the fixed effects and k j=1 u j (x − κ j ) g + as the random effects part, the well known shrinkage effect of mixed models is transferred to the estimation of the u j s, shrinking the changes in the gradient of the fitted polynomials. The random effects assumption corresponds to a quadratic penalty on the u j s, with the smoothing parameter estimated from the data.
This approach also works analogously for various other basis functions including the frequently used B-spline basis and generalized linear models (see, e.g., Fahrmeir, Kneib, Lang, and Marx 2021). Moreover, a rich variety of models that can be represented as reduced rank basis smoothers with quadratic penalties allow for this kind of representation. The estimation via lme4 can be employed by the use of gamm4. For an overview of possible model components see Wood (2017). An example is also given in Section 5.

The Akaike information criterion
Originally proposed by Hirotogu Akaike (Akaike 1973) the AIC was one of the first model selection approaches to attract special attention among users of statistics. In some way, the AIC extends the maximum likelihood paradigm by making available a framework, in which both parameter estimation and model selection can be accomplished. The principle idea of the AIC can be traced back to the Kullback-Leibler distance (KLD; Kullback and Leibler 1951), which can be used to measure the distance between a true (but normally unknown) density g(y) and a parametric model f (y | ν). The unknown parameters ν are commonly estimated by their maximum likelihood estimatorν(y). As minimizing the expected Kullback-Leibler distance is equivalent to minimizing the so called Akaike information withỹ a set of independent new observations from g, minus twice the maximized log-likelihood log f (y |ν(y)) as a natural measure of goodness-of-fit is an obvious estimator of the AI. However, this approach induces a bias as the maximized log-likelihood only depends on y whereas (8) is defined as a predictive measure of two independent replicationsỹ and y from the same underlying distribution. Therefore the bias correction is defined by Akaike derived the bias correction, which under certain regularity conditions can be estimated asymptotically by two times the dimension of ν. This yields the well-known AI estimator AIC(y) = −2 log f (y |ν(y)) + 2 dim(ν).
Hence, as the statistical model f (·|ν) with the smallest AI aims at finding the model which is closest to the true model, the AIC can be seen as a relative measure of goodness-of-fit for different models of one model class. Notice that the bias correction is equivalent to twice the (effective) degrees of freedom and the covariance penalty, see Efron (2004).

The marginal and the conditional perspective on the AIC
Adopting this principle for the class of mixed models to select amongst different random effects is not straightforward. First of all, the question arises on the basis of which likelihood to define this AIC. For the class of mixed models, two common criteria exist, namely the marginal AIC (mAIC) based on the marginal log-likelihood and the conditional AIC (cAIC) based on the conditional log-likelihood. The justification of both approaches therefore corresponds to the purpose of the marginal and the conditional mixed model perspective, respectively. Depending on the question of interest, the intention of both perspectives differs, as for example described in Vaida and Blanchard (2005) or Greven and Kneib (2010).
The marginal perspective of mixed models is suitable when the main interest is to model fixed population effects with a reasonable correlation structure. The conditional perspective, by contrast, can be used to make statements based on the fit of the predicted random effects.
In longitudinal studies, for example, the latter point of view seems to be more appropriate if the focus is on subject-or cluster-specific random effects. Another crucial difference in both approaches lies in the model's use for prediction. On the one hand, the marginal model seems to be more plausible if the outcome for new observations comes from new individuals or clusters, i.e., observations having new random effects. The conditional model on the other hand is recommended if predictions are based on the same individuals or clusters, thereby predicting on the basis of already modeled random effects.
The corresponding AI criteria have closely related intentions. The conditional AIC estimates the optimism of the estimated log-likelihood for a new data setỹ by leaving the random effects unchanged. This can be understood as a predictive measure based on a new data set originating from the same clusters or individuals as y. On the contrary, the marginal approach evaluates the log-likelihood using a new predictive data setỹ, which is not necessarily associated with the cluster(s) or individual(s) of y.
In particular for the use of mixed models in penalized spline smoothing, the cAIC usually represents a more plausible choice. As demonstrated in Example II of Section 2, the representation of penalized spline smoothing via mixed models divides certain parts of the spline basis into fixed and random effects. Using the marginal perspective in Example II, predictions would therefore be based only on the polynomial coefficients of f . If the fitted nonlinear function is believed to represent a general relationship of x and y, predictions as well as the predictive measure in terms of the Akaike information, however, make more sense if the truncated parts of the basis are also taken into account. Vaida and Blanchard (2005) propose the cAIC, an estimator of the conditional Akaike information as an alternative to the mAIC, where ν includes the fixed effects and covariance parameters θ. The cAIC may be more appropriate when the AIC is used for the selection of random effects. In addition, Greven and Kneib (2010) investigate the difference of both criteria from a mathematical point of view. Since the mAIC is intended for the use in settings where the observations are independent and the parameter space V can be transformed to R dim(ν) , the corresponding bias correction 2 dim(ν) is biased for mixed models for which these conditions do not apply. In particular, Greven and Kneib (2010) show that the mAIC leads to a preference for the selection of smaller models without random effects.

Conditional AIC for Gaussian responses
Depending on the distribution of y, different bias corrections of the maximized conditional log-likelihood exist to obtain the cAIC. For the Gaussian case, Liang et al. (2008) derive a corrected version of the initially proposed cAIC by Vaida and Blanchard (2005) for known error variance, taking into account the estimation of the covariance parameters θ: Evaluating the bias correction BC = 2 tr( ∂ŷ ∂y ) in expression (11) via numerical approximation, or a similar formula for unknown error variance, is however computationally expensive. Greven and Kneib (2010) develop an analytic version of the corrected cAIC making the calculation of the corrected cAIC feasible. We adapt their efficient implementation originally written for 'lme'-objects (returned by the nlme package; Pinheiro, Bates, DebRoy, Sarkar, and R Core Team 2021) and reimplement their algorithm for 'lmerMod'-objects (returned by lme4). A more detailed description on the calculation of several terms in the proposed formula of Greven and Kneib (2010) is given in Appendix B. Furthermore, a partition of the parameter space is needed in order to account for potential parameters on the boundary of the parameter space, as presented in Theorem 3 in Greven and Kneib (2010). This process can be very unwieldy. Therefore, a fully automated correction algorithm is implemented in cAIC4 and presented in Appendix A.

Conditional AIC for Poisson responses
As for the Gaussian case, note that for the Poisson and the binomial distribution the bias correction (9) can be rewritten as twice the sum of the covariances between η i and y i , with true but unobserved mean µ i = E(y i ) and the estimator of the natural parameter η = (η 1 , . . . ,η n ) depending on y. For the Poisson distribution an analytic reformulation of the bias correction term (12) has to be utilized to make it analytically accessible as in Säfken et al. (2014). Using results from Hudson (1978) and an identity due to Chen (1975), the bias correction (12) for Poisson distributed responses can be reformulated to

Conditional AIC for Bernoulli responses
For binary responses there is no analytical representation for the bias correction (12), see Säfken et al. (2014). Nevertheless a bootstrap estimate for the bias correction can be based on Efron (2004). The bias correction is equal to the sum over the covariances of the estimators of the natural parameter η i and the data y i . To estimate this quantity, we could in principle draw a parametric bootstrap sample z i of size B for the ith data point -keeping all other observations fixed at their observed values -to estimate the ith component E ( η i (y i − µ i )) of the bias correction (12) for binary responses by where B 0 is the number of zeros in the bootstrap sample, B 1 is the number of ones in the bootstrap sample, is the estimated logit with z ij = 0 and z i· is the mean of the bootstrap sample z i . Letting the number of bootstrap samples tend to infinity, i.e., B → ∞, the mean of the bootstrap sample z i· = 1 B B j=1 z ij = B 1 /B (as well as B 1 /(B −1) ) converges to the estimate from the data, which corresponds to the true mean in the bootstrapμ i , and therefore Since the bootstrap estimates are optimal if the number of bootstrap samples B tends to infinity, this estimator can be seen as the optimal bootstrap estimator. The resulting estimator of the bias correction which we use in the following, avoids a full bootstrap but requires n model refits.
Notice that for generalized mixed models especially for binary data, and Poisson data with small counts, the Laplace approximation that is used in lme4 for models with more than one grouping variable is not always optimal, see Ogden (2017). However the estimatorμ i (·) used for Poisson and Bernoulli responses does not need to fulfill any specific conditions in order for the cAIC to be applicable. Thus the conclusions based on the cAIC are trustworthy even if the parameter estimates are biased in the sense that the model with the proposed minimal conditional Akaike information is chosen.

Example for linear mixed models
An example that is often used in connection with the R package lme4 is the sleepstudy data.
Thus we use this data set to give a brief introduction on the technical usage of the R package cAIC4. The data set is from a study on the daytime performance changes of the reaction time during chronic sleep restriction, see Belenky, Wesensten, Thorne, Thomas, Sing, Redmond, Russo, and Balkin (2003). Eighteen volunteers were only allowed to spend three hours of their daily time in bed for one week. The speed (mean and fastest 10% of responses) and lapses (reaction times greater than 500 ms) on a psychomotor vigilance task where measured several times. The averages of the reaction times are saved as response variable Reaction in the data set. Each volunteer has an identifier Subject. Additionally the number of days of sleep restriction at each measurement is listed in the covariate Days.
The conditional AIC can be used to find the model that best predicts future observations, assuming that future observations share the same random effects as the ones used for the model fitting. In case of this data set, using the cAIC for model choice corresponds to finding the model that best predicts future reaction times of the volunteers that took part in the study.
A first model that could be applied is a model with a random intercept and a random slope for Days within each volunteer (Subject): for i = 1, . . . , 18 and j = 1, . . . 10, with and ij ∼ N 0, σ 2 . In the preceding notation τ 2 1 /σ 2 = θ 1 , τ 2 2 /σ 2 = θ 2 and τ 12 /σ 2 = θ 3 . That τ 12 is not necessarily zero indicates, that the random intercept and the random slope are allowed to be correlated. The conditional log-likelihood and the corrected degrees of freedom, i.e., one half the bias correction, are printed together with the conditional AIC as proposed in Greven and Kneib (2010). The 'cAIC' object is a list additionally containing an element called reducedModel that is the model without the random effects covariance parameters that were estimated to lie on the boundary of the parameter space, see Appendix A and Greven and Kneib (2010), and NULL if there were none on the boundary.

R>
There are several further possible models for these data. One possibility might be to reduce the complexity in the model by omitting the correlation between the random intercept u i1 and the random slope u i2 for each Subject. Since this correlation is needed in order to preserve translational invariance (see Bates et al. 2015) such a comparison can be ill-advised. Otherwise the complexity of the model could be reduced by excluding the random slope from the model. This results in a simple random intercept model: for i = 1, . . . , 18 and j = 1, . . . 10, with Thus the decrease in the conditional likelihood is much larger than the reduction of the degrees of freedom by omitting the random slopes. The strong evidence of subject-specific (random) slopes found in Bates et al. (2015) is also reflected by the cAIC.
The conditional AIC is also appropriate for choosing between a simple null model without any random effects and a complex model incorporating random effects, as has been noticed by Greven and Kneib (2010). Thus it is possible to compare the cAIC of the two previous mixed models with the standard AIC for a linear model, here including three parameters (intercept, linear effect for Days and error variance). This can be called via a cAIC() wrapper function R> cAIC(lm(Reaction~1 + Days, sleepstudy)) Conditional log-likelihood: -950.15 Degrees of freedom: 3.00 Conditional Akaike information criterion: 1906.29 In this case, however, the mixed model structure is evident, reflected by the large AIC for the linear model.

Example for generalized linear mixed models
The cAIC4 package additionally offers a conditional AIC for conditionally Poisson distributed responses and an approximate conditional AIC for binary data. The Poisson cAIC uses the bias correction (13) and the bias correction term for the binary data is (14).
Making use of the fast refit() function of the lme4 package, both cAICs can be computed moderately fast, since n − d and n model refits are required, respectively, with n being the number of observations and d the number of responses that are zero for the Poisson responses. In the following, the cAIC for Poisson response is computed for the grouseticks data set from the lme4 package as an illustration.
The grouseticks data set was originally published in Elston, Moss, Boulinier, Arrowsmith, and Lambin (2001). It contains information about the aggregation of parasites, so-called sheep ticks, on red grouse chicks. The variables in the data set are given in Table 1. Every chick, identified by INDEX, is of a certain BROOD and every BROOD, in turn, corresponds to a specific YEAR.
The number of ticks is the response variable. Following the authors in a first model the expected number of ticks λ j with INDEX (j) is modeled depending on the year and the height as fixed effects and for each of the grouping variables BROOD (i), INDEX (j) and LOCATION (k) a random intercept is incorporated.
The full model is log (E (TICKS j )) = log (λ j ) = β 0 + β 1 · YEAR j + β 2 · HEIGHT j + u 1,i + u 2,j + u 3,k with three independent univariate random variables Notice that in this model the INDEX is an observation-level random effect that addresses overdispersion in the conditional Poisson distribution given the random effects. Before fitting the model the covariates HEIGHT and YEAR are centered for numerical reasons and stored in the data set grouseticks_cen.
R> f1 <-TICKS~YEAR + HEIGHT + (1 | BROOD) + (1 | INDEX) + (1 | LOCATION) R> p1 <-glmer(f1, family = "poisson", data = grouseticks_cen) A summary of the estimated model is given below. Notice that the reported AIC in the automated summary of lme4 is not appropriate for conditional model selection. The conditional log-likelihood and the degrees of freedom for the conditional AIC with conditionally Poisson distributed responses as in (13)  The output is the same as for Gaussian linear mixed models. It becomes apparent that there is a substantial difference between the conditional and the marginal AIC: In the output of the model the marginal AIC is reported to be 1845.5. Note that the marginal AIC is biased, see Greven and Kneib (2010), and based on a different likelihood.
The cAIC function calls the deleteZeroComponents function that reduces the model by random effects components that are estimated to have variances on the boundary of the parameter space, see Appendix A for the technical details and Greven and Kneib (2010) for the methodological background. In this example this is the case for the random intercepts of the LOCATION grouping variable. Thus the resulting cAIC is the same (modulo rounding) as if the random intercept associated with LOCATION is excluded from the model: R> f2 <-TICKS~YEAR + HEIGHT + (1 | BROOD) + (1 | INDEX) R> p2 <-glmer(f2, family = "poisson", data = grouseticks_cen) R> cAIC(p2) Conditional log-likelihood: -572.01 Degrees of freedom: 205.59 Conditional Akaike information: 1555.21 Another possible model for the comparison omits random intercepts associated with the BROOD grouping. This is equivalent to setting the associated random intercepts variance to zero, i.e., τ 2 2 = 0.

A scheme for stepwise conditional variable selection
Now having the possibility to compare different (generalized) linear mixed models via the conditional AIC, we introduce a model selection procedure in this section, searching the space of possible model candidates in a stepwise manner. Inspired by commonly used step functions as for example given by the stepAIC function in the MASS package (Venables and Ripley 2002), our stepcAIC function provides an automatic model selection applicable to some models of the class 'merMod' (produced by [g]lmer) or objects resulting from a gamm4 call. A general strategy could be to automatically select the random effect structure in a forward fashion first and afterwards select the fixed effects structure for given random effects using the stepwise cAIC (or other common selection criteria for fixed effects). Alternative methods to also select both the fixed and random effect structure are available, e.g., by the function step in the package lmerTest. Our method has the advantage to be solely based on a predictive measure, the cAIC, and models are compared based on a simple comparison of this value. This circumvents problems associated with post-selection inference, where tests are conducted under the assumption of a null distribution that does not hold when the corresponding null hypothesis is the result of a preceding model selection. Although based on a predictive measure, we want to emphasize that our intention behind the stepcAIC function is not to provide a black-box algorithm for model selection, but rather to provide a convenience function for users who would otherwise manually select the best random effect structure in the same fashion. In addition, the function facilitates the comparison of different additive models when applied to a gamm4 call, allowing the user to assess which nonlinearities are necessary for a given data set and which would instead overfit the data. The step procedure can therefore be used to successively extend or reduce the model in order to check whether a fixed covariate has a constant, linear or nonlinear effect.
Whereas the backward procedure has a straightforward mechanism and does not need any further mandatory arguments, the stepcAIC function provides several optional and obligatory arguments for the forward and both procedure in order to limit the possibly large number of model extensions. Regarding the required parameters, the user must specify the variables, which may be added with fixed or random effects as they are referred to in the 'data.frame' given by the argument data. For the fixed effects, this is done by specifying the fixEf argument, which expects a character vector with the names of the covariates, e.g., fixEf = c("x1", "x2"). Variables listed in the fixEf argument are firstly included in the model as linear terms and, if the linear effect leads to an improvement of the cAIC, checked for their nonlinearity by evaluating the cAIC of the corresponding model(s). Model extensions resulting from additional random effects are created in two different ways. A new model may, on the one hand, include a random intercept for a variable forming a grouping structure or, on the other hand, a random slope for a variable. These two types are specified using the arguments groupCandidates for grouping variable candidates or slopeCandidates for candidates for variables with random slope, again by referring to the variable names in data using strings.
Further optional arguments determine the way random effects are treated in the step procedure: -allowUseAcross: logical value whether slope variables, which are already in use with a grouping variable can also be used with other grouping variables; -maxSlopes: maximum number of slopes for one grouping variable.
Following the stepAIC function, the stepcAIC function also provides an argument for printing interim results (trace) and allows for the remaining terms of the initial model to be unaffected by the procedure (keep: list with entries fixed and random, each either NULL or a formula). In addition, the user may choose whether the cAIC is calculated for models, for which the fitting procedure in (g)lmer could not find an optimum (calcNonOptimMod with default FALSE) and might choose the type of smoothing terms added in forward steps (bsType).
If the step function is used for large data sets or in the presence of highly complex models the fitting procedures as well as the calculations of the cAIC can be parallelized by defining the number of cores (numCores) being used if more than one model has to be fitted and evaluated in any step (by passing the numCores argument to a mclapply function implemented in the base package parallel).
Due to the variety of additive model definitions in gamm4, the stepcAIC function is however limited in its generic step functionality for GAMMs. On the one hand, extensions with nonlinear effects are restricted to one smooth class given by bsType, on the other hand, the step-procedure is not able to deal with further arguments passed in smooth terms. The latter point is a current limitation, since the default basis dimension of the smooth term (i.e., the number of knots and the order of the penalty) is essentially arbitrary.
An additional current limitation of the stepcAIC function in its applications with GAMMs is the handling of zero variance components occurring during the function call. As a meaningful handling of zero variance smoothing terms would depend on the exact specification of the nonlinear term, the stepwise procedure is stopped and returns the result of the previous step. After removing the term associated with the zero variance manually, e.g., by specifying a linear instead of a nonlinear effect for the corresponding covariate, the user may call the step function again.

Examples for stepwise conditional variable selection
In order to demonstrate some functionalities of the stepcAIC function, we use a data set on childhood malnutrition in Zambia, which has also been analyzed by Kandala, Lang, Klasen, and Fahrmeir (2000) and also by Greven and Kneib (2010) to demonstrate the applicability of the cAIC for model selection in additive models. The subset of the original data set, which is used here for illustrative purposes, is available in our R package and can be loaded into the environment via data("Zambia", package = "cAIC4). The dependent variable is the Z-score, a standardized measure for the height of a child, which is an indicator for chronic undernutrition or stunting in children. We estimate a linear mixed model using four covariates, the duration of breastfeeding (c.bf) in months, the age (c.age) of the child in months, the height (m.ht) as well as the body mass index (m.bmi) of the mother, included in the model as fixed effects, and account for spatial heterogeneity by including a random intercept for the region (reg) and the district in the region (dr) the child lives in.
Starting with the following random effects model, we additionally check whether the two continuous covariates m.ht and m.bmi also should be used as a random slope for either of the random intercepts.
R> mod <-stepcAIC(initmixmod, direction = "both", + slopeCandidates = c("m.ht", "m.bmi"), data = Zambia, trace = TRUE) Starting stepwise procedure... _____________________________________________ _____________________________________________ Step The step function starts with automatically calculating the four possible models, which result from extending either one of the intercepts by one of the two slope variables. Per default the scope of possible models is also restricted to random effect structures, which assume a non-zero correlation between each random effect.
In a first step, the model cannot be improved by extending the model. The next step is a backward step in which the random intercept for region is dropped. The resulting model is not altered until the final step 4 and chosen as final model.
To illustrate the use of the stepcAIC function in the context of GAMM selection, we now investigate if all fixed effects should be kept in the model given by the previous stepcAIC call and/or whether any of the covariates may yield an improvement in cAIC if included as nonlinear effect. We therefore fit an initial gamm4 model as follows.
R> bestmixmod <-gamm4(z~c.age + c.bf + m.ht + m.bmi, random =~(1 | dr), + data = Zambia) In order to allow linear effects to be dropped and/or nonlinear effects to be added to the model, we again employ the stepcAIC function using direction = "both" and supply the names of the four covariates, which should be considered as nonlinear effects in fixEfCandidates.

_____________________________________________ _____________________________________________
Step In the first step, c.age is changed to have a nonlinear effect using a thin-plate regression spline. The type of spline in forward steps can be specified using the bsType argument. In a second (backward) step, the function checks whether any of the model components should be dropped from the model. As all possibilities would result in models with larger cAIC values, the current  model is checked for other nonlinearities in a third (forward) step. As all options result in a larger cAIC value the final model, within the specified scope of the step function, is given by the formula~s(age, bs = "tp") + c.bf + m.ht + m.bmi + (1 | dr). The estimated nonlinear effect of the final model can be visualized by plot(admod$finalModel$gam), i.e., by calling the plot function on the gam part of the final model. The result is given in Figure 1, indicating that the age of very young children under 17 months is associated with a positive Z-score, hence a "positive" value in terms of health, which however decreases continuously for increasing age. After 2 to 2.5 years the effect of age on the Z-score is estimated to be almost constant and having a negative association with the Z-score. A similar effect of the children's age was also found by Greven and Kneib (2010).

Conclusion
This paper gives a hands-on introduction to the R package cAIC4 allowing for model selection in mixed models based on the conditional AIC. The package and the paper offer a possibility for users from the empirical sciences to use the conditional AIC without having to worry about lengthy and complex calculations or mathematically sophisticated boundary issues of the parameter space. The applications presented in this paper go far beyond model selection for mixed models and extend to penalized spline smoothing and other structured additive regression models. Furthermore a stepwise algorithm for these models is introduced that allows for fast model selection.
Often statistical modeling is not about finding one "true model". In such cases it is of interest to define weighted sums of plausible models. This approach called model averaging is presented in Zhang, Zou, and Liang (2014) for weights chosen by the cAIC. We plan to implement this approach in cAIC4. Another future research path is to implement an appropriate version of the Bayesian information criterion (BIC) for conditional model selection. Since the Laplace approximation is not always optimal, see Ogden (2017), methods for preferable likelihood approximations are implemented in the glmmsr package (see Ogden 2019). Furthermore the newly developed GLMMadaptive package (see Rizopoulos 2021) fits mixed models with one grouping variable using the adaptive Gaussian quadrature. We plan to make cAIC4 usable for these packages in the future.

A. Dealing with the boundary issues
A major issue in obtaining the conditional AIC in linear mixed models is to account for potential parameters of θ on the boundary of the parameter space (see Greven and Kneib 2010). This needs to be done in order to ensure positive definiteness of the covariance matrix D θ .
The restructuring of the model in order to obtain the cAIC is done automatically by cAIC4. To gain insight into the restructuring, an understanding of the mixed model formulas used in lme4 is essential. For an in-depth explanation on how the formula module of lme4 works, see Bates et al. (2015), Section 2.1.
Suppose we want to fit a mixed model with two grouping factors g1 and g2. Within the first grouping factor g1, there are three continuous variables v1, v2 and v3 and within the second grouping factor there is only one variable x. Thus there are not only random intercepts but also random slopes that are possibly correlated. Such a model with response y would be fitted in lme4 using R> m <-lmer(y~(v1 + v2 + v3 | g1) + (x | g2), exampledata)

singular fit
The singular fit statement here already indicates that potentially some parameters are on the boundary of the parameter space. In mixed models fitted with lme4, the random effects covariance matrix D θ always has block-diagonal structure. For instance in the example from above the Cholesky factorized blocks of the estimated D θ associated with each random effects term are R> getME(m, "ST") $ This code is called by function deleteZeroComponents in the cAIC4 package. This function automatically deletes all zero components from the model. Function deleteZeroComponents is called recursively, i.e., the new model is checked again for zero components. In the example above only the random intercepts are non-zero. Hence the formula of the reduced model from which the conditional AIC is calculated is R> formula(cAIC4:::deleteZeroComponents(m)) singular fit y~(1 | g2) + (1 | g1) With the new model the conditional AIC is computed. If there are no random effect terms left in the formula, a linear model and the conventional AIC is returned. In addition function deleteZeroComponents also accounts for several special cases that may occur.
Notice however that in case of using smoothing terms from gamm4 no automated check for boundary issues can be applied and zero components have to be manually deleted. Furthermore the boundary issues are only relevant for the analytic Gaussian case. For generalized models some kind of refitting is used to calculate the cAIC and thus the boundary issues are not relevant.

B. Computational matters B.1. Gaussian responses
The corrected conditional AIC proposed in Greven and Kneib (2010) accounts for the uncertainty induced by the estimation of the random effects covariance parameters θ. In order to adapt the findings of Greven and Kneib (2010), a number of quantities from the lmer model fit need to be extracted and transformed. In the following these computations are presented. They are designed to minimize the computational burden and maximize the numerical stability. Parts of the calculations needed, for instance the Hessian of the ML/REML criterion, can also be found in Bates et al. (2015). Notice however, that lme4 does not explicitly calculate these quantities but uses derivative free optimizers for the profile likelihoods. Parts of the computations are similar to those performed in the merDeriv package, see Wang and Merkle (2018).
A core ingredient of mixed models is the covariance matrix of the marginal responses y. The inverse of the scaled covariance matrix V 0 will be used in the following calculations: V = cov(y) = σ 2 I n + ZΛ θ Λ θ Z = σ 2 V 0 .

B.3. Bernoulli
The computation of an estimator of the bias correction for Bernoulli distributed responses as in Equation 14 is similar to the implementation for Poisson distributed responses above. Therefore consider any Bernoulli model b1 fitted by the glmer function in lme4. For the calculation of the bias correction for each observed response variable the model needs to be refitted with corresponding other value, i.e., 0 for 1 and vice versa. This is done best by use of the refit() function from lme4.