Coeﬃcient-Wise Tree-Based Varying Coeﬃcient Regression with vcrpart

The tree-based TVCM algorithm and its implementation in the R package vcrpart are introduced for generalized linear models. The purpose of TVCM is to learn whether and how the coeﬃcients of a regression model vary by moderating variables. A separate partition is built for each potentially varying coeﬃcient, allowing the user to specify coeﬃcient-speciﬁc sets of potential moderators, and allowing the algorithm to select moderators individually by coeﬃcient. In addition to describing the algorithm, the TVCM is evaluated using a benchmark comparison and a simulation study and the R commands are demonstrated by means of empirical applications


Introduction
When carrying out a regression analysis, researchers often wish to know whether and how moderating variables affect the coefficients of predictor variables. For example, medical scientists may be interested in how age or past illnesses moderate the effect of a clinical trial (e.g., Yusuf, Wittes, Probstfield, and Tyroler 1991), and social scientists may examine the gender wage gap separately for different labor sectors and countries (e.g., Arulampalam, Booth, and Bryan 2007).
Varying coefficient models (e.g., Hastie and Tibshirani 1993) provide a semiparametric approach for such moderated relations. Consider a response variable Y , where g(E(Y |·)) = η, with g a known link function and η a predictor function of form: M vc : η = X 1 β 1 (Z 1 ) + . . . + X P β P (Z P ) , where the variables X p , p = 1, . . . , P , are the P predictors with each having a varying coeffi-visualized as a decision tree and, therefore, the strata B m are referred to as terminal nodes, or simply as nodes.
Here, we introduce the tree-based varying coefficient model (TVCM) algorithm of the R package vcrpart (Bürgin 2017) available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=vcrpart. The TVCM algorithm allows us to approximate M vc in a "coefficient-wise" manner. First, assuming that K out of the P predictors have varying-coefficients, we let X 0 be the P − K predictors with non-varying coefficients and X 1 , . . . , X K denote the remaining K predictors with corresponding moderator vectors Z 1 , . . . , Z K . Further, we denote the value space of Z k as Z k = Z k1 × . . . ×Z kL k and denote a partition of Z k into M k nodes as {B k1 , . . . , B kM k }. Then, the proposed approximation is: Compared with M tree , the TVCM approximation M tvcm assigns each varying coefficient a partition and includes non-varying coefficients. This allows us to specify parametrically known relations (the first term) and coefficient-specific sets of moderators (the second term).
In addition, M tvcm allows us to select moderators individually by varying coefficient. Furthermore, empirical evidence suggests (Section 4.1) that M tvcm can build more accurate and more parsimonious fits than M tree is able to do. A technical difference between the two approximations M tree and M tvcm is that the coefficients of M tree are commonly obtained by fitting a local model on each of the M strata, while the approximation M tvcm must be fitted as a closed model on all observations. The remainder of this paper is organized as follows. In Section 2, we describe the basic algorithm that we apply to generalized linear models. In Section 3, we provide more detail and extend the basic algorithm. The algorithm is evaluated in Section 4 by comparing its performance with competing algorithms on a benchmark data set and its ability to retrieve an underlying generating model through a simulation study. Then, in Section 5, we present two applications. Finally, the concluding Section 6 addresses issues for further development.

The TVCM algorithm
Similar to classification and regression trees (CART, Breiman, Friedman, Olshen, and Stone 1984), TVCM involves two stages: The first stage (Section 2.2) builds K overly fine partitions; the second stage (Section 2.3) selects the final partitions by pruning.
To provide a consistent formulation, we restrict our consideration of TVCM to generalized linear models (GLMs). Therefore, Section 2.1 summarizes GLMs and introduces an illustrative example. Extensions to other model families are discussed in Section 3.3.

Generalized linear models
GLMs cover regression models for various types of responses, such as continuous data (the Gaussian model), count data (the Poisson model), and binary data (the logistic model).
Denote the ith response of the training data D as y i , with observations i = 1, . . . , N , and the ith P ×1 predictor vector as x i . Simple GLMs have densities of the form where θ i is the natural parameter of the family, φ is the dispersion parameter, and b(·) and c(·) are family-specific functions. For example, the Poisson distribution has density f (y i ) = λ y i i e −λ i /k! and it can be derived that θ i = log λ i , b(θ i ) = e θ i = λ i , φ = 1, and c(y i , φ) = log y i . The predictor vector x i is incorporated by defining the linear predictor where β is the vector of unknown coefficients. This linear predictor η i is linked with the conditional mean µ i = E(y i |x i ) via g(µ i ) = η i = x ⊤ i β. The choice of g(·) depends on the specific model. A mathematically motivated choice is to specify g(·), such that θ i = η i , usually called canonical link. For example, for the Poisson model, the canonical link is log(µ i ) = η i . Further details on GLMs can be found, for instance, in McCullagh and Nelder (1989).
GLMs are generally fitted using maximum likelihood estimation (MLE), in other words, by maximizing the total log-likelihood of the training data w.r.t. β and φ: where w i is the weight for observation i. The coefficients β enter into Equation 6 via θ i = d(µ i ) = d(g −1 (x ⊤ i β)), with d(·) a known function. To fit GLMs, we use the glm function of the stats package (see Chambers and Hastie 1992).
Gender gap in university admissions. To illustrate R syntax and explanations, we consider the admission data of the UC Berkeley from 1973. The data consist of 4, 526 observations on the response variable Admit (0 = rejected, 1 = admitted) and the covariates Female (0 = male, 1 = female) and Dept (departments A to F ). The training data UCBA are prepared by R> UCBA <-as.data.frame(UCBAdmissions) R> UCBA$Admit <-1 * (UCBA$Admit == "Admitted") R> UCBA$Female <-1 * (UCBA$Gender == "Female") R> head (UCBA,3) Admit Gender Dept Freq Female  1  1  Male  A 512  0  2  0  Male  A 313  0  3 1 Female A 89 1 Each row of the data UCBA represents one combination of values in Admit, Female, and Dept. The column Freq gives the frequencies of the combinations.
The UCB admission data are a popular application to illustrate Simpson's paradox (see Bickel, Hammel, and O'Connell 1975). The primary interest is the gender gap in the chance to be admitted. Let us first study this gap using the logistic regression model: In this second fit, the disadvantage for females disappears, and, in the case of department A, the gender gap is significantly positive (DeptA:Female: Estimate = 1.05, |z value|> 2). The apparent disadvantage for females in glmS.UCBA arises, as the reader may know, from the tendency of females to apply to departments where the chances to be admitted are low.
The model glmL.UCBA, which uncovers the problem, can be seen as a full parametric varying coefficient model that defines the intercept and the gender gap as functions of the department. We will return to this example to investigate whether and how TVCM solves this problem.

Partitioning
The first stage to fit the approximate varying coefficient model M tvcm (3) involves building a partition for each of the value spaces Z k , k = 1, . . . , K corresponding to the K varying coefficients. The resulting K partitions should be overly fine so that the "optimal" partitions Algorithm 1: The TVCM partitioning algorithm for generalized linear models.
Parameters: N 0 minimum node size, e.g., N 0 = 30 D min minimum −2 · log-likelihood reduction, e.g., D min = 2 Initialize B k1 ← Z k1 × . . . ×Z kL k and M k ← 1 for all partitions k = 1, . . . , K. repeat using all observations i = 1, . . . , N . for partitions k = 1 to K do for nodes m = 1 to M k and moderator variables l = 1 to L k do foreach unique candidate split △ kmlj , in {z kli : Using only the observations {i : z ik ∈ B km } of the node B km , computê and compute the training error reduction D kmlj = −2l M + 2l can be found in the subsequent pruning stage. Algorithm 1 provides a formal summary of this partitioning algorithm.
To partition the value spaces Z 1 , . . . , Z K , Algorithm 1 uses a breadth-first search (e.g., Russell and Norvig 2003) that in each iteration fits the current model and splits one of the current terminal nodes into two. The split is selected by employing an exhaustive search over a set of candidate splits. These candidate splits are denoted by △ kmlj and refer in each case to a partition k; a node m; a moderator variable l; and the cutpoint j in the selected moderator. Algorithm 1 fits for each candidate split a corresponding (approximate) search-model M kmlj (8) and selects the split that reduces the total −2 · log-likelihood training error the most. 1 The used search-model is approximate in that it keeps, in Step 2, all parameters but those of the two newly created nodes as fixed, and it uses only the observations of the node to split. The reason for that is given below under "Computational details" on page 9. The algorithm iterates until (i) no candidate split provides daughter nodes with more than N 0 observations or (ii) the best split increases the −2 · log-likelihood training error by less than D min .
When searching for a split, there can be differences in the number of candidate splits between partitions, nodes, and potential moderators. The −2 · log-likelihood reduction statistic is not "standardized" to such differences and, therefore, Algorithm 1 tends to select partitions, nodes, and variables with many candidate splits (cf., . This selection bias negatively affects interpretability and accuracy of the resulting trees. Reducing this bias is desirable and, therefore, a potential focus for further investigations. The tvcglm function. The tvcglm function implements Algorithm 1. For illustration, we fit a logistic TVCM to the UCB admission data. The following command specifies that both the intercept and the gender gap vary across departments. tvcglm treats the data, family and weights arguments in the same way as glm. Varying coefficients are specified with the vc operator and a separate partition is fitted for each vc term. The by argument of the vc operator specifies the predictor, while potential moderators are specified before the by argument as a comma separated list of variables. When no by argument is given, the vc term defines a varying intercept. Non-vc terms are treated as linear terms, as in glm. In the example above, vc(Dept) specifies a intercept varying with the variable Dept, and vc(Dept, by = Female) a varying coefficient for Female that varies with Dept. The predictors passed as by argument must be numeric in the current implementation. This is why we have defined on page 4 the Female variable for the UCBA data as UCBA$Female <-1 * (UCBA$Gender == "Female").
The control parameters are set by the tvcglm_control function. Above, minsize = 30 specifies N 0 = 30 and mindev = 0 specifies D min = 0. We set D min = 0 to obtain the largest possible tree and cv = FALSE to disable cross-validation (Section 2.3). Note that practiceoriented examples follow in Sections 4 and 5, and details on arguments and dummy examples can be found in the help pages of the functions tvcglm, vc and tvcglm_control.
The two fitted partitions are shown in Figure 1, along with the nodewise coefficients. These plots were produced by the following commands: R> plot(vcmL.UCBA, type = "coef", part = "A") R> plot(vcmL.UCBA, type = "coef", part = "B") As an option 2 , we could display the confidence intervals extracted from the underlying 'glm' object. However, these intervals would not account for the model selection procedure and we do not show them here. Both partitions separate the departments fully and, therefore, the values of the coefficients of vcmL.UCBA shown in Figure 1 are exactly those obtained for the model glmL.UCBA on page 5. The partitioning process can be backtracked using the splitpath function. The following command summarizes the first iteration. Step

Computational details
A breadth-first search can be computationally burdensome because it cycles in each iteration through all current nodes. Even so, we do not consider a depth-first search, which is more common for recursive partitioning and which evaluates only one node in each iteration, because it seems unclear whether the search sequence has consequences on the resulting partitions. To speed up the search, we use the approximate search model M kmlj (8) to compute the training error reduction of split △ kmlj , instead of using the following accurate search model The accurate search model, M * kmlj (9), requires using all observations i = 1, . . . , N and reestimating all coefficients. By contrast, the approximate search model, M kmlj , uses only the observations {i : z ik ∈ B km } of the node to split and incorporates the fitted valuesη i of the current model M (7) as offsets. This reduces the optimization per considered split to three parameters, namely the coefficients β 1 and β 2 of the newly created nodes, and the dispersion parameter φ, because it cannot be fixed in glm. More specifically, the approximate model M kmlj estimates the coefficients γ s , s = 1, 2 of M * kmlj asγ s =β km +β s , withβ km retrieved from the current model M. Further, in M kmlj all the remaining coefficients of M * kmlj , i.e., γ 0 and γ k ′ m ′ for (m ′ , k ′ ) = (m, k), are kept fixed at the values estimated in Part 1 given in (7). In our experience, the approximation is reliable, although it does not necessarily result in the same partitions that the accurate search would produce. 3 In particular, the approximation will tend to neglect splits that improve the fit through interplays with the temporarily fixed coefficients.
Eliminating split candidates and cleverly choosing the stopping parameters are further efficient acceleration techniques. We describe these techniques in more detail here.
Splits for ordered scales. In Algorithm 1, the splits △ kmlj for continuous and ordinal moderators are defined as rules of the form {Is z kli ≤ ζ kmlj ?}. The candidate cutpoints, {ζ kml1 , . . .}, are the unique values in set {z kli : z ik ∈ B km }. Note that splits at boundaries may be omitted to respect the minimum node size criterion. To reduce the computational burden, we allow the set of candidate cutpoints to shrink to a prespecified cardinality N S , which is N S = 9 by default. 4 Specifically, the unique values of the (non-interpolated) quantiles of {z kli : z ik ∈ B km } are extracted at the N S equidistant probabilities (1, . . . , N S )/(N S + 1). In cases of tied data, where this procedure potentially yields fewer than N S splits, the number of equidistant probabilities is increased until the set of candidate splits has the desired size.
Splits for nominal scales. The splits △ kmlj for nominal moderators are rules of the form {Is z kli ∈ ζ kmlj ?}, where ζ kmlj are merged categories from the set {z kli : z ik ∈ B km }. The number of unique candidate merges for C categories is 2 C−1 , which increases exponentially with C. An approximation that restricts the number of splits to be linearly increasing with C determines a category order and then treats the moderator as ordinal. For CART, Breiman et al. (1984) propose to deduce such an order from the category-wise averages in the current node. Following this idea, we propose ordering, for each node, the categories by the categorywise estimated coefficients. This reduces the computational expenses to fitting the model that provides the category-wise coefficients, and fitting the at most C − 1 models that evaluate the ordinal splits. By default, the approximation is applied for C ≥ 5. 5 On page 7, we referred to the category ordering technique when demonstrating the splitpath function for the first iteration of partitioning. For instance, for partition B (the gender gap), we used the order F < E < C < D < B < A. The rank of a category corresponds to the first row where a 1 appears in the corresponding column. The category-wise coefficients can be estimated by using the model: The model glmCW.UCBA substitutes the effect of Female of the current model, which is just the model glmS.UCBA (page 5), by an interaction term with Dept and Female. The category ordering is then obtained by ordering the estimated department-specific gender effects.
Stopping criteria. Algorithm 1 applies two stopping criteria. First, to have sufficient observations to estimate the coefficients nodewise, we require a minimum node size N 0 . Here, N 0 = 30 seems a reasonable rule of thumb value, but can be modified according to the model. Second, to reduce the computational burden, we stop partitioning as soon as the maximal training error reduction falls below D min . Large values of D min yield rougher partitions and require less computation, and vice versa. Therefore, it is crucial to choose D min to be small Algorithm 2: The TVCM weakest-link pruning algorithm for generalized linear models.
Input: A fitted model M from Algorithm 1 Parameters: λ: the cost-complexity penalty, enough so that the optimal partitions are not overlooked. The default D min = 2 was selected based on the forward-stepwise AIC algorithm (e.g., Venables and Ripley 2002), which also requires the total −2 · log-likelihood training error to decrease by at least 2 to continue. In our experience, D min = 2 is small enough to capture the optimal partition, yet reduces the computational burden considerably. In Section 4.1, we evaluate the impact of N 0 and D min on a real data application.
The tvcglm_control function also allows us to control classic tree growth parameters. These parameters, which can include the maximum number of terminal nodes and the maximal depth of the trees, can restrict the complexity of the final model.

Pruning
The pruning stage selects the final model by collapsing the inner nodes of the overly fine partitions produced by Algorithm 1. In other words, it cuts branches stemming from the same node. Here, we broadly follow the minimal cost-complexity pruning approach of Breiman et al. (1984, Chapter 8). Let M be a fitted model of form Equation 3, where the nodes B km result from Algorithm 1. Define the cost-complexity error criterion by In other words, we define the criterion as the total −2 · log-likelihood training error plus a tuning constant λ multiplied by the total number of splits. Here, λ trades off the in-sample performance and the complexity (i.e., the number of splits) of the model. When minimizing err λ ( M), small choices of λ yield models with many splits, and vice versa. In general, λ is unknown and must be chosen adaptively from the data.
Pruning algorithm. Pruning hierarchically collapses inner nodes of the initially overly fine partition to find the model that minimizes err λ ( M), given λ. A global search that collapses multiple inner nodes simultaneously would be computationally too expensive and, therefore, we adopt the weakest link pruning algorithm of Breiman et al. (1984). Algorithm 2 summarizes the implemented algorithm.
Each iteration in Algorithm 2 collapses the inner node that yields the smallest per-split increase in the total −2 · log-likelihood training error. The procedure starts with the model from the partitioning stage and continues until the smallest per-split increase is larger than λ (i.e., all remaining collapses would increase err λ ( M)). The prune function implements Algorithm 2. For example, the fit for vcmL.UCBA on page 7 is pruned with λ = 6, as follows.
The pruning algorithm can be backtracked with the prunepath function.
R> prunepath(vcm.UCBA, steps = 1) Step The above R output provides various information about the first iteration of Algorithm 2, applied on the fit for vcmL.UCBA. The columns part and node identify the collapsed inner node, and dev shows the per-split increase of the training error. In the first iteration, the inner node 8 of partition B (the gender gap) yields the smallestD kj and is therefore collapsed.
Choosing λ. The per-split penalty λ is generally unknown and, hence, must be chosen adaptively from the data. To do so, the validation-set or cross-validation methods are suitable (e.g., Breiman et al. 1984) and computationally fast. The validation-set method works as follows. First, divide the training data D randomly into a subtraining set D 1 and a validation set D 2 , e.g., with a ratio of 3 : 1. Second, replicate the fit with Algorithm 1 based on D 1 . Third, repeatedly prune the new fit with increasing λ values and compute the validation error each time an inner node is collapsed. This yields two sequences, 6 We use the average to avoid having the validation error depend on the number of observations in D2.  This is the center of the interval [λ s ′ , λ s ′ +1 ) that minimizes the validation error err D 2 . The estimation potentially yieldsλ = ∞, in which case no split is necessary. Cross-validation methods repeat the validation-set method to include the entire data. In particular, crossvalidation combines the obtained sequences {λ 1 , . . . , λ S+1 } to a finer grid and averages the errors err D 2 s accordingly. The cvloss function implements the validation-set and the cross-validation methods to estimateλ. By default, 5-fold cross-validation is used. To estimate λ for the UCBA data, we use the commands: R> cv.UCBA <-cvloss(vcmL.UCBA, folds = folds_control(weights = "freq", + seed = 13)) The argument weights = "freq" indicates that the weights of vcmL.UCBA represent counts rather than unit-specific weights (default). The seed argument is used to control the random generator when creating the cross-validation folds, which allows the results to be replicated. If available, the cvloss function processes the validation sets parallelized.
The black solid line in Figure 2 shows the per-split penalty λ against the cross-validated error of fits for vcmL.UCBA, which is minimal with err cv 28 = 1.148 atλ = 5.1. The original fit for vcm.UCBA can be pruned byλ = 5.1 with the command: The varying coefficients of the model obtained from pruning withλ = 5.1 are shown in Figure 3. Both the varying intercept and the varying gender gap are split into three strata. The final model collapses several departments. For example, in the right panel, we see that the departments B, C, D, and F share the same gender gap. By contrast, the large negative intercept in department F and the large gender gap in department A remain detached. Alternatives toλ (as determined in Equation 11) could be considered. For example, Breiman et al. (1984, Chapter 3) propose the 1-SE rule to decrease the variance ofλ. We preferλ for its simple form, but with cvloss and prune, we provide the tools to use these alternative rules.

Details and extensions
In Section 2, we explained the basic parts of the TVCM algorithm. This section describes the algorithm in more detail and explains how TVCM can be extended to other model classes.

Mean-centering the predictors of the search model
A useful technique to improve the split selection with the approximate search model M kmlj (8) is to mean-center its predictors. That is, we substitute the values x ik in Equation 8 with The advantage of this is most visible in the case of a pure interaction. Consider Figure 4. In both panels the slope of a predictor x varies between two groups, A and B. In the left panel, the TVCM partitioning algorithm tries to uncover this moderation when x is not centered and the current model specifies a global intercept and a global slope for x. The search model uses the fitted values of the current model (solid line) as offsets and incorporates separate slopes for each group. This restricts the slopes to pass through the origin, and hence the fit (dotted and dashed lines) do not really identify the moderation pattern. The right panel shows that, in this scenario, the moderation pattern is perfectly identified by using the same search procedure, but when x is mean-centered.
The centering trick is applied by default, but can be disabled with the control argument center. Note that the output model is not affected by the mean-centering technique, because it is applied only to the search model M kmlj (8). Further, the trick is not necessary when using the accurate search model M * kmlj (9) that does not use the offset values.

Additive expansion of multivariate varying coefficients
An additional feature of TVCM is the possibility to avoid the interactions between moderators by means of an additive expansion of varying coefficients. Although this restriction may prevent the algorithm from finding the very best model, coefficient functions with one moderator may be easier to understand while still performing quite well as will be seen in Section 4.1.
So far, we have implicitly assumed that the predictors X 1 , . . . , X P of model M vc (1) differ from one another. To suppress interactions between moderators, we expand the multivariate varying coefficients into additive, moderator-wise components. First, consider a multivariate varying coefficient term x ip β p (z ip ) = x ip β p (z ip1 , . . . , z ipLp ), possibly x ip = 1 for all i. The additive expansion is Here, we decompose x ip β(z p ) into the "isolated" contributions of the individual moderators, including a global term x ip β p0 . In this expansion, the individual varying coefficients β pl (z ipl ), l = 1, . . . , L p act as local contributions to the global coefficient β p0 . To identify the additive expansion of Equation 12, we mean-center the approximations for β p1 (·), . . . , β pLp (·) using node-weighted sum contrasts that restrict the sample-average of the coefficient functions to zero. That is, we approximate β pl (·) with the piecewise constant function M pl m=1 1(z ipl ∈ B plm )β plm , and estimate the coefficients β pl1 , . . . , β plM pl subject to The nodewise-weighted sum contrasts are computed with the contr.wsum function of vcrpart.
We also considered extending the additive expansion with second-and higher-order interactions between moderators. However, such an extension likely needs further considerations for the partitioning algorithm.

Extension to other model classes
To extend the scope of the algorithm, the TVCM requires functions to extract the training and validation errors from fitted models of the considered model class. The training error is required for partitioning (Algorithm 1) and pruning (Algorithm 2), and the need for extracting validation errors arises from cross-validating λ. Both errors refer to loss functions. We suggest to use the loss function retained for estimating the coefficients, even though different functions could be specified in vcrpart. For GLMs, we use as the training error the total −2 · log-likelihood loss on the training sample, which can be extracted from the coefficient estimation. Then, as the validation error, we use the average −2 · log-likelihood loss of predictions on validation sets, which can be extracted using the predict and family functions.
Using the −2 · log-likelihood loss for both the training and validation errors synchronizes the criteria for estimating the coefficients, selecting the split, pruning, and choosing λ. The same, or a similar implementation could be considered for other likelihood-based regression models.
The vcrpart package also provides implementations for the baseline-category as well as the cumulative-link models to allow for regression with nominal and ordinal responses. Both these models are multivariate extensions of GLMs (cf., Fahrmeir and Tutz 2001, Chapter 3). Therefore, we can adopt the definitions for the training and validation errors for GLMs (Bürgin and Ritschard 2015).

Empirical evaluation
Here, we present two empirical evaluations of the TVCM algorithm. Section 4.1 evaluates the performance of TVCM with different stopping criteria by comparing it with alternative treebased algorithms on a benchmark data set. Section 4.2 presents a simulation study to assess the ability of TVCM to identify an underlying data generating varying coefficient process.
From here on and unless specifically stated otherwise, the following default control parameters and fixed seed for creating cross-validation folds will be used.

Benchmarking TVCM with the Pima Indians diabetes data
To evaluate the TVCM algorithm, we consider the Pima Indians diabetes data of Smith, Everhart, Dickson, Knowler, and Johannes (1988). These data are available from the UC Irvine machine learning repository (Bache and Lichman 2013) and record diabetes tests of 768 Pima Indian women, along with eight covariates. Here, we use the PimaIndiansDiabetes2 data of the R package mlbench (Leisch and Dimitriadou 2012) containing a version of the original data corrected for physical impossibilities, such as zero values for blood pressure. We exclude the two variables tricepts and insulin and omit cases with missing values of the remaining data by listwise deletion, which is also the default option of tvcglm, to avoid expanding the discussion to the missing value problem. The Pima data, prepared by the following commands, include 724 observations on the seven variables listed in Table 1. where the first vc term specifies the varying intercept and the second term specifies the varying slope for glucose. We use -1 to remove the global intercept so that the fitted varying intercepts represent local intercepts. Keeping the global intercept would produce the same fit. However, the fitted varying intercepts would represent local contributions to the global intercept. The alternative additive expansion introduced in Section 3.2 is fitted using the command:

R> vcm.Pima.2 <-tvcglm(diabetes~1 + glucose + vc(pregnant) + vc(pregnant, + by = glucose) + vc(pressure) + vc(pressure, by = glucose) + vc(mass) + + vc(mass, by = glucose) + vc(pedigree) + vc(pedigree, by = glucose) + + vc(age) + vc(age, by = glucose), data = Pima, family = binomial(), + control = control)
The additive expansion includes a global intercept and a global slope for glucose, which implies that the remaining varying coefficients, which consist of moderator-wise varying intercepts and varying slopes for glucose, represent local contributions. Zeileis et al. (2006) fit the same varying coefficient model using the model-based recursive partitioning algorithm (MOB; Zeileis et al. 2008), which is based on the single-tree approximation M tree (2). First, we compare the fit for vcm.Pima.1 with the fit based on MOB to discuss the structural differences between the two approximations M tree and M tvcm .
The left panel in Figure 5 shows the fit for vcm.Pima.1 and the right panel the fit based on the MOB algorithm. Note that the partykit (Hothorn and Zeileis 2015) plot function generates by default spinograms and not coefficient plots for the leaves of the MOB tree. For the sake of comparison, we have replaced here the default with coefficient plots. 7 The structural difference with MOB is that the TVCM fits separate partitions for the varying intercept  Zeileis et al. 2006). and the varying slope for glucose, while the MOB algorithm fits a common partition for the two varying coefficients. Interestingly, the tree of the varying intercept from the TVCM is identical to the tree from the MOB algorithm. By contrast, the TVCM does not retain splits for the slope of glucose. This illustrates the flexibility of the TVCM in adapting to situations in which coefficient functions differ. If a single partition for all varying coefficients is accurate, then the TVCM can fit the same partition multiple times. Otherwise, it can tailor the partition individually for each varying coefficient. As a result, the TVCM potentially produces more parsimonious and/or more accurate fits than does the M tree approximation.
The performance comparison is made with the Pima data and relies on 250 bootstrap samples with replacement (as in Zeileis et al. 2006  each algorithm to predict the excluded observations. In the case of the TVCM, we fit five models on each bootstrap sample to compare the fits for tvcm.Pima.1 and tvcm.Pima.2 and to evaluate the sensitivity of fits for tvcm.Pima.1 to changes from the defaults for N 0 (the minimum node size), D min (the minimum training error reduction), and N S (the maximum number of splits). For the competitors, we employ the default control parameters. Three comparison measures are considered: misclassification, the median and mean 0-1 loss on excluded data; complexity, the median of the number of coefficients plus the number of splits; and time, the median computation time. Furthermore, with each algorithm, we fit a model on the original data. To run the simulation, we use a computer with an Intel Xeon 3.50GHz processor. Table 2 shows that the TVCM outperforms the competitors in terms of performance and complexity. That is, it builds smaller models with slightly better predictive performance than the other algorithms. By contrast, the TVCM performs worst in terms of computational time because it evaluates far more candidate models than do the competitors. Increasing N 0 and D min accelerates the burden significantly, with surprisingly little effect on the performance. Apparently, in this application, it is not necessary to grow very large trees in the partitioning stage to produce an accurate fit. Furthermore, the difference between the multivariate varying coefficient specification and the additive expansion is negligibly small in this application. Figure 6 shows averages of 250 pairwise differences between the competitors and the TVCM. The confidence intervals for the averages are based on the Student's t-distribution. Several average differences are significant in favor of the TVCM. It may be that the CTree, RPART, LMT, and J4.8 algorithms perform worse because they merely use piecewise constant regression functions, whereas the TVCM and the MOB algorithm include a (prespecified) slope for glucose.

Simulation study
Here, we use simulated data to evaluate the ability of TVCM to identify an underlying data generating process with coefficients varying across coefficient-wise partitions. In particular, we consider a scenario where the varying coefficients of the data generating model depend on different moderator variables, in which case we would expect coefficient-wise partitioning to perform better than a single partitioning approach.
The study consists in generating multiple datasets by means of a simulation model, and fitting a model with the coefficient-wise and a model with the single partitioning approach on each of the datasets. The two approaches are then compared by means of four performance measures. For each dataset, we draw N realizations of a normally distributed predictor variable X i.i.d.
∼ Bin(1, 0.5). Using these data, we draw N responses y 1 , . . . , y N from the simulation model: The model M sim (14) has a varying intercept that depends on Z 0 , and a varying slope for X that depends on Z 1 . The additional four variables X 2 to X 5 act as noise variables that we expect to not be selected. Since both the varying intercept and the varying slope are threshold functions, they can be reconstructed by recursive partitioning. Coefficient-wise partitioning can reconstruct the model with two splits, a first split for the varying intercept in Z 0 and a second split for the varying slope for X in Z 1 . The single partition approach needs to perform three splits. Either, it splits the root node by Z 0 and both child nodes by Z 1 , or it splits the root node by Z 1 and both child nodes by Z 1 . The sample size N is varied by steps of 50 between 50 and 500, and 2000 runs are made for each N , which makes a total of 20,000 simulated datasets.
Both the model with coefficient-wise partitions and the model with a single partition are fitted with the tvcglm function. Using for both models tvcglm allows to isolate the comparison between the coefficient-wise and single partition approaches from other differences between the algorithms. For example, glmtree from package partykit could be used to fit the single partition model, but then the performance differences would also relate to implementationspecific differences between TVCM and MOB such as the splitting criterion. Apart from computational details, the procedure used by tvcglm to fit single partition models is equivalent to that of Wang and Hastie (2014). Here are the R commands used to fit the two models, where the data frame simdata presents a placeholder for a generated dataset. First the coefficient-wise partition model is fitted and saved in vcm.sim.1 and then the single partition model is fitted and saved in vcm.sim.2.

Performance measures.
To evaluate the fitted models, four frequency measures are considered: (i) identified underlying model, the proportion of runs for which the exact underlying model was identified; (ii) nested underlying model, the proportion of runs for which the underlying model is nested in the fitted model; (iii) true moderators selected, the proportion of runs for which all true moderators were selected; and (iv) false variable selections, the proportion of runs that selected noise moderators.
Results. Figure 7 contrasts the performances of coefficient-wise partitioning to the single partition approach. All four measures tend to a same asymptote for both approaches with, however, a faster convergence for the coefficient-wise partitioning. For our generated data, the chances to identify the exact generating model tends to about 85% when the sample size N becomes large (N ≥ 150 for coefficient-wise partitioning and N ≥ 300 for single partitioning), while the chances to end up with a model that includes the generating process as a nested model tends to 1. This shows that the presence of the noise variables leads in some cases to overfitting. Similar conclusions hold for the two other indicators: When N becomes large both approaches select at least all true moderators together with, in 20% of the cases, some noise variables. In sum, the above simulation study indicates that coefficient-wise partitioning is better than single partitioning in retrieving the exact generating process with a sample of moderated size N , while both approaches look equivalent when N becomes large (≥ 350). It also shows that, when N ≥ 200, the methods tend to select a set of moderators that include at least all true moderators.

Applications
We now illustrate the usage of the vcrpart library to tackle moderated regression problems in social science research. Two applications are presented. We demonstrate the general multivariate varying coefficient specification in Section 5.1 and the additive expansion in Section 5.2.

Variable
Label  has been proposed and evaluated by Card (1993). We rely on their evaluation and do not go into detail, because the endogeneity problem is not the point of this illustration. The fit reveals that black has a significant (|t value| > 2) negative impact on lwage76.
The aim of this application is to illustrate how the TVCM could be used to study moderations on the effect of black. To do this, we consider the covariates 3 to 11 of Table 3 as potential moderators. Furthermore, we want to account for the direct effects of the covariates 5 to 11 on wage, which are those covariates not included in lm.School. To integrate these two extensions, we replace the constant coefficient of black with a varying coefficient and we replace the global intercept with a varying intercept. However, we continue to assume the Mincer equation and, therefore, use the same specification for the direct effects of ed76.IV and exp76 as in lm.School. To fit the described extended model, we use the following formula.
The fitted varying intercept and varying wage gap are shown in Figure 9. The varying intercept on the left consists of 9 terminal nodes. The tree structure suggests that, in particular, age76 and smsa76 (standard metropolitan statistical area) have strong direct effects on wage. We do not study the varying intercept in detail because it was mainly implemented to allow the study of the racial wage gap while controlling for the direct effects of the considered variables.
The varying racial wage gap, shown in the right panel of Figure 9, includes three strata. It turns out that the gap is particularly negative for people who live in a Southern state and have high working experience. For people who live in the North or with a low working experience (equal or less than 9 years) in the South, the gap is smaller. However, the negative gap remains.
The estimated non-varying coefficients for ed76.IV and exp76 (linear, squared) can be extracted using the summary or the coef functions, for example, with

The effect of parental leave duration on return to work
As the last application, we consider an example from the literature on the effects of welfare reforms. Specifically, we investigate the effect of the 1990 reform of the Austrian parentalleave (PL) system. Before 1990, working women had the right to stay off work after childbirth up to 12 months and, thereafter, return to the same (or similar) job at the same employer. The 1990 reform extended the duration of this leave from 12 months to 24 months. Lalive and Zweimüller (2009) investigated the effect of the 1990 PL reform on various fertility and labor market outcomes, based on linear regression models and using the Austrian Social Security Database (ASSD). They provide a background to the Austrian PL system and describe the data subset. 10 Here, using the same data, we reanalyze the effect of the reform on whether women returned to work at least once in the 10 years after childbirth.
The subset of the ASSD data includes 6, 180 women who gave birth in June or July 1990 and were eligible to access the Austrian PL system. With vcrpart, the data are loaded by R> data("PL", package = "vcrpart") The interesting PL reform dummy is july. A '0' in july means childbirth in June 1990, which is the last month under the old PL system, and a '1' indicates a birth in July 1990, which is the first month under the new PL system. The response variable uncj10 is binary, where '1' refers to women who returned to work at least once in the 10 years after the considered childbirth. Both july and uncj10 are summarized in Table 4, along with eight further covariates used as moderators.
First, we consider a simple logistic model for uncj10 with only july in the predictor function.  The estimated effect of july is −0.23 (corresponding to an odds ratio of e −0.23 = 0.79) and is significant (|z value|> 2). This means that the 1990 PL reform decreases the logit for returning to work.
The aim of this application is to investigate whether and how the effect of the PL reform is moderated by covariates 3 to 10 of Table 4, which include for example age and region. Furthermore, we want to study such moderation by considering the direct effects of the moderators. To implement this, we use the additive expansion for multivariate varying coefficients introduced in Section 3.2. The additive expansion is restrictive because it ignores interactions between moderators. However, in applied regression analysis it is common to limit the scope on first-order interactions between the predictor of interest and further covariates (e.g., Cox 1984 , 4, 6, 7, 15, 17, 18, 19, 20 2, 3, 5, 8, 9, 10, 11, 12, 13, 14, 16 2:n=2936  Figure 10 shows that the 5-fold cross-validated error is smallest atλ = 13.49. The error curve is relatively flat on the right of the minimum. Figure 11 renders the fitted varying coefficients with at least one split. The top row shows the varying intercepts, which are contributions to the global interceptβ (Intercept) = 1.93, and are interpreted as direct effects on the logits for return to work. The result suggests that women with low working experience (≤ 2.2 years) and a high wage (> 45.8EUR/d) have increased logits to return to work, and that there are also differences between industries. Specifically, working in an industry corresponding to Node 3, which includes service industries such as social insurance, education and bank industries, has a positive direct effect on return to work.
The global effect of the PL reform on the logits for return to work is estimated to beβ july = −0.23. The moderation effects of the two selected variables, working experience and region, are shown in the bottom row of Figure 11. From the nodewise coefficient plots, we learn that low working experience (≤ 5.3 years) increasesβ july by 0.22, and living in Vienna (W) or Lower Austria (NOe) increasesβ july by 0.27. These positive contributions imply that the effect of the PL reform locally surpasses zero, especially for those women who combine the two characteristics low working experience and living in Vienna or Lower Austria.

Discussion and outlook
In this study, we showed how to use the TVCM algorithm for varying coefficient regression, as well as its implementation in the R package vcrpart. Unlike alternative tree-based algorithms, the TVCM can build a separate partition for each varying coefficient. Thus, it allows us to select moderators individually for each varying coefficient and to specify coefficient-specific sets of moderators. Furthermore, Section 4 provides empirical evidence showing that TVCM builds slightly more accurate and/or more parsimonious models than competing tree-based algorithms. It also assesses the ability of TVCM to identify an underlying data model. In addition to the description of the TVCM, we discussed the model specification, provided R commands and demonstrated the scope of TVCM by applying the algorithm to different data sets.
It is worth mentioning here that TVCM could in some situations wrongly identify a moderator effect in place of a main effect of the variable. 11 Such spurious moderator effects could appear when the moderator has in truth a main effect on the response while the variable is not specified as a predictor, nor as a moderator of the intercept. Such potential moderators may, when one of the specified predictors remains almost constant, reflect their own main effect on the response through their interaction with that predictor. Therefore, it would be good practice to test, before drawing any definite conclusion, that every identified moderator effect does not change significantly when the moderator is also entered as a predictor.
Further research should investigate the theoretical properties of the TVCM in more detail. This could include simulation studies and/or comparisons with smoothing splines and/or kernel regression techniques, in line with the comparison study of Wang and Hastie (2014). The simulation study described in Section 4.2 is a first attempt in that direction. It suggests that the performance of TVCM for identifying an underlying data model improves with increasing numbers of observations, and that TVCM is more powerful than the single-partition approach in the case where the coefficient functions differ from each others.
There also is potential for improving the TVCM. This could include: (i) developing an unbiased selection procedure for partitioning; (ii) decreasing the computational time; (iii) refining the pruning criterion; and (iv) stabilizing the algorithm. Improvement (i) requires finding an alternative criterion that does not tend to select partitions, nodes, and moderators with many split candidates (cf., Section 2.2). At the outset, we considered implementing the score-based coefficient constancy tests of Zeileis and Hornik (2007), used in the MOB algorithm. We were particularly interested into these tests because they would have allowed to select the partition, the node and the moderator based on the scores of the current model M (7), without estimating search models. However, we discarded the idea because the tests work under the condition that the predictors of the model are stationary and ergodic (cf., Andrews 1993) with respect to the tested moderator, which seems difficult to control when partitioning coefficient-wise. Another adjustment would be to derive the distribution of the maximally selected likelihood ratio statistics D k ′ m ′ l ′ j ′ of Algorithm 1. This would allow us to select the split based on p values, which eliminates the dependence of the selection on the number of splits in the moderator variable. Andrews (1993) develops the distribution of maximally selected likelihood ratio statistics, however, again under the stationarity assumption. Indeed, the stationarity assumption could be resolved by using bootstrap techniques (e.g., Jouini 2008), but such techniques are computationally complex. Finally, F -or χ 2 -type tests, such as those proposed in Loh and Shih (1997), could be implemented. For example, Brandmaier, von Oertzen, McArdle, and Lindenberger (2012) implement such tests for building structural equation model trees, and they show that their implementation reduces the variable selection bias substantially.
With regard to point (ii), the TVCM seems more time consuming than the alternative algorithms (cf., Section 4.1), although we integrated several acceleration techniques and parallelized the cross-validation. This hindrance, which might be relevant for big data applications, could be partly solved by rewriting the bottleneck functions in a low-level programming language. With regard to improvement (iii), we could consider refining the cost-complexity criterion of Equation 10, which assumes that the "optimism" of the training error linearly increases with each split. Ye (1998) showed that this assumption is violated for CART, and the same probably applies to the TVCM. Ye (1998) and Efron (2004) provide more accurate solutions using resampling techniques, though these solutions are highly time consuming. Finally, with regard to improvement (iv), to stabilize the algorithm regarding perturbations to the data and to improve the accuracy, we provide with the fvcglm function an implementation of the random forest ensemble algorithm (Breiman 2001) for the TVCM. We have not addressed this implementation here so as to focus on the original parts of our work.
Along with tvcglm, tvcglm_control, splitpath, prunepath, and plot, this study introduced the main functions for the fitting and diagnosis of coefficient-wise tree-based varying coefficient models. Additional functions provided by vcrpart, such as summary and predict, are described in the reference manual.