Most Likely Transformations: The mlt Package

The mlt package implements maximum likelihood estimation in the class of conditional transformation models. Based on a suitable explicit parameterization of the unconditional or conditional transformation function using infrastructure from package basefun, we show how one can define, estimate, and compare a cascade of increasingly complex transformation models in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable are set-up and estimated in the same computational framework simply by choosing an appropriate transformation function and parameterization thereof. As it is computationally cheap to evaluate the distribution function, models can be estimated by maximization of the exact likelihood, especially in the presence of random censoring or truncation. The relatively dense high-level implementation in the R system for statistical computing allows generalization of many established implementations of linear transformation models, such as the Cox model or other parametric models for the analysis of survival or ordered categorical data, to the more complex situations illustrated in this paper.


Introduction
The history of statistics can be told as a story of great conceptual ideas and contemporaneous computable approximations thereof. As time went by, the computationally inaccessible concept often vanished from the collective consciousness of our profession and the approximation was taught and understood as the real thing. Least squares regression emerged from Gauß' computational trick of changing Bošcović' absolute to squared error and it took 200 years for the original, and in many aspects advantageous, concept to surface again under the name "quantile regression". This most prominent example of an idea got lost illustrates the impact computable approximations had and still have on our understanding of statistical methods withα 1 = −α/σ,α 2 = 1/σ andβ = β/σ we see that the model is of the form with distribution function F Z = Φ and linear transformation h Y (y) =α 1 +α 2 y such that E(h Y (Y ) | X = x) =x β. If we now change F Z to the distribution function of the minimum extreme value distribution and allow a non-linear monotone transformation h Y we get which is the continuous proportional hazards, or Cox, model (typically defined with a positive shift term). From this point of view, the linear and the Cox model are two instances of socalled linear transformation models (a misleading name, because the transformation h Y of the response Y is non-linear in the latter case and only the shift termx β is linear inx). It is now also obvious that the Cox model has nothing to do with censoring, let alone survival times Y > 0. It is a model for the conditional distribution of a continuous response Y ∈ R when it is appropriate to assume that the conditional hazard function is scaled by exp(x β ). For both the linear and the Cox model, application of the exact likelihood allows the models to be fitted to imprecise, or "censored", observations (ȳ,ȳ] ∈ R. The generality of the class of linear transformation models comes from the ability to change F Z and h Y in this flexible class of models. The class of linear transformation models is a subclass of conditional transformation models (Hothorn, Kneib, and Bühlmann 2014). In this latter class, the conditional distribution function is modeled by P(Y ≤ y | X = x) = F Z (h(y | x)), (1) where the transformation function h depends on both y and x. This function is unknown and this document describes how one can estimate h. Because we are interested in analyzing, i.e., estimating and interpreting, the transformation function h, we slightly rephrase the title "An Analysis of Transformations" of the first paper on this subject (Box and Cox 1964) and refer to the methods presented in this paper as transformation analysis. Different choices of F Z and different parameterizations of h in model (1) relate to specific transformation models, including the normal linear model, proportional hazards (Cox) and proportional odds models for continuous or ordinal responses, parametric models for survival regression, such as the Weibull or log-normal model, distribution regression models or survival models with time-varying effects and much more. We describe how members of the large class of conditional transformation models can be specified, fitted by maximizing the likelihood (using the estimator developed by Hothorn, Möst, and Bühlmann 2018), and analyzed in R (R Core Team 2019) using the mlt add-on package (Hothorn 2020b). In essence, the package is built around two important functions. ctm() specifies transformation models based on basis functions implemented in the basefun package (Hothorn 2020a) for variable descriptions from the variables package (Hothorn 2020c). These models are then fitted to data by mlt().
Package mlt is available from the Comprehensive R Archive Network (CRAN) at https: //CRAN.R-project.org/package=mlt. steps to packages, functions and methods discussed in this document. We take an examplebased approach for introducing models and corresponding software infrastructure in the main document. The underlying theory for model formulation, parameter estimation, and model inference is presented along the way, and we refer the reader to Hothorn et al. (2018) for the theoretical foundations. Technical issues about the three packages variables, basefun, and mlt involved are explained in Appendices A, B, and C.
Before we start looking at more complex models and associated conceptual and technical details, we illustrate the workflow by means of an example from unconditional density estimation.
Density estimation: Old Faithful geyser data. The duration of eruptions and the waiting time between eruptions of the Old Faithful geyser in the Yellowstone national park became a standard benchmark for non-parametric density estimation (the original data were given by Azzalini and Bowman 1990). An unconditional density estimate for the duration of the eruptions needs to deal with censoring because exact duration times are only available for the day time measurements. At night time, the observations were either left-censored ("short" eruption), interval-censored ("medium" eruption) or right-censored ("long" eruption) as explained by Azzalini and Bowman (1990). This fact was widely ignored in analyses of the Old Faithful data because most non-parametric density estimators cannot deal with censoring.
The key issue is the representation of the unknown transformation function by h(y) = a(y) ϑ based on suitable basis functions a and parameters ϑ. We fit these parameters ϑ in the R> data("geyser", package = "TH.data") R>   The most likely transformation (MLT)ĥ N (y) = a(y) θ N is now obtained from the maximum likelihood estimateθ N computed as R> mlt_d <-mlt(ctm_d, data = geyser) R> logLik(mlt_d) log  The model is best visualized in terms of the corresponding density φ(a(y) θ N )a (y) θ N , which can be extracted from the fitted model using R> nd_d$d <-predict(mlt_d, newdata = nd_d, type = "density") The estimated density is depicted in Figure 2. The plot shows the well-known bimodal distribution in a nice smooth way. Several things are quite unusual in this short example. First, the model was specified using ctm() without reference to the actual observations, and second, although the model is fully parametric, the resulting density resembles the flexibility of a non-parametric density estimate (details in Section 2). Third, the exact likelihood, as defined by the interval-censored observations, was used to obtain the model (Section 3). Fourth, inspection of the parameter estimatesθ N is uninteresting, the model is better looked at by means of the estimated distribution, density, quantile, hazard or cumulative hazard functions (Section 4). Fifth, no regularization is necessary due to the monotonicity constraint on the estimated transformation function (implemented as linear constraints for maximum likelihood estimation) and thus standard likelihood asymptotics work forθ N (Section 5). Sixth, because the model is a model for a full distribution, we can easily draw random samples from the model and refit its parameters using the parametric or model-based bootstrap (Section 6). Seventh, all of this is not only possible theoretically but readily implemented in package mlt.
The only remaining question is "Do all these nice properties carry over to the conditional case, i.e., to regression models?". The answer to this question is "yes!" and the rest of this paper describes the details following the workflow sketched in this section.

Specifying transformation models
In this section we review a cascade of increasingly complex transformation models following Hothorn et al. (2018) and discuss how one can specify such models using the ctm() function provided by the mlt package. The models are fitted via maximum likelihood by the mlt() function (details are given in Section 3) to studies from different domains. Results obtained from established implementations of the corresponding models in the R universe are used to evaluate the implementation in package mlt wherever possible. We start with the simplest case of models for unconditional distribution functions.

Unconditional transformation models
The distribution of an at least ordered response Y ∈ Ξ is defined by a transformation function h and a distribution function F Z . The transformation function is parameterized in terms of a basis function a The triple (F Z , a, ϑ) fully defines the distribution of Y and is called transformation model. The choice of the basis function a depends on the measurement scale of Y and we can differentiate between the following situations.

Discrete models for categorical responses
For ordered categorical responses Y from a finite sample space Ξ = {y 1 , . . . , y K } the distribution function F Y is a step-function with jumps at y k only. We therefore assign one parameter to each jump, i.e., each element of the sample space except y K . This corresponds to the basis function a(y k ) = e K−1 (k), where e K−1 (k) is the unit vector of length K − 1 with its kth element being one. The transformation function h is Note that monotonicity of h is guaranteed by the K − 2 linear constraints ϑ 2 − ϑ 1 > 0, . . . , ϑ K−1 − ϑ K−2 > 0 when constrained optimization is performed. The density of a nominal variable Y can be estimated in the very same way because it is invariant with respect to the ordering of the levels (see Section 3).

Categorical data analysis: Chinese Health and Family Life survey. The Chinese
Health and Family Life survey (Parish and Laumann 2009), conducted 1999-2000 as a collaborative research project of the Universities of Chicago, Beijing, and North Carolina, sampled 60 villages and urban neighborhoods in China. Eighty-three individuals were chosen at random for each location from official registers of adults aged between 20 and 64 years to target a sample of 5000 individuals in total. Here, we restrict our attention to women with current male partners for whom no information was missing, leading to a sample of 1534 women with the following variables: R_edu (level of education of the responding woman), R_income (monthly income in Yuan of the responding woman), R_health (health status of the responding woman in the last year) and R_happy (how happy was the responding woman in the last year). We first estimate the unconditional distribution of happiness using a proportional odds model (polr() from package MASS, Ripley 2019) R> data("CHFLS", package = "HSAUR3") R> polr_CHFLS_1 <-polr(R_happy~1, data = CHFLS) The basis function introduced above corresponds to a model matrix for the ordered factor R_happy with treatment contrasts using the largest level ("very happy") as baseline group.
In addition, the parameters must satisfy the linear constraint Cϑ ≥ m, C (argument ui) being the difference matrix and m = 0 (argument ci).

R> b_happy <-as.basis(CHFLS$R_happy)
We are now ready to set-up the (unconditional) transformation model by a call to ctm() using the basis function and a character defining the standard logistic distribution function for F Z .
R> ctm_CHFLS_1 <-ctm(response = b_happy, todist = "Logistic") Note that the choice of F Z is completely arbitrary as the estimated distribution function is invariant with respect to F Z . The model is fitted by calling the mlt() function with arguments model and data.
R> mlt_CHFLS_1 <-mlt(model = ctm_CHFLS_1, data = CHFLS) The results are equivalent to the results obtained from polr(). The helper function RC() prints its arguments along with the relative change to the mlt argument.

R> logLik(polr_CHFLS_1)
log Lik. -1328.241  Of course, the above exercise is an extremely cumbersome way of estimating a discrete density whose maximum likelihood estimator is a simple proportion but, as we will see in the rest of this paper, a generalization to the conditional case strongly relies on this parameterization.
R> RC(polr = predict(polr_CHFLS_1, newdata = data.frame(1), type = "prob"), + mlt = c(predict(mlt_CHFLS_1, newdata = data.frame (1) Be(m,M ) is the density of the Beta distribution with parameters m and M . This choice is computationally attractive because strict monotonicity can be formulated as a set of M linear constraints on the parameters ϑ m < ϑ m+1 for all m = 0, . . . , M (Curtis and Ghosh 2011). Therefore, application of constrained optimization guarantees monotone estimatesĥ N . The basis contains an intercept.
Density estimation: Geyser data (cont'd) . We continue the analysis of the Old Faithful data by estimating the unconditional distribution of waiting times, a positive variable whose abstract representation can be used to generate an equidistant grid of 100 values.
R> var_w <-numeric_var("waiting", support = c(40.0, 100), add = c (-5, 15), + bounds = c(0, Inf)) R> c(sapply(nd_w <-mkgrid(var_w, 100), range)) 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Waiting times Distribution qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q  Figure 3: Old Faithful geyser. Estimated distribution (left, also featuring the empirical cumulative distribution function as gray circles) and density (right) function for two transformation models parameterized in terms of Bernstein polynomials of order eight and 40. The right plot reproduces the density for M = 8 shown in Figure 1 (left panel) in .

[1] 35 115
A monotone increasing Bernstein polynomial of order eight for waiting in the interval [ī,ī] = support(var_w) is defined as R> B_w <-Bernstein_basis(var_w, order = 8, ui = "increasing") The (here again unconditional) transformation model is now fully described by the parameterization h(y) = a(y) ϑ and F Z = Φ, the latter choice again being not important (for larger values of M ).
The question arises how the degree of the polynomial affects the estimated distribution function. On the one hand, the model (Φ, a Bs,1 , ϑ) only allows linear transformation functions of a standard normal and F Y is restricted to the normal family. On the other hand, (Φ, a Bs,N −1 , ϑ) has one parameter for each observation andF Y,N is the non-parametric maximum likelihood estimator ECDF which, by the Glivenko-Cantelli lemma, converges to F Y . In this sense, we cannot choose M "too large". This is a consequence of the monotonicity constraint on the estimator a θ N which, in this extreme case, just interpolates the step-function F −1 Z • ECDF. The practical effect can be inspected in Figure 3 where two Bernstein polynomials of order M = 8 and M = 40 are compared on the scale of the distribution function (left panel) and density function (right panel). It is hardly possible to notice the difference in probabilities but the more flexible model features a more erratic density estimate, but not overly so. The subjective recommendation of choosing M between 5 and 10 is based on the quite remarkable flexibility of distributions in this range and still manageable computing times for the corresponding maximum likelihood estimator.

Continuous models for discrete responses
Although a model for Y can assume an absolutely continuous distribution, observations from such a model will always be discrete. This fact is taken into account by the exact likelihood (Section 3). In some cases, for example for count data with potentially large number of counts, one might use a continuous parameterization of the transformation function a Bs,M (y) ϑ which is evaluated at the observed counts only as in the next example.
Analysis of count data: Deer-vehicle collisions . Hothorn, Müller, Held, Möst, and Mysterud (2015) analyze roe deer-vehicle collisions reported to the police in Bavaria, Germany, between 2002-01-01 and 2011-12-31. The daily counts range between 16 and 210. A model for the daily number of roe deer-vehicle collisions using a Bernstein polynomial of order six as basis function is fitted using R> var_dvc <-numeric_var("dvc", support = min(dvc):max(dvc)) R> B_dvc <-Bernstein_basis(var_dvc, order = 6, ui = "increasing") R> dvc_mlt <-mlt(ctm(B_dvc), data = data.frame(dvc = dvc)) The discrete unconditional distribution function (evaluated for all integers between 16 and 210) R> q <-support(var_dvc)[[1]] R> p <-predict(dvc_mlt, newdata = data.frame(1), q = q, + type = "distribution") along with the unconditional distribution function of the Poisson distribution with rate estimated from the data are given in Figure 4. The empirical cumulative distribution function is smoothly approximated using only seven parameters. There is clear evidence for overdispersion as the variance of the Poisson model is much smaller than the variance estimated by the transformation model.

Discrete models for continuous responses
In some applications one might not be interested in estimating the whole distribution function F Y for a conceptually absolutely continuous response Y but in probabilities F Y (y k ) for some grid y 1 , . . . , y K only. The discrete basis function h(y k ) = e K−1 (k) ϑ = ϑ k can be used in Distribution q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q ECDF Transformation Model Poisson this case. For model estimation, only the discretized observations y ∈ (y k−1 , y k ] enter the likelihood.

Linear transformation models
A linear transformation model features a linear shift of a typically non-linear transformation of the response Y and is the simplest form of a regression model in the class of conditional transformation models. The conditional distribution function is modeled by The transformation h Y , sometimes called "baseline transformation", can be parameterized as discussed in the unconditional situation using an appropriate basis function a The connection to more complex models is highlighted by the introduction of a basis function c being conditional on the explanatory variables x, i.e., with c = (a , −x ) . The definition of a linear transformation model only requires the basis a, the explanatory variables x and a distribution F Z . The latter choice is now important, as additivity of the linear predictor is assumed on the scale of F Z . It is convenient to supply negative linear predictors as E(h Y (Y ) | X = x) =x β (up to an additive constant E(Z)).
One exception is the Cox model with positive shift and E(h Y (Y ) | X = x) = −x β. Large positive values of the linear predictor correspond to low expected survival times and thus a high risk.

Discrete responses
Categorical data analysis: Chinese survey (cont'd) . We want to study the impact of age and income on happiness in a proportional odds model, here fitted using polr() first R> polr_CHFLS_2 <-polr(R_happy~R_age + R_income, data = CHFLS) In order to fit this model using the mlt package, we need to set-up the basis function for a negative linear predictor without intercept in addition to the basis function for happiness (b_happy) R> b_R <-as.basis(~R_age + R_income, data = CHFLS, remove_intercept = TRUE, + negative = TRUE) The model is now defined by two basis functions, one of the response and one for the shift in addition to F Z and fitted using the mlt() function; the columns of the model matrix are scaled to [−1, 1] before fitting the parameters by scale = TRUE.
R> ctm_CHFLS_2 <-ctm(response = b_happy, shifting = b_R, + todistr = "Logistic") R> mlt_CHFLS_2 <-mlt(ctm_CHFLS_2, data = CHFLS, scale = TRUE) The regression coefficients β are the log-odds ratios, i.e., the odds-ratio between any two subsequent happiness categories is 0.9937 for each year of age and 1.0002 for each additional Yuan earned. Therefore, there seems to be a happiness conflict between getting older and richer.

Continuous responses
Survival analysis: German Breast Cancer Study Group-2 (GBSG-2) trial. This prospective, controlled clinical trial on the treatment of node positive breast cancer patients was conducted by the German Breast Cancer Study Group (GBSG-2, Schumacher et al. 1994).
Patients not older than 65 years with positive regional lymph nodes but no distant metastases were included in the study. Out of 686 women, 246 received hormonal therapy whereas the control group of 440 women did not receive hormonal therapy. Additional variables include age, menopausal status, tumor size, tumor grade, number of positive lymph nodes, progesterone receptor, and estrogen receptor. The right-censored recurrence-free survival time is the response variable of interest, i.e., a positive absolutely continuous variable.
R> data("GBSG2", package = "TH.data") R> GBSG2y <-numeric_var("y", support = c(100.0, max(GBSG2$time)), + bounds = c(0, Inf)) R> GBSG2$y <-with(GBSG2, Surv(time, cens)) We start with the Cox model (F MEV , (a Bs,10 ,x ) , (ϑ 1 , β ) ), in more classical notation the model wherex contains the treatment indicator and all other explanatory variables in the transformation function h(y | x) = a Bs,10 (y) ϑ 1 +x β with log-cumulative baseline hazard h Y (y) = a Bs,10 (y) ϑ 1 (the positive shift being in line with the implementation in coxph() in package survival, Therneau and Grambsch 2000; Therneau 2019) R> B_GBSG2y <-Bernstein_basis(var = GBSG2y, order = 10, ui = "increasing") R> fm_GBSG2 <-Surv(time, cens)~horTh + age + menostat + tsize + tgrade + + pnodes + progrec + estrec R> ctm_GBSG2_1 <-ctm(B_GBSG2y, shifting = fm_GBSG2[-2L], data = GBSG2, + todistr = "MinExtrVal") fm_GBSG2[-2L] is the right hand side of the model formula and defines the basis function for the shift term in the classical formula language. The distribution function F Z is the distribution function of the minimum extreme value distribution. The specification of the right-hand side of the formula as argument shifting along with the data (argument data) is equivalent to a basis function Note that the model matrix is set-up with intercept term first, ensuring proper coding of contrasts. Because the intercept in this model is the log-cumulative baseline hazard function h Y , the intercept column is removed from the model matrix prior to model estimation.
In contrast to the classical Cox model where only β is estimated by the partial likelihood, we estimate all model parameters (ϑ 1 , β ) simultaneously under ten linear constraints.
R> mlt_GBSG2_1 <-mlt(ctm_GBSG2_1, data = GBSG2, scale = TRUE) The results obtained for β from the partial and the exact log-likelihood are practically equivalent.
R> coxph_GBSG2_1 <-coxph(fm_GBSG2, data = GBSG2, ties = "breslow") R> cf <-coef (coxph_GBSG2_1 A practically important question is how the order M of the Bernstein polynomial affects the results. Recall that the log-cumulative baseline hazard function h Y is not specified when the partial likelihood is maximized and thus coxph() makes no assumptions regarding this function. To study the impact of M on the results, we refit the Cox model using mlt() for M = 1, . . . , 30 and plot the log-cumulative hazard function h Y for different M along with the non-parametric Nelson-Aalen-Breslow estimator in the left panel of Figure 5. In the right panel of Figure 5, the change in the regression coefficients β as a function of order M is shown. Both the log-cumulative hazard function and the regression coefficients obtained from mlt() are stable for M ≥ 10 and very similar to the results one obtains from coxph(). This result shows that a simple yet fully parametric model produces practically equivalent results when compared to the semi-parametric partial likelihood approach. There is no harm in choosing M "too large", except longer computing times of course. In this sense, there is no need to "tune" M .
A comparison with the Royston and Parmar (2002) spline model as implemented in the flexsurv package (Jackson 2019) shows that the two spline parameterizations of the logcumulative hazard function h Y are also practically equivalent (see Figure 6).
R> kn <-log(support(GBSG2y)$y) R> fss_GBSG2_1 <-flexsurvspline(fm_GBSG2, data = GBSG2, scale = "hazard", + k = 9, bknots = kn) R> logLik(fss_GBSG2_1) Order M Accelerated failure time (AFT) models arise when one restricts the baseline transformation h Y (y) to a possibly scaled log-transformation. With h Y (y) = ϑ 1 log(y) + ϑ 2 (with ϑ 1 ≡ 1) and F Z = F MEV the exponential AFT model arises which can be specified as the Cox model above, only the Bernstein basis for time needs to be replaced by a log-transformation and the corresponding parameter is restricted to one in the estimation.
It is also possible to combine the log-transformation with the Bernstein polynomial. In this case, Bernstein basis functions are computed for log-transformed survival times, thus a linear Bernstein polynomial is equivalent to a Weibull model. The log_first = TRUE argument to Bernstein_basis() implements this concept; the Cox model for the GBSG2 study can be implemented as R> log_GBSG2y <-numeric_var("y", support = c(100.0, max(GBSG2$time)), + bounds = c(0.1, Inf)) R> lBy <-Bernstein_basis(log_GBSG2y, order = 10, ui = "increasing", as implemented in lm() is rather restrictive.
R> data("BostonHousing2", package = "mlbench") R> lm_BH <-lm(cmedv~crim + zn + indus + chas + nox + rm + age + dis + + rad + tax + ptratio + b + lstat, data = BostonHousing2) We relax the model formulation without sacrificing the simplicity of a linear predictor of the explanatory variables in the linear transformation model and estimate all model parameters (ϑ 1 , β) simultaneously, i.e., both the regression coefficients and the baseline transformation h Y . Because it is straightforward to evaluate the conditional distribution function, the likelihood can deal with right-censored medvc observations (≥ 50). This censoring was mostly ignored in other parametric or non-parametric analyses of this data set.
We start with a suitable definition of median value of owner-occupied homes, set-up a 'Surv' object for this response and a model formula R> BostonHousing2$medvc <-with(BostonHousing2, Surv(cmedv, cmedv < 50)) R> var_m <-numeric_var("medvc", support = c(10.0, 40.0), bounds = c(0, Inf)) R> fm_BH <-medvc~crim + zn + indus + chas + nox + rm + age + + dis + rad + tax + ptratio + b + lstat First, we aim at fitting a normal linear model taking censored observations properly into account. With linear baseline transformation h Y , i.e., a Bernstein polynomial of order one or a linear function with intercept and positive slope, the model is equivalent to the model underlying lm() as explained in the introduction. Only the likelihood changes when we fit the model via R> B_m <-polynomial_basis(var_m, coef = c(TRUE, TRUE), + ui = matrix(c(0, 1), nrow = 1), ci = 0) R> ctm_BH <-ctm(B_m, shift = fm_BH[-2L], data = BostonHousing2, + todistr = "Normal") R> lm_BH_2 <-mlt(ctm_BH, data = BostonHousing2, scale = TRUE) R> logLik (lm_BH_2) log Lik. -1496.301 (df=15) In a second step, we are interested in a possibly non-linear transformation of the response and use a Bernstein polynomial of order six. In principle, this approach is equivalent to using a Box-Cox transformation but with a more flexible transformation function. In a sense, we do not need to worry too much about the error distribution F Z as only the additivity assumption on our linear predictor depends on this choice (which may or may not be a strong assumption!). The conditional transformation model is now given by this transformation of the response, a linear predictor of the explanatory variables the model and the normal distribution function; the mlt() function fits the model to the data.
The model can be compared with a normal linear model (fitted by lm()) on the scale of the fitted conditional distribution functions. Figure 7 shows the fitted values, i.e., the linear predictorx iβN for each observation, and the observed response overlayed with the conditional distribution function for the corresponding observations. For the normal linear model featuring a linear baseline transformation h Y , the fit seems appropriate for observations with linear predictor less than 30. For larger values, the linear predictor underestimates the observations. The conditional distribution obtained from the linear transformation model captures observations with large values of the response better. For smaller values of the response, the fit resembles the normal linear model, although with a smaller conditional variance.
R> tr_PSID1976 <-truncreg(fm_PSID1976, data = PSID1976_0) We use the R() function (see Appendix C.1) to specify left-truncation at zero, set-up a linear transformation function with positive slope for the response (we want to stay within the normal family) and a shift term.

Stratified linear transformation models
Stratification in linear transformation models refers to a strata-specific transformation function but a shift term those regression coefficients are constant across strata. The model then reads with basis function c = (a ⊗ b stratum , −b shift ) . The basis function b stratum is a dummy coding for the stratum variable and is defined using the interacting argument of ctm().

mlt: Most Likely Transformations in R
The constraints for the parameters of a have to be met for each single stratum, i.e., the total number of linear constraints is the number of constraints for a multiplied by the number of strata.

Discrete responses
Categorical data analysis: Chinese survey (cont'd) . We first estimate the distribution of happiness given health without taking any other explanatory variables into account, i.e., by treating health as a stratum variable (with dummy coding) but without any regression coefficients β in the model. The conditional distribution for happiness given each health category is returned by predict() as a matrix. There is a clear tendency of people being happier with better health. We now "adjust" for age and income by including a linear shift term which is constant across strata.

R_age
R_income 0.0117386981 0.0002492629 Because the shift basis b_R is negative, the effects of both age and income on the happiness distribution are towards larger values of happiness, i.e., older and richer people are happier for all health levels (this is, of course, due to the restrictive model not allowing interactions between health and the other two variables). The "health-adjusted" log-odds ratios are now 1.0117 for each year of age and 1.0002 for each additional Yuan earned and, conditional on health, people are getting happier as they get older and richer.

Continuous responses
Survival analysis: GBSG-2 trial (cont'd). The Cox model presented in Section 2.2 features one baseline function for all observations, an assumption which we are now going to relax. As a first simple example, we want to estimate two separate survivor functions for the two treatment regimes in the model (F MEV , (a Bs,10 (y) ⊗ (1(hormonal therapy), 1 − 1(hormonal therapy))) , (ϑ 1 , ϑ 2 ) ).
Here, the transformation functions a Bs,10 (y) ϑ 1 and a Bs,10 (y) ϑ 2 correspond to the untreated and treated groups, respectively.

Conditional transformation models
The most complex class of models currently supported by the ctm() function allows basis functions of the form The model may include response-varying coefficients (as defined by the basis a) corresponding to the bases (b 1 , . . . , b J ) and constant shift effects (b shift ). Such a model is set-up using ctm() with arguments response (basis a), interacting (basis (b 1 , . . . , b J )) and shifting (basis −b shift ). It would be conceptually possible to fit even more complex conditional transformation models of the form with a less restrictive user interface.

Discrete responses
Categorical data analysis: Chinese survey (cont'd). In a series of non-proportional odds models for the conditional distribution of happiness we study the impact of health, age and income on happiness. Similar to the stratified model ctm_CHFLS_3, we estimate the conditional distribution of happiness separately for each health level, but instead of using a dummy coding we use treatment contrasts.

Continuous responses
Survival analysis: GBSG-2 trial (cont'd). The Cox model for the comparison of the survivor distribution between the untreated and treated group assuming proportional hazards, i.e., the model (F MEV , (a Bs,10 , 1(hormonal therapy)) , (ϑ 1 , β) ), implements the transformation function h(y | treatment) = a Bs,10 (y) ϑ 1 + 1(hormonal therapy)β where a Bs,10 ϑ 1 is the log-cumulative baseline hazard function parameterized by a Bernstein polynomial and β ∈ R is the log-hazard ratio of hormonal therapy.
The function a Bs,10 ϑ 2 is the time-varying treatment effect and can be interpreted as the deviation, on the scale of the transformation function, induced by the hormonal therapy. Under the null hypothesis of no treatment effect, we would expect ϑ 2 ≡ 0. It should be noted that the log-cumulative hazard function a Bs,10 (y) ϑ 1 must be monotone. The deviation function a Bs,10 (y) ϑ 2 does not need to be monotone, however, the sum of the two functions needs to be monotone. Sums of such Bernstein polynomials with coefficients ϑ 1 and ϑ 2 are again Bernstein polynomials with coefficients ϑ 1 +ϑ 2 . Thus, monotonicity of the sum can be implemented by monotonicity of ϑ 1 + ϑ 2 . The argument sumconstr = TRUE implements the latter constraint (the default, in this specific situation). Figure 9 shows the time-varying treatment effect a Bs,10θ 2,N , together with a 95% confidence band (see Section 5 for a description of the method). The 95% confidence interval around the log-hazard ratioβ is plotted in addition and since the latter is fully covered by the confidence band for the time-varying treatment effect there is no reason to question the treatment effect computed under the proportional hazards assumption.

R> logLik(mlt_GBSG2_7)
log Lik. -2605.949 (df=22) In a second step, we allow an age-varying treatment effect in the model (F MEV , (a Bs,10 (y) ⊗ (1(hormonal therapy), 1 − 1(hormonal therapy)) ⊗ b Bs,3 (age) ) , ϑ). For both treatment groups, we estimate a conditional transformation function of survival time y given age parameterized as the tensor basis of two Bernstein bases. Each of the two basis functions comes with 10 × 3 linear constraints, so the model was fitted under 60 linear constraints R> var_a <-numeric_var("age", support = range(GBSG2$age)) R> B_age <-Bernstein_basis(var_a, order = 3) R> b_horTh <-as.basis(GBSG2$horTh) R> ctm_GBSG2_8 <-ctm(B_GBSG2y, + interacting = b(horTh = b_horTh, age = B_age), + todistr = "MinExtrVal") R> mlt_GBSG2_8 <-mlt(ctm_GBSG2_8, data = GBSG2) R> logLik(mlt_GBSG2_8) log Lik. -2588.755 (df=88) Figure 10 allows an assessment of the prognostic and predictive properties of age. As the survivor functions are clearly larger under hormonal treatment for all patients, the positive treatment effect applies to all patients. However, the size of the treatment effect varies greatly. For women younger than 30, the effect is most pronounced and levels-off a little for older patients. In general, the survival times are longest for women between 40 and 60 years old. Younger women suffer the highest risk; for women older than 60 years, the risk starts to increase again. This effect is shifted towards younger women by the application of hormonal treatment.
Quantile regression: Head circumference. The Fourth Dutch Growth Study (Fredriks et al. 2000) is a cross-sectional study on growth and development of the Dutch population younger than 22 years. Stasinopoulos and Rigby (2007) fitted a growth curve to head circumferences (HC) of 7040 boys using a GAMLSS model with a Box-Cox t distribution describing the first four moments of head circumference conditionally on age. The model showed evidence of kurtosis, especially for older boys. We fit the same growth curves by the conditional transformation model (Φ, (a Bs,3 (HC) ⊗ b Bs,3 (age 1/3 ) ) , ϑ) by maximization of the approximate log-likelihood under 3 × 4 linear constraints.

Non-normal linear regression: Boston Housing data (cont'd).
A response-varying coefficient model, also called distribution regression (Foresi and Peracchi 1995;Chernozhukov, Fernández-Val, and Melly 2013;Koenker, Leorato, and Peracchi 2013), for the Boston Housing data is The model requires the parameters ϑ 0 −ϑ(x) to be monotone increasing for all possible values of x. This type of constraint can be implemented using the sumconstr = TRUE argument to ctm(). The model is implemented using the basis function c = (a Bs,6 ⊗ (1,x )) , the intercept is required here andx should be scaled to the unit interval. This model, here using parameters ϑ 0 + ϑ(x), can be fitted using
For some observations, the variance of the conditional distribution functions seems to be smaller in the more complex model distribution regression models with response-varying effects, compared to a linear transformation model with constant shift effects β. There is not very much difference between the distribution regression models with different constraints.

Count responses
Finally, we study a transformation model with response-varying coefficients for count data.
Analysis of count data: Tree pipit counts. Müller and Hothorn (2004) reported data on the number of tree pipits Anthus trivialis, a small passerine bird, counted on 86 forest plots at a light gradient ranging from open and sunny stands (small cover storey) to dense and dark stands (large cover storey). We model the conditional distribution of the number of tree pipits at one plot given the cover storey at this plot by the transformation model (Φ, (a ⊗ b Bs,4 (cover storey) ) , ϑ), where a(y) = e 5 (y + 1), y = 0, . . . , 4; the model is fitted under 4 × 5 linear constraints. In this model for count data, the conditional distribution depends on both the number of counted birds and the cover storey and the effect of cover storey may change with different numbers of birds observed.

Most likely transformations
In this section we review the underpinnings of the mlt() function implementing the most likely transformation estimator. Most of the material in this section originates from Hothorn et al. (2018). For a given transformation function h, the likelihood contribution of a datum C = (ȳ,ȳ] ∈ C is defined in terms of the distribution function (Lindsey 1996): This "exact" definition of the likelihood applies to most practically interesting situations and, in particular, allows discrete and (conceptually) continuous as well as censored or truncated observations C. For a discrete response y k we haveȳ = y k andȳ = y k−1 such that  (Lindsey 1999). This approximation only works for relatively precise measurements, i.e., short intervals. If longer intervals are observed, one speaks of "censoring" and relies on the exact definition of the likelihood contribution instead of using the above approximation (Klein and Moeschberger 2003). In summary, the likelihood contribution of a conceptually "exact continuous" or left, right or interval-censored continuous or discrete observation (ȳ,ȳ] is given by under the assumption of random censoring. The likelihood is more complex under dependent censoring (Klein and Moeschberger 2003) but this is is not covered by the mlt implementation. The likelihood contribution L(h | Y ∈ (y k , y k−1 ]) of an ordered factor in category y k is equivalent to the term L(h | Y ∈ (ȳ,ȳ]) contributed by an interval-censored observation (ȳ,ȳ] when category y k was defined by the interval (ȳ,ȳ]. Thus, the expression F Z (h(ȳ)) − F Z (h(ȳ)) for the likelihood contribution reflects the equivalence of interval-censoring and categorization at corresponding cut-off points.
For truncated observations in the interval (y l , y r ] ⊂ Ξ, the above likelihood contribution is defined in terms of the distribution function conditional on the truncation ∀y ∈ (y l , y r ] and thus the likelihood contribution changes to (Klein and Moeschberger 2003) It is important to note that the likelihood is always defined in terms of a distribution function (Lindsey 1999) and it therefore makes sense to directly model the distribution function of interest. The ability to uniquely characterize this distribution function by the transformation function h gives rise to the following definition of an estimatorĥ N . For an independent sample of possibly censored or truncated observations C 1 , . . . , C N from P Y the estimator is called the most likely transformation (MLT). In mlt, we parameterize the transformation function h(y) as a linear function of its basis-transformed argument y using a basis function a : Ξ → R P such that h(y) = a(y) ϑ, ϑ ∈ R P . The choice of the basis function a is problemspecific, practically relevant examples were discussed in Section 2. In the conditional case we use the basis c(y, x) instead of a(y). The likelihood L only requires evaluation of h, and only an approximation thereof using the Lebesgue density of "exact continuous" observations makes the evaluation of the first derivative of h(y) with respect to y necessary. In this case, the derivative with respect to y is given by h (y) = a (y) ϑ and we assume that a is available.
In the following we write h = a ϑ and h = a ϑ for the transformation function and its first derivative omitting the argument y and we assume that both functions are bounded away from −∞ and ∞. For the basis functions discussed in Section 2, the constraint ϑ ∈ Θ can be written as Cϑ ≥ 0, thus the solution to the optimization problem is the maximum likelihood estimator. The plug-in estimator for the most likely transformation isĥ N := a θ N .
The score contribution of an "exact continuous" observation y = (ȳ +ȳ)/2 from an absolutely continuous distribution is approximated by the gradient of the log-density For an interval-censored or discrete observationȳ andȳ (the constant terms F Z (a(±∞) ϑ) = F Z (±∞) = 1 or 0 vanish) the score contribution is For a truncated observation, the score function is s(ϑ | Y ∈ (ȳ,ȳ]) − s(ϑ | Y ∈ (y l , y r ]).
The mlt() function uses the convenience interfaces auglag() for augmented Lagrangian minimization (Varadhan 2015, package alabama) and BB() for spectral projected gradients implemented in the spg() function of package BB (Varadhan andGilbert 2009, 2019) for maximizing this log-likelihood. Starting values are obtained from an estimate of the unconditional distribution function P(Y ≤ y) (for example, the empirical cumulative distribution function, a Kaplan-Meier or Turnbull estimate) via a (constrained) linear regression of z i = F −1 Z (P(Y ≤ y i )) on the design matrix of the transformation model. Optionally, columns of the underlying model matrices are scaled to [−1, 1] (argument scale = TRUE to mlt()) which leads to considerable faster optimization in many cases.

Transformation analysis
Based on the maximum likelihood estimatorθ N and the most likely transformationĥ N , transformation models can be analyzed on different scales. Plug-in estimators for the distribution and cumulative hazard functions are given byF Y, . For a continuous model, the density is given byf Y,N = f Z • a θ N × a θ N and the quantile function is obtained by numerical inversion of the distribution function. For a discrete model we getf Y, The predict() method for 'mlt' objects is the main user interface for the evaluation of these functions, the type of which is selected by its type argument. Conceptually, all functions are evaluated on the support of Y , i.e., on a grid y 1 , . . . , y K (arguments q for the vector of grid points or K for the number of grid points to be generated) for continuous responses for observations with explanatory variables as given in the newdata argument. That means the transformation function h N (y k | x i ) = c(y k , x i ) θ N , k = 1, . . . , K; i = 1, . . . , N new is evaluated for potentially large numbers K and N new by predict() and is returned as a K × N new matrix (columns correspond to observations). Because in the most general case of a conditional distribution function the transformation function iŝ with A being the matrix with rows a 1 (y k ) for k = 1, . . . , K, B the matrix with rows . . . , N new and B shift the matrix with rows b(x i ) for i = 1, . . . , N new , the transformation function can be simultaneously evaluated for all k = 1, . . . , K and i = 1, . . . , N new as This product is in the special form of an array model (Currie, Durban, and Eilers 2006) and is computed very quickly by mlt using the tricks described by Currie et al. (2006).

Classical likelihood inference
Because the problem of estimating an unknown distribution function is embedded in the maximum likelihood framework in mlt, the asymptotic analysis benefits from standard results on the asymptotic behavior of maximum likelihood estimators. The contribution of an "exact continuous" observation y from an absolutely continuous distribution to the Fisher information is approximately For a censored or discrete observation, we have the following contribution to the Fisher information

For a truncated observation, the Fisher information is given by
Based on these results, we can construct asymptotically valid confidence intervals and confidence bands for the conditional distribution function from confidence intervals and bands for the linear functions a ϑ.

Categorical data analysis: Chinese survey (cont'd).
For the proportional odds model for happiness given age and income, we compare the score function (using the estfun() method from package sandwich, The positive effect of income is "significant" by standard measures (the classical coefficient table was obtained using cftest() from package multcomp, Hothorn, Bretz, and Westfall 2008, 2019).

Density estimation: Geyser data (cont'd).
A relatively simple method for the construction of asymptotic confidence bands is based on the multivariate joint normal distribution of the model coefficientsθ N . A confidence band for the unconditional transformation function of waiting times is derived from simultaneous confidence intervals for the linear function Aθ N as implemented in package multcomp. Here A is the matrix of Bernstein basis functions evaluated on a grid of K waiting times y 1 , . . . , y K . The quantile adjusted for multiplicity is then used on a finer grid of cheat values for plotting purposes.

Simulation-based likelihood inference
The fully specified probabilistic model (F Z , a,θ N ) allows sampling from the distributionP Y . For estimated parametersθ N , this model-based or "parametric" bootstrap fromP Y can be implemented by the probability integral transform, i.e., Z 1 , . . . , Z N iid ∼ P Z is drawn and then Y i = inf{y ∈ Ξ | a(y) θ N ≥ Z i } is determined by numerical inversion of the distribution function. The simulation() method for 'mlt' objects first computes the distribution function over a grid of K response values, draws nsim times nrow(newdata) random samples from P Z and returns the intervals y k < Y i < y k+1 (interpolate = FALSE) or a linear interpolation thereof (interpolate = TRUE).

Density estimation: Geyser data (cont'd).
The model-based bootstrap analogue of the confidence band for the distribution function of waiting times (Figure 14, right panel) based on 100 bootstrap samples is generated from the following code. First, 100 samples of size N = 299 are drawn from the fitted unconditional model mlt_w for waiting times. For each sample, the model ctm_w is refitted, using the parametersθ N as starting values (theta) to speed-up the computations. Next, the log-likelihood ratio is computed for each sample. Last, distribution and density functions are obtained for each of the models in this small loop.
In the conditional case, nsim samples from the N conditional distributionsP Y |X=x i , i = 1, . . . , N are drawn by simulate().

Summary
The computational framework implemented in package mlt allows fitting of a large class of transformation models. Many established R add-on packages implement special cases, mostly in the survival analysis context. The most prominent one is the survival package (Therneau and Grambsch 2000;Therneau 2019) with partial likelihood estimation of the Cox model in coxph() and parametric linear transformation models in survreg(). Parametric proportional hazards (phreg()) and accelerated failure time models are also available from package eha for interval-censored responses, for example coxinterval (Boruvka and Cook 2015), MIICD (Delord 2017) andICsurv (McMahan andWang 2014). No special treatment of this situation is necessary in mlt as the likelihood maximized by mlt() allows arbitrary schemes of random censoring and also truncation.
Alternative likelihood approaches to transformation models often parameterize the hazard or log-hazard function by some spline, including the R add-on packages polspline (Kooperberg 2019b), logspline (Kooperberg 2019a), bshazard (Paola Rebora and Reilly 2018) and gss (Gu 2014(Gu , 2019. Packages muhaz (Hess and Gentleman 2019) and ICE (Braun 2013) implement kernel smoothing for hazard function estimation, the latter package allows for interval-censored responses. The penalized maximum likelihood estimation procedure for simultaneous estimation of the baseline hazard function and the regression coefficients in a Cox model is available from package survivalMPL (Couturier, Ma, Heritier, and Manuguerra. 2017). Estimation of the unconstrained log-hazard function is theoretically attractive but too erratic estimates have to be dealt with by some form of penalization. A direct parameterization of the transformation function, i.e., the log-cumulative baseline hazard in the Cox model, only requires monotone increasing functions to be fitted. Thus, penalization is not necessary but one has to deal with a constrained problem. Package mlt follows the latter approach.
Based on the estimation equation procedure by Chen, Jin, and Ying (2002), TransModel (Zhou, Zhang, and Lu 2017) implements continuous time proportional hazards and proportional odds linear transformation models. Time-varying coefficients can be estimated using packages dynsurv (Wang, Chen, Wang, and Yan 2019) and timereg (Scheike and Martinussen 2006;Scheike and Zhang 2011;Scheike, Martinussen, Silver, and  Maximum likelihood estimators for a fair share of the models implemented in these established packages can be re-implemented using the computational infrastructure offered by package mlt. The availability of (at least) two independent implementations allows package developers to validate their implementation. In fact, some errors in earlier development versions of mlt could be detected by comparing model outputs. Because the implementation of maximum likelihood estimation for conditional transformation models presented in this paper relies on a rather dense code base (200 lines of pure R code in variables, 860 lines in basefun and 1450 lines in mlt, all convenience functions and user interfaces included), the likelihood of implementation errors is smaller compared the likelihood of errors in any of the plethora of alternative implementations of special linear transformation models (but not zero, of course).
The modeling abilities of mlt go beyond what is currently available in R. Maybe the practically most interesting example is distribution regression, i.e., transformation models with responsevarying regression coefficients. The only special case available in R we are aware of are Cox models with time-varying effects in packages dynsurv and timereg. For other models, such as the distribution regression analysis of the Boston Housing data presented here, or for nonproportional odds or non-proportional hazards models, implementations are lacking. Our analysis of the bird counts example is novel also from a modeling point of view as linear transformation models for count data are still waiting to be kissed awake.
One unique feature of mlt is the strict separation of model specification (using ctm()) and model estimation (using mlt()) allowing computations on unfitted models. Models are specified by combinations of basis functions instead of using the rather restrictive formula language. In addition, model coefficients can be altered by the user, for example for computing conditional distribution or density functions or for the evaluation of the log-likelihood at arbitrary values of the parameters. The simulate() method for 'mlt' objects can always be used to draw samples from fitted (or unfitted) transformation models. Thus, the implementation of parametric bootstrap procedures is a straightforward exercise.
The very flexible and powerful user interface of mlt will, however, be incomprehensible for most of its potential users. Because transformation models seldomly receive the attention they deserve in statistics courses, the unorthodox presentation of regression models ignoring the fences between the traditionally compartmentalized fields of "regression analysis", "survival analysis", or "categorical data analysis" in this documentation of mlt will likely also confuse many statisticians, let alone data or subject-matter scientists. Current efforts concentrate on the implementation of convenience interfaces, i.e., higher-level user interfaces for special forms of transformation models, which resemble the traditional notational and naming conventions from the statistical modeling literature. Standard formula-based interfaces to the underlying mlt implementation of Cox models, parametric survival models, models for ordered categorical data, normal linear and non-normal linear models are available in package tram (Hothorn 2019). The corresponding package vignette demonstrates how some of the model outputs presented in this paper can be obtained in simpler ways.
The core functionality provided by mlt was instrumental in developing statistical learning procedures based on transformation models. Transformation trees and corresponding transformation forests were introduced by Hothorn and Zeileis (2017) and implemented in package trtf; an application to conditional distributions for body mass indices was described by Hothorn (2018) and novel survival forests have been evaluated by Korepanova, Seibold, Steffen, and Hothorn (2019). Two different gradient boosting schemes allowing complex models to be built in the transformation modelling framework were proposed by Hothorn (2020d) and are implemented in package tbm.

A. The variables package
The variables package (Hothorn 2020c) offers a small collection of classes and methods for specifying and computing on abstract variable descriptions. The main purpose is to allow querying properties of variables without having access to observations. A variable description allows to extract the name (variable.name()), description (desc()) and unit (unit()) of the variable along with the support (support()) and possible bounds (bounds()) of the measurements. The mkgrid() method generates a grid of observations from the variable description. The package differentiates between factors, ordered factors, discrete, and continuous numeric variables.

A.1. Unordered factors
We use eye color as an example of an unordered factor. The corresponding variable description is defined by the name, description and levels of this factor.

A.2. Ordered factors
An ordered factor, temperature in categories is used here as an example, is defined as in the unordered case R> o_temp <-ordered_var("temp", desc = "temperature", + levels = c("cold", "lukewarm", "warm", "hot")) and the only difference is that explicit bounds are known

A.3. Discrete numeric variables
Discrete numeric variables are defined by numeric_var() with integer-valued support argument, here using age of a patient as example.

B. The basefun package
The basefun package (Hothorn 2020a) implements Bernstein, Legendre, log and polynomial basis functions. In addition, facilities for treating arbitrary model matrices as basis functions evaluated on some data are available. Basis functions can be joined column-wise using c() or the box product can be generated using b(). The definition of basis functions does not require any actual observations, only variable descriptions (see Appendix A) are necessary. Each basis offers model.matrix() and predict() methods. We illustrate how one can set-up some of these basis functions in the following sections.

B.4. Model matrices
Model matrices are basis functions evaluated at some data. The basefun package offers an as.basis() method for 'formula' objects which basically represents unevaluated calls to model.matrix() with two additional arguments (remove_intercept removes the intercept after appropriate contrasts were computed and negative multiplies the model matrix with −1). Note that the data argument does not need to be a 'data.frame' with observations, a variable description is sufficient. If data is a data frame of observations, scale = TRUE scales each columns of the model matrix to the unit interval. Here is an example using the iris data.

R> mlt::coef(mod) <-h(x)
We can now simulate from this purely theoretical model R> d <-as.data.frame(mkgrid(yvar, n = 500)) R> d$grid <-d$y R> d$y <-simulate(mod, newdata = d) fit this model to the simulated data R> fmod <-mlt(mod, data = d, scale = TRUE) and compare the true (mod) and fitted (fmod) models by looking at the coefficients and loglikelihoods (evaluated for the estimated and true coefficients)

R> logLik(fmod)
log Lik. -1601.597 (df=16) R> logLik(fmod, parm = coef(mod)) log Lik. -1604.57 (df=16) The corresponding density functions of the χ 2 20 distribution, the approximating true transformation model mod and the estimated transformation model fmod are plotted in Figure 16 using the code shown on top of the plot.