Comparisons of Estimation Procedures for Nonlinear Multilevel Models

We introduce General Multilevel Models and discuss the estimation procedures that may be used to fit multilevel models. We apply the proposed procedures to three-level binary data generated in a simulation study. We compare the procedures by two criteria, Bias and efficiency. We find that the estimates of the fixed effects and variance components are substantially and significantly biased using Longford's Approximation and Goldstein's Generalized Least Squares approaches by two software packages VARCL and ML3. These estimates are not significantly biased and are very close to real values when we use Markov Chain Monte Carlo (MCMC) using Gibbs sampling or Nonparametric Maximum Likelihood (NPML) approach. The Gaussian Quadrature (GQ) approach, even with small number of mass points results in consistent estimates but computationally problematic. We conclude that the MCMC and the NPML approaches are the recommended procedures to fit multilevel models.


Introduction
Many kinds of observational and design data are such that the observations are clustered within groups and the groups in terms are nested in upper groups and form multilevel data. Two types of multilevel data are longitudinal and multicenter clinical trial data. In longitudinal studies, we investigate changes over time in characteristics which is measured repeatedly for each individual.
In medical studies, the measurements may be the number of epileptic seizures which are recorded for each patient over time (Thall and Vail, 1990). In social sciences, the measurements may be residential states which are observed for each individual over time (Clark et al., 1979). Longitudinal data are examples of two level data and the observations, in first level for each individual, in second level, are correlated. In a multicenter clinical study, subjects within a given site are prospectively studied over time. An example of this type of data is the data from the Multicenter AIDS Cohort Study or MACS (Kaslow et al., 1987). Here data exists at three levels: measurement occasion, subject, and site. There may be correlation between the repeated experiences of an individual subject as well as possible correlation between the experiences of subjects within a sit.
Generally speaking, in multilevel data, the observations within the same group are more likely to be correlated than the observations from different groups. So in each level we have some type of correlation. The correlations from all levels should be taken into account and ignoring any one of them may lead to inconsistent estimates and misleading inferences. A well known method of representing this common variation is by adding a common unobserved random effect to the linear predictor for each lower level unit in the same upper level unit. If the distribution of this random effects is conjugate to the distribution of the responses, then maximum likelihood is straightforward. Otherwise the likelihood function does not have a closed form and we need an approach to deal with the problem when we assume a specific distribution for the random effects in each level. A common distribution for the random effects is normal distribution with mean zero (Breslow and Clayton, 1993;McGilchrist, 1994). If the distribution of the responses is not normal then we will have an integral in each level (except for the random coefficient if there exists). In this case the dimension of integrals in the likelihood function is equal to the number of the levels. Some approaches to solve the In these three approaches we assume a specific parametric form of the mixing distribution of the unobserved random effects. Heckman and Singer (1984) and Davies (1987) have shown that the parameter estimation is sensible to the choice of the mixing distribution. This problem can be solved by Nonparametric Maximum Likelihood (NPML) estimation on mixing distribution on a finite number of mass points. This approach is used by Aitkin (1999) for fitting two-level data.
For the data that are clustered and / or longitudinal, random effects regression models have been developed to model continuous data (Laid and Ware, 1982;Bock, 1989;Jennrich and Schluchter, 1986;Bryk and Raudenbush, 1987). The same models have also been developed for dichotomous data (Stiratelli, Laird and Ware, 1984;Anderson and Aitkin, 1985;Wong and Mason, 1985;Zeger and Liang, 1986;Gibbons and Bock,1987;Qu et al. , 1972). Goldstien (1991) has discussed nonlinear multilevel models and their application to discrete response data. Qaqish and Liang (1992) have presented marginal models for correlated binary response within multiple classes and multiple levels of nesting. Hedeker and Gibbons (1994) have proposed a random effects ordinal regression model for multilevel analysis. Their model is proposed for analyzing the clustered longitu-dinal ordinal response data. Gibbons and Hedeker (1997) have explained random effects probit and logistic regression models for three level data. Aitkin (1999) has described an EM algorithm for nonparametric maximum likelihood estimation in generalized linear models with variance components structure. Various approaches have been proposed in recent years to model survival data with two level clustering. For example, Clayton (1991), Gray (1992, Klein (1992), Nielsen et al. (1992, McGilchrist (1993), Lin (1994). Currently Yau (2001) has described and applied a method for modeling survival data with multilevel clustering and random effects.
There are many software for multilevel data analysis and a number of comparisons have been done by some researchers. Kreft et al. (1994) have compared HLM, ML3, VARCL, BMDP5-V, and GENMOD. Rodriguez and Goldman (1995) have evaluated two software packages VARCL and ML3 for fitting models to binary response data by using a Monte Carlo study. Van der Leeden et al. (1996) have compared HLM, ML3, and BMDP5-V on repeated measures data. De Leeuw and Kreft (2001)  Very little work has been done on using and comparing the four mentioned approaches, GQ, Taylor series, MCMC, and NPML in analyzing multi level data.
The purpose of this paper is to model a multilevel data in a general form and explain, apply and compare the above approaches through simulation study. Our analysis focuses on bias and efficiency of estimates produced by the mentioned approaches. However the results will compare some software in fitting multilevel models.
The rest of this paper is organized as follows. In section 2 we introduce multilevel model. In section 3 we describe the estimation procedures. In section 4 we present the simulation study and compare the results from different procedures.
Section 5 summarizes our conclusions.

Multilevel Models
Before introducing multilevel models in a general form consider a simple three level model. Suppose that the data consists of N 1 communities and each community consists of N 2 families with N 3 children within each family. Here communities, families and children define level one, two and three for a three level data respectively. Suppose that we are interested in estimating the effect of x ijk , x ij , and x i , the explanatory variables in levels one, two, and three respectively, on the binary response measured for each child. Moreover assume that we believe the community and the family populations are heterogeneous. To control for heterogeneity we introduce two random effects u ij and u i at the second and first level. With these specifications the linear predictor will be of the form β 1 x ijk + β 2 x ij + β 3 x i + u i + u ij , i = 1, 2, ..., N 1 , j = 1, 2, ..., N 2 , k = 1, 2, ..., N 3 .
Assuming any link function, for example logit, one may estimate β 1 , β 2 , β 3 the fixed effects of x ijk , x ij , and x i and the variances of u ij and u i by using appropriate estimation procedure. Following we introduce multilevel models in a general form and in the next section we will discuss about various estimation approaches.

Multilevel Linear Model
consider the situation that the data have L levels, such that in level i we have f i fixed effects, r i random effects, and g i units. We first define the following notations.
where : β j is f i × 1 column vector of fixed effects for j th unit of level i. Now vector of random effects across all levels. Using these notations a multilevel linear model is defined as, where Y is N × 1 vector of responses and ε is N × 1 vector of error terms.
Assume that the random effects from different units are mutually independent with mean 0 and V ar(u i ) = Ω i . We then have V ar(u) = Ω and Ω =diag L [I g i ⊗ Ω i ]. Moreover assume that V ar(ε) =σ 2 W and Cov(ε, u) = 0. With these assumptions we have E(Y) = Xβ and, The model introduced by Rodriguez and Goldman (1995) is a especial case of our model in equation (1). Their model includes three levels and no random effects in the first level. There are many cases in applications that the analyst is interested in considering random coefficients for individuals in the first level especially a random effect to control omitted variables.

Multilevel Nonlinear Model
Following Goldstein (1991) where X, Z, β, and u have the same definition as above and G, H, η, and δ have the similar structure to X, Z, β, and u. Model (1) is a special case of model (2) if we omit the second and the third terms and f is an identity function.
An important especial case of model (2) is when the response is a vector of proportions, there is no linear component, and f is logit function. The multilevel logit model then is of the form, , Ω, X, Z); for i = 1, ..., N and η is a conditional linear predictor.

Estimation Approaches
We assume that the elements of Ω i are unknown parameters and W is a known matrix. Further we assume that ε has multivariate normal distribution and we look for Maximum Likelihood Estimation (MLE) of the parameters. Under these assumptions the conditional likelihood function for a multilevel model is, where f is the density function of y i . The marginal likelihood function of the parameters is, where g is the density function of the vector u. If we assume that u is normally distributed (Breslow and Clayton, 1993) then, where Φ is the multivariate normal density. For the multilevel logit model (3) the conditional likelihood function is, The marginal likelihood function can be obtained by replacing from (7) in (5) or in (6). Note that for other multilevel nonlinear models such as multilevel probit model we replace µ i in (7) by Φ(η i ), where η i is the linear predictor for individual i. (5) is intractable unless the distribution of u is conjugate to the distribution of y. Expression (6) is also intractable if the distribution of y is not normal. However, simple two level exponential family models other than the normal with a normal random effect may be fitted with difficulty and slow by maximum likelihood. Therefore, in order to maximize the likelihood function (6) many algorithms have been used. Goldstein (1986) have proposed a generalized least squares algorithm that has been implemented in the package ML3. Longford (1987) has proposed a Fisher scoring algorithm that has been implemented in program VARCL. Raudenbush and Bryk (1986) have used an EM algorithm in the package HLM. Mason et al. (1983) have also used an EM algorithm in the program GENMOD. Rodrigues and Goldman (1995) have compared least squares algorithm and Fisher scoring algorithm through package ML3 and program VARCL. Tsutakawa (1985) has used full Bayes and empirical Bayes approaches to analyze two-level models with a random effect at upper level and no explanatory variable at the lower level. Aitkin (1999) has compared the results of Tsutakawa (1985) with his results of applying GQ and NPML approaches using GL1M4 software. Hedeker and Gibbons (1994) have used Gauss

The numerical integration in
Hermite Quadrature approach to analyze two level data set. Gibbons and Hedeker (1997) have used numerical quadrature approach to find the maximum marginal likelihood estimation in random effects probit and Logistic regression models for the three level data. They have also compared parameter estimates for a normal versus rectangular prior to determine the degree to which their estimates are robust to deviation from the assumed normality of the prior distribution of the random effects. Wong and Mason (1985) have proposed empirical Bayes approach and have applied it to estimate the parameters of a two-level model. Qaqish and Liang (1992) have used the Generalized Estimation Equations (GEE) approach of Liang and Zeger (1986). They have estimated the parameters of marginal models for correlated binary responses with multiple classes and multiple levels of nesting. A regression coefficient in this marginal model is interpreted as the change in the "population average" responses rather than the changes in any one cluster's expected response with covariate X.
Except the work of Aitkin (1999) and Rodriguez and Goldman (1995) that have compared some of these approaches, no research has been done to compare the above approaches. For example, Rodriguez and Goldman (1995) have used simulation but have only compared Longford's approximate likelihood (1988 and 1994) and Goldstein's Generalized least squares approach (1991). In this paper we use simulation and will show that other approaches perform better than approaches used in VARCL and ML3. Following we briefly explain the approaches that are tractable to deal with the integrals in (5) and (6).

Gaussian Quadrature Approach
To explain approaches we consider a three level logit model with one random effect at each of the levels two and three. If we consider one explanatory variable at each level then model (5) reduces to where x ijk , x ij , and x i are the explanatory variables in levels one, two, and three respectively. u ij and u i are the random effects with means zero and standard errors σ 1 , σ 2 and density g 1 , g 2 related to second and third levels respectively. y ijk is the response for the k th individual in the j th unit of level two and i th unit of level one. β 1 , β 2 , β 3 are the fixed effects of x ijk , x ij , and x i . Here we need to calculate one dimensional integral.
If we assume that the random effects u ij and u i are distributed normally with means zero and standard errors σ 1 , σ 2 then we can approximate the integrals in (8) by a number of GQ points. The likelihood function (8) becomes, where α 1 , α 2 , ..., α s1 are s 1 Gaussian mass points with probability masses p 1 , p 2 , ... , p s 1 at level two and τ 1 , τ 2 , ..., τ s 2 are s 2 Gaussian mass points with probability masses q 1 , q 2 , ..., q s 2 at level one. We may use equal number of masses for different levels.

Nonparametric Approach
A disadvantage of any approach using a specified parametric form for the distribution of the unobserved random effects is the sensitivity of the parameter estimation to the choice of the distribution of the random effects (Heckman andSinger ,1984 andDavies, 1987). This important problem can be solved by considering the mass points and probability masses as parameters and estimate them together with the structural parameters. So the nonparametric approach is not a simplification of parametric approach. The identification of the number, location and masses of these points of support present formidable computational problems.
This approach has been used by some researches. Hind (1982) and Anderson and Hind (1988) used EM algorithm for fitting the finite mixture distribution used in (9). Aitkin (1999) has used EM algorithm in GLIM4 software to estimate a two level model using GQ and NPML approaches. A general method for estimating a multilevel model using either GQ or NPML is to consider (9) as a multivariate function of parameters. Then maximize this function by using an appropriate programing software as NAG, Fortran library or Fortran power station. Longford (1988 (b)) has proposed an approximation to the likelihood function (6).

Longford's Approximation Approach
The approximation relies on a second order Taylor  For more details about the expansion of the likelihood function in equation (6) one may see Rodriguez and Goldman (1995). Goldstein (1991) has proposed an alternative approach to the estimation of nonlinear multilevel models, including the logistic model with random coefficients.

Goldstein's Generalized Least Square Approach
His method needs a single integration instead of iterating to convergence. In a special case if there are no random effects the proposed procedure is equivalent to the standard algorithm, iteratively reweighted least square used by McCullagh and Nelder (1989) to fit generalized linear models. This approach has used first-order Taylor series expansion and the same problems that mentioned in 3.3 may arise here. However, this approximation has been implemented in the software package ML3 and currently in MLwiN. For more details see Rodriguez and Goldman (1995).

Markov Chain Monte Carlo Approach
As explained in sections 3.1 to 3.4 the computationally burden has limited the analysis of multilevel data in several ways. First , investigators have largely restricted their attention to random intercept models to avoid higher dimensional numerical integration . Second , specialized software is required and is typically optimized for a particular random effects distribution. For example the Gaussian is used in VARCL explained in 3.3.
MCMC techniques is being increasingly used as an approach for dealing with the problems for which there is no exact analytic solution, and for which standard approximation technique, have difficulties. The basic philosophy behind MCMC is to take a Bayesian approach and carry out the necessary numerical integrations using simulation, Gelfand and Smith (1990); Smith and Roberts(1993). Instead of calculating exact or approximate estimates, this computer-intensive technique generates a stream of simulated values for each quantity of interest. The conditional independence assumptions , common in data analysis , mean that the full distribution of all quantities has a simple factorization in terms of conditional distribution of each node given its parents. Thus we only need to provide the parent-child distributions in order to fully specify the model. Bayesian inference Using Gibbs sampling (BUGS) program (Spiegelhalter, Thomas, Best, and Gilks, 1996) carries out Bayesian inference on statistical problems using a simulation technique known as Gibbs sampling. We can use BUGS to analyze multilevel data which uses MCMC techniques.
In multilevel models we assume specific parametric priors for the random effects and non-informative priors with extremely small precision for the structural parameters and the precisions of the random effects. For a three level binary data the logit of µ ijk , the mean response for the ijk th individual, is where u i and u ij have informative priors with non-informative priors for their precisions as their parents. β 1 , β 2 , β 3 have no parents and so have non-informative priors. In order to use Gibbs sampling we need to successively sample from the distribution of each node given all the others in equation 10. These are known as full conditional distributions. Geman and Geman (1984) have shown that under mild conditions this process eventually provides samples from the joint posterior distribution of the unknown quantities. For model 10 we start with initial values 3 , σ 1 , σ 2 , draw β 1 from the distribution of β 1 |β 3 , σ 1 , σ 1 , β 3 , σ 1 , σ After a large number of iterations we consider the last sample as the initial val-ues for the parameters and continue the sampling for another large number of sampling. The average of the observations, for each parameter, then will be the estimate of the parameter. It should be mentioned that BUGS software only provides a simple method to solve the high dimensionality problem of the integrals.
So the sensitivity of the parameter estimation to the choice of random effects distribution mentioned in section 3.2 is still a serious problem. But since our comparison of methods of estimation is based on using the true distributions of the random effects that generate the data, this problem does not arise. So we can focus on the differences of the approaches from other points of view.

Simulation Study
In empirical study, since the true value of the parameters are not known we can never be certain if the results of empirical work are accurate and so we may have misleading comparisons of underlying approaches. In simulation studies we know the true value of the parameter and so we can compare the approaches to see which one is more accurate in estimating the parameters of the model. On the other hand in simulation studies we are not certain if the simulation results are relevant in practice. But if the simulation's structure is built related to the real data structure then we can rely on the simulation's results. For comparisons of estimation procedures we followed the simulation's structure proposed by Rodriguez and Goldman (1995). They have simulated data sets using the same hierarchial structure as one of the Guatemalan data sets analyzed by Pebley and Goldman (1992). We have used their rectangular structure design in order to compare the estimation procedures mentioned in section 3 with their results. i.e. σ 2 1 = σ 2 2 = 1.

Simulation of Data
The results from Rodriguez and Goldman (1995), reported in table 1, show large significant biases for all the parameters. When they used VARCL software, except the fixed effect at third level, all the other estimates are significantly biased. The estimates of the parameters β 1 , β 2 , β 3 , σ 1 , σ 2 are 24. 4, 22.5, 9.4, 19.9, 25 To apply the NPML approach we have used the subroutine LCONF from Fortran Power Station 4.0 software to maximize the likelihood function in equation 9. The subroutine LCONF minimize a general objective function subject to linear equality/inequality constraints (See Appendix D for codes). Table 1 shows that the results from the performance of the NPML approach are better than the results from the GQ approach in estimating the standard errors of the random effects. With 3 mass points β 1 , β 2 are 0.3 and 2.8 percent downward bias and β 3 , σ 1 , σ 2 are 24.4, 35.0, 24.3 percent upward bias respectively. With 2 mass points β 1 , β 2 , β 3 , σ 2 are 2.5, 11.6, 24.9, 20.7 percent downward bias and σ 1 is 2.9 percent upward bias respectively. The increment of about 88 in deviance (from 7286.49 to 7375.01) with the reduction of 4 in degree of freedom confirm to report the results for 3 mass points in table 1. Non of these biases are significant. The average time for each run was 1 and 4 minutes, using a 500 pentium, for 2 and 3 mass points respectively.
To apply the MCMC approach using Gibbs sampling we have used BUGS software (See Appendix E for codes). It is assumed that the prior distributions of u i and u ij to be normal with means 0 and standard errors σ 1 and σ 2 respectively. β 1 , β 2 , β 3 have non-informative normal prior with mean 0 and standard error 1000, σ 1 and σ 2 have non-informative gamma prior with mean 1 and variance 1000. In order to get over the influence of the initial values we have performed 500 iterations of the Gibbs sampler and then have updated another 1000 iterations to estimate the parameters. Table 1 shows that this approach performs excellent with at most 2.8% bias for the fixed effect at the second level. The standard deviation of estimates are small and none of the biases are statistically significant. Table 1 shows that the MCMC approach results in very small MSE. The average time for each run was 5 minutes by using a 500 pentium.
From the point of bias the least biases are obtained by using BUGS program that uses MCMC method using Gibbs sampling. The NPML approach performs better than VARCL and ML3. The GQ results better estimation for β 2 , β 3 than VARCL, ML3, and NPML. The results from BUGS, NPML, and GQ are not significantly biased while only the estimation of β 3 produced by VARCL is not significantly biased. The values of MSE reported in Table 1 show that the efficiency of the NPML is more than VARCL in estimating β 1 , β 2 but this is vice versus in estimating β 3 , σ 1 , σ 2 . Both NPML and VARCL are more efficient than GQ. The MSE obtained from using BUGS is much less than those obtained from using other approaches and consequently this approach estimates the parameters more efficient than other approaches. Since the standard deviations of estimates from ML3 have not been reported by Rodriguez and Goldman (1995) we can not compare the efficiency of their estimates with the other approaches. All runs in applying NPML and BUGS perform without any problem but, as mentioned, in performing GQ many runs results poor estimates for the standard error of the random effect in the third level. The GQ approach is used in GLIM4 (Aitkin, 1999) and SABR (Barry, Francis, and Davies, 1989) software to evaluate a one dimensional integral. Our simulations show that although the GQ approach may performs well in the problems containing one dimensional integral but care should be done in using this approach in fitting multilevel data. Table 2 shows that when the variances of the random effects are small, i.e.
σ 2 1 = σ 2 2 = 0.16, non of the estimates are significantly biased. Using GQ or NPML results in large absolute biases for the standard errors of the random effects. But since the standard deviations of the estimates are also large these biases are not statistically significant. VARCL and BUGS perform almost the same but with less biases using BUGS. For the GQ approach, except the plot for the second level fixed effect, the plots show some lack of symmetry. For the MCMC approach we have a bit deviation from asymptotic normality for the fixed effect of the first level. Our conclusion is that, non of the NPML and MCMC approaches result in significant deviation from asymptotic normality.
To complete the comparisons of approaches two points should be mentioned.
Firstly, our main aim is to estimate the explanatory variables effects. The estimation of the variances of the random effects is for identifying the levels of correlations explained in section 1. So comparisons of approaches with respect to the explanatory variables effects is more important than comparisons with respect to the variances of the random effects. The above comparisons show that if the analyst know the distributions of the random effects the MCMC approach is the best method. While, if there is no prior information about the distribution of the random effects the NPML approach is recommended. This approach is also recommended if the analyst believes in the results of Heckman and Singer (1984) and Davies (1987).  Gilks et al. 1996). Although this investigation recommend the use of MCMC approach but still it is recommended that more than one approach is tried; if similar results are obtained then more confidence can be placed in the estimates. If the results from approaches are different then a simulation study based on the data is recommended before choosing any software. Further research is needed to clearly indicate that how the performance of the various methods is affected by the number of levels considered and which factors play a significant role. Another simulation study may be proposed to investigate the effect of sample size, variance of random effects, type of explanatory variable, number of levels, and other factors on the performance of the various methods.

Conclusions
In this paper we reviewed multilevel models and introduced some estimation procedures that may be applied to fit these models. Some of them have been used by other researchers but no comparisons have been made. We have compared these approaches through simulation study for binary responses. The structure of simulations are the same as rectangular structure proposes by Rodriguez and Goldman (1995). We showed that the substantial significant biases coming from VARCL and ML3, reported by Rodriguez and Goldman (1995), can be vanished by applying the MCMC method using Gibbs sampling. The efficiency of the MCMC approach is considerably high and recommended if we have prior information about the distributions of the random effects. If there is no prior information, then parameters estimates may be sensible to the choice of the mixing distributions (Heckman and Singer, 1984). In this case the NPML approach is recommended and our simulation study shows that the nonparametric approach performs better than VARCL and ML3. Although our simulation study was for binary data but both the NPML and MCMC approaches are easily applicable to other types of multilevel data.

Acknowledgments
The author is grateful to anonymous referees for their helpful comments and suggestions. Third level fixed effect(GQ)  The following three nested loops are used to generate a random effect and a fixed ! effect at level 3, a random effect and a fixed effect at level 2, and a fixed ! effect and a binomial response at level 1. Generating Binomial random number in first level for using as response.