GEEQBOX : A MATLAB Toolbox for Generalized Estimating Equations and Quasi-Least Squares

The GEEQBOX toolbox analyzes correlated data via the method of generalized estimating equations (GEE) and quasi-least squares (QLS), an approach based on GEE that overcomes some limitations of GEE that have been noted in the literature. GEEQBOX is currently able to handle correlated data that follows a normal, Bernoulli or Poisson distribution, and that is assumed to have an AR(1), Markov, tri-diagonal, equicorrelated, unstructured or working independence correlation structure. This toolbox is for use with MATLAB .


Introduction
The method of generalized estimating equations (GEE, Liang and Zeger 1986) is widely used because it allows for straight-forward analysis of correlated outcomes that can be discrete or continuous. GEE relies on the specification of the correlation structure, which results in some limitations. For example, Crowder (1995) used simple examples to demonstrate that if the pattern in the correlations is misspecified, there may be no solution (asymptotically) to the GEE moment-based estimating equation for the correlation parameter. In practice, this can result in failure to converge in a GEE analysis. Another limitation is that relatively few correlation structures have been implemented in the major statistical software packages that implement GEE. Although a simple structure is often reasonable to describe the expected pattern of associations, expansion of GEE for implementation of more complex structures could be beneficial when the correlations are of scientific interest, or a particular pattern is biologically plausible.
The method of quasi-least squares (QLS) is a two-stage approach for estimation of the correlation parameter in the framework of GEE that overcomes some of the limitations that were just described; see Chaganty (1997) for a description of stage one of QLS for data with an equal number of observations per subject (balanced data), Shults (1996) and Shults and Chaganty (1998) for stage one for unbalanced data, and Chaganty and Shults (1999) for stage two of QLS. First, QLS can sometimes yield meaningful results when GEE fails to converge, or when the estimated correlation matrix is not positive definite for GEE. For example, Shults et al. (2007) demonstrated that application of a simple (tri-diagonal) correlation structure in analysis of data from a study of obesity in children with renal disease resulted in a non-positive definite estimated correlation matrix for GEE; in contrast, the estimated correlation matrix for QLS (implemented in Stata, StataCorp. 2003) was positive definite. Next, QLS allows for relatively straightforward implementation of patterned correlation structures. For example, see Shults and Morrow (2002), Shults et al. (2004), and Shults et al. (2006) for studies whose analysis benefited from QLS with structures that previously had not been implemented in the framework of GEE.
In this manuscript, we consider a study that was described in Nunez- Anton and Woodworth (1994) and Chaganty and Shults (1999). In this trial, profoundly deaf subjects were surgically implanted with one of two types of hearing aids. Tests designed to measure hearing ability were then administered to the patients at 1, 9, 18 and 30 months post-implant. Because measurements from this trial were not equally spaced in time, it would be reasonable to consider application of a correlation structure that depends on the actual temporal spacing of measurements, in addition to the usual simple patterns for the correlations that are applied in a GEE analysis. For this reason, we demonstrate implementation via QLS of the Markov structure that was described in Naik and Prabhala (2002); the Markov structure assumes that the correlation between measurements depends on the temporal spacing of measurements and declines with increasing separation in time, which are both reasonable assumptions for data from this study.
This article presents the GEEQBOX toolbox for analysis of correlated data with GEE and QLS using the mathematical software MATLAB (The MathWorks, Inc. 2007). The toolbox currently allows for: three possible data distributions, six assumed correlation structures, estimation by either GEE or QLS.
This article does not replace the user guide or statistical documentation of QLS that the reader can find on the internet at http://www.cceb.upenn.edu/~sratclif/QLSproject. html. Rather, it provides an overview of the features of the GEEQBOX toolbox and some examples of its use.
The paper is organized as follows. Section 2 provides some notation and a brief description of the methods of GEE and QLS. Section 3 describes the technical features of the GEEQBOX toolbox. An example data analysis is shown in Section 4 to demonstrate the implementation of GEEQBOX. Discussion, including conclusions and plans for continued expansion of the toolbox, are then provided in Section 5.

A brief description of QLS and GEE
This section provides some notation; a description of the correlation structures that are implemented in GEEQBOX; and a summary of QLS and GEE. For more detail regarding QLS and GEE, please see the references that were provided in the introduction. In addition, see Hardin and Hilbe (2003) for an excellent and comprehensive text on GEE.

Notation
For analysis of a longitudinal study, we assume that measurements Y i = (y i1 , · · · , y in i ) and associated covariates x ij = (x ij1 , · · · , x ijp ) were collected on subject i at times T i = (t i1 , · · · , t in i ) , for i = 1, · · · , m. The data are considered balanced and equally spaced when n i = n ∀ i and |t ij − t ij−1 | = γ ∀ i, and j, respectively. For analysis of a cross-sectional study, e.g., if one measurement is collected on each of several subjects within multiple clusters, then Y i = (y i1 , · · · , y in i ) represents the n i measurements that were collected within cluster i.
The expected value and variance of measurement y ij on subject (or cluster) i are assumed to equal E(y ij ) = g −1 (x ij β) = u ij and Var(y ij ) = φh(u ij ), respectively, where φ is a known or unknown scale parameter. We also let U i (β) represent the n i × 1 vector of expected values u ij on subject i. For longitudinal and cross-sectional studies, observations are assumed to be independent if they are measured on different subjects or clusters, respectively. However, within subjects or clusters, they are assumed to be correlated, with a pattern of association that can be described by a working correlation structure. The working structure for subject (or cluster) i, denoted by Corr(Y i ) = R i (α), depends on a correlation parameter α that can be scalar or vector-valued. The covariance matrix of Y i is then given by Cov Both GEE and QLS are iterative approaches that alternate between (1) updating the estimate of the regression parameter β by solving the GEE estimating equation for β and (2) updating the estimate of the correlation parameter α via moment estimation (GEE) or solving an unbiased estimating equation for α in two stages (QLS).

Working correlation structures
GEEQBOX currently implements the following structures, with plans to implement additional structures that will be made available on the web.
Equicorrelated: This structure assumes that all pairwise correlations within a cluster are equal, so that Corr(y ij , y ik ) = α. This structure is plausible for cross-sectional studies, e.g., to describe the pattern of association of weights among litter-mates of baby rats.
First-order autoregressive AR(1): This structure assumes that the correlation among repeated measurements on a subject depends on their separation in order of measurement, so that Corr(y ij , y ik ) = α j−k . This structure is plausible for longitudinal studies in which the collection times of measurements are equally spaced in time, e.g., in a weight loss intervention that measures weights on subjects at baseline and then at three and six months post-baseline.
Markov: This structure assumes that the correlation among repeated measurements on a subject depends on their timing of measurement, so that Corr(y ij , y ik ) = α |t ij −t ik | . This structure generalizes the AR(1) structure to allow for unequal spacing of measurements. The estimate for α will be within the interval (−1, 1). However, as for the AR(1) structure, a negative value for α is typically not biologically plausible. GEE-QBOX therefore uses QLS to obtain an estimate of α ∈ (0, 1). We note that GEEQBOX does not implement the Markov structure for GEE because it is not straightforward to obtain a moment estimate for this structure.
Tri-diagonal: This structure assumes that the correlation among measurements on a subject is constant for measurements that are separated by one measurement occasion, so that Corr(y ij , y ik ) = α for |j − k| = 1 and is zero otherwise. The authors are not aware of many practical applications for this structure, but it was implemented in Liang and Zeger (1986) and in most standard software packages that implement GEE.
Unstructured: This structure does not assume any pattern for the intra-subject correlations, so that Corr(Y ij , Y ik ) = α jk . This structure has been implemented in QLS (Chaganty 1997;Chaganty and Shults 1999) but the algorithms are somewhat complex. GEEQBOX therefore implements a moment estimate using GEE.
Working Independent: Another popular structure is the identity matrix. Implementation of this structure is straightforward because β can then be estimated in a noniterative process. However, several authors have shown that incorrect application of the working independence structure can result in a serious loss in efficiency in estimation of β (e.g., Sutradhar and Das 2000;Wang and Carey 2004;Shults et al. 2006)

GEE estimates of the correlation parameter
For GEE, GEEQBOX implements the following moment estimates that are implemented in PROC GENMOD in SAS (SAS Institute Inc. 2003). For the equicorrelated structure, the GEE moment estimate is given by: is the Pearson residual for subject i at time t ij and p is the dimension of β. For the AR(1) and tri-diagonal estimates, the GEE moment estimate is: . For the unstructured correlation matrix, GEEQBOX implements the following moment estimate for element j, k of the matrix: A moment based estimator has not been proposed in the literature for implementation of the more general Markov correlation for GEE, which provides motivation for implementation of QLS.

QLS estimates of the correlation parameter
While GEE typically uses moment estimates for α, QLS estimates α by obtaining a solution to an unbiased estimating equation in two stages (see Sun et al. 2006, for more details). In stage one, QLS alternates between updating the estimates of β and solving the stage one estimating equation for α until convergence.
The solutionα to (1) is not consistent. Stage two of QLS therefore obtains a consistent estimateα QLS as the solution to the stage two estimating equation for α.
The final QLS estimateβ QLS of β is then obtained by solving the GEE estimating equation for β evaluated atα QLS . For estimating equations that do not have a unique solution, GEEQBOX uses the bisection method to obtain a solution in the feasible region for α.
For the AR(1) structure and for unbalanced data, Shults and Chaganty (1998) proved that the feasible stage one estimateα can be expressed as: while the stage two estimateα QLS−AR1 (Chaganty and Shults 1999) is given bŷ For the Markov structure and unbalanced data, Shults and Chaganty (1998) provided the QLS stage one estimating equation for α: where e ij = |t ij − t i,j−1 |. Note that GEEQBOX requires that e ij ≥ 1 ∀ i and j. The stage two estimating equation for the Markov structure (Chaganty and Shults 1999) is given by: For the equicorrelated structure and for unbalanced data, Shults (1996) proved that there will be a unique feasible solution to the following stage one estimating equation for α: where I n i is the identity matrix and e i is a n i × 1 column vector of ones. Shults and Morrow (2002) obtained the stage two estimateα QLS−EQC : For the tri-diagonal structure and unbalanced data, GEEQBOX obtains solutions to the stage one and two estimating equations (1) and (2) for the tri-diagonal structure by first constructing the tri-diagonal matrix R i (α). Next, to evaluate GEEQBOX implements the following expression: ∂δ is an n i × n i matrix with ones on the off-diagonal and zero elsewhere, i.e., the (j, k) th element of ∂R i (δ) ∂δ is 1 if |j − k| = 1 and is 0 otherwise.

Testing hypotheses involving the regression parameter
The asymptotic distribution of the QLS estimateβ QLS is the same as the asymptotic distribution of the GEE estimateβ GEE . GEEQBOX therefore provides both model-based and sandwich-based estimates of the covariance matrix ofβ (Liang and Zeger 1986). The covariance matrix depends on the scalar parameter φ; GEEQBOX implements the estimate provided in Chaganty and Shults (1999). The model-based estimate of the covariance matrix is appropriate when the user has a high degree of confidence that the correlation structure has been correctly specified. It has the following form: The robust sandwich covariance matrix has the following form: GEEQBOX provides estimated standard errors, 95% confidence intervals, and p values for the tests β j = 0 that are based on both the model and sandwich covariance matrices.

Some technical features of the GEEQBOX toolbox
The development of the toolbox began in 2005, and is for use with MATLAB. It consists of two main functions, both of which can be called like any standard function in the MATLAB environment.
gee function: calculates estimates using GEE.
Both functions require the same inputs and produce the same layout of results. The required inputs for both functions are an N × 1 vector of repeated measures outcomes, plus corresponding vectors of subject id's, measurement times, and a matrix of fixed effects.

Data representation
Both main functions require the same inputs: id, y, t, and X, in that order. For example, using the gee function the command would be gee(id, y, t, X).
Each row of X should contain the observation or covariates associated with a single time point t ij . The vectors id and y should contain the associated unique numerical identifier for the subject and outcome, respectively. Thus, these four inputs should have N (total number of observations across all subjects) rows.
In addition, the measurements must be sorted so that all measurements from the same subject (id) are listed on consecutive rows. If id=[1 1 2 2 2 1 1]', then the program would count this as 3 subjects since there are 3 changes in id numbers. However, the id's do not have to be consecutive numbers. For example, id=[12 12 12 10 10 10 99 99]' would produce the same results as id=[1 1 1 2 2 2 3 3]'.
The matrix of covariates, X, should be set-up so that each column contains a separate covariate. At present, there should be no missing data in X. A constant term is not included by default in the programs. Thus, in order to include a constant in the model, a column of ones must be included as a covariate in X. This column of ones should be the final column of X. The programs will default to calling the beta estimate by the associated column number of X.
The functions also have a number of optional inputs to control the assumed distribution of the data and correction structure, as well as naming the fixed effects in the output and controlling the convergence tolerance and maximum number of iterations. The distribution can be specified by a single number of letter in the family input variable. The default distribution is a normal (n or 1) distribution, but Bernoulli (b or 2) or Poisson (p or 3) may also be specified. The correlation structure is specified in the corr variable. The default correlation structure is AR(1) (ar1 or 1) for gee and Markov (markov or 2) for qls. Any of the other correlation structures described in Section 2.2 can also be optionally chosen, and avilable options are listed in the help file for each function. Thus to specify a Poisson distribution with an equicorrelated (equi) correlation structure, we would use the command gee(id, y, t, X, 'p', 'equi').
The default output display uses column numbers to label the variables in X. The third optional input varnames can be used to overwrite these display names. The varnames variable is structured variable with each item being a string variable. For example, the three columns of X could be labeled A, B, and C with the commands: > varnames = $\{$'A', 'B', 'C'$\}$; > gee(id, y, t, X, 'p', 'equi', varnames);

Output
Each function produces the same printed results and output variables. The printed results consist of the initial values used by the algorithms, the estimated covariance parameter (α), scale parameter (φ), and the covariate parameter estimates (β). In addition, the standard errors, corresponding z-values, p values and 95% confidence intervals for each β j are also produced. Two versions of these values are presented; the one based on the robust covariance matrix and the one based on the model-based covariance matrix. The model-based results should be used when the specified working correlation matrix is known to be correct; otherwise, the robust results should be used.
Three variables are produced as outputs to the functions. These are the estimated β's, α, and a structured variable that contains the entire printed results from the robust estimations in the cell variable results.robust and the model-based estimation in the cell variable results.model.

Example
Here we present results obtained using GEEQBOX applied to data provided in Table 3 of Nunez-Anton and Woodworth (1994). This data set contains the following variables: subject id; group (A or B); month of measurement; and percentage. The variable percent represents the percent correct scores on a sentence test administered under audition-only conditions to groups of subjects wearing two different cochlear implants, referred to here as A and B. The electrode array was surgically implanted 5 to 6 weeks prior to being electrically connected to the external speech processor. Subjects were profoundly, bilaterally deaf, thus preconnection baseline values for the sentence test were all zero. At the time of the analysis reported here, data were available for 23 subjects in group A and 21 subjects in group B, with measurements scheduled at 1, 9, 18, 30 months after connection." (Nunez-Anton and Woodworth 1994) In the worked examples presented here β = (β 1 , β 2 , β 3 ) where β 1 is the regression coefficient associated with month of measurement; β 2 is the regression coefficient associated with group (group = 0 for A; group = 1 for B); and β 3 is the regression coefficient associated with cons, the constant that takes value one. The worked examples below are for the continuous outcome percentage as well as a binary variable, high, that takes value 1 for values of percentage ≥ 50, and takes value 0 otherwise. We note that our goal is not to present a complete analysis, but rather to demonstrate implementation of our toolbox and to highlight some of its features.
The data is contained in the files audio.dat (ASCII file) or audio.mat (MATLAB data file). The X matrix of interest is set as with associated variable names for displaying the results: For the continuous outcome percent, we can fit the model percent = βX + ε assuming a normal distribution for Y (identity link ) and an equicorrelated correlation structure via QLS or GEE using the respective commands: > [betahat, alphahat, results] = qls(id, percent, month, X, 'n', 'equi', varnames); > [betahat, alphahat, results] = gee(id, percent, month, X, 'n', 'equi', varnames); The resulting estimates of β with 95% confidence-intervals (low and up lim) are given in Table 1.
To demonstrate the sensitivity of results to choice of working correlation structure, we next fit the Markov correlation structure that is appropriate when the correlation between measurements declines with increasing separation in time. As discussed in the introduction, this structure is not readily applicable for GEE. We therefore implement the Markov structure using QLS via the following command: > [betahat,alphahat,results] = qls(id, percent, month, X, 'n', 'markov', varnames); The resulting estimates of β with 95% confidence-intervals (low and up lim) are given in Table 2.
We note that in this particular example, the results are not sensitive to the choice of equicorrelated versus Markov structure.
For the binary outcome, high (= 1 if percentage ≥ 50; = 0 otherwise) is modeled using an equicorrelated correlation structure and a Bernoulli link function. The associated commands for obtaining the QLS and GEE estimates are:   Table 3: Estimates of β and associated 95% confidence-intervals (low and up lim) for the high outcome assuming an equicorrelatd correlation structure.

Discussion
The GEEQBOX MATLAB toolbox can be used to analyze correlated data via either the method of generalized estimating equations (GEE) or quasi-least squares (QLS). As demonstrated in Section 4, QLS allowed for implementation of the Markov structure, with similar results for the Markov and equicorrelated structures. If the results had differed, findings based on application of the Markov structure might be preferred because an assumption of equal correlations for all measurements within a subject (which is required for application of the equicorrelated structure) is very strong and might not be reasonable for test result data, i.e., we might anticipate that test results from two examinations that occur closely together in time will be more similar, and therefore more highly correlated, than examinations that are taken further apart in time. In general, implementation of QLS will allow for consideration of correlation structures, including the Markov, that previously have not been available for GEE. Careful comparison of analysis results between several structures might strengthen confidence in a strong finding, e.g., if a results persists across several structures, or might result in the need to choose the most plausible structure, e.g., if the results are not consistent across structures.
Future and ongoing work of these authors will include implementation of correlation structures that have not previously been implemented in the framework of GEE. Related research will involve the development and comparison of methods for choosing between several correlation structures, which will be especially important for studies in which the findings differ according to choice of correlation structure. Future versions of the toolbox will be made available via the web site. These will also incorporate new correlation structures, add utility functions for manipulating and displaying the clustered data, methods for analyzing nested correlation structures and a window based "point-and-click" way of running the functions.