A Graphical User Interface to Generalized Linear Models in matlab

Generalized linear models unite a wide variety of statistical models in a common theoretical frame-work. This paper discusses glmlab |software that enables such models to be (cid:12)tted in the popular mathematical package matlab . It provides a graphical user interface to the powerful matlab computational engine to produce a program that is easy to use but with many features, including o(cid:11)sets, prior weights and user-de(cid:12)ned distributions and link functions. matlab ’s graphical capacities are also utilized in providing a number of simple residual diagnostic plots.


Introduction
Generalized linear models (or glm's) were introduced by Nelder and Wedderburn 8] in 1972 as a means of unifying a d i v erse range of statistical models, including multiple regression, log-linear analysis and probit analysis. The subsequent release of the Glim software (Francis et. al. 4]) enabled the theory to be applied easily to practical problems. Since that time, other popular statistical software packages, including S-Plus (Statistical Sciences 10]) and SPSS (Norusis 9]), have incorporated glm's. This paper explores the use of matlab (The MathWorks Inc. 5]) to implement glm's using a graphical interface in the program glmlab. The motivation for the development of the program comes from the need to teach a short course in statistical models, with a section on glm's, to a w i d e cross-section of students: di erent students may be familiar with di erent p a c kages, and perhaps some students not familiar with any statistical package at all. All have a rm grounding in matlab, h o wever. There are many advantages with this approach: the rich variety of mathematical and graphical functions of matlab are readily available there is no need to purchase or learn a specialist statistics package (such a s Glim) the same code can be used for all the di erent matlab platforms the graphical environment makes glm's quickly available to students in a short course. The software is not intended to take the place of commercial packages, but rather to bring the world of glm's to users of matlab.
Section 2 provides a theoretical background to glm's. Section 3 discusses some of the methods used in the software. The program itself is discussed in Section 4, and two examples are discussed in Section 5. Section 6 brie y concludes with a short discussion.

Background Theory
Generalized linear models are based around distributions in the exponential family, s u c h t h a t f Y (y i i ) = a(y i ) exp fy i i ; ( i )g= ] (1) for known functions a( ) a n d ( ). In such models, i = 0 ( i ) and Var(y i ) = V ( i ), where V ( i ) = 00 ( i ) is known as the variance function and is the dispersion parameter. The linear predictor, usually referred to as , i s g i v en by = X T .
The mean parameter, , is related to the linear predictor, through a monotonic, di erentiable link function so that g i ( i ) = i . (Usually, the link function is the same for all points i = 1 : : : n .) The link function can be chosen independently of the distribution, but a popular link function (called the canonical link function) is found by setting i = i . The standard multiple linear regression technique sets the link function at = . Distributions of the form (1) include the normal (Gaussian), inverse Gaussian, gamma, Poisson and binomial distributions.
The deviance of a model (a measure of the distance between y and ) i s g i v en by where d(y i i ) is the unit deviance de ned as d i (y i i ) = 2 Z yi i y i ; u V (u) du and w i are prior weights. The X matrix can contain quantitative variables and also qualitative variables (also called categorical variables, or factors). Each level of the factor is identi ed with a variable. However, there are usually dependencies between the variables. This overparameterization introduces a singular design matrix which must be removed prior to the tting of any model. There are many di erent methods available for doing so, including Helmert contrasts (as favoured by S-Plus), orthogonal polynomials, sum-to-zero constraints and corner-point parameterizations (as in Glim). All are di erent methods of introducing qualitative v ariables by h a ving k ; 1 independent v ariables for a variables with k levels.
The theory of glm's has been documented by Nelder and Wedderburn 8], McCullagh and Nelder 6] and more brie y by Dobson 2] among others.

Methods
The algorithms for tting glm's to data are well established and robust (see McCullagh and Nelder 6] and Nelder and Wedderburn 8]). The maximum likelihood estimates of can be found using iterative least squares: set 0 = X T 0 to be the initial value of the linear predictor, and 0 the corresponding value of the tted value (recall that = g( )). The link function is linearized so that z 0 =^ 0 + ( y ;^ 0 ) d d 0 and is then regressed onto the covariates X with quadratic weights de ned as to produce a new estimate of . The algorithm is repeated until convergence.
Starting values for the algorithm are easily obtained from the data. Setting 0 = y,^ 0 is then obtained from the link function. In some cases, the method needs some re ning (for example, to avoid calculating log 0 when tting a log-linear model with zero counts). The program itself|called glmlab|consists of numerous m-les (matlab functions and scripts) for tting generalized linear models. Large portions of the code involve setting up the graphical interface, parsing the input strings and subsequent c hecks of the inputs, and executing menu commands. The example pieces of code included in the Appendices concern the actual tting of the model (the code probably of most interest to readers of this paper). The actual tting algorithm is implemented in the le irls.m as shown in Appendix A.
Each distribution used in the program has an associated le that contains the information relevant t o glmlab namely the variance function and the deviance (see Appendix C for an example of the gamma distribution). Similarly, each link function has an associated le containing information about the link namely nding given nding given and nding d =d given . An example using the probit link function is given in Appendix D. This idea of placing relevant information about distribution and links in les enables the user to create new distributions and link functions with a minimum of knowledge of matlab programming.
Parameters such as the maximum number of iterations, the parameter accuracy and the ill-conditioning tolerance, can be altered from one of the drop-down menus on the main screen.
The program produces an output le called DETAILS.m that contains information about the variables tted, and deviance from each model see Section 5. Screen output gives information such as parameter estimates and residual deviance. This information is easily captured using the matlab diary command. There are many functions written for the analysis of particular statistical problems in matlab, including a subset of problems in the generalized linear models framework. However, even with the Statistics Toolbox, matlab lacks a procedure for analysing the full range of generalized linear models.
This section discusses the program glmlab that has been written to allow matlab to analyse such situations. glmlab uses a graphical interface and so is easy to learn and use. Among other features, it allows the user to add distributions and link functions that are not included, and save and load work between sessions. It does not require the user to have access to the Statistics Toolbox.
Some aspects of the program are listed below.

Data Entry Windows
There are four areas in the main screen (see Figure 1) for the ent r y o f d a t a o r v ariable names: Response (y): The response (or y) v ariable is entered here (but see Sections 4.6 and 5.2 also).
Covariates (X): T h e c o variates, or X-matrix, is entered here. The constant (or intercept) is automatically tted, but this can be altered by the user in the Options menu. Prior W eights: Any prior weights can be entered here (for example, in a weighted regression, to omit structural zeros, or optionally with the binomial distribution as in Section 5.2). O set: Any o set variables are entered here. An o set is a variable with a known coe cient (see Section 5.1). Valid matlab workspace variables can be entered, as we l l a s m o s t v alid matlab commands that produce matrices of the correct type and size. For example, the user could enter magic(4) as the covariates, and 1.0, 3.2, -1.2, 0.5]' as the response. Some commands, especially if complex, will not work. Using vector variables de ned in the matlab workspace is recommended.

Menu Items
There are eight menu items in the main glmlab window (see Figure 1): File: The le menu opens data les (*.mat les) loads and saves models (*.glg les) exits from glmlab quits matlab.  Distributions: In the distributions menu, the user selects the distribution of choice. There are ve built-in distributions (normal, inverse Gaussian, gamma, Poisson, binomial). Users can also add their own distributions which will appear in the menu. Distributions have default link functions and scale parameters as shown in Table 1. Link: In the link menu, the user selects the link function of choice. There are eight built-in link functions (identity, log, square root, power, reciprocal, complementary log-log, probit and logit (or logistic)). Users can also add their own links which will appear in the menu. For each distribution, the default link function is the canonical link (see Table 1). Scale Parameter: The scale parameter can be set to a xed (positive) value, or to be estimated by the mean deviance.

Buttons
There are seven buttons in the main glmlab window (see Figure 1) three of these along the bottom are of the most interest.

Commands
There are only a few commands that need to be learnt t o u s e glmlab: glmlab: Once in matlab, the user starts the program by t yping glmlab at the matlab prompt. fac: The fac command is used to ag a variable as qualitative (see Section 5.1). fac uses a cornerpoint parameterization (as in Glim) that includes each level of the factor as a dummy v ariable, and excludes the rst column to preserve full-rank.

Binomial Responses
Binomial responses variables require some special handling. Three link function are unique to the binomial distribution (and are unavailable otherwise from the Link menu): the logit (or logistic), probit and complementary log-log link functions. In addition, the response variable must re ect the binomial situation of counts and sample size. In this situation, response variable consists of two columns: the rst for the counts, and the second for the sample sizes. (Note that this is di erent t h a n t h e c o n vention adopted in S-Plus, where the response has the two columns as the number of successes and the number of failures.) When the data to be analysed is in the form of probabilities, only one column is needed. See Section 5.2 for an example using the binomial distribution.  The data contains many i n teresting features that serve to demonstrate features of the glmlab software. For example, there are many structural zeros, since (for example) ships made after 1975 cannot have periods of operation before 1975. There is also an observational zero in the data. The authors consider an initial model of the form log(expected number of damage incidents) = 0 + log(aggregate months of service) + (e ect due to ship type) + (e ect due to year of construction) + (e ect due to service period): The rst term after the intercept is a quantitative v ariable with a known coe cient o f 1 . Such a v ariable is known as an o set. As is usual with count data, the authors decide to use the Poisson distribution with the logarithm link function. They expect overdispersion due to expected inter-ship variability and so we over-ride the default xed scale parameter option in glmlab and estimate with the mean dispersion. To t the model discussed in McCullagh and Nelder 6], variables are entered as shown in Figure 2. Note that the o set variable is entered as log(Service) this shows that glmlab can happily accept transformations being entered as variable names. log(Service) has been used as the o set because of the model in Equation 2 proposed by McCullagh and Nelder. The variable Weights is a vector of prior weights that omits the structural zeros (it is zero for the structural zeros, and one elsewhere). The following options are also set: To declare the distribution, choose Poisson from the Distribution menu. The results, naturally, agree with those given in McCullagh and Nelder 6]. Variables with numbers following are qualitative v ariables the numbers refer to the levels of the variable. They are understood to be in reference to the rst level of the variable, the same way i n w h i c h Glim treats qualitative v ariables.
After a model has been tted, the Plots menu becomes available, and residual plots can be generated to informally examine the residuals. For example, if deviance residuals were chosen from the Residual Type menu, the Residuals vs Transformed Fitted Values plot is displayed on the screen see Figure 3.
To include an interaction term between (say) Year of Operation and Year of Construction as a covariate, enter the variables into the main window as shown in Figure 4. Pushing the Fit Speci ed Model button produces the following output: The nal parameter is aliased, in that it contains no information that is not already contained in the other variables.

Example: Binomial Data
Because of the particular nature of the binomial distribution, a simple example is considered here. The data comes from Bliss 1] (cited in Dobson 2]) and is shown in Table 3. The data involves counting the number of beetles killed after ve hours of exposure to various concentrations of gaseous carbon disulphide (CS 2 ).
The analysis concerns estimating the proportion r i =n i of beetles that are killed by the gas.
The variables in matlab were named Dose, Number and Killed for the obvious variables. Dobson analyses the data using a logit link function. The data is entered into glmlab as shown in Figure 5. In particular, take note of the entry for the response variable, which i s e n tered as two columns. After choosing    The results agree with those given in Dobson. An alternative method is to t the model using the probabilities r i =n i as the response (that is, one column of probabilities), and use n i as the prior weights.
The parameter estimates are identical. The residual plotted against the tted values (produced using the Plot, Residual vs Fitted Value option) in Figure 6 shows a possible curvature.

Discussion
The program glmlab has been discussed for using matlab to t the wide class of statistical models known as generalized linear models. The software is not meant to replace commercial packages, but to provide matlab with facilities for dealing with such a class of models. While glmlab may not be as powerful as programs like Glim and S-Plus, it o ers an easy to use introduction to generalized linear models in a user-friendly and powerful environment. It allows many di erent types of models to be tted, including common models such as logit models, ANOVA, and multiple regression. cvd=real( sqrt( covarbeta(ii,ii)+covarbeta(jj,jj)-2*covarbeta(ii,jj) ) ) end if ss(5)<10, %MINUTES time= time,'0',num2str(ss(5)),':'] else time= time,num2str(ss(5)),':'] end if ss(6)<10,%SECONDS time= time,'0',num2str(ss(6)),tag] else time= time,num2str(ss(6)),tag] end