COVAR: Computer Program for Multifactor Relative Risks and Tests of Hypotheses Using a Variance-Covariance Matrix from Linear and Log-Linear Regression

,


INTRODUCTION
Additive and multiplicative models are widely used in epidemiology to obtain maximum likelihood (ML) estimates of relative risks (RR), odds-ratios (OR) and hazard ratios (HR). Numerous computer programs have been developed that generate ML estimates of regression coefficients and variance-covariance matrices [1,2,3,4,5]. Despite this wide diversity of programs, there is a shortage of stand alone algorithms that allow the statistician to generate multifactor relative risks and confidence limits, and conduct tests of hypotheses using results of previous regression analyses.
This report documents a computer program to obtain point estimates and confidence intervals (CI) of RR for single or joint effects using the regression coefficients and a variance-covariance matrix from a previous modeling session.
The program has been developed mainly for Grizzle-Starmer-Koch (GSK) regression models for multinomial categorical data, Cox proportional hazards (PH), logistic and Poisson regression results, but can be used with the results of other models. The critical data required to apply the computer code are: There is only one output format, listing the regression coefficients, their names, variance-covariance matrix, point estimates, corresponding CIs, minimum modified χ 2 tests of hypotheses with one-sided tail probabilities and units applied to each coefficient. The program and all of its subroutines are written in FORTRAN-77 and are run on a desktop PC with an AMD-K6-166MHz chip. The algorithm may be compiled and run on Unix machines as

NOTATION AND THEORY
Thus far we have discussed the program's necessary input data. In this section we will address the relationship between asymptotic likelihood theory and the ML estimation of the regression coefficients and variance-covariance matrix. According to asymptotic likelihood theory, the score vectors of a regression model are where logL is the log of the likelihood function for the particular model under consideration, that is, the GSK, Cox proportional hazards, logit, Poisson or other model. The Fisher information matrix I(b) of the score vectors is the generalized-inverse of which is the variance-covariance matrix, V b , written as σ(b0, b1) · · · σ(b0, bj) σ(b0, b1) σ 2 (b1) · · · σ(b1, bj) . . . . . . . . . . . . σ 2 (b0, bj) σ(b1, bj) · · · σ 2 (bj) which is a p × p real symmetric matrix whose principal diagonal consists of the parameter variances. The offdiagonals are the covariances between each of the coefficients. Most computer programs print the diagonal and the upper or lower triangular of the variance-covariance matrix, which makes no difference since the matrix is symmetric. When Newton-Raphson iteration is used for convergence, V b is the generalized-inverse of the Fisher information matrix. For iteratively reweighted least squares (IRLS) regression, V b is the generalized-inverse of the weighted dispersion matrix shown as A best-asymptotic-normal (BAN) estimate of the parameter vector b is obtained by adding the solution vector to the previous parameter vector at each iteration In (5) X is the design matrix, W is a diagonal matrix of inverse variance weights and Y is the column vector of residuals. Using b and V b from, for example, a four-variable model including terms for an intercept b0 and three risk factors (b1 ,b2 , and b3), the point estimate of RR for the three factors considered jointly is where b1, b2, and b3 are the regression coefficients and u1, u2 and u3 are the units applied to the coefficients. The RR in (8) is based on reference cell coding where each p-level risk factor is coded with p − 1 independent variables. The Taylor series-based (1 − α) cofidence interval (CI) for the interval estimate can be expressed by Abramowitz and Stegun [6] give a useful method for determining percentage points of the standard deviate, Zα, in (9) for a specified a level in the form where and c0 = 2.515517 d1 = 1.432788 c1 = 0.802853 d2 = 0.189269 c2 = 0.010328 d3 = 0.001308. The variance, V , of the logarithm of the RR (9) takes on the form where σ 2 (bi) is the variance of bi, σij (bi, bj) is the covariance between bi and bj (3) and (u * i − ui) and (u * j − uj) are the change in units applied to the coefficients. A test of hypothesis of no significant effect due to the joint factors of the parameters in (8) is which can be rewritten as Ho : where C is the contrast matrix forming the desired linear combination of the regression coefficients for the threeparameter test. Thus, for the three joint factors in (8), the C matrix is Grizzle, Starmer, and Koch [7] introduced the test statistic of the linear hypothesis Ho : Cb = 0, given by which is χ 2 distributed with degrees of freedom equal to the number of independent rows in C. Matrix multiplication in (16) is first carried out by obtaining the outer products b T C T and Cb. The variance-covariance matrix V b is premultiplied by C to give CV b , which is postmultiplied by the transpose of is obtained by using singular value decomposition, which avoids such problems as nullity (zero elements), singularity and collinearity, which may be inherent in ill-conditioned or "sparse" matrices. Lastly, If the test statistic exceeds some tabled value of χ 2 (d.f.,1−α) , then we conclude that there is a statistically significant joint effect in the presence of the three factors in our example.  Real Work: Cross-product of C and V b matrices, CV b CXC(NCOl,NCOL) Real Work: Cross-product of CV b and C T matrices,

AUXILIARY ALGORITHMS
The algorithms MATMUL, TRNPOS, MATINV, SVDCMP are required. MATMUL is used for matrix multiplication; TRNPOS is used to tranpose matrices, both given in [8]. MATINV performs a check on the singular values returned from the singular value decomposition algorithm SVDCMP based on [9]; the function CUMNOR returns Zα for a given α; the functions ALG and GAMAIN return the χ 2 tail probabilities for a given χ 2 value and d.f.

RESTRICTION
All input parameters are checked for nullity and a fault message is returned if there is an illegal entry.

TIME
A thorough investigation of absolute timing has not been performed; however, it should be noted that execution time is a function of the number of model parameters and factors to be considered.

PRECISION
The algorithm may be converted to double precision by making the following changes: 1. Change REAL to DOUBLE PRECISION in the algorithm.
2. Change the constants to double precision.
3. Change EXP to DEXP and ALOG to DLOG in all applicable routines.

Estimating Relative Risk for Multiple Risk Factors
Let us take, as an example, the results of a unconditional (unmatched) logistic regression analysis of oral contraceptive use (OC), smoking and myocardial infarction (MI) found in [10]. The coefficients are given as where the variable names to the right of the coefficients are the intercept, OC use (coded 0-no, 1-yes), age (continuous variable), smoking 1-24 cigarettes per day (coded 0-no, 1-yes), smoking ≥ 25 cigarettes per day (coded 0-no, 1-yes), interaction term for OC use and smoking 1-24 cigarettes per day and an interaction term for OC use and smoking ≥ 25 cigarettes per day. The variance-covariance matrix V b for our example is The RR for a woman who is an OC user that smokes ≥ 25 cigarettes per day would be determined by substituting in the values of the coefficients and applying a unit of 1.0 to each in (8) If we now substitute into (12) the variances and covariances for the three variables in (18), we obtain V as The 95% CI for the RR is then calculated from (9) as Now that the RR and Taylor series-based CI have been estimated, we can conduct a simultaneous test of hypothesis for no significant effect due to OC use, smoking ≥ 25 cigarettes per day with respect to the RR: For the three factors, the C matrix is By substituting the appropriate values into (16) and performing the matrix manipulation to obtain χ 2 we get The resulting χ 2 test statistic of 131.65 is significantly greater than the tabled χ 2 (3,0.95) value of 7.82, indicating that there is a significant joint effect. Thus we reject the hypothesis (22) that there is no joint effect due to OC use, smoking ≥ 25 cigarettes per day and the interaction between OC use and smoking ≥ 25 cigarettes per day.
After performing estimating relative risk or testing hypotheses, the user is asked to continue with processing. If the response is "no," the program is terminated, otherwise the algorithm recycles and queries the user for input data for the next cycle (run).

Testing Hypotheses for Trends
This section describes how to use COVAR for conducting tests of hypothesis for trends in coefficients. The example is based on multinomial categorical data related to pathological outcome among 2525 thryoid surgical patients. These particular patients resided near the Semipalatinsk Test Site, Kazakstan, where the former Soviet Union conducted nuclear weapons testing from 1949-89. These data are shown in Table 1, which contains 7 columns for outcome (Goiter, Adenoma, Cancer, Hashimoto's thyroiditis, Riedel's thyroiditis, Dequervain's thyroiditis, and Other).
Our goal is to determine whether there has been an increase in the proportion of thyroid cancers identified at surgery throughout the time periods. In the "Cancer" column of Table 1, one can notice the proportion of cancer out of all other outcomes increases over time. Although inferences can be made by visualizing non-zero effects and increasing trends of proportions, we nevertheless must perform hypothesis tests to discredit the null hypothesis of no effect, no interaction, and no trend. First, we made a GSK regression by regressing the ratio of the count for each outcome (column) in each time period by the total row count for each time period. As an example, the first record in the data file for the GSK regression has 324/386 as the dependent variable, followed by 12/386 in the second record, 5/386 in the third record, 37/386 if the fourth, 3/386 in the fifth record, 3/386 in the sixth, and 2/386 in the seventh. The eighth record begins with 212/354 as the dependent variable, and so on. Dummy coding with 0 and 1 was used to specify the level of the outcome, time period, and interaction between cancer and time period. Arranged in matrix notation, the column vector of coefficients is  (31) In the parameter vector, we can see coefficients for 7 outcomes (Goiter, Adenoma, Cancer, Hashimoto's thyroiditis, Riedel's thyroiditis, Dequervain's thyroiditis, and Other), 6 time periods (1966-71, 1972-76, 1977-81, 1982-86, 1987-91, and 1992-94), and 6 interaction terms for Cancer and Period (Cancer * 1966-71, Cancer * 1972-76, Cancer * 1977-81, Cancer * 1982-86, Cancer * 1987-91, Cancer * 1992-94). Using sum-to-zero constraints, we have a total of 19 coefficients. (Sum-to-zero constraints do not use of a constant term nor p − 1 levels to describe each factor. Rather, p terms are used for each factor without a constant). We notice that coefficients for the outcomes (coefficients 1-7), periods (coefficients 8-13), and cancer*period interaction (coefficients 14-19) are non-zero. The increasing proportion of cancer over time is also noticeable by the increasing trend in values of coefficients 14 through 18, but the trend did not continue for the last time period (coefficient 19).
For the purpose of this example, we shall focus on three hypothesis tests. First, we would like to know if there was a "period effect" present in the data, that is, if the apportionment of various thyroid outcomes changed over time. In (31) the period effect is expressed by the 6 coefficients 8-13. Thus, we are conducting a simultaneous test of the 6 coefficients with the hypothesis Since there are 6 coefficients to test and 19 total coefficients, this results in a 6 × 19 C matrix shown as The second test we will perform identifies whether the cancer*period interaction is significant. The interaction terms are coefficients 14-19, so the test notation is The C matrix now becomes  Our last test determines the presence of a statistically significant trend in the cancer*period interaction term. Trend tests using contrast matrices are performed by using equal "spacing" throughout the coefficients to be tested. Thus, we have the following notation and the C matrix Results of the χ 2 tests using (16) and C matrices above are 4575.67 (6 d.f.) for the period effect, 164.99 (6 d.f) for the cancer*period interaction and 64.18 (1 d.f.) for the test of trend for the cancer*period interaction. These results indicate that there is a significant period effect, a significant cancer*period interaction, and a significant trend for the proportion of cancer to increase over time.

INPUT
There are seven forms of input into the computer program: the number of parameters, the variable names of the regression coefficients, the regression coefficients, the variance-covariance matrix of the regression coefficients, the number of coefficients used in RR or tests of hypotheses, the probability of the CIs, and the units applied to each coefficient. The first four items, i.e., the number of parameters, variable names, regression coefficients, and variance-covariance matrix can be entered from either the keyboard or a disk-file. The last three items, namely, the number of coefficients used in RR or tests of hypotheses, probability of the CIs, and units applied to each coefficient can only be entered by using the keyboard.

USER INPUT DATA FILE
This section discusses the data format when the user has specified that the regression coefficients and variancecovariance matrix are to be entered using the input disk-file. Table 2 lists the data in each record (card) of the input disk-file. When entering the regression coefficients and variance-covariance matrix with the keyboard, however, the data must follow the form described below.
The examples in the Appendix give listings of input files according to the format specified in Table 2.

SAMPLE INPUT/OUPUT CASES
The output for all computer runs is in tabular form, giving the variable names, the coefficients, variance-covariance matrix, single or joint effects, contrast matrices, units, point estimates and lower and upper bounds of the 1-α CIs for each run, the chi-square test statistic with degrees of freedom and p-values. Two examples using an input disk-file are given in the Appendix.  Table 3 lists the names of the files that were used to program, link and execute COVAR. As one notices, the source code is an ASCII text file and the object and executable files have been compiled with Microsoft FORTRAN Powerstation Version 4.10. There is only one optional input data file, which is read when the uses specifies data input from disk. In this case, the filename to be read is specified by the user at run-time.

AVAILABILITY
The program and all of its subroutines are available from the Journal of Statistical Software free of charge at http://www.stat.ucla.edu/journals/jss/