Exact Hypothesis Tests for Log-linear Models with exactLoglinTest

This manuscript overviews exact testing of goodness of ﬁt for log-linear models using the R package exactLoglinTest . This package evaluates model ﬁt for Poisson log-linear models by conditioning on minimal suﬃcient statistics to remove nuisance parameters. A Monte Carlo algorithm is proposed to estimate P values from the resulting conditional distribution. In particular, this package implements a sequentially rounded normal approximation and importance sampling to approximate probabilities from the conditional distribution. Usually, this results in a high percentage of valid samples. However, in instances where this is not the case, a Metropolis Hastings algorithm can be implemented that makes more localized jumps within the reference set. The manuscript details how some conditional tests for binomial logit models can also be viewed as conditional Poisson log-linear models and hence can be performed via exactLoglinTest . A diverse battery of examples is considered to highlight use, features and extensions of the software. Notably, potential extensions to evaluating disclosure risk are also considered.


Residence
Residence in  The manuscript is laid out as follows. In Section 1 notation and background information is presented. Section 2 gives the basics of the algorithms while Section 3 covers several example usages. Finally, Section 4 concludes with a discussion.

Problem formulation
Consider the data given in Table 1, which is a 4×4 contingency table compiled by the United States Census Bureau tabulating subjects' residence in 1985 by their residence in 1980. This data set will be used as motivation for the approach.
Here "log" is presumed to act component-wise on vectors and the design matrix, x, is assumed to be full rank. For example, for the Residency data, we might consider the familiar model of independence. The columns of x would contain a constant (intercept term), three columns of indicators for the 1980 residence and three columns of indicators for the 1985 residence.
Consider investigating model fit using goodness-of-fit statistics. Specifically, let h be a test statistic of interest such that larger values of h support the alternative hypothesis, such as the Pearson Chi-Squared statistic Here the term y i log(y i /μ i ) is defined to be 0 when y i = 0 andμ i are maximum likelihood estimates. For the Residency data set these statistics are both on the order 120, 000 on 9 degrees of freedom. The independence model results in such a poor fit because of the large counts down the diagonal; that is, people tend to stay in the same geographic region. In Section 3.1 a better fitting model for this data is considered that incorporates model parameters for each diagonal count.
For Poisson log-linear models, it is well known that the sufficient statistic for β under the null hypothesis is x y, where a superscript denotes a transpose. The existence of closed form sufficient statistics for the Poisson log-linear model yields a method for testing goodness of fit or eliminating nuisance parameters. In particular, because x y is sufficient for β, the conditional distribution of y given x y does not depend on any parameters. Furthermore, it can be shown that where C is a normalizing constant, u = (u 1 , . . . , u n ) is a generic vector of non-negative counts and Γ(s) = {u | x u = s}. Here Γ(s) is the so called "reference set", or the set of allowable values of y given that x y = s.
To illustrate, consider the independence model for the Residency data. The sufficient statistics, x y, can be shown to be the margins of the table. Hence Γ(s) is the collection of all tables that satisfy the margins. In this case, the reference set is easily characterized. However, in general, Γ(s) is either too complicated or contains too many elements to be characterized in any useful way.
A conditional P value calculates the probability of all tables with values of h at least as large as the observed value. Notationally, the conditional P value is: where y obs is the observed table and s obs = x y obs .
When compared to a fixed nominal type I error rate, such a P value is "exact". It should be noted that the accolade "exact" is given to tests that guarantee the nominal type I error rate unconditionally. Thus a test that never rejects the null hypothesis is technically exact in any situation. In particular, this illustrates that exactness may be a desirable, but is not a sufficient property for a test to be acceptable. Moreover, this example (never rejecting) is particularly relevant in our setting because Γ(s obs ) may contain one or few elements. Hence the conditional P value will be exactly or near one regardless of the evidence in the data visa-vis the two hypotheses. However, it is also the case that the conservative conditional tests can produce P values that are smaller than those calculated via Chi-squared approximations (see Subsection 3.2 for an example).
A general problem with exact P values is their calculation. The reference set, Γ(s obs ), is often too large or complicated to enumerate. In this manuscript the use of Monte Carlo to approximate conditional P values is explored. In specific, consider simulating J complete tables, say {y (j) } J j=1 . Then a Monte Carlo approximation to the conditional P value is where {w (j) } are importance weights, the ratio of the target mass function (given by Equation 1) evaluated at the simulated table to that of the mass function used for simulation. Note that when h is the Pearson statistic or the deviance, the fitted values,μ, do not change by iteration, because every simulated data set has the same sufficient statistics.

Binomial calculations
Conditional inference for binomial-logit models is a special case of conditional inference for Poisson log-linear models. Therefore, the testing method outlined can also be applied to binomial-logit model, provided some care in model specification.
Consider a binomial logit model of the form, b i ∼ Bin(n i , p i ) for i = 1, . . . , k and where γ is a scalar of interest and β is a p dimensional vector of nuisance parameters. Frequently, x i contains only a stratum indicator and an intercept term. Conditioning on the sufficient statistic for β results in standard conditional logistic regression. For this purpose, we suggest the coxph function as described in Venables and Ripley (2002). Instead we consider the more general case where x i is arbitrary. However, we stipulate that conditional inference in such settings is not always informative, as the loss of information from conditioning can sometimes be quite severe. For example, in many cases the reference set might contain only the observed data.
Consider testing H 0 : γ = 0 versus H a : γ < 0 and the following model for the success (y i1 = b i ) and failure (y i2 = n i − b i ) counts: for j = 1, 2 and i = 1, . . . , k. The sufficient statistics for the α i are y i1 + y i2 = n i . Then it can be shown that the conditional distribution of y i1 |y i+ is precisely the model given by Equation 2 where Therefore, conditioning out the nuisance parameters {α i } and β for the Poisson log-linear model yields exactly the same (null) conditional distribution as conditioning out β in Model 2. Furthermore, this exercise indicates exactly how to perform the calculations, which is pertinent, since exactLoglinTest only accepts models in the form of Poisson log-linear models. It is also possible to represent many multinomial and product multinomial data as instances of conditional Poisson models. We refer to (Agresti 1990, Chapter 8 and Section 8.6.7 in particular) for more details.

The software
The software exactLoglinTest is an implementation of the algorithms presented in Booth and Butler (1999) and Caffo and Booth (2001) using the R open-source programming language (R Development Core Team 2006a). At the heart of both algorithms is a sequentially generated rounded normal approximation to the conditional distribution. Full descriptions of the algorithms are given in the technical references above while below we give a brief overview and summarize related methods.
Both algorithms use the normal approximation to the Poisson as their base. That is, for large µ, y will be approximately Normal{µ, D(µ)}, where D(µ) is a diagonal matrix with diagonal µ. Using the properties of the multivariate-normal distribution, an approximation for the distribution of y given x y can be found. This process fixes some of the cells of y and leaves some to be simulated. For example, in the independence model, x y is the margins of the table. Subtracting all rows except the last from the row margins yields the last row. The choice of which cells to fix and which to simulate from is somewhat arbitrary. The software does an iterative search to find appropriate cells, though the choice is not optimized.
The algorithm in Booth and Butler (1999) employed the sequential distributions from the normal approximation to the distribution of y given x y as a candidate distribution for importance sampling. A novel sequential rounding scheme was used to force the simulated data to be non-negative integers.
The normal approximation allows for negative cell entries, resulting in simulated tables that must be discarded and a potential loss in algorithmic efficiency. A refinement of the algorithm fixes not only x y, but also some of the remaining free cells. Because the normal approximation is lower dimensional, a greater percentage of valid tables can be generated this way (as developed in Caffo and Booth 2001). However, this modification generates a Markov chain instead of iid samples and hence requires more thoughtful use of the algorithm. As such, one should switch to Markov chain sampling only when importance sampling produces an extremely low percentage of valid tables.
Both the importance sampling and MCMC methods are implemented in a function, mcexact. This function takes a log-linear model specification in the same form as a a call to glm. The function mcexact uses glm to construct a design matrix and obtain fitted means (μ). The algorithm proceeds to calculate goodness-of-fit statistics and subsequently runs the Monte Carlo algorithm. The returned object is a list of class "bab" or "cab" (acronyms for "Booth And Butler" and "Caffo And Booth" respectively) depending on which of the two algorithms was used. The object has available methods summary, print and update. Also included in the package are utility functions for more direct interaction with the simulated tables.
Below we discuss some of the several competing algorithms to the rounded normal approximations used in Booth and Butler (1999) and Caffo and Booth (2001). However, we note that direct software competitors to exactLoglinTest algorithms are few. From this perspective, the best developed are the network algorithms (see Mehta andPatel 1980, 1983) and related methods incorporated in the software StatXact (Mehta 1991). This proprietary software suite is considered the industry standard in this area. Being open source and freely available, exactLoglinTest is targeting a different user base. In addition, we note that exact-LoglinTest differs from the StatXact software suite by focusing only on log-linear models via Monte Carlo calculations.
Of the many competing algorithms, perhaps the most general is due to Diaconis and Sturmfels (1998), who used computational algebra to provide simple random walk algorithms based on Markov bases. A limitation of this approach is that, in some settings, the computational algebra required to obtain the Markov bases is impractical. However, Dobra (2003) calculated the Markov bases for a large class of graphical models. Related algorithms using elegant random scan Gibbs samplers were given by ; McDonald, Smith, and Forster (1999); . Furthermore, relevant recent developments in sequential importance sampling (Chen, Diaconis, Holmes, and Liu 2005;Chen, Dinwoodie, and Sullivant 2006) are applicable to this setting. We refer the reader to Caffo and Booth (2003) for an overview of Monte Carlo algorithms in this area.
In addition to Monte Carlo algorithms, there has been much relevant research in the area of fast exact enumeration calculations for specific models (such as in Patefield 1981;Booth, Capanu, and Heigenhauser 2005;Hirji, Mehta, and Patel 1987). Also, highly accurate saddlepoint approximations have been applied with success (Strawderman and Wells 1998).

Examples
A copy of the package can be obtained at the Comprehensive R Archive Network, http: //CRAN.R-project.org/. Refer to the "R Installation and Administration" manual (R Development Core Team 2006b), included with your R distribution, for details on how to install packages. In addition, a noweb version of this document can be found at the author's web site.
Assuming it is properly installed, one can load exactLoglinTest with R> library("exactLoglinTest") R> set.seed (1) The set.seed(1) command is used to set the random number seed to a specific value, so that results can be reproduced.

Residency data
Recall the Residency data ( The default method used for sampling is the importance sampling algorithm of Booth and Butler (1999). Because this method rejects simulated table with negative entries, the number of desired simulations nosim may not be met in maxiter iterations.
The returned object is a list of class "bab", storing the results as well as all of the relevant information necessary to restart the simulation. More information can be obtained with summary The "t degrees of freedom" refers to degrees of freedom used as a tuning parameter within the algorithm, while the df refers to the model degrees of freedom.
As it stands, the Monte Carlo standard error, mcse, is too large for the P value estimate to be useful. Here nosim is the number of additional simulations desired and maxiter is the maximum number of iterations allowed. It is important to note that update can only resume the simulation with a new Monte Carlo sample size. It does not allow users to change the model formulation; one must rerun mcexact independently to do that. In practice, we recommend that users employ a large number of simulations, updating results until twice the Monte Carlo standard error is below the number of significant digits required for reporting the P value.
This example illustrates the point that the underlying algorithms are very efficient when the cell counts are large. When this is the case, the large sample approximations are close to the conditional results, as we can see using the observed deviance and Pearson statistics given in the output above: R> pchisq(c(2.986, 2.982), 3, lower.tail = FALSE) [1] 0.3937887 0.3944088

Pathologists' tumor ratings
The following example is interesting in that the large sample results differ drastically from the conditional results. Moreover, the conditional results are less conservative. The data given in Table 2

Alligator food choice data using MCMC
This example illustrates the algorithm from Caffo and Booth (2001) using the data and Poisson log-linear model from the alligator food choice data shown in Table 3. This data set and model is a good choice for MCMC as the percent of valid tables generated using method = "bab" is very small, less than 1% of the tables simulated. It is often the case that the MCMC algorithm will be preferable when the table is large and/or sparse. Using MCMC introduces further complications in reliably running and using the output of the algorithm. Throughout this example we consider the log-linear model where F = food choice, L = lake, S = size and G = gender.
The algorithm from Caffo and Booth (2001) uses local moves to reduce the number of tables with negative entries that the chain produces. This method can be invoked with the option method = "cab" of mcexact. The parameter p of mcexact is the average proportion of table entries left fixed. A chain with p=.9 will leave most of the table entries fixed from one iteration to the next. A high value of p will often result in a high proportion of valid (nonnegative) simulated tables. Unfortunately a large value of p can cause the chain to mix slowly, because the tables will be very similar from one iteration to the next. However, it is also possible that a small value of p will produce too many tables with negative entries. Hence the Metropolis/Hastings/Green algorithm will stay at the current table for long periods and again result in a slowly mixing chain. Therefore, adjusting the value of p is usually required.
It is also usually useful to look at the chain of P values for the deviance statistics R> top <-cumsum(alligator.mcx$chain[, 1] >= alligator.mcx$dobs[1]) R> bottom <-1:alligator.mcx$nosim R> dev.p <-top/bottom R> plot(dev.p, type = "l", ylab = "P value", xlab = "iteration") R> title("Deviance P value by iteration") and Pearson statistics. Note that there is an extremely slow decay in the autocorrelations of the chain of goodnessof-fit statistics and that the P values for the two statistics do not seem to have converged. Therefore, a much longer final run should be performed (not shown). Also, mcexact uses batch means to estimate the Monte Carlo error, and so this suggests we should use very large batch sizes.
With regard to batch sizes, mcexact does not require the total number of simulations to be a multiple of the batch size. If the algorithm terminates in the middle of completing a batch, that batch is not used in the P value calculations. However, the simulations are not wasted if the algorithm is resumed with update.
A final run of this data, discarding all of the initial runs, could be performed by setting flush = TRUE as an argument to update. Here, flush = TRUE, tells update to throw out all of the data used in the initial tinkering, except that it starts the new chain from the final table from the initial runs. This is a harmless way to burn in the chain without throwing away samples from the final run. The chain can be restarted at the default starting value, the observed data, by rerunning mcexact.

Exact score test for binomial counts
The data given below are contained in the R dataset Titanic (see Mac and Dawson 1995). Refer to the help page for the data set for more information. The data cross-classify survival counts of the Titanic passengers by class, gender and age. The variable alpha is added to correspond to the α i terms from Equation 3.
We view each person's survival as a binary outcome. Furthermore, we use a model where a person's age, gender and class are additive effects on the logit scale. The corresponding Poisson log-linear model has model formula y~(factor(class) + factor(age) + factor(sex)) : factor(surv) + factor(surv) + factor(alpha) Because mcexact was designed for goodness-of-fit tests, one must work more directly with the simulated data in cases with an alternate goal. We use this example to illustrate the simluateConditional command, which only performs the simulation. Its result is the simulated y values in a matrix, with each row being a simulation. When method = "bab" the importance weights, on the log scale, are represented in the final column.
Consider the gender effect in specific. To calculate an exact P value simulateConditional is used to simulate tables conditioning on all of the parameters, setting the interaction factor(surv) : factor(sex) to 0.
R> chain <-simulateConditional(y~factor(surv) + (factor(class) + + factor(age)):factor(surv) + factor(alpha), dat = titanic.dat, + nosim = 10^4, method = "cab", p = 0.1) A P value for a score test of H 0 : γ = 0 versus H a : γ < 0 simply counts the proportion of tables with sufficient statistic for γ is smaller than the observed value. Using the notation from Equation 3 the sufficient statistic for γ is s γ = i z i y i ≡ z y. We calculate the chain of sufficient statistics and the observed sufficient statistic below.

Application to disclosure limitation
Though there are certainly more rigorous procedures available (see Dobra et al. 2002), exact-LoglinTest is a useful tool for exploring disclosure limitation in contingency tables. Consider the Czech Auto Worker's data given in Table 4. We investigate the potential disclosure risk from releasing all two-way marginals from this table. The following code will load the Czech auto worker data into a data frame:

R> mean(chain[, 55] > 0 & chain[, 55] < 3)
[1] 0.603 This generalized hypergeometric model was used because it fixes all two-way margins. However, that model need not fit the data well (in fact, it doesn't). Therefore, in addition to simulating from the generalized hypergeometric density, it is also of interest to simulate from other densities, such as a uniform distribution on tables with these margins. Though the normal approximations for exactLoglinTest were tailored specifically to the hypergeometric density, it allows for other target distributions. Here the density must be specified on the log scale. Multiplicative constants (additive on the log scale) can be discarded. Therefore, to specify a uniform density on the log scale, a function that returns 0 is supplied to the dens = option.
R> chain2 <-simulateConditional(y~(A + B + C + D + E + F)^2, + data = czech.dat, method = "cab", nosim = 10^3, p = 0.4, + dens = function (y) 0 For each cell under consideration, the algorithm discovered tables that satisfy the margins, but do not have a one or two cell count. Hence, the disclosure risk in releasing the two-way marginals seems minimal.

Discussion
Exact tests are useful tools for investigating goodness-of-fit for contingency table data. The primary limitation of these tests is the difficulty in implementing them. The software exact-LoglinTest can help solve this problem for many examples. The program uses sequentially rounded normal approximations to the relevant conditional distribution to produce Monte Carlo approximations to P values. In this manuscript we investigated three straightforward examples of exactLoglinTest and considered two potentially useful extensions of the program.