eco: R Package for Ecological Inference in 2 × 2 Tables

eco is a publicly available R package that implements the Bayesian and likelihood methods proposed in Imai, Lu, and Strauss (2008b) for ecological inference in 2 X 2 tables as well as the method of bounds introduced by (Duncan and Davis'53). The package fits both parametric and nonparametric models using either the Expectation-Maximization algorithms (for likelihood models) or the Markov chain Monte Carlo algorithms (for Bayesian models). For all models, the individual-level data can be directly incorporated into the estimation whenever such data are available. Along with in-sample and out-of-sample predictions, the package also provides a functionality which allows one to quantify the effect of data aggregation on parameter estimation and hypothesis testing under the parametric likelihood models. This paper illustrates the usage of eco with several real data examples that are also part of the package.


Introduction
This paper illustrates how to use eco, a publicly available R package (R Development Core Team, 2007), to implement the Bayesian and likelihood methods proposed in Imai, Lu, and Strauss (2008b) for ecological inference in 2 × 2 tables. The package also implements the method of bounds introduced by (Duncan and Davis, 1953) for the analysis of general R × C tables. Ecological inference refers to the "inferences about individual behavior drawn from data about aggregates" (Freedman, 1999, p.4027). Such cross-level inferences are frequently conducted in epidemiology, political science, and sociology when only aggregate-level data are available (e.g., Greenland and Robins, 1994;Achen and Shively, 1995;King, 1997;King et al., 2004). Yet, the difficulty of ecological inference is that the observed correlation at the aggregate level does not necessarily imply the same individual-level relationship. Using an example of literacy rates across different racial groups, Robinson (1950) powerfully illustrated this "ecological fallacy." Since Robinson's seminal article, various methods have been proposed for ecological inference. Duncan and Davis (1953) showed how to derive the bounds on unknown quantities of interest from aggregate data. We generalize and implement this method for R × C tables in eco. Goodman (1953Goodman ( , 1959 developed the regression-based approach to ecological inference, which gained popularity among applied researchers in the next several decades (e.g., Freedman et al., 1991;Achen and Shively, 1995;Gelman et al., 2001) -this approach can be easily implemented via lm() command in R, and hence is not implemented in eco. Recent years have witnessed a growing number of new methods based on modern statistical techniques (e.g., King et al., 1999;Rosen et al., 2001;Imai and King, 2004;Judge et al., 2004;Wakefield, 2004) -some of these methods are available in R via Zelig (Imai, King, and Lau, 2008a) and MCMCpack (Martin and Quinn, 2006). At the same time, the appropriateness of the assumptions underlying some of these models is often disputed (e.g., Freedman et al., 1998;Cho, 1998;King, 1999;Cho and Gaines, 2004).
In a recent paper, Imai, Lu, and Strauss (2008b) have proposed a theoretical framework for Bayesian and likelihood inference in 2 × 2 ecological tables. The framework is based on the theory of coarse data which is originally developed by Heitjan and Rubin (1991). We show that ecological inference can be formulated as a coarse data problem and that Bayesian and likelihood inference can be conducted within this coarse data framework. The main advantage of this framework is that it clarifies the modeling assumptions necessary for Bayesian and likelihood ecological inference. In particular, Imai, Lu, and Strauss (2008b) show that the ecological inference problem can be decomposed into three key factors: distributional effects which address the possible misspecification of parametric modeling assumptions about the unknown distribution of missing data, contextual effects which represent the possible correlation between missing data and observed variables, and aggregation effects which are directly related to the loss of information caused by data aggregation.
Furthermore, while Imai, Lu, and Strauss (2008b) propose statistical methods to address distributional and contextual effects, they also show that aggregation effects cannot be overcome by statistical adjustments. Instead, they demonstrate how to formally quantify the effect of data aggregation on parameter estimation and hypothesis testing. In this paper, we illustrate how to implement these proposed methods using an R package eco, which is freely available at the Comprehensive R Archive Network (http://cran.r-project.org/). In Section 2, we start our discussion by describing and generalizing the method of bounds (Duncan and Davis, 1953). We then outline the parametric and nonparametric models proposed by Imai, Lu, and Strauss (2008b), which respect the constraints imposed by the bounds. We also briefly review the method to formally quantify the aggregation effects. In Section 3, we illustrate the use of the eco package through the analysis of several real data examples. Finally, the appendices at the end of the paper provides the detailed references of commands and data sets that are a part of the eco package.  Table for the Racial Voting Example. X i , Y i , W i1 , and W i2 are proportions, and hence lie between 0 and 1. The unit of observation is typically a geographical unit and is denoted by i.

The Methodology
We use racial voting as a concrete example to describe ecological inference in 2×2 tables. Although this is a prominent example in political science, other problems in different disciplines may fit into the same framework. Table 1 presents a 2 × 2 ecological table of racial voting example. Suppose that from the census data we observe the fraction of registered white and black voters for each county, i.e., X i and 1 − X i . The overall turnout rate Y i can be obtained from the election returns for each county. However, the proportions of black and white voters who turned out, W i1 and W i2 respectively, are unknown. The eco package implements the method of bounds and fits both parametric and nonparametric methods for such ecological data. The estimation is based on the Expectation-Maximization (EM ) algorithms (Dempster, Laird, and Rubin, 1977) for likelihood models and on the Markov chain Monte Carlo (MCMC) algorithms for Bayesian models. These algorithms are described in Imai, Lu, and Strauss (2008b). Below, we briefly summarize each method and model. Note that although we do not discuss the issue of convergence of Markov chains in detail, users of eco should follow standard advice and conduct convergence diagnostics, perhaps using the coda package.

The Method of Bounds
Suppose that in a simple random sample of size n from a population, we observe the margins of Table 1 for each county i. The method of bounds is based on the following deterministic relationship, When Y i is equal to either 0 or 1, W i1 and W i2 are completely known. If X i = 1, then W i1 = Y i but W i2 does not exist. Similarly, if X i = 0, then W i2 = Y i but W i1 does not exist. King (1997) called equation 1 a tomography line. For every i, this tomography line defines a deterministic relationship between the missing data, W i = (W i1 , W i2 ) and the observed data, (Y i , X i ). Duncan and Davis (1953) first recognized that with equation 1, one can narrow the original bound of [0, 1] for W i to the following intervals, Given these bounds for each i (e.g., a county), the analysis of larger units (e.g, a state) can be carried out by simply aggregating the upper and lower bounds with appropriate weights; N i X i and N i (1 − X i ) for W i1 and W i2 , respectively, where N i is the total number of voters in county i. When the resulting bounds are sufficiently narrow, researchers can draw reasonably informative conclusions about (in-sample) missing cells. The bounds in equations 2 and 3 can be easily generalized to the situation of R × C ecological tables where R ≥ 2 and C ≥ 2. These generalized bounds can also be computed via the eco package. Suppose that we denote the observed row and column margins by Y ir and X ic for c = 1, . . . , C, and r = 1, . . . , R where C c=1 X i = 1 and R r=1 Y ir = 1 for all i. Then, the unobserved proportion in the rth row and cth column can be defined as W irc . The results in the statistical literature on contingency tables (Bonferroni, 1936;Fréchet, 1940;Hoeffding, 1940) imply that the bounds are given by, Although applied researchers often find the bounds too wide for their purposes, the method of bounds shows the identifying power of the data without any statistical assumption. That is, the bounds imply the exact degree to which the data are informative about W i . For this reason, statistical analysis that does not incorporate this deterministic relationship is likely to be sensitive to modeling assumptions. The eco package computes the bounds for the general R × C case as well as for the 2 × 2 case (see Section 3.1).

Parametric Models
Next, we describe the parametric models implemented by the package eco. Imai, Lu, and Strauss (2008b) propose the parametric models based on three assumptions. The simplest parametric model is based on the assumption of coarsened at random (CAR) and defined by, where W * i = (logit(W i1 ), logit(W i2 )), µ represents a 2 × 1 vector of population means, and Σ is a 2 × 2 positive-definite variance matrix. The model, which is similar to the ones proposed by King (1997) and Wakefield (2004), assumes the independence between W i and X i and thus no contextual effect. The maximum likelihood (ML) estimates of µ and Σ can be computed via the EM algorithm. The Bayesian analysis, on the other hand, is based on the following conjugate prior distribution, where µ 0 denotes a (2 × 1) vector of the prior mean, τ 0 is a scalar, ν 0 is the prior degrees of freedom parameter, and S 0 represents a (2 × 2) positive definite prior scale matrix. The posterior inference can then be conducted by the MCMC algorithm.
The assumption of no contextual effect under the CAR model is unrealistic in many situations. Imai, Lu, and Strauss (2008b) consider two modeling strategies in order to relax this assumption.
First, one may collect additional covariates Z i and assume no contextual effect after conditioning on Z i . Such a strategy is often employed in the literature (e.g., King, 1997;King et al., 1999). Thus, we can extend our CAR model to the following CCAR (conditionally coarsened at random) model, where β represents a (k × 1) vector of coefficients, and Z i is a (k × 2) matrix of covariates. The ML estimates of β and Σ can be obtained by the ECM algorithm and the Bayesian analysis can be conducted by placing the semi-conjugate prior distribution, vector of prior means, and A 0 is a (k × 2) matrix of prior precision. The MCMC algorithm can be used to sample from the posterior distribution.
Finally, Imai, Lu, and Strauss (2008b) suggest an alternative approach where the contextual effects are directly modeled without additional covariates. This NCAR (not coarsened at random) model is formally defined as, where X * i = logitX i , η is a (3 × 1) vector of population means, and Φ is a (3 × 3) matrix of covariance. The ML estimates of η and Φ can be obtained by the EM algorithm, whereas the Bayesian analysis of the NCAR model can be conducted in the same way as under the CAR model except that the NCAR model relies upon the trivariate normal distribution rather than the bivariate normal distribution. An advantage of the NCAR model over the CCAR model is that the former does not require the availability of additional covariates to model the contextual effects. Indeed, under the NCAR model, one needs not specify the conditional expectation function of W * i given Z i . The eco package implements all three models within the Bayesian or maximum likelihood framework (see Section 3.2).

Nonparametric Models
To address the distributional effects, Imai, Lu, and Strauss (2008b) propose Bayesian nonparametric models based on a Dirichlet process prior (e.g., Dey et al., 1998). This model generalizes the CAR and NCAR parametric models to the case of the unknown distribution of W * i . For the CAR assumption, the Bayesian nonparametric model can be written as follows, where D(G 0 , α) represents the Dirichlet process prior with the base prior distribution G 0 and the scalar concentration parameter α. Under G 0 , (µ i , Σ i ) is distributed as, The MCMC algorithm summarized in Imai, Lu, and Strauss (2008b) can be used to sample from the posterior distribution of this model. Furthermore, the nonparametric NCAR model can be formulated in the same manner by using the parametric NCAR model as the base model and specifying the Dirichlet process prior distribution on (η i , Φ i ), where η and Φ are now indexed by i. The package eco implements this Bayesian nonparametric model under both the CAR and NCAR assumptions (see Section 3.4).

Formal Assessment of Aggregation Effects
The fourth method we implement via the eco package is the formal assessment of aggregation effects under the parametric models. Imai, Lu, and Strauss (2008b) propose to measure the effect of data aggregation on parameter estimation and hypothesis testing by calculating the fraction of missing information. The idea is to quantify the amount of information the observed aggregatelevel data provide in comparison with the information one would obtain if the individual-level data were available. In the context of parameter estimation, the fraction of missing information is defined as, where I obs is the observed Fisher information matrix and I com represents the expected information matrix based on the complete-data log-likelihood function. Then, each element of the vector F θ represents the fraction of missing information for each parameter. In the eco package, we use the Supplemented EM (SEM) algorithm (Meng and Rubin, 1991) and compute the fraction of missing information for the parametric CAR and NCAR models (see Section 3.3).
For the hypothesis testing, we follow the approach proposed by Kong, Meng, and Nicolae (2005) and compute the fraction of missing information against the null hypothesis H 0 : θ = θ 0 , which is defined by, where l obs (θ | Y, X) and l com (θ | W, X) represent the observed-data log-likelihood and the complete-data log-likelihood functions, respectively. Moreover,θ is the ML estimate of θ and the expectation is taken over the conditional distribution of W given (Y, X). Then, F H equals one minus the logarithm of the observed likelihood ratio statistic divided by the logarithm of the expected likelihood ratio statistic. In the eco package, we use the SEM algorithm and compute F H with the null hypothesis of the equal marginal means, i.e., H 0 : E(W 1 ) = E(W 2 ), under the parametric CAR and NCAR models (see Section 3.3).

Additional Individual-Level Data
When bounds are not informative, ecological inference is difficult. The parametric inference will be sensitive to modeling assumptions, and the nonparametric model will not be able to recover the underlying distribution. Therefore, incorporating individual-level data may be helpful whenever such additional information is available. For example, one might conduct a survey in randomly selected counties to obtain such information. Sometimes, a small scale survey can be conducted to get rough estimates of W i for some counties, and incorporating such auxiliary information can also be helpful (Wakefield, 2004). In the eco package, it is straightforward to incorporate such information into the estimation of both parametric and nonparametric models (see Section 3.5).

Illustrative Examples
In this section, we illustrate how to implement the methods described in Section 2 via the package eco using some example data sets which also is a part of the package. The detailed references for the commands and data sets we use appear in Appendices B and C, respectively.

Computing the Bounds
We first consider the computation of the bounds described in Section 2.1 using the function ecoBD(). We illustrate the use of this function with the voter registration data from 275 counties of four Southern states in the United States: Florida, Louisiana, North Carolina, and South Carolina. The data set is taken from King (1997) and is available as reg as a part of the eco package. To load this data set, type at the R prompt (after loading the package via the library(eco) command),

> data(reg)
which stores the data frame as reg in the workspace. The data set can be summarized as, where X is the fraction of black voters in each county, Y represents the fraction of registered voters, and N is the total number of voters in each county. In this data set, the registration rates are observed separately for blacks and whites, which are given by W1 and W2, respectively. To compute the bounds using the reg data set, we simply use the following syntax, which prints out the aggregate lower and upper bounds. For example, the registration rate for blacks lies between 0.34 and 0.99, while that for whites is between 0.70 and 0.93. The actual registration rates for blacks and whites (which are usually unknown but in this case can be estimated using the sample means of W1 and W2 in the dataset reg) are 0.56 and 0.86, respectively.
The county-level bounds are also stored in the output object from ecoBD(). For example, the bound for the first county can be obtained by the following commands,  The county-level bounds can be obtained from the output object. They are stored as Nmin and Nmax.

Fitting the Parametric Models
In this section, we illustrate how to use the eco package to fit the parametric ecological inference models. First, we review the maximum likelihood (ML) estimation of the parametric models described in Section 2.2 using the function ecoML(). We then demonstrate the fitting of the Bayesian parametric model using the eco() function. The dataset used to illustrate these functions is the race and literacy dataset first collected by Robinson (1950) on the state level, and then refined to the county level by King (1997). For 1,040 counties, the marginal percentage of blacks (X), the marginal literacy rate (Y), and the population (N) is available, as well as the true cell-level values for literacy among blacks (W1) and whites (W2). As before, type the following in an R prompt to view summary statistics of the dataset, :0.9940 The Maximum Likelihood Estimation. We first demonstrate the ML estimation of the CAR model, which assumes no contextual effect, via the EM algorithm. Using the default values provided by the function (see the command reference in Appendix B), with the exception of suppressing the output, the following command fits the model and stores its output as res.ML, > res.ML <-ecoML(Y~X, data = census, verbose = FALSE) As this fitting process via the EM algorithm may take a long time on some computers, setting verbose to the value of TRUE (its default value) tracks the progress of the program, The OPTIONS output line is for verification purposes, and informs the user that there are no contextual effect to be modeled, the correlation parameter (ρ) is not fixed by the user, and that the fraction of missing information will be calculated via the SEM algorithm. The next lines of output display the iteration number of the EM loop, the parameter values for that iteration, and the observed log-likelihood for the previous set of parameter values. As showed by the cycle 1 output line, the default starting parameter values for CAR are µ 0 = 1, µ 2 = 0, σ 2 1 = 0, σ 2 1 = 1, ρ = 0.
A few remarks about the convergence are worth mentioning although we refer readers to the relevant literature for the details (e.g., McLaughlan and Krishnan, 1997). When the absolute value difference between each of the parameter values from the previous iteration falls below the convergence threshold, epsilon, the algorithm stops. The last line of the output displays the final, converged parameter estimates. The default value for epsilon is 10 −10 ; this threshold can be changed by the user. The number of maximum iterations to cycle through before halting (maxit) can also be adjusted; the default value for maxit is 1, 000. The failure to set these inputs to appropriate values may result in inaccurate estimates.
Once the EM algorithm is completed, the SEM algorithm will begin automatically (as long as the SEM parameter is not set to FALSE). See Section 3.3 for details on executing the SEM algorithm. If the fraction of missing information is not desired, the user should set SEM to FALSE in the interest of computational time.
Summary statistics for the fitted model can be displayed simply by using the summary() function (a shorter summary is available via the print() function), where the "weighted" insample predictions are computed based on the weights proportional to the group-specific population size while assuming the overall population size is the same across counties (when the information about overall population size is available, this information can be used via the option N as shown in the next example below).
Under the CAR assumption, the point estimate for the unweighted, mean black literacy rate is 66% and for the unweighted, mean white literacy, 92%. The estimated unweighted standard deviations for the county-level black and white literacy rates are 7.6% and 5.9% respectively. The correlation between logit-transformed county literacy rates is estimated to be 0.271 (rho). In addition to the aggregate-level in-sample predictions, county-level estimates are available in the n × 2 matrix res.ML$W. Use the names() command to display all available elements of the output object returned by the ecoML() function.
Bayesian Estimation. Next, we illustrate how to fit the Bayesian parametric models via the MCMC algorithms using eco. Here, we use the NCAR model, which unlike the CAR model assumes the existence of contextual effect. First, the model can be fitted by the Gibbs sampler using the following syntax, where context=TRUE indicates that the NCAR ecological inference model is to be fitted, parameter = TRUE means that the posterior draws of the parameters will be saved in the output res in order to make out-of-sample predictions based on the fitted model, and N specifies the total number of individual-level observations within each aggregate unit so that both weighted and unweighted estimates can be obtained. The progress of the Gibbs sampler is printed to the screen if verbose is set to be TRUE. Moreover, one can specify the prior distribution for the parameters of the multivariate normal distribution in eco() (see Appendix B for details). The default is a noninformative prior, which is approximately uniform on W i . In this example, the other parameters are all set to be the default values. In particular, only 5, 000 Gibbs draws are taken, with no initial burn-in draws. In practice, to ensure proper convergence, the MCMC should be run for a longer period and the initial draws should also be discarded so that inference will be made based on draws from the target posterior distribution. We refer the readers to the standard texts on Bayesian data analysis for general advice on convergence (e.g., Gelman et al., 2004). Here, we implement the following syntax, which fits the same NCAR model as above but discards the initial 20, 000 draws from a total of 50, 000 draws while saving every 10th draw (thus, using the thinning interval of 10), > res <-eco(Y~X, N = N, data = census, context = TRUE, parameter=TRUE, n.draws = 50000, burnin = 20000, thin = 9, verbose = TRUE) The summary of the fitted model can be viewed using the summary() function,

> summary(res)
Call: eco(formula = Y~X, data = census, N = N, context = TRUE, parameter = TRUE, n.draws = 50000, burnin = 20000, thin = 9, verbose = TRUE) where the first part of the output summarizes the posterior distribution of the parameters of the trivariate normal distribution for the NCAR model -the ordinates of the distribution are (W * 1 , W * 2 , X * ), i.e., the logit-transformed black literacy rate, white literacy rate and black composition, respectively. Since N is specified in eco(), both "weighted" in-sample predictions aggregate estimates according to their actual group-specific population size. After controlling for the possible correlation between racial composition and literacy rate (i.e., contextual effect), the in-sample estimates, especially those of W1, are slightly improved compared to the CAR model (see the ML estimation of the CAR model earlier in this section).
In addition, the predict() function allows one to obtain the out-of-sample predictions of W 1 and W 2 based on their posterior predictive distributions using the posterior distribution of the model parameters. In the case of the NCAR model, the predictions will be based on the values of X i which can be taken from another data set using the option newdata (the default, which we use here, is the data set used to fit the model). As before, the summary() function will summarize the results, The default number of Monte Carlo draws is the same as the number of MCMC draws stored in the object res, but this number can be changed by users via the newdraw option.

Quantifying the Aggregation Effects
We revisit the function ecoML() to outline the computation of the fraction of missing information calculation described in Section 2.4. As in Section 3.2, the census dataset is used. If the SEM option is left at its default value of TRUE, the ecoML() function proceeds from the SEM algorithm, once the SEM algorithm is completed. The transition is shown as follows.  For instance, the fraction of missing information for the calculation of black literacy is about 62.6%. The analogous quantity for white literacy is lower because in some counties, whites make up a very large proportion of the population, thus resulting in tighter bounds. In addition to computing the fraction of missing data on parameter estimation, the eco package includes limited functionality for hypothesis testing (see Section 2.4). Currently, setting the hyptest parameter to TRUE for the ecoML() function, will calculate quantities of interest with parameters constrained to the null hypothesis: µ 1 = µ 2 . Continuing with the census example, this null hypothesis would be that the mean black literacy rate is equal to the white literacy rate. To restrict the parameter space to this hypothesis, simply type, > res.ML.HT <-ecoML(Y~X, data = census, verbose = FALSE, hyptest=TRUE) With the results of the algorithm for both the constrained and unconstrained problem, the calculation of the fraction of missing information under this hypothesis is straightforward. First, calculate the observed log-likelihood ratio test statistic, then the complete log-likelihood ratio test statistic, and then subtract the ratio of those two quantities from 1 (see Equation 6).

Fitting the Nonparametric Models
To avoid the distributional assumptions that are common to parametric models, the eco package also fits the nonparametric Bayesian models described in Section 2.3. Here, we use the voter registration dataset (see Section 2.1) to illustrate the use of the ecoNP() function, which fits the Bayesian nonparametric models via the MCMC algorithms. The following command fits the nonparametric model under the CAR assumption, where the default prior specification places a diffuse distribution on the concentration parameter of the Dirichlet process prior. This default specification can be changed by the user (see Appendix B). Many of the inputs of ecoNP() are the same as those of eco(). For example, the nonparametric NCAR model can be fitted by the context = TRUE option. Like the Bayesian parametric models, one can summarize the in-sample estimates of W using summary(). Unlike the parametric models, however, no parameter estimates are presented since the distribution of W is estimated nonparametrically based on a mixture of normal distributions, > summary(res) Call: ecoNP(formula = Y~X, data = reg, N = N, parameter = TRUE, n.draws = 50000, burnin = 20000, thin = 9,verbose = TRUE) *** Insample Predictions *** Since the distribution is estimated nonparametrically, the out-of-sample prediction is generated for each observation. Hence, the total number of Monte Carlo draws is 825, 000 = 275 × 3, 000.

Incorporating the Individual-level Data
If survey or other supplemental, individual-level data is available, this information can easily be incorporated into the models. Adding this data will alleviate the adverse aggregation effects endemic to ecological inference. In eco package, the three main functions, eco(), ecoML(), ecoNP(), all take supplemental individual level data. In this section we illustrate an example using ecoML() and eco().
Continuing with the county literacy rates example, assume that the actual, within-county literacy for whites and blacks is collected for the first 100 counties (and only these counties). Simply, create an n × 2 matrix of the supplemental data with the first column W1 and the second column W2, > survey.records<-1:100 > survey.data <-census[survey.records, c("W1","W2")] Next, execute the ecoML() set the supplement parameter to the matrix of survey data as follows, Under the NCAR model, the mean black literacy rate is estimated to be 71%, which is higher than estimated under the CAR model with the same survey data provided. The average white literacy rate is adjusted slightly downward in this more general model, from 94% to 93%. The change in the fraction of missing data is not monotonic when switching models. Information loss due to aggregation is larger for estimating the mean black literacy rate under NCAR, but is smaller when estimating mean white literacy.
Similarly, the eco package can fit Bayesian models with supplemented individual level data. Below, we fit the parametric CAR and NCAR models via eco() using the same census data with the first 100 counties's data revealed.
data An optional data frame in which to interpret the variables in formula. The default is the environment in which eco is called.
N An optional variable representing the size of the unit; e.g., the total number of voters.
supplement An optional matrix of supplemental data. The matrix has two columns, which contain additional individual-level data such as survey data for W 1 and W 2 , respectively. If NULL, no additional individual-level data are included in the model. The default is NULL.
context Logical. If TRUE, the contextual effect is also modeled, that is to assume the row margin X and the unknown W 1 and W 2 are correlated. See Imai, Lu and Strauss (2008, Forthcoming) for details. The default is FALSE.

mu0
A scalar or a numeric vector that specifies the prior mean for the mean parameter µ for (W 1 , W 2 ) (or for (W 1 , W 2 , X) if context=TRUE). When the input of mu0 is a scalar, its value will be repeated to yield a vector of the length of µ, otherwise, it needs to be a vector of same length as µ. When context=TRUE, the length of µ is 3, otherwise it is 2. The default is 0.

tau0
A positive integer representing the scale parameter of the Normal-Inverse Wishart prior for the mean and variance parameter (µ, Σ). The default is 2.
nu0 A positive integer representing the prior degrees of freedom of the Normal-Inverse Wishart prior for the mean and variance parameter (µ, Σ). The default is 4.

S0
A positive scalar or a positive definite matrix that specifies the prior scale matrix of the Normal-Inverse Wishart prior for the mean and variance parameter (µ, Σ) . If it is a scalar, then the prior scale matrix will be a diagonal matrix with the same dimensions as Σ and the diagonal elements all take value of S0, otherwise S0 needs to have same dimensions as Σ. When context=TRUE, Σ is a 3 × 3 matrix, otherwise, it is 2 × 2. The default is 10.

mu.start
A scalar or a numeric vector that specifies the starting values of the mean parameter µ. If it is a scalar, then its value will be repeated to yield a vector of the length of µ, otherwise, it needs to be a vector of same length as µ. When context=FALSE, the length of µ is 2, otherwise it is 3. The default is 0.
Sigma.start A scalar or a positive definite matrix that specified the starting value of the variance matrix Σ. If it is a scalar, then the prior scale matrix will be a diagonal matrix with the same dimensions as Σ and the diagonal elements all take value of S0, otherwise S0 needs to have same dimensions as Σ. When context=TRUE, Σ is a 3 × 3 matrix, otherwise, it is 2 × 2. The default is 10.
where Y i and X i represent the observed margins, and W 1 and W 2 are unknown variables. In this exmaple, Y i is the turnout rate in the ith precint, X i is the proproption of African American in the ith precinct. The unknowns W 1i an dW 2i are the black and white turnout, respectively. All variables are proportions and hence bounded between 0 and 1. For each i, the following deterministic relationship holds,

Value
An object of class eco containing the following elements: call The matched call.

X
The row margin, X.

Y
The column margin, Y .

N
The size of each table, N .

burnin
The number of initial burnin draws. thin The thinning interval. nu0 The prior degrees of freedom. tau0 The prior scale parameter. mu0 The prior mean.

S0
The prior scale matrix.
W A three dimensional array storing the posterior in-sample predictions of W . The first dimension indexes the Monte Carlo draws, the second dimension indexes the columns of the table, and the third dimension represents the observations.

Wmin
A numeric matrix storing the lower bounds of W .

Wmax
A numeric matrix storing the upper bounds of W .

mu
The posterior draws of the population mean parameter, µ.

Sigma
The posterior draws of the population variance matrix, Σ. data An optional data frame in which to interpret the variables in formula. The default is the environment in which ecoBD is called.

Author(s)
N An optional variable representing the size of the unit; e.g., the total number of voters. If formula is entered as counts and the last row and/or column is omitted, this input is necessary.

Details
The data may be entered either in the form of counts or proportions. If proportions are used, formula may omit the last row and/or column of tables, which can be calculated from the remaining margins. For example, Y~X specifies Y as the first column margin and X as the first row margin in 2 × 2 tables. If counts are used, formula may omit the last row and/or column margin of the table only if N is supplied. In this example, the columns will be labeled as X and not X, and the rows will be labeled as Y and not Y.
An R × C ecological table in the form of counts: where n nr. and n i.c represent the observed margins, N i represents the size of the table, and n irc are unknown variables. Note that for each i, the following deterministic relationships hold; n ir. = C c=1 n irc for r = 1, . . . , R, and n i.c = R r=1 n irc for c = 1, . . . , C. Then, each of the unknown inner cells can be bounded in the following manner, max(0, n ir. + n i.c − N i ) ≤ n irc ≤ min(n ir. , n i.c ).
If the size of tables, N, is provided, An R × C ecological table in the form of proportions: where Y ir and X ic represent the observed margins, and W irc are unknown variables. Note that for each i, the following deterministic relationships hold; Y ir = C c=1 X ic W irc for r = 1, . . . , R, and R r=1 W irc = 1 for c = 1, . . . , C. Then, each of the inner cells of the table can be bounded in the following manner,

Value
An object of class ecoBD containing the following elements (When three dimensional arrays are used, the first dimension indexes the observations, the second dimension indexes the row numbers, and the third dimension indexes the column numbers): call The matched call.

Wmin
A three dimensional array of lower bounds for proportions.

Wmax
A three dimensional array of upper bounds for proportions.

Nmin
A three dimensional array of lower bounds for counts.

Nmax
A three dimensional array of upper bounds for counts.
The object can be printed through print.ecoBD. Description ecoML is used to fit parametric models for ecological inference in 2 × 2 tables via Expectation Maximization (EM) algorithms. The data is specified in proportions. At it's most basic setting, the algorithm assumes that the individual-level proportions (i.e., W 1 and W 2 ) and distributed bivariate normally (after logit transformations). The function calculates point estimates of the parameters for models based on different assumptions. The standard errors of the point estimates are also computed via Supplemented EM algorithms. Moreover, ecoML quantifies the amount of missing information associated with each parameter and allows researcher to examine the impact of missing information on parameter estimation in ecological inference. The models and algorithms are described in Imai, Lu and Strauss (Forthcoming).

Arguments formula
A symbolic description of the model to be fit, specifying the column and row margins of 2×2 ecological tables. Y~X specifies Y as the column margin (e.g., turnout) and X (e.g., percent African-American) as the row margin. Details and specific examples are given below.
data An optional data frame in which to interpret the variables in formula. The default is the environment in which ecoML is called.

N
An optional variable representing the size of the unit; e.g., the total number of voters.
supplement An optional matrix of supplemental data. The matrix has two columns, which contain additional individual-level data such as survey data for W 1 and W 2 , respectively. If NULL, no additional individual-level data are included in the model. The default is NULL.
fix.rho Logical. If TRUE, the correlation (when context=TRUE) or the partial correlation (when context=FALSE) between W 1 and W 2 is fixed through the estimation. For details, see Imai, Lu and Strauss(2006). The default is FALSE. context Logical. If TRUE, the contextual effect is also modeled. In this case, the row margin (i.e., X) and the individual-level rates (i.e., W 1 and W 2 ) are assumed to be distributed tri-variate normally (after logit transformations). See Imai, Lu and Strauss (2006) for details. The default is FALSE. sem Logical. If TRUE, the standard errors of parameter estimates are estimated via SEM algorithm, as well as the fraction of missing data. The default is TRUE.
theta.start A numeric vector that specifies the starting values for the mean, variance, and covariance. When context = FALSE, the elements of theta.start correspond to (E(W 1 ), E(W 2 ), var(W 1 ), var(W 2 ), cor(W 1 , W 2 )). When context = TRUE, the elements of theta.start correspond to (E(W 1 ), E(W 2 ), var(W 1 ), var(W 2 ), corr(W 1 , X), corr(W 2 , X), corr(W 1 , W 2 )). Moreover, when fix.rho=TRUE, corr(W 1 , W 2 ) is set to be the correlation between W 1 and W 2 when context = FALSE, and the partial correlation between W 1 and W 2 given X when context = FALSE. The default is c(0,0,1,1,0). epsilon A positive number that specifies the convergence criterion for EM algorithm. The square root of epsilon is the convergence criterion for SEM algorithm. The default is 10^(-10).
maxit A positive integer specifies the maximum number of iterations before the convergence criterion is met. The default is 1000.
loglik Logical. If TRUE, the value of the log-likelihood function at each iteration of EM is saved. The default is TRUE.
hyptest Logical. If TRUE, model is estimated under the null hypothesis that means of W 1 and W 2 are the same. The default is FALSE.
verbose Logical. If TRUE, the progress of the EM and SEM algorithms is printed to the screen. The default is FALSE.

Details
When SEM is TRUE, ecoML computes the observed-data information matrix for the parameters of interest based on Supplemented-EM algorithm. The inverse of the observed-data information matrix can be used to estimate the variance-covariance matrix for the parameters estimated from EM algorithms. In addition, it also computes the expected complete-data information matrix. Based on these two measures, one can further calculate the fraction of missing information associated with each parameter. See Imai, Lu and Strauss (2006) for more details about fraction of missing information.
Moreover, when hytest=TRUE, ecoML allows to estimate the parametric model under the null hypothesis that mu 1=mu 2. One can then construct the likelihood ratio test to assess the hypothesis of equal means. The associated fraction of missing information for the test statistic can be also calculated. For details, see Imai, Lu and Strauss (2006) for details.

Value
An object of class ecoML containing the following elements: call The matched call.

X
The row margin, X.

Y
The column margin, Y .

N
The size of each table, N . context The assumption under which model is estimated. If context = FALSE, CAR assumption is adopted and no contextual effect is modeled. If context = TRUE, NCAR assumption is adopted, and contextual effect is modeled.
sem Whether SEM algorithm is used to estimate the standard errors and observed information matrix for the parameter estimates.

fix.rho
Whether the correlation or the partial correlation between W 1 an W 2 is fixed in the estimation.

r12
If fix.rho = TRUE, the value that corr(W 1 , W 2 ) is fixed to.

epsilon
The precision criterion for EM convergence. √ is the precision criterion for SEM convergence.

W
In-sample estimation of W 1 and W 2 .

suff.stat
The sufficient statistics for theta.em.
iters.em Number of EM iterations before convergence is achieved.
iters.sem Number of SEM iterations before convergence is achieved. loglik The log-likelihood of the model when convergence is achieved. loglik.log.em A vector saving the value of the log-likelihood function at each iteration of the EM algorithm.
mu.log.em A matrix saving the unweighted mean estimation of the logit-transformed individual-level proportions (i.e., W 1 and W 2 ) at each iteration of the EM process.

Sigma.log.em
A matrix saving the log of the variance estimation of the logit-transformed individual-level proportions (i.e., W 1 and W 2 ) at each iteration of EM process. Note, non-transformed variances are displayed on the screen (when verbose = TRUE). rho.fisher.em A matrix saving the fisher transformation of the estimation of the correlations between the logit-transformed individual-level proportions (i.e., W 1 and W 2 ) at each iteration of EM process. Note, non-transformed correlations are displayed on the screen (when verbose = TRUE).

DM
The matrix characterizing the rates of convergence of the EM algorithms. Such information is also used to calculate the observed-data information matrix

Icom
The (expected) complete data information matrix estimated via SEM algorithm. When context=FALSE, fix.rho=TRUE, Icom is 4 by 4. When context=FALSE, fix.rho=FALSE, Icom is 5 by 5. When context=TRUE, Icom is 9 by 9.

Iobs
The observed information matrix. The dimension of Iobs is same as Icom.

Imiss
The difference between Icom and Iobs. The dimension of Imiss is same as miss.

Vobs
The (symmetrized) variance-covariance matrix of the ML parameter estimates. The dimension of Vobs is same as Icom.

Iobs
The (expected) complete-data variance-covariance matrix. The dimension of Iobs is same as Icom.

Vobs.original
The estimated variance-covariance matrix of the ML parameter estimates. The dimension of Vobs is same as Icom.

Fmis
The fraction of missing information associated with each parameter estimation.

VFmis
The proportion of increased variance associated with each parameter estimation due to observed data.

Ieigen
The largest eigen value of Imiss.
Icom.trans The complete data information matrix for the fisher transformed parameters.
Iobs.trans The observed data information matrix for the fisher transformed parameters.
Fmis.trans The fractions of missing information associated with the fisher transformed parameters.
Author ( Description ecoNP is used to fit the nonparametric Bayesian model (based on a Dirichlet process prior) for ecological inference in 2 × 2 tables via Markov chain Monte Carlo. It gives the in-sample predictions as well as out-of-sample predictions for population inference. The models and algorithms are described in Imai, Lu and Strauss (2008, Forthcoming).
data An optional data frame in which to interpret the variables in formula. The default is the environment in which ecoNP is called.
N An optional variable representing the size of the unit; e.g., the total number of voters.
supplement An optional matrix of supplemental data. The matrix has two columns, which contain additional individual-level data such as survey data for W 1 and W 2 , respectively. If NULL, no additional individual-level data are included in the model. The default is NULL.
context Logical. If TRUE, the contextual effect is also modeled, that is to assume the row margin X and the unknown W 1 and W 2 are correlated. See Imai, Lu and Strauss (2008, Forthcoming) for details. The default is FALSE.

mu0
A scalar or a numeric vector that specifies the prior mean for the mean parameter µ of the base prior distribution G 0 (see Imai, Lu and Strauss (2008, Forthcoming) for detailed descriptions of Dirichlete prior and the normal base prior distribution) . If it is a scalar, then its value will be repeated to yield a vector of the length of µ, otherwise, it needs to be a vector of same length as µ. When context=TRUE , the length of µ is 3, otherwise it is 2. The default is 0.
tau0 A positive integer representing the scale parameter of the Normal-Inverse Wishart prior for the mean and variance parameter (µ i , Σ i ) of each observation. The default is 2.
nu0 A positive integer representing the prior degrees of freedom of the variance matrix Σ i . the default is 4.

S0
A positive scalar or a positive definite matrix that specifies the prior scale matrix for the variance matrix Σ i . If it is a scalar, then the prior scale matrix will be a diagonal matrix with the same dimensions as Σ i and the diagonal elements all take value of S0, otherwise S0 needs to have same dimensions as Σ i . When context=TRUE, Σ is a 3 × 3 matrix, otherwise, it is 2 × 2. The default is 10.
alpha A positive scalar representing a user-specified fixed value of the concentration parameter, α. If NULL, α will be updated at each Gibbs draw, and its prior parameters a0 and b0 need to be specified. The default is NULL.

a0
A positive integer representing the value of shape parameter of the gamma prior distribution for α.

Value
An object of class ecoNP containing the following elements: call The matched call.

X
The row margin, X.

Y
The column margin, Y . burnin The number of initial burnin draws. thin The thinning interval. nu0 The prior degrees of freedom. tau0 The prior scale parameter. mu0 The prior mean.

S0
The prior scale matrix.

a0
The prior shape parameter. b0 The prior scale parameter.
W A three dimensional array storing the posterior in-sample predictions of W . The first dimension indexes the Monte Carlo draws, the second dimension indexes the columns of the table, and the third dimension represents the observations.

Wmin
A numeric matrix storing the lower bounds of W .

Wmax
A numeric matrix storing the upper bounds of W .

mu
A three dimensional array storing the posterior draws of the population mean parameter, µ. The first dimension indexes the Monte Carlo draws, the second dimension indexes the columns of the table, and the third dimension represents the observations.

Sigma
A three dimensional array storing the posterior draws of the population variance matrix, Σ. The first dimension indexes the Monte Carlo draws, the second dimension indexes the parameters, and the third dimension represents the observations.

alpha
The posterior draws of α. obs An integer or vector of integers specifying the observation number(s) whose posterior draws will be used for predictions. The default is NULL where all the observations in the data set are selected.
cond logical. If TRUE, then the conditional prediction will made for the parametric model with contextual effects. The default is FALSE.
verbose logical. If TRUE, helpful messages along with a progress report on the Monte Carlo sampling from the posterior predictive distributions are printed on the screen. The default is FALSE.
... further arguments passed to or from other methods.

Details
The posterior predictive values are computed using the Monte Carlo sample stored in the eco or ecoNP output (or other sample if newdraw is specified). Given each Monte Carlo sample of the parameters, we sample the vector-valued latent variable from the appropriate multivariate Normal distribution. Then, we apply the inverse logit transformation to obtain the predictive values of proportions, W . The computation may be slow (especially for the nonparametric model) if a large Monte Carlo sample of the model parameters is used. In either case, setting verbose = TRUE may be helpful in monitoring the progress of the code.
Value predict.eco yields a matrix of class predict.eco containing the Monte Carlo sample from the posterior predictive distribution of inner cells of ecological tables. summary.predict.eco will summarize the output, and print.summary.predict.eco will print the summary. ## S3 method for class 'summary.eco': print(x, digits = max(3, getOption("digits") -3), ...) Arguments object An output object from eco.

CI
A vector of lower and upper bounds for the Bayesian credible intervals used to summarize the results. The default is the equal tail 95 percent credible interval.
x An object of class summary.eco. digits the number of significant digits to use when printing.
param Logical. If TRUE, the posterior estimates of the population parameters will be provided. The default value is TRUE.
units Logical. If TRUE, the in-sample predictions for each unit or for a subset of units will be provided. The default value is FALSE.
subset A numeric vector indicating the subset of the units whose in-sample predications to be provided when units is TRUE. The default value is NULL where the in-sample predictions for each unit will be provided.
... further arguments passed to or from other methods.
Value summary.eco yields an object of class summary.eco containing the following elements: call The call from eco.

n.obs
The number of units.

n.draws
The number of Monte Carlo samples.

agg.table
Aggregate posterior estimates of the marginal means of W 1 and W 2 using X and N as weights.

CI
A vector of lower and upper bounds for the Bayesian credible intervals used to summarize the results. The default is the equal tail 95 percent credible interval.
x An object of class summary.ecoNP.
digits the number of significant digits to use when printing.
param Logical. If TRUE, the posterior estimates of the population parameters will be provided. The default value is FALSE.
units Logical. If TRUE, the in-sample predictions for each unit or for a subset of units will be provided. The default value is FALSE.
subset A numeric vector indicating the subset of the units whose in-sample predications to be provided when units is TRUE. The default value is NULL where the in-sample predictions for each unit will be provided.
... further arguments passed to or from other methods.
Value summary.ecoNP yields an object of class summary.ecoNP containing the following elements: call The call from ecoNP.

n.obs
The number of units.

n.draws
The number of Monte Carlo samples.

agg.table
Aggregate posterior estimates of the marginal means of W 1 and W 2 using X and N as weights. param.

Description
This data set contains the proportion of the residents who are black, the proportion of those who can read, the total population as well as the actual black literacy rate and white literacy rate for 1040 counties in the US. The dataset was originally analyzed by Robinson (1950) at the state level. King (1997) recoded the 1910 census at county level. The data set only includes those who are older than 10 years of age.

data(census)
Format A data frame containing 5 variables and 1040 observations X numeric the proportion of Black residents in each county Y numeric the overall literacy rates in each county N numeric the total number of residents in each county W1 numeric the actual Black literacy rate W2 numeric the actual White literacy rate