Constructing two-sided simultaneous confidence intervals for multinomial proportions for small counts in a large number of cells

Confidence intervals for multinomial proportions are often constructed using large-sample methods that rely on expected cell counts of 5 or greater. In situations that give rise to a large number of categories, the cell counts may not be of adequate size to ensure the appropriate overall coverage probability and alternative methods of construction have been proposed. Sison and Glaz (1995) developed a method of constructing two-sided confidence intervals for multinomial proportions that is based on the doubly truncated Poisson distribution and their method performs well when the cell counts are fairly equally dispersed over a large number of categories. In fact, the Sison and Glaz (1995) intervals appear to outperform other methods of simultaneous construction in terms of coverage probabilities and interval length in these situations. To make the method available to researchers, we have developed a SAS macro to construct the intervals proposed by Sison and Glaz (1995).


Introduction
Confidence intervals for multinomial proportions can be constructed by relying on the usual large-sample methods when the cell counts are adequate (>5 per cell) so that coverage probabilities are at or near the nominal confidence level. The authors [1] presented a SAS  macro for simultaneous confidence interval construction for multinomial proportions using several of the large-sample methods. The original macro was written in the interactive matrix language IML [2] and is easy to implement using a standard SAS macro call.
In some instances, however, the researcher may find that the cell counts in several cells are too small to rely on the large-sample methods. If the cell counts are fairly evenly dispersed over a large number of categories, methods proposed by Sison and Glaz [3] appear to maintain an adequate coverage probability where other methods perform poorly as evidenced by those authors' simulation comparisons. In the present paper, we develop a macro that constructs simultaneous intervals based on the methods proposed by Sison and Glaz [3]. We describe various methods of construction in Section 2 to differentiate the usual large-sample approaches from those of Sison and Glaz [3]. We discuss the SAS macro in Section 3 and illustrate the method with an example in Section 4. We discuss caveats in using the macro in Section 5. We give a slightly more detailed discussion of the computing aspects of the truncated Poisson probabilities in Appendix A. The macro is listed in Appendix B, an example of the calling routine in Appendix C. Sample output is listed in Appendix D.

Methods of constructing two-sided confidence intervals for multinomial proportions
Let n = (n 1 , …, n k ) T represent the vector of observed cell counts from a k x 1 classification table where n = n 1 + ... + n k is the total sample size. Thus, n i (i = 1, … , k) is the number of observations and p i = n i / n is the proportion observed in the i th cell of the k x 1 table (i = 1, ... , k). Assuming the total sample size n is fixed, the vector n is an observation from a multinomial distribution with parameters π π π π = (π 1 , ... , π k ) T where π i is the population proportion for the i th cell. The vector p = (p 1 , ... , p k ) T is the maximum likelihood estimator of π π π π and is unbiased. The variance of p i is π i (1 − π i )/n and is usually estimated by p i (1 − p i )/n. The covariance matrix is Σ Σ Σ Σ = (π π π π − ππ ππ ππ ππ T )/n with variances along the diagonal and is estimated by S = (p − pp T )/n. S converges to Σ Σ Σ Σ as n grows large.
Recently, Agresti and Coull [4] published an informative article elucidating the coverage properties of confidence interval construction for a binomial parameter. They compared score intervals attributed to Wilson [5] with Wald [6] intervals and exact intervals based on binomial probabilities. The Wald intervals are found by solving for π i where χ 2 = χ 2 (α, 1) is the upper 100(1 − α) percentage point of the chi-square . The Wald intervals are sometimes given in introductory texts and reported by some computer packages, but they do not perform well with respect to coverage probability or interval length. The Wilson intervals are found by solving for π i and do not require estimating the variance from the sample. Both types are related to Pearson's [7] chi-square statistic and can be developed using multivariate normal theory [1,8].
The authors [1,11] discussed various methods of constructing simultaneous intervals for multinomial proportions as adaptations of the Quesenberry and Hurst [9] or the Wald intervals. Adaptations include variance correction factors, continuity corrections and Bonferroni adjustments. The method proposed by Sison and Glaz [3], however, differs from others and provides an alternative that performs well in instances that are not amenable to normal large-sample theory.
Following a presentation by Levin [12], Sison and Glaz [3] used the relationship between the Poisson, truncated Poisson and multinomial distributions to construct simultaneous confidence intervals for the multinomial proportions. They presented two procedures, but recommended the procedure that was least intensive computationally.
We outline that procedure here and give more detail concerning computing the truncated Poisson probabilities in Appendix A.
Assume Z i (i = 1, ... , k) are independent Poisson random variables such that where Z i , i = 1, ... , k, are independent Poisson random variables with λ i = nπ i and W is the sum of k independent random variables from truncated Poisson distributions on the interval [b i , a i ] also with mean λ i = nπ i .
Because λ i = nπ i is usually unknown, we use λ i = np i to estimate the moments and rely on an Edgeworth expansion to estimate the probability mass function for the approximate truncated Poisson distribution as outlined in Appendix A. We search for a positive integer c by solving ρ(c) = P(p i -c/n < π i < p i + c/n ; i = 1, ... , k) ≈ 1 − α.
That is, we search for a single number c that will give us approximately the correct coverage probability for the entire set of events A 1 , ..., A k . We use the same value of c for all proportions so that the method works best when we have nearly equal proportions across all k cells. We give equal weight to each proportion and, therefore, the interval lengths are the equal. Because the interval lengths are equal, the Sison and Glaz [3] method relies on fairly even dispersion of the cell counts across all categories. The Sison and Glaz [3] method is quite different from the Wilson or Quesenberry and Hurst methods that provide potentially different interval lengths based on the variance associated with a particular cell count.
Sison and Glaz [3] suggested finding an integer such that ρ(c) < 1 − α < ρ(c + 1) and using an interpolation adjustment to form the simultaneous confidence intervals . They note that the adjustment is necessary to correct for skewness. An alternative approach that is easy to implement with the SAS macro we provided is to compute slightly wider intervals as (p i -c/n -1/n < π i < p i + c/n + 1/n) that is equivalent to using c + 1 instead of c where ρ(c) < 1 − α < ρ(c + 1). If n is relatively large, these intervals will be only slightly more conservative than those proposed by Sison and Glaz [3] but will ensure that the coverage probability is at least as large as the specified level, assuming np is a good approximation of nπ. The intervals we propose, however, do not account for skewness as do those of Sison and Glaz [3].

A SAS macro for two-sided multinomial confidence intervals
We have written a SAS macro using PROC IML that takes multinomial cell counts as input and constructs simultaneous confidence intervals for multinomial parameters as output. The algorithm, based on the methods of Sison and Glaz [3], When ρ(c) < 1 − α < ρ(c + 1), the algorithm stops. The correction factor δ is calculated and the endpoints are presented as part of the output.

Example
The example used to illustrate the calling routine was chosen to complement data from Sison and Glaz [3] where cell counts represent the number of "personal crimes The output gives the estimated coverage probabilities using c and c + 1 as 0.953 and 0.932 for this particular data set. These are not, however, the same as the expected coverage probabilities that Sison and Glaz [3] simulated and should not be interpreted as such. The coverage probability for this particular sample using the Sison and Glaz [3] method is not calculated since the Poisson function relies on discrete counts that are between c and c + 1, but the estimated coverage probability for this particular observation is somewhere between P(c) and P(c + 1), provided the observed proportions are near the true population proportions. The volume is also based on this particular sample and is included as a comparison between the two methods. Thus, the reported volume is not to be interpreted as an expected volume.
The observed proportions and confidence intervals using both the recommended correction δ and those based on c + 1 are also listed. Necessarily, the intervals using δ are slightly skewed with center at p + δ/n. The intervals based on c + 1 use endpoints ± (c/n + 1/n). The macro also outputs the mid-points of the Sison and Glaz [3] intervals that, similar to the weighted estimators discussed by Agresti and Coull [4], may also be used to estimate the proportions.
For comparison, Table 1 gives other intervals discussed by May and Johnson [1].

Conclusion
As discussed in previous articles [1,11], the Sison and Glaz [3] intervals are not always appropriate. If the number of cells is small, say no more than 10, the Goodman Patel and Read [13] discussed the Edgeworth expansion in general and also gave references and details using is the probability density function of the standard normal and 15 x 45 Levin [12] and Patel and Read [13] labeled γ 1 the "coefficient of skewness" and γ 2 the "coefficient of excess". We need to compute the first 4 central moments of the truncated Poisson distribution. Hjorth [14] discussed the relationship between the Edgeworth expansion and the moment generating function that the reader may find useful.
Mood, Graybill and Boes [15] noted that for discrete distributions the factorial moments may be easier to calculate than the central moments. The central moments can be calculated from the factorial moments using standard formulae. For r > 0, the r th factorial moment of the truncated Poisson distribution on [b, a] with location parameter λ Note that the µ (r) are functions of cumulative Poisson probabilities. The moments are undefined for P(Z = z) when z < 0 and it is customary to assume µ (r) = 0 in such cases.
The central moments are obtained from the factorial moments as follows: Given the means of the Poisson distribution, we can find the first 4 factorial moments from which we compute the first 4 central moments of the truncated Poisson distribution.