Rgbp: An R Package for Gaussian, Poisson, and Binomial Random Effects Models with Frequency Coverage Evaluations

Rgbp is an R package that provides estimates and verifiable confidence intervals for random effects in two-level conjugate hierarchical models for overdispersed Gaussian, Poisson, and Binomial data. Rgbp models aggregate data from k independent groups summarized by observed sufficient statistics for each random effect, such as sample means, possibly with covariates. Rgbp uses approximate Bayesian machinery with unique improper priors for the hyper-parameters, which leads to good repeated sampling coverage properties for random effects. A special feature of Rgbp is an option that generates synthetic data sets to check whether the interval estimates for random effects actually meet the nominal confidence levels. Additionally, Rgbp provides inference statistics for the hyper-parameters, e.g., regression coefficients.


Introduction
Gaussian, Poisson, or Binomial data from several independent groups sometimes have more variation than the assumed Gaussian, Poisson, or Binomial distributions of the first-level observed data. To account for this extra variability, called overdispersion, a two-level conjugate hierarchical model regards first-level mean parameters as random effects that come from a population-level conjugate prior distribution. The main goal of our two-level conjugate modeling is to estimate these random effects for a comparison between groups. For example, this model can be used to estimate unknown true batting averages (random effects) of baseball players for a comparison among players based on their numbers of hits and at-bats possibly with their covariate information.
With an assumption of homogeneity within each group, the observed data are group-level aggregate data from k independent groups, composed of sufficient statistics for their k random effects (without the population-level data). Specifically, the data for the Gaussian model consist of each group's sample mean and its standard error, those for the Poisson model use each group's outcome count and an exposure measure, and those for the Binomial model use the number of each group's successful outcomes together with the total number of trials. For example, the observed data for the Binomial model can be the number of hits out of at-bats for each of k baseball players, a sufficient statistic for the unknown true batting average of each player. The data analyzed by Rgbp may incorporate each group's covariate information, e.g., each player's position.
These types of data are common in various fields for estimating random effects. For example, biologists may be interested in unknown true tumor incidence rates in analyzing litter data composed of each litter's number of tumor-bearing animals out of total number of animals at risk (Tamura and Young 1987). The unknown true mortality rates on myocardial infarction can be estimated based on the death rate data collected from several independent clinical studies via a meta analysis (Gelman, Carlin, Stern, and Rubin 2014). County-level or statelevel summary data in small area estimation problems (Ghosh and Rao 1994;Rao 2003) can be used to estimate population parameters, such as unknown unemployment rates.
For such data, assuming homogeneity within each group, Rgbp's two-level model may be viewed as a conjugate hierarchical generalized linear model (HGLM, Lee and Nelder 1996) where each random effect comes from a conjugate prior distribution. However, the HGLM focuses on estimating regression coefficients to explore associations between covariates and observed data. While Rgbp does that too, its emphasis concerns making valid point and interval estimates for the k random effects for a comparison among groups.
A defining feature of Rgbp is to evaluate the repeated sampling coverage properties of the interval estimates for random effects (Christiansen and Morris 1997;Daniels 1999;Tang 2002; Morris and Tang 2011;Morris and Lysy 2012). This procedure distinguishes Rgbp from other R packages for similar hierarchical models, such as hglm (Alam, Rönnegård, and Shen 2015) for fitting conjugate hierarchical generalized models, because most software produces only estimation results without providing a quick way to evaluate their estimation procedures. The evaluation procedure which we call frequency method checking uses a parametric bootstrapping method that generates synthetic data sets given the fitted values of the estimated hyperparameters. The frequency method checking estimates the coverage rates of interval estimates based on the simulated data sets and checks whether the estimated coverage rates achieve (or exceed) the nominal confidence level.
Rgbp combines Bayesian modeling tools with our improper hyper-prior distributions on the second-level parameters. These hyper-prior distributions are known to produce good repeated sampling coverage rates for the Bayesian interval estimates for the k random effects in twolevel Gaussian models (Morris and Tang 2011;Morris and Lysy 2012;Kelly 2014). We extend these hyper-prior distributions for Rgbp's Poisson and Binomial hierarchical models.
For fitting the hierarchical model, Rgbp uses adjustment for density maximization (ADM, Morris 1988a;Christiansen and Morris 1997;Morris and Tang 2011). ADM approximates a posterior density or a likelihood function by fitting a selected (one dimensional) Pearson family, based on the first two derivatives of the given density function. For example, when the Normal distribution is the chosen Pearson family, ADM reduces to a Delta method via maxi-mum likelihood estimation. Besides ADM, Rgbp provides an option for the Binomial hierarchical model to draw independent posterior samples of random effects and hyper-parameters via an acceptance-rejection method (Robert and Casella 2013).
The rest of this paper is organized as follows. We specify the Bayesian hierarchical models and discuss their posterior propriety in Section 2. In Section 3, we explain the inferential models used to estimate the model parameters. We describe the estimation procedures including ADM and the acceptance-rejection method in Section 4 and 5, respectively. We introduce frequency method checking techniques in Section 6. We explain the basic usages of Rgbp's main functions with three examples in Section 7, and specify the functions further with various options in Section 8.

Conjugate hierarchical modeling structure
Rgbp allows users to choose one of three hierarchical models according to the type of data, namely Normal-Normal, Poisson-Gamma, and Binomial-Beta models. Although there are more hierarchical models, we choose the three models because these are based on the most common types of data we may encounter in practice. Also, their conjugacy leads to linear posterior means simplifying computations.
Our parametrization of the three hierarchical models leads to an intuitive shrinkage interpretation in inference because the shrinkage factors under our parametrization are determined by the relative amount of information in the prior compared to the data (Morris 1983).

Normal-Normal model for Gaussian data
The following Normal-Normal hierarchical model (hereafter the Gaussian model) assumed by Rgbp is useful when the group-level aggregate data from k independent groups are continuous (or approximately continuous) variables with known standard errors. The subscript j below indicates the j th group among k groups in the dataset. For j = 1, 2, . . . , k, where y j is an observed unbiased estimate, e.g., sample mean, for random effect µ j , V j is a completely known standard error of y j , µ E j is an expected random effect defined as E(µ j | β, A) = x j β = β 1 x j,1 + β 2 x j,2 + · · · + β m x j,m , and m is the number of unknown regression coefficients. It is assumed that the second-level variance A is unknown and that the vector of m regression coefficients β is also unknown unless otherwise specified. If no covariates are available, but with an unknown intercept term, then x j β = β 1 (m = 1) and thus µ E j = µ E = β 1 for all j, resulting in an exchangeable conjugate prior distribution for the random effects. Based on these conjugate prior distributions for random effects, it is easy to derive the conditional posterior distribution of each random effect. For j = 1, 2, . . . , k, where B j ≡ V j /(V j + A) is a shrinkage factor of group j and y = (y 1 , y 2 , . . . , y k ) . The conditional posterior mean E(µ j | β, A, y), denoted by µ * j , is a convex combination of the observed sample mean y j and the expected random effect µ E j weighted by the shrinkage factor B j . If the variance of the conjugate prior distribution, A, is smaller than the variance of the observed distribution, V j , then we expect the posterior mean to borrow more information from the second-level conjugate prior distribution.

Poisson-Gamma model for Poisson data
Rgbp can estimate a conjugate Poisson-Gamma hierarchical model (hereafter the Poisson model) when the group-level aggregate data from k independent groups consist of non-negative count data without upper limit. However, its usage is limited to the case where the expected random effect, λ E j = exp(x j β), is known (or equivalently all the regression coefficients β are known (m = 0)); we may be able to obtain this information from the past studies or from experts. For j = 1, 2, . . . , k, where y j is the number of events happening, n j is the exposure of group j, which is not necessarily an integer, λ E j = E(λ j | r) is the known expected random effect (m = 0), and r is the unknown second-level variance component. The mean and variance of this conjugate Gamma prior distribution are λ E and λ E /r, respectively 1 . Albert (1988) interprets r as the amount of prior information as n j represents the amount of observed information because the uncertainty of the conjugate prior distribution increases as r decreases and vice versa. The conditional posterior distribution of the random effect λ j for this Poisson model is whereȳ j ≡ y j /n j . The mean and variance of the conditional posterior distribution are where B j ≡ r/(r + n j ) is the shrinkage factor of group j, the relative amount of information in the prior compared to the data. The conditional posterior mean is a convex combination ofȳ j and λ E j weighted by B j . If the conjugate prior distribution contains more information than the observed data have, i.e., ensemble sample size r exceeds individual sample size n j , then the posterior mean shrinks towards the prior mean by more than 50%.
The conditional posterior variance in (7) is linear in the conditional posterior mean, whereas a slightly different parameterization for a Poisson-Gamma model has been used elsewhere (Christiansen and Morris 1997) that makes the variances quadratic functions of means.

Binomial-Beta model for Binomial data
Rgbp can fit a conjugate Binomial-Beta hierarchical model (hereafter the Binomial model) when the group-level aggregate data from k independent groups are composed of each group's 1 The density function of this Gamma prior distribution in (5) number of successes out of total number of trials. The expected random effect in the Binomial model is either known (m = 0) or unknown (m ≥ 1). For j = 1, 2, . . . , k, where y j is the number of successes out of n j trials, p E j is the expected random effect of group j defined as p E j ≡ E(p j | β, r) = exp(x j β)/(1 + exp(x j β)). The vector of the m logistic regression coefficients β and the second-level variance component r are unknown. The mean and variance of the conjugate Beta prior distribution for group j are p E j and p E j (1−p E j )/(r+1), respectively. The resultant conditional posterior distribution of random effect p j is whereȳ j = y j /n j is the observed proportion of group j. The mean and variance of the conditional posterior distribution are The conditional posterior mean p * j is a convex combination ofȳ j and p E j weighted by B j ≡ r/(r+n j ) like the Poisson model. If the conjugate prior distribution contains more information than the observed distribution does (r > n j ), then the resulting conditional posterior mean borrows more information from the conjugate Beta prior distribution.

Hyper-prior distribution
Hyper-prior distributions are the distributions assigned to the second-level parameters called hyper-parameters. Our choices for the hyper-prior distributions are β ∼ Uniform on R m and A ∼ Uniform(0, ∞) (or 1/r ∼ Uniform(0, ∞)).
The improper flat hyper-prior distribution on β is a common non-informative choice. In the Gaussian case, the flat hyper-prior distribution on the second-level variance A is known to produce good repeated sampling coverage properties of the Bayesian interval estimates for the random effects (Morris and Tang  In the other two cases, Poisson and Binomial, the flat prior distribution on 1/r induces the same improper prior distribution on shrinkages (dB j /B 2 j ) as does A with the Uniform(0, ∞) for the Gaussian case. The Poisson model with this hyper-prior distribution on r, i.e., dr/r 2 , provides posterior propriety if there are at least two groups whose observed values y j are non-zero and the expected random effects, λ E j , are known (m = 0); see Appendix A for its proof. If λ E j is unknown, Rgbp cannot yet give reliable results because we have not verified posterior propriety. If the Poisson is being used as an approximation to the Binomial and the exposures are known integer values, then we recommend using the Binomial model with the same hyper-prior distributions.
As for posterior propriety of the Binomial model, let us define an interior group as the group whose number of successes y j are neither 0 nor n j , and k y as the number of interior groups among k groups. The full posterior distribution of random effects and hyper-parameters is proper if and only if there are at least two interior groups in the data and the k y × m covariate matrix of the interior groups is of full rank m (k y ≥ m) (Tak and Morris 2016).

The inferential model
The likelihood function of hyper-parameters, A and β, for the Gaussian model is derived from the independent Normal distributions of the observed data with random effects integrated out. .
The joint posterior density function of hyper-parameters for the Gaussian model is proportional to their likelihood function in (13) because we use flat improper hyper-prior density functions for A and β: The likelihood function of r for the Poisson model comes from the independent Negative-Binomial distributions of the observed data with the random effects integrated out.
where Γ(a) is a gamma function defined as ∞ 0 x a−1 exp(x)dx for a positive constant a. The posterior density function of r for the Poisson model is the likelihood function in (15) times the hyper-prior density function of r, i.e., dr/r 2 : The likelihood function of hyper-parameters r and β for the Binomial model is derived from the independent Beta-Binomial distributions of the observed data with random effects integrated out (Skellam 1948).
indicates a beta function for positive constants a and b. The joint posterior density function of hyper-parameters f (r, β | y) for the Binomial model is proportional to their likelihood function in (17) multiplied by the hyper-prior density functions of r and β based on distributions in (12): Our goal is to obtain the point and interval estimates of the random effects from their joint unconditional posterior density which, for the Gaussian model, can be expressed as where µ = (µ 1 , µ 2 , . . . , µ k ) and the distributions in the integrand are given in (3) and (14). For the Poisson model, the joint unconditional posterior density for the random effects is where λ = (λ 1 , λ 2 , . . . , λ k ) and the distributions in the integrand are given in (6) and (16). For the Binomial model, the joint unconditional posterior density for the random effects is where p = (p 1 , p 2 , . . . , p k ) and the distributions in the integrand are given in (10) and (18).

Estimation via the adjustment for density maximization
We illustrate our estimation procedure which utilizes adjustment for density maximization (ADM, Morris 1988a; Christiansen and Morris 1997; Morris and Tang 2011). ADM is a method to approximate a distribution by a member of Pearson family of distributions and obtain moment estimates via maximization. The ADM procedure for the Gaussian model adopted in Rgbp is well documented in Kelly (2014) and thus we describe the ADM procedure using the Poisson and Binomial model in this section.

Estimation for shrinkage factors and expected random effects
Our goal here is to estimate the unconditional posterior moments of the shrinkage factors and the expected random effects because they are used to estimate the unconditional posterior moments of the random effects.

Unconditional posterior moments of shrinkage factors
It is noted that the shrinkage factors are a function of r, i.e., B j = B j (r) = r/(r + n j ) (or a function of A for the Gaussian model). A common method of estimation of B j is to approximate the likelihood of r with two derivatives and use a Delta method for an asymptotic Normal distribution ofB j (r M LE ). This Normal approximation, however, is defined on (−∞, ∞) whereas B j lies on the unit interval between 0 and 1, and hence in small sample sizes the Delta method can result in point estimates lying on the boundary of the parameter space, from which the restricted MLE procedure sometimes suffers (Morris and Tang 2011).
To continue with a maximization-based estimation procedure but to steer clear of aforementioned boundary issues we make use of ADM. The ADM approximates the distribution of the function of the parameter of interest by one of the Pearson family distributions using the first two derivatives as the Delta method does; the Delta method is a special case of the ADM based on the Normal distribution.
The ADM procedure specified in Morris and Tang (2011) assumes that the unconditional posterior distribution of a shrinkage factor follows a Beta distribution; for j = 1, 2, . . . , k, The mean of Beta distribution a 1j /(a 1j +a 0j ) is not the same as its mode (a j1 −1)/(a j1 +a j0 − 2). The ADM works on an adjusted posterior distribution is the same as the mean of the original Beta distribution. The assumed posterior mean and variance of the jth shrinkage factor are The ADM estimates these mean and variance using the marginal posterior distribution of r, f (r | y) ∝ L(r)/r 2 . The marginal likelihood, L(r) = L(β, r)dβ, for the Binomial model is obtained via the Laplace approximation with a Lebesque measure on β and that for the Poisson model is specified in (15). (24) involves calculating the second derivative of log(f A (B j | y)), we work on a logarithmic scale of r, i.e., α = − log(r) (or α = log(A) for the Gaussian model). This is because the distribution of α is more symmetric than that of r and α is defined on a real line without any boundary issues. Morris and Tang (2011), the posterior mean in (23) is estimated bŷ
To estimate the variance in (24), Morris and Tang (2011) introduced the invariant information defined as This invariant information is the negative second-derivative of α + log(L(α)) evaluated at the modeα. Using the invariant information, we estimate the unconditional posterior variance of shrinkage factor in (24) by We obtain the estimates of a 1j and a 0j , the two parameters of the Beta distribution in (22), by matching them to the estimated unconditional posterior mean and variance of B j specified in (25) and (27) as follows.â The moments of the Beta distribution are well defined as a function of a 1j and a 0j , i.e., The ADM approximation to the shrinkage factors via Beta distributions is empirically proven to be more accurate than a Laplace approximation (Morris 1988a;Christiansen and Morris 1997;Morris and Tang 2011;Morris and Lysy 2012).

Unconditional posterior moments of expected random effects
We estimate the unconditional posterior moments of expected random effects using their relationship to the conditional posterior moments. For a non-negative constant c, the unconditional posterior moments are We approximate the unconditional posterior moments on the left hand side by the conditional posterior moments withα inserted (Kass and Steffey 1989), i.e., by E((p E j ) c |α, y). However, calculating conditional posterior moments of each expected random effect involves an intractable integration. For example, the first conditional posterior moment of p E j is Thus, we use another ADM, assuming the conditional posterior distribution of each expected random effect is a Beta distribution as follows.
where G 1 is a random variable following a Gamma(b 1j , 1) distribution and independently G 0 has a Gamma(b 0j , 1) distribution. The representation in (32) is equivalent to exp(x j β)|α, y ∼ G 1 /G 0 , a ratio of two independent Gamma random variables. Its mean and variance are To estimate b 1j and b 0j , we assume that the conditional posterior distribution of β givenα and y follows a Normal distribution with meanβ and variance-covariance matrixΣ, wherê β is the mode of f (β |α, y) andΣ is an inverse of the negative Hessian matrix at the mode. Thus, the posterior distribution of x j β is also Normal with mean x jβ and variance x jΣ x j .
Using the property of the log-Normal distribution for exp(x j β), we estimate the posterior mean and variance in (33) and (34) aŝ We estimate the values of b 1j and b 0j by matching them to the estimated unconditional posterior mean and variance of exp(x j β) in (35) and (36), that is, Finally, we estimate the unconditional posterior moments of the expected random effects bŷ The ADM approximation to a log-Normal density via a F -distribution (represented by a ratio of two independent Gamma random variables) is known to be more accurate than the Laplace approximation (Morris 1988a).
For the Gaussian model (Morris and Tang 2011), the conditional posterior distribution of β givenÂ and y is Normal whose mean and variance-covariance matrix are Because x β givenÂ and y is also Normally distributed, we easily obtain the conditional posterior moments of µ E j = x β givenÂ and use them to estimate unconditional posterior moments of µ E j .

Estimation for random effects
We illustrate how we obtain approximate unconditional posterior distributions of random effects using the estimated unconditional posterior moments of shrinkage factors and those of expected random effects. It is intractable to derive analytically the unconditional posterior distribution of each random effect for the three models. Thus, we approximate the distributions by matching the estimated posterior moments with a skewed-Normal distribution (Azzalini 1985) for the Gaussian model, a Gamma distribution for the Poisson model, and a Beta distribution for the Binomial model; for j = 1, 2, . . . , k, where (φ, ω, δ) of the skewed-Normal distribution are location, scale, and skewness parameters, respectively.
Morris and Lysy (2012) first noted that the unconditional posterior distribution of the random effect in a two-level conjugate Gaussian model might be skewed. Kelly (2014) shows that the skewed-Normal approximation to the unconditional posterior distribution of the random effect is better than a Normal approximation (µ j | y ∼ Normal) in terms of the repeated sampling coverage properties of random effects. Kelly (2014) estimates the first three moments of the random effects by noting that µ j is Normally distributed given A and y, and thus estimates the moments by using the ADM approximation of the shrinkage factors, B j , and the law of third cumulants (Brillinger 1969). The three estimated moments are then matched to the first three moments of the skewed-Normal distribution, i.e., E(µ j | y) = φ + ωδ 2/π, VAR(µ j | y) = ω 2 (1 − 2δ 2 /π), and Skewness(µ j | y) = (4 − π)δ 3 /[2(π/2 − δ 2 ) 3/2 ] (Azzalini 1985). The full derivation can be found in Kelly (2014).
The unconditional posterior mean and variance of random effect λ j in the Poisson model are To estimate these, we insert the estimated unconditional posterior moments of shrinkage factors in (29) into both (43) and (46). Letμ λ j andσ 2 λ j denote the estimated unconditional posterior mean and variance, respectively. The estimates of the two parameters s 1j and s 0j in (41) To estimate the unconditional posterior moments of random effects in the Binomial model, we assume that hyper-parameters r and β are independent a posteriori. With this assumption, the unconditional posterior mean and variance of random effect p j are where the approximation in (51) is a first-order Taylor approximation. By inserting the estimated unconditional posterior moments of shrinkage factors in (29) and those of expected random effect in (38) into both (48) and (52), we obtain the estimates of the unconditional posterior mean and variance of each random effect, denoted byμ p j andσ 2 p j , respectively. We thus obtain the estimates of two parameters t 1j and t 0j in (42) as follows.
Finally, the assumed unconditional posterior distribution of random effect for the Gaussian model is that for the Poisson model is and that for the Binomial model is Our point and interval estimates of each random effect are the mean and (2.5%, 97.5%) quantiles (if we assign 95% confidence level) of the assumed unconditional posterior distribution in (54), (55), or (56).
For the Binomial model, Rgbp provides a fully Bayesian approach for drawing posterior samples of random effects and hyper-parameters, which we illustrate in the next section.

The acceptance-rejection method for the Binomial model
In this section, we illustrate an option of Rgbp that provides a way to draw posterior samples of random effects and hyper-parameters via the acceptance-rejection (A-R) method (Robert and Casella 2013) for the Binomial model. Unlike the approximate Bayesian machinery specified in the previous section, this method does not assume that hyper-parameters are independent a posteriori. The joint posterior density function of α = − log(r) and β based on their joint hyper-prior density function in (12) is The A-R method is useful when it is difficult to sample a parameter of interest θ directly from its target probability density f (θ), which is known up to a normalizing constant, but an easy-to-sample envelope function g(θ) is available. The A-R method samples θ from the envelope g(θ) and accepts it with a probability f (θ)/(M g(θ)), where M is a constant making f (θ)/g(θ) ≤ M for all θ. The distribution of the accepted θ exactly follows f (θ). The A-R method is stable as long as the tails of the envelope function are thicker than those of the target density function.
The goal of the A-R method for the Binomial model is to draw posterior samples of hyperparameters from (57), using an easy-to-sample envelope function g(α, β) that has thicker tails than the target density function.
We factor the envelope function into two parts, g(α, β) = g 1 (α)g 2 (β) to model the tails of each function separately. We consider the tail behavior of the conditional posterior density function f (α | β, y) to establish g 1 (α); f (α | β, y) behaves as exp(−α(k − 1)) when α goes to ∞ and as exp(α) when α goes to −∞. It indicates that f (α | β, y) is skewed to the left because the right tail touches the x-axis faster than the left tail does as long as k > 1. A skewed t-distribution is a good candidate for g 1 (α) because it behaves as a power law on both tails, leading to thicker tails than those of f (α | β, y).
It is too complicated to figure out the tail behaviors of f (β | α, y). However, because f (β | α, y) in the Gaussian model (as an approximation) has a multivariate Gaussian density function (Morris and Tang 2011; Kelly 2014), we consider a multivariate t-distribution with four degrees of freedom as a good candidate for g 2 (β).
The notation t 4 (β | ξ, S) in (59) indicates a density function of a multivariate t-distribution of β with four degrees of freedom, a location vector ξ, and a m × m scale matrix S that leads to the variance-covariance matrix 2S.
Rgbp determines the parameters of g 1 (α) and g 2 (β), i.e., l, σ, a, b, ξ, and S, to make the product of g 1 (α) and g 2 (β) similar to the target joint posterior density f (α, β | y). First, Rgbp obtains the mode of f (α, β | y), (α,β), and the inverse of the negative Hessian matrix at the mode. We define −H −1 α to indicate the (1, 1) th element of the negative Hessian matrix and −H −1 β to represent the negative Hessian matrix without the first row and the first column.
For g 1 (α), Rgbp sets (a, b) to (k, 2k) if k is less than 10 (or to (log(k), 2 log(k)) otherwise) for a left-skewness and these small values of a and b lead to thick tails. Rgbp matches the mode of g 1 (α) specified in (60) toα by setting the location parameter l toα − (a − b) √ a + b/ (2a + 1)(2b + 1). Rgbp sets the scale parameter σ to (−H −1 α ) 0.5 ψ, where ψ is a tuning parameter; when the A-R method produces extreme weights defined in (62) below, we need enlarge the value of ψ. For implementation of the acceptance-rejection method, Rgbp draws four times more trial samples than the desired number of samples, denoted by N , independently from g 1 (α) and g 2 (β). Rgbp calculates 4N weights, each of which is defined as Rgbp accepts each pair of (α (i) , β (i) ) with a probability w i /M where M is set to the maximum of all the 4N weights. When Rgbp accepts more than N pairs, it discards the redundant. If Rgbp accepts less than N pairs, then it additionally draws N (six times the shortage) pairs and calculates a new maximum M from all the previous and new weights; Rgbp accepts or rejects the entire pairs again with new probabilities w j /M , j = 1, 2, . . . , 4N + N .
After obtaining posterior samples of hyper-parameters, Rgbp draws posterior samples of random effects from f (p | y) in (21). The integration on the right hand side of (21) can be done by sampling p from f (p j | β, r, y) in (10) for j = 1, 2, . . . , k, given r = exp(−α) and β that are already sampled from f (α, β | y) via the A-R method.

Frequency method checking
The question as to whether the interval estimates of random effects for given confidence level obtained by a specific model achieve the nominal coverage rate for any true parameter values is one of the key model evaluation criteria. Unlike standard model checking methods that test whether a two-level model is appropriate for data (Dean 1992;Christiansen and Morris 1996), frequency method checking is a procedure to evaluate the coverage properties of the model. Conditioning that the two-level model is appropriate, the frequency method checking generates pseudo-data sets given specific values of hyper-parameters and estimates unknown coverage probabilities based on these mock data sets (a parametric bootstrapping). We describe the frequency method checking based on the Gaussian model because the idea can be easily applied to the other two models.
6.1. Pseudo-data generation Figure 1 displays the process of generating pseudo-data sets. It is noted that the conjugate prior distribution of each random effect in (2) is completely determined by two hyperparameters, A and β. Fixing these hyper-parameters at specific values, we generate N sim sets of random effects from the conjugate prior distribution, i.e., {µ (i) , i = 1, . . . , N sim }, where the superscript (i) indicates the i-th simulation. Next, using the distribution of observed data in (1), we generate N sim sets of observed data sets {y (i) , i = 1, . . . , N sim } given each µ (i) .

Coverage probability estimation
After fitting the Gaussian model for each simulated data set, we obtain interval estimates of the random effects µ (i) . Let (μ j, upp ) represent the lower and upper bounds of the interval estimate of random effect j based on the i-th simulation given a specific confidence level. We define the coverage indicator of random effect j on the i-th mock data set as This shrinkage indicator is equal to the value one if the random effect j in simulation i is between its interval estimates and zero otherwise.

Simple unbiased coverage estimator.
When the confidence level is 95%, the proportion of 95% interval estimates that contain random effect j is an intuitive choice for the coverage rate estimator for random effect j. This estimator implicitly assumes that there exist k unknown coverage probabilities of random effects, denoted by C A,β (µ j ) for j = 1, 2, . . . , k, depending on the values of the hyper-parameters that generate random effects and mock data sets. The coverage indicators for random effect j in (63) is assumed to follow an independent and identically distributed Bernoulli distribution given the unknown coverage rate C A,β (µ j ). The sample mean of these coverage indicators is a simple unbiased coverage estimator for C A,β (µ j ); for j = 1, 2, . . . , k, The unbiased variance estimator of VAR(Ī(µ j )) is, for j = 1, 2, . . . , k, (65)

Rao-Blackwellized unbiased coverage estimator.
Frequency method checking is computationally expensive in nature because it fits a model on every mock data set. The situation deteriorates if the number of simulations or the size of data is large, or the estimation method is computationally demanding. Christiansen and Morris (1997) and Tang (2002) use a Rao-Blackwellized (RB) unbiased coverage estimator for the unknown coverage rate of each random effect, which is more efficient than the simple unbiased coverage estimator. For j = 1, 2, . . . , k, where the sample mean of the interior conditional expectations in (66) is the RB unbiased coverage estimator. Specifically, We can easily compute the conditional posterior probabilities in (68) using the cumulative density function of the Gaussian conditional posterior distribution of each random effect in (3). The variance ofĪ RB (µ j ) does not exceed the variance of a simple unbiased coverage estimator,Ī(µ j ) (Rao 1945;Blackwell 1947).

Overall unbiased coverage estimator
To summarize the frequency method checking, we report the overall unbiased coverage estimate and its variance estimate,

Examples
In this section, we demonstrate how Rgbp can be used to analyze three realistic data sets: Medical profiling of 31 hospitals with Poisson distributed fatality counts; educational assessment of eight schools with Normally distributed data; and evaluation of 18 baseball hitters with Binomial success rates and one covariate. For each example, we construct 95% confidence intervals. Additional usages and options of the functions in Rgbp can be found in Section 8.

Poisson data with 31 hospitals: Known expected random effect
We analyze a data set of 31 hospitals in New York State consisting of the outcomes of the coronary artery bypass graft (CABG) surgery (Morris and Lysy 2012). The data set contains the number of deaths, y, for a specified period after CABG surgeries out of the total number of patients, n, receiving CABG surgeries in each hospital. A goal would be to obtain the point and interval estimates for the unknown true fatality rates (random effects) of 31 hospitals to evaluate each hospital's reliability on the CABG surgery (Morris and Christiansen (1995) use a similar Poisson model to handle these hospital profile data). We interpret the caseloads, n, as exposures and assume that the state-level fatality rate per exposure of this surgery is known, λ E j = 0.03 (m = 0). The following code can be used to load these data into R. R> library("Rgbp") R> data("hospital") R> y <-hospital$d R> n <-hospital$n The function gbp can then be used to fit a Poisson-Gamma to the fatality rates in New York States with the expected random effect, λ E j , equal to 0.03. The output contains information about (from the left) the observed fatality ratesȳ j , caseloads n j , known expected random effect λ E j , shrinkage estimatesB j , lower bounds (2.5%) of posterior interval estimatesλ j,low , posterior meansÊ(λ j |y), upper bounds (97.5%) of posterior interval estimatesλ j,upp , and posterior standard deviations SD(λ j |y) for random effects based on the assumed unconditional Gamma posterior distributions in (55).

R>
A function summary shows selective information about hospitals with minimum, median, and maximum exposures and the estimation result of the hyper-parameter α = − log(r). The output of summary shows thatr = exp(6.53) = 684, which is an indicator of how valuable and informative the second-level hierarchy is. It means that the 25 hospitals with caseload less than 684 patients shrink their sample means towards the prior mean (0.03) more than 50%. For example, the shrinkage estimate of the first hospital (B 1 = 0.911) was calculated by 684 / (684 + 67), where 67 is its caseload (n 1 ). As for this hospital, using more information from the conjugate prior distribution is an appropriate choice because the amount of observed information (67) is much less than the amount of state-level information (684).

R> summary(p.output)
To obtain a graphical summary, we use the function plot.
R> plot(p.output) Figure 2: Shrinkage plot and 95% interval plot for fatality rates at 31 hospitals.
The shrinkage plot (Efron and Morris 1975;Morris and Lysy 2012) in the first panel of Figure 2 shows the regression towards the mean; the observed fatality rates, denoted by empty dots on the upper horizontal line, are shrinking towards the known expected random effect, denoted by a blue vertical line at 0.03, to the different extents. The red dots on the bottom line denotes the estimated posterior means. Some hospitals' ranks have changed by shrinking more sharply towards 0.03 than the others. For example, an empty square at the crossing point of the two left-most lines (8th and 23rd hospitals on the list above) indicates that the seemingly safest hospital in terms of the observed mortality rate is probably not the safest in terms of the estimated posterior mean accounting for the different caseloads of these two hospitals.
To be specific, their observed fatality rates (y j , j = 8, 23) are 0.0136 and 0.0150 and caseloads (n j , j = 8, 23) are 295 and 602, respectively. Considering solely the observed fatality rates may lead to an unfair comparison because the latter hospital handled twice the caseload.
Rgbp accounts for this caseload difference, making the posterior mean for the random effect of the former hospital shrink toward the state-level mean (λ E j =0.03) more rapidly than that for the latter hospital.
The point estimates are not enough to evaluate hospital reliability because one hospital may have a lower point estimate but larger uncertainty (variance) than the other. The second plot of Figure 2 displays the 95% interval estimates. Each posterior mean (red dot) is between the sample mean (empty dot) and the known expected random effect (a blue horizontal line).
This 95% interval plot reveals that the 31st hospital has the lowest upper bound even though its point estimate (λ 31 = 0.0235) is slightly larger than that of the 23rd hospital (λ 23 = 0.0230). The observed mortality rates for these two hospitals (y j , j = 23, 31) are 0.0150 and 0.0201 and the caseloads (n j , j = 23, 31) are 602 and 1340 each. The 31st hospital has twice the caseload, which leads to borrowing less information from the New York State-level hierarchy (or shrinking less toward the state-level mean, 0.03) with smaller variance. Based on the point and interval estimates, the 31st hospital seems a better choice than the 23rd hospital. (Note that this conclusion is based on the data, assuming no covariate information about the overall case difficulties in each hospital. A more reliable analysis must take into account all the possible covariate information and instead of our Poisson model, we recommend using our Binomial model to account for covariate information.) Next, we perform frequency method checking to question how reliable the estimation procedure is, assuming r equals its estimated value,r = 683.53. The function coverage generates synthetic data sets starting with the estimated value of r as a generative value. For reference, we could designate other generative values of r and λ E j by adding two arguments, A.or.r and mean.PriorDist, into the code below, see Section 8.1 for details. R> p.coverage <-coverage(p.output, nsim = 1000) In Figure 3, the black horizontal line at 0.95 represents the nominal confidence level and the red circles indicate RB unbiased coverage estimates,Ī RB (λ j ) for j = 1, 2, . . . , 31. The overall unbiased coverage estimate across all the hospitals,Ī RB in (70), is 0.955. None of RB unbiased coverage estimates for the 31 hospitals are less than 0.95 regardless of their caseloads, which range from 67 for hospital 1 to 1,340 for hospital 31. This result shows that the interval estimates for this particular dataset adequately achieves a 95% confidence level if r =r.
The following code provides 31 RB unbiased coverage estimates and their standard errors (the output is omitted for space reasons). R> p.coverage$coverageRB R> p.coverage$se.coverageRB The code below produces 31 simple unbiased coverage estimates and their standard errors.

R> p.coverage$coverageS R> p.coverage$se.coverageS
It turns out that the variance estimate of the RB unbiased coverage estimate for the first hospital (0.0016 2 ) is about 19 times smaller than that of the simple one (0.0070 2 ). It means that the RB unbiased coverage estimates based on 1,000 simulations (N sim ) are as precise as the simple unbiased coverage estimates based on 19,000 simulations in terms of estimating the coverage probability for the first hospital, C r,λ E (λ 1 ).

Gaussian data with eight schools: Unknown expected random effect and no covariates
The Education Testing Service conducted randomized experiments in eight separate schools (groups) to test whether students (units) SAT scores are affected by coaching. The dataset contains the estimated coaching effects on SAT scores (y j , j = 1, . . . , 8) and standard errors (V 0.5 j , j = 1, . . . , 8) of the eight schools (Rubin 1981). These data are contained in the package and can be loaded into R as follows.
R> library("Rgbp") R> data("schools") R> y <-schools$y R> se <-schools$se Due to the nature of the test each school's coaching effect has an approximately Normal sampling distribution with approximately known sampling variances, based on large sample consideration. At the second hierarchy, the mean for each school is assumed to be drawn from a common Normal distribution (m = 1).
R> g.output <-gbp(y, se, model = "gaussian") R> g.output This output from gbp summarizes the results. In this Gaussian model the amount of shrinkage for each unit is governed by the shrinkage factor, B j = V j /(V j + A). As such, schools whose variation within the school (V j ) is less than the between-school variation (A) will shrink greater than 50%. The results provided by gpb suggests that there is little evidence that the training provided much added benefit due to the fact that every school's 95% posterior interval contains zero. In the case where the number of groups is large Rgbp provides a summary feature: R> summary(g.output) Main summary: The summary provides results regarding the second level hierarchy parameters. It can be seen that the estimate of the expected random effect, µ E = β 1 (beta1), is not significantly different from zero suggesting that there is no effect of the coaching program on SAT math scores.
Rgbp also provides functionality to plot the results of the analysis as seen in Figure 4. Plotting the results provides a visual aid to understanding but is only largely beneficial when the number of groups (k) is small.
R> plot(g.output) Figure 4: Shrinkage plot and 95% interval plot for eight schools.
The frequency method checking generates new pseudo-data from our assumed model. Unless otherwise specified, the procedure fixes the hyper-parameter values at their estimates (Â and β 1 in this example) and then simulates random effects µ j for each group j. The model is then estimated and this is repeated an N sim (nsim) number of times to estimate the coverage probabilities of the procedure.
As seen in Figure 5 the desired 95% confidence level, denoted by a black horizontal line at 0.95, is achieved for each school in this example. All the coverage estimates depend on the chosen generative values of A and β 1 , and the assumption that the model is valid.
In addition, RB unbiased coverage estimate and its standard error for each school can be calculated with the command below.
R> g.coverage$coverageRB The data of 18 major league baseball players contain the batting averages through their first 45 official at-bats of the 1970 season (Efron and Morris 1975). A binary covariate is created that is equal to the value one if a player is an outfielder and zero otherwise. The data can be loaded into R with the following code.
R> library("Rgbp") R> data("baseball") R> y <-baseball$Hits R> n <-baseball$At.Bats R> x <-ifelse(baseball$Position == "fielder", 1, 0) Conditional on the unknown true batting average (random effect) of each player it is assumed that the at-bats are independent and therefore, y j | p j ∼ Binomial(45, p j ) independently for j = 1, . . . , 18. Our goal is to obtain point and interval estimates of each random effect whilst considering the additional information on whether the player is an outfielder or not. The function gbp provides a way to incorporate such covariate information seamlessly into the model so that the regression towards the mean occurs within outfielders and non-outfielders separately.
R> b.output <-gbp(y, n, x, model = "binomial") R> b.output The regression coefficient for the outfielder indicator is significant, considering that p value forβ 2 (beta2) is 0.038. It means that the two estimates for the expected random effects for the outfielders and infielders are significantly different. Also, the positive sign ofβ 2 indicates that the population batting average for outfielders tends to be higher than that for infielders. The estimated odds ratio is exp(0.389) = 1.48.

R> plot(b.output)
The shrinkage plot in Figure 6 shows that the observed batting averages (empty dots) on the upper horizontal line shrink towards the two expected random effects, 0.233 and 0.310. The short red line symbols near some empty dots are for when two or more points have the same mean and are plotted over each other. For example, five players (from the 11th player to the 15th) have the same batting average, 0.222, and at this point on the upper horizontal line, there are short red lines toward five directions.
The 95% interval plot in Figure 6 shows the range of true batting average for each player, which clarifies the regression towards the mean within two groups. The 10th, 14th, and 15th players, for example, are outfielders but their observed batting averages are far lower than the first five outfielders. This can be attributed to their bad luck because their observed batting averages are close to the lower bounds of their interval estimates. The regression towards the mean indicates that their batting averages shrink towards the expected random effect of outfielders (0.310) in the long run.
To check the level of trust in these interval estimates, we proceed to frequency method checking by assuming the estimates, 112.95 forr and (-1.194, 0.389) forβ, are the generative values. In Figure 7, the estimated coverage probabilities for random effects are beyond 0.95, conservatively satisfying the 95% confidence level if r =r and β =β. If we want to draw 2,000 posterior samples of random effects and hyper-parameters from their full posterior distribution via the A-R method, we use the following R code.
R> b.output <-gbp(y, n, x, model = "binomial", n.AR = 2000) The sampling result saved in b.output consists of 8,000 weights (b.output$weight), 2,000 posterior samples of α (b.output$alpha), a 2,000×2 matrix of β (b.output$beta) each column of which corresponds to 2,000 posterior samples of each regression coefficient, and a k×2,000 matrix of random effects (b.output$p) each row of which has posterior samples of each random effect.
If we run the frequency method checking using this sampling result, b.output, obtained via the A-R method, the N sim simulations also run the A-R method each time.

Usage of functions in Rgbp
In this section, we describe more specific usage with various options of the two main functions of Rgbp, i.e., gbp for model fitting and coverage for frequency method checking.

Model fitting
The function gbp creates an S3 object "gbp" on which three generic functions plot, print, and summary are defined.
There are two cases according to whether covariates are available or not. When no covariates are available, the function gbp requires fitting an intercept term or designating known values of the expected random effects, i.e., the intercept term must be either estimated or known. The default of gbp is to fit an intercept term. The value(s) of the known expected random effect(s) can be assigned through an optional argument mean.PriorDist. Note that gbp can fit the Poisson model only when the values of expected random effects, λ E j , are known. The usage of gbp to fit each model without any covariates is R> g.output <-gbp(y, se.or.n, model = "gaussian") R> b.output <-gbp(y, se.or.n, model = "binomial") R> p.output <-gbp(y, se.or.n, mean.PriorDist, model = "poisson") The argument y is a vector of k observed sample means for the Gaussian model, k observed numbers of successful outcomes for the Binomial model, and k observed outcome counts for the Poisson model. The argument se.or.n is a vector of k standard errors of each sample mean for the Gaussian model, k numbers of trials for the Binomial model, and k exposures for the Poisson model. The argument mean.PriorDist is either a constant (if all the known expected random effects are the same) or a vector of k known expected random effects.
If covariate information for each group is available, users can fit the Gaussian and Binomial models, using the following codes.
R> g.output <-gbp(y, se.or.n, X, model = "gaussian") R> b.output <-gbp(y, se.or.n, X, model = "binomial") The argument X is a matrix of covariate(s) each column of which corresponds to one covariate for k groups. For example, if users have two covariates for each group, the argument X must be a k × 2 matrix to estimate three regression coefficients β = (β 1 , β 2 , β 3 ) including an intercept term, β 1 , as a default. If users do not want to include an intercept term (β 1 = 0), estimating two regression coefficients for the two covariates, users can add an optional argument intercept as follows.
R> g.output <-gbp(y, se.or.n, X, model = "gaussian", intercept = FALSE) The function gbp contains more optional arguments. The argument confidece.lvl, whose default value is 0.95, sets the confidence level, producing 100×confidece.lvl% interval estimates for the random effects. For the Gaussian model, setting the argument normal.CI to TRUE lets gbp use a Normal approximation to the unconditional posterior distribution of the random effect (Morris and Tang 2011). The default value of normal.CI is FALSE for the skewed-Normal approximation (Kelly 2014).
The function gbp uses the A-R method to fit the Binomial model if users assign the desired number of posterior samples (N in (62)) through the argument n.AR; its default value is zero. There are several arguments related to the A-R method. The argument n.AR.factor determines how many trial samples the method draws; its default value is four, meaning that the function gbp draws 4×n.AR trial samples and accepts or rejects them. The argument trial.scale is ψ determining the scale parameter σ of the skewed-t distribution; its default value is 1.3. The argument save.result indicates whether gbp saves the whole posterior samples of the random effects and hyper-parameters; its default value is TRUE. The two arguments t and u, taking on non-negative and positive values, respectively, allow users to choose the joint hyper-prior density function, f (r, β) ∝ 1/(t + r) u+1 ; the default values for t and u are 0 and 1, respectively, for the joint hyper-prior density function specified in (12).
For example, when there are two covariates, the following code produces 2,000 posterior samples of random effects and those of hyper-parameters, r and β (3×1) including an intercept term, via the A-R method with 8,000 trial samples.
R> b.output <-gbp(y, se.or.n, X, model = "binomial", n.AR = 2000) The object b.output above contains 8,000 weights (b.output$weight), 2,000 posterior samples of α (b.output$alpha), a 2,000×3 matrix of β (b.output$beta) each column of which corresponds to 2,000 posterior samples of each regression coefficient, and a k×2,000 matrix of random effects (b.output$p) each row of which has posterior samples of each random effect.
The S3 object "gbp" benefits from three generic functions, print, summary, and plot. The estimation result for all the random effects appears if users type the "gbp" object in the R console, which plays the same role of the function print with its default argument sort = TRUE. When the argument sort is set to TRUE, the function print displays the estimation result for all the groups in the ascending order of n for the Binomial and Poisson model and the descending order of standard errors for the Gaussian model. When the argument sort is FALSE, the estimation result is returned in the order of data input.

R> b.output R> print(b.output, sort = FALSE)
The function summary prints a detailed estimation result, including the estimation result for the hyper-parameters, A (or r) and β.

R> summary(b.output)
The function plot draws a shrinkage plot and 100×confidece.lvl% interval plot for random effects, see Figure 2, 4, or 6 for example. Its default argument "sort = TRUE" displays the 100×confidece.lvl% interval plot in the ascending order of n for the Binomial and Poisson model and the descending order of standard errors for the Gaussian model. When the argument sort is set to FALSE the 100×confidece.lvl% interval plot is displayed in the order of data input.

Frequency method checking
The function coverage conducts the frequency method checking. It estimates the coverage properties for our estimators of the random effects at a particular value of the hyperparameters by averaging the coverage over many simulated datasets. The basic usage of coverage needs a "gbp" object, such as b.output above, as the first argument; R> cov <-coverage(b.output, nsim = 1000) The argument nsim sets the number of simulations, N sim , defined in Section 6.1. If users do not assign values of the hyper-parameters through the arguments A.or.r and reg.coef, then the function coverage automatically sets the estimated posterior modes of hyper-parameters saved in the "gbp" object (or their posterior medians if the acceptance-rejection method for the Binomial model is used) as the generative values of hyper-parameters. If users want to conduct the frequency method checking with different generated values of hyper-parameters, for example, r = 100 and β = (2, 5) when one covariate was used with an intercept term, then users can specify them via the arguments A.or.r and reg.coef; R> cov <-coverage(b.output, A.or.r = 100, reg.coef = c(2, 5), nsim = 1000) When users fit a model via gbp with known expected random effects, e.g., a Poisson model with known values of {λ 1 , λ 2 , . . . , λ k }, coverage conducts the frequency method checking based on these known values as a default. However, users may want to conduct the frequency method checking with different known values of the expected random effects. For example, if users want to try a different value of the expected random effect, e.g., λ j = 30 (or can be a vector of different values), the argument mean.PriorDist is added as follows.
R> cov <-coverage(p.output, mean.PriorDist = 30, nsim = 1000) The resulting frequency method checking is based on the estimated posterior mode of r (because it is not specified through A.or.r) and the newly specified value of the expected random effect, 30, as its known value.
Though the function coverage does not produce an S3 object, the result of coverage contains various numerical details; k RB coverage estimates (cov$coverageRB) and their standard errors (cov$se.coverageRB), overall unbiased coverage estimate (cov$overall.coverageRB) and its standard error (cov$se.overall.coverageRB), etc.
A coverage plot summarizing the result of coverage automatically appears. If the result is saved in a variable such as cov above, then users can recall the coverage plot, using the function coverage.plot.

Discussion
Rgbp is an R package for estimating and validating two-level Gaussian, Poisson, and Binomial hierarchical models. The package aims to provide a procedure that is computationally efficient with good frequency properties and includes frequency method checking functionality to examine repeated sampling properties and to test that the method is valid at specified hyper-parameter values.
As an alternative to other maximization based estimation methods such as MLE and REML, Rgbp provides approximate point and interval estimates of parameters via ADM. Using the ADM approach, with our specified choice of priors, protects from cases of overshrinkage and undercoverage from which the aforementioned methods suffer (Morris 1988b).
A benefit of Rgbp is that it produces non-random output (except the A-R method for the Binomial model) and so results are easily reproduced and compared across studies. In addition to being a stand-alone analysis tool the package can be used as an aid in a broader estimation procedure. For example, by checking the similarity of output of Rgbp and that of another estimation procedure such as MCMC (Markov chain Monte Carlo), the package can be used as a confirmatory tool to check whether the alternative procedure has been programmed correctly. In addition, the parameter estimates obtained via Rgbp can be used to initialize an MCMC thus decreasing time to convergence. Lastly, due to its speed and ease of use, Rgbp can be used as a method of preliminary data analysis. Such results may tell statisticians and practitioners alike whether a more intensive method in terms of implementation and computational time, such as MCMC, is needed.

A. Posterior propriety of the Poisson model
If the posterior distribution of r is proper, then the full posterior distribution of random effects and r is also proper because f (λ, r | y) = f (λ | y) × f (r | y), where f (λ | y) is a product of k proper conditional posterior density function in (41). Thus, our goal is to show that ∞ 0 f (r | y)dr < ∞: = 1 r 2 r k j=1 y j + · · · + a k r k exp   −r k j=1 λ E j log(1 + n j /r) where the polynomial function of r in the bracket has constant coefficients.
If there are at least two groups whose observed values y j are non-zero, then f (r | y) goes to zero as r approaches zero due to the polynomial function of r in (73); the following two factors in (73) approach one. As r becomes infinite, f (r | y) touches zero exponentially fast due to the exponential term in the middle of (73). Thus, the integration of f (r | y) must be finite.