deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation

Empirical Bayes inference assumes an unknown prior density g(θ) has yielded (unobservables) Θ 1 , Θ 2 , ..., Θ N , and each Θ i produces an independent observation X i from p i (X i | Θ i ). The marginal density f i (X i ) is a convolution of the prior g and p i . The Bayes deconvolution problem is one of recovering g from the data. Although estimation of g - so called g-modeling - is difficult, the results are more encouraging if the prior g is restricted to lie within a parametric family of distributions. We present a deconvolution approach where g is restricted to be in a parametric exponential family, along with an R package deconvolveR designed for the purpose.


Introduction
Modern scientific technology excels at the production of large data sets composed of a great many small estimation problems. A microarray experiment, for example, might produce N one-dimensional Normal theory estimates X i , with the estimation of the Θ i 's being the goal. This was the case for the prostate cancer study pictured in Figure 2.1 of Efron (2010), where there were N = 6, 033 genes, with X i measuring a standardized difference between patients and controls for the ith gene.
A Bayesian analysis begins with a prior density g(θ) for the Θ i . Inference is based on the posterior density of Θ i given X i = x, here p(x | θ) is the density of X given Θ = θ, and f (x) is the marginal density of X, In the setting of Equation 1 p(x | θ) is ϕ(x − θ), with ϕ the standard Normal density What if we don't know the prior density g(θ)? Empirical Bayes methods attempt to estimate g(θ) from the observed sample X = (X 1 , X 2 , . . . , X N ). An estimateĝ(·) then produces posterior approximationsĝ(θ | x) from Equation 2. Bothĝ(θ) andĝ(θ | x) can be of interest in applied problems.
In the Normal model above, f (x) is the convolution of g(θ) with a standard N (0, 1) density. The empirical Bayes task is one of deconvolution: using the observed sample X from f (x) to estimate g (θ). This can be a formidable job. Convergence rates ofĝ to g are notoriously slow in the general framework where g (θ) can be anything at all (Carroll and Hall 1988). Efron (2016) showed that parametric models, where g(θ) is assumed to lie in a known exponential family, allow reasonably efficient and practical estimation algorithms. This is the "g-modeling" referred to in our title.
Empirical Bayes deconvolution and estimation does not require the Normal model. We might, for example, have or X i given Θ i Binomial, etc., the only requirement being a known specification of the distribution p(x | θ) for X i given Θ i . The "Bayes deconvolution problem" is a general name for estimating g (θ) in Equation 3 given a random sample from f (x).
Empirical Bayes applications have traditionally been dominated by f -modeling where the probability models for the marginal density f (x), usually exponential families, are fit directly to the observed sample. Several packages for such estimation are available in R, particularly as part of the Bioconductor project (Gentleman et al. 2004). Package siggenes (Schwender 2020) implements the approach outlined in Efron (2010) for differential expression of genes; others, such as baySeq (Hardcastle 2019) and edgeR (Robinson, McCarthy, and Smyth 2010) use an empirical Bayes approach to estimate the parameter of a Negative Binomial prior or a dispersion parameter. Our package, deconvolveR, on the other hand, is devoted specifically to g-modeling.
Section 2 presents a brief review of g-modeling estimation theory, illustrated in Section 3 with a Poisson example relating to the "missing species problem", a classical empirical Bayes test case. The main content of this note appears in Section 4: a guide to a new R (R Core Team 2020) package deconvolveR, for the empirical Bayes estimation of g(θ) and g(θ | x). Package deconvolveR (Efron and Narasimhan 2020) is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=deconvolveR and GitHub. Efron (2016) gives a full explanation of the theory and its implementation.

Empirical Bayes estimation theory
This section presents a condensed review of the empirical Bayes estimation theory in Efron (2014Efron ( , 2016, emphasizing its application as carried out by the deconvolveR package of Section 4. An unknown probability density g(θ) (possibly having discrete atoms) has yielded an unobservable random sample of independent realizations, Each Θ i independently produces an observed value X i according to a known family of probability densities p(x | θ), From the observer's point of view, the X i are an independent and identically distributed (i.i.d.) sample from the marginal density f (x), T the sample space of the Θ i . We wish to estimate g(θ) from the observed sample X = (X 1 , X 2 , . . . , X N ).
The prior density g(θ) might in general be some mixture of discrete and continuous distributions. Here it will be convenient, both for explanation and computation, to assume Θ's sample space T to be finite and discrete, A continuous formulation appears in Remark A1 of Efron (2016), but this is of no advantage in applications since Equation 8 can always be specified sufficiently finely to capture as much detail as is possible within the practical limitations of empirical Bayes inference. The support points θ (j) in Equation 8 need not be equally spaced, and in fact are not in the Shakespeare example of Section 3. The only downside of unequal spacing is that some care is needed to give the correct visual impression when plottingĝ(θ); see Figure 2 in Section 3. In particular, a "flat prior" won't have equal weights if the θ (j) 's are unequally spaced.
Choosing T can be a matter of some numerical experimentation, to see if a wider range or finer grid substantially changes the estimated priorĝ(θ). For the Normal model (1), the range of T can be assumed smaller than that of the observed sample X 1 , X 2 , . . . , X N , though "not much smaller" is a good rule of thumb. There is no great penalty for increasing the number of support points m in Equation 8, but often no great gain either.
Similarly, X , the sample space of the observations X i , is assumed finite and discrete, This is no restriction since X can be taken to be the entire order statistic of X i values. (Or, for continuous situations like Equation 1, the X i can be discretized by binning.) In the discrete formulation of Equation 8, the prior g(θ) is represented by a vector g = (g 1 , g 2 , . . . , g m ) . Likewise, the marginal f (x) in Equation 7 has vector form f = (f 1 , f 2 , . . . , f n ) . Both g and f have nonnegative components summing to 1. Letting we define the n × m matrix P = (p kj ). An advantage of the specifications in Equation 8 and Equation 9 is that the convolution-type relationship in Equation 3 between g(θ) and f (x) reduces to matrix multiplication, f = P g.
The count vector y = (y 1 , y 2 , . . . , y n ) , is a sufficient statistic for g; it follows a multinomial distribution for n categories, N draws, probability vector f , G-modeling assumes that g(θ) is a member of a p-parameter exponential family on T , expressed in the discrete formulation as where Q is an m × p structure matrix, the default choice in deconvolveR being the natural spline basis ns(T , p); α is the unknown p-dimensional natural parameter vector; c(α) is the divisor necessary to make g sum to 1. There is no deep mathematical reason for choosing splines, though their good behavior at the extremes of T helps reduce the volatility ofĝ(θ).
Coordinate-wise, Equation 14 says that The log likelihood function l(α) of y is where f (α) = P g(α). Define and let W k be the m-vector for k = 1, 2, . . . , n. (Note that w kj (α) equals g j/k (α) − g j (α) where g j/k (α) indicates the conditional probability of θ (j) given x (k) .) The score function for α then turns out to bė where The maximum likelihood estimate (MLE)α is found by numerically maximizing l(α) or by solvingl(α) = 0.
There is also a compact expression for the Fisher information matrix I(α) = E α {l(α)l(α) }, We could take I(α) −1 as an estimate of covariance forα. However a small amount of regularization greatly improves the stability ofα and its corresponding deconvolution estimate g(α).
Rather than l(α) deconvolveR maximizes a penalized log likelihood where c 0 = 1 is the default value in deconvolveR. Standard asymptotic calculations give as an approximate covariance matrix ofα when α is the true value in model (14). The Hessian matrixs(α) in Equation 25 is calculated to bë I the p × p identity.
Finally, define the p × p matrix D(α) to be Larger values of c 0 shrink g(α) more forcefully toward the flat prior g = ( 1 /m, 1 /m, . . . , 1 /m) (if T is equally spaced). Looking at Equation 25, a measure of the strength of the penalty term compared to the observed data is the ratio of traces S(α), S(α) is printed out by deconvolveR, allowing adjustment of c 0 for more or less shrinking if so desired. S(α) was quite small in our examples, supporting c 0 = 2 as a conservative choice.

The Shakespeare data
Word counts for the entire Shakespearean canon appear in Table 1: 14,376 distinct words were so rare they appeared just once each, 4,343 twice each, 2,292 three times each, with the table continuing on to the five words observed 100 times each throughout the canon. We assume that the ith distinct word, in a hypothetical listing of Shakespeare's complete vocabulary (not just that seen in the canon), appeared X i times in the canon, X i following a Poisson distribution with expectation Θ i , As in Efron and Thisted (1976) we are interested in the distribution of the unseen parameters Θ i , but here based on the g-modeling methodology of Section 2.
The support set T for Θ (8) was taken to be equally spaced on the with m = 341 support points (a denser grid than turned out to be necessary). Modeling Θ on a log scale is useful here because rare words, with very small values of Θ, comprise the bulk of Shakespeare's vocabulary, as Table 1 suggests.
(Eight hundred forty-six distinct words appear more than 100 times each in the canon; these are common words such as "and" or "the" that form the bulk of the canon's approximately 900,000 total count, but they are of less interest here than those at the rarer end of the Θ distribution.) Table 1 gives the count vector y, y 1 = 14, 376, y 2 = 4, 343, etc.
The structure matrix Q (14) was taken to be a natural spline in λ with five degrees of freedom, in language R, a 341 × 5 matrix. Some care is needed in setting the entries p kj of the matrix P . Lettingp the entries p kj (10) are hj .
This compensates for the truncated data in Table 1: the zero category -words in Shakespeare's vocabulary he didn't use in the canon -are necessarily missing. (Also missing, less necessarily, are words appearing more than 100 times each.) The definition in Equation 36 makes column j of P into the truncated Poisson distribution of X given Θ = θ (j) . Function deconv() in R package deconvolveR was run with y, T , Q, and P as previously specified, and with regularization constant c 0 = 2. The solid curve in Figure 1 plots the entries ofĝ = (· · ·ĝ j · · · ) (plotted as continuous curve) versus λ j = log(θ (j) ). About 45% of the total mass ĝ j = 1 lies below Θ = 1 (λ = 0), indicating the prevalence of rare words in Shakespeare's usage. .) The preponderance of small Θ values pulls the mode ofg(θ | x) below x but less so as x increases.
Forty-five percent is an underestimate. A word with parameter Θ i has probability exp(−Θ i ) of yielding X i = 0, in which case it will not be observed. Words with small Θ i values are systematically thinned out of the observed counts. We can correct for thinning by defining c 1 the constant that makesg sum to 1. The red dashed curve in Figure 1 showsg. This is a g-modeling estimate for the prior distribution of Θ we would see if there were no data truncation. It puts 88% of its probability mass below Θ = 1. (See later discussion for some difficulties with this result.) Equation 37 is by no means obvious. It helps to consider a simple case: suppose Θ can take on only two possible values, T = {1, 10}, with equal prior probabilities (in the untruncated situation), sayg 1 (1) =g 2 (10) = 0.5. If we begin with some large numberÑ of draws X i ∼ Poisson(Θ i ), we will observe aboutÑ /2 from Θ = 10, but only about 0.632Ñ /2 from Θ = 1 since X i = 0 is unobservable. That is, we will effectively have prior weights g 1 (1) = 0.632 1 + 0.632 = 0.387 and g 2 (10) = 1 1 + 0.632 = 0.613. The estimated priorg = (g 1 ,g 2 , . . . ,g m ) can be used to carry out Bayesian computations for the Θ i parameters, for instance, calculating the posterior probabilities where p kj is the density in Equation 36 and c k = ( g j p jk ) −1 . The cases x (k) equal to 1, 2, 4, and 8 appear in Figure 2, now graphed versus θ instead of log(θ). (To compensate for the unequal spacing of the θ (j) values, the graphs are actually proportional tog j p kj /θ (j) .) The preponderance of small Θ values seen in Figure 1 pulls the mode ofg(θ | x) toward zero, though less so for larger x.
The vertical green bars in Figure 1 indicate ± one standard error forĝ j . These were obtained as the square roots of the diagonal elements of cov(ĝ) from Equation 28. As a check on the obtained values, a parametric bootstrap simulation was run: bootstrap count vectors (n = 100, the length of x 0 , and N = 30, 688, the total number of counts in Table 1)  with their bootstrap counterparts. The agreement is quite good in this case. In practice the bootstrap calculations are usually easy to carry out as a reassuring supplement to the theory.
Looking back at Table 1, it is tempting to ask how many "new" words (i.e., distinct words not appearing in the canon) we might find in a trove of newly discovered Shakespeare. This is Fisher's famous missing species problem. Suppose then that a previously unknown Shakespearean corpus of length t · C were found, C . = 900, 000 the length of the known canon. Assuming a Poisson process model with intensity Θ i for word i, the probability that word i did not appear in the canon but does appear in the new corpus is Equation 40 and definition (37) give, after some work, an estimate for R(t), the expected number of distinct new words found, divided by N , the observed number of distinct words in the canon: A graph of Shakespeare's R(t) function is shown in Figure 4, along with standard error bars derived from Equation 25. It predicts R(t) = 1, that is a doubling of Shakespeare's observed vocabulary, at t = 3.74.
All of this might seem like the rankest kind of statistical speculation. In fact though, formula (41) performs well in cross-validatory tests where part of the canon is set aside and then predicted from the remainder. See Thisted and Efron (1987).
The resulting prediction curve, shown in Figure 4, is nearly the same as that for our five degrees of freedom spline model.
The missing species problem has an inestimable aspect at its rarest extreme: if Shakespeare knew 1,000,000 words that he only employed once each in a million canons, these would remain effectively invisible to us. By taking θ (1) = exp(−4) = 0.018 at (32), our model legislates out the one-in-a-million cases. It gives a good fit to the data, witĥ passing a Wilks' test for fit to the observed count vector y -so in this sense it cannot be improved by lowering θ (1) .
All of this seems mainly of pedantic interest in the Shakespeare example; less so, however, in biological applications of the missing species problem, where, for instance, the occurrence rates of cloned DNA segments can range over many orders of magnitude.

A guide to a new package deconvolveR
The R package deconvolveR contains one main function deconv() that handles three exponential families, Binomial, Normal and Poisson directly. Since users may wish to experiment with other exponential family models or change the details of how Q is normalized, deconv() also accepts user-specified Q and P matrices in its invocation.
The maximum likelihood estimation is carried out using the non-linear optimization function nlm() in R with the gradient of the likelihood computed via the theoretical formula in Equation 20. This has a practical effect in that harmless warnings may be generated during the optimization. The Hessian, although available, is not used to guide the optimization in the current version of the software due to numerical considerations.
The package contains a vignette that provides the complete code to reproduce all the results in this paper with additional details. Below we illustrate its use with examples from the three main models, first with simulated data for the Poisson and Normal cases, followed by real data examples using the Shakespeare word count data and an intestinal surgery dataset. We note that although g has discrete support, we plot it as a continuous curve throughout.

A Poisson simulation
Suppose the Θ i are drawn from a χ 2 density with 10 degrees of freedom and the X i |Θ i are Poisson with expectation Θ i : We carry out 1000 simulations each with N = 1000 observations by first generating the Θ and then creating a 1000 × 1000 data matrix.  Taking the support of Θ to be the discrete set T = (1, 2, . . . , 32), we apply the deconv() function on each column of the matrix to obtain the estimateĝ along with a host of other statistics.
Note the use of the ignoreZero above -here, unlike the Shakespeare example zeros are observed.
We have once again relied on the default Poisson family and a natural cubic spline basis of degree 5 for Q. The columns of Q are standardized to have mean zero and sum of squares 1. The regularization (c0) parameter is left at the default value of 1. We construct a table for g(θ) and related statistics.

A Normal model
Next, we consider data generated as follows: To deconvolve this data, we specify an atom at zero using the parameter deltaAt which applies only to the Normal case. Using τ = (−6, −5.75, . . . , 3) and a fifth-degree polynomial for the Q basis yields an estimated probability for µ = 0 as 0.887 ± 0.009 with a bias of about −0.006.
R> tau <-seq(from = -6, to = 3, by = 0.25) R> result <-deconv(tau = tau, X = data$z, deltaAt = 0, family = "Normal") The density estimates removing the atom at zero are not accurate at all, but the g-modeling estimates of conditional 95% credible intervals (code included in the package vignette) for µ given z are a good match for the Bayes intervals as shown in Figure 6.

A twin towers model
In the previous example, the distribution of θ had a significant atom at 0 and the rest of the density was smeared around −3. We now consider the case where θ has a bimodal distribution (included in the package as the dataset disjointTheta). Figure 7 is a histogram of the the θ alongside a histogram of the data, generated using Z i ∼ N (θ i , 1). Figure 8, on the left, reveals the effect of varying the degrees of freedom of the natural spline basis and regularization parameter. In this case, a choice of 6 or 7 appears reasonable to capture the bimodality. The right side of Figure 8 shows the effect of the regularization parameter c 0 on the estimates: larger values for c 0 smooth out theĝ making the bimodality less prominent.
Here S(α) can serve as a guide for choosing c 0 . For varying degrees of freedom, Table 3 shows the estimates of S(α). A choice of c 0 ≤ 4 would avoid excessive penalization.

Shakespeare example
The data for the Shakespeare example is included in the package as dataset bardWordCount.
Here, the data is a (truncated) vector of Poisson counts for frequencies of words that appeared exactly once, twice, etc. all the way to 100. We construct the support set T , equally spaced on the λ = log(θ) scale and call deconv() as shown below.  By default, deconv() assumes a Poisson family and works on a sample at a time. The invocation above provided the prior support tau = T , the sufficient statistic of counts y and indicated the support of X via the n = 100 parameter so that X = (1, 2, . . . , 100). The parameter c0 is the regularization parameter in Equation 24.
The result is a list with a number of quantities, including the MLEα, the covariance matrix of α, the matrices P and Q etc. Above, we print the head and tail rows of the stats component that includesĝ, cumulativeĜ, standard errors and biases. But one could also print out the ratio of traces S(α) of Equation 29 for example.

R> print(result$S)
[1] 0.005534954  This indicates that the penalty term c 0 = 2 used above is not too big compared to the observed data.

Intestinal surgery example
The dataset surg contains data on intestinal surgery on 844 cancer patients. In the study, surgeons removed satellite nodes for later testing. The data consists of pairs (n i , X i ) where n i is the number of satellites removed and X i is the number found to be malignant among them. We assume a Binomial model with X i ∼ Binomial(n i , θ i ) with θ i being the probability of any one satellite site being malignant for the ith patient.
We take τ = (0.01, 0.02, . . . , 0.09) (so m = 99), Q to be a 5-degree natural spline with columns standardized to mean 0 and sum of squares equal to 1 and the penalization parameter at the default value 1.
As a check on the estimates of standard error and bias provided by deconv(), we compare the results with what we obtain using a parametric bootstrap. The bootstrap is run as follows.
For each of 1000 runs, 844 simulated realizationsΘ are sampled from the densityĝ. Each gave an X i ∼ Binomial(n i ,Θ ) with n i the ith sample in the original dataset. Finally,α was computed using deconv(). The results are shown in Table 4.

Summary
Empirical Bayes estimation exploded into the statistics field around the 1950s as a new branch of statistical inference. Practical tools have been mostly related to f -modeling where probability models are proposed for the marginal density f (x) of the data. Empirical Bayes deconvolution, the problem of estimating the prior g from the data, is harder, with slow nonparametric rates of convergence (Carroll and Hall 1988). Parametric modeling of g (Efron 2016) from a known exponential family offers a way forward and our package deconvolveR implements this approach. The examples shown here indicate that it works well for a range of problems.