Bayesian random-effects meta-analysis using the bayesmeta R package

The random-effects or normal-normal hierarchical model is commonly utilized in a wide range of meta-analysis applications. A Bayesian approach to inference is very attractive in this context, especially when a meta-analysis is based only on few studies. The bayesmeta R package provides readily accessible tools to perform Bayesian meta-analyses and generate plots and summaries, without having to worry about computational details. It allows for flexible prior specification and instant access to the resulting posterior distributions, including prediction and shrinkage estimation, and facilitating for example quick sensitivity checks. The present paper introduces the underlying theory and showcases its usage.


Meta-analysis
Evidence commonly comes in separate bits, and not necessarily from a single experiment. In contemporary science, the careful conduct of systematic reviews of the available evidence from diverse data sources is an effective and ubiquitously practiced means of compiling relevant information. In this context, meta-analyses allow for the formal, mathematical combination of information to merge data from individual investigations to a joint result. Along with qualitative, often informal assessment and evaluation of the present evidence, meta-analytic methods have become a powerful tool to guide objective decision-making (Chalmers et al. 2002;Liberati et al. 2009;Hedges and Olkin 1985;Hartung et al. 2008;Borenstein et al. 2009). Applications of meta-analytic methods span such diverse fields as agriculture, astronomy, biology, ecology, education, health research, medicine, psychology, and many more (Chalmers et al. 2002).
When empirical data from separate experiments are to be combined, one usually needs to be concerned about the straight one-to-one comparability of the provided results. There may be obvious or concealed sources of heterogeneity between different studies, originating e.g. from differences in the selection and treatment of subjects, or in the exact definition of outcomes. Residual heterogeneity may be anticipated in the modeling stage and considered in the estimation process; a common approach is to include an additional variance component to account for between-study variability. On the technical side, the consideration of such a heterogeneity parameter leads to a random-effects model rather than a fixed-effect model (Hedges and Olkin 1985;Hartung et al. 2008;Borenstein et al. 2009). Inclusion of a non-zero heterogeneity will generally lead to more conservative results, as opposed to a "naïve" merging of the given data without consideration of potentially heterogeneous data sources.

The normal-normal hierarchical model (NNHM)
A wide range of problems may be approached using the normal-normal hierarchical model (NNHM); this generic random-effects model is applicable when the estimates to be combined are given along with their uncertainties (standard errors) on a real-valued scale. Many problems are commonly solved this way, often after a transformation stage to re-formulate the problem on an appropriate scale. For example, binary data given in terms of contingency tables are routinely expressed in terms of logarithmic odds ratios (and associated standard errors), which are then readily processed via the NNHM.
In the NNHM, measurements and standard errors are modeled via normal distributions, using means and their standard errors as sufficient statistics, while on a second hierarchy level the heterogeneity is modeled as an additive normal variance component as well. The model then has two parameters, the (real-valued) effect, and the (positive) heterogeneity. If the heterogeneity is zero, then the model reduces to the special case of a fixed-effect model (Hedges and Olkin 1985;Hartung et al. 2008;Higgins and Green 2011;Borenstein et al. 2009). The model and terminology are described in detail in Section 2.1 below. The bayesmeta package is based on this simple yet ubiquitous form of the NNHM.

The Bayesian solution
The bayesmeta package implements a Bayesian approach to inference. Bayesian modeling has previously been advocated and used in the meta-analysis context (Smith et al. 1995;Sutton and Abrams 2001;Spiegelhalter 2004;Spiegelhalter et al. 2004;Higgins et al. 2009;Lunn et al. 2013); the difference to the more common "frequentist" methods is that the problem is approached by expressing states of information via probability distributions, where the consideration of new data then constitutes an update to a previous information state (Gelman et al. 2014;Jaynes 2003;Spiegelhalter et al. 1999). A Bayesian analysis allows (and in fact requires) the specification of prior information, expressing the a priori knowledge, before data are taken into account. Technically this means the definition of a probability distribution, the prior distribution, over the unknowns in the statistical model. Once model and prior are specified, the results of a Bayesian analysis are uniquely determined; however, implementing the necessary computations to derive these in practice may still be tricky.
While analysis results will of course depend on the prior setting, the range of reasonable specifications however is usually limited. In the meta-analysis context, non-informative or weakly informative priors for the effect are readily defined, if required. For the betweenstudy heterogeneity an informative specification is often appropriate, especially when only a small number of studies is involved. Interestingly, the number of studies combined in the meta-analyses archived in the Cochrane Library is reported by both Davey et al. (2011) and Kontopantelis et al. (2013) with a median of 3 and a 75% quantile of 6, so that in practice a majority of analyses (including subgroup analyses and secondary outcomes) here is based on as few as 2-3 studies; such cases may not be as unusual as one might expect, at least in medical contexts. Typical meta-analysis sizes may vary across fields; for example, the data collected by van Erp et al. (2017) indicate a median number of 12 and first and third quartiles of 5 and 33 studies, respectively, for meta-analyses published in the Psychological Bulletin. Standard options for priors are available here, confining the prior probability within reasonable ranges ). Long-run properties of Bayesian methods have also been compared with common frequentist approaches by Friede et al. (2017a,b), with a focus on the common case of very few studies.
Bayesian methods commonly are computationally more demanding than other methods; usually these require the determination of high-dimensional integrals. In some (usually simpler) cases, the necessary integrals can be solved analytically, but it was mostly with the advent of modern computers and especially the development of Markov chain Monte Carlo (MCMC) methods that Bayesian analyses have become more generally tractable (Metropolis and Ulam 1949;Gilks et al. 1996). In the present case of random-effects meta-analysis within the NNHM, where only two unkown parameters are to be inferred, computations may be simplified by utilizing numerical integration or importance resampling , both of which require relatively little manual tweaking in order to get them to work. It turns out that computations may be done partly analytically and partly numerically, offering another approach to simplify calculations via the direct algorithm . Utilizing this method, the bayesmeta package provides direct access to quasi-analytical posterior distributions without having to worry about setup, diagnosis or post-processing of MCMC algorithms. The present paper describes some of the methods along with the usage of the bayesmeta package.

Other common approaches
A frequentist approach to inference is largely focused on long-run average operating characteristics of estimators. In this framework, meta-analysis using the NNHM is most commonly done in two stages, where first the heterogeneity parameter is estimated, and then the effect estimate is derived based on the heterogenity estimate. The choice of a heterogeneity estimator poses a problem on its own; a host of different heterogeneity estimators have been described, for a comprehensive summary of the most common ones see e.g. Veroniki et al. (2016). A common problem with such estimators of the heterogeneity variance component is that they frequently turn out as zero, effectively resulting in a fixed-effect model, which is usually seen as an undesirable feature. Within this context, Chung et al. (2013) proposed a penalized likelihood approach, utilizing a Gamma type "prior" penalty term in order to guarantee non-zero heterogeneity estimates.
The treatment and estimation of heterogeneity in practice has been investigated e.g. by Pullenayegum (2011), Turner et al. (2012) and Kontopantelis et al. (2013). When looking at large numbers of meta-analyses published by the Cochrane Collaboration, the majority (57%) of heterogeneity estimates in fact turned out as zero (Turner et al. 2012), while the numbers are higher for "small" meta-analyses, and lower for analyses involving many studies (Kontopantelis et al. 2013). Meanwhile the choice of analysis method (fixed-or random-effects) also correlates with the number of studies involved, with larger numbers of studies increasing the chances of a random-effects model being employed (Kontopantelis et al. 2013). Kontopantelis et al. (2013) also compared the fraction of heterogeneity estimates resulting as zero in actual meta-analyses with that obtained from simulation, suggesting that heterogeneity is commonly underestimated or remains undetected.
Once an estimate for the amount of heterogeneity has been arrived at, what is commonly done is to use this as a plug-in estimate and proceed to compute further tests and estimates conditioning on the heterogeneity estimate as if its true value were known (Hedges and Olkin 1985;Hartung et al. 2008;Borenstein et al. 2009). Such a procedure would be warranted if the heterogeneity estimate was estimated with relatively great precision. Notable exceptions here are the methods proposed by Follmann and Proschan (1999); Hartung and Knapp (2001a,b) and Sidik and Jonkman (2002), where the estimation uncertainty in heterogeneity is accounted for (on the technical side resulting in an inflated standard error and a heavier-tailed Student-t distribution to be utilized for deriving tests or confidence intervals), or bootstrap methods (van den Noortgate and Onghena 2005) and parameter estimation in the generalized inference framework (Friedrich and Knapp 2013).

Bayesian and frequentist approaches in comparison
While the interpretation of results from a frequentist analysis, especially significance tests and confidence intervals, is commonly challenging and often misunderstood (Morey et al. 2016;Hoekstra et al. 2014), Bayesian results usually address the actual research question more directly and may be interpreted more intuitively (Jaynes 2003;Spiegelhalter et al. 2004;Szucs and Ioannidis 2017;Kruschke and Liddell 2018). On the one hand, frequentist confidence intervals aim to uniformly provide a pre-specified coverage probability conditionally on any single point in parameter space, while Bayesian credible intervals account for the prior distribution and consequently provide proper coverage on average over the prior (Dawid 1982; see also Appendix A.5, page 39). By their construction, they directly relate to the information on the parameters after considering the data at hand, which is not quite the intention behind classical confidence statements; even if a proper ("frequentist") coverage probability is attained, this may still lead to rather counterintuitive conclusions in the face of actual data (Jaynes 1976;Morey et al. 2016).
In some statistical applications, there is little difference between the results from frequentist and Bayesian analyses; often one may be considered a limiting or special case of the other, while interpretations remain somewhat different (Bartholomew 1965;Jaynes 1976;Lindley 1977;Severini 1991;Spiegelhalter et al. 1999;Bayarri and Berger 2004). This is not necessarily the case in the present context, as meta-analyses are quite commonly based on few studies, so that certain large-sample asymptotics may not apply. A common misconception, namely that a Bayesian analysis based on a uniform prior generally yielded identical results to a frequentist, purely likelihood-based analysis, is exposed as such here. A crucial feature of meta-analysis problems is that one of the parameters, the heterogeneity, is confined to a bounded parameter space, which sometimes causes problems for frequentist methods (Mandelkern 2002), partly because heterogeneity estimates commonly are not adequately characterized through a mere point estimate and an associated standard error. The common use of a plug-in estimate for the heterogeneity in frequentist procedures then turns out problematic, as such a strategy usually only makes sense when the estimated parameter is associated with relatively little uncertainty. Choice of a suitable heterogeneity estimator adds to the complication, as, despite their common aim, actual estimates may turn out quite differently, adding some degree of arbitrariness to the inference (Veroniki et al. 2016). Within a Bayesian context, these issues do not pose difficulties, and inference on some parameters while accounting for uncertainty in other nuisance parameters is straightforwardly solved through marginalization. This way, uncertainty in heterogeneity is readily accommodated, and since no asymptotic arguments need to be invoked, results are valid also for small sample sizes. For such reasons Bayesian methods have been considered particularly well-suited for hierarchical models in general (Browne and Draper 2006;Kruschke and Liddell 2018), and for meta-analysis problems in particular (Smith et al. 1995;Sutton and Abrams 2001). While Bayesian modeling necessitates the specification of a prior probability distribution over all parameters, the range of plausible formulations in a given context is usually limited. Differences in results corresponding to different prior settings are quite natural, as effectively these correspond to differing answers to differently posed questions.
Use of a coherent Bayesian framework also naturally facilitates advanced computations, in which the posterior from a previous analysis constitutes the prior for a subsequent analysis. This is useful for example in sequential meta-analyses (Spence et al. 2016), in the design of future experiments (Schmidli et al. 2017), or when utilizing historical data in the analysis of clinical trials (Wandel et al. 2017).

Implementation
A number of software packages have been developed for frequentist inference within the NNHM framework, for example the Review Manager (RevMan), that is freely available from the Cochrane Collaboration (The Cochrane Collaboration 2014; Higgins and Green 2011), or, within R, the metafor and meta packages (Viechtbauer 2010;Schwarzer 2007;Schwarzer et al. 2015).
Bayesian analyses are usually computationally more demanding, and quite generally these can be approached using MCMC methods (Gilks et al. 1996). For example, meta-analysis along with the extension to meta-regression is implemented in the bmeta R package (Ding and Baio 2015) by utilizing Gibbs sampling via JAGS (Plummer 2003(Plummer , 2008). An MCMC approach offers great flexibility, and a number of model variations are also available, for example, a nonparametric generalization of the NNHM in the bspmma package (Burr 2012), a generalized approach based on model averaging in the metaBMA package (Heck et al. 2017), or methods suitable for the special problem of meta-analysis of diagnostic studies in the bamdit and metamisc packages (Verde 2011;Debray and de Jong 2012).
In certain model constellations, it may be possible to derive exact posterior distributions, as for example implemented in the mmeta R package, which utilizes a parametric model for meta-analysis of count data that are provided in terms of contingency tables (Luo et al. 2014). Otherwise, inference in a range of model classes may be approached via integrated nested Laplace approximations (INLA), as utilized e.g. in the meta4diag package for meta-analysis of diagnostic studies (Guo and Riebler 2015), or in the nmaINLA package for network-metaanalysis and -regression (Günhan 2017).
The bayesmeta package aims to provide easy access to a fully Bayesian analysis approach within the common NNHM framework. While the use of MCMC methods would be an option here, these usually require a certain amount of expertise and experience in set-up and convergence diagnostics. Also, inference based on MCMC output always contains a certain noise component due to the finite number of samples, which may sometimes constitute a nuisance. Use of the bayesmeta packages instantly provides accurate posterior summary figures analogous to output familiar from common (frequentist) meta-analysis output. Posterior distributions may be accessed in quasi-analytical form, and advanced methods, e.g. for prediction or shrinkage estimation, are also provided. Computations are fast and reproducible, allowing for quick sensitivity checks and facilitating larger-scale simulations. Accuracy of the implementation (calibration) may be verified via simulation (see also Appendix A.5).

Outline
The remaining paper is mostly arranged in two major parts. In the following Section 2, the underlying theory is introduced; first the common NNHM (random-effects) model and its notation are explained, and prior distributions for the two parameters are discussed. Then the resulting likelihood, marginal likelihood and posterior distributions are presented and some general points are introduced.
In Section 3, the actual usage of the bayesmeta package is demonstrated; an example data set is introduced, along which the steps of a Bayesian meta-analysis are shown. The determination of summary statistics and plots, as well as possible variations in the analysis setup and the computation of posterior predictive p-values are presented. Section 4 then concludes with a summary.
2. Random-effects meta-analysis 2.1. The normal-normal hierarchical model The aim is to infer a quantity µ, the effect, based on a number k of different measurements which are provided along with their corresponding uncertainties. What is known are the empirical estimates y i (of µ) that are associated with known standard errors σ i ; these constitute the "input data". The ith study's measurement y i (where i = 1, . . . , k) is assumed to arise as exchangeable and normally distributed around the study's true parameter value θ i : where the variability is due to the sampling error, whose magnitude is given by the (known) standard error σ i . All studies do not necessarily have identical true values θ i ; in order to accommodate potential between-study heterogeneity in the model, we assume that each study i measures a quantity θ i that differs from the overall mean µ by another exchangeable, normally distributed offset with variance τ 2 ≥ 0: This second model stage implements the random effects assumption.
Especially when the study-specific parameters θ i are not of primary interest, the notation may be simplified by integrating out the "intermediate" θ i terms and stating the model in its marginal form as (Hedges and Olkin 1985;Hartung et al. 2008;Borenstein et al. 2009Borenstein et al. , 2010. The two unknowns remaining to be inferred are the mean effect µ and the heterogeneity τ , which is commonly considered a nuisance parameter. The studies' shrinkage estimates of θ i are however sometimes also of interest and may be inferred from the model as well. In the special case of zero heterogeneity (τ = 0), the model simplifies to a fixed-effect model in which the study-specific means θ i are all identical (θ 1 = . . . = θ k = µ).
Such two-stage hierarchical models of an overall mean (µ) and study-specific parameters (θ i ) with a random effect for each study are commonly utilized in meta-analysis applications. The simple case of normally distributed error terms at both stages is often convenient and easily tractable, and it also constitutes a good approximation in many cases. So, while the effect here is treated as a continuous parameter, the model is quite commonly utilized to also process different types of data (e.g. logarithmic odds ratios from dichotomous data, etc.) after transformation to a real-valued effect scale (Hedges and Olkin 1985;Hartung et al. 2008;Borenstein et al. 2009;Viechtbauer 2010; Higgins and Green 2011).

General
Among the two unknowns, the effect µ is commonly of primary interest, while the heterogeneity τ usually is considered a nuisance parameter. In order to infer the parameters, we need to specify our prior information about µ and τ in terms of their joint prior probability density function p(µ, τ ). What exactly constitutes a reasonable prior distribution always depends on the given context (Gelman et al. 2014;Spiegelhalter et al. 2004;Jaynes 2003). For computational convenience, in the following we assume that we can factor the prior density into independent marginals: p(µ, τ ) = p(µ) × p(τ ). While this may not seem unreasonable, depending on the context, one may also argue in favour of a dependent prior specification (e.g., Senn 2007;Pullenayegum 2011). In the following, we aim to provide a comprehensive overview of popular or sensible options. We will discuss proper as well as improper priors; when using improper priors, the usual care must be taken, as the resulting posterior then may or may not be a proper probability distribution (Gelman et al. 2014). The discussed heterogeneity priors are also summarized in Table 1 below.

The effect parameter µ
An obvious choice of a non-informative prior for the effect µ, being a location parameter, is an improper uniform distribution over the real line (Gelman et al. 2014;Spiegelhalter et al. 2004;Jaynes 2003). A normal prior (with mean µ p and variance σ 2 p ) is a natural choice as an informative prior for the effect µ, and these two are also the cases we will restrict ourselves to for computational convenience and feasibility in the following. The normal prior here constitutes the conditionally conjugate prior distribution for the effect (see also Section 2.5 below). The uninformative uniform prior would also result as the limiting case for increasing prior uncertainty (σ p → ∞).
A way to guide the choice of a vague prior is by consideration of unit information priors (Kass and Wasserman 1995). The idea here is to specify the prior such that its information content (variance) is in some way, possibly somewhat heuristically, equivalent to a single observational unit. For example, if the endpoint is a logarithmic odds ratio (log-OR), a neutral unit information prior may be given by a normal prior with zero mean (centered around an odds ratio of 1, i.e., "no effect") and a standard deviation of σ p = 4. For a derivation, see also Appendix A.1.

The heterogeneity parameter τ : proper, informative priors
Especially since in the meta-analysis context one is commonly dealing with very small numbers of studies k, where not much information on between-study heterogeneity may be expected to be gained from the data, it may be worth while considering the use of informative priors. Depending on the exact context, there often is some information on what values for the heterogeneity are more plausible and which ones are less so, and making use of the present information may make a difference in the end. For example, if the meta-analysis is based on logarithmic odds ratios, it will usually make sense to assume that heterogeneity is unlikely to exceed, say, τ = log(10) ≈ 2.3, which would correspond to roughly an expected factor 10 difference in effects (odds ratios) between trials due to heterogeneity. An extensive discussion of such cases is provided in Spiegelhalter et al. (2004, Sec. 5.7). Values for τ between 0.1 and 0.5 here are considered "reasonable", values between 0.5 and 1.0 are "fairly high" and values beyond 1.0 are "fairly extreme". An analogous reasoning would apply for similarly defined outcomes, for example, logarithmic relative risks, logarithmic hazard ratios, or logarithmic variances (Schmidli et al. 2017). Consideration of the magnitude of unit information variances (see previous paragraph) may also be helpful in this context, as variability (heterogeneity) between studies will usually be expected to be substantially below the variability between individuals. Along these lines, it is often useful to also consider the implications of prior specifications in terms of the corresponding prior predictive distributions; see also Section 3.4 below. The impact of variations of how exactly prior information is implemented in the model may eventually also be checked via sensitivity analyses.
A sensible informative choice for p(τ ) may be the maximum entropy prior for a pre-specified prior expectation E[τ ], the exponential distribution with rate λ = 1 E[τ ] (Jaynes 1968(Jaynes , 2003Gregory 2005). Log-normal or half-normal prior distributions, e.g. with pre-specified quantiles, may also be useful alternatives. For example, for log-OR (or similar) endpoints, the routine use of half-normal distributions with scale 0.5 or 1.0 has been suggested by Friede et al. (2017a,b) and was shown to work well in simulations. In order to gain robustness, one may also consider mixture distributions as informative priors, for example half-Student-t, half-Cauchy, or Lomax distributions, which may be considered heavy-tailed variants of half-normal or exponential distributions (Johnson et al. 1994). Use of a heavy-tailed prior distribution will allow for discounting of the prior in favour of the data in case the data appear to be in conflict with prior expectations (O'Hagan and Pericchi 2012;Schmidli et al. 2014). The use of weakly informative half-Student-t or half-Cauchy priors may also be motivated via theoretical arguments, as these can be shown to also exhibit favourable frequentist properties (Gelman 2006;Polson and Scott 2012).
Although an inverse-Gamma distribution for an informative prior may seem to be an obvious choice, use of this distribution is generally not recommended (Gelman 2006;Polson and Scott 2012). More on informative (as well as uninformative) priors may be found in Spiegelhalter et al. (2004), Gelman (2006) and Polson and Scott (2012). Some empirical evidence to consider for informative priors for certain types of endpoints may be found e.g. in Pullenayegum (2011), Turner et al. (2012, Kontopantelis et al. (2013) and van Erp et al. (2017). In particular, Rhodes et al. (2015) and Turner et al. (2015) derived empirical priors based on data from the Cochrane database of systematic reviews; prior information here is expressed in terms of log-normal or log-Student-t distributions.
The heterogeneity parameter τ : proper, 'non-informative' priors Some "non-informative" proper priors have been proposed that are scale-invariant in the sense that (like the Jeffreys prior discussed below as well) they depend only on the standard errors σ i . A re-expression of the estimation problem on a different measurement scale would entail a proportional re-scaling of standard errors and so inference effectively remains unaffected. Such priors are discussed e.g. by Spiegelhalter et al. (2004, Sec. 5.7.3) and Berger and Deely (1988). Priors like these, however, are somewhat problematic from a logical perspective, as these imply that the prior information on the heterogeneity depended on the accuracy of the individual studies' estimates (Senn 2007).
The following two priors both depend on the harmonic mean s 2 0 of squared standard errors, i.e., The uniform shrinkage prior results from considering the "average shrinkage" S(τ ) = s 2 0 s 2 0 +τ 2 ; placing a uniform prior on S(τ ) results in a prior density for the heterogeneity, which has a median of s 0 . For a detailed discussion see e.g. Spiegelhalter et al. (2004) or Daniels (1999). A uniform prior in S(τ ) is equivalent to a uniform prior in 1 − S(τ ) = τ 2 s 2 0 +τ 2 , which is an expression very similar to the I 2 measure of heterogeneity due to Higgins and Thompson (2002). Substiting the harmonic mean s 2 0 for their average (ŝ 2 ) in the prior density (5) hence yields a uniform prior in I 2 . The DuMouchel prior has a similar form and is defined through This implies a log-logistic distribution for the heterogeneity τ that has its mode at τ = 0 and its median at τ = s 0 DuMouchel and Normand 2000).
A conventional prior as a proper variation of the Jeffreys prior (see also the closely related variant in (12) below) was given by Berger and Deely (1988) as This prior is in particular intended as a non-informative but proper choice for testing or model selection purposes (Berger and Deely 1988;Berger and Pericchi 2001).

The heterogeneity parameter τ : improper priors
Uninformative priors It is not so obvious what exactly would qualify a prior for τ as "uninformative". One might argue that an uninformative prior should have a probability density function that is monotonically decreasing in τ ; another question would be whether the density's intercept p(τ = 0) should be positive or finite, or what the density's derivative near zero should be. In general, the uninformative prior for a scale parameter in a simple normal model is commonly taken to be uniform in log(τ ) (and log(τ 2 )) with density p(τ ) ∝ 1 τ (Jeffreys 1946;Gelman et al. 2014), however, this "log-uniform" prior will not lead to proper, integrable posteriors in the present context (Gelman 2006). Another reasonable choice may be the improper uniform prior on the positive real line, but care must be taken here as usual, as the posterior may end up improper as well; this will not result in a proper posterior when there are only one or two estimates available (i.e., when k ≤ 2) and an (improper) uniform effect prior is used (Gelman 2006). The uniform prior may be considered a conservative choice in a particular sense, as shown below (Appendix A.2), but on the other hand it may also be considered overly conservative, as it tends to attach a lot of weight to potentially unreasonably large heterogeneity values. Gelman (2006) generally recommends a uniform heterogeneity prior as an uninformative default, unless the number of studies k is small, or an informative prior is desired or for other reasons.
One may also argue via certain requirements that an uninformative prior should meet (Jaynes 1968(Jaynes , 2003. For example, it may be reasonable to demand invariance with respect to re-scaling of τ for the prior density p(τ ), leading to a constraint of the form for any scaling factor s > 0 and some positive-valued function f (s) (i.e., re-scaling should not affect the density's shape). This requirement obviously restricts the range of priors to those with monotonic density functions. It leads to a family of improper prior distributions with densities for a ∈ R. This family includes (for a = −1) the common log-uniform prior for a scale parameter mentioned above, or (for a = 0) the uniform prior. But this class also includes further interesting cases, like, for −1 < a < 0, a compromise between the above two uniform and log-uniform priors that is (locally) integrable over any interval [0, u] with 0 < u < ∞ while also being shorter-tailed on the right than the improper uniform prior. An obvious example is (for a = −0.5) the prior with monotonically decreasing density function which corresponds to a uniform prior in √ τ . This prior has the unusual property that the prior density, and with that the posterior as well, exhibits a pole (i.e., approaches infinity) at the origin. A value of a = 1 would lead to a uniform prior in τ 2 , with an even higher preference for large heterogeneity values, which requires at least k ≥ 4 studies for a proper posterior; this prior is generally not recommended (Gelman 2006).

The Jeffreys prior
The non-informative Jeffreys prior (Gelman et al. 2014;Jeffreys 1946) for this problem results from the form of the likelihood (see equation (3) or (14) below), or more specifically, the associated expected Fisher information J(µ, τ ); its probability density function is given by p(µ, τ ) ∝ det(J(µ, τ )). This general form of Jeffreys' prior however is generally not recommended when the set of parameters includes a location parameter as in the present case; see e.g. Jeffreys (1946), Jeffreys (1961, Sec. III.3.10), Berger (1985, Sec. 3.3.3) and Kass and Wasserman (1996, Sec. 2.2). Instead, location parameters are commonly treated as fixed and are conditioned upon (Berger 1985;Kass and Wasserman 1996). In the present case (since µ and τ are orthogonal in the sense that the Fisher information matrix' off-diagonal elements are zero), this leads to Tibshirani's non-informative prior (Tibshirani 1989;Kass and Wasserman 1996, Sec. 3.7), a variation of the general Jeffreys prior, which is of the form In the following, we will consider this variant as the Jeffreys prior for the NNHM. This prior also constitutes the Berger-Bernardo reference prior for the present problem (Bodnar et al. 2016(Bodnar et al. , 2017. The prior is improper, as the right tail asymptotically behaves like p(τ ) ∝ 1 τ , but it is locally integrable in the left tail with p(0) = 0. The resuling posterior is proper as long as k ≥ 2 (Bodnar et al. 2017).
In case of constant standard errors σ i = σ, the prior's mode is at τ = σ. Otherwise the mode tends to be near the smallest σ i , but the prior may also be multimodal. The Jeffreys prior's dependence on the standard errors σ i implies that the prior information varies with the precision of the underlying data y i . With greater precision, lower heterogeneity values are considered plausible. On the other hand, the prior is invariant to the overall scale of the problem (as it scales with the standard errors σ i ) like the proper non-informative priors mentioned above.
Another variation of the Jeffreys prior was given by Berger and Deely (1988) and is defined as This prior is also improper, and it equals the Jeffreys prior in case all standard errors σ i are identical.

Choice of a prior
The selection of a prior for the effect µ is relatively straightforward. The normal prior's variance allows to vary the width from narrow/informative to wide/uninformative; the improper uniform prior as a limiting case is also available, and this may be the obvious default choice in many cases. Consideration of the unit information prior's width may also help judging the amount of information conveyed by a given informative prior.
The heterogeneity priors discussed above may roughly be categorized in four classes, as shown in Table 1. First of all, one needs to decide whether a proper prior is desired or required. Arguments in favour of a proper prior may include the need for finite marginal likelihoods and Bayes factors in model selection problems, general preference, or a small number (k) of studies.  (5), DuMouchel (6), conventional (7) Jeffreys (11), Among the proper priors one then has the choice between informative distributions, and priors that are supposed to be non-informative, which however depend on the involved studies' standard errors σ i . The improper priors discussed here are all uninformative in one or another sense; the Jeffreys and Berger-Deely priors also depend on the σ i , they require at least k = 2 available studies, the uniform prior is independent of the σ i and requires at least 3 studies.
Some prior densities are illustrated in Figure 1. As the choice of a sensible informative prior depends on the context, and some other priors depend on the σ i values, the priors shown here correspond to the example discussed in Section 3 below. The proper informative halfnormal and half-Cauchy priors with scale 0.5 are reasonable choices for log-ORs and similar endpoints. The log-normal prior's parameters are recommended for the type of investigation based on the analysis by Turner et al. (2015). The proper uniform shrinkage, DuMouchel and conventional priors depend on the involved studies' standard errors σ i . The improper Jeffreys and Berger-Deely prior densities do not integrate to a finite value, so their overall scaling is somewhat arbitrary here. Gelman (2006) generally recommends the improper uniform heterogeneity prior, unless the number of studies k is small, or an informative prior is desired or for other reasons. In those cases, an informative prior from the half-Student-t family is recommended, which includes half-Cauchy and half-normal priors as special or limiting cases. Use of the half-Cauchy family is further supported by Polson and Scott (2012) based also on classical frequentist properties. If, for example, the endpoint is a log-OR, then, using the categorization by Spiegelhalter et al. (2004, Sec. 5.7), a half-normal prior with scale 0.5 may confine heterogeneity mostly to "reasonable" to "fairly high" values and leave about 5% probability for "fairly extreme" heterogeneity. A larger scale parameter or a heavier-tailed distribution may then serve as a more conservative or more robust reference for a sensitivity check (Friede et al. 2017a,b). The Jeffreys prior constitutes another default choice of an uninformative prior; as the Berger-Bernardo reference prior it represents the least informative prior in a certain sense (Bodnar et al. 2017), and it will yield a proper posterior as long as at least 2 studies are available. Half-normal and half-Cauchy parameters are reasonable choices for log-OR endpoints. The log-normal parameters are chosen according to Turner et al. (2015). The uniform shrinkage and DuMouchel priors are scaled relative to the harmonic mean of squared standard errors s 2 0 . The Jeffreys and Berger-Deely priors are improper, so their densities do not integrate to a finite value.

Likelihood
The form of the likelihood follows from the assumptions introduced in Section 2.1. The NNHM is essentially a simple normal model with unknown mean and an unknown variance component; the resulting likelihood function is given by where y and σ denote the vectors of k effect measures y i and their standard errors σ i . For any practical application it is often more useful to consider the logarithmic likelihood, i.e.,

Marginalization
In order to do inference within a Bayesian framework, it is usually necessary to compute integrals involving the posterior distribution (Gelman et al. 2014). For example, in a multiparameter model, one may be interested in the marginal posterior distribution or in the posterior expectation of a certain parameter, both of which result as integrals. Key to the bayesmeta implementation is the partly analytical and partly numerical integration over parameter space. In the following, we will derive the marginal posterior distribution of the heterogeneity parameter via the marginal likelihood, and we will later see how marginal and conditional distributions may be utilized to evaluate the required integrals. The likelihood is initially a function of both parameters (µ and τ ), and the marginal likelihood of the heterogeneity τ results from integration over the effect µ, using its prior distribution, which we specified to be either uniform or normal.

Uniform prior
Using the improper uniform prior for the effect µ (p(µ) ∝ 1), we can derive the marginal likelihood, marginalized over µ, For the NNHM, the integral turns out as where µ(τ ) is the conditional posterior mean of µ for a given heterogeneity τ . Conditional mean and standard deviation are given by (17) A derivation is provided in Appendix A.3; the standard deviation σ(τ ) will become relevant later on. On the logarithmic scale the marginal likelihood then is:

Conjugate normal prior
The normal effect prior here is the conditionally conjugate prior distribution, since the resulting conditional posterior (for a given τ value) again is of a normal form. Calculations for the (proper) normal prior for the effect µ work similarly to the previous derivation. Assume the prior for µ is normal with mean µ p and variance σ 2 p , i.e., it is defined through the probability density function The necessary integral for the marginal likelihood then results as One can see that the prior parameters (µ p and σ p ) enter in a similar manner as the data points (y i and σ i ). In analogy to the previous derivation, define the conditional posterior mean and standard deviation and the logarithmic marginal likelihood turns out as Note that, comparing equations (18) and (22) (as well as (17) and (21)), as expected, use of the uniform prior constitutes the limiting case of large prior uncertainty (σ p → ∞).

Conditional effect posteriors
As long as a uniform or normal prior for the effect µ is used, the effect's conditional posterior distribution for a given heterogeneity value, p(µ|τ, y, σ), again is normal with mean µ(τ ) and standard deviation σ(τ ) as given in equations (17) or (21), respectively (Gelman et al. 2014).
Note that the conditional posterior moments (17) are also commonly utilized in frequentist fixed-effect and random-effects meta-analyses. The mean µ(τ ) constitutes the conditional maximum likelihood estimate (of µ), conditional on a particular amount of heterogeneity τ , while σ(τ ) gives the corresponding (conditional) standard error. Plugging in τ = 0 yields the fixed-effect estimate of µ, while a value τ > 0 yields a random-effects estimate (Hedges and Olkin 1985, Sec. 6); see also Section 3.5 below for an example.

Marginal and joint posterior
Having derived the marginal likelihood p( y|τ, σ) in Section 2.4, the (one-dimensional) marginal posterior density of τ may be computed (up to a normalizing constant) by multiplication with the heterogeneity prior This feature was one of the reasons for specifying the priors for µ and τ as independent (see Section 2.2.1). One-dimensional integration can now easily be done numerically for arbitrary priors p(τ ), as long as the resulting posterior is proper.
In this formulation, it becomes obvious that the effect's marginal distribution is a continuous mixture distribution, in which the normal conditionals p(µ|τ, y, σ) are mixed via the marginal p(τ | y, σ) with (Seidel 2010;Lindsay 1995). This expression allows for easy numerical approximation of posterior integrals of interest. For example, the marginal distribution of the effect µ (the normal mixture) may be approximated by using a discrete grid of τ values and summing up the normal conditionals using weights defined through τ 's marginal density: where the set of τ j is appropriately chosen and corresponding "weights" w j (with j w j = 1) are based on the marginal p(τ ). With that, it is now relatively straightforward to work with the joint distribution, derive marginals, moments, implement Monte Carlo integration, and so on. A general prescription of how to approach a discrete approximation as sketched in (26) while keeping the accuracy under control is given by the direct algorithm described by Röver and Friede (2017). A few more technical details are also given in Section 2.11 and Appendix A.4 below.

Predictive distribution
The predictive distribution expresses the posterior knowledge about a "future" observation, i.e., an additional draw θ k+1 from the underlying population of studies. This is commonly of interest in order to judge the amount of heterogeneity relative to the estimation uncertainty (Riley et al. 2011;Guddat et al. 2012;Bender et al. 2014), or for extrapolation in the design and analysis of future studies (Schmidli et al. 2014). Technically, the predictive distribution p(θ k+1 | y, σ) is similar to the marginal distribution of the effect µ (see previous section).

Shrinkage estimates of study-specific means
Sometimes it is of interest to also infer the posterior distributions of the study-specific parameters θ j . These may e.g. be in the focus if a meta-analysis is performed in order support the analysis of a particular study by borrowing strength from a number of related studies (Gelman et al. 2014;Schmidli et al. 2014;Wandel et al. 2017). Conditionally on a particular heterogeneity value τ , these distributions are again normal with moments given by Var(θ j | τ, y, σ) = 1 (Gelman et al. 2014, Sec. 5.5). These expressions illustrate the shrinkage of posterior estimates towards the common mean as a function of the heterogeneity. Analogously to the effect's posterior and predictive distribution, these conditional moments again allow to approximate each individual θ i 's marginal posterior distribution via a discrete mixture to marginalize over the heterogeneity.

Credible intervals
Credible intervals derived from a posterior probability distribution may be computed e.g. using the distribution's α 2 and (1− α 2 ) quantiles. However, such a simple central interval may not necessarily be the most sensible summary of a posterior distribution, especially if it is skewed or extends to the boundary of its parameter space. In such cases, it usually makes more sense to consider the highest posterior density (HPD) region, i.e., a (1−α) credible region enclosing the (1−α) posterior probability where the posterior density is largest (Gelman et al. 2014). Such a region may be disjoint and hard to determine, but closely related (and identical for unimodal distributions) is the shortest credible interval. Both types of intervals, central and shortest, will be considered in the following.

Posterior predictive checks and p-values
Posterior predictive model checks allow to investigate the fit of a model to a given data set (Gelman et al. 1996;Gelman 2003;Gelman et al. 2014). The consistency of data and model is explored by comparing the actual data to data sets predicted via the posterior distribution. The comparison is usually done graphically, or via suitable summary statistics of actual and predicted data; a discrepancy then is an indicator of a poor model fit.
If the summary statistic is one-dimensional, then the comparison may be formalized by focusing on the fractions of predicted values above or below the actually observed value. This leads to the concept of posterior predictive p-values, which are closely related to classical p-values (Meng 1994;Berkhof et al. 2000;Gelman 2013;Wasserstein 2016). Posterior predictive p-values have been applied and advocated in a range of contexts, including e.g. educational testing (Sinharay et al. 2006), metrology (Kacker et al. 2008), psychology (van de Schoot et al. 2014) and biology (Chambert et al. 2014).
In the context of the NNHM, posterior predictive checks are useful, as they allow to investigate certain hypotheses of interest, like for example µ ≥ 0, τ = 0 or θ i = 0. The posterior predictive distribution conditional on a particular hypothesis may then be explored in order to derive a corresponding posterior predictive p-value. The choice of a suitable summary statistic however may still pose a challenge. The posterior predictive checks here are implemented via Monte Carlo sampling, therefore parts of these procedures are computationally expensive.

How the bayesmeta() function works internally
The bayesmeta() function utilizes the fact that in the context of the NNHM the resulting posterior is only 2-dimensional (for now ignoring the θ i parameters) and may be expressed as a mixture distribution (see (24)) where the heterogeneity's marginal p(τ | y, σ) is known, and the effect's conditionals p(µ|τ, y, σ) are all of a normal form. This setup allows to approximate the effect marginal by a discrete mixture (see (26)) while keeping the accuracy under control; the accuracy requirements are formulated via the direct algorithm's two tuning parameters (δ and ) ).
An example of joint and marginal posterior densities of the two parameters is illustrated in Figure 3 below (see page 23). The joint posterior density (top right) is easily evaluated based on likelihood and prior density, both of which are available in analytical form (see Sections 2.2 and 2.3). The heterogeneity's marginal density (bottom right) is also easily computed, based on marginal likelihood and prior (see (23)); only its normalizing constant needs to be computed numerically (using the integrate() function available in R). The CDF is also computed using numerical integration, and the quantile function is evaluated using again the CDF and inverting it via R's uniroot() root-finding function. Figure 3) is approximated by a mixture of a finite number of normal distributions. In terms of equation (26), what is required is a finite set of support points τ j , the parameters (means and standard deviations) of the associated normal conditionals p(µ|τ j ), and the corresponding weights w j . These are all determined using the direct algorithm, and in the bayesmeta() output (see the following section) one can find these in the "...$support" element. In the example shown in Figure 3, the effect marginal is based on a 17-component normal mixture; this number of components is sufficient to bound the discrepancy between actual marginal and mixture approximation to amount to a Kullback-Leibler divergence below δ = 1%. The desired accuracy can be pre-specified via the "delta" and "epsilon" arguments .

Now the effect's marginal density (bottom left panel of
Computations related to such discrete, finite mixtures are relatively straightforward; density and CDF are linear combination of the components' (normal) densities and CDFs, random number generation is simple, and moments are also easily derived (Seidel 2010;Lindsay 1995). A few more details on the implementation are given in Appendix A.4. Many of the internal computations heavily rely on numerical integration, root-finding and optimization via R's integrate(), uniroot(), optimize() and optim() functions. Accuracy of the eventual implementation is confirmed using simulations in Appendix A.5.
3. Using the bayesmeta package 3.1. General Before proceeding to an exemplary analysis, we will first introduce an example data set and go through the common procedure of effect size derivation step-by-step. This will serve to introduce some context and generate a set of estimates (y i ) and associated standard errors (σ i ); the subsequent section will then pick up the analysis from that starting point.

Example data: a systematic review in immunosuppression
Interleukin-2 receptor antagonists (IL-2RA) are commonly used as part of immunosuppressive therapy after organ transplantation. Treatment strategies and responses are different for adults and children, and it was of interest to investigate the effectiveness of IL-2RA in preventing acute rejection (AR) events following liver transplantation in paediatric patients. A systematic literature review was performed, and six controlled studies were found reporting on the occurrence of AR events in paediatric liver transplant recipients (Crins et al. 2014).
The binary data on AR events from each of the six studies may be summarized in a 2×2-table as shown in Table 2. The data shown here come from the earliest of the studies found in the review (Heffron et al. 2003). Here one can already see that the treatment appears to be effective, as roughly only a quarter of patients in the IL-2RA group experienced an AR event, compared to three quarters in the control group.
In order to compare the effect magnitude between different studies, a common effect measure is computed from each contingency table (for each study i). One such measure is the logarithmic odds ratio (log-OR), comparing the odds of an event in treatment-and controlgroups. The log-OR estimate is given by y i = log a/b c/d , where a to d are the event counts as defined in Table 2; the corresponding standard error is σ i = 1 a + 1 b + 1 c + 1 d . In the above example, the odds ratio is 14/47 15/5 = 14 141 ≈ 0.10; we have y 1 = log 14/47 15/5 = −2.31 and σ 1 = 1 14 + 1 47 + 1 15 + 1 5 = 0.60. A wide range of other measures is available for contingency tables as well as other types of study outcomes; for example, in the present case one might alternatively be interested in (logarithmic) relative risks (RR) (log a/(a+b) c/(c+d) ) instead of the log-ORs (Hedges and Olkin 1985;Hartung et al. 2008;Borenstein et al. 2009;Viechtbauer 2010;Higgins and Green 2011;Deeks 2002). The original data and derived log-ORs for all  six studies from the systematic review are shown in Table 3.
The transplantation data set is also contained in the bayesmeta package; the data need to be loaded via the data() function: R> require("bayesmeta") R> data("CrinsEtAl2014") R> CrinsEtAl2014 Effect sizes and standard errors can be calculated from the plain count data either by implementing the corresponding formulas (see above), or, much easier and recommended, by using e.g. the metafor package's escalc() function: R> require("metafor") R> crins.es <-escalc(measure="OR", + ai=exp.AR.events, n1i=exp.total, ci=cont.AR.events, n2i=cont.total, + slab=publication, data=CrinsEtAl2014) R> crins.es One can see that the escalc() function uses a terminology analogous to that in Table 2 to interface with binary outcome data; the "ai" input argument corresponds to the a i table entries (number a of events in the treatment group for each study i), and so on. The output of the escalc() function (here: the data frame named "crins.es") will then be the original data along with two additional columns named "yi" and "vi" containing the calculated effect sizes (y i ) and the squared (!) standard errors (σ 2 i ), respectively. Note that for computing the 6th study's log-OR (see Table 3), a continuity correction was necessary, because one of the contingency table entries was zero (Sweeting et al. 2004). For more details on effect size calculation and default behaviour, see also Viechtbauer (2010) or the escalc() function's online documentation.

Performing a Bayesian random-effects meta-analysis
The bayesmeta() function In order to perform a random-effects meta-analysis, we need to specify the data, as well as the prior for the unknown parameters µ and τ (see Section 2.2). For the effect µ we are restricted to normal or uniform priors; here we use a vague prior centered at µ p = 0, which corresponds to an OR of 1, i.e., no effect. The prior standard deviation we set to σ p = 4, corresponding to the vague unit information prior (see Section 2.2.2). For the heterogeneity, we use a half-normal prior with scale 0.5, confining the a priori expected heterogeneity to τ ≤ 0.98 with 95% probability (i.e., allowing for "fairly extreme" values with only about 5% prior probability).
With the log-ORs computed as in the previous section, we can now execute the analysis using the following call R> ma01 <-bayesmeta(y = crins.es[,"yi"], sigma = sqrt(crins.es[,"vi"]), + labels = crins.es[,"publication"], mu.prior.mean = 0, mu.prior.sd = 4, + tau.prior = function(t){dhalfnormal(t,scale=0.5)}) The first three arguments pass the data (vectors of estimates y i and standard errors σ i ) and (optionally) a vector of corresponding study labels to the bayesmeta() function. Note that the metafor package's escalc() function returned variances (i.e., squared standard errors), while the bayesmeta() function's "sigma" argument requires the standard errors (i.e., the square root of the variances); hence the additional square-root-transformation here. The following arguments specify the prior mean and standard deviation of the (normal) prior for the effect µ. Finally, the last argument specifies the prior for the heterogeneity τ . While for the effect prior we are restricted to using normal or improper uniform priors, the heterogeneity prior can be of essentially any type. Specification of the heterogeneity prior works via specification of its prior density function. While this type of argument specification is somewhat unusual, it is reasonably straightforward, as one can see above. The dhalfnormal() function here is the half-normal distribution's density function; see also the corresponding online help (e.g. via entering "?dhalfnormal" in R).
Retrieving and processing the "yi" and "vi" elements (as well as study labels, if available) from an escalc() result in general is not complicated, and the bayesmeta() function can also do this automatically for any escalc() output, including the many types of effect sizes that are available (Viechtbauer 2010). Using simply the escalc() function's output as an input, the identical result can be achieved by calling R> ma01 <-bayesmeta(crins.es, mu.prior.mean = 0, mu.prior.sd = 4, + tau.prior = function(t){dhalfnormal(t,scale=0.5)}) The bayesmeta() computations may take up to a few seconds, but with that the main calculations are done, and the essential results are stored in the generated object of class "bayesmeta" (here named "ma01"). One can inspect the results by printing the returned object:
6 estimates: Heffron (2003), Gibelli (2004), Schuller (2005), Ganschow (2005) One can see that the analysis was based on k = 6 studies, that both parameters' priors were found to be proper, and maximum-likelihood (ML) as well as maximum-a-posteriori (MAP) values are quoted. Probably most interestingly, under "marginal posterior summary" one can find summary statistics describing the marginal posterior distributions of heterogeneity (τ ) and effect (µ), which may often be the most relevant figures. The resulting posterior median and 95% credible interval for the effect µ here are at a log-OR of −1.57 [−2.23, −0.93]; this information may eventually constitute the essential result in many cases.

The forestplot() function
To illustrate data and results, one can use the forestplot() function. This function is actually a bayesmeta-specific method based on the forestplot package's generic forestplot() function (Gordon and Lumley 2017). In its simplest form, it may be used as R> forestplot(ma01) Figure 2 shows the forestplot() function's default output for the example analysis. In the figure one can see all estimates y i along with 95% intervals based on the provided standard errors σ i . At the bottom, 95% credible intervals for the effect and for the predictive distribution are shown (Lewis and Clarke 2001;Guddat et al. 2012). Next to each of the quoted estimates (as specified through y i and σ i ), the shrinkage intervals for the study-specific effects θ i are also quoted estimate shrinkage estimate study Heffron (2003) Gibelli (2004) Schuller (2005) Ganschow (2005) Spada (2006) Gras (

The plot() function
The analysis output may be inspected more closely using the plot() function:

R> plot(ma01)
The output for our example is shown in Figure 3; in particular, the joint and marginal posterior distributions are illustrated in detail. Prior densities may be superimposed by using the "prior=TRUE" argument, and axis ranges may also be specified manually; see also the online help for the plot.bayesmeta() method.

Elements of the bayesmeta() output
It is possible to access the joint and marginal densities shown in Figure 3 (and more) directly from the bayesmeta() output. As usual for an object returned from a non-trivial analysis function, the result of a bayesmeta() call is a list object of class "bayesmeta" containing a number of further individual objects. One can check the complete listing of available entries in the online documentation. For example, there is the "...$summary" entry giving some basic summary statistics: R> ma01$summary tau mu theta mode 0.2453139 -1.5691214 -1.5632732 median 0.3445023 -1.5734819 -1.5701653 ma01 effect Gras (2008) Spada (2006) Ganschow (2005) Schuller (2005) Gibelli (2004) Heffron (2003) prediction θ k +1 Some of these we already saw in the output when simply printing the object (see above).
The additional third column here shows summary statistics for the predictive distribution of a 'future' study (θ k+1 ). One can also access the original data (the y i and σ i ) in the "...$y" and "...$sigma" entries, or the study labels and the total number of studies (k) in the "...$labels" and "...$k" entries.
For example, suppose that we assume a rate of AR events of p c = 50% for the control group, and we are interested in the implied risk difference based on our analysis. The risk difference is simply p t − p c , where p t is the event rate in the treatment (IL-2RA) group. To determine the distribution of the risk difference we can now simply use Monte Carlo sampling and run R> prob.control <-0.5 R> logodds.control <-log(prob.control / (1 -prob.control)) R> logodds.treat <-(logodds.control + + ma01$rposterior(n=10000, tau.sample=FALSE)) R> prob.treat <-exp(logodds.treat) / (1 + exp(logodds.treat)) R> riskdiff <-(prob.treat -prob.control) R> median(riskdiff) [1] -0.3284975 R> quantile(riskdiff, c(0.025, 0.975)) 2.5% 97.5% -0.4028368 -0.2149175 So here we find a median risk difference of −0.33 and a 95% credible interval of [−0.40, −0.21] for this example. The risk difference distribution could now also be investigated further using histograms etc.

Prior predictive distributions
In order to judge the implications of settings of the heterogeneity prior, it is often useful to consider prior predictive distributions (Gelman et al. 2014). Any fixed value of τ will imply a certain (prior) distribution p(θ i |µ, τ ) and variability among the true study-specific means θ 1 , . . . , θ k , namely, a normal distribution with Var(θ i |µ, τ ) = τ 2 (see also (2)). Depending on the type of endpoint (e.g., log-ORs), the implied variability can be interpreted and judged on the corresponding outcome scale (Spiegelhalter et al. 2004, Sec. 5.7).
Assuming a prior distribution for τ , rather than a fixed value, also implies assumptions on the a priori expected distribution and variability of the true study parameters θ i . The prior predictive distribution p(θ i |τ ) of the θ i values then is a mixture of normal distributions, with mean µ and with the prior p(τ ) as the mixing distribution for the normal standard deviation (Seidel 2010;Lindsay 1995). As the name suggests, the prior predictive distribution is actually closely related to the (posterior) predictive distribution discussed above (Gelman et al. 2014). This mixture distribution can again be evaluated using the direct algorithm ; this approach is implemented in the normalmixture() function.
Consider the half-normal prior distribution with scale 0.5 that was used for the heterogeneity in the above analysis. We can now check what prior predictive distribution this prior corresponds to. We only need to supply the prior CDF (the mean µ is by default set to zero): R> hn05 <-normalmixture(cdf=function(t){phalfnormal(t, scale=0.5)}) One can check the returned result (e.g. via str(hn05)); the result is a list with several elements, among which most importantly are the mixture's density, cumulative distribution and quantile functions ("...$density()", "...$cdf()" and "...$quantile()", respectively).

Informative heterogeneity priors
It may also make sense to consider empirical information for the setup of an informative heterogeneity prior, for example, when other evidence is extremely sparse. In medical or psychological contexts, some evidence for certain types of endpoints may be found e.g. in Pullenayegum (2011), Turner et al. (2012, Kontopantelis et al. (2013) and van Erp et al. (2017). Instantly applicable for a meta-analysis are the numbers given by Rhodes et al. (2015) and Turner et al. (2015), where in both cases the complete Cochrane Database of Systematic Reviews was analyzed to infer the predictive distribution of heterogeneity for specific applications. The investigation by Rhodes et al. (2015) here was concerned with mean difference endpoints, while Turner et al. (2015) focused on log-OR endpoints. The derived prior distributions are directly available in the bayesmeta package via the RhodesEtAlPrior() and TurnerEtAlPrior() functions. For our present example (a log-OR endpoint whose definition may be categorized as "surgical / device related success / failure", and where the comparison is between pharmacological treatment and control), we can derive the prior simply as R> tp <-TurnerEtAlPrior(outcome = "surgical", + comparator1 = "pharmacological", comparator2 = "placebo / control") For a complete description of the possible input options see the online documentation; the RhodesEtAlPrior() function then works similarly. The function output is a list with several entries, including the prior density, cumulative distribution and quantile function (in this case a log-normal distribution) in the "...$dprior()", "...$pprior()" and "...$qprior()" elements. This way we can e.g. check what magnitude of heterogeneity values is a priori expected for this setting by determining the median as well as 2.5% and 97.5% quantiles: R> tp$qprior(c(0.025, 0.5, 0.975)) [1] 0.06233896 0.34300852 1.88734045 The prior density can now immediately be used and passed on to the bayesmeta() function; for example, we can use the same effect prior as before and the "empirical" prior for the heterogeneity: R> ma02 <-bayesmeta(crins.es, mu.prior.mean = 0, mu.prior.sd = 4, + tau.prior = tp$dprior) Comparing the results to the previous analysis (e.g. via their "...$summary" outputs), one can see that in this case they are very similar. The two corresponding prior densities are also shown in Figure 1 (page 12; solid and dotted blue lines).

Non-informative priors
As discussed in Section 2.2, an obvious choice of an uninformative prior for the effect µ would be the (improper) uniform prior on the real line; this one can be utilized by simply leaving the mu.prior.mean and mu.prior.sd parameters unspecified. In order to use one of the uninformative heterogeneity priors discussed in Section 2.2, these do not need to be specified "manually" in terms of their probability density function; a set of priors is already pre-implemented and may be specified via a character string. The default setting for example is tau.prior="uniform". If one wants to use, say, the uniform effect prior along with the Jeffreys prior for the heterogeneity τ (see Section 2.2 and Bodnar et al. (2017)), one can run R> ma03 <-bayesmeta(crins.es, tau.prior="Jeffreys") The complete list of possible options is described in detail in the online documentation.

Making the connection with frequentist results
Frequentist and Bayesian approaches to inference within the NNHM framework are obviously related, and it may be interesting to highlight the connection between the corresponding results. A simple frequentist analysis may be performed e.g. using the metafor package's "rma()" function via R> ma04 <-rma(crins.es) (Viechtbauer 2010). By default, the restricted ML (REML) heterogeneity estimatorτ REML is used, but the exact type of estimator does not matter here. The heterogeneity point estimate here turns out as: R> sqrt(ma04$tau2) [1] 0.4670268 and we can retrieve the effect estimate and its standard error via: In the Bayesian setup, these numbers correspond to conditional posterior moments of the effect (µ|τ =τ REML ) in an analysis using the uniform effect prior. Such an analysis was performed in the previous section (uniform effect prior and Jeffreys heterogeneity prior; the heterogeneity prior does not matter here) and stored in the "ma03" object. From this we can retrieve the effect's conditional posterior moments (mean and standard deviation for τ =τ REML ) using the "...$cond.moment()" function: quoted estimate shrinkage estimate study Heffron (2003) Spada (  and we can see that these correspond exactly to the frequentist effect estimate. Both analysis approaches are related through the use of the same likelihood function; in the Bayesian analysis uncertainty (e.g. in the heterogeneity) is accounted for via integration, and a prior distribution for both parameters is considered.

Posterior predictive checks
A meta-analysis of two studies Posterior predictive p-values allow to quantify the consistency of the data with certain parametric hypotheses; see Section 2.10. In the following we will determine some p-values from the bayesmeta() output; to this end, we will investigate a second meta-analysis example involving only two studies.
Of the six studies considered in the pediatric transplantation example (see Figure 2), only two were randomized (Heffron et al. 2003;Spada et al. 2006). Since randomized studies are usually considered as evidence of higher quality, now suppose one was interested in combining the randomized studies only. Computations analogous to the preceding example may be done via R> ma05 <-bayesmeta(crins.es[crins.es[,"randomized"]=="yes",], + mu.prior.mean=0, mu.prior.sd=4, + tau.prior=function(t){dhalfnormal(t,scale=0.5)}) Figure 6 shows the forest plot for this analysis. Based on these two studies only, we can now inspect e.g. the estimate of the overall effect µ; comparing to the previous analysis (Figure 2), the (absolute) estimate is slightly larger, but the credible interval is wider.

Posterior predictive p-values for the effect (µ)
The obvious 'null' hypothesis to be tested here is H 0 : µ ≥ 0 (i.e., no effect or a harmful effect) versus the alternative H 1 : µ < 0 (a beneficial effect). We may now derive a posterior predictive p-value in order to express to what extent the data are consistent with or in contradiction to the null hypothesis. In order to evaluate the "discrepancy" between data and null hypothesis, we need a test statistic or discrepancy variable that in some sense measures or captures this (in-) compatibility.
An obvious candidate may e.g. be the posterior probability of a beneficial effect, P(µ < 0 | y). This probability here is identical to the posterior cumulative distribution function (CDF) evaluated at the hypothesized value µ = 0. Large values then are evidence against, and small values speak in favour of the null hypothesis. In the present example data set we can evaluate this figure as R> ma05$pposterior(mu=0) [1] 0.9974968 Regarding our hypothesis setup, the question then is, how (un-) likely our observed value of 0.9975 is under the null hypothesis (H 0 : µ ≥ 0). In order to answer that question, we need the posterior distribution of the test statistic conditional on the null hypothesis (and the data). Using Monte Carlo sampling, we can generate draws of parameters from the conditional posterior distribution (µ , τ , θ | y, µ ≥ 0) and then generate new data based on these (y | µ , τ , θ ) from which we can compute replications of the test statistic and determine its distribution.
In the bayesmeta package, posterior predictive checks are implemented in the pppvalue() function. In order to generate posterior predictive draws, we need to specify the involved hypotheses, the test statistic, and the number of Monte Carlo replications to be generated; here we use n = 1000, which may take a few minutes to compute: R> p1 <-pppvalue(ma05, parameter="mu", value=0, alternative="less", + statistic="cdf", n=1000) Since the p-value is eventually computed based on the generated Monte Carlo samples, a value of n 100 will usually be appropriate. By default, a progress bar is shown during computation, allowing to estimate the remaining computation time. We can then inspect the result by printing the returned object:

R> p1
'bayesmeta' posterior predictive p-value (one-sided) data: ma05 cdf = 0.9975, Monte Carlo replicates = 1000, p-value = 0.01 alternative hypothesis: true effect (mu) is less than 0 The default output restates the hypothesis setup and shows a posterior predictive p-value of 0.01. This means that in 10 of the 1000 replications generated (1%) the statistic was larger than our observed 0.9975. One can also do a quick check of the uncertainty in this Monte-Carlo'ed p-value using e.g. the prop.test() function, which here yields a 95% confidence interval ranging roughly from 0.5% to 2%.
The replications are also stored in detail in the generated object (here: "p1"). The list object contains a "...$replicates" element, which again contains vectors of generated τ and µ draws, matrices of the corresponding θ and y draws, and finally the test statistic values along with an indicator showing which ones constitute the "tail area" the p-value is based on. Using the provided output, one can visualize how the posterior predictive p-value is computed; executing R> plot(ma05, which=2, mulim=c(-3.5, 1), taulim=c(0,2)) R> abline(h=p1$null.value) # (the null-hypothesized mu value) R> points(p1$replicates$tau, p1$replicates$mu, col="cyan") # (the samples) one can see the joint posterior distribution of heterogeneity τ and effect µ along with the generated samples, which, according to the specified null hypothesis, are confined to µ ≥ 0 (see Figure 7, left panel). The resulting test statistic values can be illustrated via their empirical cumulative distribution function, which can be generated by R> plot(ecdf(p1$replicates$statistic[,1])) R> abline(v = p1$statistic, col="red") R> abline(h = 1-p1$p.value, col="green") (see Figure 7, right panel). The "test statistic" values range between 0 and 1, and their distribution is clearly not uniform. The actualized value in the present data set (0.9975, vertical red line) is situated in the upper tail of the distribution of replicated statistics values, and the remaining tail area (horizontal green line) eventually defines the p-value.

Posterior predictive p-values for the heterogeneity (τ )
Computation of posterior predictive p-values for the heterogeneity works analogously. Use of the posterior CDF (P(τ ≤ 0|y)) to test for zero heterogeneity does not make sense, as this figure will always be zero, for the original as well as any replicated data. In order to test for zero heterogeneity, we could use the classical Cochran's Q statistic: R> p2 <-pppvalue(ma05, parameter="tau", value=0, alternative="greater", + statistic="q", n=1000) which here yields a p-value of 24.4%. In this case computations are much faster, since computationally expensive re-analyses of the data are not necessary to compute the test statistic.
The resulting p-value should be identical to the "classical" result, since under the null hypothesis considered (here: τ = 0) the Q-statistic follows a χ 2 -distribution, as in the frequentist setting.
In a Bayesian context, it may also make sense to consider using for example the Bayes factor of the hypothesis of τ = 0 as the "test statistic" or "discrepancy measure". The pppvalue() function is able to utilize arbitrary functions as a statistic; to use the Bayes factor, we can define the function R> BF <-function(y, sigma) + { + bm <-bayesmeta(y=y, sigma=sigma, + mu.prior.mean=0, mu.prior.sd=4, + tau.prior=function(t){dhalfnormal(t, scale=0.5)}, + interval.type="central") + return(bm$bayesfactor[1,"tau=0"]) + } Two things are worth noting here. Firstly, it makes sense to use matching (especially prior) specifications for the bayesmeta() call within the BF() function as for the original analysis (here: the previously generated "ma05" object). Secondly, the use of central intervals (see the "interval.type" argument) is more efficient, since these are faster to compute, and the intervals are otherwise irrelevant here. In order to utilize the function for a posterior predictive p-value, we can then call R> p3 <-pppvalue(ma05, parameter="tau", value=0, alternative="greater", + statistic=BF, rejection.region="lower.tail", n=1000, sigma=ma05$sigma) Note that the rejection region needs to be specified explicitly here (small Bayes factors constitute evidence against the null hypothesis). Additional arguments may be passed to the statistic function, like the "sigma" argument above. The Bayes factor in this case yields a similar p-value to Cochran's Q statistic (p = 22.2%).

Posterior predictive p-values for individual effects (θ i )
Quite commonly in a meta-analysis, interest may also be in one of the study specific parameters θ i (Schmidli et al. 2014;Wandel et al. 2017). For example, suppose that at the end of the latter of the two concerned studies (Spada, 2006) a meta-analysis was performed to evaluate the cumulative evidence, but main interest still was in the outcome of the second study that had just been conducted; it would then only be considered in the light of the previous evidence. In such a scenario, we can then evaluate a posterior predictive p-value for the 2nd study's effect (θ 2 ); this shrinkage estimate is also shown in Figure 6. Using the pppvalue() function, we can simply refer to a particular study's parameter by its index or its label: R> p4 <-pppvalue(ma05, parameter="Spada", value=0, alternative="less", + statistic="cdf", n=1000) which here results in a p-value of around 16.1%.

Summary
A Bayesian approach has distinct advantages in the context of meta-analysis; it allows to coherently process the uncertainty in the heterogeneity (nuisance) parameter while focusing on inference for the effect parameter(s), small sample sizes (numbers of studies) do not pose a difficulty, and interpretation of the results is very straightforward. Since meta-analyses are quite commonly based on only very few studies, the opportunity to formally utilize external information in the analysis via the prior specification may be a welcome feature. Unlike for some other methods whose results depend on the specification of secondary details, a Bayesian analysis result is uniquely defined once the model (likelihood and prior) is specified.
The application of Bayesian reasoning for this purpose is not a novelty ), but it usually comes with a certain computational burden; often MCMC methods are necessary, which demand a substantial amount of attention on their own (Gilks et al. 1996). The bayesmeta package (Röver 2015) allows to perform Bayesian random-effects metaanalyses without the need to worry too much about the computational details. Some of the technical details of the computational approach underlying the package have been described elsewhere . The simple normal-normal hierarchical model (NNHM) treated here is applicable in a wide range of contexts and is routinely used for many types of input data and effect measures (Hedges and Olkin 1985;Hartung et al. 2008;Viechtbauer 2010;Borenstein et al. 2009). The bayesmeta implementation allows for quick, accurate and reproducible computation, and it has already faciltated some larger-scale simulation studies to compare Bayesian results with common alternative approaches and evaluate their relative performance (Friede et al. 2017a,b). Usage of the bayesmeta package is not more complicated to use than many other common meta-analysis tools. The availability of predictive distributions and shrinkage estimates makes the package attractive also for advanced evidence synthesis applications, like extrapolation to future studies (Schmidli et al. 2014(Schmidli et al. , 2017Wandel et al. 2017). Since the generic NNHM appears in different fields of application, use of the bayesmeta package may also be extended to other areas of research beyond common meta-analysis. For example, it could as well be used to model hierarchical structures within a study (e.g., groups of patients), or a two-stage approach may be useful for meta-analysis based on individual-patient data. In future, the same numerical approach might be extended to the more general case of meta-regression.
the improper uniform prior. The derivation goes as follows. Suppose we have a bounded parameter domain [a, ∞], a likelihood function f (x) ≥ 0 (x ∈ [a, ∞]) with ∞ a f (x) dx < ∞, and a prior with monotonically decreasing probability density function p(·), so that 0 < p(a) < ∞, and a ≤ x < y ⇒ 0 ≤ p(y) ≤ p(x). Using the (improper) uniform prior or prior p we get different posteriors with cumulative distribution functions F 1 (·) and F p (·), respectively. From the above assumptions follows that for all y > a. This means that with F p (y) ≥ F 1 (y), the posterior using the uniform prior is stochastically larger than the posterior based on any other prior among the class of priors with monotonically decreasing density p(·) and finite p(a) (provided the uniform prior yields a proper posterior). In our context, this especially implies that quantiles or expectations based on the uniform prior are larger (Shaked and Shanthikumar 2007).
The class of priors with finite intercept and monotonically decreasing density to which the above property applies includes e.g. the exponential, half-normal, half-Student-t, half-Cauchy and Lomax distributions (Johnson et al. 1994), or uniform distributions with a finite upper bound.

A.4. Mixture implementation details
The approximation of marginal effect distributions etc. is implemented via the direct algorithm as described by Röver and Friede (2017). This approximation is utilized to evaluate posterior distributions of the overall effect µ as well as shrinkage estimates θ i and predictions θ k+1 . In all three cases, the distributions of interest are mixtures of conditionally normal distributions; in order to construct the approximate discrete mixture, it is necessary to evaluate symmetrized divergences of the conditional distributions. The symmetrized divergence (relative entropy) for two normal distributions with mean and variance parameters (µ A , σ 2 A ) and (µ B , σ 2 B ), respectively, is given by . Since the conditional means of µ|τ and θ k+1 |τ are identical, while the conditional variance of the latter is always equal to or larger than the former (see Section 2.8), a grid constructed for the effect's posterior (µ) can always be used for the predictive distribution (θ k+1 ) without loss of accuracy. In order not to have to construct and consider several separate τ grids also for the different shrinkage distributions, the general algorithm is slightly extended. Instead of determining divergences corresponding to pairs of τ values with respect to each of the shrinkage distributions individually, the maximum divergence across effect posterior as well as all k shrinkage distributions is considered. The result is a single grid in τ values that may be re-used for all three types of distributions.

A.5. Calibration check
The inferential statements returned by a Bayesian analysis differ in their probabilistic claims from those returned by frequentist analyses. For example, while a frequentist 95% confidence interval usually is supposed to yield 95% coverage for repeated data generation and analysis conditional on any single point in parameter space, a Bayesian analysis is to be understood conditional on the assumed prior distribution, and hence the coverage holds for repeated sampling of parameters from the prior and subsequent data generation and analysis. While frequentist analyses often rely on large-sample-size asymptotics (here: large k), Bayesian posterior analyses generally should (at least for proper priors) yield exact coverages, independent of sample sizes (Dawid 1982;Gneiting et al. 2007). The accuracy (calibration) of Bayesian analysis software may be checked exploiting this property (Cook et al. 2006). The aim of this section is to demonstrate that the bayesmeta implementation in fact yields consistent results.
What one can read off from the plots directly is the empirical coverage of one-sided upper credible limits. For example, one-sided 95% credible limits empirically exhibited a coverage of close to 95% in the simulations. For the heterogeneity, a curve above the main diagonal may be interpreted as "conservative" (in the sense of a tendency to overestimate heterogeneity), while for the effect, a conservative procedure should yield a curve below the diagonal at the lower end and above the diagonal at the upper end (i.e., leading to intervals that tend to be wider than necessary).