Bayesian Model Averaging Employing Fixed and Flexible Priors: The BMS Package for R

This article describes the BMS (Bayesian model sampling) package for R that implements Bayesian model averaging for linear regression models. The package excels in allowing for a variety of prior structures, among them the “binomial-beta” prior on the model space and the so-called “hyper-g ” speciﬁcations for Zellner’s g prior. Furthermore, the BMS package allows the user to specify her own model priors and oﬀers a possibility of subjective inference by setting “prior inclusion probabilities” according to the researcher’s beliefs. Furthermore, graphical analysis of results is provided by numerous built-in plot functions of posterior densities, predictive densities and graphical illustrations to compare results under diﬀerent prior settings. Finally, the package provides full enumeration of the model space for small scale problems as well as two eﬃcient MCMC (Markov chain Monte Carlo) samplers that sort through the model space when the number of potential covariates is large.


A brief introduction to Bayesian model averaging
Model uncertainty is a problem that arises frequently in applied econometrics: Which set of the covariates is appropriate to explain variation of the response variable? Are my results robust to in-/exclusion of additional explanatory variables? In addressing these issues Bayesian model averaging (BMA) has become a popular alternative to model selection. The remainder of this article is structured as follows: This section re-iterates some basic concepts, and introduces notation for readers with limited knowledge of BMA. In Section 2 estimation using package BMS (Bayesian model sampling) is explained using employing a default prior setting. Section 3 introduces different priors on the model space, while Section 4 introduces the MCMC sampler implemented in package BMS. Section 5 provides an overview about the g prior settings, and Section 6 describes prediction in BMA. Section 7 concludes.

Bayesian model averaging
Bayesian model averaging 1 addresses model uncertainty in a canonical regression problem. Suppose a linear model structure, with y being the dependent variable, α γ a constant, β γ the coefficients, and ε a normal IID error term with variance σ 2 : y = α γ + X γ β γ + ε, ε ∼ N (0, σ 2 I). (1) A problem arises when there are many potential explanatory variables in a matrix X: Which variables X γ ∈ {X} should be then included in the model? And how important are they?
The direct approach to do inference on a single linear model that includes all variables is inefficient or even infeasible with a limited number of observations. BMA tackles the problem by estimating models for all possible combinations of {X} and constructing a weighted average over all of them. If X contains K potential variables, this means estimating 2 K variable combinations and thus 2 K models. The model weights for this averaging stem from posterior model probabilities that arise from Bayes' theorem: Here, p(y|X) denotes the integrated likelihood which is constant over all models and is thus simply a multiplicative term. 2 Therefore, the posterior model probability (PMP) is proportional (embodied by the sign ∝) to the integrated likelihood p(y|M γ , X), which reflects the probability of the data given model M γ . The marginal likelihood of model M γ is multiplied by its prior model probability p(M γ ) indiating how probable the researcher thinks model M γ is before looking at the data. The difference between p(y|X) and p(y|M γ , X) is that integration is once over the model space (p(y|X)) and once for a given model over the parameter space p(y|M γ , X). By re-normalization of the product from above one can infer the PMPs and thus the model weighted posterior distribution for any statistic θ (e.g., the estimator of the coefficient β γ ): p(θ|y, X) = .
The model prior p(M γ ) has to be elicited by the researcher and should reflect prior beliefs. A popular choice is to set a uniform prior probability for each model p(M γ ) ∝ 1 to represent the lack of prior knowledge. Further model prior options will be explored in Section 3.

Bayesian linear models and Zellner's g prior
The specific expressions for the marginal likelihoods p(M γ |y, X) and the posterior distributions p(θ|M γ , y, X) depend on the chosen estimation framework. The literature standard is to use a "Bayesian regression" linear model with a specific prior structure called "Zellner's g prior" as will be outlined in this section. 3 For each individual model M γ suppose a normal error structure as in (1). The need to obtain posterior distributions requires to specify the priors on the model parameters. Here, we place "improper" priors on the constant and error variance, which means they are evenly distributed over their domain: p(α γ ) ∝ 1, i.e., complete prior uncertainty where the constant is located. Similarly, set p(σ) ∝ σ −1 .
The crucial prior is the one on the regression coefficients β γ : Before looking at the data (y, X), the researcher formulates her prior beliefs on coefficients into a normal distribution with a specified mean and variance. It is common to assume a conservative prior mean of zero for the coefficients to reflect that not much is known about them. Their variance structure is defined according to Zellner's g: gσ 2 (X γ X γ ) −1 : This means that the researcher thinks coefficients are zero, and that their variance-covariance structure is broadly in line with that of the data X γ . The hyperparameter g embodies how certain the researcher is that coefficients are indeed zero: A small g means small prior coefficient variance and therefore implies the researcher is quite certain (or conservative) that the coefficients are indeed zero. In contrast, a large g means that the researcher is very uncertain that coefficients are zero.
The posterior distribution of coefficients reflects prior uncertainty: Given g, it follows a tdistribution with expected value E(β γ |y, X, g, M γ ) = g 1+gβ γ , whereβ γ is the standard OLS estimator for model γ. The expected value of coefficients is thus a convex combination of OLS estimator and prior mean (zero). The more conservative (smaller) g, the more important is the prior, and the more the expected value of coefficients is shrunk toward the prior mean zero. As g → ∞, the coefficient estimator approaches the OLS estimator. Similarly, the posterior variance of β γ is affected by the choice of g: 4 COV(β γ |y, X, g, M γ ) = (y −ȳ) (y −ȳ) N − 3 g 1 + g 1 − g 1 + g R 2 γ (X γ X γ ) −1 .
I.e., the posterior covariance is similar to that of the OLS estimator times a factor that includes g. The Appendix A.3 shows how to apply the function zlm in order to estimate such models outside of the BMA context.
For BMA, this prior framework results in a very simple marginal likelihood p(y|M γ , X, g), that is related to the R-squared and includes a size penalty factor adjusting for model size k γ : The crucial choice here concerns the form of the hyperparameter g. A popular "default" approach is the "unit information prior" (UIP), which sets g = N commonly for all models and thus attributes about the same information to the prior as is contained in one observation. Please refer to Section 5 for a discussion of other g priors. 5

A BMA example: Attitude data
This section shows how to run BMA using the R (R Core Team 2015) package BMS (Feldkircher and Zeugner 2015). Package BMS is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=BMS. We will use a small data set for illustration and show how to obtain posterior coefficient and model statistics.

Model sampling
Equipped with this basic framework, let us explore one of the data sets built into R: The attitude dataset describes the overall satisfaction rating of a large organization's employees, as well as several specific factors such as complaints, the way of handling complaints within the organization (for more information type help("attitude")). The data includes 6 variables, which means 2 6 = 64 model combinations. Let us stick with the UIP g prior (in this case g = N = 30). Moreover, assume uniform model priors (which means that our expected prior model parameter size is K/2 = 3).
First load the data set by typing R> data("attitude", package = "datasets") In order to perform BMA you have to load the BMS package first, via the command:

R> library("BMS")
Now perform Bayesian model sampling via the function bms, and write results into the variable att.
R> att <-bms(attitude, mprior = "uniform", g = "UIP", user.int = FALSE) mprior = "uniform" means to assign a uniform model prior, g = "UIP", the unit information prior on Zellner's g. The option user.int = FALSE is used to suppress user-interactive output for the moment. 6 The first argument is the data frame attitude, and bms assumes that its first column is the response variable. 7

Coefficient results
The coefficient results can be obtained via R> coef(att) 5 Note that package BMS is, in principle not restricted to Zellner's g priors, as quite different coefficient priors might be defined by R-savvy users.
6 Note that the argument g = "UIP" is actually redundant, as this is the default option for bms. The default model prior is somewhat different but does not matter very much with this data. Therefore, the command att = bms(attitude) gives broadly similar results.
7 The specification of data can be supplied in different manners, e.g., using a 'formula'. Type help("lm") for a comparable function. The above matrix shows the variable names and corresponding statistics: The second column Post Mean displays the coefficients averaged over all models, including the models wherein the variable was not contained (implying that the coefficient is zero in this case). The covariate complaints has a comparatively large coefficient and seems to be most important. The importance of the variables in explaining the data is given in the first column PIP which represents posterior inclusion probabilities -i.e., the sum of PMPs for all models wherein a covariate was included. We see that with 99.96%, virtually all of posterior model mass rests on models that include complaints. In contrast, learning has an intermediate PIP of 40.6%, while the other covariates do not seem to matter much. Consequently their (unconditional) coefficients 8 are quite low, since the results quite often include models where these coefficients are zero.

PIP
The coefficients' posterior standard deviations (Post SD) provide further evidence: for example, complaints is certainly positive, while advance is most likely negative. In fact, the coefficient sign can also be inferred from the fourth column Cond.Pos.Sign, the "posterior probability of a positive coefficient expected value conditional on inclusion", respectively "sign certainty". Here, we see that in all encountered models containing these variables, the (expected values of) coefficients for complaints and learning were positive. In contrast, the corresponding number for privileges is near to zero, i.e., in virtually all models that include privileges, its coefficient sign is negative. Finally, the last column idx denotes the index of the variables' appearance in the original data set, as our results are obviously sorted by PIP.
In addition to inferring about the importance of our variables, it might be really more interesting to look at their standardized coefficients. γ=1 p(βγ|, y, X, Mγ)p(Mγ|y, X) i.e., a weighted average over all models, including those where this particular coefficient was restricted to zero. A conditional coefficient in contrast, is "conditional on inclusion", i.e., a weighted average only over those models where its regressor was included. Conditional coefficients may be obtained with the command coef(att, condi.coef = TRUE).
9 Standardized coefficients arise if both the response y and the regressors X are normalized to mean zero and variance one -thus effectively bringing the data down to the same order of magnitude. advance 0.2129325 -0.0225561446 0.07720309 0.00000107 6 (Intercept) 1.0000000 1.2015488514 NA NA 0 The standardized coefficients reveal similar importance as discussed above, but one sees that learning actually does not matter much in terms of magnitude. Note that using argument order.by.pip = FALSE leads to the covariates being represented in their original order. The argument include.constant = TRUE also prints out a (standardized) constant.

Other results
Other basic information about the sampling procedure can be obtained via. 10
Model Prior g-Prior "30" "uniform / 3" "UIP" Shrinkage-Stats "Av=0.9677" It reiterates some of the facts we already know, but adds some additional information such as Mean no. regressors, posterior expected model size (cf., Section 3).
Finally we can examine the distribution of posterior model probabilities by invoking the function topmodels. This yields a binary matrix with the variables arranged row-wise and the models column-wise. For a particular model (i.e., column) a 0 indicates exclusion and 1 inclusion of a variable associated with a given row. For the sake of illustration we will focus on the three models with highest posterior model probabilities: R> topmodels(att)[, 1:3] 20 28 29 complaints 1.0000000 1.0000000 1.00000000 privileges 0.0000000 0.0000000 0.00000000 learning 0.0000000 1.0000000 1.00000000 raises 0.0000000 0.0000000 0.00000000 critical 0.0000000 0.0000000 0.00000000 advance 0.0000000 0.0000000 1.  Figure 1: Image plot. Blue color corresponds to a positive coefficient, red to a negative coefficient, and white to non-inclusion of the respective variable. The horizontal axis is scaled by the models' posterior model probabilities.
The posterior model probability for each model is given at the bottom of the matrix. The distinction between PMP (Exact) and PMP (MCMC) is of importance if an MCMC sampler was used and will be discussed in Section 4.3. Note that we can access the PMP for any model directly using the function pmpmodel -cf., help("pmpmodel"). The best model, with 29% posterior model probability, is the one that only includes complaints. However the second best model includes learning in addition and has a PMP of 17%. Use the command beta.draws(att) to obtain the actual (expected values of) posterior coefficient estimates for each of these models.
In order to get a more comprehensive overview over the models, use the command R> image(att) that produces Figure 1.
Here, blue color corresponds to a positive coefficient, red to a negative coefficient, and white to non-inclusion (a zero coefficient). On the horizontal axis the best models are shown, scaled by their PMPs. We see again that the best model with most mass only includes complaints. Moreover we see that complaints is included in virtually all model mass, and unanimously with a positive coefficient. In contrast, raises is included very infrequently, and its coefficient sign changes according to the model. Use image(att, yprop2pip = TRUE) for another illustrating variant of this plot.

Model size and model priors
Invoking the command summary(att) yielded the important posterior statistic Mean no. regressors, the posterior expected model size (i.e., the average number of included regressors), which in our case was 2.11. Note that the posterior expected model size is equal to the sum of PIPs -verify via: R> sum(coef(att)[, 1]) [1] 2.112147 This value contrasts with the prior expected model size implicitly used in our model sampling: With 2 K possible variable combinations, a uniform model prior means a common prior model probability of p(M γ ) = 2 −K . However, this implies a prior expected model size of K k=0 K k k2 −K = K/2. Moreover, since there are more possible models of size 3 than e.g., of size 1 or 5, the uniform model prior puts more mass on intermediate model sizes -e.g., expecting a model size of k γ = 3 with 6 3 2 −K = 31% probability. In order to examine how far the posterior model size distribution matches up to this prior, type:

R> plotModelsize(att)
The results are illustrated in Figure 2, left panel.
We see that while the model prior implies a symmetric distribution around K/2 = 3, updating it with the data yields a posterior that puts more importance on parsimonious models. In order to illustrate the impact of the uniform model prior assumption, we might consider other popular model priors that allow more freedom in choosing prior expected model size and other factors.

Binomial model prior
The binomial model prior constitutes a simple and popular alternative to the uniform prior we just employed. It starts from the covariates' viewpoint, placing a common and fixed inclusion probability θ on each regressor. The prior probability of a model of size k γ is therefore the product of inclusion and exclusion probabilities: Since expected model size ism = Kθ, the researcher's prior choice reduces to eliciting a prior expected model sizem (which defines θ via the relation θ =m/K). Choosing a prior model size of K/2 yields θ = 1 2 and thus exactly the uniform model prior p(M γ ) = 2 −K illustrated in the previous section. Therefore, putting prior model size at a value < 1 2 tilts the prior distribution toward smaller model sizes and vice versa. For instance, let us impose a fixed inclusion probability prior such that prior model size equalsm = 2: Here, the option user.int = TRUE directly prints out the results as from coef and summary: 11 R> att_fixed <-bms(attitude, mprior = "fixed", mprior.size = 2, + user.int = TRUE) Mean no. regressors Draws Burnins "1.6114" "64" "0" Time No. models visited Modelspace 2^K "0.029845 secs" "64" "64" % visited % Topmodels Corr PMP "100" "100" "NA" No. Obs.

PIP
Model Prior g-Prior "30" "fixed / 2" "UIP" Shrinkage-Stats "Av=0.9677" Time difference of 0.029845 secs As seen in Mean no. regressors and illustrated in the right panel of Figure 2, the posterior model size is now 1.61 which is somewhat smaller than with uniform model priors. Since posterior model size equals the sum of PIPs, many of them have also become smaller than in att. But interestingly, the PIP of complaints has remained at near 100%.

Custom prior inclusion probabilities
In view of the pervasive impact of complaints, one might wonder whether its importance would also remain robust to a greatly unfair prior. For instance, one could define a prior inclusion probability of only θ = 0.01 for complaints while setting a "standard" prior inclusion probability of θ = 0.5 for all other variables. Such a prior might be submitted to bms by assigning a vector of prior inclusion probabilities via its mprior.size argument: R> att_pip <-bms(attitude, mprior = "pip", + mprior.size = c(0.01, 0.5, 0.5, 0.5, 0.5, 0.5), user.int = FALSE) This implies a prior model size ofm = 0.01 + 5 × 0.5 = 2.51.
The results can be obtained with summary(att_pip):

Binomial-beta model priors
Like the uniform prior, the fixed common θ in the binomial prior centers the mass of its distribution near the prior model size. A look on the prior model distribution with the following command shows that the prior model size distribution is quite concentrated around its mode, which is illustrated in Figure 3, left panel.

R> plotModelsize(att_pip)
This feature is sometimes criticized, in particular by Ley and Steel (2009): They note that to reflect prior uncertainty about model size, one should rather impose a prior that is less tight around prior expected model size. Therefore, Ley and Steel (2009) propose to put a hyperprior on the inclusion probability θ, effectively drawing it from a Beta distribution. In terms of researcher input, this prior again only requires to choose the prior expected model size. However, the resulting prior distribution is considerably less tight and should thus reduce the risk of unintended consequences from imposing a particular prior model size. In the vein of Ley and Steel (2009) a prior on the model space that is completely flat can be invoked by anchoring the binomial-beta prior on an expected model size of K/2 = 6/2 = 3 regressors: R> att_random <-bms(attitude, mprior = "random", mprior.size = 3, + user.int = FALSE) R> plotModelsize(att_random) As Figure 3, right panel, illustrates, the prior on the model space is flat, while the posterior model size turns out to be 1.73. In terms of coefficient and posterior model size distribution, the results are very similar to those of att_fixed, even though the latter approach involved a tighter model prior. Concluding, a decrease of prior importance by the use of the binomialbeta framework supports the results found in att_fixed.
We can compare the PIPs from the four approaches presented so far with the following command: 13 R> plotComp(Uniform = att, Fixed = att_fixed, PIP = att_pip, + Random = att_random)

MCMC samplers
With a small number of variables, it is straightforward to enumerate all potential variable combinations to obtain posterior results. For a larger number of covariates, this becomes more time intensive: enumerating all models for 25 covariates takes about 3 hours on a modern PC, and doing a bit more already becomes infeasible: With 50 covariates for instance, there are more than a quadrillion (≈ 10 15 ) potential models to consider. In such a case, MCMC samplers gather results on the most important part of the posterior model distribution and thus approximate it as closely as possible. BMA mostly relies on the Metropolis-Hastings algorithm, which "walks" through the model space as follows: At step i, the sampler stands at a certain "current" model M i with PMP p(M i |y, X). In step i + 1 a candidate model M j is proposed. The sampler switches from the current model to model M j with probability p i,j : In case model M j is rejected, the sampler moves to the next step and proposes a new model M k against M i . In case model M j is accepted, it becomes the current model and has to survive against further candidate models in the next step. In this manner, the number of times each model is kept will converge to the distribution of posterior model probabilities p(M i |y, X).
In addition to enumerating all models, package BMS implements two MCMC samplers that differ in the way they propose candidate models: Birth-death sampler ("bd"): This is the standard model sampler used in most BMA routines. One of the K potential covariates is randomly chosen; if the chosen covariate forms already part of the current model M i , then the candidate model M j will have the same set of covariates as M i but for the chosen variable ("dropping" a variable). If the chosen covariate is not contained in M i , then the candidate model will contain all the variables from M i plus the chosen covariate ("adding" a variable).
Reversible-jump sampler ("rev.jump"): Adapted to BMA by Madigan and York (1995) this sampler either draws a candidate by the birth-death method with 50% probability.
In the other case (chosen with 50% probability) a "swap" is proposed, i.e., the candidate model M j randomly drops one covariate with respect to M i and randomly adds one chosen at random from the potential covariates that were not included in model M i .
Enumeration ("enumerate"): Up to fourteen covariates, complete enumeration of all models is the default option: This means that instead of an approximation by means of the aforementioned MCMC sampling schemes all possible models are evaluated. As enumeration becomes quite time-consuming or infeasible for many variables, the default option is mcmc = "bd" in case of K > 14, though enumeration can still be invoked with the command mcmc = "enumerate".
The quality of an MCMC approximation to the actual posterior distribution depends on the number of draws the MCMC sampler runs for. In particular, the sampler has to start out from some model 14 that might not be a "good" one. Hence the first batch of iterations will typically not draw models with high PMPs as the sampler will only after a while converge to spheres of models with the largest marginal likelihoods. Therefore, this first set of iterations (the "burn-ins") is to be omitted from the computation of results. In bms, the argument burn specifies the number of burn-ins, and the argument iter the number of subsequent iterations to be retained.

An example: Economic growth
In one of the most prominent applications of BMA, Fernández, Ley, and Steel (2001b) analyze the importance of 41 explanatory variables on long-term term economic growth in 72 countries by the means of BMA. The data set is available in package BMS, a short description is available via help("datafls"). They employ a uniform model prior and the birth-death MCMC sampler. Their g prior is set to g = max(N, K 2 ), a mechanism such that PMPs asymptotically either behave like the Bayesian information criterion (with g = N ) or the risk inflation criterion (g = K 2 ) -in bms this prior is assigned via the argument g = "BRIC".
Model Prior g-Prior "72" "uniform / 20.5" "BRIC" Shrinkage-Stats "Av=0.9994" Note that due to stochastic variation in the MCMC chain your results might slightly differ from those reported here. Under Corr PMP, we find the correlation between iteration counts and analytical PMPs for the 2000 best models (the number 2000 was specified with the nmodel = 2000 argument). At summary(fls1)["Corr PMP"], this correlation is far from perfect but already indicates a good degree of convergence. For a closer look at convergence between analytical and MCMC PMPs, compare the actual distribution of both concepts: R> plotConv(fls1) Figure 5, left panel, presents the best 2,000 models encountered ordered by their analytical PMP (the red line), and plots their MCMC iteration counts (the blue line). For an even closer look, one might just check the corresponding image for just the best 100 models. The results are provided in Figure 5, right panel, and can be achieved with the following command: 15 R> plotConv(fls1[1:100])

Analytical vs. MCMC likelihoods
The example above already achieved a decent level of correlation among analytical likelihoods and iteration counts with a comparatively small number of sampling draws. In general, the more complicated the distribution of marginal likelihoods, the more difficulties the sampler will meet before converging to a good approximation of PMPs. The quality of approximation may be inferred from the number of times a model got drawn vs. their actual marginal 15 With bma objects such as fls1, the indexing parentheses [] are used to select subsets of the (ranked) best models retained in the object. For instance, while fls1 contains 2,000 models, fls1[1:100] only contains the 100 best models among them. Correspondingly, fls1[37] would only contain the 37th best model. Cf., likelihoods. Partly for this reason, bms retains a pre-specified number of models with the highest PMPs encountered during MCMC sampling, for which PMPs and draw counts are stored. Their respective distributions and their correlation indicate how well the sampler has converged.
However, due to RAM limits, the sampling chain can hardly retain more than a few 100,000 of these models. Instead, it computes aggregate statistics on-the-fly, taking iteration counts as model weights. For model convergence and some posterior statistics bms retains only the "top" (highest PMP) nmodel models it encounters during the iterations. Since the time for updating the iteration counts for the "top" models grows in line with their number, the sampler becomes considerably slower the more "top" models are to be kept. Still, if they are sufficiently numerous, those best models can already cover most of the posterior model massin this case it is feasible to base posterior statistics on analytical likelihoods instead of MCMC frequencies, just as in the enumeration case from Section 2. From bms results, the PMPs of "top" models may be displayed with the command pmp. The numbers in the left-hand column represent analytical PMPs (PMP (Exact)) while the right-hand side displays MCMC-based PMPs (PMP (MCMC)). Both decline in roughly the same fashion, however sometimes the values for analytical PMPs differ considerably from the MCMC-based ones. This comes from the fact that MCMC-based PMPs derive from the number of iteration counts, while the "exact" PMPs are calculated from comparing the analytical likelihoods of the best models -cf., Equation 2. 17 In order to see the importance of all "top models" with respect to the full model space, we can thus sum up their MCMC-based PMPs as follows: R> colSums(pmp (fls1) In the call to topmodels on page 7, the PMPs under "MCMC" and analytical ("exact") concepts were equal since 1) enumeration bases both "top" model calculation and aggregate on-the-fly results on analytical PMPs and 2) because all possible models were retained in the object att.
18 Note that this share was already provided in column % Topmodels resulting from the summary command on page 14.
The ordering of covariates in terms of PIP as well as the coefficients are roughly similar. However, the PIPs under exact = TRUE are somewhat larger than for the MCMC results. Closer inspection will also show that the analytical results downgrade the PIPs of the worst variables with respect to the PIPs under MCMC. This stems from the fact that analytical results do not take into account the many "bad" models that include "worse" covariates and are factored into MCMC results.
Whether to prefer analytical or MCMC results is a matter of taste -however the literature prefers coefficients the analytical way: Fernández et al. (2001b), for instance, retain 5,000 models and report results based on them.

Combining sampling chains
The MCMC samplers described in Section 4.1 need to discard the first batch of draws (the burn-ins) since they start out from some peculiar starting model and may reach the altitudes of "high" PMPs only after many iterations. Here, choosing an appropriate starting model may help to speed up convergence. By default bms selects its starting model as follows: from the full model 19 , all covariates with OLS t-statistics > 0.2 are kept and included in the starting model. Other starting models may be assigned outright or chosen according to a similar mechanism (cf., argument start.value in help("bms")).

Via:
R> round(cor(pmp(fls2))[2, 1], 2) R> round(cor(pmp(fls1))[2, 1], 2) one can compare the degree of MCMC convergence for both models. Accordingly, with 0.86, the correlation between analytical and MCMC PMPs is a bit smaller than that from the fls1 example in Section 4.3, which is 0.92. However, the results of this sampling run may be combined to yield more iterations and thus a better representation of the PMP distribution.
Model Prior g-Prior "72" "uniform / 20.5" "BRIC" Shrinkage-Stats "Av=0.9994" With 0.95, the PMP correlation from the combined results is broadly better than either of its two constituent chains fls1 and fls2. Still, the PIPs and coefficients do not change much with respect to fls1 -as evidenced, e.g., by plotComp(fls1, fls_combi, comp = "Std Mean").

Alternative fixed g priors
Virtually all BMA applications rely on the presented framework with Zellner's g prior, and the bulk of them relies on specifying a fixed g. As mentioned in Section 1.2, the value of g corresponds to the degree of prior uncertainty: A low g renders the prior coefficient distribution tight around a zero mean, while a large g implies large prior coefficient variance and thus decreases the importance of the coefficient prior.
While some popular default elicitation mechanisms for the g prior (e.g., we have seen UIP and BRIC) are quite popular, they are also subject to severe criticism. Some (e.g Fernández, Ley, and Steel 2001a) advocate a comparatively large g prior to minimize prior impact on the results, stay close to the OLS coefficients, and represent the absolute lack of prior knowledge. Others (e.g., Ciccone and Jarociński 2010) demonstrate that such a large g may not be robust to noise innovations and risks over-fitting -in particular if the noise component plays a substantial role in the data. Again others (Eicher, Papageorgiou, and Raftery 2011) advocate intermediate fixed values for the g priors or present alternative default specifications (Liang, Paulo, Molina, Clyde, and Berger 2008). 20 In package BMS, any fixed g prior may be specified directly by submitting its value to the bms function argument g. For instance, compare the results for the Fernández et al. (2001b) setting when a more conservative prior such as g = 5 is employed (and far too few iterations are performed): R> fls_g5 <-bms(datafls, burn = 20000, iter = 50000, g = 5, + mprior = "uniform", user.int = FALSE) R> coef (fls_g5) Mean no. regressors Draws Burnins "20.1228" "50000" "20000" Time No. models visited Modelspace 2^K "11.58503 secs" "43996" "2.2e+12" % visited % Topmodels Corr PMP "2e-06" "2" "0.0856" No. Obs.
Model Prior g-Prior "72" "uniform / 20.5" "numeric" Shrinkage-Stats "Av=0.8333" The PIPs and coefficients for the best five covariates are comparable to the results from Section 4.2 but considerably smaller, due to a tight shrinkage factor of g 1+g = 5 6 (cf., Section 1.2). More important, with 20.4, the posterior expected model size exceeds that of fls_combi by a large amount. This stems from the less severe size penalty imposed by eliciting a small g. Finally, with a correlation between analytical and MCMC PMPs of −0.03 we can conclude that the MCMC sampler has not at all converged yet. Feldkircher and Zeugner (2009) show that the smaller the g prior, the less concentrated is the PMP distribution, and therefore the harder it is for the MCMC sampler to provide a reasonable approximation to the actual PMP distribution. Hence the above command should actually be run with many more iterations in order to achieve meaningful results.

Model-specific g priors
The examples and references above illustrate that eliciting a fixed g prior common to all models can be fraught with difficulties and unintended consequences. Under a shrinkage factor close to 1 (i.e., under a large g), posterior estimates can easily overfit. This not only has implications for the estimated coefficients but also for the PIPs. A too large "overfitting" shrinkage factor leads to tight PMP concentrations and small model sizes, which result into an unduly skewed PIP distribution. Consequently, an "overfitting" shrinkage factor attributes a relatively high PIP to just a few variables, while all other covariates yield very low PIPs. In contrast a too low shrinkage factor (i.e., low g) does not exploit the data signals, and typically leads to very similar intermediate PIPs for a large share of covariates. In order to address this issue, several authors have proposed to rely on model-specific "flexible" g priors (cf., Liang et al. 2008 for an overview). The virtue of such flexible shrinkage factors g 1+g is that they adapt to the data: The better the signal-to-noise ratio, the closer the (expected) posterior shrinkage factor will be to 1, and vice versa. Consequently, the average shrinkage factor over all models can be interpreted as a Bayesian "goodness-of-fit" indicator. Feldkircher and Zeugner (2009) show how the model specific priors implemented in BMS can be interpreted in terms of the OLS F -statistic. Overall, there are two flexible g priors that allow for closed-form solutions and are implemented in package BMS: Empirical Bayes g -local ("EBL"): g γ = arg max g p(y|M γ , X, g). Authors such as George and Foster (2000) or Hansen and Yu (2001) advocate an "empirical Bayes" approach by using information contained in the data (y, X) to elicit g via maximum likelihood. This amounts to setting g γ = max(0, F OLS γ − 1) where F OLS γ is the standard OLS F -statistic for model M γ . Each single model thus has single shrinkage factor g 1+g estimated in this way, but those shrinkage factors differ over models. The function gdensity provides density plots and data on the discrete distribution of the shrinkage factor over all models.
Note that the local "EBL" prior is popular with some for its simplicity and effectiveness, while it is despised by others: It does not constitute a "real"' prior since it involves "peeking" at the data in order to formulate a prior. Moreover, asymptotic "consistency" of BMA is not guaranteed in this case.
Hyper-g prior ("hyper"): Liang et al. (2008) propose putting a hyper-prior on g. In order to arrive at closed-form solutions, they suggest a Beta prior on the shrinkage factor of the form g 1+g ∼ B 1, a 2 − 1 , where a is a parameter in the range of 2 < a ≤ 4. Then, the prior expected value of the shrinkage factor is E( g 1+g ) = 2 a . Moreover, setting a = 4 corresponds to the uniform prior distribution of g 1+g over the interval [0, 1], while a → 2 concentrates prior mass very close to 1 (thus corresponding to g → ∞). (bms allows to set a via the argument g = "hyper = x", where x denotes the a parameter.) The virtue of the hyper-prior is that it allows for prior assumptions about g, but relies on Bayesian updating to adjust it. This limits the risk of unintended consequences on the posterior results, while retaining the theoretical advantages of a fixed g. Therefore Feldkircher and Zeugner (2009) prefer the use of hyper-g over other available g prior frameworks.
In an application Feldkircher and Zeugner (2012) show that the use of the hyper-g prior leads to more robust results. Moreover, the hyper-g prior has the advantage over similar proposals that its closed-form posterior moments allow for computing them at a speed that is only slightly slower than fixed g priors.
Both model-specific g priors adapt to the data: The better the signal-to-noise ratio, the closer the (expected) posterior shrinkage factor will be to one, and vice versa. Therefore average statistics on the shrinkage factor offer the interpretation as a "goodness-of-fit" indicator. See also Feldkircher and Zeugner (2009) who show that both EBL and hyper-g can be interpreted in terms of the OLS F -statistic. 21 Consider, for instance, the Fernández et al. (2001b) example under an empirical Bayes prior: R> fls_ebl <-bms(datafls, burn = 20000, iter = 50000, g = "EBL", + mprior = "uniform", nmodel = 1000, user.int = FALSE) R> summary(fls_ebl)
Model Prior g-Prior "72" "uniform / 20.5" "EBL" Shrinkage-Stats "Av=0.9595" The result Shrinkage-Stats reports a posterior average EBL shrinkage factor of 0.96, which corresponds to a shrinkage factor g 1+g under g ≈ 24. Consequently, posterior model size is considerably larger than under fls_combi, and the sampler has had a harder time to converge, as evidenced in a quite low Corr PMP. This can also be seen by invoking the plot(fls_ebl) command that yields a two-layer plot featuring the plotModelsize and plotConv plots illustrated previously.
The above results show that using a flexible and model-specific prior on Fernández et al. (2001b) data results in rather small posterior estimates of g 1+g , thus indicating that the g = "BRIC" prior used in fls_combi may be set too far from zero. This interacts with the uniform model prior to concentrate posterior model mass on quite large models. However, imposing a uniform model prior means to expect a model size of K/2 = 20.5, which may seem overblown. Instead, try to impose a smaller model size through a corresponding model priore.g., impose a prior model size of 7 as in Sala-i-Martin, Doppelhofer, and Miller (2004). This can be combined with a hyper-g prior, where the argument g = "hyper = UIP" imposes an a parameter such that the prior expected value of g corresponds to the unit information prior (g = N ). 22 R> fls_hyper <-bms(datafls, burn = 20000, iter = 50000, g = "hyper=UIP", + mprior = "random", mprior.size = 7, nmodel = 1000, user.int = FALSE) R> summary(fls_hyper) 21 For instance, the posterior expected value of the shrinkage factor for each model Ms with ks parameters and R-squared R 2 S is Its formulation mainly relates to R 2 s and ks and thus resembles a goodness-of-fit indicator. Feldkircher and Zeugner (2009) note that this also holds for its average across models: under some fairly standard conditions 1 1−E( g 1+g |X,y) behaves similarly to an adjusted F -statistic of the model average. 22 This is the default hyper-g prior and may therefore be as well obtained with g = "hyper".   Figure 6: Left panel: Posterior density of the shrinkage factor (g/(1 + g)). Solid line corresponds to the expected value E(g/1 + g)|Y ), dashed line to a +/− 2 standard deviation interval.m = 3. Right panel: Image plot. Blue color corresponds to a positive coefficient, red to a negative coefficient, and white to non-inclusion of the respective variable. The horizontal axis is scaled by the models' posterior model probabilities. Hyper-g prior employed.
Model Prior g-Prior "72" "random / 7" "hyper (a=2.02778)" Shrinkage-Stats "Av=0.9613, Stdev=0.018" From Shrinkage-stats, posterior expected shrinkage is 0.96 with rather tight standard deviation bounds. Similar to the EBL case before, the data thus indicates that shrinkage should be rather small (corresponding to a fixed g of g ≈ 24) and not vary too much from its expected value. Since the hyper-g prior induces a proper posterior distribution for the shrinkage factor, it might be helpful to plot its density with the command below. The focus on smaller models is evidenced by charting the best 1,000 models with the image command:

R> image(fls_hyper)
In a broad sense, the coefficient results correspond to those of fls_combi, at least in expected values. However, the results from fls_hyper were obtained under more sophisticated priors that were specifically designed to avoid unintended influence from prior parameters: By construction, the large shrinkage factor under fls_combi induced a quite small posterior model size of 10.4 and concentrated posterior mass tightly on the best models encountered (they make up 39% of the entire model mass). In contrast, the hyper-g prior employed for fls_hyper indicated a rather low posterior shrinkage factor and consequently resulted in higher posterior model size 16 and less model mass concentration 13%.

Posterior coefficient densities
In order to compare more than just coefficient expected values, it might be preferable to look at the entire posterior distribution of coefficients. For instance, the posterior density of the coefficient for Muslim, a variable with a PIP of 64%, can be generated via density: R> density(fls_combi, reg = "Muslim") The computed marginal posterior densities are a Bayesian model averaging mixture of the marginal posterior densities of the individual models. The accuracy of the result therefore depends on the number of "best" models contained in fls_combi. Note that the marginal posterior density can be interpreted as "conditional on inclusion": If the posterior inclusion probability of a variable is smaller than one, then some of its posterior density is Dirac at zero. Therefore the integral of the returned density vector adds up to the posterior inclusion probability, i.e., the probability that the coefficient is not zero.
The marginal densities of posterior coefficient distributions can be plotted with the argument reg specifying the variable under scrutiny. The right panel of Figure 7 illustrates that the coefficient is neatly above zero, but somewhat skewed. The integral of this density will add up to 0.68: The addons argument assigns the vertical bars to be drawn: The expected conditional coefficient from MCMC (E) results should be indicated in contrast to the expected coefficient based on analytical PMPs (e). In addition, the expected coefficients under the individual models are plotted (b) and a legend is included (l). The density seems more symmetric than before and the analytical results a bit smaller than what could be expected from MCMC results.
Nonetheless, even though fls_hyper and fls_combi applied very different g and model priors, the results for the Muslim covariate are broadly similar: It is unanimously positive, with a conditional expected value somewhat above 0.01. In fact 95% of the posterior coefficient mass seems to be concentrated between 0.004 and 0.022.

Predictive densities
Of course, BMA lends itself not only to inference, but also to prediction. The employed "Bayesian regression" models naturally give rise to predictive densities, whose mixture yields the BMA predictive density -a procedure very similar to the coefficient densities explored in the previous section.
Let us, for instance, use the information from the first 70 countries contained in datafls to forecast economic growth for the latter two, namely Zambia (identifier ZM) and Zimbabwe (identifier ZW). Based on their macro-fundamentals (i.e., the explanatory variables contained in datafls) we can form predictions invoking the function pred.density: R> fcstbma <-bms(datafls[1:70, ], mprior = "uniform", burn = 20000, + iter = 50000, user.int = FALSE) R> pdens <-pred.density(fcstbma, newdata = datafls[71:72, ]) The resulting object pdens holds the distribution of the forecast for the two countries, conditional on what we know from other countries, and the explanatory data from Zambia and Zimbabwe. The expected value of this growth forecast is very similar to the classical point forecast and can be accessed with pdens$fit. 24 Likewise the standard deviations of the predictive distribution correspond to classical standard errors and are returned by pdens$std.err. But the predictive density for economic growth in, e.g., Zimbabwe might be as well visualized with the following command: 25 R> plot(pdens, 2) Figure 8, left panel, shows that conditional on Zimbabwe's explanatory data, we expect growth to be concentrated around 0. And the actual value in datafls[72, 1] with 0.0046 is not too far off from that prediction. A closer look at both our densities with the function quantile shows that for Zimbabwe, any growth rate between −0.01 and 0.01 is quite likely.
R> quantile(pdens, c(0.05, 0.95)) 5% 95% ZM 0.003284431 0.02752649 ZW -0.010804288 0.01154784 For Zambia (ZM), though, based on our regression model we would expect growth to be positive. Compared to Zimbabwe, however, economic growth over our evaluation period has been even worse. 26 Under the predictive density for Zambia, its realized value (−0.012) seems quite unlikely.
To compare the BMA prediction performance with actual outcomes, we could look, e.g., at the forecast error: R> pdens$fit -datafls[71:72, 1] 24 Note that this is equivalent to predict (fcstbma, datafls[71:72, ]). 25 Here, 2 means to plot for the second forecasted observation, in this case ZW, the 72th row of datafls. 26 Note that since ZM is the row name of the 71st row of datafls, this is equivalent to calling datafls[71,]. The density for Zimbabwe is quite high (similar to the mode of the predictive density as seen in Figure 8, left panel), whereas the one for Zambia is quite low. In order to visualize how bad the forecast for Zambia was, compare a plot of predictive density to the actual outcome, which is situated far to the left. R> plot(pdens, "ZM", realized.y = datafls["ZM", 1]) The results for Zambia might imply that the country can be considered as an outlier. One could also try different prior settings and compare the resulting models by their joint predictions for Zambia and Zimbabwe (or even more countries). Last, as opposed to evaluating the goodness of forecasts via its mean squared errors, we illustrate how to make use of the whole distribution of the forecast. Following Fernández et al. 2001a we calculate the "log-predictive score" (LPS), which is defined as follows: where p(y f i |X, y, X f i ) denotes predictive density for y f i (Zambian growth) based on the model information (y, X) (the first 70 countries) and the explanatory variables for the forecast observation (Zambian investment, schooling, etc.).
The log-predictive score can be accessed with lps.
R> lps (pdens, datafls[71:72, 1]) [1] -0.604418 If you compare different forecasting settings, the model that achieves a lower score is to be preferred. Note however, that the LPS is only meaningful when comparing different forecast settings.

Concluding remarks
The BMS package implements Bayesian model averaging for R. It excels in offering a range of widely used prior structures coupled with efficient MCMC algorithms to sort through the model space. Amini and Parmeter (2011) and Amini and Parmeter (2012) carry out a comparison of R software packages that implement Bayesian model averaging, in particular the packages BAS (Clyde 2012) and BMA (Raftery, Hoeting, Volinsky, Painter, and Yeung 2014). Amini and Parmeter (2012) conclude that package BMS is the only one among its competitors that is able to reproduce empirical results in Fernández et al. (2001b); Doppelhofer and Weeks (2009) and the working paper version of Masanjala and Papageorgiou (2008).
The BMS package is very flexible regarding the use of prior information. It allows for uniform and binomial-beta priors on the model space as well as informative prior inclusion probabilities. Via these "customized" model priors one can thus fuse prior beliefs into the otherwise purely agnostic analysis, that is prevalent in the applied literature using BMA. The BMS package also provides various specifications for Zellner's g prior including the so-called hyperg priors advocated in Liang et al. (2008); Ley and Steel (2012); Feldkircher and Zeugner (2009). The sensitivity of BMA results to the specification of Zellner's g prior is well documented in the literature (Feldkircher and Zeugner 2012). It is thus of ample importance to offer a wide range of prior specifications in order to allow the user to carry out a serious sensitivity analysis.
Finally, the package comes along with numerous graphical tools to analyze posterior coefficient densities, the posterior model size or predictive densities. It also includes a graphical representation of the model space via an image plot. The flexibility and user-friendliness of the package is proved by the fact that various studies have used package BMS to perform Bayesian model averaging (among others, see Giannone, Lenza, and Reichlin 2011;Horváth 2011;Amini and Parmeter 2012;Babecký, Havránek, Matěju, Rusnák,Šmídková, and Vašíček 2013;Horváth 2013;Iršová and Havránek 2013;Feldkircher 2014;Feldkircher, Horváth, and Rusnák 2014;Horváth, Rusnák,Šmídková, and Zapal 2014). Future versions of BMS might include spatial regression models (Crespo Cuaresma and Feldkircher 2013;Crespo Cuaresma, Doppelhofer, and Feldkircher 2014), model averaging based on the predictive likelihood (Eklund and Karlsson 2007;Feldkircher 2012), so-called "heredity priors" to handle related predictors (Chipman 1996) and dilution priors that penalize models which contain highly collinear regressors (George 2010). For a detailed discussion on dilution and heredity priors see Moser and Hofmarcher (2014). An add-on to package BMS programmed also by Moser and Hofmarcher (2014) is available at https://bitbucket.org/matmo/dilutbms2 featuring among others a dilution prior put forward by Durlauf, Kourtellos, and Tan (2012), the heredity priors proposed in Chipman (1996) and employed in Feldkircher et al. (2014);Feldkircher (2014) and the tessellation sampler proposed by George (2010) that accounts for model space redundancy. The interested reader may be further referred to the "BMS-blog" section at http://bms.zeugner.eu/. The web page contains (video) tutorials on the usage of package BMS as well as further BMS add-on packages.
Parameter (mprior.size): a vector of size K + 1, detailing prior θ j for 0 to K size models: any real > 0 admissible.

A.2. Available g priors -Synopsis
The following provides an overview over the g priors available in bms. Default is g = "UIP".

Fixed g
Argument: g = x where x is a positive real scalar.
Concept: fixed g common to all models.
Sub-options: Unit information prior g = "UIP" sets g = N ; g = "BRIC" sets g = max(N, K 2 ), a combination of BIC and RIC. (Note that these two options guarantee asymptotic consistency.) Other options include g = "RIC" for g = K 2 and g = "HQ" for the Hannan-Quinn setting g = log(N ) 3 .
Sub-options: g = "hyper = x" with x defining the parameter a (e.g., g = "hyper = 3" sets a = 3). g = "hyper" resp. g = "hyper = UIP" sets the prior expected shrinkage factor equivalent to the UIP prior E( g 1+g ) = N 1+N ; g = "hyper = BRIC" sets the prior expected shrinkage factor equivalent to the BRIC prior. Note that the latter two options guarantee asymptotic consistency.

A.3. "Bayesian regression" with Zellner's g -Bayesian model selection
The linear model presented in Section 1.2 using Zellner's g prior is implemented under the function zlm. For instance, we might consider the attitude data from Section 2 and estimate just the full model containing all 6 variables. For this purpose, first load the built-in data set with the command R> data("attitude", package = "datasets") The full model is obtained by applying the function zlm to the data set and storing the estimation into att_full. Zellner's g prior is estimated by the argument g just in the same way as in Section 5. 27 R> att_full <-zlm(attitude, g = "UIP") The results can then be displayed by using, e.g., the summary method. The results are very similar to those resulting from OLS (which can be obtained using summary(lm(attitude))). The less conservative, i.e., the larger g becomes, the closer the results get to OLS. But remember that the full model was not the best model from the BMA application in Section 2. In order to extract the best encountered model, use the function as.zlm to extract this single model for further analysis (with the argument model specifying the rank-order of the model to be extracted). The following command reads the best model from the BMA results into the variable att_best.

R> summary(att_full)
R> att_best <-as.zlm(att, model = 1) R> summary(att_best) As suspected, the best model according to BMA is the one including only complaints and the intercept, as it has the highest log-marginal likelihood (logLik(att_best)). In such a way, the command as.zlm can be combined with bms for "Bayesian model selection", i.e., using the model prior and posterior framework to focus on the model with highest posterior mass. Via the utility model.frame, this best model can be straightforwardly converted into a standard OLS model: R> att_bestlm <-lm(model.frame(as.zlm(att))) R> summary(att_bestlm)

A.4. BMA when keeping a fixed set of regressors
While BMA should usually compare as many models as possible, some considerations might dictate the restriction to a subspace of the 2 K models. For complicated settings one might employ a customary designed model prior (cf., Section A.1). The by far most common setting, though, is to keep some regressors fixed in the model setting, and apply Bayesian model uncertainty only to a subset of regressors.

PIP
Model Prior g-Prior "30" "uniform / 4" "UIP" Shrinkage-Stats "Av=0.9677" Time difference of 0.008205175 secs The results show that the PIP and the coefficients for the remaining variables increase a bit compared to att. The higher PIPs are related to the fact that the posterior model size (as in sum(coef(att_learn)[, 1])) is quite larger as under att. This follows naturally from our model prior: putting a uniform prior on all models between parameter size 2 (the base model) and 6 (the full model) implies a prior expected model size of 4 for att_learn instead of the 3 for att. 28 So to achieve comparable results, one needs to take the number of fixed regressors into account when setting the model prior parameter mprior.size. Consider another example: Suppose we would like to sample the importance and coefficients for the 28 The command att_learn2 = bms(attitude, mprior = "fixed", mprior.size = 3, fixed.reg = c("complaints", "learning")) produces coefficients that are much more similar to att. cultural dummies in the dataset datafls, conditional on information from the remaining "hard" variables. This implies keeping 27 fixed regressors, while sampling over the 14 cultural dummies. Since model uncertainty thus applies only to 2 14 = 16, 384 models, we resort to full enumeration of the model space.
R> fls_culture <-bms (datafls, fixed.reg = c(1, 8:16, 24, 26:41), + mprior = "random", mprior.size = 28, mcmc = "enumerate", + user.int = FALSE) Here, the vector c (1, 8:16, 24, 26:41) denotes the indices of the regressors in datafls to be kept fixed. 29 Moreover, we use the binomial-beta ("random") model prior. The prior model size of 30 embodies our prior expectation that on average 1 out of the 14 cultural dummies should be included in the true model. As before, we find that Confucian (with positive sign) as well as Hindu and SubSahara (negative signs) have the most important impact conditional on "hard" information. Moreover, the data seems to attribute more importance to cultural dummies as we expected with our model prior: Comparing prior and posterior model size with the following command shows how much importance is attributed to the dummies.
R> plotModelsize(fls_culture, ksubset = 27:41) Figure 9 shows that expected posterior model size is close to 33, which means that 6 out of the cultural dummies should actually be included in a "true" model.