BFpack: Flexible Bayes Factor Testing of Scientific Theories in R

There has been a tremendous methodological development of Bayes factors for hypothesis testing in the social and behavioral sciences, and related fields. This development is due to the flexibility of the Bayes factor for testing multiple hypotheses simultaneously, the ability to test complex hypotheses involving equality as well as order constraints on the parameters of interest, and the interpretability of the outcome as the weight of evidence provided by the data in support of competing scientific theories. The available software tools for Bayesian hypothesis testing are still limited however. In this paper we present a new R-package called BFpack that contains functions for Bayes factor hypothesis testing for the many common testing problems. The software includes novel tools (i) for Bayesian exploratory testing (null vs positive vs negative effects), (ii) for Bayesian confirmatory testing (competing hypotheses with equality and/or order constraints), (iii) for common statistical analyses, such as linear regression, generalized linear models, (multivariate) analysis of (co)variance, correlation analysis, and random intercept models, (iv) using default priors, and (v) while allowing data to contain missing observations that are missing at random.


Introduction
This paper presents a new software package called BFpack which can be used for computing Bayes factors and posterior probabilities for statistical hypotheses in common testing problems in the social and behavioral sciences, medical research, and in related fields. This new package is an answer to the increasing interest of the scientific community to test statistical hypotheses using Bayes factors in the software environment R (R Development Core Team, 2013). Bayes factors enjoy many useful practical and theoretical properties which are not generally shared by classical significance tests. This includes its intuitive interpretation as the relative evidence in the data between two hypotheses, its ability to simultaneously test multiple hypotheses which may contain equality as well as order constraints on the parameters of interest, and its consistent behavior which implies that the true hypothesis will be selected with probability one as the sample size grows. The interested reader is referred to the many important contributions including (but not limited to) Jeffreys (1961); Berger & Delampady (1987); Sellke et al. (2001); E.-J. Wagenmakers (2007); Rouder et al. (2009); Masson (2011);Hoijtink (2011); E. Wagenmakers et al. (2018); Hoijtink et al. (2019), and the references therein. This has resulted in an increasing literature where Bayes factors have been used for testing scientific expectations (Well et al., 2008;Van de Schoot et al., 2006;Braeken et al., 2015;Vrinten et al., 2016;van Schie et al., 2016;Mulder & Wagenmakers, 2016;Hoijtink & Chow, 2017;de Jong et al., 2017;Gronau et al., 2017;Schönbrodt et al., 2017;van Ravenzwaaij et al., 2019;Flore et al., 2019;Dogge et al., 2019;Zondervan-Zwijnenburg et al., 2019).
The Bayes factors that are implemented in BFpack are based on recent developments of Bayesian hypothesis testing of location parameters, such as (adjusted) means and regression coefficients Gu et al., 2019Gu et al., , 2017Mulder, 2014b), variance components, such as group variances and intraclass correlations (Böing-Messing & Mulder, 2017;Mulder & Fox, 2019), and measures of association, (Mulder, 2016;Mulder & Gelissen, 2019). These Bayes factors can be used for common testing problems in the social and behavioral sciences, and related fields, such as (multivariate) t testing, (multivariate) linear regression, (multivariate) analysis of (co)variance, or correlation analysis. The package allows users to perform (i) exploratory Bayesian tests of whether a model parameter equals zero, is negative, or is positive, and (ii) confirmatory Bayesian tests where users specify a set of competing hypotheses with equality and/or order constraints on the parameters of interest. This will allow users to test their scientific expectations in a direct manner. Thus by providing Bayesian statistical tests for multiple hypotheses with equality as well as order constraints, BFpack makes important contributions to existing software packages, such as lmtest (Hothorn et al., 2019) and car (J. Fox & Weisberg, 2019), which contain key functions for classical significance tests of a single equality constrained null hypothesis, e.g., lmtest::coeftest() and car::linearHypothesis().
To ensure a simple and user-friendly experience, the different Bayes factors tests are implemented via a single function called BF, which is the workhorse of the package. The function needs a fitted modeling object obtained from a standard R analysis (e.g., lm, glm; see Table 1 for a complete overview), and in the case of a confirmatory test a string that specifies a set of competing hypotheses (example hypotheses are provided in Table 2). Another optional argument is the specification of the prior probabilities for the hypotheses. By building on these traditional statistical analyses, which are well-established by the R community, we present users additional statistics measures which cannot be obtained under a frequentist framework, such as default measures of the relative evidence in the data between competing statistical hypotheses as quantified by the Bayes factor.
When testing hypotheses using the Bayes factor, the use of arbitrary or ad hoc priors should generally be avoided (Lindley, 1957;Jeffreys, 1961;Bartlett, 1957;Berger & Pericchi, 2001). Therefore the implemented tests in BFpack are based on default Bayes factor methodology. Default Bayes factors can be computed without requiring external prior knowledge about the magnitude of the parameters. The motivation is that, even in the case prior information is available, formulating informative priors which accurately reflect one's prior beliefs under all separate hypotheses under investigation is a very challenging and time-consuming endeavor (Berger, 2006). Different default Bayes factors with default priors are implemented for testing different types of parameters, such as location parameters (e.g., means or regression coefficients in univariate/multivariate normal linear models), measures of association (e.g., correlations in multivariate normal distributions), and variance components (e.g., group variances, intraclass correlations). For testing unbounded parameters, such as location parameters and group variances, adjusted fractional Bayes factors (O'Hagan, 1995;Mulder, 2014b;Böing-Messing & Mulder, 2017) have been implemented. These Bayes factors have analytic expressions and are therefore easy to compute. For testing bounded parameters, such as measures of association and intraclass correlations, proper uniform priors are implemented. When testing intraclass correlations under random intercept models, a novel marginal modeling approach is employed where the random effects are integrated out (Mulder & Fox, 2019;J.-P. Fox et al., 2017;Mulder & Fox, 2013). On the one hand, these tests can be used for testing hypotheses on intraclass correlations based on substantive considerations, and on the other hand, the tests can be used as a tool when building multilevel mod- Table 1: R functions, packages, descriptions of tests, parameter of interest, and example name of the parameter that is tested, and the classification whether it is an exact or approximate Bayes factor test. For this table we assume that 'y1' is the label of an outcome variable, 'x1' is the label of a numeric predictor variable, and 'g1' is the label of a level of a grouping (factor) variable. The actual names that are used depend on the names of the variables in the model. els as the marginal model approach provides a more general framework for testing covariance structures than regular mixed effects models.
To also facilitate the use of Bayes factors for more general testing problems, an approximate Bayes factor is also implemented which is based on a large sample approximation resulting in an approximate Gaussian posterior distribution. The approximate Bayes factor only requires the (classical) estimates of the parameters that are tested, the corresponding error covariance matrix, and the sample size of the data that was used to get the estimates and covariance matrix. The resulting approximated Bayes factor can be viewed as a Bayesian counterpart of the classical Wald test. This makes the approximate Bayes factor very useful as a general test for statistical hypotheses. Note that even though it is possible to also use the ap- H 1 : θ 1 = θ 2 = θ 3 vs H 2 : "not H 1 " Order testing H 1 : θ 1 > θ 2 > θ 3 vs H 2 : θ 1 < θ 2 < θ 3 vs H 3 : "neither H 1 , nor H 2 ". Equality and order testing H 1 : θ 12 < θ 13 = θ 14 versus H 2 : "not H 1 ".
proximate Bayes factor for the testing problems for which exact tailor-made Bayes factors are available in BFpack (e.g., for an lm object), we recommend to use the exact tailored Bayes factors when they are available. The reason is that the exact Bayes factors result in exact quantification of the evidence between statistical hypotheses instead of an approximate quantification of the evidence. Further note that because the sample size is also required to perform the approximate Bayes factor test, BF() cannot be applied directly on the output of classical testing function such as lmtest::coeftest(). In Section 4.5 we show how to do this with one additional step. Table 1 shows for which models an exact Bayes factor is implemented and for which models we make use of the approximation.
Before presenting the statistical methodology and functionality of BFpack it is important to understand what BFpack adds to the currently available software packages for Bayes factor testing. First, the R package BayesFactor  mainly focuses on precise and interval null hypotheses of single parameters in Student t tests, anova designs, and regression models. It is not designed for testing more complex relationship between multiple parameters. Second, the package BIEMS (Mulder et al., 2012), which comes with a user interface for Windows, can be used for testing various equality and order hypotheses under the multivariate normal linear model. The computation of the Bayes factors however is too slow for general usage when simultaneously testing many equality constraints as equality constraints are approximated with interval constraints that are made sufficiently small using a computationally intensive step-wise algorithm. Third, the bain package  computes approximated default Bayes factors by assuming normality of the posterior and a default prior. The package has shown good performance for challenging testing problems such as structural equation models. BFpack package also builds on some of the functionality of bain in more complex statistical models. Unlike bain however, the implementation in BFpack builds on existing R functions such as dmvnorm or pmvnorm from the mvtnorm package (Genz et al., 2016) instead of calling external Fortran subroutines. This result in Bayes factors that essentially have zero Monte Carlo errors. Furthermore it is important to note that the Gaussian nature of the default prior in bain may not appropriate when testing bounded parameters, for example, such as measures of association or intraclass correlations, or when the Gaussian approximation of the posterior would be too crude, such as when testing group variances in the case of small sample sizes. Finally the free statistical software environment JASP (Love et al., 2019), which has contributed tremendously to the use of Bayes factors in psychological research and other research fields, is specifically designed for non-R users by providing a userfriendly graphical user-interface similar to SPSS. The Bayes factors implemented in JASP rely on other packages such as BayesFactor and bain. BFpack, on the other hand, is developed to give R users a flexible tool for testing a very broad class of hypotheses involving equality and/or order constraints on various types of parameters (means, regression coefficients, variance components, and measures of association) under common statistical models by building on standard R functions.
The paper is organized as follows. Section 2 describes the key aspects of the Bayes factor methodology that is implemented in BFpack. This section separately describes Bayes factors for location parameters, for measures of association, and for variance components. Section 3 gives a general explanation how the main function BF should be used. Section 4 presents 8 different applications of the methodology and software for a variety of testing problems. The paper ends with some concluding remarks in Section 5.

Technical background of the default Bayes factors
The general form of the hypotheses that can be tested using BFpack consists a set of linear equality constraints and a set of linear order constraints on the vector of model parameters, denoted by θ of size P , i.e., where [R E |r E ] is a q E × P augmented matrix specifying the equality constraints and [R O |r O ] is a q O ×P augmented matrix specifying the order constraints. A hypothesis index is omitted to keep the notation simple. In the case that R O is of full row rank (which is most often the case), a parameter transformation can be applied according where the q E equality restricted parameters equal θ E = R E θ, the q O order-restricted parameters equal θ O = R E θ, and the P −q O −q E nuisance parameters equal φ = Dθ, where the (P −q E −q O )×P dummy matrix D is chosen such that the transformation is one-to-one. Subsequently the hypothesis can equivalently be formulated as where the nuisance parameters φ are omitted. Note that for most order hypotheses, the matrix R O will be of full row rank. For example, H t : . Therefore we will work with the formulation in Equation 3 throughout this paper to keep the notation simple. In the case R O is not of full row rank, which is for instance the case for H t : (θ 1 , θ 2 ) > (θ 3 , θ 4 ), a similar type of formulation of H t can be produced as in Equation 3 1 . Next we specify a prior for the free (possibly order constrained) parameters under H t , denoted by π t , by truncating an unconstrained prior, π u , that is specified under an unconstrained alternative model, where I(·) denotes the indicator function. Using this pair of priors under the constrained hypothesis H t and the unconstrained alternative hypothesis, we can write the Bayes factor of H t against H u as where the first factor is ratio of posterior and prior densities of θ evaluated at a constant vector R E , which can be viewed as a multivariate Savage-Dickey density ratio (Dickey, 1971;Wetzels et al., 2010;Mulder et al., 2010), and the second factor is a ratio of conditional posterior and prior probabilities that the order constraints hold conditional on the equality constraints. We shall refer to Equation 5 where the (P − q E ) × P matrixD consists of the unique rows of where (generalized) Moore-Penrose inverses are used for the non square matrices.
The expression can simply be computed when the marginal and conditional posterior and priors belong to known probability distributions (examples will be given later), and thus direct computation of the marginal likelihood, which can be a challenging problem, can be avoided. The four different statistical measures in Equation 5 have the following intuitive interpretations: • The marginal posterior density evaluated at θ E = r E (numerator of first factor) is a measure of the relative fit of the equality constraints of H t relative to H u as a large (small) posterior value under the unconstrained model indicates that there is evidence in the data that θ E is (not) close to r E .
• The conditional posterior probability of θ O > r O given θ E = r E (numerator of second factor) is a measure of the relative fit of the order constraints of H t relative to H u as a large (small) probability under the unconstrained model indicates that there is evidence in the data that the order constraints (do not) hold.
• The marginal prior density evaluated at θ E = r E (denominator of first factor) is a measure of the relative complexity of the equality constraints of H t relative to H u as a large (small) prior value indicates that the prior for θ E is (not) concentrated around r E , and thus there is little (big) difference between the precise formulation θ E = r E or the unconstrained formulation H u .
• The conditional prior probability of θ O > r O given θ E = r E (denominator of second factor) is a measure of the relative complexity of the order constraints of H t relative to H u as a large (small) probability under the unconstrained model indicates that the order constrained subspace under H t is relatively large (small), indicating that the constrained model is complex (simple).
It is important to note that by conditioning on θ E = r E in Equation 4, we make specific assumptions about the prior of the free parameters under H t in relation to the unconstrained prior (Verdinelli & Wasserman, 1995;Marin & Robert, 2010), and therefore the expression should be used with some care (see also Consonni & Veronese, 2008, for an interesting discussion on this topic). Below we provide examples of Bayes factors that can and Bayes factors that cannot be expressed as an extended Savage-Dickey density ratio.

Testing location parameters
Many common testing problems in statistical science involve testing of location parameters that determine the 'location' or 'shift' of the distribution of the data. Ex-amples of location parameters are means, regression coefficients, or factor loadings. These parameters are unbounded for which flat improper priors are specified under an objective Bayesian estimation framework, i.e., π u (θ) ∝ 1.
Fractional Bayes methodology is an effective framework for testing location parameters. Informative (subjective) prior specification is avoided by splitting the data in a minimal fraction that is used for updating a noninformative improper prior to a proper default prior and a maximal fraction that is used for hypothesis testing (O'Hagan, 1995;De Santis & Spezzaferri, 1999). Despite the various useful properties of fractional Bayes factors (e.g., consistency, coherence when testing multiple hypotheses, invariance to transformations of the data, O'Hagan, 1997), an adjustment was needed in order for the fractional Bayes factor to function as an Occam's razor when testing order hypotheses (Mulder, 2014b;Mulder & Olsson-Collentine, 2019). This is achieved by shifting the default prior to the boundary of the constrained space 2 . In the simple case when testing θ < 0 versus θ > 0, the default prior would be centered at 0 (instead of around the MLE) so that the prior probabilities of θ < 0 and θ > 0 under the unconstrained model are equal to 0.5, which suggests that a negative effect is equally likely as a positive effect. Centering the unconstrained prior to the boundary also resulted in desirable testing behavior of order hypotheses when using intrinsic Bayes factors (Mulder et al., 2010;Mulder, 2014a) and when using the BIC (Mulder & Raftery, accepted) Interestingly when testing location parameters with flat improper priors, the adjusted fractional Bayes factor (and the fractional Bayes factor as well) of H t against H u can be expressed as an extended Savage-Dickey density ratio as in Equation 3, i.e., where the distributions conditional on Y b in the denominator denote the unconstrained default priors that contain a minimal fraction b of the complete data Y, and the asterisk ( * ) denotes the default prior adjustment. When the data contains information from different groups and the sample sizes highly varies across groups, it is generally recommended to use group specific fractions to properly control the amount of prior information from each group (De Santis & Spezzaferri, 2001;.

Univariate/multivariate normal linear models
Recently,  derived the adjusted fractional Bayes factor for testing hypotheses under the multivariate normal linear model with multiple groups. Under this model the unconstrained posterior of the matrix of location parameters follows a matrix Student t distribution, and the unconstrained default prior has a matrix Cauchy distribution, i.e., where the T u and C u denote the unconstrained matrix Student t and matrix Cauchy distribution, respectively, and b denotes a vector of minimal fractions that are group specific. Under these matrix-variate distributions, the posterior and prior densities at r E , and the conditional posterior and prior probabilities that the order constraints hold do not have analytic expressions. In BFpack, these quantities are computed using Monte Carlo integration. We use the fact that draws from a matrix Student t and matrix Cauchy distribution can be obtained by first sampling a covariance matrix from an inverse Wishart distribution, and subsequently drawing the matrix of location parameters from its respective matrix Gaussian distribution conditional on the drawn covariance matrix (Box & Tiao, 1973). Therefore, the posterior density evaluated at θ E = r E can be obtained by repeatedly drawing covariance matrices from its marginal posterior, and subsequently, computing the posterior density as the arithmetic average of the Gaussian densities evaluated at θ E = r E , which have analytic expressions. The Gaussian densities are computed using the dmvnorm function from the mvtnorm package (Genz et al., 2016). Such a procedure is also implemented to obtain the prior density, and the conditional prior and posterior probabilities. The Gaussian probabilities are obtained using the pmvnorm function from the mvtnorm package.
In case the constraints are formulated only on the effects belonging to the same dependent variable, or only on the effects belonging to the same independent (predictor) variable, the marginal and conditional distributions for the unconstrained parameters follow multivariate Student t distributions. The respective measures of relative complexity and fit then have analytic expressions which are efficiently computed using dmvt and pmvt (mvtnorm package) in BFpack. Finally note that fractional Bayes factors between the constrained hypotheses using the coherence property of the Bayes factor, i.e., This Bayes factor test is executed when the data are fitted using the R functions t_test, lm, aov, and manova. Note that the usual t test function in R, t.test, cannot be used because the output (of class htest) does not contain the observed sample means and sample variances of the two groups which are needed for the computation of the Bayes factors. For this reason, the equivalent function t_test was used (from the bain package) which contains the sample means and variances in addition to the standard output of t.test.

General statistical models
Under more complex statistical models where the four quantities in Equation 6 do not have analytic expressions or when they cannot be computed efficiently via Monte Carlo estimation, an approximation of the adjusted fractional Bayes factor can be used (Gu et al., 2017. This approximation relies on large sample theory where the unconstrained posterior and default prior are approximated with Gaussian distributions. As such this approximate default Bayes factor can be viewed as a Bayesian counterpart of the classical Wald test. It can be used when no exact Bayes factor is available (see the sixth column in Table 1).
First the nuisance parameters are integrated out to yield the marginal posterior for (θ E , θ O ). Following large sample theory (Gelman et al., 2014, Ch. 4), this posterior can then be approximated with a multivariate Gaussian distribution using the MLE and error covariance matrix. The approximated Gaussian posteriors can then be used to get estimates of the posterior quantities in the numerators in Equation 6. The corresponding default prior for (θ E , θ O ) is obtained by raising the posterior to a minimal fraction b, which results in a multivariate Gaussian distribution where the error covariance matrix is multiplied with the reciprocal of the minimal fraction, and the mean is shifted towards the boundary of the constrained space. This default Bayes factor can then be written as where N denotes an unconstrained multivariate (or matrix-variate) normal distribution. In BFpack the posterior and prior densities, and posterior and prior probabilities are directly computed using dmvnorm and pmvnorm functions from the mvtnorm package, respectively. In the case the matrix of order constraints is not of full row rank, pmvnorm cannot be used for computing the needed probabilities. In this special case the bain function is called from the bain package. Hence, in order to compute the approximated Bayes factor in Equation 8 only the MLEs, the error covariance matrix, and the sample size are needed. The minimal fraction is then set equal to the number of constraints that are tested divided by the sample size . These three elements can simply be extracted from fitted model objects obtained using other packages in R. Currently BFpack supports objects of class glm, coxph, polr, survreg, and zeroinfl. When executing BF() on an object of these classes, the function BF.default is called which extracts the estimates, the error covariance matrix, and the sample size from the fitted model object to compute Equation 8 for the hypotheses of interest. This Bayes factor is also called when the input argument for BF() is a named vector with the estimates, its corresponding error covariance matrix, and the sample size, together with the hypotheses of interest. This is explained in more detail in Section 3.

Testing measures of association
Correlation coefficients and other measures of association play a central role in applied research to quantify the strength of the linear relationship between two variables, possibly controlling for other variables. Measures of association abide two conditions. First they are bounded between -1 and 1, and second they lie in a correlation matrix which must be positive definite. The second condition implies that a correlations need to satisfy a complex set of constraints (e.g., Rousseeuw & Molenberghs, 1994). The volume of this subspace for increasing dimensions of the correlation matrix was explored by Joe (2006).
As measures of association are bounded, fractional Bayes methodology is not needed as the noninformative joint uniform prior for the correlations in the correlation matrix is already proper, and thus a regular default Bayes factor can be computed. This was also recommended by Jeffreys (1935). This joint uniform prior assumes that any configuration of correlations that results in a positive definite correlation matrix is equally likely a priori. Equivalently, proper uniform priors can be formulated for the measures of association under the constrained hypotheses under investigation. It is easy to show that this proper uniform prior under H t can be written as a truncation of the unconstrained joint uniform prior as in Equation 4, and therefore, the Bayes factor of constrained hypothesis against an unconstrained alternative can be expressed as an extended Savage-Dickey density ratio in Equation 5, where the unconstrained prior in the denominator is the joint uniform prior and the unconstrained posterior is proportional to the likelihood and this uniform prior (Mulder & Gelissen, 2019). Furthermore as was shown by Mulder (2016) the unconstrained posterior for the measures of association can be well approximated with a multivariate normal distribution after a Fisher transformation of the parameters. This can be explained by the fact that the sample correlation and the population correlation have a similar role in the likelihood (Johnson & Kotz, 1970), and therefore approximate normality is achieved for the posterior when using a noninformative prior such as the employed joint uniform prior. The Bayes factor on measures of association that is implement in BFpack can therefore be written as 3 To obtain the prior measures for relative complexity, numerical estimates can be obtained by approximating the joint prior using unconstrained draws, from which the prior density and probability can simply be computed using the number of draws satisfying the constraints. In BFpack this is done by calling Fortran 90 subroutines from R. This Bayes factor test can be executed when the fitted model is a multivariate linear regression model (so that the fitted object is of class mlm). Furthermore, an approximation of the Bayes factor is obtained when the fitted model object is obtained using the R function hetcor (from the polycor package; J. Fox, 2016). The mean vector and covariance matrix of the approximately multivariate normal posterior for the measures of association are obtained by extracting the estimates and standard errors from the hetcor object.

Testing group variances
Testing the heterogeneity of group variances plays a central role in psychological science and related fields. A default Bayes factor for testing equality and order hypotheses was developed by (Böing-Messing & Mulder, 2017) using adjusted fractional Bayes methodology. As variance parameters belong to the family of scale parameters, a scale adjustment is needed to obtain a default Bayes factor that functions as an Occam's razor for order hypotheses on variances (instead of a location shift as for location parameters, see Böing-Messing & Mulder, 2016. Because the noninformative independence Jeffreys prior for group variances across competing equality constrained hypotheses does not satisfy Equation 4, the fractional Bayes factor for the equality part (i.e., the first factor in Equation 3) cannot be expressed as a Savage-Dickey density ratio but the ratio of (conditional) probabilities is present.
The Bayes factor for the group variance test can be written as follows where B F t ′ u denotes the fractional Bayes factor of hypothesis H t ′ : θ E = r E (i.e., hypothesis H t where the order constraints are omitted, see also Pericchi et al., 2008) against H u , and IG denotes an unconstrained inverse gamma distribution. We refer the interested reader to (Böing-Messing & Mulder, 2017) for the mathematical expressions and derivations.
This Bayes factor test can be executed when the fitted model is obtained from the R function bartlett_test, designed for BFpack. This test is equivalent to the usual bartlett.test but the output object (of class BF_bartlett) also contains sample variances and sample sizes which are needed for computing the Bayes factors in Equation 10.

Testing between-cluster variances and intraclass correlations in mixed effects models
The multilevel or mixed effects model is the gold standard for modeling hierarchically structured data. In the mixed effects model the within-clusters variability is separately modeled from the between-clusters variability. The intraclass correlation plays a central role as a measure of the relative degree of clustering in the data where an intraclass correlation close to 1 (0) indicates a very high (low) degree of clustering in the data. Despite the widespread usage of mixed effects models in the (applied) statistical literature, there are few statistical tests for testing variance components; exceptions include Westfall & Gönen (1996); Gancia-Donato & Sun (2007); Saville & Herring (2009);Thalmann et al. (2017). The complicating factor is that testing whether the between-cluster variance equals zero is a boundary problem. In BFpack a Bayes factor testing procedure is implemented for intraclass correlations (and random intercept variances) under a marginal modeling framework where the random effects are integrated out (Mulder & Fox, 2019;J.-P. Fox et al., 2017;Mulder & Fox, 2013). Under the marginal model the intraclass correlations become covariance parameters which may attain negative values. This crucial step allows us to test the appropriateness of a random effects model using the posterior probability that an intraclass correlation is positive. The implemented Bayes factors make use of stretched uniform priors for the intraclass correlations in the interval (− 1 p−1 , 1), where p is the cluster size. This prior is equivalent to a shifted-F prior on the between-cluster variances.
Similar as when testing group variances, the equality part of the Bayes factor of a constrained hypothesis on the intraclass correlations against an unconstrained alternative cannot be expressed as a Savage-Dickey density ratio. The Bayes factor can be written as where shifted-F refers to the fact that the conditional draws for the between cluster variances are drawn from shifted-F priors in the Gibbs sampler. The marginal likelihood is estimated using importance sampling; see Mulder & Fox (2019) for the mathematical details. These Bayes factors can be used for testing the degree of clustering in the data (e.g., testing whether clustering is present among students from different schools), or for testing whether the degree of clustering varies across different cluster categories (e.g., testing the degree of clustering among students from private schools against the degree of clustering among students from public schools). To execute these tests, an object from the lmer function with random intercepts (which may be category specific) is needed. Currently BFpack only supports intraclass correlation testing in the case of equally sized clusters.

Bayes factor computation for data with missing observations
Bayesian (and non-Bayesian) hypothesis testing in the case the data contains missing observations has not received a lot of attention in the literature. This is quite surprising as missing data are ubiquitous in statistical practice. If the data contain missing observations, listwise deletion is generally not recommended as this results in a loss of information and possible bias (Rubin, 1987(Rubin, , 1996. Multiple imputation is generally the recommended method in which many imputed data sets are randomly created using an imputation model. The analyses are then performed over all the imputed data sets, and averaged in a proper manner (Little & Rubin, 2002).
In the case of model uncertainty, properly handling missing data may become increasingly complex as different imputation models need to be used for computing the marginal likelihoods under the different hypotheses. Hoijtink, Gu, Mulder, & Rosseel (2018) however showed that the computation can be considerably simplified for specific Bayes factors and testing problems. This is also the case for Bayes factors that can be expressed as the extended Savage-Dickey density ratio in Equation 5. The reason is that the four key quantities (i.e., the measures of relative fit and relative complexity for the equality and order constraints) are all computed under the same unconstrained model. Therefore we only need to get an unbiased estimate of the unconstrained posterior (and possibly unconstrained default prior in the case of a data-based prior), and use this to estimate the four key quantities. If we write a complete data matrix Y as a data matrix which only contains the observations, Y o , and a data matrix which only contain the missings as Y m , the relative fit of the equality constraints can be computed as 3 Bayes factor testing using the package The Bayes factors described in the previous section can be executed by calling the function BF. The function has the following arguments: • x, a fitted model object that is obtained using a R-function. An overview R-functions that are currently supported can be found in Table 1.
• hypothesis, a string that specifies the hypotheses with equality and/or order constraints on the parameters of interest.
-The parameter names are based on the names of the estimated effects. Thus, if the coefficients in a fitted lm object have the names weight, height, and length, then the constraints in the hypothesis argument should be formulated on these names.
-Constraints within a hypothesis are separated with an ampersand '&'. Hypotheses are separated using a semi-colon ';'. For example hypothesis = "weight > height & height > 0; weight = height = 0" implies that the first hypothesis assumes that the effect of weight is larger than the effect of height and that the effects of height is positive, and the second hypothesis assumes that the two effects are equal to zero. Note that the first hypothesis could equivalently have been written as weight > height > 0. -Brackets, '(' and ')', can be used to combine constraints of multiple hypotheses. For example hypothesis = "(weight, height, length) > 0" denotes a hypothesis where both the effects of weight, height, and length are positive. This could equivalently have been written as hypothesis = "weight > 0 & height > 0 & length > 0". -In the case the subspaces under the hypotheses do not cover the complete parameter space, a complement hypothesis is automatically added. For example, if an equality hypothesis and an order hypothesis are formulated, say, hypothesis = "weight = height = length; weight > height > length", the complement hypothesis covers the remaining subspace where neither "weight = height = length" holds, nor "weight > height > length" holds. -In general we recommended not to specify order hypotheses that are nested, such as hypothesis = "weight > height > length; weight > (height, length)", where the first hypothesis (which assumes that the effect of weight is larger than the effect of height, and the effect of height is larger than the effect of length) is nested in the second hypothesis (which assumes that the effects of weight is largest but no constraints are specified between the effects of height and length). The reason is that the Bayes factor for the simpler hypothesis against the more complex hypothesis would be bounded. Therefore the scale of the Bayes factor would become more difficult to interpret, and the evidence could not accumulate to infinity for the true hypothesis if the true hypothesis would be the smaller order hypotheses (e.g., see Mulder et al., 2010). If however a researcher has theoretical reasons to formulate nested order hypotheses these can be formulated and tested using the BF function of course.
-The default setting is hypothesis = NULL, which only gives the output for exploratory tests of whether each parameter is zero, negative, or positive when assuming equal prior probabilities, e.g., hypothesis = "weight = 0; weight < 0; weight > 0, for the effect of weight. This exploratory tests is also executed when a confirmatory test is of interest via the hypothesis argument.
-When testing hypotheses on variance components (Section 2.3), only sim-ple constraints are allowed where a parameter is equal to, greater than, or smaller than another parameter. When testing intraclass correlations, the intraclass correlation can also be compared to 0 under a hypothesis.
• prior, a numeric vector of prior probabilities of the hypotheses. The default setting is prior = NULL which specifies equal prior probabilities.
In the case the class of the fitted model x is not supported, BF.default() is called which executes the approximate Bayes factor (sixth column of Table 1). In this case, the following (additional) arguments are required: • x, a named numeric vector of the estimates (e.g., MLE) of the parameters of interest where the labels are equal to the names of the parameters which are used for the hypothesis argument.
• Sigma, the approximate posterior covariance matrix (e.g,. error covariance matrix) of the parameters of interest.
• n, the sample size that was used to acquire the estimates and covariance matrix.
The output is an object of class BF. When printing an object of class BF via the print() function, the posterior probabilities for the hypotheses under evaluation are provided, or, in the case hypothesis = NULL, the posterior probabilities are given for exploratory tests of whether each parameter is zero, negative, or positive. The summary() function shows the results for the exploratory tests, and if hypotheses are specified in the hypothesis argument, the results of the confirmatory tests consisting of the posterior probabilities of the hypotheses of interest, the evidence matrix which shows the Bayes factor between each pair of two hypotheses, a specification table which shows all the measures of relative fit and complexity for the equality and/or order constraints of the hypotheses, and an overview of the hypotheses that are tested.

Applications
This section presents a variety of testing problems that can be executed using BFpack. On https://github.com/cjvanlissa/BFpack_paper, a R Markdown version of the paper can be found to facilitate the reproducibility of the analyses.

Application 1: Bayesian t testing in medical research
The example for a one-sample t test was discussed in (Howell, 2012, p. 196), and originally presented in Rosa et al. (1998). An experiment was conducted to investigate whether practitioners of the therapeutic touch (a widely used nursing practice) can effectively identify which of their hands is below the experimenter's under blinded condition. Twenty-eight practitioners were involved and tested 10 times in the experiment. Researchers expected an average of 5 correct answers from each practitioner as it is the number by chance if they do not outperform others. In this example, the data are the number of correct answers from 0 to 10 of n = 28 practitioners. The null and alternative hypotheses are H 1 : µ = 5 and H 2 : µ > 5 where µ is the mean of the data. If H 1 : µ = 5 is true, it means that practitioners give correct answers by chance, whereas if H 2 : µ > 5, this implies that practitioners do better than expected by random chance. The BF function automatically adds the complement hypothesis, H 3 : µ < 5, which would imply that practitioners do worse than expected by chance.
As there is virtually no prior belief that H 3 may be true, and we (for this example) assume that the hypotheses of interest, H 1 and H 2 , are equally likely a priori we set the prior probabilities for H 1 , H 2 , and H 3 in the confirmatory test to 0.5, 0.5, and 0, respectively, using the prior argument. Hypotheses H 1 : µ = 5 versus H 2 : µ > 5 are tested used the frequentist t test function t_test from the R package bain and Bayesian t test function BF in the R package BFpack. R> install.packages("BFpack") R> library(BFpack) R> install.packages("bain") R> library(bain) R> ttest1 <-t_test(therapeutic, alternative = "greater", mu = 5) R> print(ttest1) R> BF1 <-BF(ttest1, hypothesis = "mu = 5; mu > 5", prior = c(.5,.5,0)) R> summary(BF1) The first six lines install and load the R package BFpack. In the 8th line, t_test function renders classical right one-sided t test and stores the result in object ttest1, which contains t value, degree of freedom, and p value, as well as 95% confidence interval: data: therapeutic t = -1.9318, df = 27, p-value = 0.968 alternative hypothesis: true mean is greater than 5 95 percent confidence interval: 3.857523 Inf sample estimates: mean of x 4.392857 The p value indicates that there is no reason to reject H 1 against the right onetailed alternative using traditional choices for the significance level. This however does not imply there is evidence in favor of H 1 as significance tests cannot be used for quantifying evidence for a null hypothesis. Next, the object ttest1 is used as the input for the BF function for a Bayesian t test in the 11th and 13th line. Specifically, BF(ttest1, hypothesis="mu = 5; mu > 5", prior = c(.5,.5,0)) executes a confirmatory Bayes factor test where the hypothesis H 1 : µ = 5 is tested against H 2 : µ > 5. The complement hypothesis, H 3 : µ < 5 in this case, is automatically added. The argument prior=c(.5,.5,0) indicates that the first two hypotheses are assumed to be equally likely a priori and the third hypothesis does not receive any prior support. The summary(BF1) on line 12 shows all the results of the exploratory and confirmatory tests: The results of the exploratory tests show that the posterior probabilities of the precise null (µ = 5), a negative effect (µ < 5), and a positive effect (µ > 5) are 0.345, 0.634, and 0.021, respectively, while assuming equal prior probabilities for the three hypotheses, i.e., P (H 1 ) = P (H 2 ) = P (H 3 ) = 1 3 . The results from the exploratory test show that the presence of a negative is most plausible given the observed data but the evidence is relatively small as there is still a probability of 0.345 that the precise null is true, and a small probability of 0.021 that there is a positive population effect.
The exploratory test however ignores the researchers prior expectations that the first two hypotheses were assumed to be equally likely while there was no reason to believe that the third hypothesis could be true, i.e., P (H 1 ) = P (H 2 ) = 1 2 and P (H 3 ) = 0. Taking these prior probabilities into account, the confirmatory test shows that there is clearly most evidence that a therapeutic touch does not exist (H 1 ) with a posterior probability of 0.943, followed by the hypothesis that a therapeutic touch exists (H 2 ) with a posterior probability of 0.057. Furthermore the Evidence matrix shows that the Bayes factor for H 1 against H 2 equals 16.473, which is equal to the ratio of the (non rounded) posterior probabilities of the respective hypotheses as equal prior probabilities were specified.
Finally the Specification table shows that the measures of relative complexity and relative fit for the constrained hypotheses. The relative fit of the one-sided hypotheses (H 2 : µ > 5 and H 3 : µ < 5) equal 0.5 (column comp_O), which can be explained by the fact that the implied one-sided subspaces cover half of the unconstrained space. Furthermore the posterior probability mass in the region µ > 5 and µ < 5 under the unconstrained model equal 0.032 and 0.968 (column fit_O), respectively, which quantify the relative fit of the one-sided hypotheses. The unconstrained default prior and posterior density at µ = 5 equal 0.195 and 0.205 (column comp_E and fit_E), which quantify the relative complexity and fit of the precise hypothesis, respectively. Janiszewski & Uy (2008) executed several experiments to investigate the numerical judgments of participants. In one of the experiments (referred to as '4a') the outcome variable was the amount by which the price for a television estimated by a participant differed from an anchor price (expressed by means of a z score), and the two factors where (1) whether the anchor price was rounded, e.g., $5000, or precise, e.g., $4989 (anchor = rounded or precise, respectively); and (2) whether the participants received a suggestion that the estimated price is close to the anchor value or whether they did not receive this suggestion (motivation = low or high, respectively). An example of a question, with anchor = rounded and motivation = low, was: "The retail price of a TV is $5000 (rounded). The actual price is only slightly lower than the retail price. Can you guess the price?". Alternatively, by changing '$5000' to '$4989' in the question a precise anchor price is obtained. By changing 'slightly lower' to 'lower' a question with a high motivation is obtained. This 2 × 2 ANOVA design can be tested using BFpack as follows R> aov1 <-aov(price~anchor * motivation, data = tvprices) R> BF(aov1)

Application 2: 2-way ANOVA to investigate numerical judgement
For an object of class aov, BFpack also provides the Bayes factors for the existence of the main effects and interactions effects in the exploratory tests The results show clear evidence that there there is a main effect for the anchor factor and a main effect for the motivation factor (with posterior probabilities of approximately 1). Furthermore, there is some evidence that there interaction effect between the two factors is present (with a posterior probability of 0.749). More data need to be collected in order to draw a more decisive conclusion regarding the existence of an interaction. It is interesting to see how the posterior probabilities for the hypotheses from the exploratory test relate to the p values of the classical significance tests of H 0 : β = 0 versus H 1 : β = 0. This can be checked by running: R> ct <-lmtest::coeftest (aov1) This results in p values of 0.66697, 1.842e−10, 1.410e−06, 0.01115 for (Intercept), anchorrounded, motivationlow, and anchorrounded:motivationlow, respectively. Thus, when using a significance level of 0.05 we would conclude that there is enough evidence in the data to conclude that an interaction effect is present. The Bayesian test on the other hand suggests that we have to be more cautious as there is still a posterior probability of 0.144 that no interaction effect is present. This illustrates that Bayes factor tests are generally more conservative than classical significance tests, which is one of the reasons for the recent recommendations to use smaller significance levels than the traditional choice of 0.05 (Benjamin et al., 2018). Silverstein et al. (1995) conducted a psychological study to compare the attentional performances of 17 Tourette's syndrome (TS) patients, 17 ADHD patients, and 17 control subjects who did not suffer from TS or ADHD. The participants were shown a total of 120 sequences of either 3 or 12 letters. Each sequence contained either the letter T or the letter F at a random position. Each sequence was presented for 55 milliseconds and afterwards the participants had to indicate as quickly as possible whether the shown sequence contained a T or an F. After a participant completed all 120 sequences, his or her accuracy was calculated as the percentage of correct answers. In this section, we are interested in comparing the variances of the accuracies in the three groups. Research has shown that ADHD patients tend to be more variable in their attentional performances than subjects who do not suffer from ADHD (e.g., Kofler et al., 2013;Russell et al., 2006). It is less well documented whether TS patients are less or more variable in their attentional performances than healthy control subjects. We will therefore test the following set of hypotheses to investigate whether TS patients are as variable in their attentional performances as either ADHD patients or healthy controls (C): H 1 : σ 2 C = σ 2 T S < σ 2 ADHD and H 2 : σ 2 C < σ 2 T S = σ 2 ADHD . We will test these hypotheses against the null hypothesis stating equality of variances, H 0 : σ 2 C = σ 2 T S = σ 2 ADHD , as well as the complement of the three aforementioned hypotheses given by H 3 : ¬ (H 0 ∨ H 1 ∨ H 2 ). We include the complement to safeguard against the data supporting neither of (H 0 , H 1 , H 2 ). Silverstein et al. (1995) reported the following sample variances of the accuracies in the three groups: s 2 C = 15.52, s 2 T S = 20.07, and s 2 ADHD = 38.81. The data are contained in a dataset called attention. In BFpack, we can conduct the multiple hypothesis test and weigh the evidence in favor of the four hypotheses as follows:

Application 3: Testing group variances in neuropsychology
R> bartlett <-bartlett_test(x = attention$accuracy, g = attention$group) R> hypothesis <-c("Controls = TS < ADHD; Controls < TS = ADHD; + Controls = TS = ADHD") R> set.seed(358) R> BF_var <-BF(bartlett, hypothesis) Note that we use equal prior probabilities of the hypotheses by omitting the prior argument in the call to the BF function. The exploratory posterior probabilities for homogeneity of group variances can be obtained by running summary (BF_var)  This results in evidence for equality of group variances. Note that the p value in the classical Bartlett test for these data equals 0.1638 which implies that the null hypothesis of homogeneity of variances cannot be rejected using common significance levels, such as 0.05 or 0.01. Note however that the this p value cannot be used as a measure for the evidence in the data in favor of homogeneity of group variances. This can be done using the proposed Bayes factor test which shows that the probability that the variances are equal is approximately 0.803. The confirmatory test provides a more detailed analysis about the most plausible relationship between the hypotheses (also obtained using the summary() call): Thus, H 1 receives strongest support from the data, but H 2 and H 3 are viable competitors. It appears that even the complement H 3 cannot be ruled out entirely given a posterior probability of 0.058. To conclude, the results indicate that TS population are as heterogeneous in their attentional performances as the healthy control population in this specific task, but further research would be required to obtain more conclusive evidence.

Application 4: Multivariate linear regression in fMRI studies
It is well established that the fusiform facial area (FFA), located in the inferior temporal cortex of the brain, plays an important role in the recognition of faces. This data comes from a study on the association between the thickness of specific cortical layers of the FFA and individual differences in the ability to recognize faces and vehicles (McGuigin et al., under review). High-resolution fMRI was recorded from 13 adult participants, after which the thickness of the superficial, middle, and deep layers of the FFA was quantified for each individual. In addition, individual differences in face and vehicle recognition ability were assessed using a battery of tests.

Analysis of the complete data
In this example, two alternative hypotheses are tested. In a recent study, McGuigin et al. (2016) found that individual differences in the overall thickness of the FFA are negative correlated with the ability to recognize faces but positively correlated with the ability to recognize cars. (H 1 ) is the most parsimonious extension of these findings. It specifies that the magnitude and direction of the association between object recognition and layer thickness is not moderated by layer. To elaborate, consider a multivariate multiple regression model model with cortical thickness measures for the superficial, middle, and deep layers as three repeated (dependent) measures for each participant , and facial recognition ability and vehicle recognition ability as two dependent variables. Hypothesis H 1 is a main effects only model specifying that only main effect terms for face and vehicle are sufficient to predict the thickness of layers. The absence of layer × face or layer × vehicle interaction terms means that the relations between face and vehicle recognition are invariant across cortical layers. In other words, this hypothesis specifies that: That is, regression coefficients between face recognition and cortical thickness measures are expected to be negative, coefficients between vehicle recognition and cortical thickness measures are expected to be positive, and no layer-specific effect is expected for either faces or vehicles.
Hypothesis H 2 is based on prior findings concerning the early development of facial recognition abilities and the more rapid development of the deep layer of the FFA. This evidence leads to the following hypothesis: That is, the negative effect between facial recognition and the cortical thickness would be more pronounced in the deep layer, relative to the superficial and middle layers. One could attempt to test and compare these two hypotheses using linear mixed effects models software (e.g., the gls function in the lme package in R) with an appropriate covariance structure on the residuals to account for within-subject dependence. Alternatively one could use a model selection framework like that embodied in the BayesFactor package in R. Unfortunately, while these approaches can test some components of each hypothesis, they are not well suited to test the directional component of H 1 , which specifies that all coefficients involving faces are smaller than 0 and that all coefficients involving vehicles are larger than 0. This hypothesis can, however, be tested using BFpack in the following way: R> fmri.lm <-lm(cbind(Superficial, Middle, Deep)~Face + Vehicle, + data = fmri) R> constraints.fmri <-"Face_on_Deep = Face_on_Superficial = Face_on_Middle + < 0 < Vehicle_on_Deep = Vehicle_on_Superficial = Vehicle_on_Middle; + Face_on_Deep < Face_on_Superficial = Face_on_Middle < 0 < + Vehicle_on_Deep = Vehicle_on_Superficial = Vehicle_on_Middle" R> set.seed(123) R> BF_fmri <-BF(fmri.lm, hypothesis = constraints.fmri) R> summary (BF_fmri) This results in the following posterior probabilities and evidence matrix: In this analysis, hypothesis H 3 is the complement hypothesis. The evidence matrix reveals there is clear evidence for H 2 against H 1 (B 21 = 42.391) and extreme evidence for H 2 against H 3 (B 23 = 565.93). The same conclusion can be drawn when looking at the posterior probabilities for the hypotheses. Based on these result we would conclude that hypothesis H 2 receives most evidence and the Bayesian probability of drawing the wrong conclusion after observing the data would be relatively small, namely, 0.025.

Analysis with missing observations
Here we illustrate how a Bayes factor test can be executed in the case of missing observations in the fMRI data set that are missing at random. A slightly simpler hypothesis test is considered to reduce the computation time 4 R> constraints.fmri2 <-+ "Face_on_Deep = Face_on_Superficial = Face_on_Middle < 0; + Face_on_Deep < Face_on_Superficial = Face_on_Middle < 0" First the Bayes factors and posterior probabilities are obtained for this hypothesis test for the complete data set: R> fmri.lm2 <-lm(cbind(Superficial,Middle,Deep)~Face + + Vehicle, data = fmri) R> BF.fmri2 <-BF(fmri.lm2, hypothesis = constraints.fmri2) This results in posterior probabilities of 0.050, 0.927, and 0.023 for the two constrained hypotheses and the complement hypothesis, respectively. The Bayes factor of the most supported hypothesis (H 2 ) against the second most supported hypothesis (H 1 ) equals B 21 = 18.443. Now 10 missing observations (out of 65 separate observations in total) are randomly created that are missing at random: R> fmri_missing <-fmri R> set.seed(1234) R> for(i in 1:10){ + fmri_missing[sample(1:nrow(fmri), 1), sample(1:ncol(fmri), 1)] <-NA + } This results in 7 rows with at least one missing observation. Therefore listwise deletion would leave us with only 6 complete observations (of the 13 rows in total). Even though list-wise deletion is generally not recommended (Rubin, 1987(Rubin, , 1996, for this illustration we compute the Bayes factors and posterior probabilities based on these 6 complete data observations to illustrate the loss of evidence as a result of list-wise deletion. R> fmri_listdel <-fmri_missing[!is.na(apply(fmri_missing, 1, sum)),] R> fmri.lm2_listdel <-lm(cbind(Superficial, Middle, Deep)~Face + Vehicle, + data = fmri_listdel) R> BF.fmri2_listdel <-BF(fmri.lm2_listdel, hypothesis = constraints.fmri2) R> print(BF.fmri2_listdel) This results in posterior probabilities of 0.010, 0.820, and 0.170 for the two constrained hypotheses and the complement hypothesis, respectively. As expected the evidence for the hypothesis H 2 which received most evidence in based on the complete data set, decreased from 0.927 to 0.820.

Application 5: Logistic regression in forensic psychology
The presence of systematic biases in the legal system runs counter to society's expectation of fairness. Moreover such biases can have profound personal ramifications, and the topic therefore warrants close scrutiny. Wilson & Rule (2015) examined the correlation between perceived facial trustworthiness and criminal-sentencing outcomes (data available at https://osf.io/7mazn/). In Study 1 photos of inmates who had been sentenced to death (or not) were rated by different groups of participants on trustworthiness, 'Afrocentricity' (how sterotypically 'black' participants were perceived as), attractiveness and facial maturity. Each photo was also coded for the presence of glasses/tattoos and facial width-to-height ratio. A logistic regression with sentencing as outcome was fitted to the predictors.
Previous research had shown that the facial width-to-height ratio (fWHR) has a postive effect on perceived aggression and thus may also have a positive effect on sentencing outcomes. In addition, perceived Afrocentricity had been shown to be associated with harsher sentences (Wilson & Rule, 2015). In the first hypothesis it was expected that all three predictors have a positive effect on the probability of being sentenced to death. Additionally, we might expect lack of perceived trustworthiness to have the largest effect. In the second hypothesis it was assumed that only trustworthiness has a positive effect. Finally, the complement hypothesis was considered. The hypotheses can then be summarized as follows Before fitting the logistic regression we reverse-coded the trustworthiness scale and standardized it to be able to compare the magnitude the three effects. We can then test these hypotheses using BFpack and the fitted glm object from R. Note that the fitted object also contains covariates. The full logistic regression model was first fitted, and then the above hypotheses were tested on the fitted glm object: R> fit <-glm(sent~ztrust + zfWHR + zAfro + glasses + attract + + maturity + tattoos, family = binomial(), data = wilson) R> set.seed(123) R> BF_glm <-BF(fit, hypothesis = "ztrust > (zfWHR, zAfro) > 0; + ztrust > zfWHR = zAfro = 0") R> summary(BF_glm) In the output we see little support for the first two hypotheses; the complement receives most support:

Hypotheses:
Pr ( Based on these results we see that the complement receives most evidence. The fact that none of the two anticipated hypotheses were supported by the data indicates that the theory is not yet well-developed. Closer inspection of the betacoefficients reveals that this is largely driven by the negative effect between perceived Afrocentricity and sentencing harshness (β zAf ro = −0.18071). This unexpected result is discussed further by Wilson & Rule (2015)

Application 6: Testing measures of association in neuropsychology
Schizophrenia is often conceptualized as a disorder of "dysconnectivity" characterized by disruption in neural circuits that connect different regions of the brain (e.g., Friston & Firth, 1995). This data set (originally collected by Ichinose, Han, Polyn, Park and Tomarken (2019; summarized in Tomarken & Mulder, in preparation) can be used to test whether such dysconnection is manifested behaviorally as weaker correlations among measures that we would expect to be highly correlated among non-schizophrenic individuals. 20 patients suffering from schizophrenia (SZ group) and 20 healthy control (HC group) participants were administered six measures of working memory. Ichinose et al. hypothesized that each of the 15 correlations would be smaller in the schizophrenic group relative to the control group. This data set is an interesting case of how an order-constrained Bayesian approach can provide a more powerful and more appropriate test relative to alternative methods. Table 3 presents the Pearson correlations for the two groups. Several features are notable: (1) Each of the 15 correlations is higher in the HC group than the SZ group; (2) On average the correlations among the HC group are rather high (on average 0.59); and, (3) The average correlation within the SZ group is essentially 0. Despite this clear pattern, there were significant differences between the HC and SZ groups on only 2 of 15 correlations when the false discovery rate was used to control for multiple testing. Given that the overall pattern of group differences is consistent with hypotheses, simultaneous testing procedures would appear to be a better approach than tests on individual correlations. Indeed, both maximum likelihood and resampling tests convincingly indicated that the covariance and correlation matrices across groups differ (p < 0.01). However, there are a number of ways in which two correlation or covariance matrices may differ. Thus, the conventional procedures for comparing matrices do not test the specific hypothesis that, for each of the 15 correlations, the value for the HC group is greater than the value for the SZ group.
This hypothesis can, however, be tested in a straightforward manner using BFpack. H 1 specifies that each correlation in the HC group is expected to be larger than the corresponding correlation in the SZ group (i.e., a total of 15 order constraints were imposed). H A represents any pattern of correlations other than those that were consistent with H 1 . The R syntax is as follows: R> lm6 <-lm(cbind(Im, Del, Wmn, Cat, Fas, Rat)~-1 + Group, + data = memory) R> set.seed (123)

Application 7: Testing intraclass correlations in educational testing
Data from the Trends in International Mathematics and Science Study (TIMSS; http://www.iea.nl/timss) were used to examine differences in intraclass correla-tions of four countries (The Netherlands (NL), Croatia (HR), Germany (DE), and Denmark (DK)) with respect to the mathematics achievements of fourth graders (e.g., the first plausible value was used as a measure of mathematics achievement). The sample design of the TIMSS data set is known to describe three levels with students nested within classrooms/schools, and classrooms/schools nested within countries (e.g., one classroom is sampled per school). In this example, the TIMSS 2011 assessment was considered.
The intraclass correlation was defined as the correlation among measured mathematics achievements of grade-4 students attending the same school. This intraclass correlation was assumed to be homogenous across schools in the same country, but was allowed to be different across countries. For the four countries, differences in intraclass correlations were tested using the Bayes factor. The size of the intraclass correlation can be of specific interest, since sampling becomes less efficient when the intraclass correlation increases. Countries with low intraclass correlations have fewer restrictions on the sample design, where countries with high intraclass correlations require more efficient sample designs, larger sample sizes, or both. Knowledge about the size of the heterogeneity provide useful information to optimize the development of a suitable sample design and to minimize the effects of high intraclass correlations.
The TIMSS data sample in BFpack consists of four countries, where data was retrieved from The Netherlands (93, 112), Croatia (139, 106), Germany (179, 170), and Denmark (166, 153) with the sampled number of schools in brackets for 2011 and 2015, respectively. Differences in intraclass correlations were tested conditional on several student variables (e.g., gender, student sampling weight variable). The following hypotheses on intraclass correlations were considered in the analyses. Countryordered intraclass correlations were considered by hypothesis H 1 , equal (invariant) intra-class correlations were represented by hypothesis H 2 , and their complement was specified as hypothesis H 3 : H 1 : ρ N L < ρ HR < ρ DE < ρ DK H 2 : ρ N L = ρ HR = ρ DE = ρ DK H 3 : neither H 1 , nor H 2 .
The ordering in the intraclass correlations was hypothesized by considering the reported standard errors of the country-mean scores. From the variance inflation factor followed, 1+(p−1)ρ, with p the number of students in each school (balanced design), it follows that the variance of the mean increases for increasing values of the intraclass correlation coefficient. As a result, the ordering in estimated standard errors of the average mathematic achievements of fourth graders of the cycles from 2003 to 2015 was used to hypothesis the order in intraclass correlations. From a more substantive perspective, it is expected that schools in the Netherlands do not differ much with respect to their performances (low intraclass correlation) in contrast to Denmark, where school performances may differ considerably (high intraclass correlation).
A linear mixed effects model was used to obtain (restricted) maximum likelihood estimates of the fixed effects of the student variables and the country means, the four random effects corresponding to the clustering of students in schools in each country, and the measurement error variance, given the 2011 assessment data.
R> library(lme4) R> timssICC_subset <-timssICC[(timssICC$groupNL11 == 1) + + (timssICC$groupHR11 == 1) + (timssICC$groupDE11 == 1) + + (timssICC$groupDK11 == 1) > 0,] R> outlme1 <-lmer(math~-1 + gender + weight + lln + + groupNL11 + (0 + groupNL11 | schoolID) + + groupHR11 + (0 + groupHR11 | schoolID) + + groupDE11 + (0 + groupDE11 | schoolID) + + groupDK11 + (0 + groupDK11 | schoolID), + data=timssICC_subset) where the schoolID factor variable assigns a unique code to each school, and each country-specific group variable (e.g., groupNL11) equals one when it concerns a school in that country and zero otherwise. The lmer output object (Bates et al., 2019) was used as input in the BF function for the Bayes factor computation, where hypothesis H 1 and H 2 were added as arguments in the function call; R> set.seed(123) R> BFicc <-BF(outlme1, hypothesis = + "groupNL11 < groupHR11 < groupDE11 < groupDK11; + groupNL11 = groupHR11 = groupDE11 = groupDK11") The output object contains the posterior mean and median estimates of the ICCs (obtained via BFicc$estimates), which are represented in Table 4. The REML intraclass correlation estimates are also given for each country, which followed directly from the random effect estimates of the lmer output. It can be seen that the posterior mean and REML estimates are quite close, and the REML estimates are also located between the 2.5% and 97.5% percentile estimates. By running summary(BFicc) we get the results of the exploratory and confirmatory tests. The exploratory tests provide posterior probabilities of whether each intraclass correlation equals zero, negative, or positive. Evidence in favor of a negative intraclass correlation indicates that a multilevel model may not be appropriate for modeling these data (Mulder & Fox, 2019). As can be seen the exploratory results indicate that a multilevel model is a appropriate for these data: Bayesian hypothesis test Type: Exploratory Object: lmerMod Parameter: intraclass correlations Method: Bayes factor based on uniform priors icc=0 icc<0 icc>0 groupNL11 0 0 1 groupHR11 0 0 1 groupDE11 0 0 1 groupDK11 0 0 1 Furthermore the posterior probabilities of the specified hypotheses shows how our beliefs are updated in light of the observed data regarding the hypotheses that were formulated on the variation of school performance across countries. The posterior probabilities of the three hypotheses in the confirmatory test reveal that there is approximately equal plausibility for H 2 and H 3 to be true (with posterior probabilities of 0.509 and 0.471, respectively), and the complement hypothesis is unlike to be true (with a posterior probability of 0.020). It can be concluded that the data gave most support to an ordering of the intraclass correlations, where the Netherlands have the smallest intraclass correlation and Denmark the highest. The evidence however is practically equal to the evidence for the equality hypothesis. Efficient sampling strategies are needed in countries with positive intraclass correlations, where countries with higher intraclass correlations will benefit more from efficient stratification strategies.

Concluding remarks
The R package BFpack was designed to allow substantive researchers to perform Bayes factor tests via commonly used statistical functions in R, such as lm, aov, hetcor, or glm. Furthermore by specifying a simple string that captures the hypotheses of interest, users can make use of the flexibility of Bayes factors to simultaneously test multiple hypotheses which may involve equality as well as order constraints on the parameters of interest. This will allow users to move beyond traditional null hypothesis (significance) testing. In the near future the package will be extended by also including more complex statistical models such as structural equation models and generalized linear mixed models.