BSL: An R Package for Efficient Parameter Estimation for Simulation-Based Models via Bayesian Synthetic Likelihood

Bayesian synthetic likelihood (BSL) is a popular method for estimating the parameter posterior distribution for complex statistical models and stochastic processes that possess a computationally intractable likelihood function. Instead of evaluating the likelihood, BSL approximates the likelihood of a judiciously chosen summary statistic of the data via model simulation and density estimation. Compared to alternative methods such as approximate Bayesian computation (ABC), BSL requires little tuning and requires less model simulations than ABC when the chosen summary statistic is high-dimensional. The original synthetic likelihood relies on a multivariate normal approximation of the intractable likelihood, where the mean and covariance are estimated by simulation. An extension of BSL considers replacing the sample covariance with a penalised covariance estimator to reduce the number of required model simulations. Further, a semi-parametric approach has been developed to relax the normality assumption. In this paper, we present an R package called BSL that amalgamates the aforementioned methods and more into a single, easy-to-use and coherent piece of software. The R package also includes several examples to illustrate how to use the package and demonstrate the utility of the methods.


Introduction
In the Bayesian framework, inference on the parameter θ ∈ Θ ⊆ R p of a statistical model is carried out using the posterior distribution p(θ|y), where y is the observed data. Bayes' theorem shows that by incorporating information about θ obtained from observed data y via the likelihood function p(y|θ), the prior knowledge p(θ) can be updated to provide the posterior distribution, In most applications, the evidence Θ p(y|θ)p(θ) dθ involves high dimensional integration and is intractable. Recovery of the posterior distribution often relies on sampling methods, such as Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC; Del Moral, Doucet, and Jasra 2006).
However, in complex models, the likelihood function can be intractable or very expensive to evaluate. The terminology "likelihood-free inference" typically refers to inference techniques that do not require direct evaluation of the likelihood function, but rely on model simulations to approximate the likelihood in some way. One successful method is approximate Bayesian computation (ABC; Sisson, Fan, and Beaumont 2018). ABC essentially estimates the intractable likelihood non-parametrically at θ with a simulation x ∼ p(·|θ). The raw dataset is usually reduced down to summaries with a carefully chosen summary statistic function S(·) : R δ → R d , where δ and d are the dimension of the raw data and summary statistics, respectively. Denote the observed and simulated summary statistics as s y = S(y) and s x = S(x), respectively. The ABC likelihood function is given by p ϵ (s y |θ) = Y K ϵ (ρ(s y , s x ))p(x|θ) dx, where ρ(·) is a discrepancy function, which measures the distance between the observed and simulated summary statistics under a certain metric, e.g., the Euclidean distance. Also, K ϵ (·) is a kernel weighting function, usually an indicator function I(· < ϵ) for convenience, which links distance and tolerance ϵ. Here ϵ is used to trade-off between the bias and variance of the likelihood estimator, i.e., as ϵ approaches zero the bias reduces but the variance increases. In sampling methods like MCMC, a likelihood estimator with large variance can reduce efficiency by causing the Markov chain to become stuck (Doucet, Pitt, Deligiannidis, and Kohn 2015).
Due to the non-parametric nature of the ABC likelihood estimate, ABC can be very inefficient when the summary statistic is high dimensional (Blum 2010), which is often referred to as the curse of dimensionality. Furthermore, ABC requires the user to select ρ, K ϵ and ϵ, and the results can be sensitive to these choices. Wood (2010) proposes to use a multivariate normal distribution to approximate the likelihood function. Such an approximation is called the synthetic likelihood (SL). We later extend the term to not only the multivariate normal distribution but also other reasonable parametric approximations of the likelihood (Drovandi, Pettitt, and Lee 2015, provide a general framework for such methods). Price et al. (2018) analyze SL in the Bayesian framework and name it Bayesian synthetic likelihood (BSL). MCMC is used to explore the parameter space. The paper uses extensive empirical results to show that BSL can outperform ABC even when the model summary statistics show small departures from normality. BSL not only scales better with the summary statistic dimension, but also requires less tuning and can be accelerated with parallel computing. Recently, asymptotic properties of BSL have been derived under various assumptions.  develop a result for posterior concentration and Frazier, Nott, Drovandi, and Kohn (2021) show that the BSL posterior mean is consistent and asymptotically normally distributed.
In the standard BSL approach, the most obvious computational drawback is the need to generate a large number of simulations n for estimating a high-dimensional covariance matrix in the SL. Several strategies have been devised to reduce the number of simulations n for estimating the synthetic likelihood. An, South, Nott, and Drovandi (2019) propose to use the graphical lasso to estimate a penalized covariance and provide an algorithm for selecting the penalty to ensure reasonable mixing when placed inside an MCMC algorithm. Ong, Nott, Tran, Sisson, and Drovandi (2018a) consider the shrinkage estimator of Warton (2008) in the context of variational Bayesian synthetic likelihood to reduce the number of simulations. To improve the efficacy of Warton's shrinkage estimator within BSL, Priddle, Sisson, and Drovandi (in press) employ a whitening transformation of the summary statistic. The whitening transformation can significantly de-correlate the summary statistic so that more shrinkage can be applied without losing much accuracy. Everitt (2017) considers a bootstrap approximation to lower the variance of the SL estimator. Furthermore, the normality assumption of SL has also been put under inspection by An, Nott, and Drovandi (2020), who consider a more flexible approximation of the synthetic likelihood using a semi-parametric estimator with a Gaussian copula.  develop a robust BSL method by introducing a latent parameter to account for the proposed model to be potentially unable to re-produce all of the observed summary statistics.
There are existing packages in R (R Core Team 2021) for likelihood-free inference. For example, the R packages abc (Csillery, Francois, and Blum 2012) and EasyABC (Jabot, Faure, and Dumoulin 2013) exist for various ABC methods. The R package synlik (Fasiolo and Wood 2021) implements the classic SL method of Wood (2010). The package provides diagnostic tools for the normal synthetic likelihood and incorporates MCMC to find the approximate posterior distribution. However, synlik is limited to only the standard SL approach. Outside of R, the ABCpy (Dutta et al. 2021) package in Python (Van Rossum et al. 2011) includes most of the popular ABC algorithms and the standard SL approach. Further, the ELFI (engine for likelihood-free inference) package (Lintusaari et al. 2018) in Python currently has support for various ABC algorithms and also BOLFI (Bayesian optimization for likelihood-free inference; Gutmann and Corander 2016).
Given the wide applicability of BSL (e.g., Karabatsos 2018; Barbu, Sethuraman, Billig, and Levy 2018), it is important that BSL methods are directly accessible to practitioners. BSL has been used for statistical inference in forest models (Hartig, Dislich, Wiegand, and Huth 2014), cell biology (Drovandi, Grazian, Mengersen, and Robert 2018), cosmology (Leclercq 2018), stochastic differential equation mixed effects models (Picchini and Forman 2019) and timecensored aggregate data (Chen, Ye, and Zhai 2020). In this paper we introduce our BSL R package (An, South, and Drovandi 2022), which implements the Bayesian version of synthetic likelihood and many of the extensions listed earlier together with additional functionality detailed later in the paper and which is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=BSL. The package is flexible in terms of the prior specification, the implementation of the model simulation function and the choice of the summary statistic function. Further, it includes several built-in examples to help the user learn how to use the package.
The rest of this paper is structured as follows. Section 2 provides a brief description of the statistical background for SL and several extensions based on unbiased estimators, semiparametric approaches, robustness to misspecification and shrinkage approaches. In Sections 3 and 4, we introduce the main functionalities of BSL with an illustrative toy example and a real example, respectively. Section 5 concludes the paper with a summary and further discussion.

Bayesian synthetic likelihood
Following the notation in Section 1, we focus on reviewing three SL estimators to p(s y |θ) and two shrinkage covariance estimation methods in Section 2.1 to Section 2.5. We briefly introduce other implementation details in Section 2.6.
The SL estimator can be viewed as an auxiliary likelihood function p A (s y |ϕ) where the subscript "A" denotes that we are using a parametric density approximation and ϕ is the parameter of this parametric family of densities. To link this auxiliary likelihood to the actual likelihood, there is a functional relationship between ϕ and θ, which may be denoted as ϕ(θ). However, we drop the dependence on θ for notational convenience. The SL posterior is given by Unfortunately, the mapping from θ to ϕ is typically unknown. However ϕ can be estimated with simulations. The estimated synthetic likelihood is placed within an MCMC algorithm to sample from the corresponding approximate posterior of θ. Below we describe the synthetic likelihood estimators supported by our package, which amount to choosing the form of p A and the type of estimator of ϕ.

The standard BSL likelihood estimator
Assume we have obtained a collection of n simulations from the model at a proposed parameter value of θ, i.e., x 1 , . . . , x n iid ∼ p(·|θ). The corresponding summary statistics are denoted as . . . , n. Following Wood (2010), the SL p(s y |θ) is assumed to be roughly multivariate normal. The classic SL estimator can be written as where ϕ sl = (µ(θ), Σ(θ)) is estimated with sample mean and covarianceφ sl = (µ n (θ), Σ n (θ)) Althoughφ sl is an unbiased estimator of ϕ sl , N (s y |φ sl ) is not an unbiased estimator of N (s y |ϕ sl ). However, empirical results demonstrate that the approximate posterior exhibits very weak dependence on n (Price et al. 2018). Nevertheless, if n is prohibitively small, then there can be significant Monte Carlo error in the MCMC approximation of the BSL target. Ghurye and Olkin (1969) provide an unbiased estimator for the multivariate normal density based on independent simulations from it. Price et al. (2018) show the viability of this estimator by using it in place of the standard SL estimator within an MCMC algorithm. The estimator, denoted as uBSL, requires n > d + 3 for unbiasedness to hold. The estimated auxiliary parameterφ go can be written as (µ n (θ), M n (θ)), where M n (θ) = (n − 1)Σ n (θ). Definitions of µ n (θ) and Σ n (θ) can be found in Equation 2. The unbiased likelihood estimator is given by

An unbiased BSL likelihood estimator
where In spite of the fact that the normality assumption is rarely true in practice, the approximate posterior distributions obtained by uBSL show less dependence on n when compared to BSL (Price et al. 2018).

A semi-parametric BSL likelihood estimator
Although the estimators of Equations 1 and 3 are straightforward to compute, the normality assumption of the summary statistic could be restrictive in some cases. It is thus of interest to consider a more robust likelihood estimator that can handle non-normal summary statistics whilst not sacrificing much on computational efficiency.
Here we briefly describe the semi-parametric likelihood estimator of An et al. (2020), which is aliased as "semiBSL" by the authors. The paper shows examples where BSL fails to produce accurate posterior approximations whilst semiBSL can still produce reasonable posterior estimates. The semiBSL estimator harnesses kernel density estimation (KDE) to nonparametrically estimate the univariate distributions of the summary statistics and uses the Gaussian copula to model the dependence amongst summaries. This semi-parametric estimator provides additional robustness over the multivariate normal assumption and is quick to apply.
The estimator involves two aspects: modeling the marginals and modeling the dependence structure. We first describe KDE for estimating the marginals. Denoting s j i for j = 1, . . . , d as the j-th component of s i , if the summary statistic is continuous or approximately continuous (e.g., large counts), a KDE of the j-th marginal density is given bŷ where K h (·) is a non-negative kernel function with bandwidth parameter h. See Izenman (1991) for a review and discussion of KDE. Similarly, the corresponding estimated cumulative distribution functionĜ j (·) can be obtained. The choice of the kernel function could be various. An et al. (2020) select the Gaussian kernel for simplicity and for its unbounded support.
By using KDE, the semiBSL estimator can accommodate non-normal marginal summary statistics. For the second aspect, semiBSL uses a Gaussian copula to model dependence between summaries. The density function for the Gaussian copula is given by is the inverse CDF of a standard normal distribution, and I d is a d-dimensional identity matrix. The main appeal of the Gaussian copula here is that it is fast to estimate the correlation matrix (see details later). Combining the advantages from both aspects, we are then able to have a tractable likelihood estimator written as p ssl (s y |ϕ ssl ), where ϕ ssl = (R, u 1 , . . . , u d , g 1 (s 1 y ), . . . , g d (s d y )) is estimated byφ ssl = (R,û 1 , . . . ,û d ,ĝ 1 (s 1 y ), . . . ,ĝ d (s d y )),û j =Ĝ j (s j y ) andĝ j (s j y ) is given by Equation 4. The full semiBSL density estimator is given by Finally we provide details on obtainingR. With a collection of simulations {η s i } n i=1 , it can either be estimated with maximum likelihood estimation, or as An et al. (2020) advocate, one could use the Gaussian rank correlation (GRC; Boudt, Cornelissen, and Croux 2012). The GRC estimator is known to be more robust. The (i, j)-th entry of the GRC matrix is given byρ where r(·) : R → A, and A ≡ {1, . . . , n}, is the rank function. Note that here the GRC is used to estimated the correlation matrix in semiBSL. However, our package also permits the use of the GRC in standard BSL to provide more robustness. In doing so, a transformation to the covariance matrix is needed afterwards.
We point out that when the data generation process is complicated, most of the computational cost will be spent on model simulations. Thus the computational overhead introduced by the semi-parametric likelihood estimator may be negligible relative to the standard SL estimator.

A robust BSL likelihood estimator for model misspecification
Model misspecification could be particularly harmful to likelihood-free simulation-based approaches as the data generating process (DGP) is often complicated. Frazier, Robert, and Rousseau (2020) demonstrate how the pseudo-true parameter value is not recovered asymptotically by ABC algorithms under model misspecification. Further,  demonstrate that BSL can produce unreliable inferences when the model is not compatible with the observed summary statistic (compatibility is described in more detail in Definition 1 below). Let P θ and Π θ be the probability measure for the likelihood and prior beliefs for some θ ∈ Θ ⊆ R p , respectively. Let b 0 := plimS(y) and b(θ) := plimS(x) be the probability limit for the observed and simulated summary statistics, respectively.
Definition 1. The model P θ × Π θ and summary statistic map S(·) are compatible if The above definition of compatibility essentially means that asymptotically the observed summary statistic can be replicated with the assumed DGP and some parameter value with prior support. With the notion of compatibility defined,  proposed two different approaches, mean adjustment and variance inflation, that help address incompatibility between model and summary statistics. We refer to these methods in this paper as robust BSL (rBSL). Both approaches define an augmented parameter is the additional free parameter to account for mean adjustment or variance inflation depending on the approach. The prior distribution of ζ can be written as p(θ) × p(γ), where p(γ) is the prior distribution of γ. We use the prior choice of  in the BSL package, but refer the readers to the original paper for details.  demonstrate that when Definition 1 holds, the posterior for γ converges to its prior, and the inferences on θ are similar to when standard BSL is used. When compatibility is not satisfied, the posterior for γ will concentrate away from zero and will be different from the prior for the corresponding statistics that the model cannot match. This allows the inferences on θ to be less impacted by statistics that the model is not compatible with. A useful by-product of rBSL, in the presence of incompatibility, is that the variance of the robust synthetic likelihood estimator is much smaller than that of the standard synthetic likelihood estimator, since the mean adjustment and variance inflation help ensure that the observed statistic does not sit as far in the tails of the synthetic likelihood.  demonstrate that the efficiency of MCMC can be greatly improved with rBSL when compatibility is not met.
The rBSL likelihood is still approximated by a Gaussian distribution, where ϕ rsl represents the auxiliary parameter of choice, i.e., ϕ rsl-m and ϕ rsl-v for mean adjustment and variance inflation approaches, respectively. For notational convenience, in Section 2.4, let µ and Σ be the sample mean µ n (θ) and sample covariance Σ n (θ) in Equation 2, respectively.

Mean adjustment
The mean adjustment approach (rBSL-M) aims to handle the incompatibility by adjusting the simulated means. Following the notation of sample mean and covariance in Equation 2, the adjusted mean parameter is defined as where diag(A) represents the diagonal matrix with diagonal elements the same as the diagonal elements of A. This ensures that the components of γ have the same scale. The auxiliary parameter for rBSL-M can be written as ϕ rsl-m = (μ n (ζ), Σ).

Variance inflation
Similarly, the variance inflation approach (rBSL-V) allows the variance of the synthetic likelihood to be inflated. The variance parameter for rBSL-V is given bỹ where σ ij is the (i, j)-th element of Σ. The auxiliary parameter for rBSL-V can be written as ϕ rsl-v = (µ,Σ n (ζ)).

An accelerated likelihood estimator with shrinkage estimation
We can see from Equations 1 and 5 that both estimators involve estimation of a covariance or correlation matrix. The number of free parameters in the covariance matrix grows quadratically with the number of summary statistics. Thus, for a high-dimensional summary statistic, a large number n of simulations may be required to estimate the likelihood with reasonable precision.
Here we introduce two shrinkage estimators of the covariance matrix that have shown to be useful in empirical examinations (Ong et al. 2018a;An et al. 2019An et al. , 2020. The shrinkage estimators can be applied to BSL and semiBSL. Note that we skip uBSL because the unbiasedness property will be violated with shrinkage estimation. The selection of the penalty parameter will be discussed later in Section 2.6. For clarification, we change the notations in Section 2.5 temporarily and restore them back after. Let x 1 , . . . , x n be samples from a multivariate normal distribution N (µ, Σ). The dimension of each sample is d. S is the sample covariance and Θ = Σ −1 is the precision matrix.

Graphical lasso
The graphical lasso (Friedman, Hastie, and Tibshirani 2008) aims to find a sparse precision matrix by maximizing the following l 1 regularized log-likelihood where c is a constant, ∥ · ∥ 1 is the l 1 norm, and λ is the penalty parameter. The problem is solved with convex optimization. The penalty parameter λ controls the sparsity in Θ, where large λ leads to high sparsity and vice versa. We refer the readers to Friedman et al. (2008) for more details about the graphical lasso. In our package we use the graphical lasso implementation in the R package glasso (Friedman, Hastie, and Tibshirani 2019). Warton (2008) introduces a straightforward shrinkage estimator that is extremely fast to compute but still performs well. DefineD as the diagonal matrix of S, then the sample correlation matrix can be computed withĈ =D −1/2 SD −1/2 . The Warton's shrinkage for the correlation matrix is given byĈ

Warton's estimator
where λ ∈ [0, 1] is the shrinkage parameter and I d is the identity matrix. The correlation matrix reduces to the identity matrix as λ approaches 0. With a correlation to covariance conversion, we get the Warton's covariance estimator for Σ

Warton's estimation with whitening transformations
Motivated by further reducing the number of model simulations, Priddle et al. (in press) propose to use a de-correlation technique based on whitening transformations to encourage greater shrinkage of the covariance matrix. With the summary statistics significantly decorrelated, Priddle et al. (in press) propose to use Warton's estimator (Equation 7) with a significant amount of shrinkage in order to reduce the number of model simulations required.
In some numerical examples, Priddle et al. (in press) find that combining the whitening transformations together with Warton's shrinkage estimator permits the use of a diagonal covariance matrix in the synthetic likelihood without much loss of accuracy compared to standard BSL. The whitening transformation (Kessy, Lewin, and Strimmer 2018) is a linear transformation that aims to de-correlate the random variables (or summary statistics in the context of BSL) such that the whitened covariance matrix could be close to an identity matrix.
A whitening transformation on s using W iss = W s. In the BSL context, s represents the summary statistic. The covariance matrix of the whitened summary statistic is given by where W is the whitening matrix. Ideally, upon the above transformation, the covariance matrix of the synthetic likelihood would be close to an identity matrix. However, the restriction of W ⊤ W = Θ still does not identify a unique matrix W . Kessy et al. (2018) outline five natural choices of the whitening matrix. Priddle et al. (in press) demonstrate that the principle component analysis (PCA) whitening method produces the most accurate posterior result when applied in a BSL algorithm. Thus, we take PCA as the default method to estimate the whitening matrix. The implementation of whitening transformations in BSL is through the R package whitening (Strimmer, Jendoubi, Kessy, and Lewin 2019).
Using eigendecomposition, the covariance matrix is decomposed as Σ = U ΛU ⊤ , where U is the matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues. The PCA whitening matrix is given by The whitening matrix to use is determined by a relatively large number of offline simulations at a point estimate of the parameter, and is subsequently fixed during BSL. If the point estimate has reasonable posterior support, then the whitening transformation can be effective in de-correlating the statistic at model parameter values with non-negligible posterior support (Priddle et al. in press).
Algorithm 1: Procedure to select the penalty value λ for BSL or semiBSL with shrinkage estimation. Input : The BSL method to be used, either BSL or semiBSL, denoted method; the shrinkage estimation to be used, either graphical lasso or Warton, denoted shrinkage; parameter value with non-negligible posterior support θ 0 ; the number of simulations n; a number of potential penalty values, λ 1 , . . . , λ K ; the standard deviation of the log-likelihood estimator to aim for σ; and the number of repeats M . Output: The selected penalty parameter λ.
Compute and penalize the Gaussian rank correlation matrix using the shrinkage method of shrinkage with the samples {s i } n i=1 , and save it to ). Estimate the log SL l m,k = log p ssl (s y |φ ssl ) with Equation 5. end end end Compute σ k as the standard deviation of {l m,k } M m=1 for k = 1, . . . , K. Choose the λ for which the empirical standard deviation σ k is closest to σ. An et al. (2019) introduce a penalty selection approach for BSLasso (BSL with graphical lasso). The general approach can also be used with Warton's shrinkage estimator. Our goal is to reduce the value of n required while still maintaining a similar level of noise in the SL approximation. Denote σ as the standard deviation of the penalized log-likelihood estimator that we are aiming for. The value of σ is typically around 1.5, as suggested below in Section 2.6. The algorithm for selecting the penalty parameter is given in Algorithm 1.

Penalty selection
Here we use λ for the penalty parameter for both the graphical lasso and Warton's estimator for notational convenience.

Incorporating BSL with MCMC
Once we have selected a likelihood function estimator, we can use a Bayesian sampling algorithm to sample from the BSL approximation to the posterior p(θ|s y ). This is currently achieved using MCMC in our package. Pseudo-code for the MCMC BSL algorithm (except rBSL) is provided in Algorithm 2. The rBSL estimator of Section 2.4 operates on the augmented parameter space Θ × G.  use a component-wise MCMC algorithm to update θ and γ alternately. The update of θ is the same as Algorithm 2, and in each iteration γ is updated with a slice sampler (Neal 2003). Price et al. (2018) demonstrate empirically that the BSL posterior depends weakly on the number of simulations n used to estimate the synthetic likelihood at each iteration of MCMC. Further,  show, under some conditions, that the BSL posterior can achieve correct frequentist uncertainty quantification asymptotically with n growing at any arbitrarily slow rate as the sample size increases. Therefore, as recommended by Price et al. (2018), we suggest to tune n based on maximizing computational efficiency in a similar way as suggested in the pseudo-marginal MCMC literature (Andrieu and Roberts 2009). In pseudo-marginal MCMC, an intractable likelihood is replaced with a non-negative unbiased estimator based on some form of Monte Carlo using N "particles"; for example, using a particle filter with N particles to obtain an unbiased estimator of the likelihood for state space models (Andrieu, Doucet, and Holenstein 2010). Increasing N reduces the variance of the likelihood estimator, which leads to a more statistically efficient MCMC algorithm. However, the cost per MCMC iteration increases. On the other hand, taking N too small leads to poor statistical efficiency of MCMC as the chain often gets stuck for long periods at a greatly overestimated estimate of the likelihood. Doucet et al. (2015) suggest to choose N that produces an estimated loglikelihood with a standard deviation of roughly 1 when evaluated at a parameter value with reasonable posterior support. In the context of standard BSL, Price et al. (2018) demonstrate empirically that aiming for a standard deviation of the synthetic log-likelihood estimator of around 1.5-2 at a representative parameter value provides the most computationally efficient results.
As shown in Price et al. (2018), there is another connection between BSL and pseudo-marginal MCMC, specifically related to the uBSL algorithm of Section 2.2. Unlike pseudo-marginal MCMC, the synthetic likelihood estimated from n simulations does not produce an unbiased estimate of the actual synthetic likelihood where the relationship between (µ, Σ) and θ is known; or equivalently assuming that we could perform an infinite number of model simulations. Therefore, as already mentioned, BSL theoretically depends on n. The uBSL algorithm of Section 2.2 can be thought of as a pseudo-marginal MCMC method that targets p(θ|s y ) when the summary statistics follow a multivariate Gaussian distribution since it uses an exactly unbiased estimator of the multivariate Gaussian density.
We recommend that the initial value of the Markov chain has non-negligible support under the BSL posterior. Our experience with MCMC BSL is that if the chain is initialized far in the tails of the BSL posterior, then MCMC BSL exhibits slow convergence. The initial value for the Markov chain may be sourced from experts, previous analyses in the literature or a short run of another likelihood-free algorithm. In low dimensions, using BSL with multiple randomized starts may be helpful in finding a region with non-negligible support.
. Estimate the auxiliary parameterφ depending on the SL estimator. Compute the estimated SLp * A =p A (s y |φ) with Equation 1, 3 or 5.

Using the BSL package
In this section, we introduce the BSL package with an illustrative example available from the package. In the following section we consider a real example, which is also available in the package. Two main functionalities that are offered in the BSL package are (1) running an MCMC algorithm with a chosen SL estimator to sample from the approximate posterior distribution, and (2) selecting the penalty parameter with the given SL estimator and shrinkage method. Parallel computing is supported with the R package foreach (Kane, Emerson, and Weston 2013; Microsoft Corporation and Weston 2020b) so that the n independent model simulations can be run on a multi-core computer if desired.

Description of the MA(2) example
Here we consider the MA(2), moving average of order 2, time series model with parameter θ = (θ 1 , θ 2 ). The parameter space is constrained to {θ : −1 < θ 2 < 1, θ 1 + θ 2 > −1, θ 1 − θ 2 < 1} so that the process is stationary and invertible. The model has two favorable properties as an illustrative example: (1) the likelihood is tractable and hence sampling from the true posterior is feasible for comparisons, and (2) the data generation process is fast. The model can be described with the following equation where z t iid ∼ N (0, 1) for t = −1, 0, . . . , T is white noise. The prior is set to be uniform on the feasible region of the parameter space, and zero elsewhere. We set T = 50 and use the raw data as the summary statistic. Note that the likelihood function is exactly multivariate normal with mean zero, VAR(y t ) = 1 + θ 2 1 + θ 2 2 , COV(y t , y t−1 ) = θ 1 + θ 1 θ 2 , COV(y t , y t−2 ) = θ 2 , and with all other covariances being 0. We load the observed data and simulation function with R> data("ma2", package = "BSL")

The model object
Before running any BSL algorithm, we shall first define the model or DGP that will later be used in simulations. The S4 class 'MODEL' contains all the ingredients that constitute a model; namely a simulation function (fnSim), a summary statistic function (fnSum), a point estimate or initial value of the parameter (theta0) and other optional arguments (for example simArgs and fnLogPrior as given below). The first three must be provided for a valid 'MODEL' object. The validity check is automatic upon the creator function newModel. The following command creates a 'MODEL' object for the MA(2) example R> ma2Model <-newModel(fnSim = ma2_sim, fnSum = function(x) x, + simArgs = list(T = 50), theta0 = c(0.6, 0.2), fnLogPrior = ma2_prior) *** initialize "MODEL" *** has simulation function: TRUE has summary statistics function: TRUE has initial guess / point estimate of the parameter: TRUE running a short simulation test ... success *** end initialize *** Here we use the built-in simulation function ma2_sim. For simplicity, the summary statistic function returns the raw data without any processing. We set the initial guess of the parameter to be theta0 = c(0.6, 0.2), which will also be used as the starting value in MCMC. We can see in the console message that upon initialization, the newModel creator function automatically checks if the three requisites exist. By default, the creator function also checks if the functions can run properly by running 10 simulations as a short test. The short test can be disabled with flag test = FALSE if desired.

The simulation function
A fast simulation function is important to the efficiency of BSL methods. The package supports two types of simulation functions. One type of simulation function is where multiple simulations are produced with vectorized code. Ideally, if code vectorization can be implemented, it is preferable to speed up the simulation process. However, in most complex applications, it will usually not be possible to vectorize independent simulations. Thus the second type of simulation only generates a single realization of the model.
The simulation function must be provided by the user. We recommend optimizing the simulation function to avoid any inefficiency. If the DGP is complex and time-consuming, we also encourage users to write their own simulation function in C/C++ and use Rcpp (Eddelbuettel and François 2011) to call the function within R, as the running time of the algorithm is typically dominated by model simulations. For a reference text on writing functions in Rcpp, see Eddelbuettel (2013).

Non-vectorized simulation function.
For a non-vectorized simulation function, the model parameter θ must be supplied as the first argument of the function. Additional arguments can be put after θ if needed. The output of a non-vectorized simulation function is not restricted to any R class, but must be appropriate to pass to the summary statistic function directly. For example, the function below takes two arguments and returns a single simulation result. The additional parameter T determines the length of the time series, and can be supplied as a list in the model creator (simArgs = list(T = 50)).

R> ma2_sim
function (theta, TT) { rand <-rnorm(TT + 2) y <-rand[3:(TT + 2)] + theta[1] * rand[2:(TT + 1)] + theta[2] * rand[1:TT] return(y) } <bytecode: 0x55dc695279d8> <environment: namespace:BSL> Vectorized simulation function. The vectorized simulation function is slightly different to the non-vectorized one. It is specified with argument fnSimVec in the model creator newModel. If a vectorized simulation function is provided, it will override the usage of the non-vectorized function fnSim. The input of fnSimVec must follow the order of n, θ and additional arguments. The output can either be a list of the n simulation results or a matrix where each row represents a simulation result. Note that vectorized simulation is incompatible with parallel computing described later. Below is a vectorized simulation function for the MA(2) model.

The summary statistic function
The summary statistic function takes a simulation result as the first input. Similarly to the simulation function, if the summary statistic function requires additional arguments, they can be stored as a list and passed to sumArgs in the model creator function. The output of the summary statistic function must be a d-dimensional vector of summary statistics. For simplicity, we load the identity function that returns the same data as the summary statistic in the MA(2) example.

The prior function
fnLogPrior computes the log density of the prior distribution. If the prior function is not provided, an improper flat prior will be used by default. However, in practice, defining a proper prior distribution is always recommended. This function should only take θ as its input argument, and the output must not be +∞. Note that the normalization constant of the prior distribution can be ignored, as we do below for the MA(2) example. Here the prior is uniform on the defined triangular region of θ.

Shrinkage of the likelihood estimator
Implementation of shrinkage methods described in Section 2.5 can be specified with arguments shrinkage ("glasso" for the graphical lasso estimator and "Warton" for Warton's estimator) and penalty for the penalty value (λ in the graphical lasso or Warton's estimator). Note that the shrinkage takes place on the covariance matrix for method "BSL", and on the correlation matrix (of the Gaussian copula) for method "semiBSL".
Shrinkage should only be used when there is a relatively large number of entries near zero in the precision matrix for graphical lasso and covariance matrix for Warton's approach. This can be checked by inspecting the inverse correlation matrix or the correlation matrix at a representative parameter value. In the MA(2) example, we are able to calculate the true covariance matrix. For more complex models the covariance matrix can be estimated with a large number of model simulations at a reasonable parameter value. Here we use the ggcorrplot function in ggcorrplot (Kassambara 2019) to visualize the matrices. The visualization of the correlation matrix and inverse correlation matrix for the MA(2) example are shown in Figure 1.
Both of the figures suggest that the correlation and inverse correlation matrices are sparse. It is also recommended to check the level of sparsity by computing the proportion of partial correlations below a certain threshold. For example, 81% of the partial correlations between the summary statistics of the MA(2) example are below 0.01. Thus, for this example, shrinkage estimation is expected to reduce the number of model simulations required for estimating the synthetic likelihood whilst not sacrificing much on posterior accuracy. We will describe how to select the penalty parameter in our BSL package in Section 3.5. The following code runs MCMC BSL with the graphical lasso and Warton's shrinkage estimator.
R> W <-estimateWhiteningMatrix(20000, ma2Model, method = "PCA", + thetaPoint = c(0.6, 0.2)) Supplying the estimated whitening matrix into the bsl function under argument whitening enables the whitening transformation. Alternatively, simply setting whitening = TRUE automatically estimates a whitening matrix with 1000 model simulations at the parameter value provided in model. The following code runs the standard BSL method with the Warton's shrinkage estimator and whitening transformation.

Parameter transformation
If the prior distribution is bounded or the normal random walk cannot explore the parameter space efficiently, parameter transformation is recommended. Recall p(θ) is the prior distribution for θ ∈ Θ ⊆ R p . Suppose the parameters are independent and bounded, the prior function can be decomposed as where a i and b i are the lower and upper bound for θ i . A straightforward but fruitful 1-1 transformation that maps the range of the parameter to the real line is the logit transformation, which is given by If the parameter is only bounded on one side, the transformation degenerates to a log transformation, i.e., The bsl function provides an easy-to-use log/logit transformation for independent and bounded parameters. The argument logitTransformBound is a p by 2 matrix containing the lower and upper bounds for each parameter. The only argument that needs to be changed is covRandWalk.
There is no need to do any reparameterization of the prior or simulation functions.
It is also possible to code a customized parameter transformation by editing the simulation function directly. In this case, the prior function should also be changed subject to the reparameterization.

Parallel computation
As a simulation-based method, the computational cost of BSL is mostly driven by the speed of the simulation process. Parallel computation is vital in complex applications where vectorization is not possible. In our BSL package, the n independent model simulations can be distributed across workers on a multi-core machine. The R package doParallel (Microsoft Corporation and Weston 2020a) provides a way to set up the CPU cores for parallel computation. For example, the following code determines all the available cores of a CPU and sets up the clusters for parallel jobs as well as registers the backend for persistent reproducible parallel loops.
R> ncores <-detectCores() R> cl <-makeCluster(ncores -1) R> registerDoParallel(cl) R> registerDoRNG (1) This should be turned on prior to running the main bsl function if parallel computing is desired. To enable parallel computing within BSL, set parallel to TRUE in the bsl function. We utilize the function foreach of package foreach (Microsoft Corporation and Weston 2020b) to run parallel simulations in our BSL package. All other arguments supported by foreach can be passed with parallelArgs in the bsl and selectPenalty functions. For illustration, the following code runs the standard MCMC BSL with parallel computing, R> resultMa2BSLParallel <-bsl(y = ma2$data, n = 500, M = 100000, + model = ma2Model, covRandWalk = ma2$cov, method = "BSL", + parallel = TRUE, verbose = TRUE) Note that parallel computing introduces additional communication time between workers. If the model simulation process is straightforward, parallel computing might increase the overall running time rather than reduce it. Thus this is not recommended in the MA(2) example. Vectorization can be more effective in such cases. Once parallel computing is completed, the following code shuts down the parallel cores.

Interpret and visualize the BSL result
The output of the function bsl is saved as an S4 object 'BSL', which includes theta (approximate posterior samples), loglike (the MCMC chain of estimated log-likelihood values), acceptanceRate (acceptance rate of MCMC) for inspection of the Markov chain, as well as several other arguments which help to analyze the result. A full list of the returned values can be checked with help(bsl, package = "BSL"). We provide the basic show, summary and plot methods for the 'BSL' class.
The following code provides some common MCMC diagnostics for the MCMC BSL result of the MA(2) example. The column title ESS in the summary result stands for the effective sample size of the approximate posterior samples, as estimated by the R package coda (Plummer, Best, Cowles, and Vines 2006). Here we only show the trace plot of the synthetic log-likelihood estimates (Figure 2), however we also recommend other specialized R packages (such as coda, Plummer et al. 2006, andplotMCMC, Magnusson andStewart 2020) for other visualizations and quantitative diagnostics of MCMC convergence.

Selecting the penalty parameter for shrinkage
If shrinkage is desired in estimating the SL, the corresponding penalty parameter value must be selected prior to running the bsl function. This can be done via the selectPenalty function. Multiple choices for the number of simulations n can be tested at the same time; simulations from the largest value of n are re-used for smaller values of n by subsampling. The basic arguments of selectPenalty include the summary statistic of the observed data ssy; a vector of the number of simulations n to test; a list of the candidate penalty values lambda_all corresponding to each n; a point estimate of the parameter theta; the number of repeats M; the target standard deviation sigma; the model of interest model; the SL estimator method and the shrinkage estimation method shrinkage. Example code is given below (Figure 7) for selecting the λ of glasso with the standard BSL estimator for the MA(2) example.

Toad example
In this section, we use the BSL package to find posterior distributions for a complex model fitted to a real dataset. The model is proposed by Marchand, Boenke, and Green (2017) to simulate the movements of an amphibian called Fowler's Toads. Marchand et al. (2017) also used ABC for parameter inference to bypass the evaluation of the intractable likelihood. Real data are available from the supplementary material of Marchand et al. (2017).

Model description
The model generally assumes that the toads hide in their refuge sites during the day, and only move around for foraging at night. At the end of the foraging period, the location of a toad is ∆x away from the previous refuge site. Here, ∆x, also called the overnight displacement, follows a Lévy-alpha stable distribution with stability parameter 0 < α ≤ 2 and scale parameter γ > 0. The distribution is known to be increasingly heavy-tailed when α is less than 2. Upon the end of nighttime foraging, the toad either returns to one of the previous refuge sites or builds a new one at the current location.
The probability of return is given by p 0 . Marchand et al. (2017) develop three different return models (random return, nearest return and distance-based return) to describe the return strategy. Here for illustrative purposes, we use the random return model where the refuge site is selected at random from all the previous sites. Multiple visits increase the chance of being selected.
We use the real data provided in the supplementary material of Marchand et al. (2017). The GPS locations are collected in the daytime when the toads are assumed to rest in the refuge sites. For simplicity, the 2D locations are projected to a straight line, so that locations can be represented by scalar values. The observation matrix Y contains the locations of n t = 66 toads over n d = 63 days, i.e., Y is of dimension n t × n d . Roughly 81% of the real observation matrix Y is made up of missing entries. We also set the same entries as NA for new simulations. The parameter of interest is θ = (α, γ, p 0 ) ⊤ . We place a uniform prior over (1, 2) × (0, 100) × (0, 0.9) in this example.
We reduce the observation matrix down to four sets of displacements of lags 1, 2, 4 and 8 days. For instance, the displacements for a lag of one day can be written as Marchand et al. (2017), we classify the displacements as returns and non-returns by whether the absolute distance of a displacement is less than 10 meters or not. We consider a summary statistic that contains the frequency of returns, the median of the non-returns, and the log differences of adjacent p quantiles, where p = 0, 0.1, . . . , 1.

Approximate the posterior with BSL
As usual, we need to set up the 'MODEL' object for the toad example. The simulation function toad_sim calls an Rcpp function that uses C++ in the backend. The summary statistic function toad_sum computes the quantile summary statistic just described in Section 4.1.
R> data("toad", package = "BSL") R> toadModel <-newModel(fnSim = toad_sim, fnSum = toad_sum, + theta0 = toad$theta0, fnLogPrior = toad_prior, + simArgs = toad$sim_args_real, + thetaNames = expression(alpha, gamma, p[0])) *** initialize "MODEL" *** has simulation function: TRUE has summary statistics function: TRUE has initial guess / point estimate of the parameter: TRUE running a short simulation test ... success *** end initialize *** The printed messages indicate that model is a valid 'MODEL' object for BSL. However, it is also recommended to look at the density distributions of the summary statistic to detect any possible problems or malfunctions associated with the model. Figure 8 plots the marginal KDE distributions of 1000 simulated summary statistics. We will utilize parallel computation when possible in the toad example. Suppose we have set up the CPU cores for parallel computing as described in the section on parallel computing of Section 3.2.

Discussion
This paper has presented the first comprehensive software package for Bayesian synthetic likelihood methods. The package implements four different types of synthetic likelihood estimators (standard, unbiased, semi-parametric and "robust") and includes functionality for two different types of shrinkage covariance estimators to reduce the number of required model simulations. The package includes functions for extracting useful statistics and visualizations of the results after running bsl. It also includes four built-in examples to illustrate the functionality of the package.
Apart from the MA(2) and toad example described in Sections 3 and 4, respectively, the package also include two other examples: a discrete-time stochastic cell biology model, cell; and the multivariate g-and-k quantile distribution mgnk (Drovandi and Pettitt 2011). The former example features a high dimensional summary statistic, while the latter example can involve a large number of parameters. Additional descriptions and example code can be found in the package documentation.
We use multivariate normal random walk MCMC as the sampling method to explore the parameter space for all the current BSL methods in the package, which means that sampling with BSL can be slow when model simulation is computationally intensive and potentially inefficient when there are a large number of parameters. More sophisticated proposal distributions for MCMC BSL would be an interesting research direction for future work. Unfortunately, alternative Monte Carlo schemes like rejection sampling and sequential Monte Carlo (Del Moral et al. 2006) are less appealing in BSL than in ABC because BSL's normality assumption can fail in regions with very low posterior support. Considering a BSL analogue of the Hamiltonian Monte Carlo (HMC) ABC algorithm of Meeds, Leenders, and Welling (2015) may be an interesting direction for future work. The downside of HMC in the likelihood-free context is that efficiency is reduced by the requirement to estimate rather than directly evaluate the gradient of the log-likelihood. Future research for this package could consider incorporating the variational Bayesian synthetic likelihood methods (Ong et al. 2018a;Ong, Nott, Tran, Sisson, and Drovandi 2018b), which scale to a large number of summary statistics and/or parameters at the expense of assuming a multivariate normal approximation of the posterior.
classification methods estimated from a large number of (model indicator, parameter, statistic) triples simulated from a reference distribution. Then, ABC model choice proceeds using the lower dimensional summary statistic produced by the classification method. Extending BSL to the model selection setting requires further research.
We welcome R developers worldwide to help contribute to the BSL package. We have made the source code available at https://github.com/LeahPrice/BSL to allow other researchers to contribute to the BSL package.

Computational details
The results in this paper were obtained using R 4.0.3 with the BSL 3.2.3 package. R itself and all packages used are available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/.