Bayesian State-Space Modelling on High-Performance Hardware Using LibBi

LibBi is a software package for state-space modelling and Bayesian inference on modern computer hardware, including multi-core central processing units (CPUs), many-core graphics processing units (GPUs) and distributed-memory clusters of such devices. The software parses a domain-specific language for model specification, then optimises, generates, compiles and runs code for the given model, inference method and hardware platform. In presenting the software, this work serves as an introduction to state-space models and the specialised methods developed for Bayesian inference with them. The focus is on sequential Monte Carlo (SMC) methods such as the particle filter for state estimation, and the particle Markov chain Monte Carlo (PMCMC) and SMC^2 methods for parameter estimation. All are well-suited to current computer hardware. Two examples are given and developed throughout, one a linear three-element windkessel model of the human arterial system, the other a nonlinear Lorenz '96 model. These are specified in the prescribed modelling language, and LibBi demonstrated by performing inference with them. Empirical results are presented, including a performance comparison of the software with different hardware configurations.


Introduction
State-space models (SSMs) have important applications in the study of physical, chemical and biological processes. Examples are numerous, but include marine biogeochemistry [16,37,17,59], ecological population dynamics [73,57,60,35], Functional Magnetic Resonance Imaging [64,52,53], biochemistry [29,30] and object tracking [71]. They are particularly useful for modelling uncertainties in the parameters, states and observations of such processes, and particularly successful where a rigorous probabilistic treatment of these uncertainties leads to improved scientific insight or risk-informed decision making.
SSMs can be interpreted as a special class within the more general class of Bayesian hierarchical models (BHMs). For a given data set, existing methods for inference with BHMs, such as Gibbs sampling, can be applied. However, specialist machinery for SSMs has been developed to better deal with peculiarities of models in the class. These include nonlinearity, multiple modality, missing closed-form densities and significant correlations between state variables and parameters. Variants of sequential Monte Carlo (SMC) [14] are particularly attractive, including particle Markov chain Monte Carlo (PMCMC) [4] and SMC 2 [8]. These methods have two particularly pragmatic qualities: they admit a wide range of SSMs, including nonlinear and non-Gaussian models, without approximation bias, and they are well-suited to recent, highly parallel, computer architectures [44,55].
Commodity computing hardware has diverged from the monoculture of x86 CPUs in the 1990s to the ecosystem of diverse desktop central processing units (CPUs), mobile processors and specialist graphics processing units (GPUs) of today. Adding to the challenge since 2004 is that Moore's Law, as applied to computing performance, has been upheld not by increasing clock speed, but by broadening parallelism [70]. This should not be understood as a passing fad, nor a deliberate design choice for modern applications, but as a necessity enforced by physical limits, the most critical of which is energy consumption [65]. Thus, while architectures continue to change, their reliance on parallelism is unlikely to, at least in the foreseeable future. The implication for statistical computing is clear: in order to make best use of current and future architectures, algorithms must be parallelised. On this criterion, SMC methods are a good fit.
Given these specialised methods for SSMs, it is appropriate that specialised software be available also. LibBi 1 is such a package. Nominally, the name is a contraction of "Library for Bayesian inference", and pronounced "Libby". Its design goals are accessibility and speed. It accepts SSMs specified in its own domain-specific modelling language, which is parsed and optimised to generate and compile C++ code for the execution of inference methods. The code exploits technologies that include SSE for vector parallelism, OpenMP for multithreaded shared-memory parallelism, MPI for distributed-memory parallelism, and CUDA for GPU parallelism. The user interacts with the package via a command-line interface, with input and output files handled in the standard NetCDF format, based on the highperformance HDF5. This work serves as a brief introduction to SSMs, appropriate Bayesian inference methods for them, the modern computing context, and the consideration of all three in LibBi. The material is presented in three parts: Section 2 introduces SSMs as a special class of BHMs; Section 3 provides specialist methods for inference with the class; Section 4 provides technical information on LibBi itself. Two examples are developed throughout. At the end of Section 2, these examples are developed as SSMs and specified in the LibBi modelling language. At the end of Section 3, posterior distributions are obtained for each model conditioned on simulated data sets, using the methods presented. In Section 4, the performance of LibBi on various hardware platforms is compared. Section 5 summarises results. Section 6 provides information on available supplementary materials to reproduce the example results throughout this work.

State-space models
State-space models (SSMs) are suitable for modelling dynamical systems that have been observed at one or more instances in time. They consist of parameters Θ ∈ R N θ , a latent continuous-or discrete-time state process X(t) ∈ R Nx , and an observed continuous-or discrete-time process Y(t) ∈ R Ny . A starting time is given by t 0 , and an ordered sequence of observation times by t 1 , . . . , t T .
The most general BHM over these random variables admits a joint density of the form: The SSM class can be considered a specialisation of the general BHM class. It imposes the Markov property on the state process X(t), and at each time t i permits the observation Y(t i ) to depend only on parameters Θ and the state X(t i ). Formally, the joint density takes the form: also depicted as a graphical model in Figure 1. The prior density is factored into a parameter density, an initial state density, and a product of one or more transition densities. The likelihood function is given as a product of observation densities. Specific models in the class may, of course, exploit further conditional independencies and so introduce additional hierarchical structure. This is demonstrated in the examples that follow.

Examples
Two example SSMs are introduced here. The first is a linear three-element windkessel model of arterial blood pressure [72], the second a nonlinear eight-dimensional Lorenz '96 model of chaotic atmospheric processes [45]. In each case the original deterministic model is introduced, then stochasticity added, and a prior distribution over parameters specified, to massage it into the SSM framework.
The state-space model, with parameters Θ, latent Markov state process X(t 0:T ) and observation process Y(t 1:T ). Figure 2: Windkessel models represented as circuit diagrams, (left) two-element, and (right) threeelement. P a (t) gives aortal pressure and P p (t) peripheral pressure. F (t) gives blood flow.

Windkessel model
A windkessel model can be used to relate blood pressure and blood flow in the human arterial system [72]. The simplest two-element windkessel [22] is physically inspired: it couples a pump (the heart) and a chamber (the arterial system), with fluid (blood) flowing from the pump to the chamber, and returning via a closed loop. Air in the chamber is compressed according to the volume of fluid (analogous to the compliance of arteries).
The two-element windkessel model is commonly represented as an RC circuit as in Figure 2 (left). Voltage models blood pressure and current blood flow. A resistor models narrowing vessel width in the periphery of the arterial system, and a capacitor the compliance of arteries.
The equations of the two-element windkessel are readily derived from circuit theory applied to Figure  2 (left) [40]: This is a linear differential equation. Assuming a discrete time step ∆t and constant flow F (t) over the interval [t, t + ∆t), it may be solved analytically to give: The two-element windkessel captures blood pressure changes during diastole (heart dilatation) well, but not during systole (heart contraction) [72]. An improvement is the three-element windkessel in Figure  2 (right), which introduces additional impedence from the aortic valve. Again using circuit theory, aortal pressure, P a , may be related to peripheral pressure, P p , by [40]: The three-element windkessel is adopted henceforth. Blood flow, F (t), would usually be measured, but for demonstrative purposes it is simply prescribed the following functional form: where F max = 500 ml s −1 gives maximum flow, T s = 0.3 s is time spent in systole, T d = 0.5 s is time spent in diastole, with mod(x, y) giving the non-integer part of x/y. This models quickly increasing then decreasing blood flow during systole, and no flow during diastole. The function is discretised to time steps of ∆t and held constant in between. An SSM can be constructed around the above equations. Input F (t) is prescribed as above, and a Gaussian noise term ξ(t) of zero mean and variance σ 2 is introduced to extend the state process from deterministic to stochastic behaviour (details below). The SSM has parameters Θ ≡ R, C, Z, σ 2 T , state X(t) ≡ P p (t) and observation Y(t) ≡ P a (t). The time step is fixed to ∆t = 10 −2 s.
The complete model is specified in the LibBi modelling language in Figure 3. The specification begins with a model statement to name the model. It proceeds with the time step size declared on line 5 as a constant value. Following this, the four parameters of the model, R (R), C (C), Z (Z) and sigma2 (σ 2 ) are declared on lines 7-10, the input F (F (t)) on line 11, the noise term xi (ξ(t)) on line 12, the state variable Pp (P p (t)) on line 13, and the observation Pa (P a (t)) on line 14.
Recall (2), which gives the general joint distribution of an SSM. The specific joint distribution for this SSM is: Following the variable declarations in Figure 3 are four blocks, declared using the sub keyword, each with the same name as, and describing the form of, one of the factors in (6). The prior distribution over parameters is given by: described in the parameter block on lines 16-21 of Figure 3. The hyperparameters of these distributions are guided by Kind et al. [40]. The prior distribution over the initial state is given by: where the second argument is the standard deviation. This is described in the initial block on lines 23-25 of Figure 3. The transition model is a stochastic extension of (3). This use of stochasticity might be interpreted as representing some uncertainty in the formulation of the model given the biological phenomena it is meant to represent. The noise term ξ(t) is introduced additively to blood flow, F (t): This form is described in the transition block on lines 27-30 of Figure 3. The observation model is based on (4), with additional noise: P a (t) ∼ N(P p (t) + ZF (t), 2).  This is described in the observation block on lines 32-34 of Figure 3.
One additional block is specified for the windkessel model. This is the proposal_parameter block on lines 36-41. It gives the proposal distribution over parameters that is used during marginal Metropolis-Hastings sampling, described in Section 3.2.1.
Note that the input F does not appear on the left side in any block. Instead, as an input variable, its value at each time is drawn from an input file. This is prepared in advance from (5) as a NetCDF file.

Lorenz '96 model
Lorenz '96 models are useful for the simulation of some important atmospheric processes [45]. They can be challenging to handle, owing to chaotic behaviour in most regimes. An N -dimensional Lorenz '96 model over the vector x ∈ R N is given by the ordinary differential equations (ODEs): where indices into x are taken to be cyclic, so that x n−N ≡ x n ≡ x n+N . F acts as a constant forcing term that induces various behaviours ranging from decay, to periodicity, to chaos. The bifurcation diagram of this deterministic system is given in Figure 4 (left). A deterministic model such as this is inappropriate for the SSM framework. To derive a suitable stochastic model, the ODEs of (7) can be converted to stochastic differential equations (SDEs): Each dW n represents an increment of a standard Wiener process [24, §3.8.1], and σ is a parameter used to scale these. As the diffusion term is additive the SDEs may be interpreted equivalently [42, p157] Figure 4 (right).
There is no analytical solution to integrate either the ODEs or SDEs to obtain a form for X n ; they must be integrated numerically. A classic fourth-order Runge-Kutta with fixed step-size ∆t suffices for this purpose, and was used for the original deterministic model [45]. For the stochastic model, by formally adopting a Stratonovich interpretation of the (8), the SDEs can be converted back to ODEs of the form [74]: where each noise term ∆W n ∼ N(0, √ ∆t) is an increment of the Wiener process over the time step of size ∆t. In this form the classic fourth-order Runge-Kutta can be used again. This final form is that used for the transition model of the SSM.
The SSM consists of two parameters, Θ ≡ (F, σ 2 ) T , along with a state vector x(t) and observation vector y(t), both of length 8. The complete model is specified in the LibBi modelling language in Figure  5.
In Figure 5, a dimension named n, of size 8, is first declared on line 5. It is given a cyclic boundary condition, recalling that, in (7), indices are interpreted cyclically. The two parameters of the model, F (F ) and sigma2 (σ 2 ) are then declared on lines 9-10, the state vector x (x(t)) on line 11, the noise vector deltaW (∆W(t)) on line 12, and the observation vector y (y(t)) on line 13. The square bracket syntax is used to extend the vector variables over the previously declared dimension n. Note that while x(t), ∆W(t) and y(t) all vary in time, there is no explicit declaration for the time dimension in LibBi.
Recall (2), which gives the general joint distribution of an SSM. The specific joint distribution for the Lorenz '96 SSM is: Note the way that decay behaviour at 0 ≤ F ≤ 1 yields to periodic behaviour for F > 1, and eventually to chaotic behaviour. The prescribed prior distribution F ∼ U(8, 12) places the state-space model firmly within the chaotic regime.
The prior distribution over parameters is given by: described in the parameter block on lines 15-18 of Figure 5. The prior distribution over the initial state of x is given by: where the second argument is the standard deviation. This is described in the observation block on lines 31-33 of Figure 5.
Two additional blocks are specified for the Lorenz '96 model. These are the proposal_parameter and proposal_initial blocks on lines 35-38 and 40-42, respectively. These specify the proposal distributions used for Metropolis-Hastings sampling, detailed in Section 3.2.1.

Inference methods
An SSM is constructed as the joint distribution p (Y(t 1:T ), X(t 0:T ), Θ), typically factorised as in (2). The task of Bayesian inference is to condition this joint distribution on some particular data set, Y(t 1:T ) = y(t 1:T ), to obtain the posterior distribution p(X(t 0:T ), Θ|y(t 1:T )).  The posterior distribution may be written: p(X(t 0:T ), Θ|y(t 1:T )) posterior = p (Θ|y(t 1:T )) p (X(t 0:T )|Θ, y(t 1:T )) . (11) Obtaining the first factor constitutes parameter estimation, while obtaining the second factor, conditioning on some particular Θ = θ drawn from the first, constitutes state estimation. Methods for Bayesian inference will in limited cases derive a closed form for the posterior distribution. In most cases this is unachievable, however, and either an approximate closed form is fit, or Monte Carlo sampling is performed.
Because an SSM is a member of the broader BHM class, any method for inference over BHMs can be applied to SSMs also. Methods that do not exploit the additional hierarchical structure of SSMs may be sub-optimal, however. SSMs also exhibit other qualities that can erode the effectiveness of generic methods for BHMs. Common occurrences are strong autocorrelations in the Markov process relative to the observation frequency [e.g. 54], and strong correlations between the model parameters and state [e.g. 57]. The Gibbs sampler [25], a mainstay of inference for BHMs, is known to mix slowly in such conditions [58]. It is also common for the Markov process, while it can be simulated, to not yield a useable closed-form transition density [e.g. 6,20,29,30,53,54]. This precludes an analytically derived conditional distribution for Gibbs sampling, or even computation of the acceptance ratio for Metropolis-Hastings-within-Gibbs.
Fortunately, specialised methods for Bayesian inference with SSMs do exist. Methods based on sequential Monte Carlo (SMC) are the focus of this work, and of the LibBi software. An important exception is the Kalman filter [39], which is optimal for a further specialisation of the SSM class: that of linear-Gaussian models. An introduction to the Kalman filter is given here also.
We begin with state estimation in Section 3.1, which will lead into parameter estimation in Section 3.2.

State estimation
Sampling the second factor of (11), conditioned on a particular parameter setting Θ = θ, constitutes state estimation. This has been well-studied in the Bayesian filtering literature. Two methods, the Kalman filter and the particle filter, are introduced here.

The Kalman filter
The Kalman filter [39] can be used in the special case of SSMs where both the transition and observation densities are linear and Gaussian, and the initial state model is Gaussian 2 . It is optimal in such cases, producing the exact closed-form solution. Models that fit this class can be expressed in the following form: where N(µ, U) denotes the normal distribution with mean vector µ and square-root of the covariance matrix U. It is understood that all symbols may depend on θ. The Kalman filter is usually initialised with the mean vector µ(t 0 ) and covariance matrix Σ(t 0 ) of the initial state p(X(t 0 )|θ). Here the square-root Kalman filter is presented, where the covariance matrix is replaced with its upper-triangular Cholesky square-root U(t 0 ), i.e. Σ(t 0 ) = U (t 0 )U(t 0 ). Using U(t 0 ) has certain numerical and computational advantages; indeed most computational manipulations of Gaussian distributions ultimately use the Cholesky square-root and not Σ(t 0 ) directly. After initialisation, the Kalman filter iterates through observation times with interleaved prediction and correction steps. Pseudocode is given in Algorithm 1. The prediction step computes p(X(t i−1:i )|y(t 1:i−1 ), θ), which is Gaussian with mean and upper-triangular Cholesky factor of the covariance matrix The cross term C(t i ) is used later. The correction step first computes p( and upper-triangular Cholesky factor of the covariance matrix It then conditions this on Y(t i ) = y(t i ), in the usual fashion for a Gaussian distribution, to obtain p(X(t i )|y(t 1:i ), θ), Gaussian with mean µ(t i ) and Cholesky factor U(t i ).
In the context of parameter estimation in Section 3.2, two further outputs are required of the Kalman filter. The first is to deliver the likelihood of θ, marginalised over X(t 0:T ). The computation is given on line 15 of Algorithm 1 (| · | denotes the matrix determinant and · the Euclidean norm). The second requirement is to deliver a single sample x (t 0:T ) ∼ p(X(t 0:T )|θ, y(t 1:T )). This is achieved with a backward pass through time at the end of Algorithm 1, in a manner similar to the Rauch-Tung-Striebel smoother [63].
A number of Kalman-type filters exist for nonlinear and non-Gaussian models. These include the mixture-of-Gaussians [1], extended [67], ensemble [19] and unscented [38] Kalman filters and accompanying smoothers [63,66]. However, for nonlinear and non-Gaussian cases these are approximate methods that introduce bias. The particle filter [31,14] is an alternative that does not, and admits any SSM of the form (2). The particle filter is the focus for the nonlinear case in this work. The approximate Kalman-type filters are not treated.

The particle filter
The particle filter [31,14], of the family of SMC methods, can be used for any SSM as defined by (2). For this general form an analytical solution is not forthcoming, and the particle filter instead relies on importance sampling. Numerous variants are available, but the most basic, that of the bootstrap particle filter [31], is described in pseudocode in Algorithm 2 and visualised in Figure 6.
For a given θ, the particle filter is initialised by drawing P x number of random samples, x j (t 0 ) ∼ p(X(t 0 )|θ), for j = 1, . . . , P x , and weighting each uniformly with w j (t 0 ) = 1/P x . These are referred to as particles. It proceeds sequentially through observation times through a series of propagation, weighting and resampling steps. In the propagation step each particle is advanced to the next observation time with x j (t i ) ∼ p(X(t i )|x a j (ti) (t i−1 ), θ), where a j (t i ) is the index of the particle's ancestor at the previous time, t i−1 (more below). It is then weighted with the likelihood of the new observation, w j (t i ) = p(y(t i )|x j (t i ), θ). The resampling step restores the particle stock to equal weights by resampling particles with replacement, where the probability of each particle being drawn is proportional to its weight w j . Particles with high weight tend to be replicated, while particles with low weight tend to be eliminated. It is this process that determines the ancestor indices for the next time propagation.
The particle filter can be used to compute an unbiased estimate of the likelihood of θ, marginalised over X(t 0:T ). The computation [10] is given on line 10 of Algorithm 2. It can also produce a samplê x (t 0:T ) ∼ p(X(t 0:T )|θ, y(t 1:T )) by tracing a single particle back through its ancestry [4]. This occurs as the last step in Algorithm 2.

Parameter estimation
Sampling from the first factor of (11) constitutes parameter estimation, for which two approaches are given here. The first is marginal Metropolis-Hastings [49,33], of the family of Markov chain Monte Carlo (MCMC), using either a Kalman filter to compute the likelihood of parameters, marginalised over the state, or a particle filter to estimate it. When a particle filter is used the approach is more specifically known as particle marginal Metropolis-Hastings (PMMH), from the family of particle Markov Algorithm 1 The Kalman filter for given θ. This is presented over a time interval [t r , t s ], returning the marginal likelihood l = p(y(t 0:s )|θ) and a state sample x (t 0:s ) ∼ p(X(t 0:s )|θ, y(t 0:s )), so that it may be called from Algorithms 3 & 4 later. The usual, standalone algorithm sets r = 0 and s = T .
22 return (l, x (t 0:s )) Algorithm 2 The bootstrap particle filter for given θ. This is presented over a time interval [t r , t s ], returning an estimate of the marginal likelihoodl = p(y(t 0:s )|θ) and an unbiased state samplex (t 0:s ) ∼ p(X(t 0:s )|θ, y(t 0:s )), so that it may be called from Algorithms 3 & 4 later. The usual, standalone algorithm sets r = 0 and s = T .
Recall that the particle filter for state estimation is a type of SMC algorithm, indeed it is prototypical. Similar methods may be employed for parameter estimation, and this is considered here. Within the SMC algorithm over parameters, a Kalman or particle filter is used to compute weights; when the latter is used the method is more specifically known as SMC 2 [8].

Marginal Metropolis-Hastings
The marginal Metropolis-Hastings algorithm is given in Algorithm 3. It is initialised with some arbitrary setting of parameters θ 0 , then for i = 1, 2, . . ., a new setting θ ∼ q(θ |θ) is proposed from some proposal distribution q. The move is accepted with probability given by the Metropolis-Hastings rule: If accepted then θ i = θ , otherwise θ i = θ i−1 . By construction, the Markov chain θ 0 , θ 1 , . . . is ergodic to the posterior distribution, so that, after convergence, states of the chain may be considered samples from it. The process continues until as many samples as desired have been drawn, a number denoted P θ . Either a Kalman or particle filter may be used to compute or estimate the marginal likelihood term p(y(t 1:T )|θ ) in (15) at each step. When estimated, a valid sampler is still obtained [4]. The single state sample drawn from the Kalman or particle filter completes the sample with a draw from the second factor of (11).

Sequential Monte Carlo
An alternative approach to sampling the first factor of (11) is to replace the MCMC over parameters with SMC over parameters. SMC over parameters works similarly to SMC over state variables. It is initialised by drawing P θ number of random samples from the prior distribution over parameters, p(Θ), and weighting them uniformly with v j (t 0 ) = 1/P θ for j = 1, . . . , P θ . These are referred to as θ-particles and θ-weights. It proceeds sequentially through observation times with a series of propagation, weighting and resampling steps, just as for the particle filter for state estimation, along with a new rejuvenation step. Pseudocode for the algorithm is given in Algorithm 4.
To each θ-particle is attached a Kalman or particle filter. Propagating a θ-particle involves advancing the attached filter through time. Weighting of a θ-particle uses the marginal likelihood p(y(t i )|θ j (t i )), computed by the attached Kalman filter or estimated by the attached particle filter. Weighting with an unbiased estimate of that marginal likelihood gives a valid SMC algorithm as in the random-weight particle filter [20,21], and is more specifically called SMC 2 [8]. Resampling involves drawing a new set of unweighted θ-particles by weight, along with their attached filters.
The resampling of θ-particles at each time t i depletes the number of unique values represented. Resampling has the same effect in the particle filter for state estimation, but in that case particles diversify again in the next propagation step. As parameters do not change in time, they cannot diversify Algorithm 4 Sequential Monte Carlo (SMC) for parameter estimation. When the particle filter is used (i.e. Particle-Filter for Filter), the algorithm gives SMC 2 [8].
, t i−1:i ) / / propagate and weight θ-particle j in this way. To fix this, an additional step, called the rejuvenation step, is inserted after the resampling of θ-particles [8]. The aim of the step is to diversify the values of θ-particles while preserving their distribution. To this end it is sufficient to take a single marginal Metropolis-Hastings step for each θparticle, θ j (t i ). This works by proposing a move to a new value θ (t i ) ∼ q(θ (t i )|θ j (t i )), estimating the marginal likelihoodp(y(t 1:i )|θ (t i )) with the Kalman or particle filter, and then accepting or rejecting the move using the acceptance probability (15). Line 7 of Algorithm 4 achieves this by calling the Marginal-Metropolis-Hastings function of Algorithm 3.

Parallelisation
The methods presented exhibit varying degrees of parallelisability, and therefore suitability to modern computing hardware. SMC is particularly promising in this regard [44]. The propagation, weighting and (in the case of parameter estimation) rejuvenation steps can be performed in parallel for each (θ-)particle. There is limited scope for parallelisation of the Kalman filter: its matrix operations can be multithreaded or even performed on GPU, but this will only be faster for very large matrices [68], and may be slower for small matrices. For the particle filter, the minimum degree of parallelism is P x , the number of particles, with the potential for further parallelisation according to model-specific structure. The bottleneck in highperformance implementation of the particle filter is the resampling step. Resampling algorithms, such as the multinomial, systematic or stratified [41] approaches, require a collective operation over the weight vector. This means that all threads must synchronize. The development of asynchronous resampling methods to alleviate this bottleneck is still an active area of research [50,55,12]. In the meantime, the particle filter is best suited to shared-memory architectures where the collective operation can be performed efficiently, although approaches for distributed-memory resampling have been proposed [7].
Marginal Metropolis-Hastings is itself a sequential method, but inherits the degree of parallelism of the filter used at each step. Again, the Kalman filter offers limited scope, but using the particle filter (the PMMH sampler) gives P x -way parallelism. The setting of P x is complicated in this context, however. The tradeoff is to maximise the mixing rate of the Metropolis-Hastings chain against the computational expense of running P x particles. The optimal choice relates to the variance in the likelihood estimator [15,54]. Increasing P x decreases this variance, but does not necessarily improve the real-time mixing of the Metropolis-Hastings chain for a fixed computational budget. Because arbitrarily increasing P x to consume all available hardware resources has depreciating returns on mixing rate, there are limits on the degree of parallelism of PMMH.
A possible solution is to run multiple marginal Metropolis-Hastings chains. This also has limits. In practice, one usually removes some number, B θ , of steps from the start of each chain to correct for the initialisation bias of θ 0 . B θ depends on the autocorrelation of the chain, and should be sufficiently large for the influence of θ 0 to be forgotten. With multiple chains, each chain must take at least B θ steps before drawing one or more samples that will be preserved. This imposes a serial limitation on the maximum speedup achievable by parallelisation with multiple chains, which may be quantified by Amdahl's law [2] as P θ /B θ . The performance gains of multiple chains might therefore be disappointing without some additional strategy to reduce B θ . Adaptation [28,32,3] and tempering [23,26,48,27,56] are both established means of reducing B θ for single chains. Using these for each chain in isolation can reduce B θ , but only by a factor that is independent of the number of chains. Reducing B θ relative to the number of chains is to be preferred. Population MCMC [43] attempts this via an evolutionary [5] selection, crossing and mutation of multiple chains. Craiu et al. [9] more explicitly target the posterior with an ensemble of chains, using the covariance of samples across all chains to adapt the proposal covariance of individual chains. This remains an active area of research.
Using SMC for parameter estimation gives at least a P θ -way parallelism, even with the Kalman filter. Using the particle filter (the SMC 2 method) has a much higher degree of parallelism, at P θ P x . This is very promising. As for the particle filter, however, the resampling step is synchronous, which can be a bottleneck to the scaling of the algorithm. This also remains an active area of research.

Examples
The example models introduced in Section 2 are used within LibBi to demonstrate the inference methods above. The three-element windkessel model is an example where Kalman filter-based methods are suitable, while the nonlinear Lorenz '96 model requires the particle filter. In both cases simulated data sets are used.
LibBi provides a libbi command to access its functionality from the command line. It is used as follows: where command is the particular command to execute, followed by zero or more command-line options to specify input files and configuration parameters. The pertinent command for these examples is sample, which draws samples from the joint, prior or posterior distribution. Options may be given on the command line itself, or, as is often convenient, by listing them in a configuration file and giving the name of that file on the command line instead, like so: libbi sample @config.conf All of the examples use configuration files in this way. The contents of these files are given in Figure 7.

Windkessel example
The three-element windkessel model has Gaussian initial state, linear and Gaussian transition model, and linear and Gaussian observation model. Kalman filter-based methods are suitable in this case. The model can be coerced into the matrix form of (12)(13)(14), but it is not necessary to do so by hand: LibBi uses symbolic differentiation to derive the matrix form internally. Consequently, no changes are required to the model, as specified in Appendix ??, to apply the Kalman filter.
The windkessel model requires two input files: one giving the values of the input variable F (t), and one giving the values of the observed variable P a (t). The input is shown in the top-left plot of Figure 8, and observations in the top-right. These have been prepared in advance as NetCDF files data/input.nc and data/obs.nc, available in the supplementary materials (see Section 6).
A first step is often to simulate the model's prior distribution. This is useful to validate the model specification. Command-line options for this purpose are given in the prior.conf file in Figure 7. The command is then:  with results output to the NetCDF file results/posterior.nc. State estimates are shown in the topright of Figure 8, and parameter estimates in the lower four plots.

Lorenz '96 example
A similar procedure is followed for the Lorenz '96 model, although a prediction forward in time will also be made. Because it is a nonlinear model, the particle filter is used for state estimation and marginal likelihood estimates.
The Lorenz '96 model requires one input file, containing the observations. This has been prepared in advance as data/obs_sparse.nc, available in the supplementary materials (see Section 6). It contains observations of the first four components of y(t) at every other time step (recall ∆t = 0.05) on the interval t = [0, 3]. It is an example of how sparse or missing data can be handled by LibBi. The aim is to condition the joint distribution on those observations that fall in the interval t = [0, 2], and set aside the remainder to validate a forward prediction on the interval t = (2, 3].
The first task is to simulate the prior distribution. The prior.conf configuration file in Figure 9 is set up for this purpose. The command: libbi sample @prior.conf outputs samples to the results/prior.nc file. These are shown in grey in Figures 11 and 12. The posterior is sampled with options from the posterior.conf configuration file of Figure 9. The particle filter is the default option for state estimation, and marginal Metropolis-Hastings for parameter estimation, so no options to select an appropriate method are required. The command is: libbi sample @posterior.conf Results are output to results/posterior.nc, and plot in blue in Figures 10-12. SMC 2 does work to sample from the posterior distribution also. It may be selected by adding the --sampler smc2 option (see the posterior_smc2.conf file in the supplementary materials).
Finally, a prediction can be made. The configuration file prediction.conf of Figure 9 is set up for this. The output file of the posterior sample is used as an input file for the prediction (--init-file results/posterior.nc) so as to extend the state estimate forward in time to form a posterior prediction. The command is:   libbi sample @prediction.conf Results are output to results/prediction.nc, and plot in red in  Note that the observations on t = (2, 3] do not factor in to the prediction, they are shown in Figure 11 only for validation of the prediction.

The LibBi software
LibBi is made up of a C++ template library and Perl frontend. The C++ template library provides the inference methods, along with supporting functionality for, among other things, input and output, memory management, matrices and vectors, and pseudorandom number generation. The Perl frontend parses the LibBi modelling language, generates model-specific C++ code using the Perl Template Toolkit 3 , uses a GNU Autotools 4 build system to compile and link this against the C++ template library, and finally runs the program. Code is configured for the hardware platform according to options set by the user when calling the libbi command. LibBi supports several hardware architectures and high-performance computing technologies, including vector SIMD (Single Instruction Multiple Data) operations on CPU using SSE (Streaming SIMD Extensions), multithreading on multicore CPUs using OpenMP 5 , generalpurpose GPU programming on NVIDIA GPUs using CUDA (Compute Unified Device Architecture) 6 , and distributed-memory computing using MPI (Message Passing Interface) 7 . The compilation process is hidden from the user. Intermediate files are preserved in a hidden directory to avoid repeated effort, although a short wait is noticeable after making changes to a model that require recompilation. The LibBi modelling language is described with an LALR grammar [13]. Tools such as Yacc 8 and GNU Bison 9 may be used to compile parsers from the grammar. Yapp 10 , a Yacc-like parser compiler for Perl, is used for this purpose. Input and output files use the standard NetCDF 11 format, based on    Table 1: Capabilities of the CPU and GPU devices used for empirical experiments, from a programming perspective. Specifications are obtained from /proc/cpuinfo for the CPU, and the CUDA SDK deviceQuery program for the GPU.
HDF5 12 , and are readily created and analysed within various mathematical and statistical packages such as MATLAB 13 , R 14 [62] and GNU Octave 15 [18]. The workflow for LibBi typically sees the user prepare input files in their preferred statistical package, run LibBi from the command-line, then return to their statistical package to summarise and visualise the results. It is hoped that closer integration will be realised in future.
Existing software for general BHMs exists, including the BUGS pair WinBUGS 16 [47] and more recently OpenBUGS 17 [46], as well as JAGS 18 [61] and Stan 19 [69]. Specialist software for SSMs also exists, notably BiiPS 20 . It is against these programs that LibBi might be most closely compared. All of these existing packages accept models specified in the BUGS language or extensions of it [61], while LibBi prescribes its own, albeit similar, language. WinBUGS, OpenBUGS, JAGS and BiiPS build a data structure from the model which is manipulated internally by the client program, while Stan and LibBi take a code generation approach. WinBUGS, OpenBUGS and JAGS target general BHMs, while BiiPS and LibBi target the more-specific class of SSMs. In line with this more specialised focus, BiiPS and LibBi use SMC as a staple method, rather than Gibbs (as in WinBUGS, OpenBUGS and JAGS) or Hamiltonian Monte Carlo [34] (as in Stan). The most notable distinction of LibBi against all of these packages is its high-performance computing focus, where its hardware support, particularly for GPUs and distributed clusters, is unique.
The C++ template library component of LibBi might also be compared to similar libraries such as SMCTC 21 [36]. The template metaprogramming techniques advocated in Johansen [36] are mirrored in LibBi. But where a library requires its user to program their model to a prescribed interface, an additional code generator component, as in LibBi, automates compliance with the interface from a higher-level language in which the model is specified. In LibBi this also facilitates the generation of code for different combinations of methods and hardware.

Performance results
This section offers a comparison of LibBi performance under different hardware configurations. The Lorenz '96 model is used for this purpose. Experiments are conducted on a single machine with two 8-core Intel Xeon E5-2650 CPUs, three NVIDIA Tesla 2075 GPUs, and 128 GB main memory. The salient technical specifications of these devices are given in Table 1. The particle filter, PMMH and SMC 2 methods described in Section 3 are configured according to Table 2 using hardware configurations in Table 3. All methods are tested with the first six configurations in Table 3. The SMC 2 method is also tested with the last two configurations, which use MPI to distribute θ-particles across multiple processes.   Table 3: System configurations for timing results. Headings additionally give the technology (e.g., MPI) and LibBi command-line option (e.g., --enable-mpi) used to enable it. All configurations additionally use the --disable-assert command-line option to disable runtime assertion checks. Figure 13 gives the execution times for each method in each configuration, summarised across 100 repeated runs with different random number seeds. For context, one would ideally like to see: 1. that configurations #2 and #4 execute four times faster than configurations #1 and #3, respectively, due to OpenMP multithreading, 2. that configurations #3 and #4 execute four times faster than configurations #1 and #2, respectively, due to four-way SSE vector parallelism in single precision floating point, 3. that configurations #5 and #6 execute faster than configuration #4, as the algorithms are wellsuited to GPU, and 4. that configurations #7 and #8 execute three times faster than configurations #4 and #6, respectively, due to three MPI processes on uncontested hardware.
Of course, these ideal targets are rarely met in practice due to synchronisation overhead and necessarily serial sections of code. Results in Figure 13 do suggest good gains with OpenMP, SSE and GPU use, however, and modest gains in MPI use. Improving gains with MPI is a topic of current research. Performance of these methods is very much model-dependent, and the results of Figure 13 should be considered demonstrative only. For example, LibBi parallelises predominantly (but not only, e.g., Murray [51]) across particles. For GPU performance to exceed CPU-only performance, typically at least a thousand particles will be required. The number of x-particles per θ sample in these configurations ( Table 2) is sufficiently large that good returns are expected. Not all models justify the use of this many x-particles, however, and for fewer, CPU performance can, and in many cases does, exceed GPU performance.

Summary
LibBi combines the utility the SSMs with the generality and computational scalability of SMC methods, in a manner sensitive to modern hardware realities. Its two aims are accessibility and speed. Accessibility is achieved by decoupling models, from methods, from the hardware on which they run. Models are specified in a modelling language with a rich variety of features, the focus on SMC methods ensures their broad inferential support, and code generation alleviates the user from the combinatorial task of writing model-, method-and hardware-specific code. Speed is achieved through code generation and template metaprogramming techniques, along with support for high-performance technologies such as SSE, OpenMP, MPI and CUDA to make full use of available hardware resources, including GPUs. Good performance gains are obtained with these technologies as demonstrated in Figure 13. Research continues into both statistical and computational advances to scale to higher-dimensional models on large compute clusters.