Getting Started with Particle Metropolis-Hastings for Inference in Nonlinear Dynamical Models

This tutorial provides a gentle introduction to the particle Metropolis-Hastings (PMH) algorithm for parameter inference in nonlinear state-space models together with a software implementation in the statistical programming language R. We employ a step-by-step approach to develop an implementation of the PMH algorithm (and the particle filter within) together with the reader. This final implementation is also available as the package pmhtutorial in the CRAN repository. Throughout the tutorial, we provide some intuition as to how the algorithm operates and discuss some solutions to problems that might occur in practice. To illustrate the use of PMH, we consider parameter inference in a linear Gaussian state-space model with synthetic data and a nonlinear stochastic volatility model with real-world data.


Introduction
We are concerned with Bayesian parameter inference in nonlinear state-space models (SSMs).This is an important problem as SSMs are ubiquitous in e.g., automatic control (Ljung 1999), econometrics (Durbin and Koopman 2012), finance (Tsay 2005) and many other fields.An overview of some concrete applications of state-space modelling is given by Langrock (2011).
The major problem with Bayesian parameter and state inference in SSMs is that it is an analytically intractable problem, which cannot be solved in closed-form.Hence, approximations are required and these typically falls into two categories: analytical approximations and statistical simulations.The focus of this tutorial is on a member of the latter category known as particle Metropolis-Hastings (PMH; Andrieu et al. 2010), which approximates the posterior distribution by employing a specific Markov chain to sample from it.This requires us to be able to point-wise compute unbiased estimates of the posterior.In PMH, these are provided by the particle filter (Gordon et al. 1993;Doucet and Johansen 2011), which is another important and rather general statistical simulation algorithm.
The main aim and contribution of this tutorial is to give a gentle introduction (both in terms of methodology and software) to the PMH algorithm.We assume that the reader is familiar with traditional time series analysis and Kalman filtering at the level of Brockwell and Davis (2002).Some familiarity with Monte Carlo approximations, importance sampling and Markov chain Monte Carlo (MCMC) at the level of Ross (2012) is also beneficial.
The focus is on implementing the particle filter and PMH algorithms step-by-step in the programming language R (R Core Team 2017).We employ literate programming1 to build up the code using a top-down approach, see Section 1.1 in Pharr and Humphreys (2010).The final implementation is available as a R package pmhtutorial distributed under the GPL-2 license via the CRAN repository.The R source code is also provided via GitHub with corresponding versions for MATLAB and Python to enable learning by readers from many different backgrounds, see Section 8 for details.
There are a number of existing tutorials which relates to this one.A thorough introduction to particle filtering is provided by Arulampalam et al. (2002), where a number of different implementations are discussed and pseudo-code provided to facilitate implementation.This is an excellent start for the reader who is particularly interested in particle filtering.However, they do not discuss nor implement PMH.The software LibBi and the connected tutorial by Murray (2015) includes high-performance implementations and examples of particle filtering and PMH.The focus of the present tutorial is a bit different as it tries to explain all the steps of the algorithms and provides basic implementations in several different programming languages.Finally, Ala-Luhtala et al. (2016) presents an advanced tutorial paper on twisted particle filtering.It also includes pseudo-code for PMH but its focus is on particle filtering and does not discuss e.g., how to tune the proposal in PMH.We return to discuss related work and software throughout this tutorial in Sections 3. 3, 4.3, 6 and 7.The disposition of this tutorial is as follows.In Section 2, we introduce the Bayesian parameter and state inference problem in SSMs and discuss how and why the PMH algorithm works.In Sections 3 and 4, we implement the particle filter and PMH algorithm for a toy model and study their properties as well as giving a small survey of results suitable for further study.In Sections 5 and 6, the PMH implementation is adapted to solve a real-world problem from finance.Furthermore, some more advanced modifications and improvements to PMH are discussed and exemplified.Finally, we conclude this tutorial by Section 7 where related software implementations are discussed.

Overview of the PMH algorithm
In this section, we introduce the Bayesian inference problem in SSMs related to the parameters and the latent state.The PMH algorithm is then introduced to solve these intractable problems.The inclusion of the particle filter into PMH is discussed from two complementary perspectives: as an approximation of the acceptance probability and as auxiliary variables in an augmented posterior distribution.
The structure of an SSM represented as a graphical model.

Bayesian inference in SSMs
In Figure 1, we present a graphical model of the SSM with the latent states (top), the observations (bottom), but without exogenous inputs2 .The latent state x t only depends on the previous state x t−1 of the process as it is a (first-order) Markov process.That is, all the information in the past states x 0:t−1 {x s } t−1 s=0 is summarised in the most recent state x t−1 .The observations y 1:t are conditionally independent as they only relate to each other through the states.
We assume that the state and observations are real-valued, i.e., y t ∈ Y ⊆ R ny and x t ∈ X ⊆ R nx .The initial state x 0 is distributed according to the density µ θ (x 0 ) parametrised by some unknown static parameters θ ∈ Θ ⊂ R p .The latent state and the observation processes are governed by the densities3 f θ (x t |x t−1 ) and g θ (y t |x t ), respectively.The density describing the state process gives the probability that the next state is x t given the previous state x t−1 .An analogous interpretation holds for the observation process.With the notation in place, a general nonlinear SSM can be expressed as with π 0 (x 0 ) = µ θ (x 0 ) as the initialisation.In theory, it is possible to construct a sequence of filtering distributions by iteratively applying (5).However in practice, this strategy cannot be implemented for many models of interest as the marginalisation over x t−1 (expressed by the integral) cannot be carried out in closed form.

Constructing the Markov chain
The PMH algorithm offers an elegant solution to both of these intractabilities by leveraging statistical simulation.Firstly, a particle filter is employed to approximate π t−1 (x t−1 ) and obtain point-wise unbiased estimates of the likelihood.Secondly, an MCMC algorithm (Robert and Casella 2004) known as Metropolis-Hastings (MH) is employed to approximate π θ (θ) based on the likelihood estimator provided by the particle filter.PMH emerges as the combination of these two algorithms.
MCMC methods generate samples from a so-called target distribution by constructing a specific Markov chain, which can be used to form an empirical approximation.In this tutorial, the target distribution will be either the parameter posterior π θ (θ) or the posterior of the parameters and states π θ,T (θ, x 0:T ) = p(θ, x 0:T |y 1:T ).We focus here on the former case and the latter follows analogously.Executing the PMH algorithm results in K correlated samples θ (1:K) {θ (k) } K k=1 which can be used to construct a Monte Carlo approximation of π θ (θ).This empirical approximation of the parameter posterior distribution is given by which corresponds to a collection of Dirac delta functions δ θ (dθ) located at θ = θ with equal weights.In practice, histograms or kernel density estimators are employed to visualise the estimate of the parameter posterior obtained from (6).
The construction of the Markov chain within PMH amounts to carrying out two steps to obtain one sample from π θ (θ).The first step is to propose a so-called candidate parameter θ from a proposal distribution q(θ |θ (k−1) ) given the previous state of the Markov chain denoted by θ (k−1) .The user is quite free to choose this proposal but its support should cover the support of the target distribution.The second step is to determine if the state of the Markov chain should be changed to the candidate parameter θ or if it should remain in previous state where Z θ is cancels as it is independent of the current state of the Markov chain.The stochastic behaviour introduced by ( 7) facilitates exploration of (in theory) the entire posterior and is also the necessary condition for the Markov chain to actually have the target as its stationary distribution.The latter is known as detailed balance which ensures that the Markov chain is reversible and have the correct stationary distribution.That is, to ensure that the samples generated by PMH actually are from π θ (θ).
Hence, the stochastic acceptance decision can be seen as a correction of the Markov chain generated by the proposal distribution.The stationary distribution of the corrected Markov chain is the desired target distribution.In a sense this is similar to importance sampling, where samples from a proposal distribution are corrected by weighting to be approximately distributed according to the target distribution.
The intuition for the acceptance probability (7) (disregarding the influence of the proposal q) is that we always accept a candidate parameter θ if π θ θ > π θ θ (k−1) .That is if θ increases the value of the target compared with the previous state θ (k−1) .This results in a mode-seeking behaviour which is similar to that of an optimisation algorithm estimating the maximum of the posterior distribution.However from (7), we also note that a small decrease in the posterior value can be accepted to facilitate exploration of the entire posterior.This is the main difference between PMH and that of an optimisation algorithm, where the former focus on mapping the entire posterior whereas the latter only would like to find the location of its mode.

Approximating test functions
In Bayesian parameter inference, the interest often lies in computing the expectation of a so-called test function, which is a well-behaved (integrable) function ϕ : Θ → R nϕ mapping the parameters to a value on the real space.The expectation with respect to the parameter posterior is given by where e.g., choosing ϕ(θ) = θ corresponds to computing the mean of the parameter posterior.
The expectation in (8) can be estimated via the empirical approximation in (6) according to where the last equality follows from the properties of the Dirac delta function.This estimator is well-behaved and it is possible to establish a law of large numbers (LLN) and a central limit theorem (CLT), see Meyn and Tweedie (2009) for more information.From the LLN, we know that the estimator is consistent (and asymptotically unbiased as K → ∞).Moreover from the CLT, we know that the error is approximately Gaussian with a variance decreasing by 1/K, which is the usual Monte Carlo rate.Note that the LLN usually assumes independent samples but a similar result known as the ergodic theorem gives a similar result (under some assumptions) even when θ (1:K) are correlated 4 .However, the asymptotic variance is usually larger than if the samples would be uncorrelated.

Approximating the acceptance probability
One major obstacle remains to be solved before the PMH algorithm can be implemented.The acceptance probability ( 7) is intractable and cannot be computed as the likelihood is unknown.
From above, we know that the particle filter can provide an unbiased point-wise estimator for the likelihood and therefore also the posterior π θ (θ).It turns out that the unbiasedness property is crucial for PMH to be able to provide a valid empirical approximation of π θ (θ).This approach is known as an exact approximation due to that the likelihood is replaced by an approximation but the algorithm stills retains its validity, see Andrieu and Roberts (2009) for more information.
The particle filter generates a set of samples from π t (x t ) for each t, which can be used to create an empirical approximation.These samples are generated by sequentially applying importance sampling to approximate the solution to (5).The approximation of the filtering distribution is then given by where the particles x (i) t and their normalised weights w (i) t constitutes the so-called particle system generated during a run of the particle filter.The un-normalised weights are denoted by v (i) t .It is possible to prove that the empirical approximation converges to the true distribution when N → ∞ under some regularity assumptions 5 .
The likelihood can then be estimated via the decomposition (3) by inserting the empirical filtering distribution (10) into (4).This results in the estimator 4 The theoretical details underpinning MH have been deliberately omitted.For the ergodic theorem to hold, the Markov chain needs to be ergodic, which means that the Markov chain should be able to explore the entire state space and not get stuck in certain parts of it.The interested reader can find more about the theoretical details in e.g., Tierney (1994) and Meyn and Tweedie (2009) for MH and Andrieu et al. (2010) for PMH.
5 Again we have omitted many of the important theoretical details regarding the particle filter.For more information about these and many other important topics related to this algorithm, see e.g., Crisan and Doucet (2002) and Doucet and Johansen (2011).
where a point-wise estimate of the un-normalised parameter posterior is given by Here, it is assumed that the prior can be computed point-wise by closed-form expressions.
The acceptance probability (7) can be approximated by plugging in (12) resulting in This approximation is only valid if p N θ (y 1:T ) is an unbiased estimator of the likelihood.Fortunately, this is the case when employing the particle filter as its likelihood estimator is unbiased (Del Moral 2013;Pitt et al. 2012) for any N ≥ 1.
As in the parameter inference problem, interest often lies in computing an expectation of a well-behaved test function ϕ : X → R nϕ with respect to π t (x t ) given by where again choosing the test function ϕ(x t ) = x t corresponds to computing the mean of the marginal filtering distribution.This expectation is intractable as π t (x t ) is unknown.Again, we replace it with an estimate provided by an empirical approximation (10), which results in for some 0 ≤ t ≤ T by the properties of the Dirac delta function.In the subsequent sections, we are especially interested in the estimator for the mean of the marginal filtering distribution, which is referred to as the filtered state estimate.
Under some assumptions, the properties of π N t [ϕ] are similar as for the estimator in the PMH algorithm, see Crisan and Doucet (2002) and Doucet and Johansen (2011) for more information.Hence, we have that the estimator is consistent (and asymptotically unbiased when N → ∞) and the error is Gaussian with a variance decreasing by 1/N .

Outlook: Pseudo-marginal algorithms
The viewpoint adopted in the previous discussion on PMH is that it is an MH algorithm which employs a particle filter to approximate the otherwise intractable acceptance probability.Another more general viewpoint is to consider PMH as a pseudo-marginal algorithm (Andrieu and Roberts 2009).In this type of algorithm, some auxiliary variables are introduced to facilitate computations but these are marginalised away during a run of the algorithm.This results in that the PMH algorithm can be seen as a standard MH algorithm operating on the non-standard extended space Θ × U, where Θ and U denote the parameter space and the space of the auxiliary variables, respectively.The resulting extended target is given by Here, the parameter posterior is augmented with u ∈ U which denotes some multivariate random variables with density m(u).From the discussion above, we know that u can be used to construct a point-wise estimator of π N θ (θ|u) via the particle filter by ( 12).The unbiasedness property of the likelihood estimator based on the particle filter gives This means that the un-normalised parameter posterior γ θ (θ) can be recovered by marginalising over all the auxiliary variables u.In the implementation, this results in that the sampled values for u simply can be disregarded.Hence, we will not store them in the subsequent implementations but only keep the samples of θ (and x t ).
We conclude by deriving the pseudo-marginal algorithm following from (16).A proposal for θ and u is selected with the form which is the product of the two proposals selected as This corresponds to an independent proposal for u and a proposal for θ that does not include u.Other options are a topic of current research, see Section 4.3 for some references.The resulting acceptance probability from this choice of target and proposal is given by Note that the resulting acceptance probability is the same as in (13).
These arguments provide a sketch of the proof that PMH generates samples from the target distribution and were first presented by Flury and Shephard (2011).For a more formal treatment and proof of the validity of PMH, see Andrieu and Roberts (2009) and Andrieu et al. (2010).
3 Estimating the states in a linear Gaussian SSM We are now ready to start implementing the PMH algorithm based on the material in Section 2. To simplify the presentation, this section discusses how to estimate the filtering distributions {π t (x t )} T t=0 for a linear Gaussian state-space (LGSS) model.These distributions can be used to compute x N 0:T and p N θ (y 1:T ), i.e., the estimates of the filtered state and the estimate of the likelihood, respectively.The parameter inference problem is treated in the subsequent section.
The particular LGSS model considered is given by where parameters are denoted by θ = {φ, σ v , σ e }. φ ∈ (−1, 1) determines the persistence of the state, while σ v , σ e ∈ R + denote the standard deviations of the state transition noise and the observation noise, respectively.The Gaussian density is denoted by N (x; µ, σ 2 ) with mean µ and standard deviation σ > 0. Figure 2 presents a simulated data record from the model with T = 250 observations using θ = {0.75,1.00, 0.10} and x 0 = 0.The complete code for the data generation is available in generateData.

Implementing the particle filter
The complete source code for the implementation of the particle filter is available in the function particleFilter.Its code skeleton is given by: particleFilter <-function(y, theta, noParticles, initialState) { <initialisation> <initialiseStates> for (t in 2:T) { <resampleParticles> <propagateParticles> <weightParticles> <normaliseWeights> <estimateLikelihood> <estimateState> } <returnEstimates> } The function particleFilter has the inputs: y (vector with T + 1 observations), theta (parameters {φ, σ v , σ e }), noParticles and initialState.The outputs will be the estimates of the filtered states x N 0:T and the estimate of the likelihood p N (θ|y 1:T ).Note that particleFilter iterates over t = 2, . . ., T , which corresponds to time indices t = 1, . . ., T − 1 in (20) as the numbering of vector elements starts from 1 in R. The iteration terminates at time index T −1 as future observations y t+1 are required at each iteration.That is, we assume that the new observation arrives to the algorithm between the propagation and weighting steps.It therefore makes sense to use the information in the weighting step if this is possible.
<initialisation> The particle filter is initialised by allocating memory for the variables: particles, ancestorIndices, weights and normalisedWeights.These correspond to the particles, their ancestor, un-normalised weights and normalised weights, respectively.Moreover, the two variables xHatFiltered and logLikelihood are allocated to store x N 0:T and log p N (θ|y 1:T ), respectively.This is all implemented by: <initialisation> = T <-length(y) -1 particles <-matrix(0, nrow = noParticles, ncol = T + 1) ancestorIndices <-matrix(0, nrow = noParticles, ncol = T + 1) weights <-matrix(1, nrow = noParticles, ncol = T + 1) normalisedWeights <-matrix(0, nrow = noParticles, ncol = T + 1) xHatFiltered <-matrix(0, nrow = T, ncol = 1) logLikelihood <-0 <initialiseStates> For the LGSS model ( 20), we have that µ θ (x 0 ) = δ 0 (x 0 ) so all the particles are initially set to x (1:N ) 0 = x 0 = 0 and all weights to w (1:N ) 0 = 1/N (as all particles are identical).This operation is implemented by: <resampleParticles> The particles in the filter corresponds to a number of different hypotheses of what the value of the latent state could be.The weights represents the probability that a given particle has generated the observation under the model.In the resampling step, this information is used to focus the computational effort of the particle filter to the relevant part of the state-space, i.e., to particles with a large weight.This is done by randomly duplicating particles with large weights and discarding particles with small weights, such that the total number of particles always remains the same.Note that, the resampling step is unbiased in the sense that the expected proportions of the resampled particles are given by the particle weights6 .This step is important as the particle system otherwise would consist of only a single unique particle after a few iterations.This would result in a large variance in ( 14).
In our implementation, we make use of multinomial resampling, which is also known as a weighted bootstrap with replacement.The output from this procedure are the so-called ancestor indices a (i) t for each particle i, which can be interpreted as the parent index of particle i at time t.For each i = 1, . . ., N , the ancestor index is sampled from the multinomial (categorical) distribution with The resampling step is implemented by a call to the function sample by: where the resulting ancestor indices a (1:N ) t are returned as newAncestors and stored in ancestorIndices for bookkeeping.Note that the particle lineage is also resampled at the same time as the particles.All this is done to keep track of the genealogy of the particle system over time.That is, the entire history of the particle system.
<propagateParticles> The hypotheses regarding the state are updated by propagating the particle system forward in time by using the model.This corresponds to sampling from the particle proposal distribution to generate new particles x where information from the previous particle x a (i) t t−1 and the current measurement y t can be included.There are a few different choices for the particle proposal and the most common one is to make use of the model itself, i.e., p θ (x t |x t−1 , y t ) = f θ (x t |x t−1 ).However for the LGSS model, an optimal proposal can be derived using the properties of the Gaussian distribution resulting in This particle proposal is optimal in the sense that it minimises the variance of the incremental particle weights at the current time step7 .For most other SSMs, the optimal proposal is intractable and the state transition model is used instead.The propagation step is implemented by: From ( 22), the ratio between the noise variances is seen to determine the shape of the proposal.
Essentially, there are two different extreme situations (i) σ 2 e /σ 2 v 1 and (ii) σ 2 e /σ 2 v 1.In the first extreme (i), the location of the proposal is essentially governed by y t and the scale is mainly determined by σ e .This corresponds to essentially simulating particles from g θ (y t |x t ).When σ e is small this usually allows for running the particle filter using only a small number of particles.In the second extreme (ii), the proposal essentially simulates from f θ (x t |x t−1 ) and does not take the observation into account.In summary, the performance and characteristics of the optimal proposal are therefore related to the noise variances of the model and their relative sizes.
<weightParticles> and <normaliseWeights> In these steps, the weights required for the resampling step are computed.The weights can be computed using different so-called weighting functions and a standard choice is the observation model, i.e., v t (x t ) = g θ (y t |x t ).However for the LGSS model, we can instead derive an optimal weighting function by again applying the properties of the Gaussian distribution to obtain Remember that in this implementation, the new observation y t+1 is introduced in the particle filter between the propagation and the weighting steps for the first time.This is the reason for the slightly complicated choice of time indices.However, the inclusion of the information in y t+1 in this step is beneficial as it improves the accuracy of the filter.That is, we keep particles that after propagation are likely to result in the observation y t+1 .
In many situations, the resulting weights are small and it is therefore beneficial to work with shifted log-weights to avoid problems with numerical precision.This is done by applying the transformation where v max denotes the largest element in log v (1:N ) t . The weights are then normalised (ensuring that they sum to one) by where the shifts −v max cancel and do not affect the relative sizes of the weights.Hence, the first and third expressions in ( 23) are equivalent but the first expression enjoys better numerical precision.The computation of the weights is implemented by: We remind the reader that we compare y[t+1] and particles[, t] due to the convention for indexing arrays in R, which corresponds to y t and x (1:N ) t−1 in the model, respectively.This is also the reason for the comment regarding the for-loop in particleFilter.We now see that the weights depend on the next observation and that is why the loop needs to run to T − 1 and not T .
<estimateLikelihood> The likelihood p(θ|y 1:t ) can be estimated by inserting the weights from the particle filter into (11).However, it is beneficial to instead estimate the log-likelihood to enjoy better numerical precision as the likelihood often is small.This results in rewriting (11) to obtain the recursion where the log-shifted weights are used to increase the numerical precision.Note that the shift by v max needs to be compensated for in the estimation of the log-likelihood.This recursion is implemented by: <estimateLikelihood> = predictiveLikelihood <-maxWeight + log(sumWeights) -log(noParticles) logLikelihood <-logLikelihood + predictiveLikelihood The estimate of the posterior distribution follows from inserting the estimate of the loglikelihood into (12).
<estimateState> The latent state x t at time t can be estimated by ( 15) given the observations y 1:(t+1) , which corresponds to the estimator which is implemented by: Note that this corresponds to the weights w (i) t = 1/N which would correspond to that all particles are identical according to the expression in (23).However, this is not the case in general and the estimator ( 24) is the result of making using of the optimal choices in the  -6.94 -7.49 -8.72 -9.29 -9.91 -10.87 -11.67 Table 1: The log-bias and the log-MSE of the filtered states for varying N .
propagation and weighing steps of the particle filter.This choice corresponds to a particular type of algorithm known as the fully-adapted particle filter (faPF; Pitt and Shephard 1999), see Section 3.3 for more details.In a faPF, we make use of separate weights in the resampling step and when constructing the empirical approximation of π t (x t ).We have refrained from this more general formulation to simplify the presentation but this instead results to this rather confusing change in the estimator for x N t <returnEstimates> The outputs from particleFilter are returned by: <returnEstimates> = list(xHatFiltered = xHatFiltered, logLikelihood = logLikelihood)

Numerical illustration
We make use of the implementation of particleFilter to estimate the filtered state and to investigate the properties of this estimate for finite N .The complete implementation is found in the script/function example1_lgss.We use the data presented in Figure 2, which was generated in the beginning of this section.The estimates from the particle filter are compared with the corresponding estimates from the Kalman filter.The latter follows from using the properties of the Gaussian distribution to solve (5) exactly, which is only possible for an LGSS model.
In the middle of Figure 3, we present the difference between the optimal state estimate from the Kalman filter and the estimate from the particle filter using N = 20 particles.Two alternative measures of accuracy are the bias (absolute error) and the mean square error (MSE) of the state estimate.These are computed according to where x t denotes the optimal state estimate obtained by the Kalman filter.In Table 1 and in the bottom part of Figure 3, we present the logarithm of the bias and the MSE for different values of N .We note that the bias and the MSE decrease rapidly when increasing N .Hence, we conclude that for this model N does not need to be large for x N t to be accurate.

Outlook: Design choices and generalisations
We make a short de-tour before the tutorial continues with the implementation of the PMH algorithm.In this section, we provide some references to important improvements and further studies of the particle filter.Note that the scientific literature concerning the particle filter is vast and this is only a small and naturally biased selection of topics and references.More comprehensive surveys and tutorials of particle filtering are given by Doucet and Johansen (2011), Cappé et al. (2007), Arulampalam et al. (2002) and Ala-Luhtala et al. (2016) which also provide pseudo-code.The alterations discussed within these papers can easily incorporated into the implementation developed within this section.
In Section 3.1, we employed the faPF (Pitt and Shephard 1999) to estimate the filtered state and likelihood.This algorithm made use of the optimal choices for the particle proposal and weighting function, which corresponds to being able to sample from p(x t+1 |x t , y t+1 ) and to evaluate p(y t+1 |x t ).This is not possible for many SSMs.One possibility is to create Gaussian approximations of the required quantities, see Doucet et al. (2001) and Pitt et al. (2012).
However, these methods rely on quasi-Newton optimisation that can be computationally prohibitive if N is large.See Arulampalam et al. (2002) for a discussion and for pseudo-code.
Another possibility is to make use of a mixture of proposals and weighting functions in the particle filter as described by Kronander and Schön (2014).This type of filters are based on multiple importance sampling, which is commonly used in e.g., computer graphics.
A different approach is to make use of an additional particle filter to approximate the fully adapted proposal, resulting in a nested construction where one particle filter is used within another particle filter to construct a proposal distribution for that particle filter.The resulting construction is referred to as nested sequential Monte Carlo (SMC), see Naesseth et al. (2015) for details.The nested SMC construction makes it possible to consider state-spaces where n x is large, something that other types of particle filter struggles with.
The bootstrap particle filter (bPF) is the default choice as it can be employed for most SSMs.
A drawback with the bPF is that it suffers from poor statistical accuracy when the dimension of the state-space n x grows beyond about 10.In this setting, e.g., faPF or nested SMC are required for state inference.However, there are some approaches which possibly could improve the performance of the bPF in some cases and should be considered for efficient implementations.These includes: (i) parallel implementations, (ii) better resampling schemes, (iii) make use of linear substructures of the SSM and (iv) using quasi-Monte Carlo.
Parallel implementations of the particle filter (i) is a topic for ongoing research but some encouraging results are reported by e.g., Lee et al. (2010) and Lee and Whiteley (2016).
In this tutorial, we make use of multinomial resampling in the particle filter.Alternative resampling schemes (ii) can be useful in decreasing the variance of the estimates, see Hol et al. (2006) and Douc and Cappé (2005) for some comparisons.In general, systematic resampling is recommended for particle filtering.
Another possible improvement is the combination of Kalman and particle filtering (iii), which is possible if the model is conditionally linear and Gaussian in some of its states.The idea is then to make use of Kalman filtering to estimate these states and particle filtering for the remaining states while keeping the linear ones fixed to their Kalman estimates.These types of models are common in engineering and Rao-Blackwellisation schemes like this can lead to a substantial decrease in variance.See e.g., Doucet et al. (2000), Chen and Liu (2000) and Schön et al. (2005) for more information and comparisons.
Quasi-Monte Carlo (iv) is based on so-called quasi-random numbers, which are generated by deterministic recursions to better fill the space compared with standard random numbers.These are useful in standard Monte Carlo to decrease the variance in estimates.The use of quasi-Monte Carlo in particle filtering is discussed by Gerber and Chopin (2015).
Finally, particle filtering is an instance of the SMC method (Del Moral et al. 2006), which represents a general class of algorithms based on importance sampling.SMC can be employed for inference in many statistical models and is a complement/alternative to MCMC.
A particularly interesting member is the SMC 2 algorithm (Chopin et al. 2013;Fulop and Li 2013), which basically is a two-level particle filter similar to nested SMC.The outer particle filter maintains a particle system targeting π θ (θ).The inner particle filter targets π t (x t ) and is run for each particle in the outer filter as these contains the hypotheses of the value of θ.
The likelihood estimates from the inner filter are used to compute the weights in the outer filter.Hence, SMC 2 is an alternative to PMH for parameter inference, see Svensson and Schön (2016) for a comparison.
4 Estimating the parameters in a linear Gaussian SSM This tutorial now proceeds with the main event, where the PMH algorithm is implemented according to the outline provided in Section 2. The particle filter implementation from Section 3.1 is employed to approximate the acceptance probability (13) in the PMH algorithm.
We keep the LGSS model as a toy example to illustrate the implementation and provide an outlook of the PMH literature at the end of this section.
A prior is required to be able to carry out Bayesian parameter inference.The objective in this section is to estimate the parameter posterior for θ = φ while keeping {σ v , σ e } fixed to their true values.Hence, we select the prior p(φ) = T N (−1,1) (φ; 0, 0.5) to ensure that the system is stable (i.e., that the value of x t is bounded for every t).The truncated Gaussian distribution with mean µ, standard deviation σ in the interval z ∈ [a, b] is defined by where I(s) denotes the indicator function with value one if s is true and zero otherwise.
The log-likelihood and log-priors are used to avoid loss of numerical precision.The prior implies that α(θ , θ (k−1) ) = 0 when |φ| > 1.Therefore, the particle filter is not run in this case to save computations and the acceptance probability is set to zero to ensure that θ is rejected.The computation of the acceptance probability is implemented by: <acceptRejectStep> Finally, we need to take a decision for accepting or rejecting the proposed parameter.This is done by simulating a uniform random variable ω over [0, 1] by the built-in R command runif.We accept θ if ω < α θ , θ (k−1) by storing it and its corresponding log-likelihood as the current state of the Markov chain.Otherwise, we keep the current values for the state and the log-likelihood from the previous iteration.The accept/reject step is implemented by:

Numerical illustration
We make use of particleMetropolisHastings to estimate θ = φ using the data in Figure 2. The complete implementation and code is available in the function/script example2_lgss.We initialise the Markov chain in θ 0 = 0.5 and make use of K = 5, 000 iterations (noIterations) discarding the first K b = 1, 000 iterations (noBurnInIterations) as burn-in.That is, we only make use of the last 4, 000 samples to construct the empirical approximation of the parameter posterior to make certain that the Markov chain in fact has reached its stationary regime, see Section 6 for more information.
In Figure 4, three runs of PMH are presented using different step sizes = 0.01 (left), 0.10 (center) and 0.50 (right).The resulting estimate of the posterior mean is φ = 0.66 for the case = 0.10.This is computed by From the ACF, we see that the choice of influences the correlation in the Markov chain.A good choice of is therefore important to obtain an efficient algorithm.We return to discussing this issue in Section 6.3.

R> mean(phi[noBurnInIterations:noIterations])
We note that the parameter estimate differs slightly from the true value 0.75 and that the uncertainty is rather large in the estimate of the parameter posterior.This is due to the relatively small sample size T (and a finite K).From the asymptotic theory of the Bayesian estimator, we know that the posterior mass tends to concentrate around the true parameter as T (and K) tends to infinity.
We exemplify this in Table 2 by estimating the posterior mean and variance using the same setup when T increases.This small study supports that the true parameter is indeed recovered by the posterior mean estimate in the limit.However, the rate of this convergence is determined by the model and therefore it is not possible to give any general guidelines for how large T needs to be to achieve a certain accuracy.

Outlook: Generalisations
We devote this section to give some references for important improvements and further studies of the PMH algorithm.As for the particle filter, the scientific literature related to PMH is vast and quickly growing.This is therefore only a small and biased selection of recent developments.For a broader view of parameter inference in SSM, see e.g., the surveys by Kantas et al. (2015) and Schön et al. (2015).
As discussed in Section 2, the PMH algorithm is a member of the family of exact approximation or pseudo-marginal algorithms (Andrieu and Roberts 2009).Here, the particle filter is used to estimate the target but it is also possible to make use of e.g., importance sampling for this end.PMH is also an instance of so-called particle MCMC (PMCMC; Andrieu et al. 2010) algorithm, which also includes particle versions of Gibbs sampling (Andrieu et al. 2010;Lindsten et al. 2014).PMCMC algorithms are a topic of current research and much effort has been given to improve their performance, see Section 6 for some examples of this effort.
In this tutorial, we make use of an independent proposal for u as discussed in Section 2.5.This essentially means that all particle filters are independent.However, intuitively there could be something to gain by correlating the particle filters as the state estimates are often quite similar for small changes in θ.It turns out that correlating u results in a positive correlation in the estimates of the log-likelihood, which decreases the variance of the estimate of the acceptance probability.In practice, this means that N does not need to scale as rapidly with T as for the case when u are uncorrelated between iterations.This is particularly useful when T is large as it decreases the computational cost of PMH.See Dahlin et al. (2015a), Choppala et al. (2016) and Deligiannidis et al. (2016) for more information and source code.

Application example: estimating the volatility in OMXS30
We continue with a concrete application of the PMH algorithm to infer the parameters of a stochastic volatility (SV; Hull and White 1987) model.This is a nonlinear SSM with Gaussian noise and inference in this type of model is an important problem as the logvolatility (the latent state in this model) is useful for risk management and to price various financial contracts.See e.g., Tsay (2005) and Hull (2009) for more information.A particular parametrisation of the SV model is given by where the parameters are denoted by θ = {µ, φ, σ v }.Here, µ ∈ R, φ ∈ [−1, 1] and σ v ∈ R + denote the mean value, the persistence and standard deviation of the state process, respectively.Note that this model is quite similar to the LGSS model, but here the state x t scales the variance of the observation noise.Hence, we have Gaussian observations with zero mean and a state-dependent standard deviation given by exp(x t /2).
In econometrics, volatility is another word for standard deviation and therefore x t is known as the log-volatility.The measurements in this model y t are so-called log-returns, where s t denotes the price of some financial asset (e.g., an index, stock or commodity) at time t.Here, {s t } T t=1 is the daily closing prices of the NASDAQ OMXS30 index, i.e., a weighted average of the 30 most traded stocks at the Stockholm stock exchange.We extract the data from Quandl 8 for the period between January 2, 2012 and January 2, 2014.The resulting logreturns are presented in the top part of Figure 5.Note the time-varying persistent volatility in the log-returns, i.e., periods of small and large variations.This is known as the volatility clustering effect and is one of the features of real-world data that SV models aim to capture.Looking at (26), this can be achieved when |φ| is close to one and when σ v is small.As these choices result in a first-order autoregressive process with a large autocorrelation.
The objectives in this application are to estimate the parameters θ and the log-volatility x 0:T from the observed data y 1:T .We can estimate both quantities using PMH as both samples from the posterior of the parameter and the state can be obtained at each iteration of the algorithm.To complete the SV model, we assume some priors for the parameters based on domain knowledge of usual ranges of the parameters, i.e., Here, G(a, b) denotes a Gamma distribution with shape a and scale b, i.e., expected value a/b. 8The data is available for download from: https://www.quandl.com/data/NASDAQOMX/OMXS30.

Modifying the implementation
The implementation for the particle filter and PMH need to be adapted to this new model.We outline the necessary modifications by replacing parts of the code in the skeleton for the two algorithms.The resulting implementations and source codes are found in the functions particleFilterSVmodel and particleMetropolisHastingsSVmodel, respectively.In the particle filter, we need to modify all steps except the resampling.
In the initialisation, we need to simulate the initial particle system from µ θ (x 0 ) by replacing: Furthermore, we need to choose a (particle) proposal distribution and the weighting function for the particle filter implementation.For the SV model, we cannot compute a closed-form expression for the optimal choices as for the LGSS model.Therefore, a bPF is employed which corresponds to making use of the state dynamics as the proposal (21) given by and the observation model as the weighting function by .
These two choices result in that the estimator x N t = π t [x] is changed to These three alterations to the particle filter are implemented by replacing: We also need to generalise the PMH code to have more than one parameter.This is straightforward and we refer the reader to the source code for the necessary changes.The major change is to replace the vectors phi and phiProposed with the matrices theta and thetaProposed.
That is, the state of the Markov chain is now three-dimensional corresponding to the three elements in θ.Moreover, the parameter proposal is selected as a multivariate Gaussian distribution centred around the previous parameters θ (k−1) with covariance matrix Σ = diag( ), where denotes a vector with three elements.This is implemented by: The computation of the acceptance probability is also altered to include the new prior distributions.For this model, it is required that |φ| < 1 and σ v > 0 to ensure that it is stable and that the standard deviation is positive.This is implemented by: acceptProbability <-acceptProbability * (abs(thetaProposed[k, 2]) < 1.0) acceptProbability <-acceptProbability * (thetaProposed[k, 3] > 0.0) Furthermore, <acceptRejectStep> needs to be altered to take into account that theta is a vector, see the source code for the details.

Estimating the log-volatility
It is possible to compute an estimate of the log-volatility that takes into account the uncertainty in θ using the pseudo-marginal view of PMH.The particle filter is a deterministic algorithm given u.Hence, the random variables u are equivalent to an estimate of the filtering distribution.The result is that it is quite simple to compute the state estimate by including it in the Markov chain generated by PMH.This is done by modifying the code for the particle filter.After one run of the algorithm, we sample a single trajectory by sampling a particle at time T with probability given by w We then follow the ancestor lineage back to t = 0 and extract the corresponding path in the state-space.This enables us to obtain a better estimate of the log-volatility as both past and future information is utilised.As a consequence, this often reduces the variance of the estimate.This is done by using the stored resampled ancestor indices and the sampled the state trajectory by replacing:  where pmhOutput denotes the output variable from particleMetropolisHastingsSVmodel.

Numerical illustration
We now make use of particleMetropolisHastingsSVmodel and particleFilterSVmodel to the SV model using the OMXS30 data introduced in the beginning of this section.The complete implementation is available in the function/code example3_sv.The resulting posterior estimates, traces and ACFs are presented in Figure 7.We see that the Markov chain and posterior are clearly concentrated around the posterior mean estimate θ = {−0.23,0.97, 0.15} with standard deviations {0.37, 0.02, 0.06}.This confirms our belief that the log-volatility is a slowly varying process as φ is close to one and σ v is small.An estimate of the log-volatility given this parameter is also computed and presented in the second row of Figure 7.We note that the volatility is larger around t = 100 and t = 370, which corresponds well to what is seen in the log-returns.

Selected advanced PMH topics
In this section, we outline a selected number of possible improvements and best practices for the implementation in Section 5. We discuss initialisation, convergence diagnostics and how to improve the so-called mixing of the Markov chain.For the latter, we consider three different approaches: tuning the random walk proposal, re-parametrising the model and tuning the number of particles.This section is concluded by a small survey of more advanced proposal distributions and post-processing.

Initialisation
It is important to initialise the Markov chain in areas of high posterior probability mass to obtain an efficient algorithm.In theory, we can initialise the chain anywhere in the parameter space and still obtain a convergent Markov chain.However, in practice numerical difficulties can occur when the parameter posterior assumes small values or is relatively insensitive to changes in θ.Therefore it is advisable to try to obtain a rough estimate of the parameters to initialise the chain closer to the posterior mode.
In the LGSS model, we can make use of standard optimisation algorithms in combination with a Kalman filter to estimate the mode of the posterior.However, this is not possible for general SSMs as only noisy estimates of the posterior can be obtained from the particle filter.Standard optimisation methods can therefore get stuck in local minima or fail to converge entirely.One approach to mitigate this problem is the simultaneous perturbation and stochastic approximation (SPSA; Spall 1987) algorithm, which can be applied in combination with the particle filter to estimate the posterior mode.

Diagnosing convergence
It is in general difficult to prove that the Markov chain has reached its stationary regime and that the samples obtained are actually samples from π θ (θ).A simple solution is to initialise the algorithm at different points in the parameter space and compare the resulting posterior estimates.This can give an indication that the chain has reached its stationary regime if the resulting estimates are similar.
Another alternative is to make use of the Kolmogorov-Smirnov (KS; Massey 1951) test to establish that the posterior estimate does not change after the burn-in.This is done by dividing the samples {θ (k) } K k=1 into three partitions: the burn-in and two sets of equal number of samples from the stationary phase.As the KS test requires uncorrelated samples, thinning can be applied to decrease the autocorrelation.The Markov chain is thinned by keeping every lth sample, where l is selected such that the autocorrelation is negligible between two retained consecutive samples.A standard KS test is then conducted to conclude if the samples in the two thinned partitions are from the same (stationary) distribution or not.
Other methods for diagnosing convergence are discussed by Robert and Casella (2009) and Gelman et al. (2013).

Improving mixing
Standard Monte Carlo methods assume that independent samples can be obtained from the distribution of interest or from some instrumental distribution.However, the samples obtained from PMH are correlated as they are a realisation generated from a Markov chain.Intuitively, these correlated samples contain less information about the target distribution than if the samples were independent.That is, most samples are similar and the target is not fully explored if the autocorrelation in the Markov chain is large.This autocorrelation is also known as mixing, which is a very important concept in the MCMC literature.It is the key quantity for comparing the efficiency of different MCMC algorithms.
It turns out that the mixing influences the performance of an MCMC algorithm by scaling the asymptotic variance of the estimates.This is apparent from the CLT governing π K θ [ϕ] in (9) which under some assumptions 9 has the form when Here, IACT θ (1:K) denotes the integrated autocorrelation time (IACT) of the Markov chain, which is computed as the area under the ACF.An interpretation of the IACT is that it estimates the number of iterations of an MCMC algorithm between obtaining two uncorrelated samples.Hence, the IACT is one if the samples are independent, which is the case in e.g., importance sampling.Minimising the IACT is therefore the same as maximising the mixing in an MCMC algorithm.
In the right column of Figure 5, we present the ACF for the three parameters in the SV model.This information can be used to compute the corresponding IACTs by where ) 2 denotes the autocorrelation coefficient at lag τ for θ (1:K) .In practice, we cannot compute the IACT exactly as we do not know the autocorrelation coefficients for all possible lags.Also, it is difficult to estimate ρ τ when τ is large and K is rather small.A possible solution to this problem is to cut off the ACF estimate at some lag and only consider lags smaller than this limit L. In this tutorial, we make use of L = 100 lags to estimate the IACT by where ρ K τ = corr(ϕ(θ (k) ), ϕ(θ (k+τ ) )) denotes the estimate of the lag-τ autocorrelation of ϕ.For the results presented in Figure 5, this corresponds to the IACT {135, 86, 66} for each of the three parameters.

Tuning the proposal
The mixing can be improved by tuning the proposal distribution to better fit the posterior distribution.This requires an estimate of the posterior covariance, which acts lika a preconditioning matrix P. The covariance is quite simple to estimate from a pilot run by

R> var(thetaStationary)
where thetaStationary denotes the trace of the Markov chain generated by PMH after the burn-in has been discarded.For the problem considered in Section 5, this results in the following pre-conditioning matrix 9 These assumptions include that the Markov chain should reach its stationary regime quickly enough.This requirement is known as uniform ergodicity, which means that the total variational norm between the distribution of the Markov chain and its stationary distribution should decrease with a geometric rate regardless of the initialisation of the Markov chain, see e.g., Tierney (1994) for more details.which can be used to form a new improved proposal given by q θ θ (k−1) = N θ ; θ (k−1) , 2.562 2 3 P .
This scaling of the covariance matrix was proposed by Sherlock et al. (2015) to minimise the IACT when sampling from a Gaussian target distribution.
In Figure 6, we present the marginal posteriors (in color) for each pair of parameters in (26) together with the original (top) and the tuned proposals (bottom) from ( 28).The tuned proposals fit the posteriors better and this is expected to improve the mixing of the Markov chain.In Figure 7, we present the results obtained by the implementation from Section 5 when using the tuned proposal instead.This code is available in the function example4_sv.
The resulting IACT estimates are {13, 27, 23}, which is a clear improvement in mixing for µ and σ v compared with the results obtained in Section 5.3.The consequence is that K (and therefore computational cost) can be cut by 135/27 = 5 while retaining the same variance in the estimates.Note that the maximum IACTs are compared as these are the limiting factors.
We can also compare the support of the posterior estimates in Figures 5 and 7.According to ( 27), the improved mixing should decrease the asymptotic variance of the estimate.However, no such improvement can been seen in this case.This is probably due to that the variance in the posterior estimates largely results from the finite amount of information in the data.

Re-parametrising the model
Another approach to improve mixing is to re-parametrise the model to obtain unconstrained parameters, which can assume any value on the real line.In Section 5, we constrained φ and σ v to the regions |φ| < 1 and σ v > 0 to obtain a stable and valid SSM.This results in poor mixing if many candidate parameters are proposed which violate these constraints and this increases the autocorrelation.This problem can be mitigated by a re-parametrisation of the model given by such that ψ, ς ∈ R are unconstrained parameters.Hence, the target of PMH is changed to the posterior of ϑ = {µ, ψ, ς}.This transformation of parameters can be compensated for to still obtain samples from the posterior of θ.This is done by taking the Jacobians of (29) into account, which are given by The resulting acceptance probability is calculated according to where the original parameters are used to compute the prior and to estimate the likelihood.
Hence, the proposal operates on ϑ to propose a new candidate parameter ϑ but (30) is applied to obtain θ , which is used to compute the acceptance probability (31).After a run of this implementation, samples from the posterior of θ can be obtained by applying (30) to the samples from the posterior of ϑ.
To implement this, we need change the call to the particle filter and the computation of the acceptance probability.The complete implementation and code is available in the function example5_sv.The resulting mean estimate of the parameter posterior is {−0.12,0.96, 0.17} with standard deviation {0.28, 0.02, 0.05} and IACT {13, 21, 17}.Note that the maximum IACT is even smaller than in Section 6.3.1 resulting in a speed-up of 6.4 compared to the implementation in Section 5.

Selecting the number of particles
The particle filter plays an important role in the PMH algorithm and greatly influences the mixing.If the log-likelihood estimates are noisy (too small N in the particle filter), the chain tends to get stuck for several iterations and this leads to bad performance.This is the result of the fact that sometimes p θ (y 1:T ) p N θ (y 1:T ), which is due to the stochasticity of the estimator.Thus, balancing N and K to obtain good performance at a reasonable computational cost is an important problem.A small N might result in that we need to take K large and vice verse.This problem is investigated by Pitt et al. (2012) and Doucet   et al. (2015).A simple rule-of-thumb is to select N such that the standard deviation of the log-likelihood estimates is between 1.0 and 1.7.These results are derived under certain strict assumptions that might not hold in practice.
Here, we would like to investigate this rule-of-thumb in a practical setting.We make use of the implementation from Section 6.3.1 and re-run it for different choices of N .We also estimate the standard deviation of the log-likelihood estimator from the particle filter for each choice of N using the estimate of the posterior mean as the parameters and 1, 000 independent Monte Carlo runs.The proposal distribution is tuned using a pilot run as before and the Markov chain is initialised in the estimate of the posterior mean.The IACT is computed using L = 250 lags.Table 3 presents some quantities related to the particle filter and PMH as N is varied (keeping all other parameters fixed).
The optimal choice for N is between 100 and 300 depending on the choice of proposal according to the rule-of-thumb of the standard deviation of the log-likelihood.The objective for the user is often to minimise the computational time of PMH to obtain a certain number of uncorrelated samples from the posterior.Hence, a suitable benchmark quantity is the maximum IACT multiplied with the computational time for one iteration.Remember, that IACT is the estimated number of iterations between two uncorrelated samples.The benchmark quantity (last row of Table 3) is therefore an estimate of the computational time per uncorrelated sample.The results are a bit inconclusive, which is the result of the fact that the IACT is very challenging to estimate and thereby noisy.However, the rule-of-thumb seems to be valid for this model and any choice of N between 200 and 400 seems to be a good one.

Including geometric information
In practical applications, we typically encounter performance issues when the number of parameters p grows beyond 5 or as previously discussed when the chain is initialised far from the areas of high posterior probability.This problem occurs even if the rule-of-thumb from Sherlock et al. (2015) is used to tune the proposal.The reason for this is that the problem lies within the use of a random walk, which is known to poorly explore high-dimensional spaces.
To mitigate this problem, it can be useful to take geometrical information about the posterior into account.Girolami and Calderhead (2011) show how to make use of the gradient and the Hessian of the log-posterior to guide the Markov chain to areas of high posterior probability.
In the paper, this is motivated by diffusion processes on Riemann manifolds.However, a perhaps simpler analogy is found in optimisation, where we can make use of noisy gradient ascent or Newton updates in the proposal.Gradient information is useful to guide the Markov chain towards the area of interest during the burn-in and also to keep it in this area after the burn-in phase.The Hessian information can be used to rescale the parameter posterior to make it closer to be isotropic, which greatly simplifies sampling.
In Dahlin et al. (2015b), the authors show how to make use of this type of proposals in the PMH algorithm.We refer to the proposal that makes use of only gradient information as PMH1 (for first-order).The proposal that makes use of both gradient and Hessian information is referred to as PMH2 (for second-order).The challenge here is to obtain good estimates of the gradient and Hessian, which both are analytically intractable for a nonlinear SSM.Furthermore, the computation of these quantities usually requires the use of a particle smoother with a high computational cost, which is prohibitive inside the PMH algorithm.To mitigate this problem, they propose to make use of the faster but more inaccurate fixed-lag particle smoother and instead regularise the Hessian estimate when it is non-positive definite.
In Dahlin et al. (2015c), the authors describe a quasi-Newton PMH2 proposal (qPMH2) based on a noisy quasi-Newton update that does not require any Hessian information but constructs a local approximation of the Hessian based on gradient information.PMH1 and similar algorithms are theoretically studied by Nemeth et al. (2016), which offers a rule-ofthumb to tune the step sizes based on an estimate of the posterior covariance.

Related software
There are a number of software packages related to the current tutorial which implements (i) PMH and/or (ii) particle filtering and SMC.These can be used to quickly carry out Bayesian parameter inference in new SSMs or as building blocks for the user who would like to create his/her own implementations of PMH.
The software LibBi (Murray 2015) provides a platform for Bayesian inference in SSMs using both serial and parallel hardware with a particular focus on high performance computing.It is written in C++ and allows the user to define new models using a modelling language similar to JAGS or BUGS.This enables the user to quickly solve the parameter inference problem using PMH and SMC 2 in many different types of SSMs.An example of using LibBi for inference in the SV model introduced in Section 5 is offered via the homepage 10 , which is helpful in learning how to use the software.Furthermore, interfaces for R and Octave are available via RBi via CRAN and OctBi via the LibBi homepage, respectively.This software is suitable for the user who would like to leverage the power of multi-core computing to carry out inference quickly on large datasets.
Another alternative is Biips (Todeschini et al. 2014), which is implemented in C++ with interfaces to R and MATLAB via the package Rbiips and the toolbox Matbiips, respectively.The functionality of Biips is similar to LibBi with the crucial difference on the focus on parallel computations of the latter.Hence, Biips allows the user to define a model using the BUGS language and carry out inference using PMH.The homepage11 connected to Biips contains some example code that implements the model from Section 5.
The R package pomp (King et al. 2016) includes an impressive setup of different inference methods for SSMs.Among others, this includes particle filtering, PMH and also approximate Bayesian computations (ABC; Marin et al. 2012).This software is a good choice for the reader who use R on a regular basis and would like to explore some recent developments in statistical computing.It also contains a number of pre-defined models for epidemiology.The model specification in pomp is a bit more complicated compared with LibBi and Biips, which could be a drawback for some users.
Probabilistic programming languages (PPLs) constitutes a fairly recent contribution to software for statistical modelling.They extend graphical models with stochastic branches, such as loops and recursions.Three popular PPLs which allow for carrying out inference using PMH in SSMs are Anglican (Tolpin et al. 2016), Venture (Mansinghka et al. 2014) and Probabilistic C (Paige and Wood 2014).Anglican is the most developed language of these three and it runs as a standalone application.Implementation of PMH in Anglican requires some additional coding compared with e.g., LibBi.However, it is a more general framework which allows for inference is a larger class of models.
A C++ template for particle filtering and more general SMC algorithms is provided by SM-CTC (Johansen 2009).This template is complemented with an interface to R via the package RCppSMC (Eddelbuettel and Johansen 2017) and extended to parallel computations in C++ via the software vSMC (Zhou 2015).These templates require additional implementation by the user to be able to carry out inference.They do not allow for easy specification of new models or to carry out inference using e.g., PMH directly.However, they give the user full control of the particle filter, which allows for adding extensions and tailoring the algorithm to a specific problem.
A parallel implementation of SMC using CUDA is provided by Lee et al. (2010) and the source code is available online12 .A MATLAB toolbox for running SMC algorithms in parallel called DeCo is developed by Casarin et al. (2015).DeCo makes use of SMC for combining densities connected to forecasts in economics, but could potentially be generalised for other applications.
To conclude, we would like to briefly mention some additional software.The Python library pyParticleEst (Nordh 2017) provides functionality for state estimation using different types of particle filters.It also includes parameter estimation using Maximum Likelihood via the Expectation Maximisation (EM; Dempster et al. (1977); McLachlan and Krishnan (2008)) algorithm.MH is implemented in Ox by Bos (2011) to estimate the states and parameters for the SV model in Section 5.In this setting, MH is applied to estimate both the states and parameters, so no particle filter is required.The R package smfsb (Wilkinson 2013) accompanying Wilkinson (2011) contains demonstration code for using PMH in a systems biology application.
The GitHub repository13 of the first author of this tutorial contains Python code for PMH1, PMH2, qPMH2 (Dahlin et al. 2015b,c) and pseudo-marginal MH with correlated random variables (Dahlin et al. 2015a).This code is quite similar to the Python supplied with this tutorial and is a good starting point for the interested reader who would like to try out more advanced implementations of PMH.

Conclusions
We have described the PMH algorithm for Bayesian parameter inference in nonlinear SSMs.This includes the particle filter as it plays an important role in PMH and provides an nonnegative unbiased estimator of the likelihood.Furthermore, we have applied PMH for inference in an LGSS model and a SV model using both synthetic and real-world data.
We also identified and discussed a wide range of practical matters related to PMH.This includes initialisation and tuning of the parameter proposal, which are important practical problems.Furthermore, we have provided many references for the reader who could like to continue his/her learning of particle filtering and PMH.The implementation developed within this tutorial can be seen as a compilation of minimal working examples.Hopefully, these code snippets can be of use for the interested reader as a starting point to develop his/her own implementations of the algorithms.
The implementations developed in this tutorial are available as the package pmhtutorial from CRAN.Source code for the implementation in R as well as similar code for MATLAB and Python is available from GitHub at: https://github.com/compops/pmh-tutorial. See the README.mdfiles in the directory corresponding to each programming language for specific comments and for dependencies.

Figure 2 :
Figure 2: Simulated data from the LGSS model with latent state (orange), observations (green) and autocorrelation function (ACF) of the observations (light green).

Figure 3 :
Figure 3: Top and middle: A simulated set of observations (top) from the LGSS model and the error in the latent state estimate (middle) using a particle filter with N = 20.Bottom: the estimated log-bias (left) and log-MSE (right) for the particle filter when varying the number of particles N .

Figure 4 :
Figure 4: The estimate of π θ [φ] in the LGSS model using the PMH algorithm using three different step sizes: = 0.01 (left), 0.10 (center) and 0.50 (right).Top: the estimate of π θ presented as a histogram and kernel density estimate (solid line).Middle: the state of the Markov chain at 1, 000 iterations after the burn-in.Bottom: the estimated ACF of the Markov chain.Dotted lines in the top and middle plots indicate the estimate of the posterior mean.The dotted lines in the bottom plot indicate the 95% confidence intervals of the ACF coefficients.

Figure 5 :
Figure 5: Top: the daily log-returns (dark green) and estimated log-volatility (orange) with 95% confidence intervals of the NASDAQ OMXS30 index for the period January 2, 2012 to January 2, 2014.Bottom: the posterior estimate (left), the trace of the Markov chain (middle) and the corresponding ACF (right) of µ (purple), φ (magenta) and σ v (green) obtained from PMH.The dotted and solid gray lines in the left and middle plots indicate the parameter posterior mean and the parameter priors, respectively.

Figure 6 :
Figure 6: The estimated marginal posterior for µ and φ (purple), µ and σ v (magenta) and φ and σ v (green) obtained from PMH.The dark contours (top/bottom) indicate the original proposal from Section 5 and the new (improved) proposal estimated from the pilot run, respectively.Both proposals are centred at the posterior mean estimate.

Figure 7 :
Figure 7: Top: the daily log-returns (dark green) and estimated log-volatility (orange) with 95% confidence intervals of the NASDAQ OMXS30 index for the period January 2, 2012 and January 2, 2014.Bottom: the posterior estimate (left), the trace of the Markov chain (middle) and the corresponding ACF (right) of µ (purple), φ (magenta) and σ v (green) obtained from PMH. Dotted and solid gray lines in the left and middle plots indicate the parameter posterior mean and the parameter priors, respectively.
Control variates are a common and useful variance reduction technique for standard Monte Carlo, which also can be applied to MCMC algorithms such as PMH.Some interesting work along these lines for MH are found inMira et al. (2013),Papamarkou et al. (2014),Mijatović and Vogrinc (2017) andDellaportas and Kontoyiannis (2012), which should be quite straightforward to implement for PMH.However to the knowledge of the authors of this paper, control variates have not yet been properly investigated for PMH but some encouraging preliminary results are presented in Chapter 4.3 ofDahlin (2016) based onPapamarkou et al. (2014).

Table 3 :
The standard deviation of the log-likelihood estimate, the acceptance probability, the maximum IACT for θ, the computational time per iteration of PMH and a benchmark quantity (see main text) for varying N .