vSMC: Parallel Sequential Monte Carlo in C++

Sequential Monte Carlo is a family of algorithms for sampling from a sequence of distributions. Some of these algorithms, such as particle filters, are widely used in the physics and signal processing researches. More recent developments have established their application in more general inference problems such as Bayesian modeling. These algorithms have attracted considerable attentions in recent years as they admit natural and scalable parallelizations. However, these algorithms are perceived to be difficult to implement. In addition, parallel programming is often unfamiliar to many researchers though conceptually appealing, especially for sequential Monte Carlo related fields. A C++ template library is presented for the purpose of implementing general sequential Monte Carlo algorithms on parallel hardware. Two examples are presented: a simple particle filter and a classic Bayesian modeling problem.


Introduction
Sequential Monte Carlo (SMC) methods are a class of sampling algorithms that combine importance sampling and resampling. They have been primarily used as "particle filters" to solve optimal filtering problems; see, for example, Cappé, Godsill, and Moulines (2007) and Doucet and Johansen (2011) for recent reviews. They are also used in a static setting where a target distribution is of interest, for example, for the purpose of Bayesian modeling. This was proposed by Del Moral, Doucet, and Jasra (2006b) and developed by Peters (2005) and Del Moral, Doucet, and Jasra (2006a). This framework involves the construction of a sequence of artificial distributions on spaces of increasing dimensions which admit the distributions of interest as particular marginals.
SMC algorithms are perceived as being difficult to implement while general tools were not available until the development by Johansen (2009), which provided a general framework for implementing SMC algorithms. SMC algorithms admit natural and scalable parallelization. However, there are only parallel implementations of SMC algorithms for many problem specific applications, usually associated with specific SMC related researches. Lee, Yau, Giles, Doucet, and Holmes (2010) studied the parallelization of SMC algorithms on GPUs with some generality. There are few general tools to implement SMC algorithms on parallel hardware though multicore CPUs are very common today and computing on specialized hardware such as GPUs are more and more popular.
The purpose of the current work is to provide a general framework for implementing SMC al-arXiv:1306.5583v1 [stat.CO] 24 Jun 2013 gorithms on both sequential and parallel hardware. There are two main goals of the presented framework. The first is reusability. It will be demonstrated that the same implementation source code can be used to build a serialized sampler, or using different programming models (for example, OpenMP (OpenMP Architecture Review Board 2011) and Intel TBB (Intel Cooperation 2013c)) to build parallelized samplers for multicore CPUs. They can be scaled for clusters using MPI (Message Passing Interface Forum 2012) with few modifications. And with a little effort they can also be used to build parallelized samplers on specialized massive parallel hardware such as GPUs using OpenCL (Kronos OpenCL Working Group 2012). The second is extensibility. It is possible to write a backend for vSMC to use new parallel programming models while reusing existing implementations. It is also possible to enhance the library to improve performance for specific applications. Almost all components of the library can be reimplemented by users and thus if the default implementation is not suitable for a specific application, they can be replaced while being integrated with other components seamlessly.

Sequential importance sampling and resampling
Importance sampling is a technique which allows the calculation of the expectation of a function ϕ with respect to a distribution π using samples from some other distribution η with respect to which π is absolutely continuous, based on the identity, And thus, let {X (i) } N i=1 be samples from η, then E π [ϕ(X)] can be approximated bŷ ϕ(X (i) )π(X (i) ) η(X (i) ) In practice π and η are often only known up to some normalizing constants, which can be estimated using the same samples. Let w (i) = π(X (i) )/η(X (i) ), then we havê where W (i) ∝ w (i) and are normalized such that N i=1 W (i) = 1. Sequential importance sampling (SIS) generalizes the importance sampling technique for a sequence of distributions {π t } t≥0 defined on spaces { t k=0 E k } t≥0 . At time t = 0, sample {X and thus and importance sampling estimate of E πt [ϕ t (X 0:t )] can be obtained using {W . However this approach fails as t becomes large. The weights tend to become concentrated on a few particles as the discrepancy between η t and π t becomes larger. Resampling techniques are applied such that, a new particle system {W In practice, the resampling algorithm is usually chosen such that M = N andW (i) = 1/N for i = 1, . . . , N . Resampling can be performed at each time t or adaptively based on some criteria of the discrepancy. One popular quantity used to monitor the discrepancy is effective sample size (ESS), introduced by Liu and Chen (1998), defined as where {W (i) t } N i=1 are the normalized weights. And resampling can be performed when ESS ≤ αN where α ∈ [0, 1].
The common practice of resampling is to replicate particles with large weights and discard those with small weights. In other words, instead of generating a random sample {X (i) directly, a random sample of integers {R (i) } N i=1 is generated, such that R (i) ≥ 0 for i = 1, . . . , N and N i=1 R (i) = N . And each particle value X (i) 0:t is replicated for R (i) times in the new particle system. The distribution of {R (i) } N i=1 shall fulfill the requirement of Equation 7. One such distribution is a multinomial distribution of size N and weights (W ). See Douc, Cappé, and Moulines (2005) for some commonly used resampling algorithms.

SMC samplers
SMC samplers allow us to obtain, iteratively, collections of weighted samples from a sequence of distributions {π t } t≥0 over essentially any random variables on some spaces {E t } t≥0 , by constructing a sequence of auxiliary distributions {π t } t≥0 on spaces of increasing dimensions, π t (x 0:t ) = π t (x t ) t−1 s=0 L s (x s+1 , x s ), where the sequence of Markov kernels {L s } t−1 s=0 , termed backward kernels, is formally arbitrary but critically influences the estimator variance. See Del Moral et al. (2006b) for further details and guidance on the selection of these kernels.
Standard sequential importance sampling and resampling algorithms can then be applied to the sequence of synthetic distributions, {π t } t≥0 . At time t − 1, assume that a set of weighted particles {W (i) t−1 , X (i) 0:t−1 } N i=1 approximatingπ t−1 is available, then at time t, the path of each particle is extended with a Markov kernel say, K t (x t−1 , x t ) and the set of particles {X (i) where η 0 is the initial distribution of the particles. To correct the discrepancy between η t andπ t , Equation 6 is applied and in this case, wherew t , termed the incremental weights, are calculated as, If π t is only known up to a normalizing constant, say π t (x t ) = γ t (x t )/Z t , then we can use the unnormalized incremental weights for importance sampling. Further, with the previously normalized weights {W Sequentially, the normalizing constant between initial distribution π 0 and some target π T , T ≥ 1 can be estimated. See Del Moral et al. (2006b) for details on calculating the incremental weights. In practice, when K t is invariant to π t , and an approximated suboptimal backward kernel is used, the unnormalized incremental weights will be

Other sequential Monte Carlo algorithms
Some other commonly used sequential Monte Carlo algorithms can be viewed as special cases of algorithms introduced above. The annealed importance sampling (AIS; Neal (2001)) can be viewed as SMC samplers without resampling.
Particle filters as seen in the physics and signal processing literature, can also be interpreted as the sequential importance sampling and resampling algorithms. See Doucet and Johansen (2011) for a review of this topic. A simple particle filter example is used in Section 5.1 to demonstrate basic features of the vSMC library.
3. Using the vSMC library 3.1. Overview The vSMC library makes use of C++'s template generic programming to implement general SMC algorithms. This library is formed by a few major modules listed below. Some features not included in these modules are introduced later in context.

Core
The highest level of abstraction of SMC samplers. Users interact with classes defined within this module to create and manipulate general SMC samplers. Classes in this module include Sampler, Particle and others. These classes use user defined callback functions or callable objects, such as functors, to perform problem specific operations, such as updating particle values and weights. This module is documented in Section 4.1.
Symmetric Multiprocessing (SMP) This is the form of computing most people use everyday, including multiprocessor workstations, multicore desktops and laptops. Classes within this module make it possible to write generic operations which manipulate a single particle that can be applied either sequentially or in parallel through various parallel programming models. A method defined through classes of this module can be used by Sampler as callback objects. This module is documented in Section 4.2.
Message Passing Interface MPI is the de facto standard for parallel programming on distributed memory architectures. This module enables users to adapt implementations of algorithms written for the SMP module such that the same sampler can be parallelized using MPI. In addition, when used with the SMP module, it allows easy implementation of hybrid parallelization such as MPI/OpenMP. In Section 5.2, an example is shown how to extend existing SMP parallelized samplers using this module.
OpenCL This module is similar to the two above except it eases the parallelization through OpenCL, such as for the purpose of General Purpose GPU Programming (GPGPU).
OpenCL is a framework for writing programs that can be execute across heterogeneous platforms. OpenCL programs can run on either CPUs or GPUs. It is beyond the scope of this paper to give a proper introduction to GPGPU, OpenCL and their use in vSMC. However, later we will demonstrate the relative performance of this programming model on both CPUs and GPUs in Section 5.2.

Obtaining and installation
vSMC is a header only library. There is practically no installation step. The library can be downloaded from http://zhouyan.github.io/vSMC/vSMC.zip. After downloading and unpacking, one can start using vSMC by ensuring that the compiler can find the headers inside the include directory. To permanently install the library in a system directory, one can simply copy the contents of the include directory to an appropriate place.
Alternatively, one can use the CMake (Martin and Hoffman 2010) (2.8 or later) configuration script and obtain the source by Git (Torvalds et al. 2013

Documentation
To build the reference manual, one need Doxygen (van Heesch 2013), version 1.8.3 or later. Continuing the last step (still in the build directory), invoking make docs will create a doc directory inside build, which contains the HTML references. Alternatively the reference manual can also be found on http://zhouyan.github.io/vSMC/doc/html/index.html. It is beyond the scope of this paper to document every feature of the vSMC library. In many places we will refer to this reference manual for further information.

Third-party dependencies
vSMC uses Random123 (Salmon, Moraes, Dror, and Shaw 2011) counter-based RNG for random number generating. For an SMC sampler with N particles, vSMC constructs N (statistically) independent RNG streams. It is possible to use millions of such streams without a huge memory footprint or other performance penalties. Since each particle has its own independent RNG stream, it frees users from many thread-safety and statistical independence considerations. It is highly recommended that the users install this library. The other third-party dependency is the Boost library. Version 1.49 or later is required. However, this is actually optional provided that proper C++11 features are available in the standard library, for example using clang (The LLVM Developer Group 2013a) with libc++ (The LLVM Developer Group 2013b). The C++11 headers of concern are <functional> and <random>. To instruct vSMC to use the standard library headers instead of falling back to the Boost library, one needs to define the configuration macros before including any vSMC headers. For example, tells the library to use C++11 <functional> and <random>. The availability of these headers are also checked by the CMake configuration script.

Compiler support
vSMC has been tested with recent versions of clang, GCC (GNU Project 2013), Intel C++ Complier (Intel Cooperation 2013a) and Microsoft Visual C++ (Microsoft Cooperation 2012). vSMC can optionally use some C++11 features to improve performance and usability. In particular, as mentioned before, vSMC can use C++11 standard libraries instead of the Boost library. At the time of writing, clang with libc++ has the most comprehensive support of C++11 with respect to standard compliant and feature completion. GCC 4.8 , Microsoft Visual C++ 2012 and Intel C++ Complier 2013 also have very good C++11 support. Note that, provided the Boost library is available, all C++11 language and library features are optional. vSMC can be used with any C++98 conforming compilers.

Core module
The core module abstracts general SMC samplers. SMC samplers can be viewed as formed by a few concepts regardless of the specific problems. The following is a list of the most commonly seen components of SMC samplers and their corresponding vSMC abstractions.
• A collection of all particle state values, namely {X In vSMC, users need to define a class, say T, to abstract this concept. We refer to this as the value collection. We will slightly abuse the generic name T in this paper. Whenever a template parameter is mentioned with the name T, it always refers to such a value collection type unless stated otherwise.
• A collection of all particle state values and their associated weights. This is abstracted by a Particle<T> object. We refer to this as the particle collection. A Particle<T> object has three primary sub-objects. One is the above type T value collection object.
Another is an object that abstracts weights {W By default this is a WeightSet object. The last is a collection of RNG streams, one for each particle. By default this is an RngSet object.
• Operations that perform tasks common to all samplers to these particles, such as resampling. These are the member functions of Particle<T>.
• A sampler that updates the particles (state values and weights) using user defined callbacks. This is a Sampler<T> object.
• Monitors that record the importance sampling estimates of certain functions defined for the values when the sampler iterates. These are Monitor<T> objects. A monitor for the estimates of E[h(X t )] computes h(X (i) t ) for each i = 1, . . . , N . The function value h(X t ) is allowed to be a vector.
Note that within the core module, all operations are applied to Particle<T> objects, that is {W , instead of a single particle. Later we will see how to write operations that can be applied to individual particles and can be parallelized easily.

Program structures
A vSMC program usually consists of the following tasks.
• Define a value collection type T.
• Configure the behavior of initialization and updating by adding callable objects to the sampler object.
• Initialize and iterate the sampler.
• Retrieve the outputs, estimates and other informations.
In this section we document how to implement each of these tasks. Within the vSMC library, all public classes, namespaces and free functions, are declared in the namespace vsmc.

The value collection
The template parameter T is a user defined type that abstracts the value collection. vSMC does not restrict how the values shall be actually stored. They can be stored in memory, spread among nodes of a cluster, in GPU memory or even in a database. However this kind of flexibility comes with a small price. The value collection does need to fulfill two requirements. We will see later that for most common usage, vSMC provides readily usable implementations, on top of which users can create problem specific classes.
First, the value collection class T has to provide a constructor of the form where SizeType is some integer type. Since vSMC allows one to allocate the states in any way suitable, one needs to provide this constructor which Particle<T> can use to allocate them.
Second, the class has to provide a member function named copy that copies each particle according to replication numbers given by a resampling algorithm. For the same reason as above vSMC has no way to know how it can extract and copy a single particle when it is doing the resampling. The signature of this member function may look like template <typename SizeType> void copy (std::size_t N, const SizeType *copy_from); The pointer copy_from points to an array that has N elements, where N is the number of particles. After calling this member function, the value of particle i shall be copied from the particle j = copy_from [i]. In other words, particle i is a child of particle copy_from[i], or copy_from[i] is the parent of particle i. If a particle j shall remain itself, then copy_from[j] == j. How the values are actually copied is user defined. Note that, the member function can take other forms, as usual when using C++ template generic programming. The actual type of the pointer copy_from, SizeType, is Particle<T>::size_type, which depends on the type T. For example, define the member function as the following is also allowed, void copy (int N, const std::size_t *copy_from); provided that Particle<T>::size_type is indeed std::size_t. However, writing it as a member function template releases the users from finding the actual type of pointer copy_from and the sometimes troubling forward declaration issues. Will not elaborate such more technical issues further.

Constructing a sampler
Once the value collection class T is defined. One can start constructing SMC samplers. For example, the following line creates a sampler with N particles Sampler<T> sampler(N); The number of particles is the only mandatory argument of the constructor. There are two optional parameters. The complete signature of the constructor is, explicit Sampler<T>::Sampler (Sampler<T>::size_type N, ResampleScheme scheme = Stratified, double threshold = 0.5); The scheme parameter is self-explanatory. vSMC provides a few built-in resampling schemes; see the reference manual for a list of them. User defined resampling algorithms can also be used. See the reference manual for details. The threshold is the threshold of ESS/N below which a resampling will be performed. It is obvious that if threshold ≥ 1 then resampling will be performed at each iteration. If threshold ≤ 0 then resampling will never be performed. Both parameters can be changed later. However the size of the sampler can never be changed after the construction.

Initialization and updating
All the callable objects that initialize, move and weight the particle collection can be added to a sampler through a set of member functions. All these objects operate on the Particle<T> object. Because vSMC allows one to manipulate the particle collection as a whole, in principle many kinds of parallelization are possible.
One is simply called move in vSMC, and is performed before the possible resampling at each iteration. These moves usually perform the updating of the weights among other tasks. The other is called mcmc, and is performed after the possible resampling. They are often MCMC type moves. Multiple move's or mcmc's are also allowed. In fact a vSMC sampler consists of a queue of move's and a queue of mcmc's. The move's in the queue can be changed through Sampler<T>::move, Sampler<T> &move (const Sampler<T>::move_type &new_move, bool append); If append == true then new_move is appended to the existing (possibly empty) queue. Otherwise, the existing queue is cleared and new_move is added. The member function returns a reference to the updated sampler. For example, the following move, std::size_t move_func (std::size_t iter, Particle<T> &particle); can be added to a sampler by sampler.move(move_func, false); This will clear the (possibly empty) existing queue of move's and set a new one. To add multiple moves into the queue, sampler.move(move1, true).move(move2, true).move(move3, true); Objects of class type with operator() properly defined can also be used, similarly to the initialization method. The queue of mcmc's can be used similarly. See the reference manual for other methods that can be used to manipulate these two queues.
In principle, one can combine all moves into a single move. However, sometimes it is more natural to think of a queue of moves. For instance, if a multi-block Metropolis random walk consists of kernels K 1 and K 2 , then one can implement each of them as functions, say mcmc_k1 and mcmc_k2, and add them to the sampler sequentially, sampler.mcmc(mcmc_k1, true).mcmc(mcmc_k2, true); Then at each iteration, they will be applied to the particle collection sequentially in the order in which they are added.

Running the algorithm, monitoring and outputs
Having set all the operations, one can initialize and iterate the sampler by calling sampler.initialize((void *)param); sampler.iterate(iter_num); The param argument to initialize is optional, with NULL as its default. This parameter is passed to the user defined init_func. The iter_num argument to iterate is also optional; the default is 1.
Before initializing the sampler or after a certain time point, one can add monitors to the sampler. The concept is similar to BUGS (Spiegelhalter, Thomas, and Best 1996)'s monitor statement, except it does not monitor the individual values but rather the importance sampling estimates.
) T is the N -vector of normalized weights. To compute this importance sampling estimate, one need to define the following evaluation function (or a class with operator() properly defined), void monitor_eval (std::size_t iter, std::size_t m, const Particle<T> &particle, double *res) and add it to the sampler by calling, sampler.monitor("some.user.chosen.variable.name", m, monitor_eval); When the function monitor_eval is called, iter is the iteration number of the sampler, m is the same value as the one the user passed to Sampler<T>::monitor; and thus one does not need global variable or other similar techniques to access this value. The output pointer res points to an N × m output array of row major order. That is, after the calling of the function, . After each iteration of the sampler, the importance sampling estimate will be computed automatically. See the reference manual for various ways to retrieve the results. Usually it is sufficient to output the sampler by std::ofstream output("file.name"); output << sampler << std::endl; where the output file will contain the importance sampling estimates among other things. Alternatively, one can use the Monitor<T>::record member function to access specific historical results. See the reference manual for details of various overloaded version of this member function.
A reference to the value collection T object can be retrieved through the Particle<T>::value member function. Particle<T> objects manage the weights through a weight set object, which by default is of type WeightSet. The Particle<T>::weight_set member function returns a reference to this weigh set object. A user defined weight set class that abstracts {W can also be used. The details involve some more advanced C++ template techniques and are documented in the reference manual. One possible reason for replacing the default is to provide special memory management of the weight set. For example, the MPI module provides a special weight set class that manages weights across multiple nodes and perform proper normalization, computation of ESS, and other tasks.

Implementing initialization and updating
So far we have only discussed how to add initialization and updating objects to a sampler. To actually implement them, one writes callable objects that operate on the Particle<T> object. For example, a move can be implemented through the following function as mentioned before, std::size_t move_func (std::size_t iter, Particle<T> &particle); Inside the body of this function, one can change the value by manipulating the object through the reference returned by particle.value().
One important thing to note is that, whenever one of these member functions is called, both the weights and logarithm weights will be re-calculated, normalized, and the ESS will be updated. The reason for not allowing changing a single particle's weight is that, in a multithreading environment, it is possible for one to change one weight in one thread, and obtain another in another thread without proper normalizing. Conceptually, changing one weight actually changes all weights.

Generating random numbers
The Particle<T> object has a sub-object, a collection of RNG engines that can be used with C++11 <random> or Boost distributions. For each particle i, one can obtain an engine that produces an RNG stream independent of others by particle.rng(i); To generate distribution random variates, one can use the C++11 <random> library. For example, std::normal_distribution<double> rnorm(mean, sd); double r = rnorm(particle.rng(i)); or use the Boost library, boost::random::normal_distribution<double> rnorm(mean, sd); double r = rnorm(particle.rng(i)); vSMC itself also makes use of C++11 <random> or Boost depending on the value of the configuration macro VSMC_HAS_CXX11LIB_RANDOM. Though the user is free to choose which one to use in their own code, there is a convenient alternative. For each class defined in C++11 <random>, it is imported to the vsmc::cxx11 namespace. Therefore one can use cxx11::normal_distribution<double> rnorm(mean, sd); while the underlying implementation of normal_distribution can be either C++11 standard library or Boost. The benefit is that if one needs to develop on multiple platforms, and only some of them support C++11 and some of them have the Boost library, then only the configure macro VSMC_HAS_CXX11LIB_RANDOM needs to be changed. This can be configured through CMake and other build systems. Of course, one can also use an entirely different RNG system than those provided by vSMC.

The value collection
Many typical problems' value collections can be viewed as a matrix of certain type. For example, a simple particle filter whose state is a vector of length Dim and type double can be viewed as an N by Dim matrix where N is the number of particles. A trans-dimensional problem can use an N by 1 matrix whose type is a user defined class, say StateType. For this kind of problems, vSMC provide a class template template <MatrixOrder Order, std::size_t Dim, typename StateType> class StateMatrix; which provides the constructor and the copy member function required by the core module interface, as well as methods for accessing individual values. The first template parameter (possible value RowMajor or ColMajor) specifies how the values are ordered in memory. Usually one shall choose RowMajor to optimize data access. The second template parameter is the number of variables, an integer value no less than 1 or the special value Dynamic, in which case StateMatrix provides a member function resize_dim such that the number of variables can be changed at runtime. The third template parameter is the type of the state values.
Each particle's state is thus a vector of length Dim, indexed from 0 to Dim -1. To obtain the value at position pos of the vector of particle i, one can use one of the following member functions, StateBase<RowMajor, Dim, StateType> value(N); StateType val = value.state(i, pos); StateType val = value.state(i, Position<Pos>()); StateType val = value.state<Pos>(i); where Pos is a compile time constant expression whose value is the same as pos, assuming the position is known at compile time. One can also read all values. To read the variable at position pos, std::vector<StateType> vec(value.size()); value.read_state(pos, vec.begin()); Or one can read all values through an iterator, std::vector<StateType> mat(Dim * value.size()); value.read_state_matrix(ColMajor, mat.first()); Alternatively, one can also read all values through an iterator which points to iterators, If the compiler support C++11 <tuple>, vSMC also provides a StateTuple class template, which is similar to StateMatrix except that the types of values do not have to be the same for each variable. This is similar to R (R Core Team 2013)'s data.frame. For example, suppose each particle's state is formed by two double's, an int and a user defined type StateType, then the following constructs a value collection using StateTuple, StateTuple<ColMajor, double, double, int, StateType> value(N); And there are a few ways to access the state values, similar to StateMatrix, double x0 = value.state(i, Position<0>()); int x2 = value.state<2>(i); std::vector<StateType> vec(value.size()); state.read_state(Position<3>(), vec.begin()); See the reference manual for details.

A single particle
For a Particle<T> object, one can construct a SingleParticle<T> object that abstracts one of the particle from the collection. For example, Particle<T> particle(N); SingleParticle<T> sp(i, &particle); create a SingleParticle<T> object corresponding to the particle at position i. There are a few member functions of SingleParticle<T> that makes access to individual particles easier than through the interface of Particle<T>. Firt sp.id() returns the value of the argument i in the above code that created this SingleParticle<T> object. In addition, sp.rng() is equivalent to particle.rng(i). Also sp.particle() returns a constant reference to the Particle<T> object. And sp.particle_ptr() returns a pointer to such a constant Particle<T> object. Note that, one cannot get write access to a Particle<T> object through interface of SingleParticle<T>. Instead, one can only get write access to a single particle. For example, If T is a StateMatrix instantiation or its derived class, then sp.state(pos) is equivalent to particle.value().state(i, pos) and the reference it returns is mutable. See the reference manual for more informations on the interface of the SingleParticle class template. SingleParticle<T> objects are usually not constructed by users, but rather by the libraries' other classes in the SMP module, and passed to user defined functions, as we will see very soon.

Sequential and parallel implementations
Once we have the SingleParticle<T> concept, we are ready to introduce how to write implementations of SMC algorithms that manipulate a single particle and can be applied to all particles in parallel or sequentially. For sequential implementations, this can be done through five base classes, template <typename BaseState> class StateSEQ; template <typename T, typename D = Virtual> class InitializeSEQ; template <typename T, typename D = Virtual> class MoveSEQ; template <typename T, typename D = Virtual> class MonitorEvalSEQ; template <typename T, typename D = Virtual> class PathEvalSEQ; The template parameter BaseState needs to satisfy the general value collection requirements in addition to a copy_particle member function, for example, StateMatrix. Other base classes expect T to satisfy general value collection requirements. The details of all these class templates can be found in the reference manual. Here we use the MoveSEQ<T> class as an example to illustrate their usage. Recall that Sampler<T> expect a callable object which has the following signature as a move, std::size_t move_func (std::size_t iter, Particle<T> &particle); For the purpose of illustration, the type T is defined as, typedef StateMatrix<RowMajor, 1, double> T; Here is a typical example of implementation of such a function, std::size_t move_func (std::size_t iter, Particle<T> &particle) { std::vector<double> inc_weight(particle.size()); for (Particle<T>::size_type i = 0; i != particle.size(); ++i) { cxx11::normal_distribution<double> rnorm(0, 1); particle.value().state(i, 0) = rnorm(particle.rng(i)); inc_weight[i] = cal_inc_weight(particle.value().state(i, 0); } particle.weight_set().add_log_weight(inc_weight.begin()); } where cal_inc_weight is some function that calculates the logarithm incremental weights. As we see, there are three main parts of a typical move. First, we allocate a vector inc_weight. Second, we generate normal random variates for each particle and calculate the incremental weights. This is done through a for loop. Third, we add the logarithm incremental weights. The first and the third are global operations while the second is local. The first and the third are often optional and absent. The local operation is also usually the most computational intensive part of SMC algorithms and can benefit the most from parallelizations.
In the simplest case, MoveSEQ only takes away the loop around Part 2. However if one implement the move in such a way, and then replace MoveSEQ with MoveOMP, the changing of the base class name causes vSMC to use OpenMP to parallelize the loop. For example, one can declare the class as #include <vsmc/smp/backend_omp.hpp> class move : public MoveOMP<T>; and use exactly the same implementation as before. Now when move::operator() is called, it will be parallelized by OpenMP. Other backends are available in case OpenMP is not available. Among them there are Cilk Plus (Intel Cooperation 2011) and Intel TBB. In addition to these well known parallelization programming models, vSMC also has its own implementation using C++11 <thread>. There are other backends documented in the reference manual. To use any of these parallelization, all one need to do is to change a few base class names. In practice, one can use conditional compilation, for example, to use a sequential implementation or a OpenMP parallelized one, we can write, Or one can configure the source file through a build system such as CMake, which can also determine which programming model is available on the system.

Adapter
Sometimes, the cumbersome task of writing a class to implement a move and other operations can out weight the power we gain through object oriented programming. For example, a simple move_state-like function is all that we need to get the work done. These are respectively, sequential, C++11 <thread>, Intel TBB, Cilk Plus, and OpenMP implementations. The first template parameter is the type of value collection and the second is the name of the base class template. Actually, the MoveAdapter's constructor accepts two more optional arguments, the pre_processor and the post_processor, corresponding to the other two aforementioned member functions. Similar adapters for the other three base classes also exist.
Another scenario where an adapter is desired is that which backend to use needs be decided at runtime. The above simple adapters can already be used for this purpose. In addition, another form of the adapter is as the following, class move; MoveAdapter<T, MoveTBB, move> move_obj; sampler.move(move_obj, false); where the class move has the same definition as before but it no longer derives from any base class. The class move is required to have a default constructor, a copy constructor and an assignment operator.

Thread-safety and scalability considerations
When implementing parallelized SMC algorithms, issues such as thread-safety cannot be avoided even though the vSMC library hides most parallel constructs from the user.
Classes in the vSMC library usually guarantee that their member functions are thread-safe in the sense that calling the same member function on different objects at the same time from different threads is safe. However, calling mutable member functions on the same object from different threads is usually not safe. Calling immutable member functions is generally safe.
There are a few exceptions, • The constructors of Particle and Sampler are not thread-safe. Therefore if one need to construct multiple Sampler from different threads, a mutex protection is needed. However, subsequent member function calls on the constructed objects are thread-safe according to the above rules.
• Member functions that concern a single particle are generally thread-safe in the sense that one can call them on the same object from different threads as long as they are called for different particles. For example Particle::rng and StateMatrix::state are thread-safe.
In general, one can safely manipulate different individual particles from different threads, which is the minimal requirement for scalable parallelization. But one cannot manipulate the whole particle collection from different threads, for example WeightSet::set_log_weight.
User defined callbacks shall generally follow the same rules. For example, for a MoveOMP subclass, pre_processor and post_processor does not have be thread-safe, but move_state needs to be. In general, avoid write access to memory locations shared by all particles from move_state and other similar member functions. One needs to take some extra care when using third-party libraries. For example, in our example implementation of the move class, the rnorm object, which is used to generate Normal random variates, is defined within move_state instead of being a class member data even though it is created with the same parameters for each particle. This is because the call rnorm(sp.rng()) is not thread-safe in some implementations, for example, when using the Boost library.
For scalable performance, certain practices should be avoided when implementing member functions such as move_state. For example, dynamic memory allocation is usually lockbased and thus should be avoided. Alternatively one can use a scalable memory allocator such as the one provided by Intel TBB. In general, in functions such as move_state, one should avoid using locks to guarantee thread-safety, which can be a bottleneck to parallel performance.

A simple particle filter
The model and algorithm This is an example used in SMCTC (Johansen 2009). Through this example, we will show how to re-implement a simple particle filter in vSMC. It shall walk one through the basic features of the library. SMCTC is the first general framework for implementing SMC algorithms in C++ and serves as one of the most important inspirations for the vSMC library. It is widely used by many researchers. One of the goals of the current work is that users familiar with SMCTC shall find little difficulty in using the new library.
The state space model, known as the almost constant velocity model in the tracking literature, provides a simple scenario. The state vector X t contains the position and velocity of an object moving in a plane. That is, X t = (x t pos , y t pos , x t vel , y t vel ) T . Imperfect observations Y t = (x t obs , y t obs ) T of the positions are possible at each time instance. The state and observation equations are linear with additive noises, and we assume that the elements of the noise vector V t are independent Normal with variance 0.02 and 0.001 for position and velocity, respectively. The observation noise, W t comprises independent, identically distributed t-distributed random variables with degree of freedom ν = 10. The prior at time 0 corresponds to an axis-aligned Gaussian with variance 4 for the position coordinates and 1 for the velocity coordinates. The particle filter algorithm is shown in Algorithm 1.

Implementations
We first introduce the body of the main function, showing the typical work flow of a vSMC program.
#include <vsmc/core/sampler.hpp> #include <vsmc/smp/adapter.hpp> #include <vsmc/smp/state_matrix.hpp> #include <vsmc/smp/backend_seq.hpp> const std::size_t DataNum = 100; const std::size_t ParticleNum = 1000; const std::size_t Dim = 4; Repeat the Iteration step until all data are processed. sampler.initialize((void *)"pf.data"); sampler.iterate(DataNum -1); std::ofstream est("pf.est"); est << sampler << std::endl; est.close(); est.clear(); return 0; } In the main function, we constructed a sampler with ParticleNum particles. And then we added the initialization function cv_init and the move object of type cv_move. And then we added a monitor that will record the importance sampling estimates of the two position parameters. Next, we initialized the sampler with data file pf.data and iterate the sampler until all data are processed. The last step is that we output the results into a file called pf.est. The class cv will be our value collection which is a subclass of StateMatrix<RowMajor, Dim, double>. To illustrate both the core module and the SMP module, the initialization cv_init will be implemented as a standalone function. The move cv_move will be implemented as a derived class of MoveSEQ<cv>. To monitor the importance sampling estimates of the two position parameters, we will implement a simple function cv_monitor_state and use the adapter MonitorEvalAdapter<cv, MonitorEvalSEQ>.
The value collection is an N by Dim (in this case Dim = 4) matrix of type double. We can simply use StateMatrix<RowMajor, Dim, double> as our value collection. However, we would like to enhance its functionality through inheritance. First, since the data is shared by all particles, it is natural to bind it with the value collection. Second, both the initialization and move will need to calculate the log-likelihood. We can implement it as a standalone function, but since the log-likelihood function will need to access the data, it is convenient to implement it as a member function of the value collection cv. Here is the full implementation of this simple value collection class The log_likelihood member function accepts the iteration number and the particle's id as arguments. It returns the log-likelihood of the id'th particle at iteration iter. The read_data member function simply read the data from a file.
The initialization is implemented through the cv_init function, std::size_t cv_init (Particle<cv> &particle, void *filename); such that, it first checks if filename is NULL. If it is not, then we use it to read the data. So the first initialization may look like sampler.initialize((void *)filename); And after that, if we want to re-initialize the sampler, we can simply call,
The updating method cv_move is similar to cv_init. It will update the particle's value by adding Normal random variates. However, as we see above, each call to cv_init causes a log_weight vector being allocated. Its size does not change between iterations. So it can be viewed as some resource of cv_move and it is natural to use a class object to manage it. Here is the implementation of cv_move, First before calling any move_state, the pre_processor will be called, as described in Section 4.2. At this step, we will resize the vector used for storing incremental weights. After the first resizing, subsequent calls to resize will only cause reallocation if the size changed. In our example, the size of the particle system is fixed, so we don't need to worry about excessive dynamic memory allocations. The move_state member function moves each particle according to our model. And after move_state is called for each particle, the post_processor will be called and we simply add the logarithm incremental weights.
For each particle, we want to monitor the x pos and y pos parameters and get the importance sampling estimates. To extract the two values from a particle, we can implement the following function and add it to the sampler through sampler.monitor("pos", 2, cv_est); If later we decided to monitor all states, we only need to change the 2 in the above line to Dim. After we implemented all the above, compiled and ran the program, a file called pf.est was written by the following statement in the main function, q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q est << sampler << std::endl; The output file contains the ESS, resampling, importance sampling estimates and other informations in a table. This file can be read by most statistical softwares for further analysis. For instance, we can process this file with R, and get the plot of the estimates of positions against the observations, as shown in Figure 1.

The model and algorithm
Since Richardson and Green (1997), the Gaussian mixture model (GMM) has provided a canonical example of a model-order-determination problem. We use the same model as in Del Moral et al. (2006b) to illustrate the implementation of this classical example in Monte Carlo literature. This model is also used in Zhou, Johansen, and Aston (2013) for demonstration of the use of SMC in the context of Bayesian model comparison, which provides more details of the following setting. The model is as follows; data y = (y 1 , . . . , y n ) are independently and identically distributed as where N (µ j , λ −1 j ) denotes the Normal distribution with mean µ j and precision λ j ; θ r = For each model M r ∈ M perform the following algorithm.
Repeat the Iteration step up to t = T r .
Algorithm 2: SMC algorithm for Bayesian modeling of Gaussian mixture model.
(µ 1:r , λ 1:r , ω 1:r ) and r is the number of components in each model. The parameter space is thus R r ×R +r ×S r where S r = {ω 1:r : 0 ≤ ω j ≤ 1; r j=1 ω j = 1} is the standard (r−1)-simplex. The priors which are the same for each component are taken to be µ j ∼ N (ξ, κ −1 ), λ j ∼ G(ν, χ) and ω 1:r ∼ D(ρ) where D(ρ) is the symmetric Dirichlet distribution with parameter ρ and G(ν, χ) is the Gamma distribution with shape ν and scale χ. The prior parameters are set in the same manner as in Richardson and Green (1997); also see Zhou et al. (2013) for details. The data is simulated from a four components model with µ 1:4 = (−3, 0, 3, 6), and λ j = 2, ω j = 0.25, j = 1, . . . , 4. Our interest is to simulate the posterior distribution of models with r components, denoted by M r and obtaining the normalizing constant for the purpose of Bayesian model comparison (Robert 2007, chap. 7).
Numerous strategies are possible to construct a sequence of distributions for the purpose of SMC sampling. One option is to use for each model M r , r ∈ {1, 2, . . . }, the sequence {π t } Tr t=0 , defined by π t (θ t r ) ∝ π(θ t r |M r )p(y|θ t r , M r ) α(t/Tr) .
where the number of distribution, T r , and the annealing schedule, α : [0, 1] → [0, 1] may be different for each model. This leads to Algorithm 2.
The MCMC kernel K t in Algorithm 2 is constructed as a three-blocks Metropolis random walk, 1. Update µ 1:r through a Normal random walk.
The standard direct estimate of the normalizing constants (Del Moral et al. 2006b) can be obtained from the output of this SMC algorithm, where W (i) t−1 is the importance weight of sample θ (i) t−1 .

Path sampling for estimation of normalizing constants
As shown in Zhou et al. (2013) the estimation of the normalizing constant associated with our sequence of distributions can also be achieved by a Monte Carlo approximation to the path sampling formulation given by Gelman and Meng (1998), also known as thermodynamic integration or Ogata's method. Given a parameter α which defines a family of distributions, {p α = q α /Z α } α∈[0,1] that move smoothly from p 0 = q 0 /Z 0 to p 1 = q 1 /Z 1 as α increases from zero to one, one can estimate the logarithm of the ratio of their normalizing constants via a simple integral relationship, where E α denotes expectation under p α . The sequence of distributions in the SMC algorithm for this example can be interpreted as belonging to such a family of distributions, with α = α(t/T r ).
The SMC sampler provides us with a set of weighted samples obtained from a sequence of distributions suitable for approximating this integral. At each time t we can obtain an estimate of the expectation within the integral via the usual importance sampling estimator, and this integral can then be approximated via a trapezoidal integration. In summary, the path sampling estimator of the ratio of normalizing constants λ Tr = log(Z 1 /Z 0 ) can be approximated byλ where

Implementations
In this example we will implement the following classes.
• gmm_state is the value collection class.
• gmm_init is a class that implements operations used to initialize the sampler.
• gmm_move_smc is a class that implements operations used to update the weights as well as selecting the random walk proposal scales and distribution parameter α(t/T r ).
• gmm_move_mu, gmm_move_lambda and gmm_move_weight are classes that implement the random walks, each for one of the three blocks.
• gmm_path is a class that implements monitors for the path sampling estimator. This class is similar to the importance sampling monitor introduced before. It is to be used with Sampler<gmm_state>::path_sampling. Its interface requirement will be documented later.
• gmm_alpha_linear and gmm_alpha_prior are classes that implement two of the many possible annealing schemes, α(t/T r ) = t/T r (linear) and α(t/T r ) = (t/T r ) p , p > 1 (prior).
• And last, the main function, which configure, initialize and iterate the samplers.
This example is considerably more complicated than the last one. Instead of documenting all the implementation details, for many classes we will only show the interfaces. In most cases, the implementations are straightforward as they are either data member accessors or straight translation of mathematical formulations. For member functions with more complex structures, detailed explanation will be given. Interested readers can see the source code for more details.
Later we will build both sequential and parallelized samplers. A few configuration macros will be defined at compile time. For example, the sequential sampler is compiled with the following header and macros, #include <vsmc/smp/backend_seq.hpp> #define BASE_SATE StateSEQ #define BASE_INIT InitializeSEQ #define BASE_MOVE MoveSEQ #define BASE_PATH PathEvalSEQ The definitions of these macros will be changed at compile time to build parallelized samplers. For example, when using OpenMP parallelization, the header backend_omp.hpp will be used instead of backend_seq.hpp; and StateSEQ will be changed to StateOMP along with similar changes to the other macros. In the distributed source code, this is configured by the CMake build system.
Again, we first introduce the main function. The required headers are the same as the last particle filter example in addition to the SMP backend headers as described above. The following variables used in the main function will be set by the user input.
int ParticleNum; int AnnealingScheme; int PriorPower; int CompNum; std::string DataFile; In the main function, we will create objects that set the distribution parameter α(t/T r ) at each iteration according to the user input of AnnealingScheme. Below is the main function. Note that some source code of IO operations which set the parameters above are omitted.
It is obvious that the parameter class gmm_param need to store the parameters (µ 1:r , λ 1:r , ω 1:r ). We also associate with each particle its log-likelihood and log-prior. Here is the definition of the gmm_param class. We omitted definitions of some data access member functions. The comp_num member function allocate the memory for a given number of components. The save_old member function save the current particle states. It is used before the states are updated with the random walk proposals, as we will see later when we implement the gmm_move_mu. The mh_reject_mu member function accept the Metropolis acceptance probability p and a uniform (0, 1] random variate, say u; it rejects the proposed change if p < u, and restore the particle state of the parameters µ 1:r by those values saved by save_old. The member functions mh_reject_lambda and mh_reject_weight do the same for the other two set of parameters. All these three also call the mh_reject_common which restore the stored log-likelihood and log-prior values. The use of these member functions will be seen in the implementation of gmm_move_mu, in the context of which their own implementation become obvious. Other member functions provide some useful computations such as the logarithm of the λ 1:r . They are used when compute the log-likelihood. The class gmm_state contains some properties common to all particles, such as the data and the distribution parameter α(t/T r ). Also, we will have it to record the logarithm of the ratio of normalizing constants, using the NormalizingConstant class. We will see how to update this variable at each iteration in the implementation of gmm_move_smc. The prior parameters are also stored in the value collection. Here is the definition of this value collection class. Again, we omitted some data access member functions, The variable alpha_inc_ is ∆α(t/T r ) = α(t/T r ) − α((t − 1)/T r ), which will be used when we update the weights. The variable nc_ of type NormalizingConstant will be updated when the weights are changed by gmm_move_smc and it will compute the standard normalizing constant estimateλ Tr,N ds . The variables mu0_ and sd0_ are the prior parameters of the means µ 1:r . The variables shape0_ and scale0_ are the prior parameters of the precisions λ 1:r . The variables mu_sd_, lambda_sd_, and weight_sd_ are the proposal scales of the three random walks, respectively. The data access member functions of these variables are omitted in the above source code snippet.
In the update_log_likelihood member function, the calculation is a straightforward translation of the mathematical formulation. The gmm_param::update_log_lambda member function is used before the loop, which simply calculates log λ j for j = 1, . . . , r, and stores their values. The purpose is to avoid repeated computation of these quantities inside the loop. When the function returns, it uses the mutable version of the gmm_param::log_likelihood member function to update the log-likelihood stored in the param object. This is the reason that the function is named with a update prefix. As we will see later, whenever the parameter values are updated, it will be followed by a call to update_log_likelihood and update_log_prior, which is implemented in a similar fashion. Therefore the value we get by calling gmm_param::log_likelihood will always be "up-to-date" while no repeated computation is involved. Surely there are other and possibly better design choices. However, for this simple example, this design serves our purpose well.
After initialization, at each iteration, gmm_move_smc class will implement the updating of weights as well as selecting of the proposal scales and the distribution parameter. For example, when using the linear annealing scheme, we can implement a gmm_alpha_linear class as the following, It accepts the total number of iterations T r as an argument to its constructor. And it implements an operator() that update the distribution parameter α(t/T r ). The prior annealing scheme can be implemented similarly. For simplicity and demonstration purpose, we only allow gmm_move_smc to be configured with different annealing schemes, and hard code the proposal scales. An industry strength design may make this class a template with annealing scheme and proposal scales as policy template parameters. double alpha = particle.value().alpha(); alpha = alpha < 0.02 ? 0.02 : alpha; particle.value().mu_sd(0.15 / alpha); particle.value().lambda_sd((1 + std::sqrt(1 / alpha)) * 0.15); particle.value().weight_sd((1 + std::sqrt(1 / alpha)) * 0.2); incw_.resize(particle.size()); weight_.resize(particle.size()); particle.read_weight(weight_.begin()); double coeff = particle.value().alpha_inc(); for (vsmc::Particle<gmm_state>::size_type i = 0; i != particle.size(); ++i) { incw_[i] = coeff * particle.value().state(i, 0).log_likelihood(); } particle.value().nc().add_log_weight(&incw_[0], particle.weight_set()); particle.weight_set().add_log_weight(&incw_[0]); return 0; } private : alpha_setter_type alpha_setter_; std::vector<double> incw_; std::vector<double> weight_; }; Note that, cxx11::function is an alias to either std::function or boost::function, de-pending on the value of the configuration macro VSMC_HAS_CXX11LIB_FUNCTIONAL. Objects of this class type can be added to a sampler as a move. The operator() satisfies the interface requirement of the core module. First it uses alpha_setter_ to set the distribution parameter α(t/T r ). Second, it sets the proposal scales for the three Metropolis random walks according to the current value of α. Then it computes the unnormalized incremental weights. The NormalizingConstnat class has member function add_log_weight, which is not unlike the one with the same name in WeightSet. It accepts the logarithm of the incremental weights and a WeightSet object. The standard normalizing constant estimates will be computed using these values. The last, we also modify the WeightSet type object itself by adding the logarithm of the incremental weights.
At each iteration, ramdom walks are also perfoemd. The implementations of the random walks are straightforward. Below is the implementation of the random walk on the mean parameters. The random walks on the other parameters are similar. param.mu(i) += rmu(sp.rng()); p = state.update_log_prior(param) + state.alpha() * state.update_log_likelihood(param) -p; double u = std::log(runif(sp.rng())); return param.mh_reject_mu(p, u); } }; First we save the logarithm of the value of target density computed using the old values in p, the acceptance probability. And then we call gmm_param::save_old to save the old values. Next we update each parameter with a proposed Normal random variates and compute the new log-prior and the log-likelihood as well as the new value of the target density. Then we reject it according to the Metropolis algorithm, as implemented in gmm_param, which manages both the current states as well as the backup.
Last we need to monitor certain quantities for interference purpose. Recall that, in the main function we used sampler.path_sampling(gmm_path()) to set the monitoring of path sampling integrands. The path_sampling member function requires a callable objects with the following signature, double path_eval (std::size_t iter, const Particle<T> &, double *res); The input parameter iter is the iteration number, the value of t in Equation 18. The return value shall be the value of α t . The output parameter res shall store the array of values of Our implementation of gmm_path is a sub-class of an SMP module base class, which provides an operator() that satisfies the above interface requirement. Its usage is similar to the MoveSEQ template introduced in Section 4.2. The path sampling integrands under this geometry annealing scheme are simply the loglikelihood. Therefore the implementation of gmm_path class is rather simple,

Results
After compiling and running the algorithm, the results were consistent with those reported in Del Moral et al. (2006b). For a more in depth analyze of the methodologies, extensions and the results see Zhou et al. (2013).
Extending the implementation using MPI vSMC's MPI module assumes that identical samplers are constructed on each node, with possible different number of particles to accommodate the difference in capacities among nodes. To extend the above SMP implementation for use with MPI, first at the beginning of the main function, we add the following, MPIEnvironment env(argc, argv); to initialize the MPI environment. When the object env is destroyed at the exit of the main function, the MPI environment is finalized. Second, we need to replace base value collection class template with StateMPI. So now gmm_state is declared as the following, class gmm_state : public StateMPI<BASE_STATE<StateMatrix<RowMajor, 1, gmm_param> > >; The implementation is exactly the same as before. Third, the gmm_param class now needs to be transferable using MPI. Unlike the SMP situations, a simple copy constructor is not enough. vSMC uses Boost MPI library, and thus one only needs to write a serialize member function for gmm_param such that the data can be serialized into bytes. See documents of Boost MPI and serialization libraries for details. In summary, the following member function accept an Archive object as input, and it can perform a store or a load operation based on the Archive type. In a load operation, the Archive object is like an input stream and in a store operation, it is like an output stream. Fourth, after user input of the sampler parameters, we need to sync them with all nodes. For example, for the ParticleNum parameter, boost::mpi::communicator World; boost::mpi::broadcast(World, ParticleNum, 0); Last, any importance sampling estimates that are computed on each node, need to be combined into final results. For example, the path sampling results are now obtained through adding the results from each node together, double ps_sum = 0; boost::mpi::reduce(World, ps, ps_sum, std::plus<double>(), 0); ps = ps_sum; For the standard normalizing constant ratio estimator, we will replace NormalizingConstant with NormalizingConstantMPI, which will perform such and other tasks.
After these few lines of change, the sampler is now parallelized using MPI and can be deployed to clusters and other distributed memory architecture. On each node, the selected SMP parallelization is used to perform multi-threading parallelization locally. vSMC's MPI module will take care of normalizing weights and other tasks.

Parallelization performance
One of the main motivation behind the creation of vSMC is to ease the parallelization with different programming models. The same implementation can be used to built different samplers based on what kind of parallel programming model is supported on the users' platforms. In this section we compare the performance of various SMP parallel programming models and OpenCL parallelization.
For different number of particles, the wall clock time and speedup are shown in Figure 2.
For 10 4 or more particles, the differences are minimal among all the programming models. They all have roughly 550% speedup. With smaller number of particles, vSMC's C++11 parallelization is less efficient than other industry strength programming models. However, with 1000 or more particles, which is less than typical applications, the difference is not very significant.
OpenCL implementations are also compared on the same workstation, which also has an NVIDIA Quadro 2000 graphic card. OpenCL programs can be compiled to run on both CPUs and GPUs. For CPU implementation, there are Intel OpenCL (Intel Cooperation 2013b) and AMD APP OpenCL (Advanced Micro Devices, Inc. 2012) platforms. We use the Intel TBB implementation as a baseline for comparison. The same OpenCL implementation are used for all the CPU and GPU runtimes. Therefore they are not particularly optimized for any of them. For the GPU implementation, in addition to double precision, we also tested a single precision configuration. Unlike modern CPUs, which have the same performance for double and single precision floating point operations (unless SIMD instructions are used, which can have at most a speedup by a factor of 2), GPUs penalize double precision performance heavily.
For different number of particles, the wall clock time and speed up are plotted in Figure 3. With smaller number of particles, the OpenCL implementations have a high overhead when compared to the Intel TBB implementation. With a large number of particles, AMD APP OpenCL has a similar performance as the Intel TBB implementation. Intel OpenCL is about 40% faster than the Intel TBB implementation. This is due to more efficient vectorization and compiler optimizations. The double precision performance of the NVIDIA GPU has a 220% speedup and the single precision performance has near 1600% speedup. As a rough reference for the expected performance gain, the CPU has a theoretical peak performance of 24.48 GFLOPS. The GPU has a theoretical peak performance of 60 GFLOPS in double precision and 480 GFLOPS in single precision. This represents 245% and 1960% speedup compared to the CPU, respectively. It is widely believed that OpenCL programming is tedious and hard. However, vSMC provides facilities to manage OpenCL platforms and devices as well as common operations. Limited by the scope of this paper, the OpenCL implementation (distributed with the vSMC source) is not documented in this paper. Overall the OpenCL implementation has about 800 lines including both host and device code. It is not an enormous increase in effort when compared to the 500 lines SMP implementation. Less than doubling the code base but gaining more than 15 times performance speedup, we consider the programming effort is relatively small.

Discussion
This paper introduced a C++ template library intended for implementing general SMC algorithms and constructing parallel samplers with different programming models. While it is possible to implement many realistic application with the presented framework, some technical proficiency is still required to implement some problem specific part of the algorithms. Some basic knowledge of C++ in general and how to use a template library are also required.
It is shown that with the presented framework it is possible to implement parallelized, scalable SMC samplers in an efficient and reusable way. The performance of some common parallel programming models are compared using an example.
Some future work may worth the effort to ease the implementation of SMC algorithms further. However, there is a balance between performance, flexibility and the ease of use. vSMC aims to be developer-friendly and to provide users as much control as possible for all performance related aspects. For a BUGS-like interface, users may be interested in other software such as BiiPS (Caron, Todeschini, Legrand, and Moral 2012). In addition LibBi (Murray 2013) provides a user friendly and high performance alternative with a focus on state-space models. Compared with these recent developments, vSMC is less accessible to those with little or no knowledge of C++. However, for researchers with expertise in C++ and template generic programming in particular, vSMC provides a framework within which potential superior performance can be obtained and greater flexibility and extensibility are possible.