DNest4: Diffusive Nested Sampling in C++ and Python

In probabilistic (Bayesian) inferences, we typically want to compute properties of the posterior distribution, describing knowledge of unknown quantities in the context of a particular dataset and the assumed prior information. The marginal likelihood, also known as the"evidence", is a key quantity in Bayesian model selection. The Diffusive Nested Sampling algorithm, a variant of Nested Sampling, is a powerful tool for generating posterior samples and estimating marginal likelihoods. It is effective at solving complex problems including many where the posterior distribution is multimodal or has strong dependencies between variables. DNest4 is an open source (MIT licensed), multi-threaded implementation of this algorithm in C++11, along with associated utilities including: i) RJObject, a class template for finite mixture models, (ii) A Python package allowing basic use without C++ coding, and iii) Experimental support for models implemented in Julia. In this paper we demonstrate DNest4 usage through examples including simple Bayesian data analysis, finite mixture models, and Approximate Bayesian Computation.


Introduction
Bayesian inference, where probability theory describes degrees of logical implication or subjective certainty, provides a powerful general basis for data analysis (O'Hagan and Forster 2004;Sivia and Skilling 2006). The result of such an analysis is typically posterior probabilities of various hypotheses, or a joint posterior probability distribution for the values of unknown parameters. In compact but standard notation, the posterior posterior distribution for parameters θ given data D, within the context of prior information M , is where p(D|M j ) = p(θ j |M j )p(D|θ j , M j ) dθ j is the marginal likelihood of model j, equal to the expected value of the likelihood function with respect to the prior distribution. This kind of calculation is often called "model selection" or "model averaging", and the results are often presented as ratios of marginal likelihoods, known as Bayes Factors. When discussing computational matters, the prior distribution for parameters is often written π(θ), the likelihood L(θ), and the marginal likelihood Z. A popular alternative name for the marginal likelihood, which emphasizes its role in Bayesian model averaging, is the "evidence".
Nested Sampling (NS; Skilling 2006) is a Monte Carlo method whose main aim is to calculate Z. However, it can also be used to generate samples to represent the posterior distribution π(θ)L(θ)/Z, or any other distribution proportional to π(θ)Φ [L(θ)] where Φ is any monotonic function. This latter property makes NS particularly useful for statistical mechanics calculations (Pártay, Bartók, and Csányi 2010;Baldock, Pártay, Bartók, Payne, and Csányi 2016), where the "canonical" family of distributions proportional to π(θ)L(θ) β is of interest. In such applications, L(θ) is usually equivalent to exp(−energy). NS is particularly efficient for this, since only a single run (exploring regions of higher and higher L) is needed, and different canonical distributions can be obtained by re-weighting the output points.
A defining feature of NS is that it works with a sequence of constrained prior distributions, proportional to π but restricted to regions of the parameter space where L(θ) is above some threshold : where one or more particles in the parameter space, along with an integer index variable j for each particle, to explore the following joint distribution: p(θ, j) = p(j)p(θ|j) where the { j } are a sequence of increasing likelihood thresholds or levels, 0 = 0, and {w j } is the marginal distribution for j. The marginal distribution for θ is then a mixture of constrained priors: The DNS algorithm consists of two stages. In the first stage, the particle(s) are initialized from the prior π(θ), equivalent to the mixture of constrained priors with a single level whose log likelihood threshold is −∞. The mixture of constrained priors evolves by adding new levels, each of which compresses the distribution by a factor of about e ≈ 2.71818 (see Brewer et al. (2011) for details). In this stage, the mixture weights {w j } are set according to where λ is a scale length. This enhances the probability that particles stay close to the highest likelihood regions seen. The second stage sets the mixture weights to be approximately uniform (w j ∝ 1), with some tweaks described by Brewer et al. (2011).
The mixture of constrained priors tends to be easier to sample than the posterior, as the prior is always a mixture component, allowing the MCMC chain to mix between different modes in some circumstances. The marginal likelihood estimate can also be more accurate than standard MCMC-based NS (Brewer et al. 2011), as less information is discarded. DNS has been applied several times in astrophysics (e.g., Pancoast, Brewer, and Treu 2014;Huppenkothen, Brewer, Hogg, Murray, Frean, Elenbaas, Watts, Levin, Van Der Horst, and Kouveliotou 2015;Brewer and Donovan 2015) and was recently used in a biological application (Dybowski, Restif, Goupy, Maskell, Mastroeni, and Grant 2015).
There are several ways of using DNest4. After installing the software (Section 4), you can implement a model as a C++ class (Section 8) and compile it to create an executable file to run DNest4 on that problem. This method offers full control over the design of your class, and allows the opportunity of optimizing performance by preventing a full re-computation of the log-likelihood when only a subset of the parameters has been changed.
Alternatively, the Python bindings (Section 11) allow you to specify a model class in Python, and run DNest4 entirely in the Python interpreter without having to invoke the C++ compiler.

Relation to other algorithms
The Diffusive Nested Sampling algorithm, and its implementation in DNest4, has advantages and disadvantages compared to other Bayesian computation algorithms and software.
In the Nested Sampling sphere, the simple C implementation of given by Skilling (2006) is responsibility to implement the MCMC moves used. The DNS algorithm produces more accurate estimates of Z and ought to outperform classic NS on multimodal problems.
MultiNest (Feroz et al. 2009) is a Fortran Nested Sampling implementation that dispenses with MCMC for generating new particles from constrained prior distributions. Instead, the new particles are generated by making an approximation to the constrained prior using ellipsoids. This tends to work well in low to moderate dimensional parameter spaces (up to ∼ 40). In higher dimensions, DNest4 is more useful than MultiNest since it is based on MCMC. We expect MultiNest to be more useful than DNest4 only on low to moderate dimensional problems with slow likelihood evaluations, where DNest4's reliance on MCMC becomes costly. The more recent POLYCHORD (Handley et al. 2015) combines MultiNest-like methods with MCMC.
Outside of Nested Sampling, the popular JAGS (Plummer et al. 2003) and Stan (Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li, and Riddell 2016) packages are more convenient than DNest4 for specifying models, by virtue of their model-specification languages. For users who are interested only in the posterior distribution and do not need the marginal likelihood, these are very useful. However, being based on Gibbs sampling and Hamiltonian MCMC respectively, they can run into difficulty on multimodal posterior distributions.
emcee (Foreman-Mackey, Hogg, Lang, and Goodman 2013) is a popular Python package for MCMC based on affine-invariant ensemble sampling. The user needs only to specify a Python function evaluating the log of the posterior density (up to a normalizing constant). This user-friendliness is a key advantage of emcee, and its algorithm performs well on low-moderate dimensional (up to ∼ 50) parameter spaces with a unimodel target distribution (which can be highly dependent). However, it can give misleading results in higher dimensions (Huijser, Goodman, and Brewer 2015).
Algorithms based on "annealing" or "tempering", such as parallel tempering (Hansmann 1997) and annealed importance sampling (Neal 2001) are related to Nested Sampling and are useful on multimodal posterior distributions. However, they require much more tuning than NS (in the form of an annealing schedule) and do not work on phase change problems, unlike Nested Sampling (Skilling 2006).
The advantages and disadvantages of a subset of these packages are summarized in Table 1.

Markov chain monte carlo
DNS is build upon the Metropolis-Hastings algorithm. In this algorithm, the acceptance probability α is given by where q(θ |θ) is the proposal distribution used to generate a new position θ from the current position θ. Often, q is symmetric so the q terms cancel in the acceptance probability.
In DNS, the target distribution is not the posterior but rather the joint distribution in Equation 8. Moves of θ are done keeping j fixed, so we only need to consider the Metropolis acceptance probability for fixed j, i.e., with respect to a single constrained prior like Equation 5. Hence, the appropriate acceptance probability for a proposed move from θ to θ is where j is the likelihood threshold for the current level j. There are also moves that propose a change to j while keeping θ fixed, but the details are less relevant to the user.
For convenience later on, we separate the prior and proposal-related terms from the likelihoodrelated term, and write the former as The logarithm of H, is the user's responsibility when implementing models, and will become relevant in Sections 8.2 and 8.3. In terms of ln(H), the acceptance probability becomes

Dependencies and installation
The following instructions apply to Unix-like operating systems such as GNU/Linux, Mac OS X, and FreeBSD.

Running DNest4
To demonstrate DNest4, we will use a simple linear regression example where the sampling distribution is and the priors are ln σ ∼ Uniform(−10, 10).
These are naive diffuse priors and do not have any special status. The dataset is shown in Figure 4 and the code is included in the code/Examples/StraightLine subdirectory, and a slightly simplified version of the code is explained in Section 8. To execute DNest4 on this problem, go to this directory and execute main.
The output to the screen should contain information about levels and "saving particles to disk". After 10,000 particles have been saved to the disk, the run will terminate.

Output files
The executable main is responsible for the exploration part of the DNS algorithm (i.e., running the MCMC chain, building levels, and then exploring all the levels). It creates three text output files, sample.txt, sample_info.txt, and levels.txt.
The first output file, sample.txt, contains a sampling of parameter values that represents the mixture of constrained priors (the target distribution used in DNS), not the posterior distribution. Each line of sample.txt represents a point in parameter space. In the linear example, there are three parameters (m, b, and σ), so there are three columns in sample.txt. Each time a point is saved to sample.txt, DNest4 prints the message "Saving a particle to disk. N = ...".
The second output file, sample_info.txt, should have the same number of rows as sample.txt, because it contains metadata about the samples in sample.txt. The first column is the index j, which tells us which "level" the particle was in when it was saved. Level 0 represents the prior, and higher levels represent more constrained versions of the prior. The second column is the log-likelihood value, and the third column is the likelihood "tiebreaker", which allows Nested Sampling to work when there is a region in parameter space with nonzero prior probability where the likelihood is constant. The final column tells us which thread the particle belonged to: when you use DNest4 in multithreaded mode (see Appendix A), each thread is responsible for evolving one or more particles.
The third output file, levels.txt, contains information about the levels that were built during the run. The first column has estimates of the log(X) values of the levels, i.e., how compressed they are relative to the prior, in units of nats. For example, if a level has log(X) = −1.02, its likelihood value encloses exp(−1.02) ≈ 36.1% of the prior mass.
The second column contains the log likelihoods of the levels. The first level, with a log(X) value of 0 and a log likelihood of −10 308 (basically "minus infinity"), is simply the prior. The third column has the "tiebreaker" values for the levels, which again are not particularly useful unless your problem has likelihood plateaus. The fourth and fifth columns are the number of accepted proposals and the total number of proposals that have occurred within each level, which are useful for monitoring the Metropolis acceptance ratio as a function of level. The final two columns, called "exceeds", and "visits", are used to refine the estimates of the level compressions (and hence the log(X) values of the levels in column 1), as discussed in Section 3 of Brewer et al. (2011). The visits column counts the number of times a level (level j, say) has been visited, but only starts counting after the next level (j + 1) has been created. The exceeds column counts the number of times a particle that was in level j had a likelihood that exceeded that of level j + 1.

Postprocessing
The output files themselves are typically not immediately useful. The goal of running NS is usually to obtain posterior samples and the marginal likelihood Z, whereas sample.txt only contains samples from the mixture of constrained priors. Additional post-processing is required. This can be achieved by running the following Python function 1 : These are the natural log of the marginal likelihood, the information which quantifies the degree to which D restricted the range of possible θ values, and the effective sample size, or number of saved particles with significant posterior weight. The postprocess function also saves a file, posterior_sample.txt, containing posterior samples (one per row). Unfortunately, it is harder to compute justified error bars on ln(Z) in DNS than it is in standard NS.
The postprocess function can be called while DNest4 is running. This is helpful for monitoring the progress of a run, by inspecting the output plots.

Options
A plain-text file called OPTIONS resides in the directory from which you execute a run. This file contains numerical parameters controlling the DNS algorithm. Here are the contents of OPTIONS for the linear regression example:

Number of particles
The first option is the number of particles, here set to five. If you use more particles, the same amount of CPU time will be spent evolving more particles, so each one will not be evolved as far. On most problems, five is a sensible default value. On complex problems where the likelihood function has a challenging structure, more particles are useful, but it is usually better to run in multi-threaded mode (see Appendix A).

New level interval
The new level interval controls how quickly DNest4 creates new levels. In this example, this is set to 10,000, so a new level will be created once 10,000 MCMC steps have resulted in 10,000 likelihood values above the current top level. It is difficult to give a sensible default for this quantity because it depends on the complexity of the problem (basically, how good the Metropolis proposals are at exploring the target distribution). However, 10,000 will work for many problems, so we suggest it as a sensible default. Higher values are slower, but more fail-safe.

Save interval
The save interval controls how often DNest4 writes a model to the output files; what is usually called "thinning". Saving more frequently (i.e., a smaller save interval) is usually better. However, this can result in big output files if your model prints a lot of parameters to sample.txt. causing the postprocessing to take a long time and/or a lot of RAM. A default suggestion, used in the example, is to set the save interval to the same value as the new level interval.

Thread steps
The "thread steps" parameter controls how frequently separate threads pool their information about the levels (when running in multi-threaded mode, see Appendix A). It should be set to a moderate value, but should also be a small fraction of the new level interval and the save interval. 100 is a suggested default value that should work without problems in the vast majority of cases.

Maximum number of levels
As the name suggests, this tells DNest4 how many levels to create, and therefore controls the factor by which the parameter space is ultimately compressed. An appropriate value for this quantity depends on the specific model and dataset at hand -typically, a larger numbers of parameters and larger (more informative) datasets will lead to a larger value of H, and therefore need more levels.
In the initial phase of DNS when levels are being created, the particles move to the left in Figure 3, increasing likelihood L and decreasing prior mass X. The posterior distribution is concentrated where the rate of increase of ln(L) and the rate of decrease of ln(X) are approximately equal. Figure 3 is the most useful diagnostic plot for setting the correct maximum number of levels. A clear peak should be visible in the posterior weights plot in the lower panel, such that moving further to the left would not add any more particles with comparable weight to those in the peak.
As described by Skilling (2006), there is no guarantee in principle that another peak might have appeared had a run continued for longer. These phase changes are more common in statistical mechanics problems than in data analysis problems, but can appear in the latter (e.g., Brewer 2014; Brewer and Donovan 2015).
Alternatively, DNest4 can try to determine the maximum number of levels automatically, if you set the maximum number of levels to 0. This works well on problems where the MCMC exploration is efficient. When it fails, this is detectable as the posterior weights plot (lower panel of Figure 3) will peak at the left end of its domain. However, such a failed run may still be useful as it can be used to suggest an order of magnitude for the required number of levels.
For example, if the automatic setting results in 150 levels and is later seen to fail, it might be worth a try to set the maximum number of levels to, say, 1.3 × 150 = 195.

Backtracking scale length
The backtracking scale length, denoted by λ, appeared in Equation 10, and controls the degree to which particles are allowed to "backtrack" down in level during the first stage of DNS when levels are being built. Higher values are more fail-safe, but make it take longer to create the levels. The value 10 used in the linear regression example is a suitable default value that should work in almost all cases. In simple problems where MCMC exploration is easy, lower values from 1 to 5 work sufficiently well.

Equal weight enforcement
This value is the parameter β described in Brewer et al. (2011), and compensates for imprecision in the spacing of the levels, so that the desired mixture weights w j ∝ 1 are achieved during the second stage of the DNS algorithm. The value 100 is recommended.

Maximum number of saves
This controls the length of a DNest4 run, in units of saved particles in sample.txt, which represent the mixture of constrained priors. The number of posterior samples is always less than this. In most applications 5,000 (as in the linear regression example) provides enough posterior samples (typically a few hundred) for sufficiently accurate posterior summaries. However, if you want to plot smooth-looking posterior histograms, you'll need to increase this value.
If you set the maximum number of saves to zero, DNest4 will run until you terminate it manually.

Implementing models
The "classic" method of implementing models in DNest4 is by writing a C++ class, an object of which represents a point in your parameter space.
To run DNest4 on any particular problem, such as this linear regression example, the user needs to define a C++ class to specify the model. Specifically, an object of the class represents a point in the model's parameter space. Member functions are defined which generate the object's parameters from the prior, make proposal steps, evaluate the likelihood, and so on. The sampler calls these member functions while executing a run.
For the simple linear regression example, we will call the class StraightLine. The member variables representing the unknown parameters are defined in the header file StraightLine.h: The class must also define and implement the following member functions: (i) void from_prior(DNest4::RNG& rng), which generates parameter values from the prior; (ii) double perturb(DNest4::RNG& rng), which proposes a change to the parameter values, and returns ln(H) as defined in Equation 15; (iii) double log_likelihood() const, which evaluates the log of the likelihood function; (iv) void print(std::ostream& out) const, which prints parameters of interest to the given output stream; and (v) std::string description() const, which returns a C++ string naming the parameters printed by print(std::ostream&) const. This string is printed (after a comment character #) at the top of the output file.
These are all described (using the linear regression example) in the following sections.

Generating from the prior
The member function used to generate straight line parameters from the prior is: This generates m, b, and σ from their joint prior. In this case, the priors are all independent, so this reduces to generating the parameters each from their own prior distribution. For convenience, DNest4 provides an RNG class to represent random number generators. The RNG class is just a convenience wrapper for the random number generators built into C++11. As you might expect, there are rand() and randn() member functions to generate double precision values from a Uniform(0, 1) and Normal(0, 1) distribution respectively.

Proposal moves
The perturb member function for the straight line model is given below. This takes a random number generator as input, makes changes to a subset of the parameters, and returns a double corresponding to ln(H) as defined in Equation 15. The combination of the choice of proposal and the ln(H) value returned must be consistent with the prior distribution. This function first chooses a random integer from {0, 1, 2} using the rand_int(int) member function of the DNest4::RNG class. This determines which of the three parameters (m, b, σ) is modified. In this example, there are no proposals that modify more than one of the parameters at a time, and all proposals are "random walk" proposals that add a perturbation (drawn from a symmetric distribution) to the current value.

Proposals for single parameters
The proposal for m involves adding a perturbation to the current value using the line m += 1E3 * rng.randh(); A challenge using MCMC for standard Nested Sampling is that the target distribution is not static -it gets compressed over time. Similarly, in the first stage of DNS the target distribution gets compressed (as levels are added), and in the second stage the target distribution is a mixture of distributions that have been compressed to varying degrees. This makes it difficult to tune step-sizes as you would when using the standard Metropolis algorithm to sample the posterior distribution.
Rather than trying to adapt proposal distributions as a function of level, it is much simpler to just use heavy-tailed proposals which have some probability of making a jump of appropriate size. This is slightly wasteful of CPU time, but it saves a lot of human time and is more fail-safe than tuned step sizes. In simple experiments, we have found that heavy-tailed proposals are about as efficient as slice sampling (Neal 2003), but much easier to implement. The following procedure generates x from a heavy-tailed distribution: 1. Generate a ∼ Normal(0, 1); 2. Generate b ∼ Uniform(0, 1); 3. Define t := a/ − ln(b); 4. Generate n ∼ Normal(0, 1); 5. Set x := 10 1.5−3|t| n The variable t has a student-t distribution with 2 degrees of freedom. Overall, this procedure generates values x with a maximum scale of tens-hundreds, down to a minimum scale of about 10 −30 with 99% probability, by virtue of the t-distribution's heavy tails. For convenience, the RNG class contains a member function randh() to generate values from this distribution. The factor of 1E3 is included because it is a measure of the prior width. Therefore, this proposal will attempt moves whose maximum order of magnitude is a bit wider than the prior (since it would be very surprising if any bigger moves were needed), and will propose moves a few orders of magnitude smaller with moderate probability. We recommend using this strategy (a measure of prior width, multiplied by randh()) as a default proposal that works well in almost all problems.
Recall that for Nested Sampling, the Metropolis acceptance probability, excluding the term for the likelihood, is When implementing a model class, the perturb(DNest4::RNG&) member function must return the logarithm of this value. Since the prior for m is a normal distribution with mean zero and standard deviation 1000, H is This explains the use of the log_H variable.
In the example, the proposal for σ is implemented by taking advantage of the uniform prior for ln(σ). So σ is transformed by taking a logarithm, a proposal move is made (that satisfies detailed balance with respect to a uniform distribution), and then σ is exponentiated again. The step for the uniform prior between -10 and +10 uses the following code: sigma += 20.0 * rng.randh(); DNest4::wrap(sigma, -10.0, 10.0); The factor of 20 accounts for the prior width, and the wrap(double&, double, double) function uses the modulo operator to construct periodic boundaries. For example, if the perturbation results in a value of 10.2, which is outside the prior range, the value is modified to −9.8. The wrap function has no return value and works by modifying its first argument, which is passed by reference. Alternatively, the log-uniform prior, which has density proportional to 1/σ, could have been used directly by adding ln [(1/σ )/(1/σ)] to the return value instead of using the log/exp trick. However, this isn't recommended for a prior distribution like this which covers several orders of magnitude, because the appropriate scale size for the proposal is less clear. This is likely to cause inefficient sampling.

Consistency of prior and proposal
It is imperative that from_prior and perturb be consistent with each other, and that each implements the prior distributions that you want to use. One technique for testing this is to sample the prior for a long time (by setting the maximum number of levels to 1) and inspect sample.txt to ensure that each parameter is exploring the prior correctly.

Log likelihood
The log-likelihood for the model is evaluated and returned by the log_likelihood member function. For the straight line fitting example, the log likelihood is based on the normal density, and the code is given below. The dataset is assumed to be accessible inside this function. In the regression example, this is achieved by having a Data class to represent datasets. Since there will usually only be one dataset, the singleton pattern (a class of which there is one instance accessible from anywhere) is recommended. The Data class has one static member which is itself an object of class Data, and is accessible using Data::get_instance() -this is the singleton pattern, essentially a way of defining quasi-'global' variables: Alternatively, the data could be defined using static members of your model class.

Parameter output
The print and description functions are very simple: The print function specifies that the parameters m, b, and σ are printed a single line (of sample.txt) and are separated by spaces (the postprocess function assumes the delimeter is a space).

Running the sampler
The file main.cpp contains the main() function which is executed after compiling and linking. The contents of main.cpp for the linear regression example are: #include <iostream> #include "Data.h" #include "DNest4/code/DNest4.h" #include "StraightLine.h" using namespace std; int main(int argc, char** argv) { Data::get_instance().load("road.txt"); DNest4::start<StraightLine>(argc, argv); return 0; } The first line of main() loads the data into its global instance (so it can be accessed from within the log-likelihood function) and the second line uses a start template function to construct and run the sampler.

Finite mixture models with the RJObject class
Mixture models are a useful way of representing realistic prior information in Bayesian data analysis. To reduce the amount of effort needed to implement mixture models in DNS, Brewer (2014) implemented a template class called RJObject to handle the MCMC moves required. The RJ in RJObject stands for Reversible Jump (Green 1995), as RJObject implements birth and death moves for mixture models with an unknown number of components. An updated version of RJObject is included in DNest4.
If N is the number of components, x i denotes the vector of parameters of the ith component, and α is the vector of hyperparameters, the prior can be factorized via the product rule, giving The specific assumptions of RJObject are that this simplifies to That is, the prior for the hyperparameters is independent of the number of components, and each component is independent and identically distributed given the hyperparameters. The sampling distribution p(D|x, α) (not shown in the PGM) cannot depend on the ordering of the components. A probabilistic graphical model (PGM) is showing this dependence structure is shown in Figure 5. There are no observed data nodes here, but if such a structure forms part of a Bayesian model, an RJObject object within your model class can encapsulate this part of the model.

Brewer (2014) described the motivation for
RJObject and some details about the Metropolis proposals underlying it. Here, we demonstrate how to implement a finite mixture model using RJObject. This example can be found in the code/Examples/RJObject_1DMixture directory. The "SineWave" and "GalaxyField" models from Brewer (2014) are also included in the DNest4 repository. for the ith data point is

An example mixture model
Colloquially, one might say the data were 'drawn from' a mixture of N normal distributions, and we want to infer N along with the properties (positions, widths, and weights) of those normal distributions. The mixture weights {w j } must obey a normalization condition j w j = 1. The easiest way of implementing this is to use un-normalized weights {W j } which do not obey such a condition and then normalize them by dividing by their sum.
Making the connection with Equation 24 and Figure 5, the "components" are the gaussians, the gaussian parameters are {x j } = {(µ j , σ j , W j )}, and hyperparameters α may be used to help specify a sensible joint prior for the gaussian parameters.
We now describe the specific prior distributions we used in this example. The prior for N was for N ∈ {1, 2, ..., 100}. For the conditional prior of the gaussian parameters, we used Laplace (biexponential) distributions 2 . Normal distributions would be more convenional, but the analytic cumulative distribution function (CDF) of the Laplace distribution makes it easier to implement. These were: ln σ j ∼ Laplace(a ln σ , b ln σ ) ln W j ∼ Laplace(0, b ln W ).
2 A Laplace distribution with location parameter a and scale parameter b has density p(x|a, b) = The location parameters are denoted with a and scale parameters with b. These priors express the idea that the centers, widths, and relative weights of the gaussian mixture components are probably clustered around some typical value.
where the template argument <MyConditionalPrior> is a class implementing the prior for the hyperparameters α and the form of the conditional prior p(x i |α). The main advantage of the RJObject class is that we do not need to implement any proposals for {(µ j , σ j , W j )}. Rather, these can be done trivially as follows: // Generate the components from the prior gaussians.from_prior(rng); // Do a Metropolis proposal double logH = gaussians.perturb(rng); // Print to output stream 'out' gaussians.print(out); The RJObject constructor definition is: RJObject(int num_dimensions, int max_num_components, bool fixed, const ConditionalPrior& conditional_prior, PriorType prior_type=PriorType::uniform); where num_dimensions is the number of parameters needed to specify a single component (three in this example), max_num_components is the maximum value of N allowed, fixed determines whether N is fixed at N max or allowed to vary, conditional_prior is an instance of a conditional prior, and prior_type controls the prior for N (the default is uniform from {0, 1, 2, ..., N max }). The RJObject_1DMixture MyModel initializes its RJObject as follows: MyModel::MyModel() :gaussians(3, 100, false, MyConditionalPrior(), PriorType::log_uniform) { } Passing PriorType::log_uniform for the final argument specifies the 1/(N + 1) prior for N . One complication for this model is that RJObject, by default, allows N to be zero, which makes no sense for this particular problem. Therefore, we prevent N = 0 from being generated in MyModel::from_prior, and assert that it should always be rejected if proposed in MyModel::perturb: To access the component parameters, the RJObject member function get_components is used. This was used in the above functions, but more typically it is needed in the log likelihood. The member function get_components returns (by const reference) a std::vector of std::vectors of doubles. For example, parameter two of component zero is:
RJObject's print function prints the dimensionality of each component and the maximum value of N , followed by the hyperparameters, then parameter zero of each component (zero padded when N < N max ), parameter one for each component, and so on.

Conditional priors
The RJObject class is used to manage the components. An additional class is needed to define and manage the hyperparameters α (i.e., define how they are generated from the prior, how they are proposed, and so on). The iid conditional prior p(x i |α) is also defined by this extra class. In the example, the class is called MyConditionalPrior and is inherited from an abstract base class DNest4::ConditionalPrior which insists that certain member functions be specified.
The member functions from_prior, perturb_hyperparameters, and print do the same things as the similar functions in a standard model class, but for the hyperparameters α. In addition, three functions from_uniform, to_uniform, and log_pdf together define the form of the conditional prior for the component parameters, p(x i |α). Each of these takes a std::vector of doubles by reference as input. The log_pdf function just evaluates ln p(x i |α) at the current value of α. The position x i where this is evaluated is passed in via the input vector. In the example, these were defined using Equations 27-29. To simplify this, we have defined a class for Laplace distributions, rather than explicitly writing out the densities. We decided (arbitarily) that parameters 0,1, and 2 are µ, ln(σ), and W respectively. The functions to_uniform and from_uniform must implement the cumulative distribution function (CDF) of the conditional prior and its inverse. i.e., from_uniform, if the input vector contains iid draws from Uniform(0, 1), these should be modified to become draws from p(x|α), and to_uniform should be the inverse of from_uniform. Both of these functions modify the vector argument in-place. For the example, we again used an external class to define these functions for the Laplace distribution: u2,u3}~Uniform(0,1) Figure 6: The 'galaxy' data, with the posterior mean fit (equivalent to the predictive distribution for the "next" data point).

Mixture model results
The data and the posterior mean fit are shown in Figure 6 and the posterior distribution for N is shown in Figure 7. The marginal likelihood was ln(Z) = −232.2, and the information (measured in nats) was H = 29.5.

Background
Nested Sampling can be used to solve Approximate Bayesian Computation (ABC) problems elegantly and efficiently. Also known (misleadingly) as likelihood-free inference, ABC is a set of Monte Carlo techniques for approximating the posterior distribution without having to evaluate the likelihood function L(θ). Instead, the user must be able to cheaply generate simulated data sets from the sampling distribution p(D|θ, M ). Since a sampling distribution  (Skilling 2006) for which Nested Sampling is the best known solution. There are some points with high posterior weight at the left of the plot, but there are so few of them that they make up a trivial fraction of the posterior mass. In practice, however, it is worth re-running with more levels to verify that a second peak is not forming. must be specified, it is not that there is no likelihood function (indeed, a sampling distribution and a data set imply a particular likelihood function), rather that we cannot evaluate it cheaply and therefore cannot use MCMC.
All Bayesian updating conditions on the truth of a proposition. We sometimes speak and use notation as if we're conditioning on the value of a variable, for example by writing the posterior distribution as p(θ|D, M ). However, this is shorthand for p(θ|D = D observed , M ). In the prior state of knowledge, the statement D = D observed could have been either true or false, but it is known to be true in the posterior state of knowledge. In the case of "continuous data" (really a continuous space of possibilities for the data before we learned it) we condition on a proposition like (D ∈ R) where R is a region, and then implicitly take a limit as the size of R goes to zero.
A simple "rejection sampling" version of ABC works by sampling the joint prior distribution for the parameters and data p(θ, D|M ) = p(θ|M )p(D|θ, M ) and rejecting samples for which D = D observed , so that the samples represent The marginal distribution for θ is also the conditional distribution p(θ|D = D observed , M ), that is, the posterior 3 .
This approach is rarely usable in practice because the probability of generating a dataset that matches the observed one is extremely low. To work around this, we replace the proposition D = D observed with a logically weaker one (i.e., one that is implied by D = D observed but does not imply it). The weaker proposition is defined using a discrepancy function ρ, which measures how different a simulated dataset D is from the real one D observed : The discrepancy function should take a minimum value of zero when D = D observed . The analysis then proceeds by Monte Carlo sampling the joint posterior for θ and D conditional on the logically weaker proposition where is some small number. With this proposition, the rejection rate will still typically be very high, but lower than with D = D observed . Typically, ρ is defined by introducing a few summary statistics s 1 (D), s 2 (D), ..., s n (D), which hopefully capture most of the relevant information in the data, and then computing the Euclidian distance between the summary statistics of D and D observed . That is, The main challenges associated with ABC are: (i) The choice of discrepancy function ρ(D; D observed ), which may involve a choice of summary statistics; (ii) How to choose the value of , or how to change it as a run progresses; and (iii) How to make algorithms more efficient than rejection sampling.
Challenge (i) is Bayesian in nature, i.e., it relates to the very definition of the posterior distribution itself. On the other hand, challenges (ii) and (iii) are about the computational implementation. Most ABC analyses are done using Sequential Monte Carlo (SMC; Del Moral, Doucet, and Jasra 2012), another family of algorithms closely related to Nested Sampling and annealing in the sense that they work with sequences of probability distributions.

ABC with nested sampling
NS can be used for ABC by noting that Equation 36 is of the same form as a constrained prior, only instead of being the prior for the parameters p(θ|M ) defined on the parameter space, it is the joint prior for the parameters and the data, p(θ, D|M ) = p(θ|M )p(D|θ, M ), defined on the product space of possible parameters and datasets. Therefore, to implement ABC in DNest4, you implement a model class whose from_prior member function generates parameters and simulated data from the joint prior, and whose perturb member function proposes changes to the parameters and/or the simulated data, in a manner consistent with the joint prior. The connecting Bayesian updating to maximum entropy updating (Caticha and Giffin 2006;Giffin and Caticha 2007).
log_likelihood function of the class, instead of actually evaluating the log likelihood, should evaluate −ρ(D; D observed ). Then, running the sampler will create levels which systematically decrease ρ. The samples from a particular level then represent the ABC posterior defined in Equation 36 with corresponding to minus the "log likelihood" of the level. A single DNest4 run allows you to test sensitivity to the value of using a single run. The marginal likelihood reported will be the probability that ρ (D; D observed ) < given the model, although caution should be applied if using this for model selection (Robert, Cornuet, Marin, and Pillai 2011).

ABC example
A simple ABC example is included in the directory code/Examples/ABC. The example tries to infer the mean µ and standard deviation σ of a normal distribution from a set of samples. The sampling distribution is and the prior is µ ∼ Uniform(−10, 10) ln σ ∼ Uniform(−10, 10).
However, instead of conditioning on the full dataset {x i }, the example uses the minimum and maximum values in the data, min({x i }) and max({x i }), as summary statistics 4 .
In ABC applications the model class needs to describe a point in the joint (parameters, data) space, and the from_prior and perturb functions need to respect the joint prior distribution. In Bayesian inference the joint prior tends to make the parameters and data highly dependent -otherwise the data wouldn't ever be informative about the parameters. To improve efficiency, it is helpful to use a change of variables such that the parameters and data-controlling variables are independent. In this example we use µ ∼ Uniform(−10, 10) ln σ ∼ Uniform(−10, 10) n i ∼ Normal(0, 1) where each of the n i enables a data point to be calculated using A general procedure for achieving this is to imagine generating a simulated data set from the parameters. In this process, you would need to call a random number generator many times.
The results of the random number calls are independent of the parameters and can be used to parameterize the data set. In the example, the {n i } variables play this role. Therefore, the variables in the model class are: class MyModel { private: double mu, log_sigma; std::vector<double> n; The proposal involves changing either mu, log_sigma, or one of the ns: The "log-likelihood" is really minus the discrepancy function. Since we parameterized joint (parameter, data) space using the n variables instead of the data {x i } itself, we must "assemble" the simulated dataset in order to evaluate the discrepancy function: // Goodness double logL = 0.; logL -= pow(*min_element(x_fake.begin(), x_fake.end()) -x_min, 2); logL -= pow(*max_element(x_fake.begin(), x_fake.end()) -x_max, 2); return logL; }

ABC example results
For ABC applications, there is an alternate postprocess function called postprocess_abc. This generates posterior samples using the value corresponding to level ≈ 3/4× (maximum number of levels).

dnest4.postprocess_abc()
To try different values of , you can set the argument threshold_fraction to a value other than 0.8. For example dnest4.postprocess_abc(threshold_fraction = 0.6) In the example directory there is a dataset generated from a standard normal distribution. The inference should result in a joint posterior for µ and σ with significant density near (µ = 0, σ = 1). We show posterior samples in Figure 9.

Python bindings
In DNest4, it is also possible to specify and run models in Python. Currently, the only models supported are ones where the parameters are a NumPy array of double precision quantities.