Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimised Implementations in the bnlearn R Package

It is well known from the literature that the problem of learning the structure of Bayesian networks is very hard to tackle: its computational complexity is super-exponential in the number of nodes in the worst case and polynomial in most real-world scenarios. Efficient implementations of score-based structure learning benefit from past and current research in optimisation theory, which can be adapted to the task by using the network score as the objective function to maximise. This is not true for approaches based on conditional independence tests, called constraint-based learning algorithms. The only optimisation in widespread use, backtracking, leverages the symmetries implied by the definitions of neighbourhood and Markov blanket. In this paper we show that backtracking degrades the stability of Bayesian network structure learning, for little gain in terms of speed. As an alternative, we introduce a general framework to parallelise constraint-based structure learning algorithms, and we investigate its performance using the bnlearn R package on four reference networks and two real-world data sets from genetics and systems biology. We show that on modern multi-core or multiprocessor hardware the proposed approach is always preferable over backtracking, which was developed when single-processor machines were the norm.

Each node in V is associated with one variable, and they are referred to interchangeably.The directed arcs in A that connect them represent direct stochastic dependencies; so if there is no arc connecting two nodes, the corresponding variables are either marginally independent or conditionally independent given (a subset of) the rest.As a result, each local distribution depends only on a single node X i and on its parents (i.e. the nodes X j such that X j → X i , here denoted Π X i ): Common choices for the local and global distributions are multinomial variables (discrete BNs, Heckerman, Geiger, and Chickering 1995); univariate and multivariate normal variables (Gaussian BNs, Geiger and Heckerman 1994); and, less frequently, a combination of the two (conditional Gaussian (CG) BNs, Lauritzen and Wermuth 1989).In the first case, the parameters of interest are the conditional probabilities associated with each variable, represented as conditional probability tables (CPTs); in the second, the parameters of interest are the partial correlation coefficients between each variable and its parents.As for CG BNs, the parameters of interest are again partial correlation coefficients, computed for each node conditional on its continuous parents for each configuration of the discrete parents.
The key advantage of the decomposition in Equation ( 1) is to make local computations possible for most tasks, using just a few variables at a time regardless of the magnitude of m.A related quantity that works to the same effect is the Markov blanket of each node X i , defined as the set B(X i ) of nodes which graphically separates X i from all other nodes V \ {X i , B(X i )} (Pearl 1988, p. 97).In BNs such a set is uniquely identified by the parents (Π X i ), the children (i.e. the nodes X j such that X i → X j ) and the spouses of X i (i.e. the nodes that share a child with X i ).By definition, X i is independent of all other nodes given B(X i ), thus making them redundant for inference on X i .
The task of fitting a BN is called learning and is generally implemented in two steps.
The first is called structure learning, and consists in finding the DAG that encodes the conditional independencies present in the data.This has been achieved in the literature with constraint-based, score-based and hybrid algorithms; for an overview see Koller and Friedman (2009) and Nagarajan, Scutari, and Lèbre (2013).Constraint-based algorithms are based on the seminal work of Pearl on causal graphical models and his Inductive Causation algorithm (IC, Verma and Pearl 1991), which provides a framework for learning the DAG of a BN using conditional independence tests under the assumption that graphical separation and probabilistic independence imply each other (the faithfulness assumption).Tests in common use are the mutual information test (for discrete BNs) and the exact Student's t test for correlation (for GBNs).Score-based algorithms represent an application of heuristic optimisation techniques: each candidate DAG is assigned a network score reflecting its goodness of fit, which the algorithm then attempts to maximise.BIC and posterior probabilities arising from various priors are typical choices.Hybrid algorithms use both conditional independence tests and network scores, the former to reduce the space of candidate DAGs and the latter to identify the optimal DAG among them.Some examples are PC (Spirtes, Glymour, and Scheines 2000), The second step is called parameter learning and, as the name suggests, deals with the estimation of the parameters of the global distribution.Since the graph structure is known from the previous step, this can be done efficiently by estimating the parameters of the local distributions.
Most problems in BN theory have a computational complexity that, in the worst case, scales exponentially with the number of variables.For instance, both structure learning (Chickering 1996;Chickering, Geiger, and Heckerman 1994) and inference (Cooper 1990;Dagum and Luby 1993) are NP-hard and have polynomial complexity even for sparse networks.This is especially problematic in high-dimensional settings such as genetics and systems biology, in which BNs are used for the analysis of gene expressions (Friedman 2004) and protein-protein interactions (Jansen, Yu, Greenbaum, Kluger, Krogan, Chung, Emili, Snyder, Greenblatt, and Gerstein 2003;Sachs, Perez, Pe'er, Lauffenburger, and Nolan 2005); for integrating heterogeneous genetic data (Chang and McGeachie 2011); and to determine disease susceptibility to anemia (Sebastiani, Ramoni, Nolan, Baldwin, and Steinberg 2005) and hypertension (Malovini, Nuzzo, Ferrazzi, Puca, and Bellazzi 2009).
Even though algorithms in recent literature are designed taking scalability into account, it is often impractical to learn BNs from data containing more than a few hundreds of variables without restrictive assumptions on either the structure of the DAG or the nature of the local distributions.Two ways to address this problem are: 1. optimisations: reducing the number of conditional independence tests and network scores computed from the data, either by skipping redundant ones or by limiting local computations to a few variables; 2. parallel implementations: splitting learning across multiple processors or cores to make better use of modern multi-core and multiprocessor hardware.
As far as score-based learning algorithms are concerned, both possibilities have been explored and a wide range of solutions proposed, from efficient caching using decomposable scores (Daly and Shen 2007), to parallel meta-heuristics (Rauber and Rünger 2010) and integer programming (Cussens 2011).The same cannot be said of constraint-based algorithms; even recent ones such as SI-HITON-PC, while efficient, are still implemented with basic backtracking as the only optimisation.We will examine them in Section 2, arguing that they provide only modest speed gains and increase the variability of the learned DAGs by introducing spurious dependencies between the variables.As an alternative, in Section 3 we propose a general framework to create parallel implementations of constraint-based algorithms that scale well on large data sets and do not suffer from this problem.

Constraint-Based Structure Learning and Backtracking
All constraint-based structure learning algorithms share a common three-phase structure inherited from the IC algorithm through PC and GS; it is illustrated in Algorithm 1.The first, optional, phase consists in learning the Markov blanket of each node to reduce the number of candidate DAGs early on.Any algorithm for learning Markov blankets can be plugged in step 1 and extended into a full BN structure learning algorithm, as originally suggested in Margaritis (2003) for GS.Once all Markov blankets have been learned, they are checked for consistency (step 2) using their symmetry; by definition X i ∈ B(X j ) ⇔ X j ∈ B(X i ).Asymmetries are corrected by treating them as false positives and removing the offending nodes from each other's Markov blankets.
The second phase learns the skeleton of the DAG, that is, it identifies which arcs are present in the DAG modulo their direction.This is equivalent to learning the neighbours N (X i ) of each node: its parents and children.As illustrated in step 3, the absence of a set of nodes S X i X j that separates a particular pair X i , X j implies that either X i → X j or X j → X i .
Separating sets are considered in order of increasing size to keep computations as local as possible.Furthermore, if B(X i ) and B(X j ) are available from steps 1 and 2 the search can be greatly reduced because both sets are typically much smaller than V.With the exception of the PC algorithm, which is structured exactly as described in step 3, constraint-based algorithms learn the skeleton by learning each N (X i ) and then enforcing symmetry (step 4).
Finally, in the third phase arc directions are established as in Meek (1995).It is important to note that, for some arcs, both directions are equivalent in the sense that they identify equivalent decompositions of the global distribution.Therefore, some arcs will be left undirected and the algorithm will return a completed partially directed acyclic graph identifying an equivalence class containing multiple DAGs.Such a class is uniquely identified by the skeleton learned in steps 3 and 4, and by the v-structures V l of the form ∈ N (X j ) learned in step 5 (Chickering 1995).Additional arc directions are inferred indirectly in step 6 by ruling out those that would introduce additional v-structures (which would have been identified in step 5) or cycles (which are not allowed in DAGs).
Even in such a general form, we can see that Algorithm 1 performs many checks for graphical separation that are redundant given the symmetry of the B(X i ) and the N (X i ).Intuitively, once we have concluded that X j / ∈ B(X i ) there is no need to check whether X i ∈ B(X j ) in step 1; and similar considerations can be made for neighbours in step 3.In practice, this translates to redundant independence tests computed on the data.To reduce the number of such tests, it is common practice to assume X 1 , . . ., X m are processed sequentially and use backtracking in steps 1 and 3 as follows: Not that in both cases X i can be discarded in the process of learning B(X j ) and N (X j ).
Such a strategy is described in detail, for example, in Tsamardinos et al. (2006, p. 46).It is important to note that, in contrast with enforcing symmetry by construction (e.g. , this heuristic does not automatically introduce a false positive or a false negative in the learning process for every type I or type II error in the independence tests.Even so, backtracking has the undesirable effect of making structure learning depend on the order the variables are stored in the data set, which has been shown to increase errors in the PC algorithm (Colombo and Maathuis 2013).In addition, Algorithm 1 A Template for Constraint-Based Structure Learning Algorithms Input: a data set containing the variables X i , i = 1, . . ., m. Output: a completed partially directed acyclic graph.
1.For each variable X i , learn its Markov blanket B(X i ).
2. Check whether the Markov blankets B(X i ) are symmetric, e.g.X i ∈ B(X j ) ⇔ X j ∈ B(X i ).Assume that nodes for whom symmetry does not hold are false positives and drop them from each other's Markov blankets.
3. For each variable X i , learn the set N (X i ) of its neighbours (i.e. the parents and the children of X i ).Equivalently, for each pair X i , X j , i = j search for a set S X i X j ⊂ V (including S X i X j = ∅) such that X i and X j are independent given S X i X j and X i , X j / ∈ S X i X j .If there is no such a set, place an undirected arc between X i and X j (X i −X j ).If B(X i ) and B(X j ) are available from points 1 and 2, the search for S X i X j can be limited to the smallest of B(X i ) \ X j and B(X j ) \ X i .
4. Check whether the N (X i ) are symmetric, and correct asymmetries as in step 2.

5.
For each pair of non-adjacent variables X i and X j with a common neighbour X k , check whether X k ∈ S X i X j .If not, set the direction of the arcs 6. Set the direction of arcs that are still undirected by applying the following two rules recursively: (a) If X i is adjacent to X j and there is a strictly directed path from X i to X j (a path leading from X i to X j containing no undirected arcs) then set the direction of Return the resulting completed partially directed acyclic graph.
backtracking provides only a modest speed increase compared to a parallel implementation of steps 1-4; we will compare the respective running times in Section 3.
Simulations were performed on a cluster of 7 dual AMD Opteron 6136, each with 16 cores and 78GB of RAM, with R 3.1.0and bnlearn 3.5.
For each BN, we considered 6 different sample sizes (n = 0.1p, 0.2p, 0.5p, p, 2p, 5p), chosen as multiples of p to facilitate comparisons between networks of such different complexity; and 4 different constraint-based structure learning algorithms (GS, Inter-IAMB, MMPC, SI-HITON-PC).Since all reference BNs are discrete, we used the asymptotic χ 2 mutual information test with α = 0.01.For each combination of BN, sample size and algorithm we repeated the following simulation 20 times.First, we loaded the BN from the RDA file downloaded from the repository (alarm.rdabelow) and generated a sample of the appropriate size with rbn.
> hamming(skel.orig,rskel.orig)[1] 0 > hamming(skel.back,rskel.back)[1] 10 Ideally, skel.orig and rskel.origshould be identical and therefore their Hamming distance should be zero.This may not be the case for BNs with deterministic 0-1 nodes, whose structure is unlikely to be learned correctly by any of the considered algorithms; or when conditional independence tests are biased and have low power because of small sample sizes.
The difference between the Hamming distance of skel.orig and rskel.origand that of skel.back and rskel.backgives an indication of the dependence on the ordering of the variables introduced by backtracking.
The results of this simulation are shown in Figure 1.With the exception of ALARM, ANDES, HEPAR II for the GS algorithm, Hamming distance is always greater when backtracking is used.This is true for all the considered sample sizes.In fact, Hamming distance does not appear to converge to zero as sample size increases; on the contrary, large samples make even weak dependencies detectable and thus increase the chances of getting different skeletons.This trend is consistently more marked when using backtracking, as is the range of observed Hamming distances in each configuration of BN, sample size and learning algorithm.The combination of this two facts suggests that backtracking does indeed make learning dependent on the order in which the variables are considered; and that it increases the variability of the learned structure.

A Framework for Parallel Constraint-Based Learning
Constraint-based algorithms as described in Algorithm 1 display coarse-grained parallelism: different parts need to be synchronised only three times, in steps 2, 4 and 6.Steps 1, 3 and 5 are embarrassingly parallel, because each B(X i ), each N (X i ) and each V l can be learned independently from the others.
Step 6 on the other hand is inherently sequential because of its iterative formulation.Parallelising Algorithm 1 on a step-by-step basis is therefore very convenient.As shown in Figure 2, the implementation still follows the same steps; and it can scale efficiently because computationally intensive steps can be partitioned in as many as parts as there are variables.Splitting the tests in large batches corresponding to the B(X i ), N (X i ) and V l also reduces the amount of information exchanged by different parts of the implementation, reducing overhead.

Simulations Reference BNs
All constraint-based learning algorithms in bnlearn have such a parallel implementation thanks to the infrastructure provided by the parallel package (R Core Team 2014).The master R process controls the learning process and distributes the B(X i ), N (X i ) and V l to the slave processes, executing only steps 2, 4 and 6 itself.Consider, for example, the Inter-IAMB algorithm and the data set generated from the ALARM reference BN shipped with bnlearn.
The difference in the number of tests between the two slaves is due to the topology of the BN: the B(X i ) and N (X i ) have different sizes and therefore require different numbers of tests to learn.This in turn also affects the number of tests required to learn the v-structures V l .
Increasing the number of slave processes reduces the number of tests performed by each of them, further increasing the overall performance of the algorithm.
processes and therefore slaves that have fewer tests to perform are left waiting for others to complete (see, for instance, the R code snippets above).Even so, the parallel implementations in bnlearn scale efficiently up to 8 slaves.In the absence of any overhead we would expect the average normalised running time to be approximately 1/8 = 0.125; observed values are in the range [0.157, 0.191].Optimised learning is at best competitive with 2 slaves (MMPC, MUNIN), and at worst may actually degrade performance (LINK, SI-HITON-PC).

Simulations on the Real-World Data
To provide a more realistic benchmarking on large-scale biological data, we applied  (2006).Both data sets are publicly available and have been preprocessed to remove highly correlated variables (COR > 0.95) and to impute missing values with the impute package (Hastie, Tibshirani, Narasimhan, and Chu 2013).The adenocarcinoma data set has a sample size which is extremely small compared to the number of variables, which is common in systems biology.On the other hand, the mice data has a sample size that is typical of large genome-wide association studies.We ran SI-HITON-PC using 1,2,3,4,6,8,10,12,14,16,18 and 20 slaves, averaging over 5 runs in each case; the other algorithms we considered in Section 2 did not scale well enough to handle either data set.Variables were treated as continuous, and independence was tested using the Student's t test for correlation with α = 0.01.
As we can see in Figure 4, we observe a low overhead even for 20 slave processes, with normalised running times of 0.062 (mice) and 0.076 (adenocarcinoma) which are very close to the theoretical 1/20 = 0.05.Similar considerations can be made across the whole range of 2 to 20 slaves, with a measured overhead between 0.02 and 0.08.Surprisingly, overhead seems to decrease in absolute terms with the number of slaves, from 0.04 (adenocarcinoma) and 0.08 (mice) for 3 slaves to 0.012 (adenocarcinoma) and 0.026 (mice) for 20 slaves.However, clusters with 2 slaves have a small overhead (0.021 and 0.037) than those with 3 or 4 slaves.We also note that overhead is comparable to that of the reference BNs in Section 2, suggesting that it does not strongly depend on the size of the BN.Again, the running time of the optimised implementation is comparable with that of 2 slaves.

Discussion and Conclusions
In this paper we proposed a general framework to create parallel implementations of constraintbased structure learning algorithms for BNs and implemented it in bnlearn.Since all these algorithms trace their roots to the IC algorithm from Verma and Pearl (1991), they share a common layout and can be parallelised in the same way.In particular, steps 1, 3 and 5 are embarrassingly parallel and can be trivially split in independent parts to be executed simultaneously.This is important for two reasons.Firstly, it limits the amount of overhead in the parallel computations due to the need of keeping different parts of the algorithms in sync.As we have seen in Section 3, this allows the parallel implementations to scale efficiently in the number of slave processes.Secondly, it implies that the parallel implementation of each algorithm performs the same conditional independence tests as the original.This is in contrast with backtracking, which is the only widespread way of improving the sequential performance of constraint-based algorithms.The simulations in Section 2 suggest that backtracking can increase the variability of the DAGs learned from the data.At the same time, speed gains are competitive at most with the parallel implementation with 2 slave processes.Since most computers in recent times come with at least 2 cores, it is possible to outperform backtracking even on commodity hardware while retaining the lower variability of the unoptimized algorithms.

Figure 1 :
Figure 1: Hamming distance between skeletons learned for the ALARM, ANDES, HEPAR II, LINK and MUNIN reference BNs before and after reversing the ordering of the variables, for various n/p ratios and algorithms.Blue boxplots correspond to structure learning without backtracking, green boxplots to learning with backtracking.

Figure 4 :
Figure4: Running times for learning the skeletons underlying the lung adenocarcinoma(Beer et al. 2002) and mice(Valdar et al. 2006) data sets with SI-HITON-PC.Raw running times are reported for backtracking and for parallel learning with 1, 2, 3, 4 and 20 slave processes.
Only the results for LINK and MUNIN are shown, as they are the largest reference BNs considered in this paper.It is clear from the figure that the gains in running time follow the law of diminishing returns: adding more slaves produces smaller and smaller improvements.Furthermore, tests are never split uniformly among the slave SI-HITON-PC on the lung adenocarcinoma gene expression data (86 observations, 7131 variables representing expression levels) from Beer et al. (2002); and on the WTCCC heterogeneous mice SNP data (1940 observations, 4053 variables representing allele counts) from Valdar et al.