Learning Large-Scale Bayesian Networks with the sparsebn Package

Learning graphical models from data is an important problem with wide applications, ranging from genomics to the social sciences. Nowadays datasets typically have upwards of thousands—sometimes tens or hundreds of thousands—of variables and far fewer samples. To meet this challenge, we develop a new R package called sparsebn for learning the structure of large, sparse graphical models with a focus on Bayesian networks. While there are many existing packages for this task within the R ecosystem, this package focuses on the unique setting of learning large networks from high-dimensional data, possibly with interventions. As such, the methods provided place a premium on scalability and consistency in a high-dimensional setting. Furthermore, in the presence of interventions, the methods implemented here achieve the goal of learning a causal network from data. The sparsebn package is open-source and available on CRAN.


Introduction
Graphical models are a popular tool in machine learning and statistics, and have been used in a variety of applications including genetics (Gao and Cui 2015;Isci, Dogan, Ozturk, and Otu 2014), computational biology (Jones, Buchan, Cozzetto, and Pontil 2012), oncology (Chen, Wang, Chen, Liu, Tang, Mao, Li, Lin, Sun, and Ma 2015), medicine and health care (Nicholson, Cozman, Velikova, van Scheltinga, Lucas, and Spaanderman 2014), logistics (Garvey, Carnovale, and Yeniyurt 2015), finance (Sanford and Moosa 2012), and even software testing (Dejaeger, Verbraken, and Baesens 2013).The widespread growth of high-dimensional biological data in particular has spurred a renewed interest in the use of graphical models to aid in the discovery of novel biological mechanisms (Bühlmann, Kalisch, and Meier 2014).While the past decade has witnessed tremendous developments towards understanding undirected graphical models (Meinshausen and Bühlmann 2006;Ravikumar, Wainwright, Lafferty et al. 2010;Yang, Ravikumar, Allen, and Liu 2015), there has been less progress towards understanding directed graphical models-also known as Bayesian networks (BNs) or structural equation models (SEM)-for high-dimensional data with p n.A BN is represented by a directed acyclic graph (DAG), whose structure contains a richer and different set of conditional independence relations than an undirected graph.Moreover, DAGs are commonly used in causal inference where the direction of an edge encodes causality.Consequently, there have been continuing efforts in structure learning of directed graphs from data.sparsebn Unlike their undirected counterparts, however, the structure learning problem for directed graphical models is complicated by the nonconvexity, nonsmoothness, and nonidentifiability of the underlying statistical problem.These issues have no doubt slowed progress towards fast, scalable algorithms for learning in the presence of thousands-let alone tens of thousandsof variables.Despite progress at the theoretical (van de Geer and Bühlmann 2013; Aragam, Amini, and Zhou 2016) and computational level (Schmidt, Niculescu-Mizil, and Murphy 2007;Xiang and Kim 2013;Fu and Zhou 2013), there is still a lack of user-friendly software for putting these modern tools into the hands of practitioners.
To bridge this gap we have developed sparsebn, a new R (R Core Team 2016) package for structure learning and parameter estimation of large-scale Bayesian networks from highdimensional data.When interventional data are available, an estimated DAG from this package has a natural causal interpretation.While there are many R packages for learning Bayesian networks (Section 2.4), none that we are aware of are specifically tailored to high-dimensional data with experimental interventions.The sparsebn package has been developed from the ground up using recent developments in structure learning (Fu and Zhou 2013;Aragam and Zhou 2015;Gu, Fu, and Zhou 2016) and statistical optimization (Friedman, Hastie, Höfling, and Tibshirani 2007;Friedman, Hastie, and Tibshirani 2010;Mazumder, Friedman, and Hastie 2011).In addition to methods for learning Bayesian networks, this package also includes procedures for learning undirected graphs, fitting structural equations, and is compatible with existing packages in R. All of the code for this package is open-source and available through the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=sparsebn.To briefly illustrate the use of sparsebn, the code below learns the structure of the pathfinder network (Heckerman, Horvitz, and Nathwani 1992): R> library(sparsebn) R> data(pathfinder) R> data <-sparsebnData(pathfinder[["data"]], type = "continuous") R> dags <-estimate.dag(data)R> plotDAG (dags) This code estimates a solution path with 16 total estimates (see Section 3.2) and takes approximately one second to run.The first four estimated networks with an increasing number of edges are shown in Figure 1.This example is explored in more detail in Section 6.2.

Learning Bayesian Networks from Data
We begin by reviewing the necessary background and definitions, and then discuss the existing literature and methods.

Background
The basic model we work with is a p-dimensional random vector X = (X 1 , . . ., X p ) with joint distribution P(X 1 , . . ., X p ). Bayesian networks are directed graphical models whose edges encode conditional independence constraints implied by the distribution P. For continuous data, we assume that X follows a multivariate Gaussian distribution; for discrete data we assume each X j is a factor with r j levels.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 1: Example output from learning the pathfinder network.To save space, only the first four nontrivial estimates are shown here out of the 16 total estimates in the full solution path.
The general theory of Bayesian networks is quite intricate, and we will make no attempt to cover it in detail here.The interested reader is referred to one of the many excellent textbooks on this subject: Koller and Friedman (2009); Spirtes, Glymour, and Scheines (2000); Lauritzen (1996).
Formally, a Bayesian network is defined as a directed acyclic graph G = (V, E) that satisfies the following factorization condition with respect to P: Here, pa(X j ) = {X i : X i → X j ∈ E} is the parent set of X j and θ j encodes the parameters that define the conditional probability distribution (CPD) for X j .
Traditional methods for learning Bayesian networks start with this definition and develop algorithms from a graph-theoretic perspective (Spirtes and Glymour 1991).This approach comes with restrictive assumptions such as strong faithfulness (Uhler, Raskutti, Bühlmann, and Yu 2013;Zhang and Spirtes 2002), which hinders their use in practice.To motivate our work, we adopt a more general approach to Bayesian networks via structural equation models.In this approach, we start by directly modeling each CPD P(X j | pa(X j ), θ j ) via a sparsebn generalized linear model.We will consider two special cases: Gaussian CPDs for continuous data and multi-logit CPDs for discrete data.This approach also fits naturally our framework for learning causal networks discussed in Section 2.2.

Continuous data
Suppose that for each j = 1, . . ., p there exists β j = (β 1j , . . ., β pj ) ∈ R p such that where β jj = 0 (to avoid trivialities) and , we can rewrite (1) as a matrix equation: (2) The model ( 2) is called a structural equation model for X.The matrix B defines the weighted adjacency matrix of a directed graph which, when acyclic, is also a BN for X.Note that to estimate B from data it is not enough to simply regress X j onto X −j -this leads to the so-called neighbourhood regression estimator of the Gaussian graphical model introduced in Meinshausen and Bühlmann (2006).The trouble with this approach is that there is no guarantee that the resulting adjacency matrix will be acyclic.In order to produce a BN, we must constrain B to be acyclic, which couples the parameters of each CPD to one another and induces a nonconvex constraint during the learning phase.For this reason, learning directed graphs is substantially more difficult than learning undirected graphs.
By writing Ω = cov(ε), we see that the parameters (B, Ω) specify a unique normal distribution N (0, Σ) for X.In fact, a little algebra on (2) shows that This gives a way to compute Σ from (B, Ω), and suggests a way to estimate the covariance matrix by setting Σ = (I − B) −T Ω(I − B) −1 .By taking Γ = Σ −1 , we obtain an alternative estimator of the Gaussian graphical model, given by This idea was first explored by Rütimann and Bühlmann (2009) using the PC algorithm (Section 2.3), and a similar idea using our methods is implemented in sparsebn.In many situations, the DAG representation (B, Ω) encodes more conditional independence relations than the inverse covariance matrix Γ, which is one of the motivations for learning DAGs from observational data.

Discrete data
In a discrete Bayesian network, each X i is a finite random variable with r i states.The most common model for discrete Bayesian networks is the product multinomial model, which we briefly review here.Under this model, the parent set of X i , pa(X i ), has q i = j∈pa(X i ) r j possible configurations, which leads to a total of p i=1 r i j∈pa(X i ) r j parameters.Let the possible states of pa(X i ) be indexed by {π k : k = 1, ..., q i } and Θ ijk = P( A discrete Bayesian network can therefore be parameterized by Θ = {Θ ijk ≥ 0 : j Θ ijk = 1}.This parameterization is the basis for various popular structure learning algorithms, including the K2 algorithm and the hybrid MMHC algorithm (Section 2.3).
Instead of this product multinomial model, sparsebn uses a multi-logit model for discrete data.One of the advantages of this approach is a significant reduction in the number of parameters, and recent work suggests that this model serves as a good approximation to the product multinomial model (Gu et al. 2016).Here, we briefly introduce the multi-logit model, and describe how it is used for structure learning.
We use a standard form of the multi-logit model in which each variable X j is encoded by d j = r j − 1 dummy variables (Dobson and Barnett 2008).More specifically, given a reference category (which may be arbitrary), the r j possible values of X j are encoded by a vector of dummy variables z j = (z jk , k = 1, . . ., d j ) ∈ {0, 1} d j .If we choose the last category as the reference, then z j,k = I(X j = k) for k = 1, . . ., r j − 1 so that the reference category is coded as z j = 0. Put z = (1, z 1 , . . ., z p ) ∈ {0, 1} d , where d = 1 + p i=1 d i .Under this parametrization, the conditional distributions take the form where β • j = (β 0 j , β 1 j , . . ., β p j ) ∈ R d and β i j ∈ R d i is the coefficient vector for variable X i to predict the th level of X j with intercept β 0 j .In this model, if X i / ∈ pa(X j ) then we have equivalently β i j = 0 for all .Let β ij = (β i1j , . . ., β ir j j ) ∈ R d i × R r j be the coefficient matrix for edge i to j.Therefore, by estimating β ij , we can infer the structure of a network.If β ij = 0, there does not exist an edge from X i to X j .

Causal DAG learning from interventions
DAGs are a popular model for causal networks, particularly with the use of experimental interventions in addition to observational data (Pearl 2000).Cooper and Yoo (1999); Meganck, Leray, and Manderick (2006); Ellis and Wong (2008) proposed methods to learn causal networks with a mix of observational and experimental data, and Peér, Regev, Elidan, and Friedman (2001); Pournara and Wernisch (2004) inferred gene networks with perturbed expression data.Each of the methods in sparsebn can take interventional data as input in order to learn the precise causal relationships in a system.
The following simple example illustrates how interventions can be used for learning causal relations: Assume the true causal graph is G * : X 1 → X 2 , which is observationally equivalent to the graph G : X 1 ← X 2 .That is, we cannot distinguish between these two graphs using observational data alone.With interventions, however, we could fix X 2 experimentally, which amounts to cutting off the edge from X 1 to X 2 in G * .In this case, the joint distribution factors as P(X 1 )P(X 2 ).In contrast, if we fix X 1 in G * , the relation between X 1 and X 2 stays the same, and the joint distribution factors as P(X 1 )P(X 2 | X 1 ).By exploiting this, experimental interventions can be used to uncover the true causal structure in a physical system.
To see how this can be applied in a statistical setting, we show how a general form of the density function for interventional data can be derived from the pure observational joint density function.Let M ⊂ {1, . . ., p} be the set of variables under intervention, so the joint sparsebn probability decomposes as where P(X i | •) is the marginal distribution of X i from which experimental samples are drawn.Thus, experimental data sets generated from the true DAG G can be considered as data sets generated from a DAG G , where G is obtained by removing all directed edges in G pointing to the variables under intervention.Furthermore, we can see that when M = ∅, (6) is the density function for observational data.For more details regarding causal DAG learning, please refer to Pearl (2000) and references therein.

Previous work
The various algorithms that have been proposed for structure learning of Bayesian networks fall into three main categories: constraint-based methods, score-based methods and hybrid methods.

Constraint-based methods
Constraint-based methods rely on repeated conditional independence tests in order to learn the structure of a network.The main idea is to determine which edges cannot exist in a DAG using statistical tests of independence, a procedure which is justified whenever the socalled faithfulness assumption holds.These algorithms first use independence tests to learn the skeleton of the network, and then orient v-structures along with the rest of the edges.Because of the existence of equivalent DAGs, the direction of some edges may not be decided.
The PC algorithm proposed by Spirtes and Glymour (1991) is a well-known example of this kind of method.Another example is the FCI (Fast Causal Inference) algorithm (Spirtes et al. 2000;Colombo, Maathuis, Kalisch, and Richardson 2012), which allows for latent variables in the network.The output of these algorithms is a partially directed graph, which means that there may be some undirected edges in the estimated graph.While the PC algorithm is a powerful method to learn Bayesian networks in low-dimensions with n very large, the performance of the PC algorithm is less competitive in high-dimensions.

Score-based methods
Score-based methods rely on scoring functions such as the log-likelihood or some other loss function.The goal of these algorithms is to find a DAG that can optimize a given scoring function.Some popular scoring functions include several Bayesian Dirichlet metrics (Buntine 1991;Cooper and Herskovits 1992;Heckerman, Geiger, and Chickering 1995), Bayesian information criterion (Chickering and Heckerman 1997), minimum description length (Bouckaert 1993;Suzuki 1993;Bouckaert 1994;Lam and Bacchus 1994), and entropy (Herskovits and Cooper 1990).One of the classic score-based methods is the greedy hill climbing (HC) algorithm (Russell and Norvig 1995).This algorithm is fast but tends to predict too many edges in high-dimensional settings.For discrete networks, the K2 algorithm (Cooper and Herskovits 1992) is another popular method, however, this method requires prior knowledge about the ordering of the network which is often unavailable in applications.There are also Monte Carlo methods (Ellis and Wong 2008;Zhou 2011), which are quite accurate but also computationally demanding.This limits Monte Carlo methods to smaller networks with only tens of nodes.Each of the learning algorithms implemented in sparsebn is a score-based method, and as such this package represents an attempt to resolve many of the computational and statistical issues cited here with respect to these methods.

Hybrid methods
Finally, there are hybrid methods which combine constraint-based and score-based methods.Hybrid methods first prune the search space by using a constraint-based search, and then learn an optimal DAG structure via score-based search (Tsamardinos, Brown, and Aliferis 2006;Gámez, Mateo, and Puerta 2011).The Max-Min Hill-Climbing (MMHC) algorithm proposed by Tsamardinos et al. (2006) is a powerful method of this kind.

Existing R packages for structure learning
There are several existing R packages for learning and manipulating Bayesian networks.bnlearn is a thorough and actively maintained package that implements a wide variety of classical approaches such as hill-climbing and MMHC (Scutari 2010).pcalg focuses on the causal interpretation of Bayesian networks, and implements the PC and FCI algorithms along with methods for inferring causal effects (Kalisch, Mächler, Colombo, Maathuis, and Bühlmann 2012).Other packages include deal for mixed data (Boettcher and Dethlefsen 2003) and gRain (Højsgaard 2012) for exact and approximate computations.For structural equation models in particular, lavaan is a modern R package that implements many of the standard fitting procedures for SEM (Rosseel 2012).
The main drawback of these packages is that they focus almost exclusively on observational data with few variables and large samples-the so-called low-dimensional regime.As a result, many applications of contemporary interest are precluded: • Datasets with several thousand variables, which arise in machine learning and computational biology, • High-dimensional data with p n, which is common in genomics applications with high-throughput datasets, such as gene expression data that have p ∼ 20, 000 and n ∼ 100, • Experimental data with interventions, which is also common in genomics applications.
The sparsebn package was specifically designed to fill this gap, with an orientation towards large, high-dimensional, experimental data.In order to achieve this, the methods contained in this package rely on a combination of novel methodology and algorithms in order to scale to larger and larger datasets.Of course, these methods also gracefully degrade to handle simpler settings with observational and/or low-dimensional data.Moreover, whenever possible, crosscompatiblity with the above mentioned packages has been provided (Section 4.3).

Learning with Sparse Regularization
To learn a Bayesian network from data, we use a score-based approach based on regularized maximum likelihood estimation that allows for the incorporation of intervention data.In this sparsebn section we discuss these details as well as the block coordinate descent algorithm used to approximate the resulting optimization problem.

Regularized Maximum Likelihood
Suppose X ∈ R n×p is a matrix of observations, and let denote the negative log-likelihood and ρ λ be some regularizer.For example, ρ λ may be the 1 penalty (Tibshirani 1996), the group norm penalty (Yuan and Lin 2006), or a nonconvex penalty such as the SCAD (Fan and Li 2001) or MCP (Zhang 2010).We consider the following program: min where D ⊂ R p×p is the set of weighted adjacency matrices that represent directed graphs without cycles.The resulting problem ( 7) is a nonconvex program, where the nonconvexity arises from (potentially) all three terms: The constraint D, the loss function , and the regularizer ρ λ .
For continuous data, we use a Gaussian likelihood derived from (2) combined with 1 or MCP regularization, and for discrete data we use a multi-logit model as in ( 5) combined with a group lasso penalty.When some data are generated under experimental intervention, we can derive the form of (B; X) as follows: When X i is under intervention, its marginal distribution is known and hence can be considered constant in (6).Recalling that M ⊂ {1, . . ., p} is the set of variables under intervention, it follows that Let I j be the set of row indices of the data matrix X where node X j is under intervention, and O j = {1, . . ., n} − I j be the collection of observations for which X j is not under intervention.Then, according to (6), the negative log-likelihood factorizes as where , f β j is the conditional density for the jth node, and x hj is the value of node X j at the hth data point.By incorporating experimental interventions in this way, we are able to orient the edges in a causal Bayesian network and thus distinguish between equivalent DAGs.Evidently, one advantage of this framework is its universal applicability to different data types and various likelihood models.
Since ( 7) is nonconvex it is generally regarded as infeasible to find an exact global minimizer of this problem.Indeed, score-based structure learning is known to be NP-hard (Chickering, Heckerman, and Meek 2004).Instead, we seek local minimizers of (7) through an optimization scheme based on block coordinate descent.The method is based on the following observation: The difficulty in solving (7) lies in enforcing the constraint B ∈ D, which is a highly nonconvex and singular parameter constraint.Instead, if we consider (7) one edge at a time, the problem simplifies considerably.This is the heuristic exploited by many score-based methods (most notably greedy hill climbing).Unlike conventional methods, however, Fu and Zhou (2013) propose a block-cyclic strategy and show that it outperforms existing approaches based on greedy updates.This general observation can be exploited to construct the family of fast algorithms implemented in sparsebn.

Algorithm details
Recalling that B = (β ij ), the high-level idea behind the method is the following: 1. Repeat outer loop until stopping criterion met: 2. Outer loop.For each pair (j, k), j = k: Since the program (7) depends on the unknown regularization parameter λ, this value must be supplied in advance to the algorithm.In practice, we wish to solve ( 7) for several values of the regularization parameter, so instead of returning a single DAG estimate, the output of this meta-algorithm is a solution path (also called a regularization path, Friedman et al. This is similar to the well-known graphical lasso for undirected graphs (Friedman, Hastie, and Tibshirani 2008).
As λ decreases, there is less regularization, hence the resulting estimates B(λ l ) become more dense.Since our focus is on sparse graphs, in practice we use this fact to terminate the algorithm early if the solution path becomes too dense, i.e. if the number of edges in B(λ l ) exceeds some user-defined threshold.The default values used in sparsebn are 10p edges for continuous data and 3p edges for discrete data.The smaller threshold for discrete data is due to the higher computational complexity of the underlying algorithm as compared with the continuous case.
The detailed implementation of the algorithms use several tricks from the literature on coordinate descent in order to speed up the algorithm: • Warm starts.Given the previous estimate B(λ l−1 ) in the solution path, we use B(λ l−1 ) as the initial guess for the next iterate B(λ l ).Furthermore, it is always possible to choose λ 0 so that B(λ 0 ) = 0 (i.e. the zero matrix), which is the default implementation used by sparsebn.
• Active set iterations.In the inner loop above, the algorithm only updates the nonzero parameters by solving at most p penalized regression or multi-logit regression problems.This is often computationally tractable, which yields significant performance improvements for large graphs.
• Block updates.Instead of updating each β jk one at a time, the algorithms update each parameter as a block {β jk , β kj }.This is justified by the acyclicity assumption: If β jk = 0 and β kj = 0, then the acyclic constraint is violated, and this fact can be exploited in order to update both parameters simultaneously.

sparsebn
• Sparse data structures.Internally, everything is stored using sparse data structures for representing directed acyclic graphs.This saves memory and speeds up the computation of each update.
Compared with existing packages for structure learning, the main novelties of the present methods are 1) The use of 1 and MCP regularization, 2) Block-cyclic (as opposed to oneat-a-time greedy) updates, and 3) The use of warm starts to return a solution path.Further details can be found in Fu and Zhou (2013) and Aragam and Zhou (2015).

Parameter estimation
After learning the structure of a Bayesian network, often it is of interest to then estimate the parameters of the local conditional distributions associated with the learned structure.
For causal DAGs, these parameters determine the causal effect sizes between the parents and their children.
For Gaussian data, this is straightforward by regressing each node onto its parents, using (unpenalized) least squares regression.This produces a weighted adjacency matrix B = ( β ij ) (or more specifically, a solution path B(λ l ), l = 0, . . ., L).Given these weights, we can estimate the conditional variance of each node given its parents by: This yields a variance matrix Ω = diag( ω 2 1 , . . ., ω 2 p ), and together ( B, Ω) can be used to estimate the covariance matrix Σ (see Section 2.1).
For discrete data, we regress each node onto its parents set using multi-logit regression via the R package nnet (Ripley, Venables, and Ripley 2016).Note that for discrete data, instead of a coefficient matrix we get a 4-way array B = ( β ij ), where βij is a matrix and the (l, k) entry of this matrix is the influence that level l of X i has on level k of X j .

The sparsebn package
sparsebn implements four structure learning algorithms based on the framework described in Section 3.Each method is tailored to a specific type of data: • Experimental, Gaussian data (Fu and Zhou 2013; Zhang 2016), • Experimental, discrete data (Gu et al. 2016), • Observational, Gaussian data (Aragam and Zhou 2015), • Observational, discrete data (Gu et al. 2016).
By combining these approaches, sparsebn automatically handles datasets with mixed observational and experimental data.Each algorithm is implemented in C++ using Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013).In addition to the main algorithms, the package implements high-level methods for fitting, plotting, and manipulating graphical models.
sparsebn is actually a family of R packages, designed to be cross-compatible with minimal external dependencies.To date, sparsebn imports the following packages: • ccdrAlgorithm, based on Aragam and Zhou (2015) and Fu and Zhou (2013), • discretecdAlgorithm, based on Gu et al. (2016), • sparsebnUtils, for housing various common utilities and classes.
The idea is that the codebase for each algorithm is housed inside its own package, allowing for rapid development and convenient extensibility.This allows us to add new algorithms and features rapidly without significant dependency or compatibility constraints.

Speed and scalability improvements
sparsebn is designed to handle large, high-dimensional datasets with p potentially in the thousands.To illustrate this, Figure 2 provides a comparison of our methods with the PC algorithm from the pcalg package and the MMHC algorithm from the bnlearn package.Due to the higher computational workload for discrete algorithms in general, the results here are for Gaussian data only.Furthermore, although these numbers are intended to be illustrative, the interested reader may find more extensive simulations corroborating these results for both types of data in Aragam and Zhou (2015); Gu et al. (2016). sparsebn In order to provide a direct comparison, the times reported in Figure 2 reflect the total time to learn graphs of the same complexity (number of edges), even though sparsebn is capable of computing further into the solution path if desired.Note also that both the PC and MMHC algorithms output only a single graph, whereas our method outputs a full solution path with multiple graph estimates.The results illustrate how sparsebn provides a more favourable scaling when p is large: For the largest graph with 1090 nodes, sparsebn is 17x faster compared to the PC algorithm in pcalg and 52x faster compared to the MMHC algorithm in bnlearn.

Experimental interventions
In addition to scalability, another feature that distinguishes sparsebn is native support for mixed observational and experimental data.As discussed in Section 2.2, experimental interventions allow observationally equivalent DAGs to be distinguished, thereby uncovering the structure of the true causal DAG.To illustrate this, we ran two simulation experiments, reported in Figures 3 and 4. To keep things simple, we focus on discrete data.1 As in the previous subsection, a more thorough evaluation of the effect of interventions can be found in Gu et al. (2016); Zhang (2016); Fu and Zhou (2013).
Figure 3 illustrates how the accuracy of reconstruction improves as interventions are added to each node in the network.We can see that as we process more interventions per node (denoted by m) with n held fixed, the true positive rate (TPR) increases.Further analysis of these results indicates furthermore that the number of reversed edges decreases along with the number of false positives, and as a result the overall structural Hamming distance (SHD), a measure of discrepancy between graphs, decreases.
Figure 4 considers the more practical scenario in which only k (k < p) nodes in the network are under intervention, and over time more interventions on more nodes are able to be collected.This is common when reconstructing large graphs in biological applications in particular.Again, we see that overall the true positive rate increases as more nodes are under intervention, and further analysis shows that the SHD decreases as well.
Since n is held fixed in each simulation, the improvements observed here cannot be attributed to an increase in sample size, illustrating how sparsebn is able to improve estimation of the true causal graph under experimental interventions.In particular, reducing reserved edges that are observationally equivalent shows in an obvious way the utility of interventions for causal inference.

Compatibility
Unfortunately, there is no consistent standard in R for storing and representing graphs.As a result, different domains have adopted different R packages as a de facto standard for graph and network representation.For example, in biology the graph package (Gentleman, Whalen, Huber, and Falcon 2016) seems to be the most popular, whereas in social science and demography the network package (Butts 2008) is more popular.In other domains, the igraph package (Csardi and Nepusz 2006) is popular, which has libraries in R, Python, and C.
For this reason, sparsebn does not provide its own mechanism for manipulating graphs, and instead provides cross-compatibility with each of these three packages.By default, all methods output graphs stored as an edgeList object, which is an internal class with little builtin functionality outside of being a storage mechanism for graph data.In order to make use of the extensive capabilities of the different graph packages in R, we have included the setGraphPackage method to allow users to set a global preference for which graph package to use.Once this preference is set, the full feature set of the selected package (e.g.plotting, manipulation, network statistics, etc.) becomes available to the user.We emphasize that the purpose of sparsebn is not to provide a new library for graph representation and visualization, but instead to provide algorithms for learning their structure.The manipulation of graphs is appropriately left to libraries designed explicitly for that purpose.
Furthermore, to allow cross-compatibility with existing packages for structure learning, we have provided methods to convert the output of sparsebn methods to bnlearn-compatible objects.Compatibility with the pcalg package is possible via the aformentioned graph package, which is the default representation used by pcalg.

Example: Cytometry data
To illustrate the use of this package, we will use the flow cytometry data from Sachs, Perez, Pe'er, Lauffenburger, and Nolan (2005) as a working example in this section.The original dataset consists of 11 continuous variables corresponding to different proteins and phospholipids in human immune system cells, and each observation indicates the measured level of each biomolecule in a single cell under different experimental interventions.Based on this data, a consensus network was reconstructed and validated, which is included in the package as well.First, we load this data: R> library(sparsebn) R> data(cytometryContinuous)

R> names(cytometryContinuous)
[1] "dag" "data" "ivn" Each component of this list stores an important part of the experiment: • dag is the consensus network with 11 nodes and 17 edges, learned via experiments detailed in Sachs et al. (2005).
• data is raw data collected from these experiments.
• ivn is a list of interventions for each observation in the dataset.
The network is visualized in Figure 5, in which a directed edge indicates that a change in the level of the parent will cause a change in the level of the child.This is a relatively small network which we use in order to keep the exposition simple.More examples, including a discrete version of this dataset and applications to large networks with 100-1000s of nodes, will be discussed in Section 6.
To illustrate the use of this package, the rest of this section describes the five main steps to learning Bayesian networks from data: 1) Loading the data, 2) Learning the structure, 3) Estimating the parameters, 4) Selecting the regularization parameter, and 5) Visualizing and assessing the output.We conclude with a brief overview of the various data structures used by sparsebn, which may serve as a useful reference for first-time users.

Loading data
In order to distinguish different types of data-namely, experimental vs. observational and continuous vs. discrete-we use the sparsebnData class which wraps a data.frameinto an object containing this auxiliary information.All of the methods implemented in sparsebn expect input as sparsebnData.To make use of this class, simply type R> cyto.data<-sparsebnData (cytometryContinuous[["data"]], type = "continuous", ivn = cytometryContinuous[["ivn"]]) Notice that we need to explicitly specify that the data is continuous (for discrete data, one would specify type = "discrete").Furthermore, since our data has interventions, we use the ivn argument to specify a list of interventions.Without this argument, it is assumed that the data is purely observational.
Finally, for discrete data, we may wish to manually specify the levels of each variable, which can be done using the levels argument.When this argument is omitted, we attempt to automatically infer the levels.Note that in doing so, however, levels which are missing from the data will not be recognized by the sparsebnData constructor.Note that some of the observations were not under intervention-such datasets with mixed observational and experimental samples are automatically handled by the methods in this package.

nodes 7466 observations
The output of estimate.dag is a sparsebnPath object, which stores the entire solution path that is returned by the method. 2We can inspect the first solution as we would an ordinary R list:

R> cyto.learn[[1]]
CCDr estimate 7466 observations lambda = 86.40602DAG: <Empty graph on 11 nodes.> The first estimate will always be the empty graph, which is a consequence of how we employ warm starts in the solution path.
The third estimate, for example, shows a bit more structure:

R> cyto.learn[[3]]
CCDr estimate 7466 observations lambda = 53.21299 The estimated DAG is represented by a table of parent sets with rows indexed by the child node.For large graphs, explicit output of the parental structure as shown here is omitted by default, however, this behaviour can be overridden via the maxsize argument.Alternatively, we can retrieve the adjacency matrix, {I( sparsebn Finally, for large graphs, it may be desirable to inspect a subset of nodes, which can be done using the show.parentsmethod: R> show.parents(cyto.learn[[3]], c("raf", "pip2")) [raf] mek [pip2] plc In addition to data, there are several optional parameters that can be passed to estimate.dag.The main arguments of interest are lambdas and lambdas.length,which allow the user to adjust the grid of regularization parameters By default, estimate.dagproduces a solution path of 20 estimates with the grid chosen adaptively to the data.This grid can be shortened or lengthened by specifying lambdas.length:R> cyto.learn50<-estimate.dag(data= cyto.data,lambdas.length= 50) For even more fine-tuning, the lambdas argument allows the user to explicitly input their own grid.For convenience we have included the generate.lambdasmethod, which provides a mechanism for generating grids of arbitrary lengths on either a linear or log scale: R> ### Use a linear scale R> cyto.lambdas<-generate.lambdas(lambda.max= 10, lambdas.ratio= 0.001, lambdas.length= 10, scale = "linear") R> cyto.lambdas [1] 10.00 8.89 7.78 6.67 5.56 4.45 3.34 2.23 1.12 0.01 R> ### Use a log scale R> cyto.lambdas<-generate.lambdas(lambda.max= 10, lambdas.ratio= 0.001, lambdas.length= 10, scale = "log") R> cyto.lambdas [1] 10.000 4.642 2.154 1.000 0.464 0.215 0.100 0.046 0.022 0.010 This grid can also be generated manually, although this is not recommended.To run the algorithm using cyto.lambdas(vs. the default): R> cyto.learn.log<-estimate.dag(data= cyto.data,lambdas = cyto.lambdas) Another argument of interest is edge.threshold,which is the easiest way to specify when the algorithm terminates.Specifically, if any point on the solution path contains an estimate with more than edge.thresholdedges, the algorithm will terminate immediately and return what has been estimated up to that point.This makes our methods anytime algorithms, in the sense that they can be interrupted at anytime while still producing valid output.This is convenient when running tests on very large graphs.
The remainder of the arguments to estimate.dagare mainly technical; for brevity we omit a complete description of all the arguments.The interested reader may consult the documentation for more details.

Parameter estimation
It is important to note that the output of estimate.dag is a sequence of graphs, i.e., no parameters (edge weights, variances, etc.) have been estimated at this stage.The next step is to estimate the values of the parameters associated with the underlying joint distribution.This is easy to do: R> cyto.param<-estimate.parameters(cyto.learn,data = cyto.data) The output is a list with each entry containing a component for the weighted adjacency matrix (coefs) and a diagonal matrix for the conditional variances (vars).For example, the parameters of the third estimate in the solution path with six edges are (rounded to two decimal places): For Gaussian data, we can also use (3) to estimate the implied covariance matrix for each solution in the solution path.This is implemented via the get.covariance method (see also get.precision for computing the precision, or inverse covariance, matrix).If the user is only interested in the covariance matrix (resp.precision matrix) and not the underlying DAG, then this extra step can be skipped by directly using the estimate.covariancemethod (resp.estimate.precision).

Model selection
Unlike existing methods, the output of estimate.dag is a solution path with multiple estimates of increasing complexity, indexed by the regularization parameter.Methods such as the PC algorithm and MMHC algorithm depend on the significance level α, and hence also exhibit varying levels of complexity in their output, however, the effect of α is less noticeable compared to λ.Thus, it is important to be able to pick out estimates for inspection and further exploration.
For ad hoc exploration, the select method is useful: This allows one to select an estimate from a sparsebnPath object based on the number of edges, the regularization parameter λ, or the index l.To save space, the output of the following code is suppressed: For practical applications, one is often concerned with selecting an optimal value of λ.In high-dimensions, optimal selection of λ for finite samples is an open problem, and past work has shown that both the prediction oracle and cross-validated choices perform poorly (Meinshausen and Bühlmann 2006;Fu and Zhou 2013).Instead, Fu and Zhou (2013) suggest a practical method for selecting λ based on a trade-off between the increase in log-likelihood and the increase in complexity between solutions.This method is implemented in sparsebn via the method select.parameter: R> select.parameter(cyto.learn,cyto.data) [1] 7 select.parameterreturns the optimal index according to this method, in this case l = 7, corresponding to a value of λ ≈ 20.18204 and an estimated network of 15 edges.

Visualization
In order to visualize graphs estimated by the sparsebn package, we make use of the visualization capabilities of existing graph packages (see Section 4.3).By default, sparsebn uses the popular igraph package: R> getPlotPackage() In addition to igraph, sparsebn is also compatible with the plotting features of the graph (via Rgraphviz) and network packages.It is easy to change which package is used for plotting: R> setPlotPackage("network") R> getPlotPackage() [1] "network" R> setPlotPackage("graph") R> getPlotPackage() [1] "graph" R> setPlotPackage("igraph") R> getPlotPackage() This allows the user complete flexibility over which R package is used for storing data and for visualizing data.In fact, sparsebn even allows one package to be used for visualization and a different package for storage.For comparison, the default output of each package is displayed in Figure 6.
Visualizing the full solution path is easy: Simply call plot(cyto.learn).In order to ensure flexibility, we maintain all of the defaults used by the selected package for the plot method.
It is easy, however, to customize the appearance of the plots if desired.For example, if we would like to compare the consensus cytometry network and the estimated network side by side, we can adjust the arguments to plot as follows: sparsebn q q q q q q q q q q q 1 2 3 The output can be seen in Figures 7a and 7b.
For large graphs, it is helpful to use a different set of defaults, which are provided in a separate method, plotDAG, which can be used for quick plotting (see, for example, Figure 1).
The sparsebnData class is used to represent both continuous and discrete data with experimental interventions.Observational data corresponds to the degenerate case where X does not contain any interventions, and is treated as such by the sparsebn package.R> names(cyto.data) [1] "data" "type" "levels" "ivn" The sparsebnPath class represents a solution path, which is the standard output of all of the learning algorithms implemented in this package.Internally, this is a list of sparsebnFit objects whose jth component corresponds to the jth value of λ in the solution path, R> names(cyto.learn)

NULL
The sparsebnFit class represents an individual graph estimate from a DAG learning algorithm.By default, this graph is stored as an edgeList object, which is an internal implementation of a child-parent edge list.This graph can also be stored in a variety of formats including graphNEL (from the graph package), igraph (from the igraph package), and network (from the network package) by using the setGraphPackage method.

Further examples
In this section we provide three more examples of the functionality of sparsebn: 1) An example with discrete data, 2) A benchmark network from the Bayesian network repository, and 3) An example with 5000 nodes.

Discrete cytometry data
In the previous section we used the cytometry network as an instructional example based on the original, continuous dataset.Sachs et al. (2005) also provided a cleaned, discretized version of this dataset which can be used to illustrate how these methods apply to discrete data.After cleaning and pre-processing, the original continuous measurements were binned into one of three levels: low = 0, medium = 1, or high = 2. Due to the pre-processing, the discrete data contains fewer observations compared to the raw, continuous data.
To use this data, we start by loading it as usual: R> library(sparsebn) R> data(cytometryDiscrete) The code to estimate this graph is essentially the same as in Section 5.For completeness, we present only the essential steps here.As before, the first step is to pass the data into a sparsebnData object: 5400 total rows (5390 rows omitted) Discrete data w/ interventions on 3600/5400 rows.
One of the main purposes of the sparsebnData class is to encode all of the necessary information needed to run the main algorithms; now that the data has been converted into a sparsebnData object, the user will notice almost no difference between the code in Section 5 and the sequel.It is easy to adjust lambdas and lambdas.lengthas before.Note also the difference between the number of solutions here and in Section 5.2; this is a consequence of the stopping criterion used for discrete data (see also Section 3.2).A plot of the DAG with 15 edges (for comparison with the network selected in Section 5.4) is shown in Figure 7c As the parameter space for the multi-logit model is much larger than for the Gaussian model, the output is much more complicated.For example, the node raf has a single parent in the DAG selected above, and so the parameter space for this node is a 2 × 3 matrix: For each extra parent, there will be two more columns added to this matrix.There is one such matrix for each node (this is the default output from nnet; see Section 3.3 for details on the parameterization).

The pathfinder network
In order to illustrate this package on a larger network, we will reconstruct the pathfinder network (Heckerman et al. 1992).The pathfinder network has 109 nodes and 195 edges and is part of the Bayesian network repository,3 a centralized repository of benchmark networks for testing learning algorithms.

sparsebn
We first load the dataset: R> data(pathfinder) R> dat <-sparsebnData (pathfinder[["data"]], type = "c") The data was generated from a Gaussian SEM with β ij = 1 whenever β ij = 0 and ω 2 j = 1 for each j.By (3), we were able to compute the implied covariance matrix and use mvtnorm to generate samples from this distribution.For this example, n = 1000 samples were drawn.

Large networks
To conclude the examples, we finish with an application to the estimation of very large, sparse networks.We begin by illustrating how to generate data from a linear Gaussian SEM.To generate a random, acyclic adjacency matrix, we make use of the random.dagfunction: R> set.seed(1) # for reproducibility R> nnode <-nedge <-5000 R> coefs <-random.dag(nnode,nedge) R> id <-Matrix::Diagonal(n = nnode) # identity matrix R> vars <-id The matrix coefs defines the coefficients of the SEM, and vars defines the conditional variances for each node (see (2)), which we take to be the identity for simplicity.For this example, we will simulate data from a sparse network with p = 5000 nodes and edges.Using the formula in (3), we can then calculate the 5000 × 5000 covariance matrix: R> covMat <-t(solve(id -coefs)) %*% vars %*% solve(id -coefs) Finally, we can use mvtnorm to generate samples from this covariance matrix: R> nobs <-50 R> gaussian.data<-mvtnorm::rmvnorm(n = nobs, sigma = as.matrix(covMat))R> dat <-sparsebnData(gaussian.data, type = "c") We now learn a DAG from this data, and keep track of the time: In this example, it took about 83 seconds for our package to estimate 16 DAGs with p = 5000.
Some comments on the timing are in order.The edge.threshold argument is used here to terminate the algorithm at 5000 edges-in practice, of course, we do not know the true number of edges and so this is meant to be purely illustrative.Moreover, since this parameter can be set arbitrarily high, the algorithm can in principle run for arbitrarily long.In fact, one of the upshots of our methods is that they can be interrupted at anytime, bearing in mind that the result may be a subgraph of the true graph.The edge.threshold parameter allows the user to incorporate prior knowledge of the sparsity (or lack thereof) of the underlying graph.

Conclusion
sparsebn is a fast, modern package for learning the structure of sparse Bayesian networks.By leveraging recent trends in nonconvex optimization, sparse regularization, and causal inference, we are able to scale structure learning to problems containing high-dimensional data with thousands of variables and experimental interventions.This fills a gap in the existing R ecosystem, which already provides excellent methods for traditional problems with large samples and without interventions.All of the code to reproduce the results presented here is available on GitHub, along with the source code of the sparsebn package, which is available on CRAN.
Given the nonconvex nature of the minimization problems here, one future direction is to incorporate stochastic optimization into our package to enhance its global search ability.For example, stochastic gradient descent may be used in the algorithm for discrete Bayesian networks, which will reduce the computational complexity as well.We are also interested in developing divide-and-conquer strategies for ultra-large graphs, say on the scale of 10 5 nodes.Finally, our package can be combined with various post-learning functions for multi-stage learning of causal networks.Given the DAG learned from sparsebn, one may further infer causal relations in a subgraph of interest by making additional causal assumptions or with new intervention data.
(a) Minimize (7) with respect to (β kj , β jk ), holding all other parameters fixed; (b) If the edge k → j (resp.j → k) induces a cycle in the graph, set β kj ← 0 (resp.β jk ← 0) and then update β jk (resp.β kj ); (c) Repeat inner loop until convergence: 3. Inner loop.Fix the edge set E from the outer loop and minimize (7) by cycling through the edge weights β kj for (k, j) ∈ E.

Figure 2 :
Figure 2: Timing comparison (in seconds).The data was simulated from a Gaussian DAG consisting of k = 1, . . ., 10 independent copies of the pathfinder network and n = 50 samples.(solid black line) C = CCDr algorithm implemented in sparsebn, (dashed blue line) P = PC algorithm implemented in pcalg, (dotted green line) M = MMHC algorithm implemented in bnlearn.

Figure 3 :
Figure3: The effect of interventions in learning discrete DAGs.For every node, m interventions are added to otherwise observational data with n = 500 and p = 50.(A) Scale-free network with 49 edges (solid green line), small-world network with 100 edges (dashed red line), polytree with 49 edges (dashed blue line), and bipartite graph with 50 edges (dashed black line); (B) Plot for each network individually.

Figure 4 :
Figure 4: The effect of interventions in learning discrete DAGs.For each k = 0, . . ., 50, m = 10 interventions for k randomly selected nodes are added to otherwise observational data with n = 500 and p = 50.Results are averaged over 20 random permutations of the order in which each node is added as k is increased.(A) Scale-free network with 49 edges (solid green line), small-world network with 100 edges (dashed red line), polytree with 49 edges (dashed blue line), and bipartite graph with 50 edges (dashed black line); (B) Plot for each network individually.

Figure 6 :
Figure 6: Comparison of the default output from the three supported R packages for visualization: (left) igraph, (middle) network, (right) graph.

Figure 7 :
Figure 7: Cleaned up plots of (a) The consensus cytometry network, (b) The learned network based on continuous data, and (c) The learned network based on discretized data.Both learned networks were estimated using estimate.dag.
When selecting by the number of edges or by λ, fuzzy matching is used by default so that the closest match is returned to within a given tolerance.Selecting by index is equivalent to subsetting as usual with the subset operator '[['.