SeqNet : An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data

Gene expression data provide an abundant resource for inferring connections in gene regulatory networks. While methodologies developed for this task have shown success, a challenge remains in comparing the performance among methods. Gold-standard datasets are scarce and limited in use. And while tools for simulating expression data are available, they are not designed to resemble the data obtained from RNA-seq experiments. SeqNet is an R package that provides tools for generating a rich variety of gene network structures and simulating RNA-seq data from them. This produces in silico RNA-seq data for benchmarking and assessing gene network inference methods. The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package= SeqNet and on GitHub at https://github.com/tgrimes/SeqNet .


Introduction
Gene regulatory networks (GRN) describe systems of gene-gene interactions that regulate gene expression. Using high-throughput technologies to measure simultaneous gene expression is an effective avenue for inferring these networks (Sanguinetti and Huynh-Thu 2019). Over the past two decades, a vast collection of methods have been developed for reverse engineering the network structure from gene expression data (Featherstone and Broadie 2002;Pihur et al. 2008;Pesonen et al. 2018). Many of these have been successful in identifying well-known regulatory interactions and identifying new interactions that were later validated (Modi et al. 2011).
There are many excellent review articles on GRN inference methods. Barbosa et al. (2018) summarizes modern methods from various frameworks, such as Bayesian networks, regression-The SeqNet package (Grimes and Datta 2021) is implemented in R (R Core Team 2021) and provides tools for (1) creating random networks that have a topology similar to real transcription networks without the need of a reference network and (2) generating RNA-seq data with conditional gene-gene dependencies defined by the network structure. SeqNet does not require a reference GRN like existing simulators, but it allows users to provide a reference gene expression dataset. This removes any dependence on a reference GRN, which is only available for few well-studied organisms, and enables the simulation of RNA-seq expression profiles. Alternatively, a zero-inflated negative-binomial distribution can be used to model raw RNA-seq counts (see the details given in Appendix A). However, in real datasets the raw counts will contain artifacts that introduce artificial associations among genes, so data are typically normalized before estimating the co-expression network structure (Chowdhury et al. 2019). In this case simulated data should resemble normalized values, not raw counts, and SeqNet will generate the appropriate data when given a normalized dataset as its reference.
We demonstrate that the generated networks are similar to real transcriptional networks by comparing the topology of simulated networks to the E. coli transcription network obtained from RegulonDB. The topology is also compared to networks generated from GeneNetWeaver, SynTReN, and Rogers (Rogers and Girolami 2005), all of which are obtained from the GRNdata R package (Bellot et al. 2021). When generating RNA-seq data, we show that SeqNet produces expression profiles that resemble the reference dataset, and that co-expression methods produce a similar distribution of gene-gene associations on the simulated data as they do on the reference dataset. In an application, SeqNet simulates data from three different underlying network topologies and three different distributions of gene expression to determine how these factors affect the performance of several co-expression methods.
The paper is organized as follows. In Section 2 we present the algorithms for generating network structures, creating weights for the network, and simulating RNA-seq data. Section 3 provides a validation study of the generated network topologies and the characteristics of the simulated RNA-seq data. In Section 4 the main package functions are illustrated in usage examples. An application of SeqNet is shown in Section 5, which compares the relative performance of 12 co-expression methods across different network topologies. Section 6 concludes with a discussion.

Algorithms
The SeqNet simulator consists of three main components: the network generator, the Gaussian graphical model (GGM), and a converter from GGM values to RNA-seq expression data. The network structure is decomposed into individual gene modules that specify local connectivity among subsets of genes. Weights for these local connections are generated under the framework of a GGM. The GGM is used to generate values with dependencies that resemble the global structure obtained when aggregating all the individual local structures. These Gaussian values are finally converted to RNA-seq data such that the dependence structure is maintained. Details for these three components are provided in the following sections.

Network structures
The first step of the simulator is to create a network structure. This structure is represented as an undirected graph whose nodes correspond to genes and edges correspond to nonzero gene-gene associations. The meaning of association is discussed in Section 2.2.

Algorithm 1 Creating a random unweighted network.
Input: p, the number of genes in the network; m, (optional) the number of modules in the network. Output: An unweighted network object.
1: Let G = {1, . . . , p} denote the set of indices for all genes in the network. Set i = 1. 2: Generate a random module size, n i . 3: Sample a subset of n i unique genes, G (i) ⊆ G. 4: Generate a local network structure for the genes in G (i) , and store it as an adjacency matrix, A (i) . 5: Save the resulting module as M (i) = (G (i) , A (i) ) 6: Repeat steps 2-4 until all genes have been selected at least once, or until m modules are created (if m is provided). 7: Return the unweighted network N = (G, where m is the total number of modules created.
The idea behind the SeqNet algorithm is to decompose the global network into a collection of smaller, overlapping modules. The modules are meant to resemble individual gene regulatory pathways. Pathways are sets of genes that interact with each other to control the production of mRNA and proteins. These are biological processes that are activated in response to internal or external stimulus. Each pathway corresponds to a specific process, but they can be organized into a hierarchical structure. For example, the pathway for a general function like metabolism can be broken down into smaller pathways representing each of its individual sub-processes. Some genes may be involved in multiple processes, so there is often overlap among pathways. Individual pathways can be represented mathematically using a graph, where nodes represent genes and edges represent gene-gene interactions.
The SeqNet algorithm generates a (global) network from the ground up by iteratively constructing the overlapping modules. The resulting network is denoted by N = (G, {M (i) } m i=1 ), which contains the set of p genes, G = {1, . . . , p}, and a collection of modules, {M (i) } m i=1 , where M (i) = (G (i) , A (i) ) is the ith module containing genes G (i) ⊂ G and an adjacency matrix, A (i) ∈ {0, 1} |G (i) |×|G (i) | , representing the local network structure. The algorithm consists of three main steps: (1) generate a random module size, (2) sample genes to populate the module, and (3) generate the local network structure for the module. These steps are iterated until all p genes have been sampled. Algorithm 1 outlines this process, and details for the three key steps are provided after.
Note that the resulting network, N , is unweighted because only the structure of the gene-gene connections is specified (through the adjacency matrices A (i) ). At the global level, two genes i, j ∈ G are connected if and only if they are connected in at least one of the modules.

Random module size
A random module size is generated by n = n min + x, where x ∼ NB(n avg − n min , σ 2 = σ 2 ) has a negative-binomial (NB) distribution parameterized to have mean n avg − n min and variance σ 2 . The NB distribution provides the flexibility to model the overdispersion of module sizes found in real pathways. The modules have a default minimum size of n min = 10, with a mean and standard deviation of n avg = 50 and σ = 50, respectively. These parameters can be adjusted by the user. In addition, a maximum module size (optional) can be set, and any generated size above this value will be reduced to it. This may be useful for smaller networks to avoid any one module containing a majority of nodes. By default, the maximum module size is set to 200.

Sampling a subset of genes
The sampling procedure used to select genes for each module is designed to achieve two goals: (1) the modules may be overlapping, i.e., genes may be sampled for multiple modules, and (2) every new module is connected to at least one existing module. The second goal ensures that the global network is a single, giant component similar to those found in real transcription networks (Dobrin et al. 2004;Ma et al. 2004).
The first module is a special case: after generating a random module size, n 1 , a subset of genes, G (1) ⊆ G, is sampled with uniform probability, where G = {1, . . . , p} denotes the global set of genes and |G (1) | = n 1 .
For each additional module, after generating a random module size n i , first a "link node" is sampled with probability dependent on the connectivity of each gene. Then, the remaining n i − 1 genes are sampled with reduced probability given to genes that already belong to a module.
Link nodes are motivated by transcription factors found in real regulatory network, which regulate genes across multiple pathways. The link node is sampled from the existing modules, i−1 j=1 G (j) , so that it links the new module to the rest of the network. This process helps to generate networks with the desired topological characteristics, but it is not meant to mimic the evolutionary process that forms real biological networks -that process is largely unknown. However, the mechanism of selecting a link node achieves goal (2) of ensuring that the global network is a single component. In addition, and perhaps more importantly, link nodes provide a way of building up very large hub genes. Rather than sampling link nodes with uniform probability they are sampled with probability π k , which is defined by Equation 1 below. This sampling probability places greater weight on highly connected genes. Through this mechanism, hubs can continually grow their connections by being included in newly created modules.
The remaining n i − 1 genes are sampled with reduced probability given to genes that already belong to a module. Let G = i−1 j=1 G (j) denote the set of genes selected for all previous modules, and let ∈ G denote the link node that was sampled for the current module. Then, the remaining genes are sampled with probability where ν ∈ [0, 1] is a parameter that controls the amount of overlap among modules. This probability simply reweights the genes that were selected for a previous module by a factor of ν; the normalization factor for this probability is 1/(|G| − (1 − ν)| G| − ν). The parameter ν has a significant influence on the topology of generated networks. Adjusting ν provides a way of exploring a rich distribution of networks. For example, increasing ν will increase the amount of overlap among modules, which will result in more modules being created (assuming the total number of modules is not fixed). The end result will be more connections in the network, i.e., less sparsity and higher average degree, and a decrease in the average distance between nodes.

Generating a local network structure
The network structure generator is influenced by the Watts-Strogatz algorithm (Watts and Strogatz 1998) and the Barabasi-Albert model (Barabási and Albert 1999). In the Watts-Strogatz algorithm, a network of p nodes is initialized by a ring lattice, with each node having initial degree 2K. For each node, edges are rewired to a new neighbor with probability π. If π = 0, then the ring lattice structure is retained; if π = 1, then the network is a random graph.
With intermediate values for π, small-world (but not scale-free) networks are created. The Barabasi-Albert model is a generative algorithm whereby p nodes are added to the network one at a time. Each time a node is added, it is connected to M other nodes. The probability that the new node will be connected to an existing node i is where d i is the degree of node i; that is, the wiring probability is proportional to the degree of the node. This contrasts with the Watts-Strogatz algorithm where the rewiring probability is constant. The preferential wiring to high-degree nodes results in a scale-free network structure.
A scale-free structure is a common assumption when analyzing gene expression data, from normalization (Parsana et al. 2019) to network estimators (Zhang and Horvath 2005). However, there is evidence to suggest that biological networks are not so scale-free (Stumpf and Ingram 2005); this is particularly evident when observing, for example, the E. coli transcription network. One characteristic that stands out is the presence of nodes with exceedingly large degrees, higher than expected under a power-law distribution.
The SeqNet network generator is designed to sample from a diverse range of topologies beyond scale-free networks. The generator for local network structures uses ideas from both the Watts-Strogatz and Barabasi-Albert models. It starts with an initial network structure, then performs a rewiring step, followed by a edge removal step, and it finalizes by reconnecting any disconnected components. Each of these steps is detailed next.
The algorithm initializes the module structure as a ring lattice, which can be visualized by arranging the nodes in circular layout and connecting nodes to their nearest neighbors to form a ring. The initial degree of each node is 2K. The algorithm proceeds in three stages: node rewiring, connection removal, and connecting disconnected components.
The algorithm initializes the module structure as a ring lattice with neighborhood size K. Each connection is then rewired with a constant probability. When a connection is rewired, a new node is selected with a probability that depends on the node degree. Once the rewiring step is completed, an edge removal step is used to delete edges from the network with constant probability; this allows for finer control of the network's sparsity. At this point, the network may contain disconnected components either due to the rewiring or removal of edges, so a final step is added to connect these disconnected components.
In the rewiring step, each of the p nodes are traversed one at a time. On node i, each of its connections may be rewired with a constant probability π rewire . Suppose the connection to node j is to be rewired. Then, a new neighbor, k, is sampled with probability π k from among the remaining nodes, i.e., from {1, . . . , p}\{i, j}. (The construction of π k is discussed next.) In the edge removal step, uniform sampling is used -each edge in the module may be removed with probability π remove . The default values are π rewire = 1 and π remove = 0.5. Whenever rewiring is performed in a module, the rewiring probability, π k , will depend on the degree of gene k. However, rather than rewiring with probability proportional to the node degree, as in the Barabasi-Albert model, we instead consider the percentile of the degree. The goal here is to devise a strategy that builds hub genes more quickly than in the Barabasi-Albert model. The idea behind our strategy is to strongly favor the highest degree nodes, even if they have a small number of connections. This is where percentiles come in: they translate degrees into rankings. The percentile is taken with respect to all the genes that can be rewired to -in a module with p nodes, there will be p − 2 candidates. The percentile for gene k is defined by p k = F D (d k ), where F D (x) = d∈D (d ≤ x)/p is the empirical cumulative distribution function (CDF) of the candidate node degrees, and D = {d 1 , . . . , d p }\{d i , d j } are the degrees for all candidate genes. (Recall that modules are considered as local networks; the degree of a gene is calculated from the global network, i.e., by the total number of connections to it in all modules that currently contain it.) Note that, in the case of all ties, p k = 1 for each k, and in any case at least one p k will be equal to one. Now, candidate genes could be sampled with probability proportional to p k . However, using p k alone would only slightly favor genes with the highest degree. To increase the preference towards the most connected genes, the p k 's need to be adjusted so that smaller values are decreased. This can be accomplished by exponentiating p k . For example, using p 2 k will leave the top rank gene(s) with p k = 1 unchanged, while reducing any lower ranking genes with p k < 1. In general, using probabilities proportional to p α k , where α > 1, can dramatically favor genes with the highest degree. As it turns out, this approach will allow us to generate networks that contain exceedingly large hub nodes, similar to those found in the E. coli network.
Note that the CDF of the Beta distribution achieves the same effect as exponentiating p k : with the Beta distribution, F α,β , parameterized by α and β such that its mean and variance are α α+β and αβ (α+β) 2 (α+β+1) , respectively, we have p α 0 k = F α=α 0 ,β=1 (p k ). Using this setup, the parameters α and β provide a more flexible way of controlling the rewiring preferences. This is the approach used by SeqNet. A small term, , is added to guarantee that all nodes have a nonzero chance of being selected. Thus, the final probability of rewiring to node k is with default values set to α = 100, β = 1, = 10 −5 .
In the edge removal step, each connection in the module has a constant probability, π remove , of being removed. Node degrees do not have any influence on whether a connection is removed.
In the final step, disconnected components in the module are connected. The largest component is identified, then each of the smaller components are wired to it one by one. A single gene is sampled from the smaller component uniformly, and it is wired to a gene k in the large component with probability π k as defined above.
This procedure for generating a random module structure is summarized in Algorithm 2.

Gaussian graphical model
Once a network structure is created, the next step is to weigh each connection. Since the network is composed of individual modules, each having a local network structure, the process of adding weights is performed at the module level.
The connections within a module identify the associations among its genes. An association will be defined as conditional dependence; that is, a nonzero gene-gene association will imply Algorithm 2 Generating a random module structure.
Input: N , the number of genes in the module; d * , a vector containing the external degree of each gene; π rewire = 1, the probability of rewiring an edge; π remove = 0.5, the probability of removing an edge; K = 3, the neighborhood size of the ring lattice; α = 100 and β = 1, the parameters for the Beta distribution; and = 10 −5 , a small constant. Output: A random network.
1: Let G = {1, . . . , N } denote the set of indices for each gene in the module. Initialize a ring lattice, A, of size N such that node i ∈ G has degree d i = 2K. 2: (Rewiring step) For i in G, rewire each connection to node i with probability π rewire .
If a connection between nodes i and j are to be rewired, choose a new node, k, from where the subscript G denotes subsetting the vector onto the nodes contained in G. After rewiring, update d to reflect the changes in degree distribution. 3: (Removal step) For each connection in A, remove the connection with probability π remove . 4: (Component step) Identify the disconnected components in A. If there are more than two, let A 1 denote the largest component and A c denote the remaining components for c = 2, . . . , C. 5: For each c in {2, . . . , C}, sample a node i from A c with uniform probability, and sample a node j from A 1 with probability proportional to Add a connection between nodes i and j in A. 6: Return A. that the expression of two genes are conditionally dependent given the other genes in the module. If an association is zero, meaning there is no edge between the two genes in the module's network, then those two genes are conditionally independent. Using this definition of association, we can translate the network structure into a probabilistic model through the multivariate normal distribution, also referred to as the Gaussian graphical model (Koller and Friedman 2009). In this model, the expression X ∈ R p of the p genes in the module has the joint distribution where µ ∈ R p is a mean vector and Ω ∈ R p×p is a positive definite matrix called the precision matrix. Note, the precision matrix is simply the inverse of the covariance matrix commonly used to parameterize the normal distribution. However, working with the precision matrix is preferable because the partial correlation, ρ ij , between genes i and j can be calculated directly from Ω by, Furthermore, partial correlation is related to conditional dependence in the GGM, whereby two genes have zero partial correlation if and only if they are conditionally independent given the other genes. Therefore, the off-diagonal elements in Ω determine the conditional dependencies among genes. Since conditional dependence is our definition of association in the network, this means that each gene-gene association can be represented through nonzero entries in Ω. So, generating weights for the associations in a module is equivalent to generating a random precision matrix that has the same sparsity structure as the module's adjacency matrix.
In the following sections we outline a procedure for generating a precision matrix and discuss a concern when dealing with multiple networks.

Generating a precision matrix
Given a module M = (G, A) containing p = |G| genes, the task is to generate a positive definite matrix Ω ∈ R p×p that has the same structure as A, meaning A j,k = I(Ω j,k = 0) for every j = k, where I(·) is the indicator function. This task is carried out in three steps.
First, a random symmetric matrix W ∈ R p×p of weights is generated. By default, the entries in the lower triangle of W are sampled uniformly from (−1, −0.5) ∪ (0.5, 1), and the uppertriangular values are updated to ensure symmetry. The support of this uniform distribution can be modified by the user. An initial matrix is constructed by mapping these weights to the module structure byΩ The initial matrixΩ will not be positive definite, so an adjustment must be made. Let λ max (Ω) and λ min (Ω) denote the maximum and minimum eigenvalues ofΩ, respectively. The precision matrix is obtained by Ω =Ω + cI p , where c is Note that c is nonnegative since the diagonal elements of the adjacency matrix A are zero (a gene cannot be associated with itself), hence the trace ofΩ is zero. This means that all eigenvalues are zero and c = 0, or they are a mix of positive and negative values and c > 0. ButΩ is symmetric, so the only case in which all of its eigenvalues are zero will be when it is the zero matrix (i.e. there are no edges in the network). Hence, c = 0 is handled as a special case in which data are generated using independent distributions.
In any non-empty network, c > 0, and this adjustment ensures that the smallest eigenvalue of Ω is positive, so the matrix is positive definite. In addition, this choice of c increases the smallest eigenvalue sufficiently above zero so that computing the inverse of Ω is numerically stable. In particular, it guarantees that the condition number of Ω, defined as κ(Ω) = |λ max (Ω)|/|λ min (Ω)|, is below 10 k + 1 (Gentle 2007). The default value is k = 2.5. Decreasing k will increase the numerical accuracy of the computed inverse (Gentle 2007), but will also shrink the partial correlation values for the associations toward zero. Increasing k will have the opposite effect.

Generating precision matrices for multiple networks
If multiple networks are being constructed, it should be considered whether or not common gene-gene associations across networks should be given the same weight. By giving common edges the same weight, the only difference between networks will be the network structure itself; some edges may be missing or added within a network, but any edge present in multiple networks will have the same association strength. In order to accomplish this, the weight matrix W generated for a given module must be the same across networks, and the adjustment c used to create the precision matrix module must also be constant.
Algorithm 3 outlines the procedure for generating precision matrices for multiple networks.
To reiterate, the key attribute of this algorithm is that common edges across networks will be given the same weight. If this property is not desired, then the individual networks can be weighted by iteratively running the algorithm with a single network as the input.
Algorithm 3 Generating precision matrices for multiple networks.

Input: A collection of unweighted networks, {N
); the networks have the same number of modules, and each module contains the same genes across networks, G (i|1) = . . . = G (i|n) , for i = 1, . . . , m. Output: A set of weighted networks with equal weights for common edges across networks.

Generating RNA-seq data
The final step of the simulator is to generate data from the GGM and convert those values into expression data. There are two goals in creating these data: (1) The gene expression has a dependence structure that reflects the global network structure, and (2) the marginal distribution of expression for each gene resembles the reference RNA-seq data.
The generator is divided into two parts to tackle these goals. First, initial data are generated and aggregated together from the local GGMs defined in each module; this imposes the module dependence structures onto the gene expression. Then, the Gaussian values are transformed into RNA-seq data by sampling from the empirical distribution of the reference dataset. These two steps are detailed in the following sections.

Initializing values from local GGMs
, the first step is to generate a local expression vector,X (j) ∈ R |G (j) | , for each module j = 1, . . . , m; the tilde here is used to distinguish these initial Gaussian values from the eventual RNA-seq data. Based on the GGM of each module, the local expression vector is generated from the multivariate normal distribution, X (j) ∼ N (0, (Ω (j) ) −1 ), where the mean µ = 0 is the zero vector of appropriate length. A sample can be obtained from this distribution by first generating independent standard normal variables Z i ∼ N (0, 1), setting Z (j) = (Z 1 , . . . , Z p ) , and transforming them bỹ The global expression,X ∈ R |G| , is calculated by aggregating the expression over all modules. Consider gene i ∈ G, and let M i = {j : i ∈ G (j) } denote the set of module indices that contain gene i. If the gene does not belong to any module (M i = ∅), then its global expression, X i , is generated directly from the standard normal distribution. Otherwise, its expression is calculated byX denotes the generated expression of this gene in the jth module. Rather than averaging, the weight |M i | −1/2 is used so that the variance of the aggregated expression remains constant; i ) = 1. This procedure results in a random expressionX i ∼ N (0, 1) for each gene such that the covariance between two genes i and j is where (Ω (k) ) −1 ij is the covariance between the two genes in the kth module. Note that, although two genes will be uncorrelated if they are in distinct modules, their partial correlation may be nonzero if the modules overlap. This means that in the presence of overlapping modules, genes with no direct connection may have nonzero partial correlation. Whether or not this property is appropriate depends on the user's model of the data generating process. In particular, we must decide if it is appropriate to model gene expression as the aggregate of independent pathways (modules), or if expression is dependent on the global dependencies alone (collapse the modules into a single, large module). The answer to this question is not clear from a biological perspective, so SeqNet is designed to handle both cases.
The network can be coerced into a single module, which interprets all connections as global connections (this is illustrated in Section 4). In this case, the initialized expression values, X (1) , provide the global values; no aggregation is needed since there is only one module. These generated values have all of the usual properties of a GGM. In particular, the nonzero partial correlations will correspond to direct connections (edges) in the network -this is the main distinction between generating values from a single module compared to aggregating over multiple overlapping modules.

Converting GGM values to RNA-seq data
Once the Gaussian values,X i ∼ N (0, 1) for i ∈ G, are obtained, the final step is to convert these into RNA-seq data. The conversion is performed for each gene, one at a time, using the empirical marginal distribution of expression from the reference RNA-seq data.
is the CDF of the standard normal distribution. The inverse of the empirical CDF is calculated in R using the quantile() function with argument type set to 1, which defines the inverse CDF as The expression values of a given gene in the reference are assumed to be identically distributed. However, raw RNA-seq counts will fail to satisfy this assumption due to differences in library sizes, guanine-cytosine (GC) content, batch effects, and other sources of technical and biological variation (Hitzemann et al. 2013;Chowdhury et al. 2019). Therefore, it is assumed that the data have been appropriately normalized to allow for comparisons across samples, for example, using trimmed mean of M values (TMM) normalization (Robinson and Oshlack 2010), surrogate variable analysis (SVA) (Leek and Storey 2007;Parsana et al. 2019), SVA by partial least squares (SVAPLS) (Chakraborty et al. 2012), or removal of unwanted variation (RUV) (Freytag et al. 2015;Jacob et al. 2016).
Algorithm 4 Generating RNA-seq data from a collections of local GGMs. Output: A matrix X ∈ R n×p of RNA-seq expression data generated from the network.
1: Sample p = |G| columns from the reference dataset Y . If p ref < p, then sample with replacement. 2: Set r = 1. Initialize a n by p matrix X. 3: Generate Gaussian valuesX ∈ R p from the weighted network.

Validation
In this section, we assess the simulator in terms of the topological properties of the generated networks and how well the generated data resemble the reference RNA-seq dataset.

Network topology
Three topological measures are commonly used to characterize networks; these include the degree distribution, clustering coefficient, and average path length. These measures have been used to describe the structure of various complex networks and to assess network generators (Barabasi and Oltvai 2004;Di Camillo et al. 2009).
Let V denote a random node sampled from the nodes in a graph, v ∈ G. The properties of a network are described on two levels: by the local characteristics with respect to V , and by the global characteristics with respect to G. In this setup, V is a random variable with P(V = v) = n −1 for v ∈ G and zero otherwise, where n = |G| is the number of nodes in G. In general, local properties of a topological measure m(v) are expressed in terms of its distribution P(m(V )), while global properties are viewed by its expectation E(m(V )).

Degree
The degree of a node, d(v), is the number of connections to v. The degree of network is said to follow a power-law distribution if Networks with a power-law degree distribution are called scale-free networks (Albert and Barabási 2002). Such networks contain hub nodes that have many more connections than the average node. In biological networks, the parameter γ is often between 2 and 3 (Milo et al. 2002). However, the degree distribution of transcription networks may not be so simple. For example, the exponentially truncated power-law, which has the form P(d(V ) = k) ∝ k −γ exp(−αk), may be more suitable for some networks (Zhang and Horvath 2005;Csányi and Szendröi 2004). In this study, the degree distribution for various networks are compared visually rather than fitting any particular model.
The average degree of the network, provides a global characterization of the graph.

Clustering coefficient
The local clustering coefficient, C(v), of a node is the probability that two of its neighbors are connected. It is defined for nodes of degree d(v) ≥ 2 by, Visually, suppose nodes j and k are connected and are neighbors of i. Then, the three nodes i-j-k form a triangle in the network. This is what the clustering coefficient measuresthe frequency of triangles in the network. As a function of node degree, the average local clustering coefficient has been shown to follow a power-law distribution, with φ = 1 for metabolic networks (Ravasz et al. 2002;Barabasi and Oltvai 2004); the powerlaw scaling of C(k) is characteristic of a hierarchical organization that is not always found in scale-free networks (Ravasz et al. 2002). As with the degree distribution, C(k) will be compared visually for various networks rather than fitting any particular model.
The global clustering coefficient provides a measure of the entire network. It is defined by, Note, this is not equivalent to the average local clustering coefficient, n −1 v∈G C(v). The global clustering coefficient is calculated using the transitivity() function from the igraph R package (Csárdi and Nepusz 2006).

Average path length
The third measure is based on the shortest path between nodes. In particular, the average path length, L(v), for node v is, Length of shortest path between v and w, where G(v) denotes the largest connected component of G that contains v, and |G(v)| is the number of nodes in G(v). This measure is calculated using the distances() function from the igraph R package. As with the clustering coefficient, we consider the average path length as a function of node degree, The global measure is the average shortest path among all pairs of nodes, This measure is sometimes referred to as the characteristic path length. It is calculated using the mean_distance() function from the igraph R package.
When studying a network generating model, it can be useful to study the global characteristics as a function of network size N . For example, consider the characteristic path length as a function of network size, L(N ) = E(L(G) | |G| = N ), where G is now a random network sampled from the generating procedure. It has been shown that scale-free networks with exponent 2 < γ < 3 are ultrasmall (Cohen and Havlin 2003), having an average path length that scales by L(N ) ∼ log log N , compared to small-world networks whose average path length scales by log N . Similar characterizations can be made with d(N ), the expected average degree, and C(N ), the expected global clustering coefficient, of networks of size N . However, when comparing a network generated by SeqNet to the E. coli transcription network, comparisons of the global characteristics will be made at the fixed size N of the transcription network.

E. coli transcription network
In this section we compare the topology of a generated SeqNet network to the E. coli transcription network obtained from RegulonDB . This demonstrates that SeqNet is able to generate a network with a topology similar to a real GRN without using a reference network. Note that GeneNetWeaver and SynTReN both use the E. coli network as a reference, whereas SeqNet and Rogers do not.
RegulonDB provides the strength of evidence for each regulatory interaction, coded as either "strong" or "weak". When constructing the E. coli transcription network, we must choose whether to include only "strong" sources of evidence, or both "strong" and "weak" sources. There are currently 2767 connections that have strong evidence and 1797 connections with weak evidence. The network topology will change depending on whether or not the "weak" connections are included, so we consider both scenarios.
The SeqNet generator has many parameters that can be adjusted. However, in this example only ν will be adjusted, which controls the amount of overlap among modules; the remaining parameters are set to default values. As we will see, the change that occurs in the E. coli network topology when adding "weak" connections can be modeled by simply decreasing ν to allow for more connections in the generated network.

Strong evidence only
In the first case, only connections with strong evidence are included in the E. coli network. The SeqNet generator is set with ν = 10 −3 and default values for all other parameters. All five networks are shown in Figure 1. The main feature visible from this view is the presence of very large hub genes. Notably, the network generated by Rogers does not contain any large hubs.
The local topology for each network is summarized in Figure 2. The first column provides the degree distribution of each network. In a scale-free network, the points will follow a linear Network visualization Figure 1: From left to right: the E. coli transcription network using only "strong" evidence connections; a network generated from SeqNet with ν = 10 −3 ; a network generated by GeneNetWeaver; a network generated by SynTReN; and a network generated by Rogers.
Nodes are scaled by their degree. trend. But each distribution, aside from Rogers, has a longer tail than expected in a powerlaw distribution; this tail indicates the presence of very large hub genes. These results are consistent with previous findings that suggest biological networks deviate from a scale-free topology (Stumpf and Ingram 2005;Khanin and Wit 2006). However, we are not particularly interested in whether the networks are scale-free, but rather, whether the topology of the generated networks are similar to that of the E. coli transcription network.

E. coli
The second and third columns of Figure 2 show the clustering coefficient and average path length, respectively, as a function of node degree. These tend to follow the linear trend more closely. The power-law distribution for the clustering coefficient, C(k), has been previously identified in biological networks (Ravasz et al. 2002). However, to our knowledge the distribution of L(k) has not been previously characterized. These results from the E. coli transcription network suggests that L(k) may also follow a power-law distribution in biological networks.
Focusing on the SeqNet network (row B), the degree distribution and average path length (first and third columns) have very similar patterns compared to the E. coli network (row A). The main difference is found with the clustering coefficient, in which the E. coli network contains more variability in the clustering coefficient among genes of the same degree compared to the SeqNet network.
The global topology for each network is summarized in Table 1. The sparsity, average degree, and average path length of the SeqNet generated network are all similar to the E. coli network. The average degree for the GeneNetWeaver and SynTReN networks are much larger than the E. coli network, and similarly, their average path length is smaller. The GeneNetWeaver network most closely matches the E. coli network in terms of clustering coefficient.

Strong and weak connection
Next, we consider including the connections with weak evidence into the E. coli network. By including more edges, the network will become more dense. To reflect this change, the SeqNet generator is re-run with ν decreased from 10 −3 to 10 −2 to allow for more overlap among modules which will increase density. The new E. coli and SeqNet networks are shown in Figure 3.
The topology of the E. coli network remains relatively unchanged at the local level (see Figure 4) compared to the network with only "strong" connections, but the global characteristics change (see Table 2); as we might expect after including additional connections, the average degree increases and the average path length decreases. Similar changes are found in the SeqNet network after decreasing ν from 10 −3 to 10 −2 .
The sparsity, average degree, and average path length of the SeqNet generated network are again very similar to the E. coli network. The average degree and average path length for the GeneNetWeaver and SynTReN networks now agree with the updated values for the E. coli network. The clustering coefficient is unchanged when adding the weak connections, and the SeqNet clustering coefficient remains larger, as before.

Comparing generated data to reference data
SeqNet is able to simulate expression data that resemble a reference dataset. This is validated in two ways. First, the mean and variance of gene expression profiles are compared between the reference data and the simulated data. Second, since the main use of SeqNet is to assess the performance of co-expression methods, we compare the behavior of several such methods on the reference data compared to the simulated data. Details are provided in the following sections.

Reference RNA-seq datasets
Four RNA-seq datasets were downloaded from LinkedOmics (Vasaikar et al. 2017), a portal to multi-omics data generated by The Cancer Genome Atlas (TCGA). Data were obtained for four cancers: breast invasive carcinoma, glioma, head and heck squamous cell carcinoma, and prostate adenocarcinoma. The downloaded datasets contain reads per kilobase million (RPKM) normalized, log 2 (x + 1) transformed expression levels for transcripts annotated at the gene level. Since RPKM normalization does not align gene expression across samples (Lin et al. 2016), we applied the trimmed mean of M-values (TMM) normalization (Robinson and Oshlack 2010) to the untransformed RPKM data. The TMM-normalized values for each cancer were then used as four reference RNA-seq datasets (without the log transformation).
Due to computational restrictions of some co-expression methods used later, only a subset of genes in each reference dataset is used. However, since we are interested in analyzing coexpression, a random selection of genes would not be ideal because the sampled genes may be unrelated. A better strategy is to randomly sample a set of gene regulatory pathways, and subset the data onto the genes contained in those pathways. This way, the reference data will likely retain some true co-expression patterns among the sampled genes. In this study, 20 pathways are randomly sampled from the Reactome database (Fabregat et al. 2015). Note, the main results in this section are robust to different sets of sampled pathways (results not shown).
Results for the breast cancer reference dataset are reviewed in the following sections. Similar results are obtained for the other three reference data; those details are provided in Appendix B. The breast cancer dataset contains 1093 RNA-seq samples and 15034 genes. Subsetting the data on 20 random pathways resulted in 621 unique genes.

Mean and variance of expression profiles
The first experiment verifies that the simulated expression profiles are similar to those in the reference data. A key feature of RNA-seq data is the presence of overdispersion; the gene expression profiles tend to have much larger variance compared to their mean. This motivates the use of a mean-variance plot to summarize the distribution of expression profiles. That is, the columns of each dataset will be summarized by their mean and variance, and the distribution of mean-variance pairs will be compared across datasets.
In Figure 5, a reference dataset is compared to simulated data using SeqNet and simulated data from a zero-inflated negative-binomial (ZINB) model; details for the ZINB model are provided in Appendix A. A smoothed kernel density estimate is displayed to highlight the shape of each distribution. The outlying points are expression profiles with a density estimate below 0.05. We find that the overall mean-variance relationship is similar across datasets, but the outlying profiles -particularly those with excess variance -show up with the SeqNet generator but are missed by the ZINB model.

Values from co-expression methods
In the second experiment, we verify that co-expression methods produce similar association values on simulated data as they do on the reference dataset; references for these methods are provided in Section 5.1. Each co-expression method produces an association score for each gene-gene pair. For p genes, this results in p(p − 1)/2 different values. In this assessment, Figure 5: Mean-variance plots on a log scale for the reference dataset (left) and simulated data using SeqNet (middle) and a ZINB model (right). The heatmap shows a smoothed kernel density estimate. Outlying points are expression profiles with a density estimate below 0.05. Figure 6: Distribution of association scores from the 12 network inference methods outlined in Section 5.1. The underlying network is based on the estimated partial correlations in the reference dataset.

Distribution of association scores
each method is applied to the reference dataset, simulated data from the SeqNet generator, and simulated data from the ZINB model; the results for each method are summarized by a histogram of the values generated from each dataset.
To simulate data, an underlying network of gene-gene associations must be provided. Since our goal is to show that the associations in simulated data can resemble those in real data, the underlying network is estimated directly from the reference dataset. This way, the associations in the simulated data are close to those in the reference. Results are shown in Figure 6. We find that for each method the values follow a similar distribution in each of the three datasets. The SeqNet and ZINB datasets tend to generate substantial overlap and are indistinguishable for most methods. There is some difference in the simulated and reference datasets, particularly for the correlation methods and ARACNE, but these may be due to genuine differences in the true underlying network and estimated network structures.
SeqNet can also be used to test the null case: what values do the co-expression methods produce when there are no gene-gene associations in the simulated data? In this case the reference dataset is not changed, but network structure is now empty; simulated data will contain independent gene expression profiles. Results from this experiment are shown in Appendix B. Most co-expression methods produce association values near zero, as expected. However, the methods CLR and Silencer continue to produce association scores that are relatively large.

Package usage
The main SeqNet functions are illustrated through a few working examples. A complete reference manual is available from the Comprehensive R Archive Network (CRAN); it contains additional examples, a description, and argument details for every function. The manual can be accessed in R using the help() function. SeqNet can be installed directly from CRAN and loaded using the following commands.

R> install.packages("SeqNet") R> library("SeqNet")
A description of the main functions are provided in Table 3. The package manual contains additional details for all SeqNet functions and their arguments. This documentation is conveniently accessible within R using the help() and the ? commands, e.g.,

R> help(package = "SeqNet") R> ?random_network
The two main SeqNet functions are random_network() and gen_rnaseq(). These can be used to quickly generate a random network of p genes and simulate an RNA-seq dataset of n samples.
R> p <-100 R> n <-100 R> network <-random_network(p) R> generated_data <-gen_rnaseq(n, network) p indicates the number of nodes (genes) in the network and n the number of samples to generate. random_network(p) generates a random network of size p and gen_rnaseq() is used the generate the n samples.
The variable generated_data is a list containing the generated dataset (containing n rows and p columns) and the reference dataset used to create them. Note that the network is not Function Description random_network() Generate a random network containing p nodes (genes). Additional tuning parameters for the network generating algorithm can be used here, including nu, prob_rewire, prob_remove, and alpha. as_single_module() Used to coerce a 'network' into a single, large module rather than a collection of overlapping modules. This affects how data are generated when passed into gen_rnaseq(), as discussed in Section 2.3. The decision to use this function should be based on the user's model of the data generating process. perturb_network() Rewires a 'network' to obtain a differential network. By default, a hub node is randomly sampled from the 'network', and each of its connections are removed with probability 0.5. Note, this function can be used iteratively to turn off many hub nodes. gen_partial_correlations() Generates weights for the connections in a 'network'.
When multiple 'network' objects are provided, connections that are common across networks are given the same weight. print() Printing a 'network' displays the number of nodes, edges, and modules, along with the average degree, clustering coefficient, and average path length in the network. plot() Generates a plot of the 'network'. See also plot_network(). plot_network_diff() Generates a plot of the differential network between two 'network' objects.

gen_rnaseq()
Generate an RNA-seq dataset containing n samples (rows) with a dependence structure defined by a given 'network'. A reference dataset can be provided. If the 'network' is not weighted, then weights are automatically generated with a warning message. sample_reference_data() Helper function for sampling p genes (columns) from a reference dataset. The argument percent_ZI is useful when the reference contains zero-inflated genes to control the proportion of the p sampled genes that are zero-inflated. get_adjacency_matrix() Creates an adjacency matrix from a 'network'. If genes i and j are connected in the 'network' (in any module), then the ijth entry of the adjacency matrix is 1, and otherwise is 0. create_network_from_ adjacency_matrix() Creates a 'network' from an adjacency matrix. This is useful when the user has a prespecified network structure, i.e., does not want to generate a random one, but does want to generate RNA-seq data. weighted; when gen_rnaseq() is called, it automatically uses gen_partial_correlations() to generate weights for the network. Additionally, no reference dataset is provided to gen_rnaseq(); in this case, gen_rnaseq() uses the breast cancer dataset by default. The above code is a minimal example to quickly generate RNA-seq data. However, better practice is to explicitly code the intermediate steps. The weights are generated with gen_partial_correlations() and SeqNet::reference$rnaseq accesses the reference data set: R> p <-100 R> n <-100 R> network <-random_network(p) R> network <-gen_partial_correlations(network) R> df_ref <-SeqNet::reference$rnaseq R> generated_data <-gen_rnaseq(n, network, df_ref) This code shows how the user can generate weights for the network object and access the breast cancer reference data. Note that SeqNet::reference$rnaseq can be replaced by the user's preferred reference data. In addition, any normalization or transformations can be applied to these data before passing it into gen_rnaseq().
The next example illustrates simulating two populations. This is useful, for example, to generate data for a differential network analysis. In most applications, the two networks should be similar overall with only a few connections changed. To achieve this, the perturb_network() function is used to introduce differential connections (perturbations) to the first network. In addition, this example will demonstrate how to coerce a 'network' to a single module using the as_single_module() function (refer to Section 2.3 for details). The networks are visualized using plot(), and the layout of each display can be passed from one plot to the other for easier comparison, as will be shown.
At this stage, we have generated two random networks, network_1 and network_2. Both of these networks have been plotted, but it is difficult to compare the two images. For easier comparison, the plot_network_diff() function is used to plot the differential network. Here, tan edges are connections only present in the first network, red edges are only present in the second network, and black edges are common to both networks. In addition, the size of each node is proportional to the number of differential connections (tan or red edges), which helps identify differentially connected nodes. Note that saving the original plot, g, allows us to maintain the same layout across different plots.  One can also compare two networks by a differential network visualization (see Figure 8): R> plot_network_diff(network_1, network_2, g) To generate precision matricies for these two networks, the gen_partial_correlations() function is used. When passing more than one network into this function, it returns a list of weighted networks.

R> network_list <-gen_partial_correlations(network_1, network_2) R> network_1 <-network_list[[1]] R> network_2 <-network_list[[2]]
These weighted networks can then be used to generate RNA-seq data. The breast cancer dataset is used, and it is subset onto p genes prior to passing it into the generator using the sample_reference_data() function. Using the same p reference genes for both generated datasets ensures that the marginal distribution of each gene is the same in both networks and only the conditional dependences among the genes change. Alternatively, different references can be used for the two populations, which would allow us to simulate differential expression in addition to differential connectivity.
Here, adj_diff is the adjacency matrix for the differential network, and it can be visualized just like a 'network' object using the plot_network() function. Any adjacency matrix can also be used to create a 'network' object with create_network_from_adjacency_matrix():

R> network <-create_network_from_adjacency_matrix(adj_1)
This can be useful if the user wants to modify the structure generated by random_network() or to use a pre-specified network structure to generate RNA-seq data. Note, however, that the adjacency matrix only specifies the global connections (it does not contain information on modules), so the created 'network' will consist of a single module containing all of those global connections. (This is similar to applying the as_single_module() function to a 'network' object.)

Application
The SeqNet package provides a set of tools for creating gene-gene networks and simulating gene expression data from them. These tools can be used in many applications: for example, (1) to compare the performance of various co-expression methods in a particular setting; (2) to explore how robust a method is to assumptions on network topology or distribution of gene expression; (3) to determining whether different transformations of the data improve performance; (4) to display the estimated network and visually compare it to the true underlying network; (5) to determine if a method is able to identify certain motifs in the underlying network; and (6) to assess the performance of differential network analysis methods (Gill et al. 2010;Fukushima 2013;Ha et al. 2015;Ji et al. 2017;Siska and Kechris 2017;Grimes et al. 2019).
The range of applications is too broad to fully cover in this manuscript. Instead, a single application is selected and covered in detail. In this application, SeqNet will be used to assess the performance of 12 co-expression networks in various settings; these settings include three different topologies for the underlying network and three different marginal distributions for the expression data. The goal is to determine how these settings affect the relative performance of each method. This application illustrates the two main utilities of SeqNet: the ability to implement any network structure, and to generate expression data under various distributions.
The R code for this application involves the use of external packages (methods for estimating networks and assessing performance). Since the bulk of code is not specific to the SeqNet package, those details are left as Supplements in the replication materials.
The minet R package (Meyer et al. 2008) is used to run ARACNE, CLR, and MRNET; the default settings for these three methods were used, which includes using "Spearman" as the entropy estimator. The bc3net R package (de Matos Simoes and Emmert-Streib 2012) provides an implementation of BC3NET and C3NET; the default settings were used, except the entropy estimator was changed from "Pearson" to "Spearman" so that all MI-based methods use the same estimator. The degree weighted lasso, DWLasso, is implemented in the DWLasso R package (Sulaimanov et al. 2018). GENIE3, which is a regression tree based method, is implemented in the GENIE3 R package (Huynh-Thu et al. 2010); the default number of trees was decreased from 1000 to 100 due to computational restrictions. The graphical lasso, glasso, is implemented using the glasso R package (Friedman et al. 2019); the regularization parameter was set to 0.1 for all simulations. The shrinkage approach to covariance estimation, Shrinkage, is implemented in the corpcor R package (Schafer et al. 2017). The silencing of indirect correlations method, Silencer, was translated from existing MATLAB code (Barzel and Barabási 2013) into R code. The Pearson and Spearman correlations were both implemented using the base R function cor().

Performance measures
Inferring the connections in a gene-gene network can be viewed as a binary classification problem, in which every gene-gene pair is classified as connected or not connected. The co-expression methods produce values that correspond to the estimated strengths of each gene-gene association. A network can be constructed from these values by setting a threshold and connecting any gene-gene pair with an association strength above the threshold. With a fixed threshold, the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) can be calculated, and performance is evaluated using metrics such as sensitivity (also called recall), specificity, and true discovery rate (TDR; also called precision), which are defined by, sensitivity = recall = TP/(TP + FN), specificity = TN/(TN + FP), TDR = precision = TP/(TP + FP).
However, this leaves the task of choosing an appropriate threshold for each method. In practice, the best threshold will likely depend on the specific problem at hand, as raising or lowering the threshold will adjust the trade-off between sensitivity and specificity. For some problems, a higher threshold may be preferred in order to increase sensitivity at the potential cost of precision, while in other settings having higher precision may be favorable.
To avoid the problem of choosing a threshold, the receiver operating characteristic (ROC) and precision-recall (PR) curves can be used. This approach considers all possible thresholds to explore the overall performance. The ROC curve plots 1 − specificity on the x-axis and sensitivity on the y-axis, and the PR curve plots recall on the x-axis and precision on the y-axis. In the context of classifying connections in the network, there is a large imbalance between the two classes; since the networks are sparse, the number of true connections will be much smaller than the number of possible connections. Because of this, using the ROC alone can provide misleading results. The PR curve avoids these problems because it does not incorporate TN, the number of true negatives, which is the term that will be inflated from the sparsity of the network and can skew the results. In the DREAM3 challenge, both ROC and PR were used to gain a full characterization of performance . Here, we present results for PR and provide those for ROC in Appendix C.
If a method produces a PR (or ROC) curve that dominates all other curves, then that method provides the best performance (Davis and Goadrich 2006). More often, however, the curves for each method will intersect, making the selection of a best performer more ambiguous (Bekkar et al. 2013). The area under the PR curve (AUC-PR) and area under the ROC curve (AUC-ROC) provide convenient summary metrics for the overall performance. Selecting the method with the highest AUC-PR makes the process less ambiguous, even if no single method produces a dominating curve. However, the method with highest AUC-PR will not necessarily be the same method with the highest AUC-ROC (Davis and Goadrich 2006).

Network structures
The SeqNet package provides a novel method for creating random networks. However, the user is not limited to the built-in network generator; any network can be used to generate gene expression data with SeqNet. In this application, three different network generating algorithms are considered: the Watts-Strogatz algorithm for generating random small-world networks (Watts and Strogatz 1998); the Barabasi-Albert method for generating scale-free networks (Barabási and Albert 1999); and the SeqNet algorithm. Figure 10 provides a visualization of networks generated by these three procedures. The performances of each coexpression method are compared across these three underlying network topologies. Figure 10: Example networks generated from the Watts-Strogatz algorithm (left), Barabasi-Albert method (middle) and SeqNet generator (right). Each network contains p = 400 genes. Figure 11: AUC-PR values from 30 simulations using networks generated by the Watts-Strogatz algorithm, the Barabasi-Albert method, and the SeqNet network generator. RNAseq expression profiles were generated with SeqNet using the breast cancer reference dataset. In general, the performance of each method depends on the underlying network topology.

Distribution of gene expression
The SeqNet package is designed to incorporate a reference RNA-seq dataset and generate expression profiles that have similar marginal distributions as the reference. In addition, Figure 12: AUC-PR values from 30 simulations using Gaussian, RNA-seq, and ZINB data. The underlying network is generated using the SeqNet generator. The NB and RNA-seq data are log(x + 1) transformed. Each method has similar performance across different data types. the package can be used to generate data from a Gaussian graphical model, which produces normally distributed expression profiles, or it can generate data from a zero-inflated negativebinomial model, which produces expression profiles having a discrete distribution resembling raw RNA-seq counts. In this application, each of these three approaches are considered. The co-expression method are compared across the three distributions. For simulating RNA-seq data, the reference used is the breast cancer dataset reviewed in Section 3.

Results
All results are from networks containing p = 400 genes and n = 200 samples. This resembles the typical setting where n < p, and the network is moderately large. The main restriction of p and n in this application were due to computational limitations of DWLasso and GENIE3.
The AUC-PR values of each method for different network topologies are shown in Figure 11. Most methods have their best performance on the Watts-Strogatz small-world network (WS), with a decrease in performance on the Barabasi-Albert scale-free networks (BA), and a fur-ther decrease on the SeqNet networks. Some exceptions are DWLasso, which attains its best performance on SeqNet networks -this makes sense because SeqNet networks contain very large hub genes, and DWLasso is designed to handle hubs; MRNET, which has similar performance on both BA and SeqNet networks; and both Pearson and Spearman correlations maintain similar AUC-PR across all three topologies.
The results from different marginal distributions of gene expression are shown in Figure 12. Each method maintains similar performance across the three distributions, suggesting that they are robust to the distribution of expression profiles.
In Figure 13, the relative performances of each method are compared based on AUC-PR, with underlying networks generated by the Watts-Strogatz algorithm, the Barabasi-Albert method, and the SeqNet generator. In each setting, RNA-seq expression data are generated with SeqNet using the breast cancer reference dataset, and a log(x + 1) transformation is applied. These results show that the relative performance depends on the underlying network topology. For example, on WS networks the five methods ARACNE, BC3Net, CLR, glasso, and Silencer all have the best performance. But on BA networks, only ARACNE and BC3Net come out on top. When considering SeqNet networks, DWLasso has a slight edge in performance, followed closely by ARACNE, BC3Net, C3Net, and MRNET.
This application demonstrates the importance of the underlying topology when assessing the relative performance of network estimation methods. With SeqNet, a wide range of topologies can be generated by varying its tuning parameters. The default settings were used in this demonstration, but a range of values should be considered in practice. For example, increasing nu lowers the average path length and increases the average degree of generated networks, while increasing alpha lowers the clustering coefficient and has minimal effect on average degree and path length. The utility of SeqNet comes from exploring these tuning parameters and evaluating the network estimation methods across a wide range of topologies.

Discussion
The SeqNet package provides convenient tools for assessing the performance of co-expression methods. The main feature of SeqNet is its ability to simulate RNA-seq expression data from an underlying network of gene-gene associations, and we have demonstrated that the marginal distributions of simulated data resemble those of a reference dataset. While any network can be used with SeqNet, the package also provides a novel network generator that creates random networks having a topology similar to real biological networks. The SeqNet network generator can be adjusted through various tuning parameters to access a rich distribution of network topologies -the exact relationship between these parameters and the generated topologies will be studied in a future project.
The SeqNet package is implemented in R and freely available through CRAN. The implementation uses C++ code for efficient computation. The algorithm for generating random networks runs in linear time, O(p), where p is the number of nodes in the network. On a 2.2 Ghz Intel Core i7 laptop computer, it takes on average 0.06s, 0.58s, and 7.4s to generate networks containing 10 2 , 10 3 , and 10 4 nodes, respectively. The algorithm for simulating expression data also runs in linear time, O(n), where n is the number of samples generated. On the same machine, it takes on average 0.73s, 5.5s, and 45s to simulate 10 2 , 10 3 , and 10 4 samples from a network containing 1000 nodes.

Method performance on Barabasi-Albert (scale-free) networks
Method performance on SeqNet networks Figure 13: AUC-PR values from 30 simulations with the underlying network generated using the Watts-Strogatz algorithm, the Barabasi-Albert method, and the SeqNet generator. The RNA-seq expression data are generated by SeqNet using the breast cancer dataset.
SeqNet is most useful for evaluating the performance of unsupervised co-expression methods that do not implement biological knowledge. It can be used to gain insight into how performance may change depending on the topology of the underlying network, or depending on the distribution of gene expression. SeqNet can use a reference RNA-seq dataset to simulate realistic data, however only information on the marginal distributions of read counts in the reference are used. Any additional information about the genes or samples are not taken into account. Because of this, SeqNet may have limited utility in evaluating co-expression methods that take advantage of existing domain knowledge, such as the known identity of transcription factors.
A limitation of SeqNet is its reliance on the Gaussian graphical model (GGM). This model allows us to generate data that have a particular conditional dependence structure, but the gene-gene interactions are restricted to monotonic relationships. It is uncertain how well this model is able to capture the true data generating process of gene expression. The dynamics of gene regulation are often modeled using Michaelis-Menten and Hill kinetics, but it is unclear how to simulate high-dimensional RNA-seq data from this model. Using a GGM provides a tractable approach, and our validation study suggests that it does produce data that are similar to the real datasets. Figure 14: Distribution of association scores from 12 network inference methods. The SeqNet and ZINB data contain independent gene expression profiles, i.e., the underlying network contains no connections.

B.2. Prostate cancer
The prostate cancer dataset contains 497 samples. Subsetting the data on 20 random Reactome pathways results in 673 unique genes. Figure 15: Mean-variance plots for the prostate cancer reference dataset and simulated data using SeqNet and a ZINB model. The blue heatmap shows a smoothed kernel density estimate, and the outlying points are expression profiles with a density estimate below 0.05. The axes are drawn on a log scale. Figure 16: Distribution of association scores from various network inference methods. The underlying network used by SeqNet is based on the estimated partial correlations in the reference dataset. Figure 17: Distribution of association scores from various network inference methods. The SeqNet and ZINB data contain independent gene expression profiles, i.e., the underlying network contains no connections.

B.3. Head and neck cancer
The head and neck cancer dataset contains 520 samples. Subsetting the data on 20 random Reactome pathways results in 726 unique genes. Figure 18: Mean-variance plots for the head and neck cancer reference dataset and simulated data using SeqNet and a ZINB model. The blue heatmap shows a smoothed kernel density estimate, and the outlying points are expression profiles with a density estimate below 0.05. The axes are drawn on a log scale. Figure 19: Distribution of association scores from various network inference methods. The underlying network used by SeqNet is based on the estimated partial correlations in the reference dataset.

Null distribution of association scores
Figure 20: Distribution of association scores from various network inference methods. The SeqNet and ZINB data contain independent gene expression profiles, i.e., the underlying network contains no connections.

B.4. Glioma
The Glioma dataset contains 667 samples. Subsetting the data on 20 random Reactome pathways results in 538 unique genes. Figure 21: Mean-variance plots for the glioma reference dataset and simulated data using SeqNet and a zero-inflated negative binomial model. The blue heatmap shows a smoothed kernel density estimate, and the outlying points are expression profiles with a density estimate below 0.05. The axes are drawn on a log scale. Figure 22: Distribution of association scores from various network inference methods. The underlying network used by SeqNet is based on the estimated partial correlations in the reference dataset.

Null distribution of association scores
Figure 23: Distribution of association scores from various network inference methods. The SeqNet and ZINB data contain independent gene expression profiles, i.e., the underlying network contains no connections.

C. Additional results from the application
This section provides results for the AUC-ROC metric, along with full PR and ROC curves. The same general conclusions made in the manuscript for AUC-PR can be made for AUC-ROC; the performance of each method depends on the topology of the underlying network but is robust to different distributions of gene expression. However, there is less variability in AUC-ROC when comparing methods on a fixed network topology; each method provides similar performance in terms of AUC-ROC. This is in contrast to AUC-PR, which produced more separation in performance among the methods. Figure 24: AUC-ROC values from 30 simulations using networks generated by the Watts-Strogatz algorithm, the Barabasi-Albert method, and the SeqNet network generator. RNAseq expression profiles were generated with SeqNet using the breast cancer reference dataset. In general, the performance of each method depends on the underlying network topology. Figure 25: AUC-ROC values from 30 simulations using Gaussian, RNA-seq, and ZINB data. The underlying network is generated using the SeqNet generator. The RNA-seq and ZINB data are log(x + 1) transformed. Each method has similar performance across different data types.

Method performance on SeqNet networks
Figure 26: AUC-ROC values from 30 simulations with the underlying network generated using the Watts-Strogatz algorithm, the Barabasi-Albert method, and the SeqNet generator. The RNA-seq expression data are generated by SeqNet using the breast cancer dataset.
The full PR and ROC curves are shown in the following figures. The results are separated based on the underlying network topology.

Method performance on Watts-Strogatz (small-world) networks
Figure 27: PR and ROC curves from 30 simulations with the underlying network generated using the Watts-Strogatz algorithm. The RNA-seq expression data are generated by SeqNet using the breast cancer dataset.

Method performance on Barabasi-Albert (scale-free) networks
Figure 28: PR and ROC curves from 30 simulations with the underlying network generated using the Barabasi-Albert method. The RNA-seq expression data are generated by SeqNet using the breast cancer dataset.