Computation of Graphlet Orbits for Nodes and Edges in Sparse Graphs

Graphlet analysis is a useful tool for describing local network topology around individual nodes or edges. A node or an edge can be described by a vector containing the counts of diﬀerent kinds of graphlets (small induced subgraphs) in which it appears, or the “roles” (orbits) it has within these graphlets. We implemented an R package with functions for fast computation of such counts on sparse graphs. Instead of enumerating all induced graphlets, our algorithm is based on the derived relations between the counts, which decreases the time complexity by an order of magnitude in comparison with past approaches.


Introduction
Analysis of networks plays a prominent role in many areas of science and business, from genetic and protein networks in bioinformatics to social networks in mining user data. Describing the roles of individual nodes and edges, clustering them, and predicting their future development requires observing their locally defined properties. One of the methods -used particularly in bioinformatics -is based on counting graphlets and graphlet orbits.
Graphlets are small connected simple graphs (Pržulj, Corneil, and Jurisica 2004). There are 9 different graphlets with two to four nodes and 30 graphlets with up to five nodes. In graphlet-based network analysis, we examine induced graphlets within the network: For each node, we count the number of times the node touches an induced graphlet of each kind, which gives a 9-or 30-dimensional vector description of the local topology surrounding the observed node. One of the vector components represents, for instance, the number of times the node is included in an induced star on five nodes. Furthermore, we can group nodes of each graphlet into orbits (Pržulj 2007) with respect to  1: Graphlets with 2-5 nodes with enumeration of orbits. Node colors and line shapes, which are chosen arbitrarily, correspond to orbits within each graphlet. Node orbits are enumerated as in Pržulj (2007); edge orbit numbers are enumerated by increasing orbits of the corresponding node pairs. the graphlet automorphisms (Figure 1(a)). Orbits define the "roles" of the nodes within the graphlet. For instance, in a star on five nodes (G 11 ), one node represents the center and the remaining four nodes are the leaves; the nodes of the star thus form two different orbits (numbered 23 and 22, respectively). Instead of counting only the number of appearances of induced stars that touch an observed node in the network, we can count how many times the node represents the center of such star (i.e., the node is connected to four nodes that are not connected to each other) and how many times it has the role of a leaf (i.e., it is connected to a node that is connected to another three nodes that are disconnected from each other and from the observed node). This gives a finer description of the node's vicinity with a 15-dimensional vector for four-node graphlets and a 73-dimensional vector for five node graphlets.
Similar can be done for edges (Solava, Michaels, and Milenković 2012): There are 68 edge orbits for graphlets with 3-5 nodes, which allow for a characterization of an edge with a 68-dimensional vector (Figure 1(b)). , 2(c) and 2(d) show all four-node subgraphs that include node C; node C appears in orbits 5, 7 and 11. Table 2(e) shows all orbit counts for node C, including those belonging to two-and tree-node subgraphs. Table 2(f) shows the orbit counts for all nodes and orbits. The vector (row) corresponding to node C is quite different from others (e.g., counts for orbits 2, 7, 11), which indicates its special place in the graph. Signatures of A and B, on the other hand, are the same since the two nodes map to each other in an automorphism of the graph. 1 The straightforward computation of orbit counts by enumeration takes O(nd k−1 ) time, where n is the number of nodes (typically thousands or tens of thousands), d is the maximal node degree (usually up to one hundred), and k is the graphlet size (4 or 5). We have recently presented a combinatorial approach for counting orbits of nodes (Hočevar and Demšar 2013) in time that is, for practical purposes, proportional to nd k−2 . Using this technique, the common-size networks from proteomics can be analyzed in a reasonable time of a few hours on a common desktop computer. In this paper, we provide the first complete description of the algorithm, including its novel extension to counting edge orbits (Section 2), and then document the corresponding R package together with two usage examples (Section 3).
The notation used throughout the paper is summarized in Table 1.

Combinatorial approach to orbit counting
Let G = (V, E) be a simple graph with n vertices (V ) and e edges (E). We assume that the graph is sparse (e = O(n)). We will denote graphlets as G i and node orbits as O i . We follow the enumeration by Pržulj (2007) (see Figure 1(a)), in which the orbit numbers are assigned somewhat arbitrarily but with the constraint that the indices of orbits belonging to the graphlets with fewer edges are smaller than those belonging to the graphlets with more edges. We will use E i to denote edge orbits, which we enumerate as shown in Figure 1 Here we decided to ignore the pre-existing enumeration by Solava et al. (2012) and define a more consistent one in which the edge orbits are ordered by the orbits of the corresponding nodes.
The task is to count the number of times a node x appears in each orbit O i , or the number 1 In general, two nodes will have the same signature for k node graphlets if their local neighborhood of up to k − 1 edges is the same.

C B
The count of orbit 0 equals the degree of the node. The inner node of a path on four nodes (Fig. b). 7 2 The central node of a 3-edge star (Fig. c). 11 2 The central node of the L 3,1 lollipop graph (Fig. d).
(e) Non-zero orbit counts for node C.
Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  of times an edge e appears in orbit E i . We will denote the two numbers by o i (x) and e i (x); where possible, we will omit x and write only o i and e i . The algorithm computes the counts for all graph nodes (or edges). Computation for just a few nodes can be done faster using a brute force approach (exhaustive enumeration).
Past approaches -such as that in GraphCrunch (Milenković, Lai, and Pržulj 2008)   of induced subgraphs. 2 Their theoretical and empirical complexity of enumerating graphlets of size k is O(nd k−1 ), where d is the maximal node degree in the graph. Our approach builds on the work of Kloks, Kratsch, and Müller (2000), who constructed a system of equations for counting induced subgraphs with four-nodes, and Kowaluk, Lingas, and Lundell (2011), who generalized it for larger subgraphs. We use a similar principle to count orbits; besides, our approach scales better for sparse graphs. In comparison with enumeration-based algorithms, the combinatorial approach decreases the practical time complexity by the factor of d by directly enumerating only the graphlets of size k − 1 and using them to compute the counts for graphlets of size k.

Node orbits
We shall demonstrate the basic idea with an example.  u,v,t: G[{x,u,v,t}] where u, v and t are triplets of nodes that fulfill certain conditions (details are explained below) and c(v, t) is the number of common neighbors of v and t. The left-hand side of the equation is a linear combination of orbit counts that we wish to compute and the right-hand side is a statistics that is easy to obtain.
Equation 1 can be constructed as follows. 3 Let the subgraph on some nodes x, u, v and t be isomorphic to G 6 with x in O 9 (Figure 3(a)). Now we observe the possible extensions of Solid lines belong to graphlet G 6 , dashed lines represent the required edges (as described in Section 2.3) and dotted lines represent additional edges whose presence or absence determines the orbit of the node x.
Gray lines represent edges to other nodes of G.
G [{x, u, v, t}] with a node w ∈ V that is attached to v and t. For each such w ∈ N (v, t), the subgraph on x, u, v, t, w is isomorphic • to G 19 , if w is not connected to x and u ( Fig. 3(b)), or • to G 23 , if w is connected to u, but not to x (Fig. 3(c)), or • to G 25 , if w is connected to x, but not to u ( Fig. 3(d)), or • to G 26 , if w is connected to x and u ( Fig. 3(e)).
This puts x in orbits O 45 , O 56 , O 62 or O 65 , respectively. Therefore, where o i represent orbit counts considering only these particular nodes (and annotations) u, v, t, and all common neighbors of v and t, N (v, t). The term −1 is needed since one of the members of N (v, t) is also u.
Equation 1 relates the total orbit counts o 45 , o 56 , o 62 and o 65 for a fixed node x. We construct it by summing the right-hand side of (2), c(v, t) − 1, over all triplets {u, v, t} ⊆ V such that G [{x, u, v, t}] ∼ = G 9 and v, t ∈ N (x) (to put x in O 9 within this subgraph) and with v < t under some arbitrary ordering of nodes, that is, u,v,t: G[{x,u,v,t}] Despite the condition v < t, some subgraphs are counted multiple times.
• Each subgraph with x in O 56 (G[{x, u, v, t, w}] Figure 3(c)) is counted thrice: The nodes x and u are fixed while v, t and w are exchanging their roles in three possible permutations (the condition v < t prohibits the other three out of the six possible permutations).
• If x belongs to O 62 (G [{x, u, v, t, w}] ∼ = G 25 , Figure 3(d)), the subgraph on quintuplet {x, u, v, t, w} is counted twice, with exchanged roles of u and w; the nodes v and t are fixed due to v < t. Figure 3 , the conditions v < t and v, t ∈ N (x) allow for only one possible annotation of the nodes.
After accounting for these multiple counts of the same orbit when summing (2) over all applicable triplets u, v, t (as in (3)), we get Equation 1. To evaluate such equations we need to pre-compute values c(v, t) and sum them over four-node induced subgraphs of the network. Both of these steps require an enumeration of all four-node induced subgraphs, which is the bottleneck of the method. Because every enumerated four-node subgraph will contribute to the sum on the right side of one or more equations, we can optimize the code and avoid explicitly checking the summation conditions in the equations because they are already included in the enumeration process (see the code snippet in Section 2.4 for illustration).
Certain relations involve more complicated symmetries, for instance The sum runs over all induced subgraphs in G that put node x in orbit O 8 in G 5 . Nodes u and v are its neighbors and t is the node opposite of x in G 5 . We obtain the same graphlet if we attach a new node w either to u or to v, and there are c(u) − 2 and c(v) − 2 such possibilities. There are also three optional edges (to x, y and u or v), which decide the orbit of

Edge orbits
Relations for edge orbits are derived in the same way. For instance, let (x, y) represent an edge of a square (graphlet G 5 ). Let us label the remaining two nodes with u ∈ N (x)\{y} and v ∈ N (y)\{x} ( Figure 4(a)). Extending this pattern with a node w that spans over the edge (u, v) leads to three possible graphlets and hence three different edge orbits of the edge (x, y) ( Figure 4 (b-d)).
Orbit E 56 arises when w is adjacent to either x or y, while the other two orbits, E 43 and E 63 can only arise in one way. Hence the relation between the orbits is

System of equations
We constructed the equations similar to those above to relate each orbit with orbits from graphlets with a larger number of edges. Complete lists of equations are provided in the appendices.
Figure 4: Derivation of the relation between e 43 , e 56 and e 63 . Solid lines belong to graphlet G 5 , dashed lines represent the required edges (since w is defined to span over the edge (u, v)) and dotted lines represent additional edges whose presence or absence determines the orbit of the edge (x, y). Gray lines represent edges to other nodes of G.
For instance, Equation 1 was constructed specifically to relate o 45 with higher orbits. The actual construction of the equation goes in the opposite direction from that presented in the introductory example. We started with the graphlet G 19 , to which the orbit O 45 belongs and picked one of the nodes (labeled w in the above case). We assumed that w is adjacent to v and t, and examined the graphlets in which w may be also adjacent to u and/or x. This ensures that the equation that is set up with O 45 in mind relates O 45 with orbits with higher indices since these graphlets have more edges than G 19 . As a consequence, the resulting system of equations is triangular, and thus independent and easy to solve by going backwards from the higher orbits (starting with O 14 or O 72 , which belong to complete graphs) towards lower orbits.
We impose several constraints on selection of w. Node w cannot coincide with x, or with x or y when computing orbits of edge (x, y). We further require that removal of w does not break the remaining nodes into disconnected subgraphs. Node w must have at most k − 2 neighbors; when it does have k − 2 neighbors, they must be connected. This allows for more time-and space-efficient computations of orbits, as described in Section 2.4. Existence of such nodes for each orbit of four-and five-node graphlets can be proven by exhaustive search, with exception of G 5 , which is handled as a special case.
All equations have the following general form: where cond(S) is a set of conditions that constrain the embedding of G[S] into G and assign labels to nodes. For instance, in Equation 1, condition v, t ∈ N (x) assigns the labels v and t to the nodes in orbit O 10 and v < t ensures that the same quadruplet of nodes is not counted twice.
The sum runs over subgraphs G [S] isomorphic to some graphlet G j on k − 1 nodes, that is, over some three-node graphlet when computing the orbits in four-node graphlets, or over some four-node graphlet when computing orbits in five-node graphlets. The subgraph must include x, and the conditions in the sum put x into some fixed orbit. Additional conditions may impose ordering on the remaining nodes of the graphlet to decrease the number of symmetries.
The terms in the sum are the number of common neighbors of some subsets of nodes in the subgraph (S k ⊂ S). The number of such terms is between 1 and 3. The size of S k is also between 1-3, that is, the terms refer to node degrees and to the number of common neighbors of pairs and triplets of nodes. The criteria for the choice of node w, which are described above, ensure that these terms can be efficiently computed using some pre-computed data as described in the following section.
The left-hand side is a fixed linear combination of orbits to which the node x evolves after extending G j with another node connected to one of subsets S k . The coefficients reflect the symmetries in the graphlets with regard to node assignments.
We prepared a system of 10 equations that relate the 11 node orbits of four-node graphlets, and a system of 57 equations that relate the 58 node orbits of the five-node graphlets. Likewise, we have constructed 9 equations that relate the 10 edge orbits for four-node graphlets and 55 equations for 56 edge orbits on five-node graphlets. By selecting different nodes w, we have empirically verified that it is impossible to construct a full-rank system using our approach and the constraints we put on w.
Due to the rank's deficiency, one of the orbits must be enumerated directly. The most suitable candidates are the orbits belonging to complete graphlets (O 14 and O 72 for nodes, and E 12 and E 68 for edges). First, this allows for a straightforward computation of the orbits since the system is triangular so that lower orbits are computed from the higher. Second, since we assume that the graphs are sparse, we can efficiently compute these orbits by using an enumeration method similar to the Bron-Kerbosch maximal clique enumeration algorithm (Bron and Kerbosch 1973).

Algorithm
The algorithm consists of pre-computation of some data, followed by computation of orbit counts for each node or edge.
1. Pre-computation: • Count the complete graphlets touched by each node or edge.
• Count the common neighbors of each pair and each connected triplet of vertices.
2. For each node or edge: • Compute the right-hand sides by enumeration of k − 1 node graphs using the precomputed data above.
• Solve the system of linear equations.
Our implementation of the algorithm represents the graph with adjacency and incidence lists, which are appropriate for sparse graphs. If the graph has less than 30000 nodes, we also construct an adjacency matrix. The matrix, which uses 1 bit per edge and takes at most around 100 MB, allows us to check for existence of edges between any given pair of nodes in constant time. Without it, the time complexity of the look-up for an edge between two nodes is proportional to the logarithm of the number of neighbors of one of the nodes.
In the following, we will describe each step in more detail.
1. Pre-computation: • For each node, count the number of complete graphlets in which the node or edge participates. We build cliques of size k from cliques of size k − 1 by maintaining a set of candidate nodes that are adjacent to all nodes in the smaller clique. This procedure is similar to the Bron-Kerbosch algorithm with the difference that we are not interested in maximal cliques but in all cliques of a given size. Although the theoretical upper bound of the time complexity of this step is O(ed k−2 ), where d is the maximal node degree, the actual contribution of this step to the total running time is negligible since complete subgraphs (cliques) in sparse networks are rare. 2. For each node or edge: • Compute the right-hand sides of the system of the linear equations. Its general form is shown in Equation 6. For four-node graphlets, the sums run over threenode paths or triangles in which the node appears. For five-node graphlets, they run over four-node graphlets that the node touches. Right-hand sides of equations that sum over the same graphlet can be computed simultaneously. The following code chunk illustrates the computation of the righthand sides of equations for orbits 13, 16 and 20 for an edge (x, y), c(x, y).
The code for computation of the right-hand sides is as follows.
for ( The if-clauses check that the edge belongs to E 3 and impose additional constraints as needed. The computation is sped up by using the pre-computed data from the first two steps. In the above case, the right-hand sides of equations for orbits 13, 16 and 20 are (c(a)−1)+(c(b)−1), (c(x)−2)+(c(y)−2) and c(x, y), respectively. The former two are trivial to compute from the graph, and the latter is pre-computed in the second step above.
Note that the orbits for k − 1-node graphlets (as the orbit 3, above) are computed directly.
The time complexity of this step is O(ed k−3 ).
• Solve the system of equations to obtain orbit counts. Since the system is triangular and the coefficients are fixed, this does not require decomposing or inverting a matrix; the orbits are computed in order, from the higher towards the lower indices, starting with the orbit belonging to the complete graphlet, as for instance, in the following code snippet from the computation of edge orbits.
The total time complexity for all four steps is O(ed k−2 + ed k−3 + ed k−3 + n) for nodes and O(ed k−2 + ed k−3 + ed k−3 + e) for edges. The theoretical complexity is thus O(ed k−2 ), which is the same as for direct enumeration algorithms. Since large networks are typically sparse, the actual contribution of the first term, which comes from enumerating the cliques with k nodes, is negligible in practice. Empirical measurements indeed show that the time complexity is proportional to ed k−3 , that is, O(ed) for four-node graphlets and O(ed 2 ) for five-node graphlets.

The orca package
Package orca (orbit counter) is written mostly in C++, with coercion and wrapper functions in R. The package requires R version 2.15 or higher. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=orca (Hočevar and Demšar 2016). Due to using the C++ standard 11, which is not available on all platforms, CRAN only hosts the package for R 3.1. Packages for R 2.15 and binaries for OS X and MS Windows are available on the supplement page (http://www.biolab.si/supp/Rorca/).

Functions
The package provides four functions: count4 and count5 count the node orbits of graphlets on up to four and up to five nodes, and ecount4 and ecount5 count the edge orbits. All functions accept a single argument, a graph stored in • a graph object from the graph package (Gentleman, Whalen, Huber, and Falcon 2016); • an e × 2 edge matrix in which each row contains a pair of nodes given by one-based integer indices; or • a data frame in the same format.
Functions return a numeric matrix with rows corresponding to graph nodes or edges, and the columns corresponding to orbits, with column 1 corresponding to orbit 0, column 2 to orbit 1 and so forth. 5 We will show the package usage on the Karate club network (Zachary 1977), which is included in the package. The network is visualized in Figure 5.

R> orbits <-count4(karate) R> dim(orbits)
[1] 34 15 The result of count4, which counts node orbits for graphlets with up to four nodes, has 34 rows (the number of nodes) and 15 columns (the number of orbits).
The first four orbits correspond to three-node graphlets. Here are the orbit counts of four-node graphlets for the first four nodes. Note that for such small networks a visualization reveals more than orbit counts. Orbit counts become useful on large networks, which are difficult to plot out. Thiel and Berthold (2012) argue that in exploring networks we are not necessarily interested in nodes that are closely positioned to the query node (spatial similarity), but also in nodes that have a similar neighborhood structure (structural similarity). They proposed activation spreading signature as a topological description of the local neighborhood of graph vertices and demonstrate its use on the Wikipedia for Schools network. We conducted a similar experiment by using orbit counts instead of activation spreading for the signature.

Usage example on the Wikipedia for Schools network
We downloaded the 2013 edition of Wikipedia for Schools (SOS Children 2013) and extracted the network of internal links. 6 We computed the orbits for four-node graphlets and found the nearest neighbors (in terms of Euclidean distances between orbit counts) using FNN (Beygelzimer, Kakadet, Langford, Arya, Mount, and Li 2013) for a few nodes.
After the nodes are described by orbit counts, we find the ten most similar nodes (as defined by Euclidean distance, for the sake of simplicity) to several selected nodes. Results, together with distances divided by 1000, are as follows.  (71), 1754 (72) The results for Canada and Germany are impressive. The two nodes have similar orbit counts -and thus a similar role in the local network topology -as nodes Japan, Italy, Russia and other nodes representing countries, cities and regions. This would indicate that it is possible to recognize the nodes corresponding to countries based on the local network structure represented by orbit counts.
The node orbits -and thus the structure of the network around them -for Isaac Newton and Albert Einstein are also similar to those of other nodes related to physics. The inclusion of World War II in the nodes similar to Germany, and the Church of England with Isaac Newton may, however, be just an instance of the Texas sharpshooter phenomenon. Results for Mahatma Gandhi and Mahabharata are considerably less satisfactory as these two nodes are connected to unrelated nodes.
Exploring why the topology around the nodes is similar in one case and not in another is beyond the scope of this paper. 8 While this example provides an alternative take at the problem explored by Thiel and Berthold (2012), graphlet analysis is most often used in bioinformatics, where orbit counts are assumed to reflect the roles of genes or proteins in the observed networks. An interested reader may find further examples in the cited works of Pržulj and Milenković.

Conclusion
We presented a new package orca for computing the graphlet orbit counts for nodes and edges. This paper provides the first complete description of the underlying algorithm, which runs much faster than the previous approaches; a more detailed comparison is available in Hočevar and Demšar (2013). The novel contribution of the paper is also the generalization of the method to counting the orbits for edges. The package is available on the CRAN repository under the GPL-3 license. Let p(u, v) denote the number of paths on three nodes that start at node u, continue with v and end with some node t, which is not connected to u. We can compute p (u, v) as

C. Equations for node-orbit counts in 5-graphlets
Conditions, P i , define the order of nodes and put x in orbit O i ; e.g., in P 13 node x is in orbit O 13 .

D. Equations for edge-orbit counts in 5-graphlets
Conditions, P i , define the order of nodes and put edge (x, y) in orbit E i ; e.g., in P 13 edge (x, y) is in orbit E 13 .