PAFit : an R Package for Estimating Preferential Attachment and Node Fitness in Temporal Complex Networks

Many real-world systems are proﬁtably described as complex networks that grow over time. Preferential attachment and node ﬁtness are two ubiquitous growth mechanisms that not only explain certain structural properties commonly observed in real-world systems, but are also tied to a number of applications in modeling and inference. In the node ﬁtness mechanism, the probability a node acquires a new edge is proportional to a quantity called ﬁtness that is assumed to be independent of the network structure. On the other hand, in the preferential attachment mechanism, this probability of acquiring new edges is proportional to a function of the current number of edges of the node. While this function is originally assumed to be the linear function, and hence ﬁxed, in general it can be arbitrary, and thus is the target of estimation in real-world datasets. While there are standard statistical packages for estimating the structural properties of complex networks, there is no corresponding package when it comes to the estimation of preferential attachment and node ﬁtness mechanisms. This paper introduces the R package PAFit , which implements well-established statistical methods for estimating preferential attachment and node ﬁtness, as well as a number of functions for generating complex networks from these two mechanisms. The main computational part of the package is implemented in C++ with OpenMP to ensure good performance for large-scale networks. In this paper, we ﬁrst introduce the main functionalities of PAFit using simulated examples, and then use the package to analyze a collaboration network between scientists in the ﬁeld of complex networks.


Introduction
Since the end of the last century, complex networks have been increasingly used in modeling many temporal relations found in diverse fields (Dorogovtsev and Mendes 2003;Caldarelli 2007;Newman 2010). Some notable examples include collaboration networks between authors in a scientific field (Newman 2001), connection networks between computers on the Internet (Barabási et al. 2000), and sexual relation networks between members of a community (Liljeros et al. 2001). One driver of the popularity of this modeling paradigm is that complex networks let us abstract away domain-dependent details and focus on modeling important structural properties observed in real-world systems, in the hope that we will be able to predict or control the future behavior of such systems.
Among the most important real-world network structural properties is degree distribution. Degree distribution lets us understand the proportion of highly and lowly connected nodes in a network. Since most dynamical network processes must travel frequently through highlyconnected nodes, this understanding in turns sheds light on the answers of important practical questions, including how to prevent the spreading of rumors (Nekovee et al. 2007), how to stop a virus outbreak (Pastor-Satorras and Vespignani 2001), and how to guard against cybernetic attacks .
The degree distributions of many real-world networks have been found to be heavy-tailed (Albert and Barabási 1999). The best-known heavy-tailed distribution in network science is the power-law, which is a distribution where the number of nodes in a network with degree k is proportional to k −γ for 2 < γ ≤ 3. Besides the power-law, there is emerging evidence that real-world network degree distributions have other heavy-tailed forms, including the lognormal (Redner 2005), exponential (Dunne et al. 2002), stretched exponential (Newman et al. 2002), and power-law with exponential cut-off (Clauset et al. 2009). All of these heavy-tailed distributions differ from the light-tailed binomial degree distribution, which is characteristic of networks produced by the classical Erdös-Rényi (ER) random graph model (Erdös and Rényi 1959). This prompted the network scientists to search for new modeling ingredients capable of explaining heavy-tailed degree distributions. It turns out that temporal complex network models that incorporate growth mechanisms offer a powerful modeling framework for achieving this end.
Temporal complex network models, or temporal network models for short, are probabilistic generative models of a real-world networks that change with time. In its most common form, a temporal network model assumes that a network grows gradually from some initial state by the addition of new nodes and edges over a large number of discrete time-steps. Some well-known basic models in the field of complex networks are the Barabási-Albert (BA) model (Albert and Barabási 1999) and the Bianconi-Barabási (BB) model (Bianconni and Barabási 2001). Growth mechanisms, which governs how a node acquires new edges in the growth process, are the most important elements that distinguishes different temporal network models. This paper focuses on estimating two interpretable and ubiquitous growth mechanisms: preferential attachment (PA) and node fitness. While they are based on simple concepts that are shared in diverse fields, they are also flexible enough to produce a wide range of different networks. In the PA mechanism, the probability P i a node v i gets a new edge in the future is proportional to some positive function A k i of its current degree k i . This function is called the attachment function.
The name 'preferential attachment' stems from the motivation for the mechanism: if A k is an increasing function on average, a highly connected node will acquire more edges than a lowly-connected node, which is an appealing property in many real-world situations. From now, we will say that preferential attachment exists if A k is an increasing function on average. Note that, however, this meaning differs from the original meaning of the term 'preferential attachment' used in the BA model, which means only the linear case of A k = k. This linear form in fact has been long known in other fields with various names such as 'richget-richer' (Simon 1955) and 'cummulative advantages' (Price 1976). When A k assumes the log-linear form of k α , with α > 0 called the attachment exponent, we have the generalized BA model (Krapivsky et al. 2001).
While P i depends on the degree of v i in the PA mechanism, in the fitness mechanism P i depends only on a positive quantity η i called the fitness of node v i . We can interpret η i as the intrinsic attractiveness of v i . The fitness mechanism offers a simple way to express the variance in edge-acquiring abilities between nodes with the same degree. For example, two early-career scientists with roughly the same number of collaborators at some point in time may acquire different numbers of collaborators in the future based on intrinsic fitnesses.
The PA and node fitness mechanisms combine to produce a wide range of degree distributions. In their combined form, probability P i is proportional to the product of A k i and η i : As we will show in Section 2, (1) encompasses many well-known temporal network models. Based on the functional form of A k and the distribution of η i , the model depicted in (1) can produce networks with various degree distributions (Bianconni and Barabási 2001;Caldarelli et al. 2002;Borgs et al. 2007;Kong et al. 2008).
There are implementations of standard statistical methodologies for estimating complex network degree distribution, for example the R packages igraph (Csardi and Nepusz 2006) and poweRlaw (Gillespie 2015), but the corresponding standard methods for estimating the underlying growth mechanisms are not implemented anywhere. This is unsatisfactory because PA (and other growth mechanisms) and degree distribution are a package deal in so far as temporal complex networks are concerned.
A growing number of interesting applications also call for an implementation of PA and node fitness estimation methods. Based on the functional forms of A k and η i , we can check whether two important social phenomena called 'rich-get-richer' or 'fit-get-richer' exist in a temporal network (Pham et al. 2016). The two mechanisms have also been proposed to be the underlying mechanisms of another important phenomenon called the 'generalized friendship paradox' (Feld 1991;Eom and Jo 2014;Momeni and Rabbat 2015). They are also used in inference problems in biological networks (Sheridan et al. 2010;Guetz and Holmes 2011), World Wide Web (Kong et al. 2008), Internet topology graphs (Bezáková et al. 2006), and citation networks (Wang et al. 2013;Sinatra et al. 2016). Finally, we can classify real-world temporal network data based on the estimated attachment exponent of A k .
This paper introduces the R package PAFit (Pham et al. 2017), which fills the gap with an implementation of the standard PA and node fitness estimation procedures. In particular, we implement Jeong's method (Jeong et al. 2003), Newman's method (Newman 2001) and the PAFit method (Pham et al. 2015(Pham et al. , 2016 in the package. The first two are heuristic methods that are widely used in estimating the attachment function A k in isolation, while the last one is a principled statistical method that can either estimate A k (or η i ) in isolation or simultaneously estimate the two mechanisms. Although using PAFit is advisable in almost every circumstance, Jeong's method and Newman's method are still widely used and might still be appropriate in certain situations. Therefore, the inclusion of the two heuristic methods in the package is warranted. We discuss their strengths and shortcomings in Section 2 when we provide an overview of the methodology.
The package also implements a variety of functions to simulate temporal networks from the PA and node fitness mechanisms, as well as functions to plot the estimated results and underlying uncertainties. We review PAFit's main functions in Section 3. Before demonstrating their usages on three simulated examples in Section 5, we will discuss how PAFit relates with exisiting network analysis packages in Section 4.
In Section 6, the package is showcased with a complete end-to-end work-flow analyzing a collaboration network of scientists from the field of complex networks. Finally, concluding remarks are given in Section 7.

Mathematical background
Here we review the standard methods for estimating the attachment function A k and node fitnesses η i in a temporal network. First we review the estimation of A k in isolation in Section 2.1, then the estimation of η i in isolation in Section 2.2, and finally the joint estimation of A k and η i in Section 2.3. In the course of doing so, we also review the underlying temporal network models assumed in each case.

Attachment function estimation
The methods for estimating the attachment function A k in isolation assume a simplified version of (1), in which the η i are assumed to be 1. Thus the probability P i in (1) only depends on A k . Perhaps the most frequently-encountered parametric version of this model is the log-linear for A k = k α with attachment exponent α > 0. Network scientists are particularly interested in estimating α, since the asymptotic degree distribution of the network corresponds to simple regions of α. If α is less than unity (the sub-linear case), then the degree distribution is a stretched exponential, while in the super-linear case of α > 1, one node will eventually get all the incoming new edges (Krapivsky et al. 2001). It is only the linear case of α = 1 gives rise to a power-law distribution.
Concerning this model, there are three main estimation methods for A k : Jeong's method (Jeong et al. 2003), Newman's method (Newman 2001), and PAFit (Pham et al. 2015). Jeong's method basically makes a histogram of the number of new edges n k connected to a node with degree k, then divides n k by the number of nodes with degree k in the system to get A k (Jeong et al. 2003). Jeong's method has the merit of being simple, but estimates obtained using the method are subject to high variance and low accuracy (Pham et al. 2015). By contrast, Newman's method combines a series of histograms for lower variance and higher accuracy (Newman 2001). Note that in PAFit we implemented a corrected version of Newman's original method (Pham et al. 2015). The main drawback of Newman's method is that the mathematical assumption behind its derivation only holds when α = 1, thus the method at best is heuristic when α = 1 (Pham et al. 2015).
The final method is PAFit (Pham et al. 2015). It iteratively maximizes an objective function that is a combination of the log-likelihood of the model with a regularization term for A k by a Minorize-Maximization algorithm (Hunter and Lange 2000). We defer the details of this term to Section 2.3. There is a hyper-parameter, called r, in the method that controls the strength of the regularization. PAFit chooses r automatically by cross-validation (Pham et al. 2016). The method is not only able to recover A k accurately, but also can estimate confidence intervals of the estimated A k for each k. It mains drawback is that it might be slow, since it is an iterative algorithm.

Node fitness estimation
When we consider only node fitnesses, there are two generative models in the literature with different assumptions on the functional form of A k in (1). While the Caldarelli model (Caldarelli et al. 2002) assumes that A k is 1 for all k, the BB model (Bianconni and Barabási 2001) assumes that A k = k. Both models have been shown to generate networks with various heavy-tailed distributions (Borgs et al. 2007;Kong et al. 2008).
Node fitnesses in both models can be estimated by variants of the PAFit method proposed in Pham et al. (2016), by either setting A k = k for the BB model or A k = 1 for the Caldarelli model. These estimation methods use Minorize-Maximization algorithms that maximize the corresponding log-likelihood functions with a regularization term that regularizes the distribution of η i . Specifically, the inverse variance of this distribution is controlled by a hyper-parameter, called s, which is chosen automatically by cross-validation. We defer a more detail discussion of the regularization to the next section. We note that node fitnesses in the BB model can also be estimated by the method in Kong et al. (2008). But since PAFit has been shown to outperform this method (Pham et al. 2016), we do not bother to include it in the package.

Joint estimation of the attachment function and node fitnesses
Finally, by using the full model in (1) the method PAFit in Pham et al. (2016) can jointly estimate A k and η i . We note this full model includes all the temporal network models mentioned as shown in Table 1. For a more complete table, see Table 1 in Pham et al. (2016).
Temporal network model Attachment function Node fitness Growing ER model (Callaway et al. 2001 Table 1: Some well-known temporal network models that are special cases of model defined by (1).
The objective function of PAFit is a combination of the log-likelihood of the full model defined by (1) and two regularization terms: one for A k and one for η i . While we refer readers to Pham et al. (2016) for details on the log-likelihood function, we will discuss the regularization terms here.

The regularization term for
with w k = T t=1 m k (t) and m k (t) is the number of edges that connect to a degree k node at time-step t. This term controls the shape of A k . When r is large, A k becomes more linear in log-scale. In the limit case when r is very large, we effectively assume that A k = k α . Thus this covers the case of α = 1 in the BA or the BB models and the case of α = 0 in the growing ER or the Caldarelli model.

The regularization term for node fitnesses is defined by
This term controls the variance of the distribution of node fitnesses. The larger the value of s, the more tightly concentrated the values of η i become. If s is infinitely large, then all η i will take the same value. This is effectively equivalent to estimate the attachment function in isolation.
To conclude: joint estimation with the above regularization terms also compasses the two cases of estimating either A k or η i in isolation. As mentioned in the two previous sections, the values of r and s are automatically selected by cross-validation; see Table 2 in Pham et al. (2016) for a summary of the relation of r and s with well-known temporal network models.

Package overview
The PAFit package provides functions to simulate various temporal network models, gather essential network statistics from raw input data, and use these summarized statistics in the estimation of A k and η i . The heavy computational parts of the package are implemented in C++ through the use of the Rcpp package (Eddelbuettel and François 2011;Eddelbuettel 2013). Furthermore, users with a multi-core machine can enjoy a hassle-free speed up through OpenMP parallelization mechanisms implemented in the code. Apart from the main functions, the package also includes a real-world collaboration network dataset between scientists in the field of complex networks. estimates node fitnesses in isolation with either A k = 1 or A k = k joint_estimate estimates the PA function and node fitnesses jointly coauthor contains the collaboration network dataset to_networkDynamic converts a PAFit_net object to a networkDynamic object from_networkDynamic converts a networkDynamic object to a PAFit_net object to_igraph converts a PAFit_net object to an igraph object from_igraph converts an igraph object to a PAFit_net object graph_to_file writes the graph in an PAFit_net object to file graph_from_file reads a graph from file into a PAFit_net object as.PAFit_net converts an edgelist matrix into a PAFit_net object Firstly, most well-known temporal network models based on the PA and node fitness mechanisms can be easily simulated using the package. PAFit implements generate_BA for the BA model, generate_ER for the growing ER model, generate_BB for the BB model and generate_fit_only for the Caldarelli model. These functions have many customizable options, for example the number of new edges at each time-step are tunable stochastic variables. They are actually wrappers of the more powerful generate_net function, which simulates networks with more flexible attachment function and node fitness settings. Given a PA function and a vector of node fitnesses, one can also use the function simulate_true_net to generate a network based on a true network as closely as possible. All statistics that are not relevant to the PA and fitness mechanisms, e.g., the numbers of new nodes and new edges at each time-step, are kept the same as in the true network.
In any case, the output of these functions is a PAFit_net object, which is a list with four fields: type, fitness, PA, and graph. The type field is a string indicates the type of network: "directed" or "undirected". This field is "directed" for the networks generated by the simulation functions. The fitness and PA fields contain the true node fitnesses and PA function respectively. The graph field contains the generated temporal network in a threecolumn matrix format. Each row of this matrix is of the form (id_1 id_2 time_stamp).
While id_1 and id_2 are IDs of the source node and the destination node respectively, time_stamp is the birth time of the edge. This is the so-called edgelist format in which raw temporal networks are stored in many online repositories (Kunegis 2013;Leskovec and Krevl 2014). We will discuss how to use functions provided by PAFit to convert this edgelist format to formats used in other network softwares in the next section. One can apply the function plot directly to a PAFit_net object to visualize its content.
The second functionality of PAFit is implemented in get_statistics. With its core part implemented in C++, this function efficiently collects all temporal network summary statistics that are needed in subsequent estimation of PA and node fitnesses. The input network is assumed to be stored in a PAFit_net object. One can use the function graph_from_file to read an edgelist graph from a text file into a PAFit_net object, or convert an edgelist matrix to this class by the function as.PAFit_net.
The edgelist matrix is assumed to be in the same format as simulated graphs from PAFit, i.e., each row is of the form (id_1 id_2 time_stamp). The node IDs are required to be integers bigger than −1, but need not to be contiguous. Note that (id -1 t) describes a node id that appeared at time t without any edge. There is no assumptions on the values or data types of time_stamp, other than that their chronological order is the same as what the R function order returns. Examples of time-stamps that satisfy this requirement are the integer vector 1:T, the format 'yyyy-mm-dd', and the POSIX time.
The get_statistics function automatically handles both directed and undirected networks. It returns a list containing many statistics that can be used to characterize the network growth process. Notable fields are m_tk containing the number of new edges that connect to a degree-k node at time-step t, and node_degree containing the degree sequence, i.e., the degree of each node at each time-step.
The most important functionality of PAFit is estimating the attachment function and node fitnesses of a temporal network. This is implemented through various methods. There are three usages: estimation of the attachment function in isolation, estimation of the node fitnesses in isolation, and the joint estimation of the attachment function and node fitnesses.
The functions for estimating the attachment function in isolation are: Jeong for Jeong's method, Newman for Newman's method, and only_A_estimate for the PAFit method in Pham et al. (2015). For estimation of node fitnesses in isolation, only_F_estimate implements a variant of the PAFit method in Pham et al. (2016). For the joint estimation of the attachment function and node fitnesses, we implement the full version of the PAFit method (Pham et al. 2016) in joint_estimate. The input of these functions is the output object of the function get_statistics. The output object of these functions contains the estimation results as well as some additional information pertaining to the estimation process. The estimated attachment function and/or node fitnesses can be plotted by using the plot command directly on this output object. This will visualize not only the estimated results but also the remaining uncertainties when possible.

Related network analysis software
Since network analysis has been an important field for a long time, various aspects of it have been implemented in a very large number of software packages. To our best effort, we have confirmed that the estimations of PA and fitness mechanisms in a growing network are not implemented elsewhere. Restricting the discussion to packages in R, the main packages that deal with temporal networks are igraph (Csardi and Nepusz 2006), statnet (Handcock et al. 2008, and RSiena (Ripley et al. 2013).
The igraph package contains the functions sample_pa and sample_growing which are the equivalents of generate_BA and generate_ER in PAFit, respectively. Although igraph also generates networks from many other mechanisms, it does not contain any function for estimating the PA function and/or node fitnesses.
The statnet package is an extensive metapackage that contains many network-related R packages. Among them, packages that deal with temporal networks are: networkDynamic , tsna (Bender-deMoll and Morris 2016), and tergm . The networkDynamic package provides the networkDynamic class to store dynamic networks, and various functions to manipulate them. The tsna package calculates many temporal statistics of a dynamic network stored in a networkDynamic object.
The tergm package fits an exponential random graph model to a networkDynamic object. It allows very flexible modelling based on customizable network statistics. Although it theoretically allows the modelling of the PA function and node fitnesses as in our setting, preimplemented functions in the package only allow for a simple parametric functional form of A k . It also does not have regularization terms specifically designed for non-parametric A k and η i as in our package PAFit, which are essential for the joint estimation of the PA function and node fitnesses (Pham et al. 2016).
The RSiena package allows for versatile modeling of temporal networks using continuoustime exponential random graphs. It can accomodate discreate-time temporal networks, too. The package can model the probability of a new edge based on an extensive list of network statistics. It also implements various estimation methods, including the method of moments and likelihood-based methods. Like the case of tergm, it is theoretically possible to use RSiena to model the PA and fitness mechanisms. However, the pre-implemented functions in the package, as well as all the related usages of the package that we could find in the literature, have focused on estimating the simple parametric forms of the PA function A k = k and A k = √ k. Furthermore, like tergm, RSiena does not implement regularization terms that are crucial in joint estimation of the non-parametric PA function and fitnesses. Lastly, RSiena assumes that the node set is fixed in time, and thus cannot handle the addition of new nodes without introducing some approximations.
PAFit provides functionalities to communicate with these extensive network analysis packages. Using to_networkDynamic and from_networkDynamic, one can convert a PAFit_net object to a networkDynamic's networkDynamic object and vice versa. The functions to_igraph and from_igraph do the same for igraph's igraph objects. The extensive functions of statnet and igraph packages can then be used. One can also output the graph stored in a PAFit_net object to the universal gml format by the function graph_to_file, or read from a gml file by the function graph_from_file.

Package usage
Here we show three usages of PAFit: the estimation of the attachment function A k in isolation in Section 5.1, the estimation of node fitnesses η i in isolation in Section 5.2, and the joint estimation of A k and the η i values in Section 5.3.  Recall that A k is linear in the BA model, i.e., the attachment exponent α is equal to 1, and the node fitnesses are uniform.

Attachment function estimation
One can observe the emergence of hubs in this network by visualize the generated graph at various time-steps by the function plot. The following script plots the network snapshot at time t = 10 in Figure 1a and its corresponding degree distribution in Figure 1d: R> plot(sim_net_1, slice = 1) R> plot(sim_net_1, slice = 1, plot = "degree") Note that if the network is directed, as it is in this example, the option plot = "degree" will plot the in-degree distribution. We can plot in the same way the network snapshots at time t = 10 and t = 100 in Figures 1b and 1c and their corresponding degree distributions in Figures 1e and 1f. The next step is to use the function get_statistics to get summary statistics of the temporal network:

R> stats_1 <-get_statistics(sim_net_1)
With stats_1 containing all the needed summary statistics, we then apply the three methods of estimating the attachment function in isolation: R> result_Jeong <-Jeong(sim_net_1, stats_1) R> result_Newman <-Newman(sim_net_1, stats_1) R> result_PA_only <-only_A_estimate(sim_net_1, stats_1) Let us explain result_PA_only in more detail. Information on the estimated results as well as the estimation process can be viewed by invoking summary:

R> summary(result_PA_only)
Estimation results by the PAFit method. Mode: Only the attachment function was estimated. As stated in Section 2, PAFit method first finds the r parameter, which regularizes the PA function, by cross-validation, and then estimate A k using the chosen r. The estimated function can be assessed via $estimate_result$k and $estimate_result$A of result_PA_only. From this estimated function, the attachment exponent α (when we assume A k = k α ) and its two-sigma confidence interval are also estimated. Hereα is 1.001 ± 0.01 as we can see from the output of summary.
The output also reveals that in default PAFit applies binning with 50 bins. In this procedure, we divide the range of k into bins consist of consecutive degrees, and assume that all k in a bin have the same value of A k . Binning is an important regularization that significantly stabilizes the estimation of the attachment function (Pham et al. 2015). In this example, the center of each bin is stored in the field $center_k of stats_1.
Since the center of a bin is also the PA value corresponding to that bin in the linear PA case, we can plot the estimated attachment function together with the true attachment function using the following script, which produces Figure 2a.
Overall, Newman's method and PAFit estimate the attachment function A k about equally well, while Jeong's method is found to underestimate the function and also exhibits high variance. This can also be observed in the estimated attachment exponent of the three methods: Newman's method and PAFit recover the true α, while Jeong's method underestimates it. Note that in PAFit we also have the confidence intervals (lightblue region in Figure 2a) of the estimated A k , which are unavailable in the other two methods. This is a significant advantage of PAFit over the two since it allows the user to quantify uncertainties in the result.

Node fitnesses estimation
Here we estimate node fitnesses from a BB model generated network with the assumption that A k = k. To demonstrate the functionality of the package, we generate a BB network with a nonstandard setting: R> sim_net_2 <-generate_BB(N = 1000, num_seed = 100, multiple_node = 100, m = 15, s = 10) This network grows from a seed network with N 0 = 100 nodes where the nodes form a line graph. The value of N 0 can be specified by num_seed. At each time-step we add n = 100 new nodes where each node has m = 15 new edges. The values of n and m can be specified via multiple_node and m, respectively. The total number of nodes in the final network is N = 1000. Finally, the distribution from which we generate node fitnesses is the Gamma distribution with mean 1 and inverse variance s = 10.
In its default setting, the function only_F_estimate estimates node fitnesses under the assumption that A k = k. But one also can estimate node fitnesses in the Caldarelli model, i.e., assuming A k = 1 for all k, with the option model_A = "Constant". The function only_F_estimate works by first find the estimated valueŝ of s by cross-validation, and then usesŝ in the subsequent estimation of node fitnesses. The summary information of the estimation result can be viewed by invoking summary:

R> summary(result_fit_only)
Estimation results by the PAFit method. Mode: Only node fitnesses were estimated. We can see that s is slightly underestimated, which means the variance of node fitnesses is overestimated. We can check whether the node fitnesses were estimated well by plotting the estimated fitnesses versus the true fitness by the following script: R> plot(result_fit_only, stats_2, true_f = sim_net_2$fitness, plot = "true_f") This will produce the plot of Figure 3b. It turns out that the estimated node fitnesses agree pretty well with the true node fitnesses. We note that the light blue band around eachη i depicts the confidence intervals of that estimated values. The upper and lower  values can be accessed via $estimate_result$upper_f and $estimate_result$lower_f of result_fit_only, respectively.

Joint estimation of the attachment function and node fitnesses
Here we show how to estimate the attachment function and node fitnesses simultaneously. We need to assume in Section 5.1 the equality of all η i for estimation of A k in isolation, and in Section 5.2 a specific functional form of A k for estimation of η i in isolation. Such assumptions becomes unnecessary when we perform joint estimation, since the appropriate functional forms will be automatically enforced through regularization parameters r and s, which will be chosen by cross-validation. We recommend the joint estimation procedure as the standard estimation procedure in this package, unless there is a specific reason to justify the one or the other of these assumptions.

Analysis of a collaboration network between scientists
In this section, we demonstrate the complete workflow for the joint estimation of A k and η i on a collaboration network between scientists from the field of complex networks. In this network, nodes are scientists and an undirected edge exists between them if and only if they have jointly written a paper. Since it contains no duplicated edges between two scientists, the degree of a node represents the number of collaborators of that scientist. The temporal network is stored in coauthor.net, and the names of the scientists are stored in coauthor.author_id. The network without timestamps was compiled by Mark Newman from the bibliographies of two review articles on complex networks (Newman 2006). The second author of the present work augmented the dataset with timestamps. More information on the dataset can be found in the package reference manual.
The first step in the analysis is to convert the edgelist matrix coauthor.net to a PAFit_net object, and get the summarized statistics using the function get_statistics.
R> set.seed(1) R> true_net <-as.PAFit_net(coauthor.net, type = "undirected") R> net_stats <-get_statistics(true_net) R> summary(net_stats)  The best fit model when we performed joint estimation is close to the BB model. In Figure 5a, the estimated A k is an increasing function on average withα = 0.88±0.02, which is close to 1. We can conclude that preferential attachment roughly exists in this collaboration network. Let us look at the region of small k, where the estimated attachment function is linear, for a concrete example: a network scientist with four collaborators has roughly twice the chance to get a new collaborator, compared with someone who only has two collaborators, assuming they have the same fitness. For comparison, we also plot the estimation results of A k in isolation using Jeong's method, Newman's method and Pham et al. (2015)'s method in Figures 5c, 5d, and 5e, respectively: R> result_Jeong <-Jeong(true_net, net_stats) R> result_Newman <-Newman(true_net, net_stats) R> result_onlyA <-only_A_estimate(true_net, net_stats) R> plot(result_Jeong, net_stats, plot = "A") R> plot(result_Newman, net_stats, plot = "A") R> plot(result_onlyA, net_stats, plot = "A") We notice that the estimated A k of the joint estimation resembles that of Figure 5e, when we estimate it in isolation. The reason is that estimated node fitnesses in Figure 5b are highly concentrate around the mean. Thus their distribution is not very far from the case when all the fitnesses are 1. Nevertheless, we observe that the estimated A k from the joint estimation is reduced compared with that of Figure 5e. This is expected since in joint estimation, a portion of the connection probability in (1) is explained by node fitnesses.
Although the distribution in Figure 5b is concentrate around its mean, we notice that its right tail is rather long, which is a sign that this tail contains interesting information. We can extract the information from this region by finding the 'fittest' network scientists. This can be done as follows: R> sorted_fit <-sort(full_result$estimate_result$f, decreasing = TRUE) R> top_scientist <-coauthor.author_id[names(sorted_fit),] R> print(cbind(sorted_fit[1:10],top_scientist[1:10,2])) This snippet will produce the results show in  one has some familiarity with the field, it is easy to recognize the names of many big-shots in the list. For example, at the top of the list is none other than Barabási, who introduced the BA model. Number two and number three are Mark Newman and Hawoong Jeong, who respectively are the authors of the eponymously named Newman's method and Jeong's method.

Conclusion
We introduced the R package PAFit, which provides a comprehensive framework for the estimation of PA and node fitness mechanisms in the growth of temporal complex networks.
In summary, PAFit implements functions to simulate various temporal network models based on these two mechanisms, gathers summarized statistics from real-world temporal network datasets, and estimates the attachment function and node fitnesses. We provided a number of simulated examples, as well as a complete analysis of a real-world collaboration network. We believe that the package is useful for statistical analysis of temporal networks.