missSBM : An R Package for Handling Missing Values in the Stochastic Block Model

The stochastic block model is a popular probabilistic model for random graphs. It is commonly used to cluster network data by aggregating nodes that share similar connectivity patterns into blocks. When fitting a stochastic block model to a partially observed network, it is important to consider the underlying process that generates the missing values, otherwise the inference may be biased. This paper presents missSBM , an R package that fits stochastic block models when the network is partially observed, i.e., the adjacency matrix contains not only 1s or 0s encoding the presence or absence of edges, but also NA s encoding the missing information between pairs of nodes. This package implements a set of algorithms to adjust the binary stochastic block model, possibly in the presence of external covariates, by performing variational inference suitable for several observation processes. Our implementation automatically explores different block numbers to select the most relevant model according to the integrated classification likelihood criterion. The integrated classification likelihood criterion can also help determine which observation process best fits a given dataset. Finally, missSBM can be used to perform imputation of missing entries in the adjacency matrix. We illustrate the package on a network dataset consisting of interactions between political blogs sampled during the 2007 French presidential election.


Introduction
In many scientific fields, networks are a natural way to represent interaction data.To cite a few examples, a network may represent social interactions such as friendship or collaboration between people in a social network, regulation between genes and their products in a gene regulatory network, or predation between animals in a food web.In this paper, we only consider networks which can be represented by graphs composed of binary edges connecting pairs of nodes (also referred to as dyads in the following).
To date, there are many software programs that perform network-related analyses.It is not surprising that the R community is extremely active in this field.Indeed, the programming language R is particularly well designed for data manipulation and visualization, and is therefore well suited to the manipulation of network data.Of the many network-related packages available, we suggest a classification into three groups1 : (i) Packages for representation, manipulation or visualization tasks, and packages computing descriptive statistics.We mention non-exhaustively the following top representatives: igraph (Csardi and Nepusz 2006), network and sna (Butts 2008a,b).
The package missSBM that we present here belongs to the third category, that is, software that fits a specific probabilistic model on network data.Specifically, missSBM is dedicated to the estimation of the SBM, a mixture of Erdős-Rényi random graphs (Erdős and Rényi 1959) offering a high degree of heterogeneity in connectivity profiles (see Abbe 2017, for a recent review).SBM generally fits real-world network data well while retaining the advantage of being a probabilistic generative model (contrary to mechanistic approaches such as the Barabási-Albert model (Albert and Barabási 2002), defined by a preferential attachment algorithm).The main outcome of an SBM fit is a clustering of the nodes -or "blocks" -so that the nodes share the same properties within the same block.To our knowledge, the reference package for fitting the SBM with the R programming system is blockmodels.It includes efficient implementations of variational algorithms for fitting different flavors of SBMs, tailored to binary network data and valued networks, with optional covariates on the edges.Two other important extensions are available as R packages: the degree-corrected SBM in randnet (Li, Levina, Zhu, and Le 2022) and a dynamic version in dynsbm (Matias and Miele 2020).
Beyond the R framework, there are also Python packages and C++ libraries providing efficient codes for some particular SBMs: the Python packages CommunityDetection (Mejean and Maison 2017) and BipartiteSBM (Yen and Larremore 2018) are dedicated to the estimation of special network structures using various heuristics and network models, among which SBM. Beyond variational approaches, the are Markov chain Monte Carlo (MCMC) methods for inferring the SBM, which solve the exact problem but are generally more computationally demanding: the Python library graph-tool (Peixoto 2014) includes an MCMC sampler to fit the binary SBM and the degree-corrected SBM; C++ libraries sbm_canonical_mcmc (Young, Desrosiers, Hébert-Dufresne, Laurence, and Dubé 2017) and bipartiteSBM-MCMC (Yen and Larremore 2019) implement respectively a MCMC sampler for simple and bipartite SBMs.
Finally MODE-NET (Decelle, Krzakala, and Zhang 2019) implements the belief propagation algorithm for inferring the degree-corrected SBM.
Despite their high quality, an important limitation of the aforementioned software is to require a fully observed network, i.e., no missing value is supported.The main feature of missSBM is to deal with cases where the network data is only partially observed.Specifically, we consider situations where the adjacency matrix of the network data contains not only 1s or 0s for presence or absence of an edge, but also NAs encoding missing information for some dyads.
Note that this situation is different from the case considered in noisySBM: there, a similarity matrix is fully observed between all pairs of nodes, and the goal is to separate the adjacency matrix from the noise, where the adjacency matrix is assumed to be generated by an SBM.
When inferring SBM from network data with missing values, it is important to take into account the underlying process that generates these missing values in the estimation of the model parameters, otherwise it may be biased.Specifically, it is necessary to identify whether the values are missing at random or not (MAR and MNAR, see Little and Rubin 2019).This issue has been studied in the context of network data by Handcock and Gile (2010) for the ERGM and in our methodological paper (Tabouy, Barbillon, and Chiquet 2020) for the SBM.missSBM is an implementation of the methodology developed therein.It also considers new sampling designs and the inclusion of covariates simultaneously in the SBM and in the observation process, which was not studied by Tabouy et al. (2020).Specifically, missSBM implements variational algorithms in the vein of Daudin, Picard, and Robin (2008) and Léger (2016) for estimating the SBM, with or without covariates, under a variety of missing data mechanisms.This includes cases of incomplete data where inference can only be made on the observed portion of the data (MAR), or cases where it is necessary to take the sampling design into account in the inference (MNAR).
Some frameworks deal with missing data but more from a cross-validation perspective than from a sampling perspective.Cross-validation is used to perform model selection for networks, such as choosing the number of blocks or communities (Li, Levina, and Zhu 2020;Chen and Lei 2018) or choosing the latent structure (Hoff 2007).These frameworks are thus very different from ours since the cross-validation is performed under MAR sampling while our main goal is to be able to infer an SBM under various MNAR sampling mechanisms.
The paper is organized as follows: Section 2 introduces the statistical framework of the binary SBM, with or without covariates, and summarizes the key points of its inference under missing data conditions.Section 3 provides basic guidelines for the main functions and object classes.We finally detail in Section 4 a case study that analyzes a network dataset describing the French blogosphere during the period preceding the 2007 French presidential election, illustrating the most striking features of the package.

Binary SBM
In an SBM, the nodes of the set N ≜ {1, . . ., n} are distributed in a set Q ≜ {1, . . ., Q} of hidden blocks that model the latent structure of the graph.Group membership is described by independent categorical variables The probability of having an edge between any pair of nodes (or dyad) depends only on the blocks to which the two nodes belong.Therefore, the presence of an edge between i and j, indicated by the binary variable Y ij , is independent of the other edges conditionally on the latent blocks: where B represents the Bernoulli distribution and D the set of dyads.This set can be either equal to {(i, j) ∈ N2 ; i ̸ = j} if the network is directed or to {(i, j) ∈ N 2 ; i < j}, otherwise 2 .In the following, we denote by π = (π qℓ ) (q,ℓ)∈Q 2 ∈ [0, 1] Q 2 the connectivity matrix, α ∈ )∈D the n × n adjacency matrix.This matrix is binary, with a diagonal filled with NAs and is symmetric if and only if the network is undirected.The vector encompassing all the unknown model parameters is θ = (α, π).A schematic representation of the binary SBM in the undirected case is given in Figure 1, where we highlight the latent clustering.

Accounting for external covariates
In addition to information about the connections between nodes, it is common for network data to be accompanied by additional information about the nodes or dyads, which we call covariates.In social networks, for example, nodes may belong to different categories (gender, occupation, nationality).Covariates on dyads typically represent similarity or dissimilarity between nodes: for example, in a spatial data context where nodes correspond to features with explicit geographic location, the covariates of dyads may be the distances between nodes.
Depending on the analysis, we may want to detect a connectivity pattern beyond the covariate effect.To do so, we implemented in missSBM a variant of Model 1 to include covariates.Let X ij ∈ R m denote the vector of m covariates for the dyad (i, j).If the covariates correspond to the nodes, i.e., X i ∈ R N is associated with node i for all i ∈ N , they are transferred onto the dyad level via a symmetric "similarity" function ϕ(•, the following, X ≜ [X ij ] i,j∈N ∈ (R m ) n×n denotes the array of covariates.An SBM including the effect of these covariates is as follows: where The vector of unknown parameters is now defined by θ = (γ, β, α).Note the connection with logistic regression: Model 2 assumes a logistic link between the presence of an edge Y ij and the corresponding covariates.The intercept term (γ qℓ ) qℓ depends on the blocks of the nodes and describes the heterogeneity of the connections that is not explained by the regression term β ⊤ X ij .

Connections with similar models
The SBM was originally introduced by Nowicki and Snijders (2001).Many extensions have been proposed since then, including other distributions on dyads in addition to incorporating covariates (Mariadassou, Robin, and Vacher 2010).Unlike the latent space model of Hoff et al. (2002) where the latent space is continuous, the latent variables in the SBM lie in a discrete space.When covariates are included as in Equation 2, their effect is removed and the blocks shall then explain the structure in the network beyond the covariates.This approach is similar to Vu, Hunter, and Schweinberger (2013) and opposite to that used by Tallberg (2004) or Binkiewicz, Vogelstein, and Rohe (2017) where covariates help learn the underlying clustering of the nodes since their distribution is assumed to depend on the same latent variables as the SBM.

Missing data and SBM
The main purpose of missSBM is to deal with some simple processes that generate missing values in order to provide more accurate estimates of the parameters underlying an incompletely observed network.The number of nodes n is assumed to be known and the missing information only concerns the dyads.The sampled data can therefore be encoded in an adjacency matrix Y where the missing information -the dyads whose value is unobserved -is encoded by NAs.We also define the n × n observation matrix R such as R ij = 1 if the dyad Y ij is observed and R ij = 0 otherwise.For convenience, we define Y o = {Y ij : R ij = 1} and Y m = {Y ij : R ij = 0} the respective sets of observed and unobserved dyads.
In our framework, an observation process -or sampling design -is a stochastic process that generates R. We then rely on standard missing data theory of Little and Rubin (2019) to classify these designs as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) cases.This framework has to be extended to deal with the latent variables Z in the SBM as we did in Tabouy et al. (2020): Sampling design for SBM is (3) The notation |= stands for independence between random variables.Note that MCAR missingness is a particular case of MAR missingness.This definition provides the general case when covariates X are available.Otherwise, the definition remains valid just by removing X.We denote by ψ the set of parameters associated with the distribution that generates the sampling matrix R.These parameters account for the sampling effort of the network.They live in a space that depends on the observation process as it is detailed in the next section.We assume that ψ and θ live in a product space, so that we can derive the following proposition: This proposition was proven in Tabouy et al. (2020) in the absence of covariate.The generalization to handle covariates is straightforward.In words, the inference can be conducted on the observed part of the network data when the sampling is M(C)AR without incurring any bias.In these cases, adapting the existing algorithms for SBM inference is simple.MNAR sampling designs require, however, more refined inference strategies since the observation process has to be included in the inference.

Examples of sampling designs for networks
This section reviews a set of stochastic processes defining sampling designs available in missSBM.These sampling designs may depend either on (i) the values of the dyads in the network; (ii) the latent clustering of the nodes; or (iii) some covariates, via the vector of parameters ψ.All examples detailed in the following assume that the observations are independent conditionally on Y, Z and X (either observations of the dyads for dyad-centered sampling designs, or observations of the nodes for node-centered sampling designs).From a practical viewpoint, the sampling designs implemented in missSBM allow the user to either (i) generate new data by partially observing an existing network according to a predefined sampling design, with user-defined parameters ψ, or (ii) to fit an SBM model under missing data condition, assuming that the missing entries arise from a given type of sampling design for which the unknown parameters ψ are estimated jointly with the SBM parameters θ.
• Double standard sampling (MNAR): let ψ ≜ (ρ 1 , ρ 0 ) ∈ [0, 1] 2 .Double standard sampling consists in observing dyads with probabilities The probability for sampling a dyad thus intrinsically depends on the presence/absence of the corresponding edge.This double standard sampling is especially likely in real world applications, if it is easier to observe an existing connection than the absence of connection.For instance, in protein-protein networks, the sampling effort is more important to determine the absence of a link than its existence.
• Block-dyad sampling (MNAR): this sampling consists in observing all dyads with probabilities ψ ≜ (ψ qℓ ) (q,ℓ)∈Q 2 ∈ [0, 1] Q 2 depending on the underlying clustering of the network: Here the probability for observing a dyad is driven by the effect of a given covariate: where we recall that g(x) = (1 + e −x ) −1 .Under this sampling, the external covariates may have an impact on both a connection and the ability to observe it.In this case, the sampling remains MAR provided that the covariates are available.

Node-centered sampling designs
A node-centered sampling consists in observing some nodes sampled with probabilities given by the sampling design.Observing a node means observing all the dyads involving that node.For all i ∈ N , we denote by V i the indicator variable for observing node i. Hence if V i = 1 we have R ij = R ji = 1 for all j ∈ N .Node-centered sampling designs are likely in social sciences since a network is sampled through direct interviews.During an interview, individuals (nodes) indicate to whom they are connected.Some individuals may then indicate a connection with an individual not available for an interview.The resulting missing dyads concern dyads between individuals who were not interviewed.Even if the connection is oriented (directed network), we assume that an individual, when interviewed, provides its ingoing and outgoing connections.
• Node sampling (MCAR): the probabilities for observing nodes are uniform: • Snowball sampling (MAR): a first batch of nodes is sampled as in node sampling.Then, a second batch is composed of the neighbors of the first batch (the set of nodes linked to at least a node of the first batch).Other batches can then be obtained through several sampling steps which are called waves.These successive waves are then MAR and not MCAR since they are built on the basis of the previously observed part of Y .
• Degree sampling (MNAR): for all node i ∈ N , P( This sampling may be the consequence of a situation where popular individuals are more likely to be interviewed. • Block-node sampling (MNAR): this sampling consists in observing all dyads corresponding to nodes selected with probabilities ψ ) for all (i, q) ∈ N × Q.This sampling may happen if some communities that shape the connections in the network are not equally reachable.
• Covar-node sampling (MAR): The probability to observe a node is In this sampling, some external information shapes the sampling process of the nodes.Even if the covariates also have an impact on the probabilities of connection, as in the covar-dyad sampling, the sampling design is MAR provided that the covariates are available.

Estimation procedure: A variational expectation-maximization
The SBM is a latent state space model which can be seen as a mixture model for random graphs.Therefore, the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) is the natural choice for the inference since it generally proves very useful for inferring various types of mixture models.It is based on the evaluation of the expectation of the complete log-likelihood of the model, with respect to the conditional distribution of the latent variables given the data.However, this expectation is intractable in the SBM due to the structure of dependency between the latent variables Z and the network Y.In fact, it would require to sum over all possible clusterings for all pairs of nodes, which is out of reach even for a moderate number of nodes or blocks.To address this shortcoming when the network Y is fully observed, Daudin et al. (2008) introduced a variational-EM (V-EM), based on the variational principles of Jordan, Ghahramani, Jaakkola, and Saul (1998).The idea is to maximize a lower bound of the log-likelihood based on an approximation of the true conditional distribution of the latent variable Z.In the case of an SBM with missing data, the level of difficulty is higher since the set of latent variables encompasses both Z (the latent blocks) and Y m (the missing dyads).We propose here a variational distribution of the conditional distribution where complete independence is forced on Z and Y m , using a multinomial, respectively a Bernoulli distribution for Z and Y m .We denote by m(•) and b(•) the probability mass functions of, respectively, the multinomial and the Bernoulli distributions which gives the following expression of the variational distribution: where are the two sets of variational parameters respectively associated with Z and Y m .Interestingly, τ 's are proxies for the posterior probabilities of the group memberships for all nodes, and ν's correspond to the imputed values of the missing dyads in the network data.This variational distribution was chosen in order to replace the intractable E-step of the EM algorithm with a tractable variational E-step.This approximation leads to the following lower bound J of the log-likelihood, where KL is the Kullback-Leibler divergence between the true conditional distribution and its variational approximation: ), the true conditional distribution of the latent variables Z, Y m , we retrieve the standard EM algorithm, requiring the evaluation of the intractable quantity Note that we alleviated the notations above by not explicitly writing the possible conditioning on covariates in the log-likelihoods.
Based on this approximation, the V-EM algorithm consists in alternating updates of the variational parameters {τ , ν} (the VE-step) with updates of the model parameters θ, ψ (the M-step) maximizing J.
Steps VE and M are iterated until convergence like in a standard EM.The algorithm converges to a local maximum of the lower bound of the log-likelihood.This variational is translated into a collection of algorithms for handling missing data with all sampling designs introduced in Section 2.4.When an algorithm reaches convergence, we obtain estimates of the parameters involved in the SBM (θ), in the sampling process (ψ), and also estimates of the variational parameters which bring information on the clustering (τ ) and on the missing dyads (ν).The variational estimator in the SBM is proven to be asymptotically normal in Bickel, Choi, Chang, and Zhang (2013) when the network is fully observed.The extension to the MCAR case is proven in Mariadassou and Tabouy (2020).

Initialization
It is well known that EM-like algorithms are very sensitive to the initialization step, which therefore requires special attention.In missSBM, the initial clustering is obtained by applying the popular absolute eigenvalues spectral clustering (detailed in Rohe, Chatterjee, and Yu 2011) to the adjacency matrix where the NAs are replaced by zero.The initial clustering can also be provided by the user.As specified in the next section, the exploration of the number of blocks also provides several other relevant initializations.

Selection of the number of blocks
One of the main difficulties encountered in SBM inference is the estimation of the number of blocks, which is usually unknown to the user.To address this problem, we use the integrated classification likelihood (ICL) criterion defined in Biernacki, Celeux, and Govaert (2000) and commonly used in the context of mixture models.Note that Saldana, Yu, and Feng (2017), Wang and Bickel (2017), Hu, Zhang, Qin, Yan, and Zhu (2020) and Côme, Jouvin, Latouche, and Bouveyron (2021) provide alternative methods for selecting the number of blocks.The ICL criterion has been adapted to the SBM under missing data condition in Tabouy et al. (2020), and is recalled here: for a Q-blocks model, a sampling design parametrized by K pa- rameters (size of ψ) and ( θ, ψ) = arg max for node-centered sampling.
We have also implemented in missSBM an exploration procedure designed to avoid getting stuck in local minima by producing a convex and robust ICL curve.This exploration procedure is divided into two steps, forward and backward: the forward step creates new initializations for each number Q of block considered, by splitting the blocks obtained from the estimates with Q − 1 blocks.On the other hand, the backward step tries new initializations for each Q by merging groups of the model with Q + 1 groups.The best model in terms of ICL is always retained.The procedure can be iterated until a satisfying shape of the ICL curve is encountered.

Implementation details
missSBM adopts an oriented-object programming spirit for representing most models by means of R6 classes and the R6 package of Chang (2021), which opens the way for future extensions.This approach is not visible to the user, who essentially only has to deal with classical R functions.The most time-consuming parts of the code are written in C++ using the armadillo library for linear algebra (Sanderson and Curtin 2016), in conjunction with the Rcpp and RcppArmadillo packages (Eddelbuettel and François 2011;Eddelbuettel and Sanderson 2014) to interface C++ with R. The numerical performance of our implementation is of the same order as existing variational implementations of the binary SBM (such as blockmodels).It can handle networks with up to a few thousands nodes.To give the reader an idea, Figure 2 reports the timings for adjusting an SBM to the network data considered in Section 4 (the 2007 French political blogosphere, 194 nodes), for a varying number of blocks with a single Intel Core i9-9900 CPU at 3.10GHz, replicated 50 times per block-size.We point out in the discussion some interesting avenues to improve speed in the future and eventually tackle larger networks.

Guidelines for users
This section gives an overview of the basic usage of the package.In particular, it describes the inputs and the outputs of the main functions involved in the procedure that fits an SBM from partially observed network data.Along the section, we also describe the object classes included in the package to facilitate the manipulation of the resulting fitted models.
In addition to this section and to the usual R package documentation, full documentation including a vignette is available as a pkgdown website at http://grosssbm.github.io/missSBM.The missSBM package (Chiquet, Barbillon, and Tabouy 2022) can be installed from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=missSBM.

Parameter estimation, prediction and clustering
Estimation of an SBM from a partially observed network is done by means of the function estimateMissSBM, with the following usage: R> estimateMissSBM(adjacencyMatrix, vBlocks, sampling = "dyad", + covariates = list(), control = list())

Standard usage
This function takes as a first argument -adjacencyMatrix -a square base::matrix or a sparsely encoded Matrix::dgCMatrix, possibly with NA entries corresponding to missing (unobserved) values of the network.The second argument vBlocks contains the successive explored values for the number of blocks, this number being generally unknown.The third argumentsampling -specifies how NA entries are taken into account in the estimation.It must be one of the following character strings, corresponding to one of the sampling designs (or observation processes) depicted in Section 2.4: R> missSBM:::available_samplings [1] "dyad" "covar-dyad" "node" "covar-node" [5] "block-node" "block-dyad" "double-standard" "degree" [9] "snowball" Argument covariates is an optional list with as many entries as covariates.If the covariates are nodal, each entry must be a size-n vector; if the covariates are defined between pairs of nodes, each entry must be an n × n matrix.

Advanced tuning
The argument control is a list to control the estimation and finely tune the V-EM algorithm, with the following entries:

Field Description models
A list of models with class 'missSBM_fit' ICL The vector of ICL associated to the models in the collection bestModel Best model according to the ICL (a 'missSBM_fit' object) optimizationStatus A data.frame summarizing the optimization process for all models
(i) useCov: Logical indicating whether the covariates should be incorporated within the SBM (or just in the sampling).
(ii) clusterInit: Initial method for clustering.Either a character ("spectral") or a list with length(vBlocks) vectors, each with size ncol(adjacencyMatrix), providing a user-defined clustering.Default is "spectral", for absolute spectral clustering.
(iv) threshold: Optimization stops when a V-EM step changes the objective function or the connection parameters by less than threshold (default is 1.10 −2 ).
(v) maxIter: Optimization stops when the number of iteration exceeds maxIter (default is 50).
(vi) fixPointIter: Number of iterations in the fixed-point algorithm used to solve the variational E step (default is 3).
(viii) iterates: Integer for the number of iterations during the exploration process.Only relevant when exploration is different from "none" (default is 1).

Return value: Classes 'missSBM_collection' and 'missSBM_fit'
An output of the function estimateMissSBM is an instance of an R6 object with class 'missSBM_collection', which can be handled as a standard list.Its fields are described in Table 1.
Among the fields of missSBM_collection, models is a list with as many elements as in vBlocks.These elements are R6 objects with class 'missSBM_fit', the fields of which are detailed in Table 2.It gives access to the results of the inference for a fixed number of blocks.
We finally give additional details on fields fittedSBM and fittedSampling in Table 3.Note that fittedSBM enjoys some standard plot, print, coef, predict and fitted methods, most of them inherited from the class 'simpleSBM_fit' of the package sbm.Method Description plot(object, type) Plot either "imputed", "expected", "meso" or "monitoring" print(object) A print method recalling important fields and methods available coef(object, type) Model parameters (type = "mixture", "connectivity", "covariates" or "sampling") predict(object) Estimated adjacency matrix (with imputation) fitted(object) Expected value of the SBM

Models exploration
At the end of the estimation process, it is common that the algorithm gets stuck in some local minima for some values of Q, the current number of blocks.The consequence is a "nonsmooth" ICL curve when it should be theoretically convex and rather smooth.This can lead the user to choose a sub-optimal number of blocks.Thus, the ICL criterion is automatically "smoothed" after a first pass on all the models.The idea is to apply a split and merge strategy to the path of the models stored in missSBM_collection in order to find a better initialization for each value of Q in the V-EM algorithm, so that it converges to the global minimum.This model exploration can be tuned with the argument control, see Section Advanced tuning.
The default goes back and forth a single time.

Parallel computing
Some internal components of estimateMissSBM (initialization, estimation, exploration) use the future framework (Bengtsson 2021) to speed-up the whole process through parallel computing.The future::plan must be set by the user (by default, it is sequential without parallelism).For example, to run missSBM on 10 cores with forking on Unix systems, the following command should be used before calling estimateMissSBM().

Partial observation ("sampling") of some network data
The function observeNetwork generates missing entries in a fully observed adjacency matrix according to a given network sampling design.The usage is the following: R> observeNetwork(adjacencyMatrix, sampling, parameters, clusters = NULL, + covariates = list(), similarity = l1_similarity, intercept = 0) The first argument -adjacencyMatrix -is the totally observed network to be partially observed.The second argumentsampling -is the chosen sampling among the available ones.The third argumentparameters -contains the parameters associated with the selected sampling.Note that its dimension (or its length) depends on the sampling design selected, as described in Section 2.4.The clusters argument only needs to be specified for "block-dyad" and "block-node" sampling designs.The argumentscovariates is bydefault list(), covarSimilarity is set to the l1_similarity function defined by (x, y) ∈ , and finally intercept is set to 0. These last three arguments only need to be specified in the case of an SBM with covariate(s).Note that the intercept is not included in parameters and must be specified independently.The output of observeNetwork is a matrix encoding the adjacency-matrix, with NA values for dyads not observed by the observation process, which can be provided as input to estimateMissSBM.

Illustration: The 2007 French political blogosphere
This section illustrates the features of missSBM by performing an analysis of a real-world data network.It is a sub-network of the French political blogosphere, extracted from a snapshot of more than 1100 blogs collected during a period preceding the 2007 French presidential election, and manually classified by the "Observatoire Présidentielle project" (see Zanghi, Ambroise, and Miele 2008).The network is composed of 194 blogs representing the nodes of the network and of 1432 edges indicating that at least one of the two blogs references the other.In addition to missSBM, our analysis relies on aricode (Chiquet, Rigaill, and Sundqvist 2020) to compute clustering comparison measures and tidyverse to perform data manipulations and producing graphical output.The igraph package, imported by missSBM, is needed for basic graph manipulations.The package future is used for parallel computing.We also fix the seed for reproducibility: R> library("missSBM") R> library("aricode") R> library("tidyverse") R> theme_set(theme_bw()) R> library("igraph") R> library("future") R> set.seed(03052008) We set our future::plan to multicore with 10 workers.(Use multisession if you work on Windows or from RStudio.) R> future::plan("multicore", workers = 10) The frenchblog2007 data set is provided with missSBM 3 as an 'igraph' object.We extract the adjacency matrix corresponding to the blog network after removing the isolated nodes.
We also extract the political party of each blog from the vertex attribute party, which gives us a natural classification of the nodes that could be used as a reference in our analyses.

Standard SBM estimation
At this stage, the data set has no missing entry: every dyad and every node is observed.The adjacency matrix Y of the fully-observed network is stored in the variable blog.We first perform a standard SBM estimation on the fully observed network, including smoothing of the ICL.

R> sbm_full <-estimateMissSBM(blog, blocks, "node")
We inspect the result of the optimization process by plotting the ELBO (variational lower bound of the log-likelihood) against the number of iterations in the V-EM algorithm, accumulated along all the numbers of blocks considered (Figure 3).The ELBO is typically expected to increase with the number of blocks at the end of the successive optimizations, which is indeed the case here: R> plot(sbm_full, type = "monitoring") The ICL criterion is minimal for an SBM with 10 blocks: The corresponding model is stored in the field $bestModel of sbm_full as an object with class 'missSBM_fit'.Printing this object results in a summary of the most important accessible fields and methods:

Partial observation of an existing network
For illustrative purposes, we sample a sub-graph from the blog network to mimic missing data and create a new adjacency matrix with missing entries.Since data consists in interactions between blogs (who references who), it is natural to sample the network with a node-centered sampling design: the following code generates a realization of a 194 × 194 sampling matrix according to the block-node sampling, where the blocks correspond to the clustering estimated by the SBM fitted on the full data set.The sampling rate is either low (0.2) in small blocks or high (0.8) in large blocks.

Estimation of a partially observed network
We now perform SBM inference under missing data conditions by fitting two types of model: first, the SBM under the MNAR block-node sampling design, i.e., under the design that truly generated the missing entries; second, the SBM under the MCAR node sampling design which basically performs inference only on the observed part of the network, neglecting the process that generates the missing values.The estimation is run on both models with the same settings as for the fully observed data.The ICL curve is smoothed with five iterations of forward-backward exploration since the presence of missing values typically increases the chances for falling into local minima.

Sampling design comparison
We plot on Figure 5 the ICL of the three models ("fully observed", "block-node" sampling and "node" sampling) to show how it can be compared to select which sampling design fits at best the data: Note that the curves associated with the block-node sampling and the node sampling are quite close for small numbers of blocks (less than 10) and then depart from each other: the choice of the sampling design is a difficult issue for the network data at stake.We also plot the ICL curve for the collection of SBMs estimated on the fully observed network: although the ICL values cannot be compared between this model and the two obtained on the partially observed network (the datasets are not the same), we point out that the numbers of blocks selected in the different cases remain comparable.A model with 10 blocks is selected with the fully observed network, while a model with only 9 blocks is chosen for the block-node sampling SBM.Indeed, due to the partial sampling, some blocks are less well represented than others, and it seems more likely to cluster some blocks given the available information.
Regarding the clustering obtained by the three variants, we compare them with the Adjusted Rand Index (ARI, Rand 1971) calculated with the aricode package (Chiquet et al. 2020).We use the fully-observed SBM classification as a reference, since its clustering was used to sample the network with the block-node sampling design.We typically expect that a model relying on better modeling of missing values shall lead to a clustering closer to the reference.This is indeed the case when looking at the following piece of code, where it is shown that the ARI with the reference clustering is higher for the block-sampling SBM than for the MCAR node sampling SBM: R> ARI(sbm_block$bestModel$fittedSBM$memberships, + sbm_full$bestModel$fittedSBM$memberships) [1] 0.6570555 R> ARI(sbm_node$bestModel$fittedSBM$memberships, + sbm_full$bestModel$fittedSBM$memberships) [1] 0.5436008

Extraction of the SBM with block-sampling design
The model that we finally retain is thus a block-sampling with 9 blocks.

R> myModel <-sbm_block$bestModel
myModel is an object with class 'missSBM_fit' with two special elements used for storing the results of the estimation of both the SBM (field fittedSBM) and the sampling design (fittedSampling).These two elements are special objects themselves with dedicated fields and methods which are recalled to the user thanks to the print/show methods:

Representation and validation
With myModel, we now have at hand a tool for analyzing the clustering of the French political blogosphere.The first output is the connectivity matrix of the network, which puts into light the community structure of the blogosphere.Indeed, it is revealed by a diagonal filled with high probabilities and off-diagonal with low probabilities.Thus, nodes (blogs) in blocks connect with a high probability with other nodes in the same block and with a low probability with nodes in other blocks.Such a network concentrates most of its edges between nodes of the same blocks.This can be seen by displaying the probability of connection predicted by the SBM at the whole network scale (see Figure 6): R> plot(myModel, type = "expected", dimLabels = list(row = "blogs", + col = "blogs")) For validation, we suggest comparing the clustering of the model with the node attribute corresponding to the political parties to which the blogs belong.First, we remark that the SBM fitted on missing entries carries a little bit less information regarding the political party than the SBM adjusted on the fully observed network: R> ARI(party, myModel$fittedSBM$memberships) [1] 0.4126328 R> ARI(party, sbm_full$bestModel$fittedSBM$memberships) [1] 0.463709 A more detailed comparison between blocks inferred by the SBM and political parties is reported in Figure 7 with an alluvial diagram.Also remember that missSBM performs imputation of the missing dyads in the adjacency matrix.Thus, we can compare the imputed values with the values of the dyad in the fully observed network to validate the performance of our approach.Using the R package pROC (Robin, Turck, Hainard, Tiberti, Lisacek, Sanchez, and Müller 2011), we check the quality of the imputation.We replicate this experiment 500 times with a sampling rate varying between ≈ 0.4 and ≈ 0.9 (always with block-sampling design), fixing the number of blocks to the best one found on the fully observed network.The area under the curve (AUC) is plotted in Figure 8 against the sampling rate, showing the robustness and the good performance of the imputation method.R> blog_subgraph_obs <-blog_subgraph %>% as_adj() %>% + observeNetwork(sampling = "covar-node", parameters = 10, + covariates = list(dummy_party)) R> blocks <-2:8 R> future::plan("multicore", workers = 10) Since the covariate is nodal, a nodal observation process is considered.When it comes to take into account the effect of this covariate on the probability of connection between two nodes in the SBM as in Equation 2, the covariate has to be transferred at the dyad level.This consists in computing an n × n matrix whose elements are equal to one if the corresponding two nodes belong to the same political party, zero otherwise.This matrix computation is directly handled in missSBM, when the option useCov is TRUE.
From the observed network blog_subgraph_obs, four modeling choices (labeled i to iv) are possible whether the covariate is taken into account in the SBM or not, and whether the sampling is set as depending on the covariate or not.For the latter choice, we know that the sampling which depends on the covariate is more appropriate but this information is generally not available a priori.We then run the estimation in these four scenarios below and print the estimated parameters related to the SBM and the sampling.
Figure 10 shows clearly that the "covar-node" sampling is uncovered from the data since the ICL criterion favors this sampling no matter if the covariate is taken into account in the SBM or not.The ICL criterion also points out that the models including the covariate effect in the SBM are better.

Discussion
The R package missSBM enables the estimation of an SBM from partially observed binary networks, even for some observation processes which generate MNAR data.
Although version 1.0.0 of missSBM deals only with binary networks, we deploy a structure that allows for easy inclusion of other variants in the future.In particular, extending missSBM to weighted SBMs where the distribution on the edges belongs to the exponential family (e.g., Poisson distribution or Gaussian distribution, see Mariadassou et al. 2010) should be straightforward in the MAR case.To pave the way of such a generalization, missSBM relies on the package sbm which is designed to offer a collection of methods and algorithms for handling more general SBM, but in the absence of missing data, or at least, with imputed data.Therefore, missSBM could benefit in the future of improvements and new advances in sbm, while focusing specifically on handling of missing data, dealing only with modeling of the observation process and imputing missing entries.The modular object-oriented coding of missSBM also allows the developer to make it easily evolving in the way it can handle missing data: new observation processes could be incorporated by providing new sampling designs which should contain the functions corresponding to the sampling specific steps of the V-EM algorithm.Other dependency structures between the SBM and the covariates could also be modeled and incorporated.For example, the latent variables Z could depend directly on the covariates X instead of the adjacency matrix Y.
Some recent work focuses on improving the speed of variational inference (see e.g., Blei, Kucukelbir, and McAuliffe 2017).This work could be adapted in the SBM context with or without missing data to allow our package to handle larger networks.Resorting to minorizationmaximization algorithms within the Variational E-step as in Vu et al. (2013) could also be of interest to speed up the algorithm.Finally, a common drawback of variational inference is that is provides too narrow standard errors for the estimated parameters (Westling and McCormick 2019).Moreover, there is no uncertainty measure on the stability of node clustering.We consider it as future work to provide the user with confidence intervals and stability measures for clustering based on resampling techniques such as bootstrapping.

Figure 1 :
Figure 1: Schematic representation of an undirected network following the stochastic block model with 3 blocks.Colors are blocks in which nodes are dispatched with probabilities α and the distribution of the dyads depends on colors of nodes with probabilities π.

Figure 2 :
Figure2: Timings for adjusting binary SBM with missSBM on the French political blogosphere with a single core for a varying number of blocks (50 replicated runs per # blocks).

Figure 3 :
Figure 3: Evolution of the ELBO during the optimization for the successive numbers of blocks.

Figure 4 :
Figure 4: Network data reorganized by the estimated block memberships.

Figure 5 :
Figure 5: ICL for models with fully observed data, block-node sampling and node sampling.

Figure 6 :
Figure 6: Probabilities of connection predicted by the SBM with block-node sampling.

Figure 7 :
Figure 7: Alluvial plot between block-node sampling clustering and political parties.

Figure 8 :
Figure 8: AUC of the imputation as a function of the sampling rate.

Figure 9 :
Figure 9: Subnetwork extracted from the French political blogosphere (only blogs with attribute party in {left, right} were kept).

Figure 10 :
Figure 10: ICL criterion for different numbers of blocks under the four models which make different use of the covariate political party.

Table 2 :
Selection of fields in object 'missSBM_fit' with descriptions and types.

Table 3 :
Selection of fields in fittedSBM, fittedSampling and mathematical counterparts.

Table 4 :
Clustering comparison with ARIs between models taking the covariate into account (models (i), (ii), (iii), (iv) and model with fully observed network).