mipfp : An R Package for Multidimensional Array Fitting and Simulating Multivariate Bernoulli Distributions

This paper explains the mipfp package for R with the core functionality of updating an d -dimensional array with respect to given target marginal distributions, which in turn can be multi-dimensional. The implemented methods include the iterative proportional ﬁtting procedure (IPFP), the maximum likelihood method, the minimum chi-square and least squares procedures. The package also provides an application of the IPFP to simulate data from a multivariate Bernoulli distribution. The functionalities of the package are illustrated through two practical examples: the update of a 3-dimensional contingency table to match the targets for a synthetic population and the estimation and simulation of the joint distribution of the binary attribute impaired pulmonary function as used by Qaqish, Zink, and Preisser (2012).


Introduction and motivation
Combining information from two or more data sets is an operation commonly required to estimate unknown population counts. Typically this involves integrating fully detailed and disaggregated data from one source with aggregated data from another source. Such processes are frequently implemented when generating synthetic populations (Beckman, Baggerly, and McKay 1996;Guo and Bhat 2007;Barthélemy and Cornelis 2012;Huynh, Barthélemy, and Perez 2016) For example, let x ijk be the unknown population count referring to a cell of a 3-way contingency table. Index i stands for the household type, j for the gender and k for the age category. Often only marginal tables are released by statistical agencies, such as the gender population counts x •j• = i k x ijk . A survey, a random sample from a population, or previous estimates may also be available to generate an initial table x * ijk known as the seed. The aim is, then, to estimate x ijk from the marginal counts and the seed.
A number of packages already exist in R (R Core Team 2018) to achieve estimation using the iterative proportional fitting algorithm (IPFP), including ipfp (Blocker 2016), but they are usually limited to initial 2-dimensional tables and 1-dimensional target margins. These packages, however, do not provide a facility to assess the variability of the estimators.
The R function loglin from the stats package (R Core Team 2018) also relies on the IPFP to compute maximum likelihood estimates for log-linear models of multidimensional contingency tables. However this function only relies on the margins of the given initial array to perform the fitting and does not allow the use of externally supplied margins. The fitted log-linear models also assume a model of conditional independence given the margins, which can be an unrealistic assumption. Similar functionality is also provided by the packages MASS (Venables and Ripley 2002) and cat (Harding and Tusell 2012).
These observations led the authors to create the mipfp package (Barthélemy and Suesse 2018) for R providing in its original release a user friendly multidimensional implementation of the IPFP (giving rise to the name of the package). The current version 3.2.1 of the package also includes several other distance-based fitting methods such as: maximum likelihood (ML), minimum χ 2 (MCSQ) and minimum least square (LSQ). All methods can deal with contingency tables and target margins of arbitrary dimensions. The package also includes an application of the IPFP to simulate data from multivariate Bernoulli distributions, as well as estimating their parameters from given marginal probabilities and association parameters, such as correlations and odds ratios. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=mipfp. The remainder of this paper is organized as follows. In Section 2, we first present the fitting methods available in the package before illustrating how they are used. The simulation and estimation of a multivariate Bernoulli distribution using the package functions is detailed in Section 3. Concluding remarks are found in Section 4.

Fitting multidimensional arrays
In this section we will briefly detail the methods implemented in the mipfp package to fit an initial multi-way (contingency) table with respect to known target marginal distributions (counts). The interested reader can find more details in Suesse, Namazi-Rad, Mokhtarian, and Barthélemy (2017).
The notations in this section refer to a 3-way table, which can be straightforwardly generalized to any dimensions. Assume three categorical variables X 1 , X 2 and X 3 with I, J and K levels, respectively. Their initial contingency table, referred to as the seed, has given initial components (or cell counts) x * ijk ∈ R + where i ∈ {1, . . . , I}, j ∈ {1, . . . , J} and k ∈ {1, . . . , K} correspond to the level of the first, the second and the third variable, respectively. The unknown and estimated components of the fitted table are denoted by x ijk andx ijk ∈ R + , respectively. The notations x, x * andx represent the vectors containing the elements of the unknown, initial and fitted tables respectively. The vectorization (or reshaping) of the multidimensional arrays is performed with the last index changing fastest. The set of known desired target marginal counts M is a non-empty subset of where • refers to the summation over the corresponding variable, e.g., x •jk = i x ijk . It should be noted that in this particular case each target is either a vector or a matrix. It is also assumed that each component of the margins is non-negative. Consistency across the target margins is also assumed, i.e., they all sum up to N . For example: The initial cell probabilities are given by the sample proportions π * Similarly the estimated cell probabilities are obtained byπ ijk =x ijk /N . Vectorization of the unknown, initial and estimated cell probabilities are denoted by π, π * andπ. Finally let A be the matrix of full rank such that where vector m contains all components but one (to insure that A has a full rank) of every target in M divided by N , i.e., the target probability margins. Note that A has r = I +J +K rows and c columns (the last one being a vector of ones), and the degrees of freedom of thê The aim is to find a suitable estimatorx ijk for x ijk , such that every resulting margin equals its corresponding target margin in set M. This can be achieved by using either the IPFP, the maximum likelihood (ML) method, the minimum χ 2 estimation (CHISQ) method or the least square estimation method (LSQ). These methods are briefly detailed in the remainder of this section, followed by illustrating their application using the package's functions on real world data.

Iterative proportional fitting procedure
One of the most popular approaches to estimate a d-way table based on known marginal tables and an initial contingency table is the iterative proportional fitting procedure originally described by Deming and Stephan (1940), which has been extensively studied over the decades (Lovelace, Birkin, Ballas, and van Leeuwen 2015). This procedure is also known in the literature as raking, matrix scaling or the RAS algorithm.
This method is formally described below when d = 3 but can easily be extended to any number of dimension. For the sake of completeness we also assume that every target margin is available, i.e., M = T . It should be noted that this assumption is not always satisfied with real world data. The procedure iteratively updates the cells of the array depending on the targets. The adjustments at iteration l are computed by the equations: ijk .
These iterations are performed until either a maximum number of iterations iter has been performed or the following stopping criterion is reached: where tol ∈ R + o is a small constant. By default tol = 10 −11 and iter = 1000 in the current implementation. The values at the last iteration are thex ijk .
If only a subset of the target margins is available, i.e., M ⊂ T , then only the relevant adjustments in (2) are performed.
Finally it can be noted that if an initial table is not available, this procedure can still be used by choosing initial values for all x 0 ijk . The choice of those initial values must then be done carefully for the following reasons: • IPFP preserves the correlation structure defined by the odds ratio of the initial table (Mosteller 1968).
• If the initial table contains some cells x 0 ijk = 0, then it can be easily observed that the value of cells will remain 0 across the iterations. This property allows to easily fit tables with structural zeros.
• If x 0 ijk = 1 ∀i, j, k then IPFP produces the same results as a log-linear model of conditional independence given the margins (such as the R function loglin, among others, would). This may or may not be adequate depending on the application and should be then be done only when such model is deemed adequate.

Distance-based approaches
As an alternative to the IPFP, one can use a distance-based approach, consisting of solving the following optimization problem:π = arg min π f (π) s.t. A π = m 1 to find estimatesπ ijk , where different specifications of f (π) lead to different models, see Little and Wu (1991) for the corresponding models. In particular, we consider the following functions: • the maximum likelihood (ML) method under random sampling obtained by: • the minimum χ 2 method (MCSQ) characterized by: Table 1: Diagonal elements of D 1 and D 2 . The vector division is performed component-wise.
• and the minimum least square (LSQ) method defined by: The count estimates are finally obtained by the relationx ijk = Nπ ijk .
Interestingly, all four estimations methods (IPFP, ML, MCSQ and LSQ) are also maximum likelihood methods for various mis-specification models. Such models link a target population with known margins and a sample obtained from a different population, the so-called nontarget population. This is practically important, as often the available sample is not a real sample from the target population and a mis-specification model then allows estimation of the target population counts (see Little and Wu 1991;Suesse et al. 2017, for more details).

Covariance estimation
Assuming that the distribution of the random sample used to derive π * can be approximated by a multinomial distribution 1 , then following Little and Wu (1991) and Freeman and Koch (1976) the asymptotic covariance matrix of the estimated probabilities providing an uncertainty measure is derived from the Delta method and formally defined as: where D 1 and D 2 are diagonal matrices whose diagonal elements are given in Table 1 and the matrix U is the orthogonal complement of A such that A U = 0 and (A|U) has full rank.
The asymptotic covariance of the estimated cell counts is then simply given by: Lang (2004,2005) also proposed the following covariance matrix for the estimators: where Hπ denotes J h (π), the Jacobian evaluated inπ of the function: The matrix A −1 is different from A in the sense that the last column of ones is removed.

Goodness of fit statistics
and formally as: The statistics are: • the Wilk's log-likelihood ratio statistic: • the Wald statistic: • and the Pearson χ 2 statistic: The degrees of freedom for these statistics corresponds to the number of components in m.

Functions description and illustrative example
The IPFP and the distance-based approaches are implemented in the Estimate() function. Its arguments, along with their description, are listed in Table 4. The minimal requirements for Estimate() consist of an initial array to be updated, a list describing the dimensions of the target margins, a list containing the data of the target margins and the method to be used for the estimation. The function will return an object of class 'mipfp', as detailed in Table 5 containing the updated array and information about the convergence of the selected algorithm.
It should be noted that in the case of the distance-based approaches, the optimization is performed using the function solnp from the package Rsolnp (Ghalanos and Theussl 2015).
To illustrate the functionalities of the package, the data frame spnamur representing a synthetic population of 105,248 individuals for the city Namur (Belgium) is used. We refer the interested reader to Barthélemy and Toint (2013) for a detailed analysis and description of the generation of this data set. The variables of the data set are detailed in Table 2. For the sake of simplicity we disregard in this example the last three variables, and focus only on household type (hht), gender (gen) and professional status (pro). For illustration purposes we obtain the seed and the target margins from the synthetic population. The margins are extracted from the contingency table of the synthetic population. The seed is obtained by taking a 10% simple random sample extracted (without replacement) from the synthetic population.
Nevertheless it should be noted that in a real world application, the original data from which the seed and the target margins are derived are usually not available. Typically the target margins come from current aggregated census data while the seed is provided from other data sources, such as surveys or disaggregated partial census data.
The package and the data are loaded and a random seed is set with: R> library("mipfp") R> set.seed(1234) R> data("spnamur", package = "mipfp") The 3-dimensional contingency The target margins can then be easily extracted from this table. In this illustrative example, we consider the (multi-dimensional) target marginal distribution detailed in Table 3 which is generated by: R> tgt.hht <-apply(true. It can be observed that the targets defined in tgt.hht can be derived from tgt.hht.gen. Ideally the former should be discarded, but it is still included in this example to demonstrate that the package functions are able to detect and remove the unnecessary target margins. 2 The target margins and their description (i.e., the index of the variables involved) are then stored in the lists tgt.list.dims and tgt.data respectively by: The next step extracts the 10% sample from the original synthetic population in order to generate the initial seed contingency table seed.  Having a seed and a set of target margins, the IPPF and the distance-based approaches can now be applied to obtain an estimate of the true contingency table. This is achieved by using the function Estimate() returning an object of class 'mipfp': R> r.ipfp <-Estimate(seed = seed. target.data = tgt.data, method = "lsq") A summary of the results detailing the values of the estimates, their standard deviation, their t score and associated p value, the absolute maximum deviation between every target and its corresponding generated margin (referred as the margins errors) and the goodness of fit statistics can then be obtained using the summary() method. For instance, the summary for the object r.ipfp is given by: R> summary(r.ipfp) Call: Estimate(seed = seed. These first results from the IPFP show that after the convergence of the algorithm, the margins of the estimated cells fit the known target margins. Indeed Margins errors reports a small absolute maximum deviation between the desired and estimated margins (≈ 3.6 × 10 −12 ). The p values behind each cell strongly reject the underlying null hypothesis that the cell is zero. Hence every cell is significant in this example.
The goodness of fit statistics defined in Section 2.4 and their associated p value indicate that the seed (10% sample) agrees with the imposed margins, as expected as the sample and the margins were obtained from the same population data. More details about the inputs and outputs of the summary() method can be found in Tables 6 and 7. It should be noted that the results of the other methods stored in r.ml, r.chi2 and r.lsq are similar to r.ipfp.
The confidence intervals of the estimates (Smithson 2002) can be easily computed with the confint method as illustrated for the object r.ipfp below. The inputs of the method are documented in Table 8 and the output is a matrix containing the upper and lower bounds of the estimates. Other S3 methods for the objects of class 'mipfp' are also available such as vcov() to compute the variance-covariance matrix of the estimates, coef() to extract the estimates and print() to print a short description of the object (see Barthélemy and Suesse 2018).
In order to validate the implementation of the different methods, to asses their convergence and to compare their results, the CompareMaxDev() function is provided. First we check the absolute maximum deviation between every target and its corresponding generated margin: R> CompareMaxDev(list(r.ipfp, r.ml, r.chi2, r.lsq), echo = TRUE)
In this example the true contingency table is stored in true.

Simulating multivariate Bernoulli distributions
Consider the K binary variables Y 1 , . . . , Y K with success probabilities π i = P(Y i = 1) for i = 1, . . . , K. Under independence, random numbers or data referring to these K variables can easily be generated by generating independently numbers from the Bernoulli distribution. Under dependence simulation of multivariate binary data becomes more complicated, because the underlying distribution is characterized by 2 K probabilities which add up to 1.
When K is large, specifying and determining 2 K probabilities becomes impractical and often infeasible. A simpler approach is provided by using the IPFP, as suggested by various authors, for example Lee (1993) and Gange (1995), and has been applied in various simulation studies, e.g., Bilder, Loughin, and Nettleton (2000); Liu and Suesse (2008); Suesse and Liu (2012) for simulating multivariate binary data. The approach is based on specifying the K probabilities π 1 , . . . , π K and the (K − 1) × K/2 pairwise-probabilities π ij = P(Y i = 1, Y j = 1). The IPFP finds a solution of 2 K probabilities such that the marginal one-and two-dimensional probabilities equal {π i } and {π i,j }. In practice, specifying pair-wise probabilities is difficult, as these are bounded by {π i }, i.e., max(0, π i + π j − 1) ≤ π ij ≤ min(π i , π j ).
There are often many solutions and the IPFP converges to one of these. There are also other approaches, for example linear programming (Lee 1993), however IPFP is attractive because the final solution has usually strictly positive joint probabilities, meaning that none of the theoretically 2 K sequences can be excluded. This is in contrast to linear programming which often has several zero joint probabilities in the final solution, meaning these sequences can never be generated in the sampling process, an undesirable practical property.
Alternatively correlations and odds ratios are standard measures of association between two random variables and can be used to determine the π ij required by IPFP. While the correlations are frequently used for continuous random variables, the odds ratio is frequently used for categorical variables, because the correlation COR(Y i , Y j ) between Y i and Y j : is also bounded in a similar fashion as π ij . In contrast the odds ratio OR ij for variables Y i and Y j defined by is not bounded and can take any value in (0, ∞). Qaqish et al. (2012) analyze a frequently used data set on n = 407 parents and siblings of subjects with chronic obstructive pulmonary disease and their controls with the binary outcome of interest impaired pulmonary function. These data are clustered as observations come from families and family size K = 1, 2, . . . , 10 varies. The primary focus of the authors was to model the probability of impaired pulmonary function as a function of sex, race, age, smoking status and an indicator as to whether a relative had the same disease. While a logit model was applied to model {π i }, the odds ratio was used as a measure of association. To simplify a model for the K(K −1)/2 odds ratios, each distinct family relationship was modeled with a different parameter: parent-parent with α PP , parent-sibling with α PS and siblingsibling with α SS . Joint estimation of such mean and association models can be achieved with generalized estimating equations (GEE2) or the method of orthogonalized residuals, see Qaqish et al. (2012) and Liang, Zeger, and Qaqish (1992) for more information on these approaches. For example OR estimates for the method of orthogonalized residuals, see method ORTH MOMENT in Table 3 in Liang et al. (1992), are α PP = 0.283, α PS = 2.214, α SS = 2.186. We aim not at estimation of these parameters but at generating data from the underlying joint distribution. Let us consider for simplicity a family of four with two parents and two siblings, then the matrix with the OR's used in this example is loaded by:

Function description and illustrative example
R> library("mipfp") R> data("Qaqish", package = "mipfp") R> or <-Qaqish$or R> or and let us fix the π = (π 1 , π 2 , π 3 , π 4 ) for simplicity as: R> p <-c(0.2, 0.4, 0.6, 0.8) which would usually be determined by the logistic regression model for a given set of covariates for the four individuals under consideration. Then estimating the joint distribution via IPFP can be achieved with: R> p.joint <-ObtainMultBinaryDist(odds = or, marg.probs = p) These 2 4 = 16 joint probabilities are stored in the p.joint$joint.proba and can now be used to generate multivariate binary data using RMultBinary(). For instance, simulating 100,000 data points of size 4 from the obtained joined-distribution can be done with: R> y.sim <-RMultBinary(n = 1e5, mult.bin.dist = p.joint)$binary.sequences To confirm the results, the estimated probabilities and odds ratios from the generated random data are: R> apply(y.sim, 2, mean)

Conclusion
In this paper we have presented the R package mipfp. It provides several methods for updating an initial array (or seed) with respect to given target margins, namely the maximum likelihood, minimum χ 2 and minimum least squares methods as well as the well known iterative proportional fitting procedure. The package provides the first and crucial step for generating synthetic populations where a disaggregate sample and aggregate margins are available. Unlike the other fitting methods already implemented in several packages it can also compute an approximation of the covariance matrix and standard errors of the estimates which can be used to assess their accuracy and confidence intervals.
Package mipfp also provides an application of the iterative proportional fitting to simulate data from and estimate multivariate Bernoulli distributions, an important application for simulation studies. The main functions of mipfp have been successfully illustrated through the use of data sets included in the package.
Extensions will be implemented in future versions of the package. Those include additional uncertainty measures for the estimated cells induced by fixed marginals such as Fréchet bounds (Fienberg 1999;Dobra and Fienberg 2001) and rounding procedures of the estimated counts (Lovelace and Ballas 2013). Finally, we also aim at adding methods for synthetic population generation and other estimation methods.

A. Documentation
Argument Description seed An array storing the initial multi-dimensional table to be updated. Each cell must be non-negative. target.list A list of dimensions of the marginal target constraints in target.data.
Each component of the list is an array whose cells indicate which dimension the corresponding margin relates to. target.data A list containing the data of the target margins. Each component of the list is an array storing a margin. The list order must follow the one defined in target.list. Note that the cells of the arrays must be non-negative. method An optional character string indicating which method is to be used to update the seed. This must be one of the strings "ipfp" (iterative proportional fitting procedure, the default.), "ml" (maximum likelihood), "chi2" (minimum chi-squared), or "lsq" (least squares).

keep.input
A Boolean indicating if seed, target.data and target.list must be saved in the output when set to TRUE. ...
Additional arguments that can be passed to the functions Ipfp() and ObtainModelEstimates(). See their respective documentation for more details (Barthélemy and Suesse 2018).

Name Description x.hat
Array of the same dimension as seed containing the updated cell values and whose margins match the ones specified in target.list. p.hat Array of the same dimension as x.hat containing the updated cell probabilities. error.margins List returning, for each margin, the absolute maximum deviation between the desired and generated margin. conv Boolean indicating whether the algorithm converged to a solution. evol.stp.crit Vector containing the values of the stopping criterion over the iterations if the selected method is "ipfp". solnp.res The estimation process uses the solnp optimization function from the R package Rsolnp and solnp.res is the corresponding object returned by the solver (if the selected method is not "ipfp"). method The selected method for estimation. call The matched call. If keep.input = TRUE: seed The original seed. target.list The original target.list. target.data The original target.data.  The method used to generate estimates. df Degrees of freedom of the estimates. estimates An array containing the estimates generated by the selected method with their standard deviations and associated t-and p values. error.margins A list returning, for each margin, the absolute maximum deviation between the desired and generated margin. vcov A covariance matrix of the estimates (last index moves fastest) computed using the method specified in cov.method. tab.gof A table containing the log-likelihood (G2), Wald (W2) and Pearson chisquared (X2) statistics with their associated p values.

stats.df
Degrees of freedom for the G2, W2 and X2 statistics.

dim.names
Original dimension names of the estimated table. l.names The value of the parameter l.names.  by ObtainMultBinaryDist(). The list contains at least the element joint.proba, an array detailing the joint probabilities of the K binary variables. The array has K dimensions of size 2, referring to the 2 possible outcomes of the considered variable. Hence, the total number of elements is 2 K . Additionally the list can provide the element var.label, a list containing the names of the K variables. target.values A list describing the possibles outcomes of each binary variable, for instance {1, 2}. Default = {0, 1}. Table 9: List of arguments for the function RMultBinary().

Name Description binary.sequences
The generated K × n random sequence. possible.binary.sequences The set of possible binary sequences, i.e., the domain.

chosen.random.index
The index of the random draws in the domain.

Name Description odds
A K × K matrix where the ith row and the jth column represents the odds ratio between variables i and j. Must be provided if corr is not. corr A K × K matrix where the ith row and the jth column represents the correlation between variables i and j. Must be provided if odds is not. marg.probs A vector with K elements of marginal probabilities where the ith entry refers to P(X i = 1). ...
Additional arguments that can be passed to the Ipfp function such as tol, iter, print and compute.cov. Table 11: List of arguments for the function ObtainMultBinary().

Name Description joint.proba
The resulting multivariate joint probabilities (from Ipfp).

stp.crit
The final value of the Ipfp stopping criterion. check.margins A list returning, for each margin, the absolute maximum deviation between the desired and generated margin. Ideally the elements should be close to 0 (from Ipfp). Affiliation: