gcimpute: A Package for Missing Data Imputation

This article introduces the Python package gcimpute for missing data imputation. gcimpute can impute missing data with many different variable types, including continuous, binary, ordinal, count, and truncated values, by modeling data as samples from a Gaussian copula model. This semiparametric model learns the marginal distribution of each variable to match the empirical distribution, yet describes the interactions between variables with a joint Gaussian that enables fast inference, imputation with confidence intervals, and multiple imputation. The package also provides specialized extensions to handle large datasets (with complexity linear in the number of observations) and streaming datasets (with online imputation). This article describes the underlying methodology and demonstrates how to use the software package.


Introduction
Missing data is ubiquitous in modern datasets, yet most machine learning algorithms and statistical models require complete data.Thus missing data imputation forms the first critical step of many data analysis pipelines.The difficulty is greatest for mixed datasets, including continuous, binary, ordinal, count, nominal and truncated variables.Mixed datasets may appear either as a single dataset recording different types of attributes or an integrated dataset from multiple sources.For example, social survey datasets are generally mixed since they often contain age (continuous), demographic group variables (nominal), and Likert scales (ordinal) measuring how strongly a respondent agrees with certain stated opinions, such as the five category scale: Strongly disagree, disagree, neither agree nor disagree, agree, strongly agree.The Cancer Genome Atlas Project (https://www.cancer.gov/ccg/research/genome-sequencing/tcga) is an example of integrated mixed dataset: It contains gene expression (microarray, continuous), mutation (binary) and microRNA expression (RNA-seq count) data.Imputation may be challenging even for datasets with only continuous variables if variables have very different scales and variability.
The Gaussian copula model nicely addresses the challenges of modeling mixed data by separating the multivariate interaction of the variables from their marginal distributions (Liu, Lafferty, and Wasserman 2009;Hoff 2007;Fan, Liu, Ning, and Zou 2017).Specifically, this model posits that each data vector is generated by first drawing a latent Gaussian vector and then transforming it to match the observed marginal distribution of each variable.In this way, ordinals result from thresholding continuous latent variables.A copula correlation matrix fully specifies the multivariate interaction and is invariant to strictly monotonic marginal transformations of the variables.Zhao and Udell (2020b) propose to impute missing data by learning a Gaussian copula model from incomplete observation and shows empirically the resulting imputation achieves stateof-the-art performance.Following this line of work, Zhao and Udell (2020a) develop a low rank Gaussian copula that scales well to large datasets, and Zhao, Landgrebe, Shekhtman, and Udell (2022) extend the model to online imputation of a streaming dataset using online model updates.This article introduces an additional methodological advance by extending the Gaussian copula model to support truncated variables.Truncated variables are continuous variables that are truncated to an interval (which may be half-open) (see Section 2.1 and Table 2 for precise definition).One example is the zero-inflated variable: A non-negative variable with excess zeros, which often appears when a continuous variable is measured by a machine that cannot distinguish small values from zero.
Reliable decision-making with missing data requires a method to assess the uncertainty introduced by imputation.Typically, imputation software quantifies uncertainty either by providing explicit confidence intervals for imputation, or providing multiple imputations (Rubin 1996).Multiple imputations allow the end user to incorporate imputation uncertainty into subsequent analysis, for example, by conducting the desired analysis on each imputed dataset and combining the results.Zhao and Udell (2020a) derive analytical imputation confidence intervals when all variables are continuous.In this article, we further develop a multiple imputation method for Gaussian copula imputation.Furthermore, we provide confidence intervals based on multiple imputation that are valid for mixed data.
The gcimpute package is available at https://pypi.org/project/gcimpute/and implements the methodology presented in Zhao and Udell (2020a,b); Zhao et al. (2022) and the new advances mentioned above: It supports imputation for continuous, binary, ordinal, count, and truncated data, confidence intervals, multiple imputation, large-scale imputation using the low rank Gaussian copula model, and online imputation.Nominal variables cannot be directly modeled by a Gaussian copula model, but gcimpute also accepts nominal variables by one-hot encoding them into binary variables.
We present the technical background in Section 2 and demonstrate how to use gcimpute through examples drawn from real datasets in Section 3.

Why not other copulas?
There exist many other copula models (see Jaworski, Durante, Hardle, and Rychlik 2010).However, model estimation for other copula models from incomplete data is not well studied, and the conditional distribution required by the imputation task rarely admits a closed form.These two challenges together make accurate imputation using other copula models very difficult.
Table 1: Suitability of imputation methods for different p (number of variables).For small n (number of samples) scenarios, simple imputation methods such as mean imputation are recommended due to limited information.

Software for missing data imputation
There are many software implementations available for imputing missing data, with R (R Core Team 2023) offering the greatest variety (Mayer, Sportisse, Josse, Tierney, and Vialaneix 2022) An imputation package will tend to work best on data that matches the distributional assumptions used to develop it.The popular package Amelia (Honaker, King, and Blackwell 2011) makes the strong assumption that the input data is jointly normally distributed, which cannot be true for mixed data.Package missMDA (Josse and Husson 2016) imputes missing data based on principal component analysis, and handles mixed data by one-hot encoding nominal variables.Package mice and missForest iteratively train models to predict each variable from all other variables.They handle mixed data by choosing appropriate learning methods based on each data type.Package missForest uses random forest models as base learners and performs well provided there are sufficient samples with non-linear relationship.In general, it yields more accurate imputations than mice, which uses variants of linear models (Stekhoven and Bühlmann 2012).In the computational experience of the authors, gcimpute outperforms missForest on binary, ordinal and continuous mixed data (Zhao and Udell 2020b).When the data includes nominal variables, which are poorly modeled by any of the other assumptions (low rank, joint normality, or Gaussian copula), missForest generally works best.
The imputation methods mentioned above typically perform better when the sample size n is large, as there is more information available to learn from.When n is small, simple imputation methods like mean imputation are often recommended due to the limited amount of available information.Amelia, mice, and missForest can work well when the number of variables p is small, but run too slowly for large p. Methods with weaker structural assumptions like gcimpute and missForest yield better imputations, as they are able to learn more complex relationships among the variables.Methods that rely on a low rank assumption scale well to large datasets.They tend to perform well when both n and p are large, as these data tables generally look approximately low rank (Udell and Townsend 2019), but can fail when either n or p is small.Low rank imputation methods include missMDA, softImpute (Hastie and Mazumder 2021), GLRM (Udell, Horn, Zadeh, and Boyd 2016) and the low rank model from gcimpute.Hence, gcimpute provides a compelling imputation method for data of all moderately large sizes.We summarize our recommendation in Table 1.
There are also a few copula based imputation packages in R. Package sbgcop (Hoff 2018) uses the same model as gcimpute but provides a Bayesian implementation using a Markov Chain Monte Carlo (MCMC) algorithm.Package gcimpute uses a frequentist approach to achieve the same level of accuracy as sbgcop but much more quickly (Zhao and Udell 2020b).Package mdgc (Christoffersen 2023) amends the algorithm in Zhao and Udell (2020b) by using a higher quality approximation for certain steps in the computation, improving model accuracy but significantly increasing the runtime when the number of variables is large (p > 100).Package CoImp (Di Lascio and Giannerini 2019) uses only complete cases to fit the copula model and is unstable when most instances have missing values.In contrast, gcimpute can robustly fit the model even when every instance contains missing values.Moreover, gcimpute is the first copula package to fit moderately large datasets (large p), by assuming the copula has low rank structure, and the first to fit streaming datasets, using online model estimation.

Mathematical background
Package gcimpute fits a Gaussian copula model on a data table with missing entries and uses the fitted model to impute missing entries.It can return a single imputed data matrix with imputation confidence intervals, or multiple imputed data matrices.Once a Gaussian copula model is fitted, it can also be used to impute missing entries in new out-of-sample rows.
Let us imagine that we wish to use gcimpute on a data table X with n rows and p columns.We refer to each row x of X as a sample, and each column as a variable.Package gcimpute is designed for datasets whose variables admit a total order: That is, for any two values of the same variable x 1 and x 2 , either x 1 > x 2 or x 1 ≤ x 2 .Each variable may have a distinct type: For example, numeric, Boolean, ordinal, count, or truncated.Nominal variables do not have an ordering relationship.By default, gcimpute encodes nominal variables as binary variables using a one-hot encoding, although other encodings are possible.However, this encoding is not self-consistent for the copula model.We advise users to model their features directly as ordinal or binary, if possible.Package gcimpute learns the univariate distribution of each variable without any distributional assumption, and then estimates the multivariate structure based on the learned univariate distribution.
Package gcimpute offers specialized implementations for large datasets and streaming datasets.Large datasets with many samples or many variables can use an efficient implementation that exploits mini-batch training, parallelism, and low rank structure.For streaming datasets, it can impute missing data immediately upon seeing a new sample and update model parameters without remembering all historical data.This method is more efficient and can offer a better fit for non-stationary data.

Gaussian copula model
The Gaussian copula (Hoff 2007;Liu et al. 2009;Fan et al. 2017;Feng and Ning 2019;Zhao and Udell 2020b) models complex multivariate distributions as transformations of latent Gaussian vectors.More specifically, it assumes that the complete data x ∈ R p is generated
Table 2: For any random variable x admitting a total order, there exists a unique monotonic transformation f such that f (z) = x for a random standard Gaussian z.For each data type of x, this table includes its distribution specification, the marginal f , and the set inverse as a monotonic transformation of a latent Gaussian vector z ∈ R p : x = (x 1 , . . ., x p ) = (f 1 (z 1 ), . . ., f p (z p )) := f (z), for z ∼ N (0, Σ).
The marginal transformations f 1 , . . ., f p : R → R match the distribution of the observed variable x to the transformed Gaussian f (z) and are uniquely identifiable given the cumulative distribution function (CDF) of each variable x j .This model separates the multivariate interaction from the marginal distribution, as the monotone f establishes the mapping from the latent variables to the observed variables while Σ fully specifies the dependence structure.We write x ∼ GC(Σ, f ) to denote that x follows the Gaussian copula model with marginal f and copula correlation Σ.
Variables and their marginals.When the variable x j is continuous, f j is strictly monotonic.When the variable x j is ordinal (including binary as a special case), f j is a monotonic step function (Zhao and Udell 2020b).The copula model also supports one or two sided truncated variables.A one sided truncated variable is a continuous variable truncated either below or above.A variable x truncated below α has a CDF F shown below (at realization x = x * ): where F is the CDF of a random variable and satisfies F (α) = 0.An upper truncated variable and two sided truncated variable are defined similarly.The CDF of a truncated variable is a strictly monotonic function with a step either on the left (lower truncated) or the right (upper truncated) or both (two sided truncated).The expression of f j as well as their set inverse f −1 j (x j ) := {z j | f j (z j ) = x j } are given in Table 2.In short, f j explains how the data is generated, while f −1 j denotes available information for model inference given the observed data.Figure 1 depicts how a Gaussian variable is transformed into an exponential variable, a lower truncated variable, and an ordinal variable.Figure 2 depicts the dependency structure induced by a Gaussian copula model: It plots randomly drawn samples from 2D Gaussian copula model with the same marginal distributions from Figure 1.It shows that the Gaussian copula model is much more expressive than the multivariate normal distribution.
By default, gcimpute categorizes a count variable to one of the above variable types based on its distribution (see Section 3.1 for the rule).Package gcimpute also provides a Poisson distribution modeling for count variables.The difference embodies in how to estimate f j and f −1 j .In short, the estimation of f j and f −1 j are different depending on if there is a parametric form of the function to be estimated.We defer its detailed discussion to Section 2.3 after introducing marginal transformation estimation.The data is generated by sampling (z 1 , z 2 ) from a 2D Gaussian distribution with zero mean, unit variance and 0.65 correlation and computing where f 1 and f 2 denote the transformations corresponding to the marginals for each model.For Gaussian marginals (1st row and 1st column), the transformation is the identity.For other marginals, the corresponding transformations are plotted as the third column of Figure 1.

Missing data imputation
Package gcimpute uses the observed entries, along with the estimated dependence structure of the variables, to impute the missing entries.In this section, let us suppose we have estimates of the model parameters f and Σ and see how to impute the missing entries.We discuss how to estimate the model parameters in the next section.
Every sample is independent conditional on f and Σ, so we may independently consider how to impute missing data in each sample.For a sample x ∼ GC(Σ, f ), let us denote the observed variables as O and the missing variables as M, so x O is a vector of length | O | that collects the observed entries.Since x ∼ GC(Σ, f ), there exists z ∼ N (0, Σ) such that x = f (z).Our first task is to learn the distribution of the missing entries in the latent space.Denote Σ I,J as the sub-matrix of Σ with rows in I and columns in J. Now, z is multivariate normal, so ).We can map this distribution to a distribution on x M using the marginal transformation f .Observations and their latent consequences.To estimate the distribution of z M , we must model the distribution of z O using the observed values x O .For an observed continuous variable value x j , the corresponding z j takes value f −1 j (x j ) with probability 1.For an observed ordinal variable value x j , f −1 j (x j ) is an interval, since f j is a monotonic step function.Hence the distribution of z j conditional on x j is a truncated normal in the interval f −1 j (x j ).For an observed truncated variable x j , z j takes value f −1 j (x j ) with probability 1 if x j is not the truncated value, otherwise is a truncated normal in the interval f −1 j (x j ).If only a single imputation is used, gcimpute will first compute the conditional mean of z M given x O , and then return the imputation by applying the transformation f .If multiple imputations are used, gcimpute will instead sample from the conditional distribution of z M given x O , and then transform the sampled values by applying the transformation f .Package gcimpute can return confidence intervals for any single imputation.If all observed variables x O are continuous, z O has all probability mass at a single point and thus z M has a multivariate normal distribution.In this scenario, gcimpute first computes the normal confidence interval and then transforms it through f to produce a confidence interval for the imputation.In other cases, gcimpute computes an approximate confidence interval by assuming that z O has all probability mass at its conditional mean given x O and then computes the normal confidence interval of z M as it does for all continuous variables.The approximated confidence intervals are still reasonably well calibrated if there are not too many ordinal variables.Otherwise, gcimpute provides a safer approach to build confidence intervals by performing multiple imputation and taking a confidence interval on the empirical percentiles of imputed values.

Algorithm
Inference for the Gaussian copula model estimates the marginal transformations f 1 , . . ., f p , as well as their inverses, and the copula correlation matrix Σ.The estimate of the marginal distribution and its inverse relies on the empirical distribution of each observed variable.The marginals may be consistently estimated under a missing completely at random (MCAR) mechanism.See Little and Rubin (2019, Chapter 1.3) for precise definition of missing mechanisms.Otherwise, these estimates are generally biased: For example, if larger values are missing with higher probability, the empirical distribution is not a consistent estimate of the true distribution.
Estimating the copula correlation Σ is a maximum likelihood estimate (MLE) problem.Estimates for the correlation are consistent under the missing at random (MAR) mechanism, provided the marginals are known or consistently estimated (Little and Rubin 2019).2, both f j and f −1 j only depend on the distribution of the observed variable x j .Thus to estimate the transformation, we must estimate the distribution of x j , for example, by estimating the CDF and quantile function (for continuous and truncated) or the probability of discrete values with positive probability mass (for ordinal and truncated).Package gcimpute uses the empirical CDF, quantile function, or discrete probability as estimates.

Marginal transformation estimation. As shown in Table
All imputed values are obtained through estimated f j , and thus the empirical quantile estimate of F −1 j .For continuous variables, a linear interpolated f j is used so that the imputation is a weighted average of the observed values.For ordinal variables, the imputation is the most likely observed ordinal level.
Suppose you know a parametric form of f j for the observed data.Can you use this information?Should you use this information?(1) Yes, you can use this information.We include a capacity to do this in our package.( 2) No, you probably should not.We have never seen a case in which using the parametric form helps in our simulation study.For example, for Poisson (count) data with a small mean, most likely values are observed, so treating the data as ordinal works well.For Poisson data with a large mean, the empirical distribution misses certain values, so certain values will never appear as imputations.Yet we find that fitting a parametric form instead barely outperforms.We believe that the dangers of model misspecification generally outweigh the advantage of a correctly specified parametric model.Parametric and nonparametric models differ most in their predictions of tail events.Alas, these predictions are never very reliable: It is difficult to correctly extrapolate the tail of a distribution from the bulk (Clauset, Shalizi, and Newman 2009).
Copula correlation estimation.Package gcimpute uses an expectation maximization (EM) algorithm to estimate the copula correlation matrix Σ. Suppose x 1 , . . ., x n are n i.i.d.samples from a Gaussian copula model, with observed parts {x i O i } i=1,...,n .Denote their corresponding latent Gaussian variables as z 1 , . . ., z n .At each E-step, the EM method computes the expected covariance matrix of the latent variables z i given the observed entries The M-step finds the MLE for the correlation matrix of z 1 , . . ., z n : It updates the model parameter Σ as the correlation matrix associated with the expected covariance matrix computed in the E-step.Each EM step has computational complexity O(np 3 ) (for dense data).

Acceleration for large datasets
Package gcimpute runs quickly on large datasets by exploiting parallelism, mini-batch training and low rank structure to speed up inference.Our EM algorithm parallelizes easily: The most expensive computation, the E-step, is computed as a sum over samples and thus can be easily distributed over multiple cores.
When the number of samples n is large, users can invoke mini-batch training to accelerate inference (Zhao et al. 2022), since a small batch of samples already gives an accurate estimate of the full covariance.This method shuffles the samples, divides them into mini-batches, and uses an online learning algorithm.Concretely, for the t-th mini-batch, gcimpute computes the copula correlation estimate, Σ, using only this batch and then updates the model estimate as where Σ t denotes the correlation estimate and η t ∈ (0, 1) denotes the step size at iteration t.To guarantee convergence, the step size {η t } must be monotonically decreasing and satisfy This online EM algorithm converges much faster as the model is updated more frequently.Zhao et al. (2022) report the mini-batch algorithm can reduce training time by up to 85%.
When the number of variables p is large, users can invoke a low rank assumption on the covariance to speed up training.This low rank Gaussian copula (LRGC, Zhao and Udell 2020a) assumes a factor model for the latent Gaussian variables: for some rank k ≪ p.The k-dimensional t denotes the data generating factors and ϵ denotes random noise.Consequently, the copula correlation matrix has a low rank plus diagonal structure: Σ = W W ⊤ + σ 2 I p .This factorization decreases the number of parameters from O(p 2 ) to O(pk) and decreases the per-iteration complexity from O(np 3 ) to O(npk 2 ) for dense data.For sparse data, the computation required is linear in the number of observations.Thus, gcimpute can easily fit datasets with thousands of variables (Zhao and Udell 2020a).

Imputation for streaming datasets
Package gcimpute provides an online method to handle imputation in the streaming setting: As new samples arrive, it imputes the missing data immediately and then updates the model parameters.The model update is similar to offline mini-batch training as presented in Equation 1, with Σ estimated from the new samples.Online imputation methods can outperform offline imputation methods for non-stationary data by quickly adapting to a changing distribution, while offline methods are restricted to a single, static model.
Package gcimpute responds to the changing distribution by updating its estimate of parameters f and Σ after each sample is observed.The marginal estimate only uses the m most recent data points, so the model forgets stale data and the empirical distribution requires constant memory.The hyperparameter m should be chosen to reflect how quickly the distribution changes.A longer window works better when the data distribution is mostly stable but has a few abrupt changes.On the other hand, if the data distribution changes rapidly, a shorter window is needed.The correlation Σ is updated according to the online EM update after observing each new mini-batch, using a constant step size η t ∈ (0, 1).A constant step size ensures the model keeps learning from new data and forgets stale data.
Streaming datasets may have high autocorrelation, which can improve online imputation.By default, gcimpute imputes missing entries by empirical quantiles of the most recent stored observations.However, it also supports allocating different weights to different stored observations and imputing missing entries by empirical weighted quantiles.Package gcimpute provides an implementation using decaying weights for the m stored observations: d t with d ∈ (0, 1] for each time lag t = 1, ..., m.The decay rate d should be tuned for best performance.This approach interpolates between imputing the last observed value (as d → 0) and the standard Gaussian copula imputation (when d = 1).The user may also supply their own choice of weights.

Software usage
This article demonstrates how to use the Python implementation of gcimpute (an R implementation, gcimputeR, is available at https://github.com/udellgroup/gcimputeR),i.e., two model classes GaussianCopula and LowRankGaussianCopula, whose core methods are displayed in Table 3.Our examples rely on some basic Python modules for data manipulation and plotting: >>> import numpy as np >>> import pandas as pd >>> import time >>> import matplotlib.pyplotas plt >>> import seaborn as sns >>> from tabulate import tabulate

Basic usage
To demonstrate the basic usage of gcimpute, we use demographic data from the 2014 General Social Survey (GSS) data (https://gss.norc.org/):We consider the variables age (AGE), highest degree (DEGREE), income (RINCOME), subjective class identification (CLASS), satisfaction with the work (SATJOB), weeks worked last year (WEEKSWRK), general happiness (HAPPY), and condition of health (HEALTH).All variables are ordinal variables encoded as integers, with varying number of ordinal categories.The integers could represent numbers, such as 0, 1, . . ., 52 for WEEKSWRK, or ordered categories, such as 1 ("Very happy"), 2 ("Pretty happy"), 3 ("Not too happy") for the question "How would you say things are these days?"(HAPPY).Many missing entries appear due to answers like "Don't know", "No answer", "Not applicable", etc. Variable histograms are plotted in Figure 3    We mask 10% of the observed entries uniformly at random as a test set to evaluate our imputations.

Monitoring the algorithm fitting
Package gcimpute considers the model to have converged when the model parameters no longer change rapidly: It terminates when ∥Σ t+1 − Σ t ∥ F /∥Σ t ∥ F falls below the specified tol, where Σ t is the model parameter estimate at the t-th iteration and ∥•∥ F denotes the Frobenius norm.In practice, the default value tol = 0.01 works well and the algorithm converges in less than 30 iterations in most cases.
Tracking the objective value may also be useful.The objective value is the marginal likelihood at the observed locations, averaged over all instances.When all variables are continuous, gcimpute computes the exact likelihood.In other cases, gcimpute computes an approximation to the likelihood.The approximation behaves well in most cases including those with all ordinal variables: It monotonically increases during the fitting process and finally converges.Setting verbose = 1 allows to monitor the parameter updates and the objective during fitting.

Acceleration for large datasets
In this section, we will see how to speed up gcimpute with the acceleration tools described in Section 2.4.To use parallelism with m cores, we call GaussianCopula(n_jobs = m).To use mini-batching training, we set training_mode as "minibatch-offline" also in the model call GaussianCopula().The low rank Gaussian copula is invoked using a different model call LowRankGaussianCopula(rank = k) with desired rank k.Mini-batch training for the low rank Gaussian copula is more challenging and remains for future work, as the low rank update is nonlinear.Nevertheless, for large n and large p, the parallel low rank Gaussian copula already converges quite rapidly.
Mini-batch training requires a batch size s ≥ p to avoid inverting a singular matrix (Zhao et al. 2022).In practice, it is easy to select s ≥ p, since problems with large p should use LowRankGaussianCopula() instead.The maximum number of iterations matters more for mini-batch methods, because the stochastic fluctuation over mini-batches makes it hard to decide convergence based on the parameter update.Instead of specifying an exact maximum number of iterations, it may be more convenient to select a desired number of complete passes through the data (epochs), i.e., max_iter= n s ×num_pass with s as the mini-batch size.Often using num_pass= 2 (the default setting) or 3 gives satisfying results.

Accelerating datasets with many variables: Low rank structure
The low rank Gaussian copula (LRGC) model accelerates convergence by decreasing the number of model parameters.Here we showcase its performance on a subset of the MovieLens1M dataset (Harper and Konstan 2015): The 400 movies with the most ratings and users who rated at least 150 of these movies in the scale of {1, 2, 3, 4, 5}.That yields a dataset consisting of 914 users and 400 movies with 53.3% of ratings observed.We further mask 10% entries for evaluation.
Here we already see that LRGC already reduces the runtime by 82% compared to the standard Gaussian copula, although the number of variables p = 400 is not particularly large.When the number of variables is much larger, the acceleration is also more important.Moreover, LRGC improves the imputation error from 0.613 to 0.577, as shown below.For BondSprea, the unit is percentage.For other variables, the units are the ratios to a baseline (a historical value).

Imputation for streaming datasets
Package gcimpute's "minibatch-online" training mode performs streaming imputation: As new samples arrive, it imputes the missing data immediately and then updates the model parameters.We showcase its performance on eight daily recorded economic time series variables from the Federal Reserve Bank of St. Louis (FRED, https://fred.stlouisfed.org/),consisting of 3109 days from 2008-06-03 to 2020-12-31.The selected eight variables are diverse and among the most popular economic variables in FRED: Gold volatility index, stock volatility index, bond spread, dollar index, inflation rate, interest rate, crude oil price, and US dollar to Euro rate, shown in Figure 6.
>>> from gcimpute.helper_dataimport load_FRED >>> fred = load_FRED() >>> fred.plot(... subplots = True, layout = (2, 4), figsize = (16, 6), ... legend = False, title = fred.columns.to_list()... ) Here we consider a scenario in which some variables are observed as soon as they are generated, while others are observed after a lag of one day.The goal is to predict the unobserved variables each day.We use stock StockVolatility and CrudeOilPrice as two unobserved variables.Each day, using a fitted Gaussian copula model, we predict their values based on both their historical values (through the marginal) and the six other observed variables at that day (through the copula correlation).After we make our prediction, the actual values are revealed and used to update the Gaussian copula model.Package gcimpute conveniently supports this task.Let us first create a Gaussian copula model to impute streaming datasets (training_mode = "minibatch-online"), shown as below.Three hyperparameters control the learning rate of the model: window_size controls the number of recent observations used for marginal estimation; const_stepsize controls the size of the copula correlation update; and batch_size is the frequency of the copula correlation update.In contrast, decay only controls the imputation and does not influence the model update (decay rate d in Section 2.5).Smaller values of decay put less weight on old observations, i.e., forget stale data faster.In economic time series, yesterday's observation often predicts today's value well.We use a small value decay = 0.01, so that the imputation depends most strongly on yesterday's observation, but interpolates all values in the window.These parameters can be tuned for best performance.
Next, to conduct the experiment described above, we prepare two data matrices with one row for each temporal observation: X for imputing missing entries and X_true for updating the model.We use first 25 rows to initialize the model.
>>> Xmasked = fred.assign(StockVolatility= np.nan,CrudeOilPrice = np.nan)>>> Ximp = model.fit_transform(X= Xmasked, X_true = fred, n_train = 25) More concretely, a Gaussian copula model receives the t-th row of X, imputes its missing entries, and then is asked to update parameters of the model using the t-th row of X_true.X_true must agree with X at all observed entries in X, but may reveal additional entries that are missing in X.By default, X_true = None, indicating no additional entries beyond X are available.In this example, two columns of Xmasked are missing: StockVolatility and CrudeOilPrice.All other columns are fully observed.All columns in fred are fully observed.We now evaluate the imputation performance and compare against a simple but powerful alternative, yesterday's observation.The predicted series of both methods are almost visually indistinguishable from the true values in Figure 6, but the Gaussian copula predictions perform better on average, with lower mean squared error (MSE).

Imputation uncertainty
So far we have seen several methods to impute missing data.Package gcimpute also provides functionality to quantify the uncertainty of the imputations: Multiple imputation, confidence interval for a single imputation, and relative reliability for a single imputation.We present the first two notions here, since they are widely used.The third, relative reliability, ranks the imputation quality among all imputed entries, which is well suited for the top-k recommendation task in collaborative filtering (Zhao and Udell 2020a).

Multiple imputation
Multiple imputation creates several imputed copies of the original dataset, each having potentially different imputed values.The uncertainty due to imputations can be propagated into subsequent analyses by analyzing each imputed dataset.See Little and Rubin (2019, Chapter 5.4) for a more detailed introduction to multiple imputation.One common use case for multiple imputation is supervised learning with missing entries: A researcher creates multiple imputed feature datasets, then trains a model with each imputed training feature dataset and predicts with each imputed test feature vector.Finally, they pool all predictions into a single prediction, for example, using the mean or majority vote.An ensemble model like this often outperforms a single model trained from a single imputation.
We show how to use multiple imputation in gcimpute on a regression task from UCI datasets, the white wine quality dataset (Cortez, Cerdeira, Almeida, Matos, and Reis 2009).This dataset has 11 continuous features and a rating target for 4898 samples.The (transposed) header of the dataset is shown below.

Imputation confidence intervals
Confidence intervals (CI) are another important measure of uncertainty.Package gcimpute can return a CI for each imputed value: For example, a 95% CI should contain the true missing data with probability 95%.In general, these CI are not symmetric around the imputed value due to the nonlinear transformation f .We will continue to use the white wine dataset for illustration.After fitting the Gaussian copula model, we can obtain the imputation CI as shown below.
>>> ct = model_wine.get_confidence_interval()>>> upper, lower = ct["upper"], ct ["lower"] By default, the method get_confidence_interval() extracts the imputation CI of the data used to fit the Gaussian copula model, with significance level alpha = 0.05.The empirical coverage of the returned CI is 0.943, as shown below.Hence we see the constructed CI are well calibrated on this dataset.The default setting uses an analytic expression to obtain the CI.As in Section 2.2, when some variables are not continuous, a safer approach builds CI using empirical quantiles computed from multiple imputed values.Let us now construct the quantile CI and compare them with the analytical counterparts.Here we follow the default setting to use 200 samples for computing quantiles.As shown below, the quantile CI has almost the same empirical coverage rate as the analytical CI, validating that the CI are well calibrated.

Concluding remarks
Package gcimpute supports a variety of missing data imputation tasks including single imputation, multiple imputation, imputation confidence intervals, as well as imputation for large datasets and streaming datasets.As a complement to this article, we provide usage vignettes (https://github.com/udellgroup/gcimpute/blob/master/Examples)detailing more specific topics such as trouble shooting, relative reliability for a single imputation, etc.
Although this article focuses on missing data imputation, gcimpute can also be used to fit a Gaussian copula model to complete mixed datasets.The resulting latent correlations may be useful to understand multi-view data collected on the same subjects from different sources.As far as we know, no other software supports Gaussian copula estimation for mixed continuous, binary, ordinal and truncated variables.Fan et al. (2017) only support continuous and binary mixed data; Feng and Ning (2019) support continuous, binary and ordinal mixed data; Yoon, Carroll, and Gaynanova (2020) support continuous, binary and zero-inflated (a special case of truncated) mixed data.
Package gcimpute estimates the model provably well when data is missing uniformly at random (MCAR), and can estimate the copula provably well given the marginals if the data is missing at random (MAR).Adapting the theory to handle data missing not at random (MNAR) is challenging.However, we find empirically that gcimpute still performs reasonably well in this setting.Indeed, many different missing patterns may be called MNAR, and imputation methods designed for one MNAR mechanism do not necessarily outperform on other MNAR data due to this heterogeneity.We advise users to make the choice by evaluating on a validation dataset.

TRANSFORMATIONFigure 1 :
Figure 1: Three monotonic transformations of a Gaussian variable.The third column depicts the transformations that map the data distribution, visualized as both probability distribution function (histogram approximation) and cumulative distribution function (analytical form), in the left two columns to the data distribution in the right two columns.

Figure 2 :
Figure2: Scatter plot of samples from several 2D Gaussian copula models with different marginals.The data is generated by sampling (z 1 , z 2 ) from a 2D Gaussian distribution with zero mean, unit variance and 0.65 correlation and computing x 1 = f 1 (z 1 ) and x 2 = f 2 (z 2 ), where f 1 and f 2 denote the transformations corresponding to the marginals for each model.For Gaussian marginals (1st row and 1st column), the transformation is the identity.For other marginals, the corresponding transformations are plotted as the third column of Figure1.
using the following code: >>> from gcimpute.helper_dataimport load_GSS >>> data_gss = load_GSS() >>> fig, axes = plt.subplots(2,4, figsize = (12, 6)) Method Description fit Fit a Gaussian copula from (incomplete) data transformImpute incomplete data using a Gaussian copula fit_transform Impute incomplete data using the Gaussian copula fitted from itself fit_transform_evaluateConduct an evaluation on imputed data returned at each iteration during model fitting sample_evaluation Sample multiple imputed data using a Gaussian copula get_params Get parameters of the fitted Gaussian copula get_vartypes Get the specified variable types used in model fitting get_confidence_interval Get the confidence intervals for the imputed missing entries Table3: Overview of the core methods for both the 'GaussianCopula' class and the 'LowRankGaussianCopula' class.

Figure 3 :
Figure 3: Histogram plots for GSS variables.There are 2538 samples in total.

Figure 4 :
Figure 4: The estimated latent copula correlation among GSS variables.

Figure 5 :
Figure 5: The imputation error among GSS variables is plotted w.r.t. the number of iterations run in gcimpute.Satisfying results emerge after four iterations.

Figure 6 :
Figure 6: Values of eight selected FRED economic variables from 2008-06-03 to 2020-12-31 are plotted.For CrudeOilPrice, StockVolatility and GoldVolatility, the units are US dollars.For BondSprea, the unit is percentage.For other variables, the units are the ratios to a baseline (a historical value).
We can specify the type of each variable in model.fit_transform()directly.Otherwise, the default setting works well.It guesses the variable type based on the observed value frequency.
A variable is treated as continuous if its mode's frequency is less than 0.1.A variable is treated as lower/upper/two sided truncated if its minimum's/maximum's/minimum's and maximum's frequency is more than 0.1 and the distribution, excluding these values, is continuous by the previous rule.All other variables are ordinal.The default threshold value 0.1 works well in general, but can be changed using the parameter min_ord_ratio in the model call GaussianCopula().For example, let us look at the frequency of the min, max, and mode for each GSS variable.