OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data

In this article, operational details of the R package OrdNor that is designed for the concurrent generation of correlated ordinal and normal data are described, and examples of some important functions are given. The package provides needed tools that have been lacking for generating multivariate data with a mixture of ordinal and normal components.


Introduction
Scientific investigation typically involves gathering variables measured on different scales simultaneously. The data collected may be related to interdependent outcomes that are simultaneously targeted by a study, or a research project may be assessing the impact of correlated predictors or risk factors acting together to produce two or more correlated outcomes. For example, different levels of dietary intervention can affect multiple health indicators including BMI (body mass index), blood serum cholesterol level etc., which are inherently correlated. Yet another way by which correlated variables of different types can arise is when conducting longitudinal studies. In such studies, the response variable and time varying covariates are measured within a subject, which induces correlation among the observations. Sophisticated analytical methods are increasingly used to analyze correlated data (Agresti 2002). In order to evaluate and compare the methods for parameter estimation and hypotheses testing in the context of correlated data, authors often resort to Monte Carlo simulation. These simulation studies require a mechanism to generate a set of artificial data that mimic the marginal distribution and correlation structure of data sets encountered in real studies.
Major contributions in random number generation have been focused on efficient generation of samples for either univariate or multivariate distributions of one particular type. There have been several theoretical developments and software implementations in this regard (Em-rich and Piedmonte 1991;Lee 1993;Park, Park, and Shin 1996;Gange 1995;Oman and Zucker 2001;Qaqish 2003;Biswas 2004;Demirtas 2006;Yahav and Shmueli 2012;Ferrari and Barbiero 2012). However, there has been only sporadic interest in generation of multivariate data with mixed data types. Examples of such efforts are found in Demirtas and Doganay (2012) for generating mixture of correlated binary and normal variables, and in Ruscio and Kaczetow (2008) for generating multivariate non-normal data via a sample and iterate (SI) algorithm. We further the effort in mixed data generation by elaborating on an easy-to-use R package that provides functions for concurrent generation of ordinal and normal variables with a given marginal distribution and correlation structure.
The organization of the rest of the article is as follows. In Section 2, we outline the algorithm implemented for the simultaneous generation of correlated ordinal and normal variables. In Section 3, we describe the operational details of the R package OrdNor. In Section 4, we present an example and evaluate the performance of the package. In Section 5, we present a run-time comparison between proposed and alternative algorithm. We conclude the paper with a discussion in Section 6.

Prerequisites
Multivariate normal and multivariate ordinal data generation with underlying normal distribution is well-studied. When it comes to joint generation of normal and ordinal variables, the key is to establish a connection between point-polyserial and polyserial correlations. Once such relationship is established, one can concurrently generate a set of ordinal and normal variables in a unified manner given marginal proportions and a set of correlations. Specific algorithmic steps are given below.
Let Y 1 , Y 2 , . . . , Y j be a set of ordinal variables with proportion vectors p 1 , p 2 , . . . , p j and W l ∼ N (µ l , σ 2 l ), where l = 1, 2, . . . , k. The (j + k) × (j + k) correlation matrix is Σ. Without loss of generality, assume that variables are in a certain order, i.e., the data consist of Y 1 , . . . , Y j , W 1 , . . . , W k . Then, Σ is comprised of three components: Σ OO , Σ ON , and Σ N N , where O and N correspond to ordinal and normal, respectively. In this setup, Σ OO is a j × j submatrix and Σ N N is a k × k submatrix of Σ that stand for the correlations between the ordinal-ordinal and normal-normal combinations, respectively. Similarly, Σ ON represents a j ×k submatrix whose elements are the correlations between ordinal and normal variables. Required parameter values are either specified or estimated from a real data set that is to be mimicked.
2. Check if Σ is positive definite. With a linear transformation in Step 1, correlations will remain unchanged.
3. Check if the elements of Σ OO are within limits. The lower and upper bounds of valid correlation values are found by the generate, sort, and correlate (GSC) algorithm (Demirtas and Hedeker 2011) whose steps are given below.
(a) Generate random samples from the intended distributions independently using a large number of observations (e.g., N = 100, 000). Here, the intended distributions are two univariate multinomial distributions.
(b) To estimate the lower bound, sort the two variables in opposite directions (i.e., from smallest to largest for one of the variables, and from largest to smallest for the other). Then, compute the sample correlation.
(c) To estimate the upper bound, sort the two variables in the same direction, and compute the sample correlation.
4. Check if the elements of Σ ON are within limits using a GSC algorithm. Here, the intended distributions are multinomial and standard normal distributions.
5. Compute the polychoric correlations for the OO combinations using an iterative procedure as described by Ferrari and Barbiero (2012). The steps involved in the computation are as follows.
is the target value of a pairwise correlation between two ordinal variables.
(b) Generate X and Y from the bivariate normal distribution N 2 (0, R OO ) with a large number of data points, where (c) Transform X and Y into ordinal variable X * O and Y * O by discretization based on the target marginal probabilities.
and go back to Step 5b. Otherwise, δ + X O ,Y O is the intermediate correlation.

Repeat
Step 5 for each pair of ordinal variables.
7. Construct a correlation matrix Σ + OO for all ordinal components of the multivariate distribution. Σ + OO is a symmetric submatrix whose diagonal elements are 1.
8. Compute the polyserial correlations for the ON combinations as follows.
(a) Generate X and Y independently with a large number of data points, where X and Y are independent standard normal variables.
(b) Transform X into an ordinal variable X O by discretization based on the target marginal probabilities.
(c) Estimate the upper bound for correlation max(COR(X O , Y )) using GSC; that is sort the two variables in the same direction, and compute the sample correlation.
is the target value of the pairwise correlation between an ordinal and a normal variable. δ X,Y is an intermediate pairwise correlation between two standard normal variables, such that ordinalization of one variable of the pair is expected to result in the target value of pairwise correlation of an ordinal-normal pair.

Repeat
Step 8 for each pair of ordinal and normal variables.
10. Construct a correlation submatrix Σ + ON for all combinations of ordinal and normal components of the multivariate distribution. This is a submatrix of the overall correlation matrix, and it is pertinent to the ordinal-normal part. Hence, the matrix may or may not be square.
12. Check if Σ + is positive definite. If it is not, compute the nearest positive definite correlation matrix.
15. Go back to the original scale for normal variables by reverse centering and scaling, W l = Z l × σ l + µ l , l = 1, 2, . . . , k.

The OrdNor package
The OrdNor package (Amatya and Demirtas 2015) provides functions for concurrently generating correlated ordinal and normal data as described above. The package is available via the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package= OrdNor. Once the package has been installed, the results given in this paper are reproducible. The package contains a data generating function genOrdNor that generates the data set with mixed ordinal and normal variables with pre-specified marginal distribution and correlation structure. The data generating function is supported by three core functions IntermediateON, IntermediateOO, and cmat.star. The remaining three auxiliary functions valid.limits, validate.target.cormat, and validate.plist provide necessary protection against erroneous user inputs. The valid.limits function is a wrapper that encompasses the functions Limit_forON and Limit_forOO. Throughout the article, we use the following substitutes given in Table 1.
The following is a summary of the functions contained in the package. Their functionality, in the context of correlated ordinal and normal data generation, is described in the next three subsections.
genOrdNor: Generates a data set with ordinal and normal variables.

Data generating function
The genOrdNor function generates the mixed multivariate data with given marginal probabilities and correlations (Steps 13 to 15). It takes n, plist, cmat.star, mean.vec, sd.vec, no.ord and no.norm as input arguments. The validate.target.cormat function is called to insure input arguments are correctly specified. Next, the cmat.star function is called and its output is used to generate the intermediate multivariate normal data. It is accomplished through the rmvnorm function from the mvtnorm package (Genz, Bretz, Miwa, Mi, Leisch, Scheipl, and Hothorn 2015;Genz and Bretz 2009). Finally, the ordinalize function transforms the first no.ord components of the multivariate normal data to multivariate ordinal data (Step 14). The remaining no.norm components are scaled and shifted to match the marginal distributions of the normal variables.

Core functions
The  The cmat.star function wraps the two functions above, takes CorrMat, plist, no.ord and no.norm as input arguments and returns the complete matrix of the intermediate pairwise correlations, i.e., Σ + . The function utilizes the ordcont function from the GenOrd package (Barbiero and Ferrari 2015), the nearPD function from the Matrix package (Bates and Maechler 2015) and the is.positive.definite function from the corpcor package (Schäfer, Opgen-Rhein, Zuber, Ahdesmäki, Duarte Silva, and Strimmer 2015) to accomplish various operations. In case Σ + is non-positive definite, it returns the positive definite matrix nearest to the Σ + as described in Step 12.

Auxiliary functions
Trivial specification problems are captured by three validation functions. These are auxiliary functions that are called by the core functions. The core functions expect marginal probabilities of ordinal variables to be specified in the form of cumulative probabilities. The consequence of specifying cumulative probabilities is that the cumulative probability of the last category is not explicitly provided but is implied, i.e., 1. If there are any skipped categories (i.e., a category with 0 probability), the consecutive entries in the corresponding probability vectors will be the same. The validate.plist function checks whether plist meets these requirements.  (marginal,Sigma1,no.ord,no.norm) Error in validate.target.cormat (marginal,Sigma1,no.ord,no.norm Range violation occurred in the target correlation matrix.
The ordinalize function discretizes the standard normal variable into categories defined by the cumulative marginal probability values in pvec. It is used by the other three core functions and the data generating function to transform standard normal variables to ordinal variables.

Example
This section demonstrates the performance of the package via an example. A data set composed of three ordinal (Y 1 , Y 2 , Y 3 ) and four normal variables (W 1 , W 2 , W 3 , W 4 ) with specific marginal distributions and correlation structure (true values are given in Tables 3 and 4), and a sample size of 2000 is generated. The ordinal components were assumed to have two, three and four categories, respectively. Empirical cumulative marginal probabilities for each variable and correlations between each pair are calculated from the generated data. Then the simulation was repeated 1000 times. Th set of cumulative marginal probabilities specified for this example is listed in Table 2. For example, 0.7 in the second column (P Y 2 ) is the cumulative probability of Y 2 to fall in either the first or the second category of that variable, which implies that the probability of the second category is 0.4 (i.e., 0.7−0.3) and of the third category is 0.3. Numbers within parentheses are the empirical 95% confidence intervals (CIs) of the generated ordinal data corresponding to the true cumulative probability of categories of each variable. For instance, the proportion of generated Y 2 that fell in the first or the second category was between (0.680, 0.721) for 95% of the simulated data. Obviously, as n gets larger, the confidence intervals are expected to become narrower.
The correlation structure specified for the seven variables in this example is shown in Table 4. The numbers within the parentheses are the 95% confidence limits of the corresponding correlations. For instance, a correlation of 0.18 is desired between Y 1 and Y 2 . In the simulated data, the empirical correlation between Y 1 and Y 2 was within (0.139, 0.225) for 95% of the simulations.
Figures 1 and 2 demonstrate distributions of empirical marginal probabilities and correlations, respectively. For brevity, we only present figures for the ordinal-ordinal and ordinal-normal pairs. We observe that 98% of all the simulated data have empirical probabilities within 10% of the true probabilities; and 99% of all the simulated data have empirical correlations within 10% of the true correlations. The close resemblance between the specified and empirically computed quantities on average suggests that the procedure works properly.

A comparison of run-time with the SI algorithm
The method proposed by Ruscio and Kaczetow (2008) is also capable of generating simulated data with ordinal and normal components. The algorithm provides an alternative framework to generate data with various combinations of mixed variable types. The method identifies intermediate correlation matrix through an iterative, trial-and-error process.
For the purpose of run-time comparison between OrdNor and the SI algorithm, we utilize the R code (GenData) that is available online at http://www.tcnj.edu/~ruscio/taxometrics. html#NonnormalData.
We generate 15 simulated data sets using each algorithm. We use the same parametric values that are specified in Section 4. Each data set is composed of 3 ordinal variables and 4 normal      variables. We compare the performance of each algorithm for four different sample sizes.
The results of the comparison are presented in Table 5. The results show that OrdNor can be substantially faster for larger data sets. However, for smaller data sets, GenData runs as fast or faster than OrdNor. The comparative advantage in speed depends on the number of variables, the sample size, and the number of data sets being generated. GenData seems to run faster when generating a single data set with a small sample size and a large number of variables. However, OrdNor seems to be advantageous when generating large number of sample data sets with the same population parameter values.  Table 5: Run-time comparison between OrdNor and GenData for generating 10 simulated data sets.

Discussion
The paper described the OrdNor package which is a useful tool for generating correlated mixed multivariate ordinal and normal data with specified marginal distributions and correlation structure. The underlying idea of the procedure is to generate intermediate multivariate normal data whose components after discretization will give the ordinal components of the multivariate ordinal-normal data with desired specifications. Overall, the performance of the algorithm is very satisfactory in that deviations are within tolerable limits of errors. In general, empirical estimates of parameters based on multivariate ordinal-normal data generated using the package are within 10% of the true parameter values. Larger number of rows would further diminish already minor differences. Our simulation studies (not shown for brevity) suggest that the package successfully implements the algorithm across many real-life scenarios as long as there are no specification errors and correlations are within the feasible limits (Demirtas and Hedeker 2011). In case such complications occur, appropriate warning/error messages will be generated to alert the user. In summary, the OrdNor package provides an efficient implementation of the concurrent ordinal and normal data generation algorithm. The package provides a valuable tool for investigators who need a mixed data generation mechanism for their research.