Simulating Correlated Multivariate Pseudorandom Numbers

A modification of the Kaiser and Dichman (1962) procedure of generating multivariate random numbers with specified intercorrelation is proposed. The procedure works with positive and non-positive definite population correlation matrix. A SAS module is also provided to run the procedure.


INTRODUCTION
A fundamental task of quantitative researches is to make probability-based inferences about population characteristics, θ, based on an estimator, θ ) , using a "representative" sample drawn from that population. However, the statistics of classical parametric inference inform us about how the population works to the extent necessary mathematical assumptions are met, and violation of any of the mathematical assumptions may harm the accuracy of the research conclusion(s). In regression, for instance, when the slope of the independent variable (x) is statistically significant and BLUE (best linear unbiased estimator), a researcher can clearly expect what happens to the dependent variable (y) when x changes one unit provided that the usual mathematical assumptions of the regression (independence, homogeneity of variance, and normality) are met. If any of these assumptions is violated, the risk of inference from the ordinary lease squares (OLS) is seriously off the mark (Moony, 1997).
In real world, however, there are numerous situations where data violates at least one of the mathematical assumptions of the inferential statistic that is going to be used in analysis.
Therefore, statisticians as well as researchers are interested to know how the mathematical assumptoin(s) violation affects the inferntial statistics' power (the probability of correctly rejecting a false null hypothesis) and/or the Type I error rate (the probability of incorrectly rejecting a true null hypothesis). Fortunately, meaningful investigation of many problems in statistics can be done through Monte Carlo simulations (Brooks et al., 1999).

Statement of the Problem
Monte Carlo simulation is a computer-based technique that enables one to generate artificially random samples from populations with known characteristics in order to control some variables and manipulate others to investigate the effect of the manipulation on the robustness of a statistic. For more details about Monte Carlo simulation, the reader is referred to Brooks et al., 1999 andMooney, 1997. Monte Carlo simulations that require correlated data from normal and nonnormal populations are frequently used to investigate the small sample properties of competing statistics, or the comparison of estimation techniques (Headrick & Sawilowsky, 1999). And, the widely used technique to generate correlated data from normal population (which is the paper's concentration) is the Kaiser and Dichman (1962) procedure. The procedure uses a component analysis (e.g., principal components, square-root components) of a positive definite population correlation matrix (R) for generating multivariate random numbers with specified intercorrelations.
The procedure depends mainly on the decomposition of R and works only with a positive definite R. This limitation is an obstacle often faces users because not all real world situations have a positive definite R. Thus, a modification for this procedure (or development of a new one) is highly desired.

Purpose of the Study
The study proposes a modification of the Kaiser and Dichman procedure in order to decrease its limitation effect. That is, the paper presents techniques that enable the Kaiser and Dichman procedure' users to generate multivariate correlated pseudorandom numbers using a nonpositive definite R.

REVIEW OF THE LITERATURE
The importance of the problem of generating a correlated pseudorandom numbers comes up from the considerable attention that has been given to the Monte Carlo studies; Monte Carlo studies have been of interest to applied statisticians and continue to receive large focus in the recent statistical literature.
The first to introduce a method for generating correlated pseudorandom numbers was Hoffman in 1924. He presented a simple technique that is used only to simulate two correlated variables. Then, Kaiser and Dickman (1962) extended the Hoffman's procedure for m ≥ 2, where m is the number of the correlated variables. They proposed a method for generating sample and population score matrices and sample correlation matrices from a given R. The procedure utilizes component analysis for generating multivariate random numbers. The disadvantage of the Kaiser and Dickman procedure is that it depends on matrix factorization that requires positive definiteness, and this condition does not frequently hold. Fleishman (1978) noted that real-world distributions of variables are typically characterized by their first four moments (i.e., mean, variance, skew, and kurtosis) and presented a procedure for generating normal as well as nonnormal random numbers with these moments specified. The procedure accomplishes this by simply taking a linear combination of a random number drawn from a normal distribution and its square and cube. Tadikamalla (1980) criticized the Fleishman's procedure for producing distributions of variables for which the exact distribution was known and thus lacked probability-density and cumulative-distribution functions and which, furthermore, could not produce distributions with all possible combinations of skew and kurtosis. Tadikamalla (1980) proposed five alternative procedures for generating nonnormal random numbers and compared them all with Fleishman power method for speed of execution, simplicity of implementation, and generality of their ability to generate nonnormal distributions. Vale and Maurelli (1983) extended the Fleishman (1978) and Kaiser and Dickman (1962) procedures to the multivariate case. They proposed a method for generating multivariate normal and nonnormal distributions with specified intercorrelations and marginal means, variances, skews, and kurtoses. However, like Kaiser and Dickman (1962) procedure , the procedure fails to work when R is not positive definite. Not only that, but also the procedure fails to generate desired intercorrelations when the conditional distributions possess extreme skew and/or heavy tails, even for sample sizes as large as N = 100 (Headrick and Sawilowsky, 1999). Headrick and Sawilowsky (1999) proposed a method that combines a generalization of Theorem 1 and 2 from Knapp and Swoyer (1967) with the Fleishman (1978) procedure.
The procedure generates multivariate nonnormal distributions with average values of intercorrelations closer to the population parameters for skewed and/or heavy tailed distributions and for small sample sizes. Although the procedure eliminates the necessity of conducting a factorization procedure on R that underlies the random deviate, yet it still requires the positive definiteness of R.
Briefly, all procedures that can be used to generate multivariate pseudorandom numbers require either directly or indirectly positive definite R and do not work without it, but the Fleishman's (1978). This study will present a modified technique that minimizes the reliance on positive definiteness condition for R.

3-1 Kaiser and Dichman procedure
The Kaiser and Dichman (1962) procedure is generally applied to generate multivariate normal random numbers, and uses a matrix decomposition procedure. A Cholesky factorization (or any factorization, for that matter) is performed on R that is to underlie the random numbers. To generate a multivariate random number, one random number is generated for each component, and each random variable is defined as the sum of the products of the variable's component loadings and the random number corresponding to each of the components. When the random numbers input are normal, the resulting random numbers are multivariate normal with population intercorrelations equal to those of the matrix originally decomposed.

3-2 The Modified Procedure
The modified procedure is also used to generate multivariate normal pseudorandom numbers utilizing R. However, unlike other procedures, it works with a positive as well as a nonpositive definite R.
To generate a correlated multivariate data matrix with specific R, we rewrite R as a block matrix such as and are the correlation matrices of Y and X, respectively, and the is the correlation matrix between Y and X. The division is assumed to occur intentionally to ensure positive definiteness of both and . This is a necessary condition (i.e., the method dose not work without it).
Next, one of the following techniques may be applied.

3-2-1 When A q × q and q =1
This technique is applied to generate multivariate pseudorandom numbers when R can be partitioned into three matrices: , B , and C , where k is the number of x's and k >1 such as The technique is performed as follows: A total of k correlation matrices that contain the correlation between y and each x individually are created such as ∀ i = 1, 2, …, k. Then, a multivariate normal data matrix ( ) is generated through the equation where X is the multivariate normal data matrix, µ is the vector containing the variable means, contains vectors of independent and standard normal variates, U is the Cholesky upper triangular matrix of C . The factorization is known as the Cholesky factorization.
A data vector (y n × 1 ) of standard normal variates is generated and concatenated horizontally with each column of the matrix X individually such as 2, …, k). This will gives us a total of k matrices each of size (n × 2).
A total of k vectors Z i of size (n × 1) are generated independently through the where U ic 2 is the 2 nd column of the Cholesky upper triangular matrix of B i which was created in the beginning. The vector y is then concatenated horizontally with Z i ∀ i = 1, 2, …, k to create data matrix of size (n × k+1) such as D n × k+1 = [ y Z 1 Z 2 … Z k ].
The resulting data matrix has a correlation matrix that most likely varies from the given population matrix (R) and the change takes place mainly in C. Thus, to get data matrix that has a correlation matrix close to (i.e. with average intercorrelation among the x's around) the desired R, we manipulate the actual correlation value among the x's, and repeat the process.
The average intercorrelation among the x's is assessed using the measure proposed by Kaiser (1968 where λ is the largest eigenvalue of the correlation matrix, and O is the number of variables (which is k when measuring the average intercorrelation among the x's). The larger the value of γ, the greater the correlation; if γ = 0, there is no correlation. γ = 1 when the correlations among the variables in the set are quite high. Accordingly, one should expect that γ takes values near the values of ρ x .

Simulation Example 1
Suppose a data matrix of size (20 × 5) with specific intercorrelations need to be generated, and suppose x x x x y R The Kaiser and Dickman procedure cannot be used here to generate the required data because the provided R is not positive definite. Thus, the new procedure is going to be implemented instead (the SAS module in the Appendix A can be utilized to simulate the data using the modified procedure). Table 1 shows the correlation matrix (R i ) used in the procedure and the correlation matrix of the simulated data (Re i ) at attempt i . At the second attempt (when the correlation among the x's was replaced by 0.84 in the original R), we had data with correlation matrix (Re 2 ) that is closer to the original R (which equals R 1 ) than those of the first (Re 1 ) and the third (Re 3 ) attempts.
Notice that Re i in the Table are full rank (i.e., Rank(Re i ) = 5), and usually compared with the original R. R 2 , R 3 , … etc. are only used in the method to simulate the needed data.

TABLE 1 The Theoretical and Empirical Correlations Values
The Correlation matrix used The Correlation matrix gotten

3-2-2 When A q × q and q ≥ 2
This technique is applied when R can be partitioned into three matrices: , B , and , where q >1 and k>1 are numbers of the y's and the x's, respectively, such as A total of (q × k) correlation matrices that contain the correlation between each y and each x are created such as ∀ i = 1, 2, …, q and j = 1, 2, …, k.
Then a multivariate normal data matrix ( ) is generated through the equation where X is the multivariate normal data matrix, µ is the vector containing the variable means, contains vectors of independent and standard normal variates, U is the Cholesky upper triangular matrix of C .

X k k×
Similarly, a multivariate normal data matrix ( ) is generated using the Cholesky upper triangular matrix of . Next, each column of the matrix Y is concatenate horizontally with each column of the matrix X individually (i.e., T ij =[ Y i X j ] ∀ i = 1, 2, …, q and j = 1, 2, …, k), which gives a total of (q × k) matrices each of size (n × 2). q n× Y q q× A A total of (q × k) vectors Z ij of size (n × 1) are generated through the equation to create the required data matrix ( D ). k q n + × Similar to the first case, the resulting data matrix ( ) has a correlation matrix that most likely varies from the population matrix and the change, once again, takes place mainly in C. Thus, to get data matrix that has a correlation matrix close to (i.e. with average intercorrelation among the x's around) the desired population correlation matrix, we manipulate the actual correlation value among the x's, and repeat the process. the provided R is not positive definite. Consequently, the modified procedure is going to be implemented as an alternative (the SAS module in Appendix A also can be utilized to simulate the data using the modified procedure). Table 2 shows the correlation matrix (R i ) used in the procedure and the correlation matrix of the simulated data (Re i ) at attempt i . At the third attempt (when the correlation among the x's was replaced by 0.84 in the original population correlation matrix), we had data with correlation matrix (Re 3 ) that is closer to the original population correlation matrix (R 1 ) than those of the first (Re 1 ) and the second (Re 2 ) attempts.

TABLE 2 The Theoretical and Empirical Correlations Values
The Correlation matrix used The Correlation matrix gotten y 1 y 2 y 3 x 1 x 2 x 3 y 1 y 2 y 3 x 1 x 2 x 3 y 1

CONCLUSION
The paper has suggested a modification to the Kaiser and Dickman (1962)