AssocTests : An R Package for Genetic Association Studies

The R package AssocTests provides some procedures which are commonly used in genetic association studies. These procedures are population stratiﬁcation correction through eigenvectors, principal coordinates of clusterings, Tracy-Widom test, distance regression, single-marker test, maximum test based on three Cochran-Armitage trend tests, non-parametric trend test, and non-parametric maximum test. The trait values for these methods should be discrete or continuous. The discrete traits can be coded by 1/0 for cases/controls. The genotype values can be 0, 1, or 2 indicating the number of risk alleles for a biallelic single-nucleotide polymorphism. This article introduces the methods and algorithms implemented in the package. Some examples are provided to illustrate the package’s capability.


Introduction
Genetic association study has become a popular tool to identify the genetic variants predisposing to human complex diseases (Klein et al. 2005;Sladek et al. 2007;Burton et al. 2007;Ardlie et al. 2015). Currently, more than 10,000 deleterious single-nucleotide polymorphisms (SNPs) have been identified (http://www.genome.gov/gwastudies/). Two important issues are often considered in population-based genetic association studies. One issue is the adjustment for confounders in which the population stratification (PS) is noteworthy, and the other is the adoption of powerful tests. Single-marker analysis is crucial in many gene-or pathwaybased procedures, such as the truncated p value combination method (Zaykin, Zhivotovsky, Westfall, and Weir 2002;Yu et al. 2009) and the minimum p value approach (Hoh, Wille, and Ott 2001;Dudbridge and Koeleman 2004). In the single-marker analysis, investigators have and dominant models, has been used to identify the genetic variants associated with type II diabetes (Sladek et al. 2007). The MAX3 test has been included in the SAS JMP Genomics Software (SAS Institute Inc. 2008). However, it was based on the results of Freidlin, Zheng, Li, and Gastwirth (2002) and did not support adjustments for the covariates. Li, Zheng, Li, and Yu (2008) employed the generalized equation to obtain the MAX3 test, which could support adjustments for the covariates.
Finally, the linear regression model is a classical approach to evaluate the association between genetic variants and a quantitative trait when the quantitative trait variable follows a normal distribution. However, if the normal assumption for the quantitative trait variable does not hold, non-parametric tests such as the Kruskal-Wallis test (Kruskal and Wallis 1952) and the Jonckheere-Terpstra test (Jonckheere 1954;Terpstra 1952), are preferred. Recently, Zhang and Li (2015) proposed a non-parametric trend test (NPT) considering the genetic mode of inheritance and showed that it is more powerful than the Kruskal-Wallis test and the Jonckheere-Terpstra test. They also provided a robust non-parametric maximum test (NMAX3), which is free from the genetic models.
In this article, we introduced a new R (R Core Team 2020) package, AssocTests (Wang, Zhang, Li, and Zhu 2020), which provides some procedures focusing on genetic association studies. The package implements the following methods: population stratification correction through eigenvectors (EIGENSTRAT; Price et al. 2006), PCOC, TW test, DR, single-marker test, MAX3, NPT, and NMAX3. The trait values for these methods should be discrete or continuous. The discrete traits can be coded by 1/0 for cases/controls. The genotype values can be 0, 1, or 2 indicating the number of risk alleles for a biallelic SNP. Many packages for genetic association studies are reported. Some packages such as GenABEL (Aulchenko 2013), pbatR (Hoffmann 2018), and snpMatrix (Clayton and Leung 2008), provide functions to perform genome-wide association studies. The function egscore() in the package GenABEL could perform EIGENSTRAT, which is also involved in our package AssocTests (Wang et al. 2020). Some packages such as gap (Zhao 2007), tdthap (Clayton 2013), and Rassoc (Zang, Fung, and Zheng 2010), provide functions to test the association between individual genetic markers and a phenotype. gap supports the genetic data analysis of both population and family data, tdthap is designed for the transmission/disequilibrium tests for extended marker haplotypes, and Rassoc provides functions to perform robust tests for case/control genetic association studies. However, the procedures PCOC, TW test, DR with adjustments for the covariates, MAX3 test with adjustments for the covariates, NPT, and NMAX3 in our package AssocTests are not included in any of these packages. Package AssocTests is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package= AssocTests.
This paper is organized as follows. Section 2 summarizes the methods from which the As-socTests package was developed and also describes the functions in the package. Section 3 illustrates the capabilities of AssocTests by using some simulation data sets. Section 4 provides a real data example related to the PS. Finally, Section 5 concludes the paper.

The R package AssocTests
The procedures provided in the R package AssocTests are eigenstrat(), pcoc(), tw(), dr(), smt(), max3(), npt(), and nmax3(). The eigenstrat() procedure was developed to correct for the PS in genetic association studies by searching for some "top" eigenvectors (Price et al. 2006). The pcoc() procedure could correct for the PS by identifying the clustering and continuous patterns of the genetic variation . The tw() procedure is based on the TW test and could evaluate the significant eigenvalues of a matrix (Tracy and Widom 1994). The dr() procedure is based on the DR and could detect the association between gene patterns and some independent variants of interest with or without adjustments for the covariates ). The smt() procedure is an implementation of the single-marker analysis used to identify the association between a genotype and a trait (Hoh and Ott 2003;Marchini, Donnelly, and Cardon 2005). The max3() procedure is a robust test to identify the association between a SNP and a binary phenotype with or without adjustments for the covariates (Sladek et al. 2007;. Finally, the npt() and nmax3() procedures perform the NPT and the robust NMAX3, which are against the normal assumption and the genetic uncertainty , respectively.

Function index
The function index of the package AssocTests is listed as follows: • eigenstrat(): EIGENSTRAT for correcting for the PS.
• pcoc(): Principal coordinates of clusterings for correcting for the PS.
• max3(): Maximum test based on the maximum value of the three Cochran-Armitage trend tests under the recessive, additive, and dominant models.
• npt(): Non-parametric trend test based on the non-parametric risk under a given genetic model.
• nmax3(): NMAX3 test based on the maximum value of the three NPTs under the recessive, additive, and dominant models.
More details about them are described below.

Function eigenstrat()
The EIGENSTRAT for detecting and correcting for the PS provides the test through searching for the eigenvectors of the similarity matrix among the subjects in population-based genetic association studies (Price et al. 2006). The function eigenstrat() calculates the top eigenvectors or the eigenvectors with significant eigenvalues of the similarity matrix among the subjects to infer the potential population structure. It is used as follows: eigenstrat(genoFile, outFile.Robj = "out.list", outFile.txt = "out.txt", rm.marker.index = NULL, rm.subject.index = NULL, miss.val = 9, num.splits = 10, topK = NULL, signt.eigen.level = 0.01, signal.outlier = FALSE, iter.outlier = 5, sigma.thresh = 6) The arguments of the function are described as follows: • genoFile: A text file containing the genotypes (0, 1, 2, or 9). The element of the file in row i and column j represents the genotype at the i-th marker of the j-th subject. 0, 1, and 2 denote the number of risk alleles, and 9 (default) is for the missing genotype.
• outFile.Robj: The name of an R object for saving the list of the results. Such list is the same as the return value of this function. The default is out.list.
• outFile.txt: A text file for saving the eigenvectors corresponding to the top significant eigenvalues.
• rm.marker.index: A numeric vector containing the indices of the removed markers. The default is NULL.
• rm.subject.index: A numeric vector containing the indices of the removed subjects. The default is NULL.
• miss.val: The value used to fill in the blanks caused by missing values in the input data. The default is 9. The element 9 representing the missing data in the genoFile should be replaced according to the value of miss.val.
• num.splits: The number of groups into which the markers are split. The default is 10.
• topK: The number of eigenvectors to return. If it is NULL, it is calculated by the TW test. The default is NULL. • signal.outlier: A logical value indicating whether the function searches for and deletes outliers of the subjects. The default is FALSE.
• iter.outlier: A numeric value indicating the maximum iteration number for the outlier detection of the subjects. The default is 5.
• sigma.thresh: A numeric value indicating the threshold for the outlier elimination. The default is 6.
The arguments rm.marker.index and rm.subject.index could be provided according to the user-specified rules for data cleaning instances, such as an individual with excessively many missing genotype values. The argument num.splits does not affect the results of the EIGENSTRAT. It is used to reduce the working set (i.e., the amount of memory that an application requires) when we scan the file given by genoFile. Large num.splits results in the need for a small working set and results in slow eigenstrat() function run. The same usage is observed for num.splits in the function pcoc().
This function returns a list of num.markers (the number of markers excluding the removed markers), num.subjects (the number of subjects excluding the outliers), rm.marker.index (the indices of the removed markers), rm.subject.index (the indices of the removed subjects), TW.level (the significance level of the TW test), signal.outlier (indicating whether the function deletes the outliers of the subjects), iter.outlier (the maximum iteration number for the outlier detection), sigma.thresh (the threshold for the outlier elimination), num.outliers (the number of the outliers), outliers.index (the indices of the outliers), num.used.subjects (the number of the used subjects), used.subjects.index (the indices of the used subjects), similarity.matrix (the similarity matrix among the subjects), eigenvalues (the eigenvalues of the similarity matrix), eigenvectors (the eigenvectors corresponding to the eigenvalues), topK (the number of the significant eigenvalues), TW.stat (the observed values of the TW statistics), topK.eigenvalues (the top eigenvalues), topK.eigenvectors (the eigenvectors corresponding to the top eigenvalues), and runtime (the execution time of this function).

Function pcoc()
The PCOC for correcting for the PS identifies the clustering and continuous patterns of the genetic variation. The function pcoc() calculates the principal coordinates and the clustering of the subjects in the PCOC for correcting for the PS. It is used as follows: pcoc(genoFile, outFile.txt = "pcoc.result.txt", n.MonteCarlo = 1000, num.splits = 10, miss.val = 9) Most of the arguments are the same as those in eigenstrat(), and the different ones are as follows: • outFile.txt: A text file for saving the results of this function. The default value is "pcoc.result.txt".
• n.MonteCarlo: The repeat number of the Monte Carlo sampling procedure. The default is 1000.
This function returns a list with elements principal.coordinates and cluster, where principal.coordinates stores the principal coordinates and cluster stores the clustering of the subjects. If the number of the clusters is only one, cluster is omitted.

Function tw()
The TW test detects the significant eigenvalues of a matrix. The function tw() calculates the number of significant eigenvalues and the TW statistics. This function was written by Bejan (2005Bejan ( , 2008 and is used as follows: tw(eigenvalues, eigenL, criticalpoint = 2.0234) The arguments of the function are described as follows: • eigenvalues: A numeric vector whose elements are the eigenvalues of a matrix. The values should be sorted in a descending order.
• eigenL: The number of the eigenvalues.
This function returns a list with the class 'htest' containing statistic (a vector of the TW statistics), alternative (a character string describing the alternative hypothesis), method (a character string indicating the type of the test performed), data.name (a character string providing the name of the data), and SigntEigenL (the number of significant eigenvalues).

Function dr()
The pseudo F statistic based on the DR with or without adjustments for the covariates detects the association between a distance matrix and some independent variants of interest. A distance matrix can be transformed into a similarity matrix easily. The function dr() calculates the observed value of the test statistic and the p value of the test by using the pseudo F statistic based on the DR. It is used as follows: dr(simi.mat, null.space, x.mat, permute = TRUE, n.MonteCarlo = 1000, seed = NULL) The arguments of the function are described as follows: • simi.mat: The similarity matrix among the subjects.
• null.space: A numeric vector containing the indices of those columns in x.mat corresponding to the null space.
• x.mat: The covariate matrix which combines the null space and the matrix of interest.
• permute: A logical value indicating whether the Monte Carlo sampling procedure is invoked without replacement. The default is TRUE.
• n.MonteCarlo: The repeat number of the Monte Carlo sampling procedure. The default is 1000.
• seed: The seed of the random number generator. The default is NULL.
This function returns a list with the class 'htest' containing statistic (the observed value of the test statistic), p.value (the p value of the test), alternative (a character string describing the alternative hypothesis), method (a character string indicating the type of the test performed), and data.name (a character string describing the names of the data). The return values of the functions max3(), npt(), and nmax3() described below are similar to that of this function.

Function smt()
The single-marker test is used to identify the association between the genotype at a biallelic marker and a trait using the Wald test or the Fisher's exact test. The function smt() calculates the number of the valid subjects and the p value of the single-marker test. It is used as follows: smt(y, g, covariates = NULL, min.count = 5, missing.rate = 0.20, y.continuous = FALSE) The arguments of the function are described as follows: • y: A numeric vector of the observed trait values in which the i-th element is for the i-th subject. The elements could be discrete (0 or 1) or continuous. Any missing value is represented by NA.
• g: A numeric vector of the observed genotype values (0, 1, or 2, denoting the number of risk alleles) in which the i-th element is for the i-th subject. Any missing value is represented by NA. g has the same length as y.
• covariates: An optional data frame, list, or environment containing the covariates used in the model. The default is NULL, that is, no covariates are present.
• min.count: A threshold to decide which method is used to calculate the p value when the trait is discrete and covariates = NULL. For a certain genotype and a certain trait value, we have a corresponding number of the subjects. If the minimum value of all such numbers traversing all possible genotypes and trait values is less than min.count, the Fisher's exact test is adopted; otherwise, the Wald test is adopted. The default is 5.
• missing.rate: The highest missing value rate of the genotype values that this function can tolerate. The default is 0.20.
• y.continuous: A logical value indicating whether y is continuous. The default is FALSE.
If y is continuous, this function returns a list with the class 'htest', which contains the components statistic, p.value, alternative, method, data.name, and sample.size. The components statistic, p.value, alternative, method, and data.name are similar to those of dr(). sample.size is a vector providing the numbers of the subjects with the genotypes 0, 1, and 2 (n0, n1, and n2, respectively). If y is discrete, this function returns a list with the class 'htest' containing the components statistic, p.value, alternative, method, data.name, sample.size, and bad.obs. The components statistic, p.value, alternative, method, and data.name are similar to those of dr(). Meanwhile, sample.size is a vector that provides the number of the subjects with the trait value 1 and the genotype 0 (r0), the number of the subjects with the trait value 1 and the genotype 1 (r1), the number of the subjects with the trait value 1 and the genotype 2 (r2), the number of the subjects with the trait value 0 and the genotype 0 (s0), the number of the subjects with the trait value 0 and the genotype 1 (s1), and the number of the subjects with the trait value 0 and the genotype 2 (s2). bad.obs is a vector that provides the number of the missing genotype values with the trait value 1 (r.miss), the number of the missing genotype values with the trait value 0 (s.miss), and the total number of missing genotype values (n.miss).

Function max3()
The MAX3 test based on the trend tests without adjustments for the covariates or based on the Wald tests with adjustments for the covariates is conducted for the association between a SNP and a binary phenotype. The test statistic is the maximum value of the three test statistics derived for the recessive, additive, and dominant models. The function max3() calculates the observed value of the MAX3 statistic and the p value of the MAX3 test. It is used as follows: The arguments of the function are described as follows: • y: A numeric vector of the observed trait values in which the i-th element is for the i-th subject. The elements should be either 0 or 1.
• g: A numeric vector of the observed genotype values (0, 1, or 2, denoting the number of risk alleles) in which the i-th element is for the i-th subject. Any missing value is represented by NA. g has the same length as y.
• covariates: A numeric matrix specifying the covariates used in the model. Each column is for one covariate. The default is NULL, that is, no covariates are needed to be adjusted for. The rhombus formula is an approximation formula to estimate the two-sided test p value for the MAX3 statistic . It is an extension of the W -formula, which was originally derived to estimate the one-sided test p value of the MAX3 statistic (Efron 1997).
The function max3() in the package AssocTests can test the association between a SNP and a binary phenotype with or without correcting for the covariates. This function differs from the function MAX3() in the package Rassoc, which can only test for the association without correcting for the covariates.

Function npt()
The NPT examines the association between a genetic variant and a non-normally distributed quantitative trait based on the non-parametric risk. The function npt() calculates the observed value of the NPT statistic and the p value of this test under a specific genetic model. It is used as follows: npt(y, g, varphi) The arguments of the function are described as follows: • y: A numeric vector of the observed quantitative trait values in which the i-th element is the trait value of the i-th subject.
• g: A numeric vector of the observed genotype values (0, 1, or 2, denoting the number of risk alleles) in which the i-th element is the genotype value of the i-th subject for a biallelic SNP. g has the same length as y.
• varphi: A numeric value representing the genetic model. It should be 0, 0.5, or 1, which indicates that the calculation should be performed under the recessive, additive, or dominant model, respectively.

Function nmax3()
When the genetic model is uncertain, a robust test is preferred. The MAX3 test is a widelyused robust test in case/control association studies. NMAX3 is a non-parametric MAX3 test based on the NPT to evaluate the association between a biallelic SNP and a quantitative trait. The function nmax3() calculates the observed value of the NMAX3 statistic and the p value of this test. It is used as follows: nmax3(y, g) The arguments y and g are the same as those in npt().
The function nmax3() in the package AssocTests differs from the functions max3() described above in the package AssocTests and MAX3() in the package Rassoc. nmax3() is constructed on the basis of the NPT for quantitative trait association studies, whereas max3() and MAX3() are used for case/control association studies and derived from the Cochran-Armitage trend test.

Simulation examples
Some simulation examples are used to illustrate the usages and capabilities of the functions in AssocTests (Wang et al. 2020). The analyses were conducted using the R version 3.6.3 (R Core Team 2020).
The data sets used in this section and the next section have been placed into a data-only package AssocTests.data (Wang, Zhang, Li, and Zhu 2015). This package can be downloaded from https://github.com/statscueb/AssocTests.data or use the function install_github() in the package devtools (Wickham, Hester, and Chang 2019) to install it directly.

Simulation: eigenstrat() and tw()
The simulation data set consists of 1,000 cases and 1,000 controls. For each individual, we generate the genotypes of 10,000 SNPs that are not associated with the disease. Following Price et al. (2006) and , we consider two population substructures for the study population as follows: S1 (two underlying discrete subpopulations) and S2 (three underlying In each population substructure, we consider two levels of the ancestral differences between the cases and the controls, which are moderate and extreme, by varying the sampling fractions summarized in Table 1. For the first simulation example, the population substructure is S1, and the level of the ancestral differences between the cases and the controls is moderate. In the package AssocTests.data, the data sets moderate2PSG and moderate2PSP are the genotype data and the phenotype data, respectively, under this condition. We save the data set moderate2PSG in a text file which can be used as the input of the function eigenstrat().
R> data("moderate2PSG", package = "AssocTests.data") R> data("moderate2PSP", package = "AssocTests.data") R> gFile <-"moderate2PSG.txt" R> write. In function tw(), we use result.E$eigenvalues[1:(n -1)] as the value of eigenvalues and n -1 as the value of eigenL. The criticalpoint is set to 0.9793, which corresponds to the significance level of 0.05. The value of result.E$topK from the function eigenstrat() is 1, which is consistent with the value of result.TW$SigntEigenL from the function tw(). Thus, the number of significant eigenvalues is 1 in this example. For the second simulation example, the population substructure is S2, and the level of the ancestral differences between the cases and the controls is moderate. In the package AssocTests.data (Wang et al. 2015), the data sets moderate3PSG and moderate3PSP are the genotype data and the phenotype data, respectively, generated under this situation. For the third simulation example, the population substructure is S1, and the level of the ancestral differences between the cases and the controls is extreme. In the package AssocTests.data, the data sets extreme2PSG and extreme2PSP are the generated genotype and phenotype data, respectively. For the fourth simulation example, the population substructure is S2, and the level of the ancestral differences between the cases and the controls is extreme. In the package AssocTests.data, the data sets extreme3PSG and extreme3PSP are the generated genotype and phenotype data, respectively. The R codes for the second, third, and fourth simulation examples are similar to those for the first one.
The results of result.E$topK from the function eigenstrat() are 1, 2, 1, and 2 for the four examples, respectively, conforming to the results of result.TW$SigntEigenL from the function tw(). Hence, the numbers of the significant eigenvalues are 1, 2, 1, and 2, respectively. Figure 1 plots the samples in the space of the first two eigenvectors of the similarity matrices for the four simulation examples. The clustering patterns are noticeable, with the subjects from the same subpopulation staying close. Furthermore, the distributions of cases (represented by red circles) are nonuniform in controls (represented by blue pluses), especially in the third and fourth examples (Figure 1 C and D), illustrating the allele frequency differences between the cases and the controls due to systematic ancestry differences, that is, the PS.

Simulation: pcoc()
The simulation design and the examples for the function pcoc() are the same as those for eigenstrat(). For the first example, genoFile = gFile. Furthermore, outFile is set to "moderate2PS.PCOC.txt". The other arguments are all set to their default values.
We can calculate the accuracies of the clusterings provided by the function pcoc() by using the values of result.PCOC$cluster, considering that we know the true clusterings of the subjects in the simulation design. We find that pcoc() can classify the subpopulations with 100% accuracy in the four examples, where the subpopulation patterns are tremendously recognizable and no overlap exists between different subpopulations.

Simulation: dr()
Considering the linkage disequilibriums among the SNPs, we use the real data set that contains the genotypes of 127 SNPs in the uronyl-2-sulfotransferase gene from Genetic Analysis Workshop 16 (Cupples et al. 2009;Amos et al. 2009;) to generate the simulation data set. After the deletion of the subjects containing missing values, the real data set consists of 1,081 subjects. The disease prevalence is set to 0.05 and the first 50 SNPs are assumed to be associated with the disease with a log odds ratio ln(1.05). We use the model ln p j 1−p j = ln(0.05/0.95) + ln(1.05) × x j1 + · · · + ln(1.05) × x j50 (j = 1, · · · , 1081) to simulate the phenotypes of the subjects. In the package AssocTests.data (Wang et al. 2015), the data sets drG, drP, and drS are the genotype data, phenotype data, and the similarity matrix of the genotype data, respectively. In the function dr(), the left and the right parts of the argument x.mat are set to a 1,081-dimensional column vector of 1s and the vector of the phenotype values, respectively. The null.space stores the column indices of the left part of x.mat. The other arguments are all set to their default values.
R> data("drP", package = "AssocTests.data") R> data("drS", package = "AssocTests.data") R> set.seed(100) R> x.mat <-cbind(rep(1, length(drP)), drP) R> result.DR <-dr(simi.mat = drS, null.space = 1, x.mat) R> result.DR Distance regression data: drS and x.mat F = 0.0018221, p-value = 0.011 alternative hypothesis: the pair-wise similarity is influenced by the variants of interest This test is two-sided. The null hypothesis is that all the regression coefficients are 0s, that is, the pair-wise similarity is not influenced by the variants of interest. The alternative hypothesis is that some regression coefficients are nonzero, that is, the pair-wise similarity is influenced by the variants of interest. The result indicates that the p value is less than 0.05, illustrating that the markers are associated with the disease, conforming to the simulation design.

Simulation: smt()
The simulation design in this section is similar to that in . The simulation data set consists of 1,000 cases and 1,000 controls. HWE is assumed and the minor allele frequency (MAF) is set to 0.3. Furthermore, the additive model is considered. The relative risks of the groups with genotypes 1 and 2 relative to the group with genotype 0 are 1.2 and 1.4, respectively. The disease prevalence is set to 0.05.
R> result.SMT <-smt(y, g) R> result.SMT Single-marker test data: y and g p-value = 0.006666 alternative hypothesis: the phenotype is significantly associated with the genotype The p value of the test illustrates that the genotype and the phenotype in this example are significantly associated, with the significance level of 0.05.

Simulation: max3()
The simulation data sets of the first example for max3() are the same as those for smt(). We can use the function max3() to test the association between the genotype g and the phenotype y from Section 3.4. The p values of the tests illustrate that significant association is found between the genotype and the phenotype in this example with the significance level of 0.05.

R>
The simulation design of the second example for max3() is similar to that in . The simulation data sets consist of two subpopulations, the sample sizes of which are both 1,000. The proportions of the cases from the two subpopulations are 0.6 and 0.4, respectively, whereas those of the controls from the two subpopulations are 0.4 and 0.6, respectively. Therefore, the numbers of the cases and the controls are both 1,000. HWE is assumed within each subpopulation. The MAFs are 0.3 and 0.35 for the two subpopulations, respectively. The additive model is considered. The relative risks of the groups with genotypes 1 and 2 relative to the group with genotype 0 are 1.2 and 1.4, respectively, within each subpopulation. The disease prevalence is set to 0.05 for the two subpopulations.
R> z <-matrix(rep(c(0, 1), c(n.sp1, n.sp2)), ncol = 1) R> max3(y, g, covariates = z, Score. According to the results of the function max3(), the p values of the tests are 0.01849, 0.01847, 0.0182, and 0.01815 when we choose to use the Wald test and the twofold integration, the Wald test and the rhombus formula, the score test and the twofold integration, and the score test and the rhombus formula, respectively, illustrating that a significant association is found between the marker and the phenotype with correcting for the PS in this example with the significance level of 0.05.

Simulation: npt() and nmax3()
The simulation data set consists of 1,000 subjects. Following Zhang and Li (2015), we assume that the quantitative trait y relates to a biallelic SNP with the genotype g as the linear model y = β 0 + gβ 1 + e, where e is the error term that follows a generalized extreme value distribution, tGEV(0, 0, 1), with the shape parameter 0, the location parameter 0, and the scale parameter 1. We set β 0 = 0.5, β 1 = ln 1.2, and MAF = 0.3 in the population. The genetic model is assumed to be additive in this simulation.
R> result.NPT <-npt(y, g, 0.5) R> result.NPT Nonparametric trend test data: y and g NPT = 4.1097, p-value = 3.962e-05 alternative hypothesis: the phenotype is significantly associated with the genotype The p value of the NPT for the additive model is 3.962 × 10 −5 , which is far less than the significance level of 0.05. Thus, the quantitative trait is significantly associated with this SNP.
We can also use the function nmax3() to test the association between the quantitative trait y and the biallelic SNP with genotype g by using the NMAX3.
R> result.NMAX3 <-nmax3(y, g) R> result.NMAX3 Nonparametric MAX3 test data: y and g NMAX3 = 4.1097, p-value = 7.779e-05 alternative hypothesis: the phenotype is significantly associated with the genotype This result also shows that the continuous phenotype y is associated with the biallelic SNP by using the NMAX3 with the significance level of 0.05.

An application: Rheumatoid arthritis with PS
This section presents a detailed application of this package on the association analysis of rheumatoid arthritis with the PS. The data is from the Genetic Analysis Workshop 16 Problem 1 (Cupples et al. 2009;Amos et al. 2009). The genotype data set used for correcting for the PS consists of 868 cases and 1,194 controls at 12,749 SNPs that are not associated with the disease . The genotype and the phenotype data sets are saved in arthritisG and arthritisP, respectively, in the package AssocTests.data (Wang et al. 2015).
R> arth.TW <-tw(eigenvalues = arth.E$eigenvalues[1:(nrow(arthritisP) -1)], + eigenL = nrow(arthritisP) -1) R> arth.TW$SigntEigenL The value of arth.E$topK from the function eigenstrat() is 4, which is consistent with the value of arth.TW$SigntEigenL from the function tw(). Thus, the number of significant eigenvalues is 4 in this example. Figure 2 plots the samples in the space of the first four eigenvectors of the similarity matrix. The distributions of the cases (represented by the red circles) are nonuniform in the controls (represented by the blue pluses), especially in Figure 2 C, E, and F, where the fourth eigenvector is involved in, illustrating that the PS exists in the data.
The function pcoc() can be used to detect the population structure of this example. This data set consists of two subpopulations according to the result of pcoc().

Conclusions
In this article, we have outlined some methods and algorithms for the genetic association studies and described the R package AssocTests (Wang et al. 2020), which contains the procedures EIGENSTRAT, PCOC, TW test, DR, single-marker test, MAX3 test with or without adjustments for the covariates, NPT, and NMAX3. The descriptions of the functions have their counterparts in the R package AssocTests.
We demonstrated the usages and the capabilities of this package in some simulation studies and real data analyses in genetic association studies. Actually, the functions can also be used in other application areas, such as food processing, economics, and finance. All the functions in this package performed well. The computational complexity is often extremely high in the genome-wide association study, typically using 500,000 ∼ 1,000,000 SNPs across the genome. The functions in this package are effective. Considerably numerous SNPs are feasibly handled. Although the execution time is relatively long, it is affordable. Furthermore, the multitrait genetic association study (Thoen et al. 2017), which is developing rapidly recently, is not involved in this package. For further works, the methods for the multitrait genetic association study will be implemented in an updated version of this package. Depending on the demands of users, we may consider developing a graphical user interface for this package.