MixMAP : An R Package for Mixed Modeling of Meta-Analysis p Values in Genetic Association Studies

Genetic association studies are commonly conducted to identify genes that explain the variability in a measured trait (e.g., disease status or disease progression). Often, results of these studies are summarized in the form of a p value corresponding to a test of association between each single nucleotide polymorphisms (SNPs) and the trait under study. As genes are comprised of multiple SNPs, post hoc approaches are generally applied to determine gene-level association. For example, if any SNP within a gene is signiﬁcantly associated with the trait at a genome-wide signiﬁcance level ( p < 5 × 10 − 8 ), then the corresponding gene is considered signiﬁcant. A complementary strategy, termed mix ed modeling of m eta- a nalysis p values (MixMAP) was proposed recently to characterize formally the associations between genes (or gene regions) and a trait based on multiple SNP-level p values. Here, the MixMAP package is presented as a means for implementing the MixMAP procedure in R .


Introduction
Genetic association studies provide an opportunity to investigate the relationships between complex disease phenotypes and genetic polymorphisms. First stage analysis of data arising from these studies generally involves characterizing the association between each single nucleotide polymorphisms (SNPs) and a measured trait. For example, in an analysis unadjusted for covariates where interest lies in modeling a binary measure of disease status, investigators may perform a Cochran-Armitage trend test. Here each SNP is defined by the number of variant alleles present. The results of genetic association studies are then summarized by a p value for each SNP as a measure of significance of the association with the trait.
Since interest generally lies in characterizing association between genes (or gene regions) and the trait, where genes are comprised of multiple SNPs, an additional analysis step is required. One simple approach that is commonly applied, is to declare a gene as significant if at least one SNP within that gene reaches "genome-wide significance" defined as a p value less than 5 × 10 −8 . This is equivalent to a Bonferroni adjustment based on one million tests.
In a recent manuscript, a complementary strategy was proposed to use summary level data (p values) to investigate a group of SNPs simultaneously in their association with the trait. This approach, termed mix ed modeling of meta-analysis p values (MixMAP; Foulkes, Matthews, Das, Ferguson, Lin, and Reilly 2012), has the advantage of identifying genes comprised of several SNPs that exhibit moderate signals but do not individually reach genome-wide significance. MixMAP involves applying a mixed effects modeling framework to transformed SNP-level p values with random gene level cluster effects. This is a natural framework as SNPs within the same gene tend to have moderate to high linkage disequilibrium (LD), which in turn leads to potentially correlated p values. For each gene, the MixMAP package provides functions that return an empirical Bayes estimate of the corresponding random effect and a determination of whether the gene exhibits an association with the trait.
This manuscript describes the MixMAP package (Matthews 2015) in R which was developed to implement the MixMAP approach. Section 2 begins with a detailed description of the MixMAP procedure. This is followed in Section 3 by an outline of the MixMAP package with a comprehensive data example. Finally, the manuscript concludes with a brief discussion and topics for future work in Section 4.

The MixMAP approach
MixMAP is designed as a complementary analytic strategy to characterize the association between a gene (or gene region) comprised of one or more SNPs and a trait. This section provides a brief overview of the MixMAP approach, while a more complete description and extensive simulation studies to characterize the method can be found in Foulkes et al. (2012). The primary inputs to the MixMAP procedure are: (1) a set of p values for single SNP tests of association for multiple SNPs within and across genes; (2) the SNP name corresponding to each p value; (3) a mapping of SNPs to genes (or gene regions). For the purpose of this presentation, the term "locus" is used to refer generally to a gene or gene region in which there are moderate to high levels of LD among the SNPs. The input p values can be based on a single cohort study or generated based on a meta-analysis of several genetic association studies, as is common practice in large-scale investigations.
Formally, let L i represent the ith locus, i = 1, 2, . . . , N , and let L i = (s i1 , . . . , s in i ) where s ij is the jth SNP in locus i. In a typical genetic association study, a p value is calculated for each s ij . MixMAP begins by ranking the set of all M (= N i=1 n i ) p values and then applying an inverse normal transformation. The result of this transformation is expressed as y k = Φ −1 (r k ) where r k = k M +1 , k is the rank of the SNP, and Φ −1 (·) is the inverse of the cumulative density function for a standard normal. A mixed effects models is then fit of the form: where y i = (y i1 , . . . , y in i ) is a vector of transformed p values (y k 's) for the ith location and Step 1: Transformation. p values resulting from univariate analysis in a genetic association study are ranked and used to calculate r k . In turn, the r k are transformed to normality to produce y k .
Step 2: Mixed Modeling. Using the y k as the dependent variable, a mixed effects model is fitted with a fixed intercept and random intercepts for each locus.
Step 3: Prediction. Empirical Bayes estimates and one-sided prediction intervals are created for each locus with a correction for multiple loci.
Step 4: Locus Selection. Loci are selected as statistically meaningful if the upper limit of the corresponding prediction interval is less than 0. the matrix X i is the corresponding matrix of covariates. Z i is a vector of 1's of length n i and b i is used to model the latent effect of the ith location. By assumption, we also have that The values of interest here are the collection of estimates of b i for i = 1, . . . , N , corresponding to the gene-level effects. In this case, the best linear unbiased estimator for b i is: where Σ i = COV(y i ) = σ 2 b J n i J n i + σ 2 I n i and J n i is a vector of 1's of length n i . The quantity b i , referred to as the empirical Bayes estimator of b i , is found by replacing σ b and Σ i in Equation 2 with corresponding restricted maximum likelihood (REML) estimates. The prediction variance for the ith locus is given by: where X = (X 1 , . . . , X N ) and Σ is a block diagonal matrix whose diagonal elements are the matrices Σ 1 , . . . , Σ N .
When the number of clusters, N , is large VAR Since the default variance returned by the lmer function in the R package lme4 (Bates, Maechler, and Bolker 2015) is given by VAR(b i |y i ) this can be used in this setting. Here, one is interested in detecting locus effects that are smaller than zero. This is of interest because it is the large negative values of locus effects that correspond to smaller p values. Therefore, one-sided prediction intervals are constructed for each region with the upper limit for the (1 − α)% prediction interval for b i expressed as: As multiple prediction intervals are being created, a correction is applied, and α is replaced with α * = α/N where again N is the number of genes. A summary step-by-step procedure is provided in Figure 1 and the procedure described in this section is implemented in the function mixmapPI.

Testing framework
While prediction intervals are a commonly used framework for estimating random effects, in this context, the width of the prediction interval will approach zero as the number of SNPs within a gene gets large. Therefore, rather than constructing a prediction interval which uses the VAR( b i − b i ) as the variance the prediction interval, we also provide an alternative method using a hypothesis testing framework. This method is less sensitive to the number of SNPs within a gene. Formally, the test of interest is H 0 : As a result, the test statistic for the ith gene is defined as: The null hypothesis is then rejected if U i > C α,i . If C α,i is defined to be C α,i = z (1− α N ) then this equates to a Bonferroni corrected control of the family-wise error rate. This procedure is implemented in the function mixmapTest.

Implementation in the MixMAP package
MixMAP is an R (R Core Team 2015) package that is freely available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=MixMAP and that allows for gene-level association analysis based on p values from an association study, and can be installed on any operating system that has R installed. The MixMAP package relies on the lme4 package to implement the mixed models necessary in some of the MixMAP functions. The primary functions in the MixMAP are mixmapPI, which implements the MixMAP algorithm summarized in Section 2, and mixmapTest, which implements the procedure described in Section 2.1. In order to use the mixmapPI or mixmapTest functions, the user must provide a data frame with the following five variables: SNP name, gene name, chromosome number, base pair coordinate, and p value. Both mixmapTest and mixmapPI output an object of class 'MixMAP' that has associated methods of summary and plot.

Global Lipids Gene Consortium (GLGC) data
The MixMAP package is illustrated in an application using meta-analysis results derived from several independent studies of approximately 100, 000 individuals in the Global Lipids Gene Consortium (GLGC; Teslovich et al. 2010). These data are freely available at http: //www.broadinstitute.org/mpg/pubs/lipids2010/. For the purpose of this presentation, we focus on a subset of 31, 825 SNPs in 2, 960 genes that are also on the ITMAT-Broad-CARe (IBC) 50K SNP array data and can be uniquely mapped to a gene, as described in Foulkes et al. (2012). The trait under study is low-density lipoprotein cholesterol (LDL-C), a known causal risk factor for cardiovascular disease. Each p value contained in the data set corresponds to a test of association between LDL-C and a specific SNP, after adjusting for appropriate covariates, as described in Teslovich et al. (2010). These p values, along with the mapping of SNPs to genes, serve as the primary input to the mixmapTest function as illustrated in the following section.

Application of MixMAP
The first step after installing the MixMAP package is for the user to load the package as follows:

R> library("MixMAP")
The package contains example data named MixMAP_example. This file contains all of the information needed to be run through the mixmapPI function, namely SNP name, gene name, chromosome number, base pair coordinate, and p value. Notably, the chromosome number and base pair coordinate are required for the associated plot function. If these data are not available, empty columns can be added to the input data frame, and the mixmapTest function will still work appropriately. The data are loaded as follows: R> data("MixMAP_example", package = "MixMAP") Inspecting the first ten rows confirms that all of the necessary columns are present for these data to be read directly into the mixmapTest function: Note that the variable name for the column containing the SNP name is "MarkerName". This is different than the default ("SNP") and will need to be explicitly stated in the mixmapTest function. This is also true of the other variable names in this file. The mixmapTest function is applied as follows: R> MixOut <-mixmapTest(MixMAP_example, pval = "GC.Pvalue", + snp = "MarkerName", chr = "Chr", coord = "Coordinate", gene = "Gene") The user can specify the α level used in the hypothesis testing framework. This is indicated by alpha in the mixmapTest function and by default alpha = 0.05. The current implementation of mixmapTest divides this number by the total number of genes to arrive at the effective level of α for each test.
The mixmapTest function returns an object of class 'MixMAP'. Objects of this class contain 4 slots: output, num.genes.detected, detected.genes, and lme.out. The output slot contains a data frame with one row for each gene in the input. The columns include the corresponding empirical Bayes estimates of the random gene-level effects (postEst), the estimated posterior variance (var), the upper limit of the prediction interval (predUpper), and the number of SNPs in the specified gene (snpCount), as well as the given chromosome number (chr) and base pair location (coordinate). The last three columns include the MixMAP p value (MixMAP_pvalue), the Bonferroni adjusted MixMAP p value and the false discovery rate adjusted q value (MixMAP_qvalue). The first 10 rows of the output slot are displayed below for the GLGC data example: The num.genes.detected slot contains an integer vector of length two with the first element containing the number of genes that were selected as statistically meaningful (number detected) and the second containing the total number of genes included in the analysis (total number of genes). For the GLGC data, we have: R> MixOut@num.genes.detected number detected total number of genes 7 2960 The third slot in a 'MixMAP' object is called detected.genes and contains similar information as the output slot, but only for the subset of genes that were selected by the MixMAP algorithm. Along with the information provided in the output slot (i.e., empirical Bayes estimates, posterior variances, upper limit of prediction intervals, etc.), this data frame contains the SNP name with the smallest p value within each selected gene, the probit rank-transformed value related to the smallest p value in the selected gene, and several additional summary measures of the p values within the selected gene. Finally, the last slot, lmer.out, contains the output from the mixed model fit. For the GLGC data, this is given by: Finally, the plot method associated with a 'MixMAP' object will produce a Manhattan-style plot. Here the x-axis represents the base pair location on the specified chromosome and shading is used to distinguish chromosomes. The y-axis represents the maximum of 0 and the absolute value of the empirical Bayes estimate of the gene level random effect. For the example provided, we have: ing from genome-wide association or deep sequencing studies involving a larger number of genes and/or greater coverage within a gene. Additionally, the MixMAP package could be extended to allow for control of potential confounding variables that may lead a gene to be falsely detected.
Further extensions of the MixMAP package will also allow users to read in files that are created as output from R functions that perform GWAS analysis directly, such as GenABEL (GenABEL Project Developers 2013). This would allow users to perform the original testing in R and seamlessly perform the MixMAP algorithm all within the R environment. Finally, extensions of the MixMAP algorithm to the mixture modeling setting are currently being developed. This would allow the model to reflect a more accurate version of the true distribution of the p values, without requiring a first stage ranking, and may result in better levels of detection of genes that are truly associated with the trait of interest.
The MixMAP tool presented herein is intended to complement existing approaches to characterizing gene level association. In the example provided, SNPs are grouped into genes (regions) of moderate to high linkage disequilibrium (LD), defined loosely as a measure of correlation between SNPs. Alternatively, a group of SNPs could be defined by a pathway or a gene set, as described in Subramanian et al. (2005). To emphasize, the algorithm implemented in this package is flexible with respect to choice of grouping and an alternative clustering variable can be input into the algorithm in place of gene.
Finally, in many scenarios where it is desirable to use the MixMAP procedure, SNPs must be mapped to some grouping, often genes. This in itself is not necessarily a trivial task in and of itself. ANNOVAR (Wang, Li, and Hakonarson 2010) is a useful piece of software for annotating SNPs to genes that allows users to annotate genes based on gene definitions (i.e., RefSeq, UCSC, ENSEMBL, GENCODE).