Bayesian Linear Mixed Models with Polygenic Eﬀects

We considered Bayesian estimation of polygenic eﬀects, in particular heritability in relation to a class of linear mixed models implemented in R ( R Core Team 2018). Our approach is applicable to both family-based and population-based studies in human genetics with which a genetic relationship matrix can be derived either from family structure or genome-wide data. Using a simulated and a real data, we demonstrate our implementation of the models in the generic statistical software systems JAGS (Plummer 2017) and Stan (Carpenter et al. 2017) as well as several R packages. In doing so, we have not only provided facilities in R linking standalone programs such as GCTA (Yang, Lee, Goddard, and Visscher 2011) and other packages in R but also addressed some technical issues in the analysis. Our experience with a host of general and special software systems will facilitate investigation into more complex models for both human and nonhuman genetics.


Introduction
The genetic basis of quantitative phenotypes has been a long-standing research problem associated with a large and growing literature, and one of the earliest was by Fisher (1918) on additive effects of genetic variants (the polygenic effects). In human genetics it is common to estimate heritability, the proportion of polygenic variance to the total phenotypic variance, through twin and family studies. For twin studies, polygenic effects are embedded into correlations between monzygotic and dizygotic twin pairs using the assumption that monozygotic twins share all the genetic materials but dizygotic twins only half. For family studies, the polygenic component is coupled with a relationship matrix in a mixed model with covariates as fixed effects, e.g., Morton and MacLean (1974) and Lange (2002). The models differ from those usually seen in general statistics as the polygenic effects are represented by a random variable that is correlated among all relatives due to genes shared identity-by-descent. The estimation can be inaccurate due especially to shared environment in both twin and family studies.
More recently, a large quantity of single nucleotide polymorphisms (SNPs), single base-pair variants of DNA, available from population-based samples has offered renewed interest in the problem. This is because the data allows for a genomic relationship matrix (GRM) to be built as part of a genomewide association study (GWAS) for identification and characterization of the DNA variants and phenotype (our outcome of interest) association. Yang et al. (2010) showed that a GRM can be used in the mixed model very much in the same way as in models for families where the relationship matrix is built on familial relationships. Consequently, the ubiquitous availability of DNA also makes the models appropriate for any samples with typed DNA polymorphisms. The approach is applicable to a wide variety of traits including continuous, discrete and time-to-event outcomes (Zhao and Luan 2012). The estimation of heritability (h 2 ), the proportion of total additive genetic variance as a proportion of total phenotypic variance, is fundamentally important since it largely quantifies the scope of a GWAS in gene discoveries and characterizations.
Bayesian methods are attractive since generic software systems are available to facilitate the model-building, and they also help to address the issue concerning the uncertainty in parameter estimation. Moreover, they give credible intervals with highest probability density (HPD) as opposed to frequentist interval estimates, often derived under simplifying assumptions. Markov chain Monte Carlo (MCMC) serves as a practical tool for Bayesian inference with a full characterization of the posterior distribution of the variance components as well as heritability. For this reason, Bayesian methods have been widely used in plant and animal science literature for a broad range of traits, e.g., Yi and Xu (2000); Varona et al. (2005). These applications and the software employed almost exclusively use family structure, given that the inverse of the relationship matrix is easily calculated, as was also the case with work on humans, e.g., Burton, Scurrah, Tobin, and Palmer (2005). Exceptions regarding software include package BLR (Perez, de Los Campos, Crossa, and Gianola 2010;de los Campos, Perez, Vazquez, and Crossa 2013) in R (R Core Team 2018) which can accommodate GRM but the analysis often has to be stopped due to a nonpositive definite GRM. It is not obvious how these issues can be addressed.
In our own analysis, we have encountered various issues. Our attempts to tackle these problems have led to some useful results, which we believe will facilitate similar analyses by other colleagues. Via a simulated data and a real data, we implemented the models using JAGS (just another Gibbs sampler; Plummer 2017), Stan (sampling through adaptive neighborhoods; Stan Development Team 2017; Carpenter et al. 2017) and in the case of a large sample package BLR. We wrote utilities in R to read or write a GRM as generated from software GCTA (Yang et al. 2011) to be used with these software packages, which contain functions to calculate heritability and its standard error when polygenic and residual variance/standard errors are given. We further adapted the R package MCMCglmm (Hadfield 2010) to enable comparison between family-based or genotype-based relationship matrices. These functions are available from the R package gap (Zhao 2007(Zhao , 2017 with further information. We also gave expressions for perturbing the covariance matrix when GRM is considered nonpositive definite. We believe our work will be of interest in human genetics as well as animal and plant genetics. Below we will briefly describe the polygenic model, a simulated data as a benchmark and an application. We then conclude with a summary, which includes generic discussions on non-genetic effects, missing outcomes, efficient implementation, frequentist and Bayesian estimates of heritability for the GCTA documentation example.

Statistical models
We start with an outline of the linear mixed model, showing how total additive genetic effects can be framed with respect to a relationship matrix. We then consider specification of the Bayesian linear mixed model.

Linear mixed model
To motivate we consider a study of body mass index (BMI, body weight/height, kg/m 2 ) in relation to sex (0 = Man, 1 = Woman) and age (in years). A linear model (LM) of BMI on sex and age is as follows, where b 0 is an intercept, b 1 and b 2 are the regression coefficients for sex and age, indicating a unit change in BMI attributable to being a woman than man and per-year increase in age, respectively. e is a residual term indicating effects on BMI other than sex and age. As will soon become clear, there is a need to have extra terms which are random variables, leading to a linear mixed model (LMM). More generally, let y be a continuous variable and our outcome of interest, X covariates, u random effects. A LMM has the following form, where y -an N × 1 vector of observations; X -an N × p matrix of known covariates; β -a p × 1 vector of unknown regression coefficients; Z -a known N × q matrix of variables; u -a q × 1 vector of unknown random effects; e -an N × 1 vector of (unobservable random) errors.
We assume that u ∼ N (0, D) and e ∼ N (0, Statistical inference of this model, based on the frequentist approach, can be done with maximum likelihood (ML) or restricted maximum likelihood (REML) estimation. Procedures are widely available (see Sorensen and Gianola 2002, for further details).

Linear mixed model with polygenic effects
We assume that our trait of interest, y, is a function of m causal variants each with effect u i , u i ∼ N (0, σ 2 u ), i = 1, . . . , m, treated as random effect, σ 2 u a polygenic variance. These variants are DNA polymorphisms at particular positions across the genome. At locus i, we assume the two causal alleles are q and Q with frequency 1 − f i , f i , and forms genotypes qq, qQ and QQ, respectively with additive effects 0, 1, and 2. The genotypic effects are associated following a Binomial distribution, Bin(2, f i ), with mean 2f i and variance 2(1 − f i )f i , respectively, leading to normalized additive effects (z The simplest form of a polygenic model uses a linear combination of effects from all causal variants, i.e., g = m i=1 z i u i where z i can be seen as a function of the frequency of allele with effect acting as a scaling factor such that E(z i ) = 0 and VAR(z i ) = 1. In matrix notation g = Zu, we have g ∼ N (0, σ 2 u ZZ ) and σ 2 g = mσ 2 u is the variance of total additive effects ("polygenic effects"). From this VAR(y) = σ 2 u ZZ +σ 2 I = σ 2 g ZZ /m+σ 2 I = σ 2 g A + σ 2 I, where A = ZZ /m amounts to a relationship matrix and is indeed called a GRM at the causal loci, σ 2 is the residual variance, and I an identity matrix. Heritability is defined as the proportion of phenotypic variance explained by the polygenic effects, namely, h 2 = σ 2 g /(σ 2 g + σ 2 ). The matrix A can be represented with genomewide data containing a large number (M ) of SNPs analogous to causal variants, i.e., G = W W /M where w ij = (x ij − 2p i )/ 2(1 − p i )p i , j = 1, 2, 3 represents the genotypic effects of SNP i and p i is the allele frequency, while x ij = 0, 1, 2 for SNP i having alleles a 1 , a 2 , and genotypes a 1 a 1 , a 1 a 2 , a 2 a 2 , respectively. A series of refinements of the G matrix has been suggested by Yang et al. (2010). The GCTA software can generate a compressed (.grm.gz) or binary (.grm.bin) form of GRMs from genomewide SNPs and provide REML estimates for the polygenic model.
In summary, our model is similar to (2) in that D = σ 2 g G and E = σ 2 I, where G is a GRM, VAR(y) = σ 2 g G + σ 2 I with g being "polygenic effects" and G an N × N GRM. For data on relatives, the additive genetic relationship matrix A can also be derived from a given family structure which is twice the kinship matrix (Lange 2002) whose entries represent probabilities of genes shared identity-by-descent among pairs of relatives. The matrix can be generated by a number of R packages such as kinship2 (Therneau and Sinnwell 2015) which are available from the Comprehensive R Archive Network (CRAN).

Bayesian linear mixed model with polygenic effects
A Bayesian linear mixed model (BLMM) with polygenic effects follows the set-up above, whose sampling model is as follows, y|β, u, σ 2 ∼ N (Xβ + Zu, σ 2 I), where B is a known, nonsingular matrix and σ 2 β is a hyperparameter. Full specification of the model is furnished with appropriate distributions for the variance components, e.g., Section 6.3 of Sorensen and Gianola (2002). For the polygenic model (3) in this paper, we have the likelihood and assumed prior specifications as follows: where s 1 and s 2 are chosen to provide noninformative priors, and the matrix B is diagonal.
Other priors for the variance components such as uniform are possible, as in Section 4.2 below, see also Waldmann (2009) andGelman (2006).

Handling of the G matrix
Simulation of the polygenic effects in Section 2.3 involves the multivariate Normal distribution, which could be very time-consuming when N gets large. A speedup can be achieved by obtaining the precision matrix as input to software described below. More often, a Cholesky decomposition can be applied.
As expression (5) is amenable to a few software environments for MCMC sampling, these are exposed in Section 3.2 below.

Benchmark
Data from Meyer (1989) as in Tempelman and Rosa (2004) is used as our benchmark. The pedigrees for each of these 282 animals derive from an additional 24 base population (Generation 0) animals that do not have records of their own, nevertheless are of interest with respect to the inference on their own additive genetic values. Furthermore, it is presumed that these original 24 base animals are not related to each other. Therefore, the row dimension of u is 306 (282 + 24). To facilitate discussions the data is made available in package gap available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=gap. Tempelman and Rosa (2004) gave a variety of estimates using SAS (SAS Institute Inc. 2014). We are interested in the REML estimates which are available from package regress (Clifford and McCullagh 2006).

Bayesian approach
We now turn to the Bayesian approach and begin with a generic implementation based on the BUGS (Bayesian inference using Gibbs sampling) specification. As most such implementations would involve large samples, we moved away from WinBUGS (Lunn, Thomas, Best, and Spiegelhalter 2000) and used OpenBUGS (OpenBUGS Foundation 2015) and JAGS under Linux. Both allow for command line execution but as noted earlier (Sturtz, Ligges, and Gelman 2005) data manipulation is required which can be greatly facilitated with OpenBUGS, specifically using the R package R2OpenBUGS (Sturtz et al. 2005). We focused on JAGS as it was better tuned under Linux with LAPACK (Anderson et al. 1999), or Intel MKL, (Intel 2013) and the R counterpart R2jags (Su and Yajima 2015). We use multiple chains (e.g., 2 to 4), and Brooks-Gelman-Rubin (BGR) statistics, provided in JAGS or Stan, to check convergence. Initial parameter values are generally based on subject matter knowledge and/or parameter estimates from classical estimation.
DIC info (using the rule, pD = var(deviance)/2) pD = 336.4 and DIC = 2517.8 DIC is an estimate of expected predictive error (lower deviance is better).
The version with Cholesky decomposition is as follows, noting that the factored matrix needs to be transposed.
DIC info (using the rule, pD = var(deviance)/2) pD = 338.2 and DIC = 2519.2 DIC is an estimate of expected predictive error (lower deviance is better).

Stan
We further experimented with Stan, which is appealing to us as it implemented faster sampling algorithms (Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin 2014, p. 307). We worked on both the R interface, rstan (Stan Development Team 2018), and the command line version, cmdstan (Stan Development Team 2016).
R> data <-with(meyer, list(N = N, y = y, g1 = g1, g2 = g2, G = G)) R> library("rstan") R> meyer.stan <-" R> parms <-c("b", "p", "r", "h2") R> f1 <-stan(model_code = meyer.stan, data = data, chains = 2, iter = 500, + verbose = FALSE) R> f2 <-stan(fit = f1, data = data, chains = 2, iter = 5000, pars = parms, where results from the first stan call is used as input to the second call. Note that the program is sectioned with data passed from R and the part which is in transformed data. These are followed by parameters and transformed parameters before they are used in model. Our quantities of interest can further be obtained from generated quantities. The results from Stan are shown below and in Figure 1, Inference for Stan model: df0c4ce12df598b4fcdd553dfe7d2cee. 2 chains, each with iter=5000; warmup=2500; thin=1; post-warmup draws per chain=2500, total post-warmup draws=5000.   where the BGR diagnostic statistics show convergence of the parameters. The overlapped density plots for the two chains are also shown in Figure 1. Although both OpenBUGS and JAGS work as standalone programs, the counterpart in Stan, cmdstan, is much easier. We simply need to make a copy of the program above, say meyer.stan, to the cmdstan directory and issue "stanc" to generate the C++ source or even "make meyer" to generate the executable, We first prepare for our data in R and then use functions bugs.data and bugs2jags to output an input file for meyer, R> library("R2OpenBUGS") R> data <-with(meyer, list(N = N, y = y, g1 = g1, g2 = g2, G = G)) R> bugs.data(data, data.file = "meyer_bugs.txt") [1] "meyer_bugs.txt" R> library("coda") R> bugs2jags("meyer_bugs.txt", "meyer_stan.txt") and we can call $ ./meyer sample data file=meyer_stan.txt output file=meyer.csv $ stansummary meyer.csv The data file (meyer_stan.txt) is used by the executable to generate our output in meyer.csv, and the summary statistics are given by the print utility. Equally, rstan can also pick up results to allow for graphical facilities in R.

Parallel computation
It is possible to take advantage of multicore facility in R for multiple chains via package parallel (R Core Team 2018). This can be done as follows using the Meyer data.

Nonpositive definite G matrix
We found it more likely to have a nonpositive definite G matrix in (4) and (5) than a kinship matrix. In theory, we can get around this with a perturbation ( ) as described in Guo and Thompson (1991, p. 174), namely to replace G withG ≡ (G + /σ 2 g I), so that σ 2 gG = σ 2 g G + andσ 2 = σ 2 − one only needs to amend σ 2 asσ 2 + . The is according to the Gerschgorin theorem (Varga 2004 This will be the same as before when = 0. While this is mathematically viable, it involves additional matrix inversion in JAGS making our task even more formidable for MCMC convergence. We used (G + I) in place of the relationship matrix and σ 2 + σ 2 g as residual variance, which do not involve direct simulation from the multivariate Normal distribution.

Application: Familial vs. genomic heritabilities
The data used in this section was derived from a large family study which mirrors work by (Klimentidis, Vazquez, de Los, Allison, Dransfield, and Thannickal 2013), to enable contrasting genetic relationship from family structure and genome-wide data.

Frequentist approach
Two relationship matrices based on family structure and genomic data were generated by R and GCTA, respectively, to be used by GCTA for REML estimation.

Bayesian approach
Besides results from REML shown above, in a separation analysis on lung function from the same cohort, the two approaches yielded almost identical heritability estimates (Klimentidis et al. 2013). The marked difference in deviance prompted us to seek to characterize variability of heritability in a Bayesian framework.
For this "large N " (N 1, 000) problem, the implementation in either JAGS or Stan became prohibitively slow, we therefore resorted to specific implementations in packages MCMCglmm and BLR that we were aware of. However, an adaption of package MCMCglmm with GRM took about three days on our Linux system with 300 burn-ins and 1,000 iterations and it is infeasible to consider large number of iterations. With package BLR, we encountered the issue of nonpositive definite GRM. While adding a perturbation to the GRM it was not clear how our results will be adjusted. We also sought for the possibility of approximate Bayesian methods through which AnimalINLA (Holand, Steinsland, Martino, and Jensen 2013) came to our attention. It was derived from package INLA (integrated nested Laplace approximation; Rue, Martino, Lindgren, Simpson, Riebler, and Krainski 2014). It was not obvious it can handle GRM but we would like to explore.

AnimalINLA
The AnimalINLA package was used first taking family structures into account.
Based on an inspection of the structure of the object xx above, we could build a compatible object to one from function compute.Ainverse. However, the GRM would loose the sparse-ness in the kinship matrix from the family structure. So this can be expected to be slower compared to GCTA.

BLR
Our call is as follows, R> y <-as.matrix(r) R> eps <-0.1 R> m <-BLR(y, + GF = list(ID = seq_len(N), A = g$GRM + diag(eps, N)), + prior = list(varU = list(df = 3, S = 4), varE = list(df = 3, S = 4)), + nIter = 500000, burnIn = 150000, thin = 1, saveAt = "fgh.BLR_") R> attach(m) R> varU  the columns and rows of the GRM are indexed in the object g$id whose ordering was used to compromise with that of the phenotypic data. The argument bF specifies flat priors for regression coefficients earlier. The argument A is a "symmetric, positive definite" matrix (de los Campos et al. 2013). The priors for the polygenic (varU) and residual (varE) variances follow de los Campos et al. (2013) as scaled inverse χ 2 with expectation S/(df − 2), S = VAR(y)(1 − h 2 )(df − 2). This is roughly the same for both variances. The perturbation = 0.1 has enabled the GRM to be positive definite. Note that the saveAt option informs the function to keep values of bF, varU and varE at each iteration to fgh.BLR_bF.dat, fgh.BLR_varU.dat and fgh.BLR_varE.dat, respectively. Figure 3 shows the results of a very long chain (150,000 burn-ins, 350,000 iterations). The sequences were also converted into an 'mcmc' object used in package coda from which we obtained the HPD interval via function HPDinterval. The density plot is indeed similar to Figure 2.

Summary
We implemented Bayesian linear mixed models that involve a direct use of the relationship matrix. Generic software such as JAGS or Stan renders greater simplicity than purposewritten software and more flexibility for complex models. Through data analysis we showed that the frequentist and Bayesian approaches can give comparable point estimates but the latter is desirable with its ability to use prior information and produce posterior distributions. For large samples, unlike the usual availability of family structures and therefore fast on-thefly calculation of the inverse of the precision matrix involving polygenic variance (Waldmann 2009;Damgaard 2007) they have great difficulty in dealing with large genomic matrices. We therefore exploited matrix decomposition and parallel computation. We also compiled JAGS using both LAPACK and Intel MKL. Given that the computing time remains prohibitive, we further used approximate Bayesian inference such as Laplace approximation, in particular INLA as in package AnimalINLA, which was again humbled by the high dimensionality and non-sparsity density of the GRM. Our analysis also naturally called up a number of packages in the R system with its ability for data management, powerful programming and modeling.
The implementation has not been seen in the literature and Stan gave comparable results to the usual REML estimation and those obtained with package JAGS. Our setup enables relationship matrix from either family or population data directly into a polygenic model. The comparison of both types of relationship matrices is now possible with package MCMCglmm from which a function MCMCgrm was implemented in package gap. Package BLR runs faster but would fail with a non-positive definite G matrix. Unlike Guo and Thompson (1991), our approach does not involve repeated inversion or factorization of the variance-covariance matrix at the sampling stage and has enabled analysis package BLR. The analysis also went beyond our previous experiment (Zhao and Luan 2012), whose focus was only on frequentist approaches. A reviewer pointed out work by Bae, Perls, and Sebastiani (2014) noting previous work on decomposition and conditioning by Waldmann, Hallander, Hoti, and Sillanpää (2008) ;Hallander, Waldmann, Wang, and Sillanpää (2010) where they "proposed an approach based on a decomposition of the multivariate Normal distribution of the random effects into univariate Normal distributions using conditional distributions" but "fails to produce accurate results with large multigenerational families" though the authors "were not able to pinpoint the reason for the apparent discrepancy" (between the conditioning and singular value decomposition). In essence, the model as in Bae et al. (2014) has a covariance structure where K i are the kinship matrices associate with a particular family i, i = 1, . . . , m. In our case, the GRM does not have the block structure. In Hallander et al. (2010), dominance effects were also modeled and in principle can be included in our approach similar to GRM.
We hope that our work will facilitate exploration of other practical issues of Bayesian linear mixed models with polygenic effects, some of which are highlighted here.

Non-genetic effects
Although we have focused on the polygenic effects, their non-genetic counterparts can be an indispensable part of the research. For instance, BMI may be linked to lifestyle and psychosocial factors such as diet, physical activity and mental health. SNP effects are now commonly derived as part of a GWAS from the so-called mixed linear effects model involving polygenic effects and SNP dosage as fixed effects. Gene-environment interactions are also important.
For non-genetic effects, the g-prior (Zellner 1986) is often used. In our notation, this amounts to β ∼ MVN (β 0 , aσ 2 (X X) −1 ) where β 0 is a hyperparameter and a a positive scalar often chosen to be the sample size, noting the use of a instead of g as in the literature is simply to avoid confusion with the polygenic effects g throughout this paper and elsewhere. The prior can facilitate model comparison since in the case of multiple linear regression closed form regression coefficients can be obtained but some undesirable property in model comparison has also been documented (e.g., Pericchi 2005).

Efficient implementation
The polygenic modeling would benefit greatly from a truly efficient Bayesian computation software system involving fine-tuned algorithms. Our limited experience showed that JAGS and Stan are feasible for moderate sample size (N ≈ 1, 000) but become very time-consuming when it gets larger. Besides approaches described in Section 4.1, JAGS can be compiled to use multicore facility. Recent versions of rstan actually have an option cores to automatically use all available cores. We do not attempt to elaborate this here as it is an active and evolving area with work such as Kruschke (2015) giving further information.

R> names(m)
[1] "y" "weights" "mu" "varE" "yHat" "SD.yHat" [7] "whichNa" "fit" "bF" "SD.bF" "u" "SD.u" [13] "varU" "prior" "nIter" "burnIn" "thin" with which we would be more comfortable. It seems that both frequentist and Bayesian approaches yielded smaller variance components compared to the imputation of missing outcome a priori. From the quantity mu and bF we are able to recover regression coefficients for the fixed effects comparable to what we have seen earlier. Furthermore, a vector whichNa indicates which observation has a missing outcome so that yHat[whichNa] contains predicted values for those missing outcomes.
Package GCTA can give heritability and standard error estimates for a quantitative trait based on a large number of SNPs. The documentation data involves a quantitative trait for 3,925 individuals and 1,000 SNPs, leading to h 2 (SE) = 0.022 (0.009). Now we ran 5,000 burn-ins and 10,000 iterations with package BLR and obtained 0.119 (0.001) and 95% HPD interval (0.098-0.142), still slightly higher than that based on REML estimation.
Gaussian outcome is but one of many scenarios for which polygenic effects can be included. Our frequentist counterparts include packages regress, pedigreemm (Vazquez, Bates, Rosa, Gianola, and Weigel 2010) and coxme (Therneau 2015), all in the R environment. They could involve problems with outcomes being binary, Poisson, time-to-event, etc. Our focus was on h 2 and there should be some similarity when we approach other indicators from the mixed models such as coefficient of determination (R 2 ) (Nakagawa and Schielzeth 2013).