R Package wgaim: QTL Analysis in Bi-Parental Populations Using Linear Mixed Models

The wgaim (whole genome average interval mapping) package developed in the R system for statistical computing (R Development Core Team 2011) builds on linear mixed modelling techniques by incorporating a whole genome approach to detecting significant quantitative trait loci (QTL) in bi-parental populations. Much of the sophistication is inherited through the well established linear mixed modelling package ASReml-R (Butler et al. 2009). As wgaim uses an extension of interval mapping to incorporate the whole genome into the analysis, functions are provided which allow conversion of genetic data objects created with the qtl package of Broman and Wu (2010) available in R. Results of QTL analyses are available using summary and print methods as well as diagnostic summaries of the selection method. In addition, the package features a flexible linkage map plotting function that can be easily manipulated to provide an aesthetic viewable genetic map. As a visual summary, QTL obtained from one or more models can also be added to the linkage map.


Introduction
Whole genome analysis is receiving wide attention in the statistical genetics community. In the context of plant breeding experiments the focus is on quantitative trait loci (QTL) which attempt to explain the link between a trait of interest and the underlying genetics of the plant. Many approaches of QTL analysis are available such as marker regression methods (Hayley and Knott 1992;Martinez and Curnow 1992) and interval mapping (Zeng 1994;Whittaker et al. 1996). These methods are common place in QTL software and are available for use in R packages such as the qtl package of Broman and Wu (2010). This particular suite of software is also complemented with a book (Broman and Sen 2009) which has been favourably reviewed (Zhou 2010).
There has also been some focus on the use of numerical integration techniques for the analysis of QTL. Xu (2003) and Zhang et al. (2008) suggest the use of Bayesian variable shrinkage and utilise Markov chain Monte Carlo (MCMC) to perform the analysis. An MCMC approach is also adopted in the R package qtlbim (Yandell et al. 2005). The package builds on the qtl package and the Bayesian paradigm allows an extensible list of trait types to be analysed. The package also makes use of the new model selection technique, the Deviance Information Criterion (Shriner and Yi 2009), to aid in identifying the correct QTL model. Similarly, a non-MCMC approach is adopted in the BayesQTLBIC package (Ball 2010) where the QTL analysis involves the use of the Bayesian Information Criterion (Schwarz 1978) as a QTL model selection tool.
Unfortunately many of the aformentioned methods and their software lack the ability to account for complex extraneous variation usually associated with plant or animal based QTL studies. Limited covariate additions are possible in R package qtlbim and through the inventive online GridQTL software which uses the ideas of Seaton et al. (2002). Kang et al. (2008) uses linear mixed models in the R package EMMA but it does not allow for extraneous random effects and possible complex variance structures that may be needed to capture environmental processes, such as spatial layouts, existing in the experiment. In this paper we discuss the R package wgaim which implements the genetic and inferential derivations of the whole genome average interval mapping (WGAIM) approach of Verbyla et al. (2007). The package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=wgaim. This approach allows the simultaneous modelling of genetic and non-genetic variation through extensions of the linear mixed model. The extended model allows complex extraneous variation to be captured as well as simultaneously incorporating a whole genome analysis to detection and selection of QTL using a linkage map. The underlying linear mixed modelling analysis is achieved computationally using the R package ASReml-R. The simulation results and examples in Verbyla et al. (2007) show that WGAIM is a powerful tool for QTL detection and outperforms more rudimentary methods such as composite interval mapping. As it incorporates the whole genome into the analysis it eliminates the necessity for piecemeal model fitting along the genome which in turn avoids the use of model selection criteria to control the number of false positive QTL. It must be noted Huang and George (2009) also use ASReml-R as their core engine for whole genome QTL analysis in the R package dlmap. In this package, the backward elimination model fitting procedure and multiple testing corrections used to control false positive QTL suggest this package differs markedly from wgaim. In wgaim the false positives are controlled naturally by assuming a background level of QTL variation through a single variance component associated with a contiguous set of QTL across the whole genome. This parameter can then be tested to determine the presence of QTL somewhere on the genome. As a result, a less cumbersome approach to detecting and selecting QTL is ensured.
The WGAIM method uses an extension of interval mapping to perform its analysis. Thus, for convenience and flexibility, the wgaim package provides the ability to convert genetic data objects created in the qtl package to objects for further use in wgaim. The converted objects retain a similar structure to objects created in qtl and therefore can still be used with functions within the package. Users of wgaim need to be aware that it is a software package intended for the analysis and summary of QTL and only contains minimal tools for exploratory linkage map manipulation. Much of the exploratory work can be handled with functions supplied in the qtl package and users should consult its documentation if required. In addition, the interval mapping approach of Verbyla et al. (2007) and its implementation in wgaim is also restricted to populations with only two distinct genotypes. Some of these populations include, double haploid (DH), back-crosses and recombinant inbred lines (RIL). To ensure this rule is adhered to, error trapping has been placed in the appropriate functions of wgaim.
Throughout the WGAIM procedure the underlying linear mixed model analysis is achieved using the highly flexible R software package ASReml-R, built as a front end wrapper for the more sophisticated stand alone version, ASReml ). This software allows the user the ability to flexibly model spatial or environmental variation as well as possible variation that may arise from additional components associated with the experimental design. It uses an average information algorithm developed in Gilmour et al. (1995) that allows efficient computing of residual maximum likelihood (REML) (Patterson and Thompson 1971) estimates for the variance parameters. The use of REML estimation in the linear mixed model context becomes increasingly necessary in situations where the data is unbalanced. Much of its sophistication has been influenced from its common use in the analysis of crop variety trials (Smith et al. 2001(Smith et al. , 2006 where complex additional components such as spatial correlation structures or multiplicative factor analytic models need to be incorporated into the mixed model. If available, the software also allows complex pedigree information to be included (Oakey et al. 2006). Many of these additional flexibilities in ASReml have also established it as a valuable software tool in the livestock industries. In more recent years it has been used as a core engine for more complex genetic analyses as in Gilmour (2007), Verbyla et al. (2007) and Huang and George (2009). The stand alone software and the R package ASReml-R is only commercially available through http://www.vsni.co.uk/ but trial licenses are also available.
The paper is arranged as follows. Section 2 briefly describes the theory of the WGAIM algorithm that is implemented in wgaim. Section 3 presents a walk through a typical QTL analysis using the functions of qtl and wgaim. QTL analyses from two plant breeding experiments are provided in Section 4. The second example shows some of the enhanced features of wgaim including the ability to plot an aesthetic genetic map. For visualization QTL can also be placed on the map post analysis. Post analysis diagnostics are also available which present features of the forward selection procedure used to determine the QTL.

WGAIM theoretical method
Before discussing the functions of the wgaim package it is necessary to provide a theoretical overview of the methodology used in its implementation. The WGAIM approach is a forward selection method that uses a whole genome approach to genetic analysis at each step. Following Verbyla et al. (2007), initially a working model is developed that assumes a QTL in every interval. Thus for a given set of trait observations y = (y 1 , . . . , y n ) consider the model where τ is a t length vector of fixed effects with an associated n × t explanatory design matrix X and u e is a b × 1 length vector of random effects with an associated n × b design matrix Z e . Typically, the distribution of u e ∼ N (0, σ 2 G(ϕ)) and is assumed mutually independent to the residual vector e ∼ N (0, σ 2 R(φ)) with ϕ and φ being vectors of variance ratios.
The vector g in (1) represents a r length vector of genotypic random effects with its associated design matrix Z g . Let c be the number of chromosomes and m k be the number of markers on chromosome k, (k = 1, . . . , c), and q i,k:j represent the parental allele type for line i in interval j on chromosome k. In WGAIM, q i,k:j = ±1, reflecting two possible genotypes AA, BB for DH and RIL and AB, BB for back-cross populations. The ith genetic component of this model is then given by where a k:j is QTL effect size assumed to have distribution a k:j ∼ N (0, σ 2 γ a ) and p i ∼ N (0, σ 2 γ p ) represents a polygenic or residual genetic effect not captured by the QTL effects.
As in interval mapping the vector of QTL allele types are replaced by the expectation of the QTL genotype given the flanking markers. Let m k:j be the jth marker on the kth chromosome and applying a parameter reduction technique from Verbyla et al. (2007) produces a vector of genotypic effects of the form where ψ k:j = θ k:j,j+1 /2d k:j,j+1 (1 − θ k:j,j+1 ) and θ k:j,j+1 , d k:j,j+1 are the recombination fraction and Haldane's genetic distance between marker j and j + 1 respectively on the kth chromosome. Thus M ψ is a fully specified known matrix of pseudo-markers spanning the whole genome. A more detailed overview of this decomposition and its derivation can be found in Verbyla et al. (2007). Let Z a = Z g M ψ then the full working statistical model for analysis is then After the fitting of (3) the simple hypothesis H 0 : γ a = 0 is tested based on the statistic −2 log Λ = −2(log L − log L 0 ) where L and L 0 is the residual likelihood of the working model (3) with and without the random regression QTL effects, Z a a. Stram and Lee (1994) suggest that under H 0 , −2 log Λ is distributed as the mixture 1 2 (χ 2 0 +χ 2 1 ) due to the necessity of testing whether the variance ratio is on the boundary on the parameter space.
If γ a is found to be significant a putative QTL is determined using an outlier detection method based on the alternative outlier model (AOM) for linear mixed models from Gogel (1997) and formalised in Gogel et al. (2001). Verbyla et al. (2007) uses the AOM to develop a score statistic for each of the chromosomes. For example, for the kth chromosome let a k0 = a k + δ k where δ k is a vector of random effects such that δ k ∼ N (0, σ 2 γ a,k I m k −1 ). The full outlier model is where Z a,k is the matrix Z a appropriately subsetted to chromosome k. The REML score is then derived for γ a,k and evaluated at γ a,k = 0, namely where and best linear unbiased predictors (BLUPS)ã k = γ a Z T a,k P y. This score has mean zero and this will occur exactly when the terms in the parentheses of (5) are equal. Scores that depart from zero suggest a departure from γ a,k = 0. A simple statistic that reflects this departure can be based on the "outlier" statistic . This statistic can therefore be calculated from the BLUPS of the QTL sizes and their prediction error variances arising from the working model. In most cases mixed model software, including ASReml-R used in wgaim, provide the ability to extract these components for this use.
In a similar manner to the above once the chromosome with the largest outlier statistic is identified, the individual intervals within that chromosome are checked. For example if the largest t 2 k is from the kth chromosome, a similar derivation can be followed for the outlier statistic of the jth interval, namely A putative QTL is then determined by choosing the largest t 2 k:j within that chromosome. It must be stated at this point that although (4) is formulated to derive the theory for QTL outlier detection there is no requirement to fit this model as the chromosome and interval outlier statistics only contain components obtainable from a fit of the working model proposed in (3). Thus there is only a minimal computational cost to determine an appropriate QTL interval using this method.
Once a QTL interval is selected it is moved into the fixed effects of the working model (3) and the process is repeated until γ a is not significant. After the selection process is complete the selected QTL intervals appear as fixed effects and the final model is where z a,i is the appropriate column of Z a for the ith QTL. This complete approach is known as the WGAIM algorithm.

A casual walk through
A typical QTL analysis with wgaim can be viewed as series of steps with the appropriate functions 1. Fit a base asreml() (see the ASReml-R package) model as in (3) but without the added marker/interval genetic information term Z a a. The asreml() call allows very complex structures for the variance matrices G(ϕ) and R(φ) through its random and rcov arguments. This makes it an ideal modelling tool for capturing non-genetic experimental variation, such as design components and/or extraneous environmental variation. From a plant breeding context Verbyla et al. (2007) also suggests including the polygenic or residual genetic term Z g p in the base model as a simple random effect. Examples of base models can be found in Section 4 of this article.
For a comprehensive overview of the ASReml-R package, including thorough examples of its flexibility, users should, in the first instance, consult the documentation that is included with the package. On any operating system that has ASReml-R installed, the documentation can be found using the simple command asreml.man() in R.
2. Read in genetic data using read.cross() (see the qtl package). This function allows the reading in of genetic information in a number of formats including files generated from commonly used genetic software programs such as Mapmaker and QTL Cartographer.
For the exact requirements of all available file types and their nomenclature users should consult the qtl documentation.
The read.cross() function can also process more advanced genetic crosses. However, in wgaim the QTL analysis is restricted to populations with two genotypes. Thus users should be aware that the class of the returned object from read.cross() needs to have the structure c("bc", "cross"). The "bc" is an abbreviated form for "back-cross". It is this class structure that is checked in the proceeding steps.
3. Convert genetic "cross" data to an "interval" object using cross2int(fullgeno, missgeno = "MartinezCurnow", rem.mark = TRUE, id = "id", subset = NULL) The function contains a number of arguments that allow some linkage map manipulation before calculation of the interval information for each chromosome. If missgeno = "MartinezCurnow", missing values within a chromosome are calculated using the rules of Martinez and Curnow (1992). If missgeno = "Broman" the they are calculated using the default values of argmax.geno() in the qtl package. If rem.mark = TRUE, coincident markers across the genome are removed from the marker set. The correlated markers and how they are connected is returned as part of the final object. The id is a required argument that determines the names of the unique rows of fullgeno and is used for matching names with phenotypic data in the next step. There is also an option to subset your map if desired.
The final genetic data object returned retains the c("bc", "cross") class for backward compatibility with other functions in the qtl package as well as inherits the class "interval" for functionality within the wgaim package.
4. Merge the genetic "interval" data with the base model phenotypic data using wmerge(geno, pheno, by = NULL, ...) All named arguments of this function are required for a successful merging of genotypic and phenotypic data. The geno argument must be a genetic data object inheriting the class "interval" from a call to cross2int(). The pheno can be the usual data frame or a file. If it is a file then it is read in by the wrapper function asreml.read.table() which conveniently converts column names with capital letters to factors. The ... argument can be used as additional arguments to asreml.read.table(). The by argument is used for merging geno and pheno and should be a column name that is present in geno$pheno as well as present in one of the columns of pheno. There is error trapping in the function if these rules are not adhered to.
By default this function initially merges interval information from different chromosomes or linkage groups to form the fully specified interval matrix M ψ in (2) with column names "Chr.<chr>.<int>". The full genetic matrix, M ψ is then merged with the phenotypic data which is equivalent to expanding the genetic information using Z a = Z g M ψ , ensuring replicates of the same line will have the same genetic structure. It should be noted that unmatched elements of by are handled differently depending on whether they are from the geno or pheno data. If elements of by exist in pheno and are unmatched with elements in geno then they are kept to ensure completeness of the phenotypic data. If elements of by exist in geno and not in pheno they are dropped as there will be no phenotypic information available for that genetic line.
The merged object retains all the same components as the "interval" data object with the addition of named components pheno.dat representing the phenotypic data only and full.data representing the fully merged phenotypic and genotypic data.
5. Perform QTL analysis with wgaim() wgaim(baseModel, parentData, TypeI = 0.05, attempts = 5, trace = TRUE, ...) The baseModel argument must be an asreml.object and therefore have "asreml" as its class attribute. Thus a call to wgaim() is actually a call to wgaim.asreml(). This stipulation ensures that an asreml() call has been used to form the base model in step 2 before attempting QTL analysis. An error trapping function, wgaim.default() is called if the class of the base model is not "asreml". The second argument parentData is a data object formed from a call to wmerge(). parentData must be of class "interval" and contain the named component full.data. There are initial checks in wgaim.asreml() to ensure that parentData contains the baseModel data. The TypeI argument allows users to change the significance level for the testing of QTL effects variance component γ a . As asreml() calls output components of the fit to the screen there is an option to trace this to a file if desired.
The fitting of the working model (3) is achieved through added functionality to the asreml() call. The merged interval matrix Z a is added to the base asreml model as a contiguous block of random effects with a single variance component, γ a . If this variance component is significant then the algorithm searches for a QTL. Once a QTL is found it places the appropriate interval column of Z a into the fixed component of the base model and reiterates this process. Thus the process of finding and selecting QTL using wgaim() is automated and may require several model fits. For this reason users must be patient if they are analysing a dataset with a large number of observations or a large number of markers. Upon completion of the algorithm, summary() and print() methods are available to summarize the QTL.

Worked examples 4.1. The zinc data
The zinc data is available in wgaim as a usable date set to display the functionality of the package. The data consists of 200 observations of zinc concentration and shoot length for a DH population of wheat. There are two replicates of 90 double haploid lines from a crossing of the wheat varieties Cascades and Rac875-2 and ten each of the parents in an id variable. The data also includes a Type variable to distinguish the parents from the DH lines. The experiment also contained two blocks in a variable called Block.
A suitable base model for shoot length is explored by considering (3) without the random regression effects, Z a a, attributed to genetic markers/intervals. As we are interested in the genetic variance associated with the DH lines, Type is modelled as a fixed effect (τ ) to ensure the removal of the genetic effect associated with the parents. The Block as a random effect of non-interest (u e ) and id as a set of polygenic random effects p.
R> data("zinc", package = "wgaim") R> sh.fm <-asreml(shoot~Type, random =~Block + id, data = zinc) A simple summary of the variance parameters in the model can be achieved with The summary reveals there is only a small difference between Blocks. However, the genetic variance is more than nine times the residual variance of the model making the response an ideal candidate for QTL analysis.
wgaim is prepackaged with a genetic map associated with the zinc data. The data has already been read in using read.cross() and can be loaded using R> data("raccas", package = "wgaim") Alternatively to illustrate the use of read.cross() in conjunction with this package the same data is available from the extdata directory of the package library as a CSV file. A subset of the data from the CSV file is given in Table 1. This reveals that the CSV file is in the rotated CSV format (see read.cross() from the qtl package). The genotypes are set as AA or AB and missing values are "-". Thus a call to read.cross() is R> wgpath <-system.file("extdata", package = "wgaim") R> raccas <-read.cross("csvr", file = "raccas.csv", + genotypes = c("AA", "AB"), dir = wgpath, na.strings = c("-", "NA")) R> class(raccas)  Table 1: A subset of the genetic data from the comma delimited file raccas.csv.
[1] "bc" "cross" The returned object has the required class structure and is converted to an "interval" object using R> raccas <-cross2int(raccas, missgeno = "Mart", id = "id", rem.mark = TRUE) R> summary(raccas) R> class(raccas) [1] "bc" "cross" "interval" As coincident markers are omitted from the map it is written to a file, "dummy.csv", and read back in using using read.cross() to allow a re-estimation of genetic information for the reduced map. The summary shows there is total of 468 markers across 40 linkage groups. It also reveals that just over 5% of markers were missing and imputed using the rules of Martinez and Curnow (1992). The classes of raccas and their ordering is retained and it now also inherits the class "interval" for use with functions in wgaim.
The genetic "interval" data can now be merged with the phenotypic zinc data using R> raccasM <-wmerge(raccas, zinc, by = "id") R> names(raccasM) [1] "geno" "pheno" "rf" "pheno.dat" "full.data" This newly merged data retains the same classes as raccas and adds a named component "pheno.dat" containing the phenotypic data only and "full.data" containing the essential merging of the phenotypic zinc data with all the chromosomal "intval" components of the genotypic raccas data.
With this newly merged data a QTL analysis is simply R> zn.qtl <-wgaim(sh.fm, parentData = raccasM, na.method.X = "include") R> summary(zn.qtl, raccasM) The analysis reveals four significant QTL in four linkage groups. Verbyla et al. (2007) recommends the use of p-values, rather than the commonly used LOD scores, as the overall test of significance for each of the QTL. The argument LOD = TRUE can be given to summary.wgaim() if LOD scores are necessary.

Sunco-Tasman data
This example stresses the importance of modelling extraneous variation to a ensure a more appropriate QTL analysis. The Sunco-Tasman data is available in the data directory of wgaim and contains the results of a field trial conducted in the year 2000 with 175 double haploid lines from a crossing of wheat varieties Sunco and Tasman. The original field trial was arranged in a 31 rows by 12 columns with two replicates of each line. A milling experiment was then performed which replicated 23% of the field samples producing 456 samples milled over 38 mill days with 12 samples per day. The focus is on the trait milling yield.
components of the fixed model are mean centred covariates of Millord and Row that capture the natural linear trends that occur in the samples across milling order on any given day and across rows in the field. The summary reveals a large genetic variance component. For comparison a NULL model (no extraneous effects) is also fitted.
Each of these QTL models can be summarised visually using link.map(). In this case it calls the method link.map.wgaim() to plot the QTL on the genetic map. Multiple models or traits can be handled through link.map.default(). For example, Figure 3 is produced with R> link.map.default(list(st.qtlF, st.qtlN), stmerge, marker.names = "dist", + cex = 0.6, clist = list(qcol = c("red", "light blue"), mcol = "red", + tcol = c("red", "light blue")), trait.labels = c("Full", "Null")) This QTL plotting procedure is highly flexible to user colour changes. Through an argument clist it allows the user to specify the QTL colour between markers, the colour of the flanking QTL marker names, the colour of the trait names and the rest of the marker names. If no colours are chosen qcol and tcol defaults to rainbow(n) where n is the number of traits. The QTL map reveals that an extra six QTL were detected in the full model compared to the null model, highlighting the importance of modelling extraneous variation appropriately in QTL analyses.
From a statistical standpoint the QTL selected across the genome cannot be expected to be orthogonal. Thus the introduction of the next QTL in the forward selection process will inevitably affect the significance of the previously selected QTL. A post diagnostic evaluation of the QTL p-values in the forward selection process can be displayed using R> tr(st.qtlF, iter = 1:10, digits = 3)     for their significance as well as plotted on a linkage map for visual inspection of their location on the genome.
Currently only QTL analysis of univariate traits is possible with wgaim. However, a multivariate version of the WGAIM algorithm is being researched and preliminary papers have been submitted . The software implementation of this multivariate approach is currently being tested. Research is currently being conducted to determine the inclusion of higher order effects such as epistatic interactions into the WGAIM approach. We are hopeful that these new approaches will be implemented in future releases of wgaim.