Exploring Diallelic Genetic Markers : The HardyWeinberg Package

Testing genetic markers for Hardy-Weinberg equilibrium is an important issue in genetic association studies. The HardyWeinberg package offers the classical tests for equilibrium, functions for power computation and for the simulation of marker data under equilibrium and disequilibrium. Functions for testing equilibrium in the presence of missing data by using multiple imputation are provided. The package also supplies various graphical tools such as ternary plots with acceptance regions, log-ratio plots and Q-Q plots for exploring the equilibrium status of a large set of diallelic markers. Classical tests for equilibrium and graphical representations for diallelic marker data are reviewed. Several data sets illustrate the use of the package.


Introduction
The HardyWeinberg package (?) consists of a set of tools for analyzing diallelic genetic markers in the R environment (?), and is particularly focused on the graphical representation of their (dis)equilibrium condition in various ways. The package is mainly aimed at researchers working in the fields of genetics, statistics, epidemiology, bio-informatics and biostatistics and is available from the Comprehensive R Archive Network (CRAN) at http: //CRAN.R-project.org/package=HardyWeinberg. This paper describes the state of the art of version 1.6.0 of the package. If you appreciate this software and wish to cite it, please cite the corresponding paper in the Journal of Statistical Software (?). The structure of this paper is as follows. In Section 2 we briefly introduce Hardy-Weinberg equilibrium. Section 3 reviews the classical statistical tests and power computation for Hardy-Weinberg equilibrium. Section 4 briefly presents the X-chromosomal tests for equilibrium. Section 6 treats graphical representations of Hardy-Weinberg equilibrium for sets of markers. Section 7 is an example session showing how to analyze genetic markers with the functions of the package. Finally, a discussion (Section 8) with some comments on related packages completes the paper.

Hardy-Weinberg equilibrium
A diallelic genetic marker with alleles a and b with respective frequencies p and q (p + q = 1) is said to be in Hardy-Weinberg equilibrium if the relative genotype frequencies f AA , f AB and f BB are given by p 2 , 2pq and q 2 respectively. This law, independently formulated by ? and ?, is a fundamental principle of modern genetics (?). The term "Hardy-Weinberg equilibrium" was proposed by ?. The law is easily extended to a system with multiple alleles A 1 , . . . A k with frequencies p 1 , . . . , p k , giving genotype frequencies p 2 i for homozygotes and 2p i p j for heterozygotes. An alternative formulation of the law for the diallelic case is obtained by squaring the heterozygote frequency: (1) Hardy-Weinberg equilibrium (HWE) is achieved in one generation of random mating. In the absence of disturbing forces (migration, mutation, selection, among other possibilities) the law predicts that genotype and allele frequencies will remain in their equilibrium state over the generations. We refer to genetic textbooks (??) for a more detailed treatment of the long list of assumptions that underlie HWE. The law plays an important role in the context of genetic association studies for various reasons. Disequilibrium may be the result of genotyping error, most typically the confusion of heterozygotes and homozygotes. Tests for HWE may thus help to detect (gross) genotyping error. On the other hand, disequilibrium among cases in a case-control study may be indicative of disease association. Thus, tests for HWE may also provide clues in marker-disease association studies.

Classical autosomal tests for Hardy-Weinberg equilibrium
There are several statistical tests available for investigating whether a genetic marker can be considered to be in equilibrium or not. The classical chi-square test for goodness-of-fit has been the most popular test for HWE for decades, though nowadays exact procedures are more and more often employed. A likelihood ratio test is also available. A description of the different tests is given by ?, Chapter 3. Bayesian inference for HWE (?????) is not considered here. In the following sections we summarize the chi-square test (Section 3.1), the likelihood ratio test (Section 3.2), the exact test (Section 3.3) and the permutation test (3.4), and also describe the computation of power for HWE tests (Section 3.5). Testing for HWE with missing genotype data is addressed in Section 3.6.

Chi-square test
The chi-square test is the classical test for HWE and is typically explained in genetic textbooks (??). Let n AA , n AB and n BB represent the observed genotype counts, and e AA = np 2 , e AB = 2npq and e BB = nq 2 the expected genotype counts under HWE. The chi-square statistic X 2 can be computed as and compared with a χ 2 1 reference distribution. Alternatively, the chi-square statistic may be a b a n AA 1 2 n AB 1 2 n A b 1 2 n AB n BB 1 2 n B 1 2 n A 1 2 n B n Table 1: Three genotype counts n AA , n AB and n BB represented in a two-way table.
a b a 2 n AA n AB n A b n AB 2 n BB n B n A n B 2 n expressed as where D = 1 2 (n AB −e AB ) indicates the deviation from independence for the heterozygote. The computation of D (or other disequilibrium statistics) is recommended because X 2 itself is not informative about the nature of disequilibrium (excess or lack of heterozygotes). A chi-square test for HWE can be carried out by using function HWChisq of the package, and supplying the vector of the three genotype counts. However, in R standard chi-square tests for independence are typically carried out on tables or matrices. If the genotype counts are re-organized in a two-way layout given in Table 1, then a standard chi-square test for independence (function chisq.test in R) applied to this table is the same as a chi-square test for HWE. The total of Table 1 is the number of individuals, and the margins are half the allele counts. If the table is multiplied by 2 then the margins of the table have a more substantive interpretation as allele counts n A = 2n AA + n AB and n B = 2n BB + n AB , and the total of the table is the total number of alleles, as shown in Table 2.
We note that, due to the multiplication by 2, the latter table has a chi-square statistic that doubles the chi-square statistic of Table 1. It is well known that the chi-square statistic is related to the sample correlation coefficient (r) between two indicator variables for the row and column categories by the expression The indicator matrix corresponding to contingency Table 2 is given in Table 3.
The patterns for aa, ab and bb in this table are repeated n AA , n AB and n BB times respectively. In this table each individual is decomposed into its two constituent genes. The indicator variables I ♀B and I ♂B show whether the corresponding individual received a b allele from their mother or their father respectively. The sample correlation coefficient between the two indicator variables is an estimate for what is known as the inbreeding coefficient in population genetics (?, Chapter 3). The inbreeding coefficient, usually denoted by f , is the probability that the pair of alleles of an individual is identical by descent. In the statistical literature, f is better known as the intraclass correlation coefficient. f can be estimated by maximum likelihood (ML) asf Maternal Paternal Individual Allele Allele

Likelihood ratio test
In general, the likelihood of a sample of genotype counts is given by the multinomial distribution L(P AA , P AB , P BB ) = n n AA , n AB , n BB P n AA AA P n AB AB P n BB BB , and the ML estimator is given by the relative sample genotype frequencies. We thus obtain Under the assumption of HWE, the likelihood is The logarithm of the likelihood ratio of the latter two is given by and the statistic G 2 = −2 ln L 0 L 1 has, asymptotically, a χ 2 1 distribution. The likelihood ratio test for HWE can be carried out using the function HWLratio of the package. Asymptotically, the likelihood ratio test is equivalent to a chi-square test for HWE.

Exact test
Exact test procedures for HWE are based on the conditional distribution of the number of heterozygotes (N AB ) given the minor allele count (N A ). This distribution was derived by ? and ? and is given by Equation 7.
The standard way to compute the p value of an exact test is to sum probabilities according to Equation 7 for all samples that are as likely or less likely than the observed sample. This way to compute the p value has been termed the SELOME p value (select equally likely or more extreme samples). The function HWExact provides the standard exact test for HWE, even though it also implements alternative definitions of the p value. In particular, the function also offers the possibility to do a one-sided test, or to use the mid p value (?). The mid p value is defined as half the probability of the observed sample plus the probabilities of all possible samples that are less likely than the observed sample. The mid p value is less conservative, has a type I error rate that is closer to the nominal level, and has been shown to have better power (?

Permutation test
Hardy-Weinberg equilibrium refers to the statistical independence of alleles within individuals. This independence can also be assessed by a permutation test, where all 2n alleles of all individuals are written out as a single sequence (E.g. AAAAABABBBAA....). This sequence is then permuted many times, and for each permuted sequence pairs of successive alleles are taken as individuals. For each permutation a test statistic (the pseudo-statistic) for disequilibrium is computed. The test statistic for the original observed sample is compared against the distribution of the pseudo-statistic, where the latter was generated under the null hypothesis. The p value of the test is calculated as the fraction of permuted samples for which the pseudo-statistic is equal to or exceeds the test statistic. Such a test is computer intensive but has the advantage that it does not rely on asymptotic assumptions. Function HWPerm performs this test.

Power calculations
The power of the chi-square test or of an exact test can be calculated if the sample size, minor allele count and significance level (α) are known, and if the degree of deviation from equilibrium (the effect size) is specified. The effect size can be specified by providing a disequilibrium parameter θ, given by When there is exact equilibrium θ = 4. The situation θ > 4 refers to heterozygote excess, and the situation θ < 4 refers to heterozygote dearth. Alternatively, the degree of disequilibrium may also be parametrized by using the inbreeding coefficient f . Under inbreeding, the population genotype frequencies are given by

Missing data
Genotype data often have missing values. If missing values are not missing completely at random, inference with respect to HWE may be biased (?). Multiple imputation (?) of missing values, taking information from allele intensities and/or neighboring markers and into account, can improve inference for HWE. Function HWMissing of the package does inference for HWE in the presence of missing data. The multiple imputation part is resolved by the package mice (?). In brief, HWMissing computes the inbreeding coefficient (see Equation 5) for each imputed data set, and combines all estimates according to Rubin's pooling rules. A confidence interval for f and a p value for a test for HWE can then be computed. Alternatively, exact inference for equilibrium when there are missings is also possible by combining the exact p values of the imputed data sets (?). An example of inference for HWE with missing values is given in Section 7.

X-chromosomal tests for Hardy-Weinberg equilibrium
Recently, ? have proposed specific tests for HWE for bi-allelic markers on the X-chromosome. These tests take both males and females into account. The X-chromosomal tests can be carried out by the same functions mentioned in the previous Section (HWChisq, HWLratio, HWExact, HWPerm) and adding the argument x.linked=TRUE to the function call. For a detailed treatment of frequentist X-chromosomal tests, see ?. The frequentist X-chromosomal procedures are omnibus tests that simultaneously test equality of allele frequencies in males and females and Hardy-Weinberg proportions in females. Recently, a Bayesian method for testing bi-allelic X-chromosomal variants has been proposed by Puig, Ginebra and Graffelman (?). Examples of frequentist and Bayesian testing of X-chromosomal markers for HWE are given below in Section 7.

HWE and gender allele frequencies
Testing HWP for a genetic variant is contingent on the assumption of equality of allele frequencies in the sexes. Likewise, a chi-square of exact test for equality of allele frequencies assumes HWP. Recently, ? proposed exact and likelihood ratio procedures that can test HWP and equality of allele frequencies (EAF) jointly or independently, using different scenarios for a bi-allelic variant. Function HWLRtest compares different scenarios with a likelihood ratio test (LRT). Puig, Ginebra and Graffelman (?) describe ten different scenarios for autosomal variants that allow for sex-specific allele frequencies and inbreeding coefficients. The different scenarios can be compared using Bayesian model selection implemented in function HWPosterior. Alternatively, function HWAIC can be used to calculate Akaike's information criterion (AIC), which can also be used to decide which model best fits the data. Examples are given in Section 7.

Graphics for Hardy-Weinberg equilibrium
Several graphics can complement statistical tests for HWE, in particular if many markers are tested simultaneously. The package HardyWeinberg provides several graphical routines which are briefly discussed in the following subsections, where we consider scatter plots (Section 6.1), ternary plots (Section 6.2), log-ratio plots (Section 6.3) and Q-Q plots (Section 6.4). We will use two data sets to illustrate the different graphics. The first data set, HapMapCHBChr1, concerns 225 single nucleotide polymorphisms (SNPs) with no missing data from chromosome 1 for a sample of 84 individuals from the Han Chinese population in Beijing, compiled from the publicly available datasets of the HapMap project (http://hapmap.ncbi.nlm.nih.gov/, ?). The second data set, Mourant, consists of the genotype counts for the mn blood group locus for 216 samples from different human populations. This data set was compiled by ?, Table  2.5. We will refer to these data sets as the HapMap and the Mourant data set respectively.

Scatter plots of genotype frequencies
Relationships between the genotype frequencies can be explored by making scatter plots of the frequencies, such as f AB versus f AA or f BB versus f AA . In these scatter plots, genetic markers tend to follow a particular curve described by the Hardy-Weinberg law. Any scatter plot of two of the three genotype frequencies will reveal structure if the law holds. In a plot of f AB versus f AA , the Hardy-Weinberg law is given by Equation 10, and in a plot of f BB versus f AA the law is described by Equation 11, These relationships are easily derived from Equation 1. Examples of both plots using the HapMapCHBChr1 data set are shown in Figure 1. Both graphs show that all samples cluster closely around the HWE curve. The function HWGenotypePlot can be used to create these plots.

The ternary plot
The Italian statistician Bruno ? represented genotype frequencies in a ternary diagram. This diagram is known as a de Finetti diagram in the genetics literature (?). The HWE condition defines a parabola in the ternary plot. A ternary plot of the genotype frequencies with the HWE parabola is an information-rich graphical display. From this plot one can recover genotype frequencies, allele frequencies, and infer the equilibrium status of a genetic marker at a glance (see Figure 2). The ternary plot is most useful for plotting data consisting of multiple samples that have all been genotyped for the same genetic marker. In that case the three vertices of the display are fully identified. An example is shown in Figure 3 where the genotype counts for the mn  The ternary plot may also be used to represent multiple markers, though this is a bit tricky because the obtained display is no longer uniquely determined. In this case, one vertex, usually the top vertex, is chosen to represent the heterozygote frequency of each marker. The two bottom vertices are used for one of the two homozygote frequencies. It is arbitrary to place aa on the right and bb on the left or the other way round. Representing multiple markers amounts to overplotting all ternary diagrams for each individual marker in such a way that the axes for the heterozygotes always coincide. Despite the indeterminacy of the homozygote vertices, the plot remains highly informative, as now minor allele frequency, genotype frequencies and equilibrium status are visualized simultaneously for many markers in just one plot. ? amplified the ternary plot by representing the acceptance regions of chisquare and exact tests inside the plot. An example with multiple markers is shown in the right panel of Figure 3. This figure shows 225 SNPs of the dataset HapMapCHBChr1. The function HWTernaryPlot of the package allows the construction of ternary plots with the equilibrium parabola and various acceptance regions.

Log-ratio plots
A vector of genotype counts (aa, ab, bb) can be seen as a composition, where these counts form parts of a whole. Compositional data analysis (?) is a branch of statistics dedicated to the analysis of compositions. Some of the tools employed in compositional data analysis such as ternary diagrams and log-ratio transformations can be useful for the analysis of genotype counts. Currently, three types of log-ratio transformations are in use: the additive log-ratio (alr) transformation, the centered log-ratio (clr) transformation and the isometric log-ratio (ilr) transformation (?). Starting with a vector of genotype counts (x = (n AA , n AB , n BB )), the log-ratio transformations for diallelic markers were given by ?, and are also detailed below: where g m (·) denotes the geometric mean of its argument. Note that there exist 3 alr and 3 ilr transformations depending on which genotype count is used as the divisor in the log-ratios. We will use the first of the three ilr transformations, because the isometric log-ratio transformation yields a space with an orthonormal basis, and because HWE is in these coordinates represented by a simple horizontal line. With this transformation, HWE implies that the second ilr coordinate is constant (− 2/3 ln (2)) and the first coordinate is √ 2 times the log odds of the allele frequency. The package includes some standard routines for computing additive, centered and isometric log-ratio coordinates for vectors of genotype counts (HWAlr, HWClr and HWIlr), and also graphical routines that display markers in log-ratio coordinates (HWAlrPlot, HWClrPlot and HWIlrPlot). HWE is represented in log-ratio coordinates by a perfect linear relationship between the first and second log-ratio coordinate. Examples of the log-ratio plots for the Mourant data and the HapMap data are given in Figure 4.

Q-Q plots
Genetic association studies nowadays investigate many markers for their possible relation with diseases. The equilibrium status of the markers is important, since deviation from HWE may be indicative of genotyping error. Moreover, disequilibrium for cases in a case-control study is indicative for disease association. Given that so many markers are tested, it is cumbersome to do this all in a numerical manner only, and it is known beforehand that false positives will arise. Even if we find that 5% of the markers is significant when we use a significance level of α = 0.05, this does not imply that the database as a whole can be considered to be "in equilibrium". The distribution of the test results (chi-square statistics or p values) then becomes interesting to look at. One way to do this is to compare the sample percentiles of the chi-square statistics of all markers with the theoretical percentiles of a χ 2 1 distribution in a chi-square quantile-quantile plot (Q-Q plot). For exact tests, Q-Q plots of the p values are used. Often the uniform distribution is chosen as the reference distribution. However, with discrete data the p value distribution under the null is not uniform. The function HWQqplot of the package plots the p values against samples from the null distribution rather than the  uniform distribution. The function takes into account that sample size and allele frequency can vary over markers. Figure 5 shows Q-Q plots for the HapMap data (left panel) and also for simulated data under moderate inbreeding (right panel, f = 0.05). The green line is the reference line passing through the origin with slope 1. Each grey line plots a sample from the null distribution against the empirical quantiles of the p values. Deviation of the green line from the grey zone is taken as evidence that HWE is violated. The HapMap data set is seen to be in good agreement with what is expected under the null. This is not surprising, as the markers of the project undergo a quality control filter, and markers that strongly deviate from HWE (p value of an exact test < 0.001) are discarded from the project. For the dataset simulated under inbreeding, a manifest deviation from HWE is found. Q-Q plots assume independent observations. We note that this assumption will be violated if the markers under study are closely neighboring markers from the same region of a single chromosome.

An example session
This section shows the basic use of the package for testing and plotting genetic markers. We consider installation (Section 7.1), testing of markers (Section 7.2), power computations (Section 7.5), simulation of marker data (Section 7.6) and graphics for HWE (Section 7.7).

Installation
The HardyWeinberg package can be installed as usual via the command line or graphical user interfaces, e.g., the package can be installed and loaded by:

Testing autosomal markers for HWE
We show how to perform several classical tests for Hardy-Weinberg equilibrium. As an example we use a sample of 1000 individuals genotyped for the mn blood group locus described by ?, Table 2.4. We store the genotype counts (298, 489 and 213 for mm, mn and nn respectively) in a vector x: This shows that the chi-square statistic has value 0.179, and that the corresponding p value for the test is 0.6723. Taking Taking a significance level of α = 0.05, we do not reject HWE for the mn locus. When verbose is set to FALSE (default) the test is silent, and HW.test is a list object containing the results of the test (chi-square statistic, the p value of the test, half the deviation from HWE (D) for the heterozygote (D = 1 2 (f AB − e AB )), the minor allele frequency (p) and the inbreeding coefficient f). By default, HWChisq applies a continuity correction. This is not recommended for low minor allele frequencies. In order to perform a chi-square test without Yates' continuity correction, it is necessary to set the cc parameter to zero: R> HW.test <-HWChisq(x, cc = 0, verbose = TRUE) Chi-square test for Hardy-Weinberg equilibrium (autosomal) Chi2 = 0.2214896 DF = 1 p-value = 0.6379073 D = -3.69375 f = 0.01488253 The test with correction gives a smaller χ 2 -statistic and a larger p value in comparison with the ordinary χ 2 test. The likelihood ratio test for HWE can be performed by typing

R> HW.lrtest <-HWLratio(x, verbose = TRUE)
Likelihood ratio test for Hardy-Weinberg equilibrium G2 = 0.2214663 DF = 1 p-value = 0.637925 Note that the G 2 -statistic and the p value obtained are very close to the chi-square statistic and its p value. An exact test for HWE can be performed by using routine HWExact.

R> HW.exacttest <-HWExact(x, verbose = TRUE)
Haldane Exact test for Hardy-Weinberg equilibrium (autosomal) using SELOME p-value sample counts: nMM = 298 nMN = 489 nNN = 213 H0: HWE (D==0), H1: D <> 0 D = -3.69375 p-value = 0.6556635 The exact test leads to the same conclusion, we do not reject HWE (p value = 0.6557). Both one-sided and two-sided exact tests are possible by using the argument alternative, which can be set to "two.sided", "greater", or "less". Three different ways of computing the p value of an exact test are implemented, and can be specified by the pvaluetype argument, which can be set to dost (double one-sided tail probability), selome (sum equally likely or more extreme) or midp (the mid p value). The exact test is based on a recursive algorithm. For very large samples, R may give an error message "evaluation nested too deeply: infinite recursion". This can usually be resolved by increasing R's limit on the number of nested expressions with options(expressions = 10000) prior to calling HWExact. See ?HWExact for more information on this issue. The permutation test for HWE is activated by: R> set.seed(123) R> HW.permutationtest <-HWPerm(x, verbose = TRUE) Permutation test for Hardy-Weinberg equilibrium Observed statistic: 0.2214896 17000 permutations. p-value: 0.6551765 and the number of permutations can be specified via the nperm argument. By default the chi-square statistic will be used as the test statistic, but alternative statistics may be supplied by the FUN argument.
All routines HWChisq, HWExact, HWLratio and HWPerm assume that the data are supplied as a vector of genotype counts listed in order (aa, ab, bb). The genotype counts may be specified in a different other, but in that case the elements of the count vector must be appropriately labeled. E.g., the HWChisq function may also be called with: The mn data concern a large sample (n = 1000) with an intermediate allele frequency (p = 0.4575), for which all test results closely agree. For smaller samples and more extreme allele frequencies, larger differences between the tests are typically observed.
We also indicate how to test for HWE when there is missing genotype data. We use the data set Markers for that purpose. This gives a significant result (p value = 0.0032). If data can be assumed to be missing completely at random (MCAR), then we may impute missings by randomly sampling the observed data. This can be done by supplying the method = "sample" argument, and we create 50 imputed data sets (m = 50). As could be expected, the conclusion is the same: there is significant deviation from HWE (p = 0.0022). It will make more sense to take advantage of variables that are correlated with SNP1, and use multiple imputation of the missings of SNP1 using a multinomial logit model. The multinomial logit model will be used when we set method = "polyreg" or leave the method argument out, since "polyreg" is the default for imputation of factor variables by means of a multinomial logit model used by package mice. We test SNP1 (with missings) for HWE, using a multinomial logit model to impute SNP1 using information from the allele intensities iG and iT and the neighboring markers SNP2 and SNP3.

R> data(Markers
R> set.seed (123 Autosomal tests for HWP assume equality of allele frequencies in the sexes. When sex is taken into account, several scenarios are possible. The function HWPosterior can be used to perform Bayesian model selection using the posterior probability of each scenario. We consider an example using an SNP of the JPT sample taken from the 1000G project. The results show that for this variant, equality of allele frequencies in the sexes and HWP for both sexes (model M 11 ) is the model with the largest probability. For more accurate results, higher precision of posterior probabilities can be obtained by specifying precision=0.005, at the expense of increasing the computation time.
We analyse the same variant by calculating the AIC for each scenario. This is achieved by R> data(JPTsnps) R> AICs <-HWAIC (JPTsnps[1,1:3 In this case, the AIC criterion identifies the same M 11 model as the best fitting model.

Testing X-chromosomal markers for HWE
We show here how to perform HWE tests for X-chromosomal markers. We use a vector of 5 elements, containing male and female genotype counts. Note that the test including males is significant, whereas the test excluding males is not.
The exact test for HWE for an X-chromosomal marker can be performed by adding the x.linked=TRUE option:

R> HWExact(SNP1,x.linked=TRUE)
Graffelman-Weir exact test for Hardy-Weinberg equilibrium on the X-chromosome using SELOME p-value Sample probability 5.682963e-05 p-value = 0.02085798 which gives a p-value similar to the χ 2 test. When the mid p-value is used we obtain R> HWExact(SNP1,x.linked=TRUE,pvaluetype="midp") Graffelman-Weir exact test for Hardy-Weinberg equilibrium on the X-chromosome using MID p-value Sample probability 5.682963e-05 p-value = 0.02082957 These exact tests show that the joint null of Hardy-Weinberg proportions and equality of allele frequencies has to be rejected. An exact test using the females only gives again a non-significant result:

R> HWLratio(SNP1,x.linked=TRUE)
Likelihood ratio test for Hardy-Weinberg equilibrium for an X-linked marker G2 = 7.693436 DF = 2 p-value = 0.02134969 Finally, a summary of all frequentist X-chromosomal tests is obtained by

R> AFtest(SNP1)
Fisher Exact test for equality of allele frequencies for males and females. For this SNP, there is a significant difference in allele frequency between males and females.
Puig, Ginebra and Graffelman (?) have proposed a Bayesian test for HWE for variants on the X-chromosome which is implemented in the function HWPosterior.
A Bayesian analysis of the same SNP is obtained by:

R> HWPosterior(SNP1,x.linked=TRUE)
Bayesian test for Hardy-Weinberg equilibrium of X-chromosomal variants. and shows that a model with Hardy-Weinberg proportions for females and different allele frequencies for both sexes has the largest posterior probability, and the largest Bayes factor.

Testing sets of markers for HWE
Functions HWCHisq, HWLratio, HWExact, HWPerm test a single diallelic marker for HWE. Large sets of markers can be tested most efficiently with the functions HWChisqStats for the chi-square test, and with HWExactStats for the exact tests. Both these functions allow for Xlinked markers via the x.linked argument. Exact tests that rely on exhaustive enumeration are slow in R, and HWExactStats now uses by default faster C++ code generously shared by Christopher Chang. The same C++ code is used in the current version (2.0) of Plink (?).

Power computation
Tests for HWE have low power for small samples with a low minor allele frequency, or samples that deviate only moderately from HWE. It is therefore important to be able to compute power. The function HWPower can be used to compute the power of a test for HWE. If its argument θ is set to 4 (the default value), then the function computes the type I error rate for the test. Function mac is used to compute the minor allele count. E.g.,: R> x <-c(MM = 298, MN = 489, NN = 213) R> n <-sum(x) R> nM <-mac(x) R> pw4 <-HWPower(n, nM, alpha = 0.05, test = "exact", theta = 4, + pvaluetype = "selome") R> print(pw4) [1] 0.04822774 R> pw8 <-HWPower(n, nM, alpha = 0.05, test = "exact", theta = 8, + pvaluetype = "selome") R> print(pw8) [1] 0.9996853 These computations show that for a large sample like this one, the type I error rate (0.0482) is very close to the nominal rate, 0.05, and that the standard exact test has good power (0.9997) for detecting deviations as large θ = 8, which is a doubling of the number of heterozygotes with respect to HWE. Type I error rate and power for the chi-square test can be calculated by setting test="chisq". With the allele frequency of this sample (0.5425), θ = 8 amounts to an inbreeding coefficient of -0.1698.

Simulating data
The package HardyWeinberg allows for the simulation of genetic markers under equilibrium and disequilibrium conditions. This enables the user to create simulated data sets that match the observed data set in sample size and allele frequency. The comparison of graphics and statistics for observed and simulated datasets is helpful when assessing the extent of HWE for a large set of markers. We simulate m = 100 markers for n = 100 individuals by taking random samples from a multinomial distribution with θ AA = p 2 , θ AB = 2pq, and θ BB = q 2 . This is done by routine HWData, which can generate data sets that are in or out of Hardy-Weinberg equilibrium. Routine HWData can generate data that are in exact equilibrium (exactequilibrium = TRUE) or that are generated from a multinomial distribution (default). The markers generated by HWData are independent (there is no linkage disequilibrium). HWData returns a list with both the matrix of genotype counts Xt and the matrix with genotype compositions Xc with the relative frequencies of aa, ab and bb. Routine HWData can simulate genotype counts under several conditions. A fixed allele frequency can be specified by setting pfixed = TRUE, and setting p to a vector with the desired allele frequencies.
Sampling is then according to Levene-Haldane's exact distribution in Equation 7. If pfixed is FALSE, the given vector p of allele frequencies will be used in sampling from the multinomial distribution. If p is not specified, p will be drawn from a uniform distribution, and genotypes are drawn from a multinomial distribution with probabilities p 2 , 2pq and q 2 for aa, ab and bb respectively. It is also possible to generate data under inbreeding, by specifying a vector of inbreeding coefficients f. We illustrate the use of HWData by simulating several data sets as shown below. Each simulated dataset is plotted in a ternary diagram in Figure 7 in order to show the effect of the different simulation options. We subsequently simulate 100 markers under HWE with allele frequency 0.5 (X1), 100 markers under HWE with a random uniform allele frequency (X2), 100 markers under inbreeding (f = 0.5) with allele frequency 0.5 (X3), 100 markers under inbreeding (f = 0.5) with a random uniform allele frequency (X4), 100 markers with fixed allele frequencies of 0.2, 0.4, 0.6 and 0.8 (25 each, X5) and 100 markers in exact equilibrium with a random uniform allele frequency (X6).  (e) sampling from the Levene-Haldane distribution with fixed allele frequencies, (f) a data set in exact equilibrium with a uniform allele frequency. Red points represent markers that are significant in a chi-square test for HWE, green points represent non-significant markers.

Graphics for HWE
Genetic association studies, genome-wide association studies in particular, use many genetic markers. In this context graphics such as ternary plots, log-ratio plots and Q-Q plots become particularly useful, because they can reveal whether HWE is a reasonable assumption for the whole data set. We begin to explore the Han Chinese HapMap data set by making a ternary plot shown in Figure 8. For large databases of SNPs, drawing the ternary plot can be time consuming. Usually the matrix with genotype counts contains several rows with the same counts. The ternary plot can be constructed faster by plotting only the unique rows of the count matrix. Function UniqueGenotypeCounts extracts the unique rows of the count matrix and also counts their frequency. Figure 8 shows 10 significant SNPs (two significant markers overlap). A ternary plot with the acceptance region of the exact test is shown in the right panel of Figure 8. This plot only shows 4 significant markers, and illustrates that the exact test is more conservative. A log-ratio plot of the same data was already shown in Figure 4, and can be created with HWIlrPlot(HapMapCHBChr1). We proceed to make a Q-Q plot of the exact p values. At the same time, we construct a simulated database that matches the HapMapCHBChr1 database in allele frequency distribution. This is achieved by setting argument p of HWData equal to the allele frequencies of the observed data, where the latter are computed with function af. R> HWQqplot(HapMapCHBChr1) R> dev.off() R> set.seed(123) R> SimulatedData <-HWData(nm = 225, n = 84, p = af(HapMapCHBChr1))$Xt R> HWQqplot(SimulatedData) The Q-Q plots in Figure 9 show that both the HapMap dataset and its simulated counterpart are in agreement with HWE.

Discussion
The package HardyWeinberg offers functions and graphics for analyzing the Hardy-Weinberg equilibrium status of diallelic genetic markers. There are several other packages for the R environment that implement functionality for investigating genetic markers for HWE. The package genetics by ? offers data structures for genetic markers, and also includes several functions for testing markers for HWE and for linkage equilibrium. Bayesian tests for HWE are implemented in the package HWEBayes of ? and the package HWEintrinsic of ?. A loglinear modeling approach to HWE is available in the package hwde from ?. The PLINK software by ? is a standard in genetic data analysis, and can interact with R by means of the package Rserve (?).
We briefly enumerate and comment some features of the HardyWeinberg package not provided by the aforementioned R packages: the package provides several graphics for HWE (ternary plots with acceptance regions, log-ratio plots and Q-Q plots against the truly null distribution). These graphics are useful for analyzing datasets of multiple markers (e.g., a set of markers used in a candidate gene study, or the study of a specific genomic region), and can shed light on the question if the HWE assumption is tenable for the dataset as a whole. The functions provided for the simulation of marker data under HWE (and under disequilibrium) are also useful in this respect. They allow to create datasets that are similar to the observed data in terms of sample size and allele frequency distribution. The comparison of HWE graphics for simulated and observed data can help to rule out or confirm the HWE assumption. Functions for power calculation make it possible to compute the power to detect deviation from HWE for the data at hand. The exact tests of the package are apparently the only ones available that implement several types of p values, and HardyWeinberg is apparently the only software package that performs inference for HWE with missing genotype information using multiple imputation. Future versions of the package may incorporate functions for testing for HWE with multiple alleles. All tests of the package assume homogeneous samples of individuals from one population. Testing for HWE with individuals from different populations (stratification) may also be addressed in future versions of the package.