POPS : A Software for Prediction of Population Genetic Structure Using Latent Regression Models

The software POPS performs inference of population genetic structure using multilocus genotypic data. Based on a hierarchical Bayesian framework for latent regression models, POPS implements algorithms that improve estimation of individual admixture proportions and cluster membership probabilities by using geographic and environmental information. In addition, POPS deﬁnes ancestry distribution models allowing its users to forecast admixture proportion and cluster membership geographic variation under changing environmental conditions. We illustrate a typical use of POPS using data for an alpine plant species, for which POPS predicts changes in spatial population structure assuming a particular scenario of climate change.

Population genetic structure is commonly estimated by identifying genetic clusters defined as genetically divergent groups of individuals that arise from isolation of populations, and by computing individual membership probabilities for each genetic cluster (Davies, Villablanca, and Roderick 1999;Pritchard, Stephens, and Donnelly 2000). Because genetic ancestry can be shared among several clusters, another way to estimate population structure is to infer admixture proportions representing the relative contributions of distinct ancestral populations to a genome. Identifying genetic clusters based on multilocus genetic data can be viewed as an instance of unsupervised learning based on multivariate categorical variables. The first efforts to infer genetic clusters and individual admixture proportions using Bayesian modeling date back to the introduction of the computer programs STRUCTURE and PARTITION (Pritchard et al. 2000;Dawson and Belkhir 2001). Although STRUCTURE algorithms and their derivatives were widely used to study the influence of landscape features on evolutionary processes, these approaches did not incorporate information on environmental variables.
In this article, we introduce POPS, a software that implements Bayesian clustering algorithms based on genetic, geographic and environmental variables. The principle of POPS is that individuals sharing similar environmental conditions and geographically close to each other are also likely to share genetic ancestry. To achieve this objective POPS assigns individuals or genes to genetic groups after modeling the effects of geography and environment on individual membership and admixture proportions. POPS is based on latent regression models that consider individual ancestry as a hidden response variable regressed on geographical and environmental covariates (e.g., Bandeen-Roche, Miglioretti, Zeger, and Rathouz 1997;Chung, Flaherty, and Schafer 2006;Linzer and Lewis 2011). Hidden regression models are used to obtain predictions of cluster membership and admixture proportions from geographic and environmental variables. In the context of climate change, POPS can be used to forecast changes in population genetic structure of species in response to global warming (Jay et al. 2012).
POPS is implemented in the C++ programming language and can be run from a commandline engine or a graphical user interface. The program takes input data files in a format compatible with existing clustering algorithms like STRUCTURE (Pritchard et al. 2000) and TESS (Chen, Durand, Forbes, and François 2007). POPS returns textual and graphical results of inferred and predicted membership probabilities and admixture coefficients, allele frequencies in the estimated clusters, and values of criteria for model selection. In Sections 2 and 3, we describe the hierarchical models used by POPS and their implementation using Markov chain Monte Carlo (MCMC) algorithms. In Section 4, we explain how to use the software from its graphical user interface and from its command-line engine. In Section 5, we illustrate the use of POPS on an example data set and show the main features of our program.

Models
We consider a data matrix, x, with N rows and L columns that records genotypic data for a sample of N individuals genotyped at L loci. For haploid species, each entry of x corresponds to the occurrence of a given allele encoded as a categorical variable. In polyploid species there are A × N (A ≥ 2) rows instead of N rows in the data matrix. Each copy corresponds to one of the A chromosomes carried by an organism. To each sampled individual corresponds a geographic sample site where coordinates are recorded. In addition to the genetic and geographical data, we assume that a set of environmental covariates are measured at each sample site. We denote byX S the geographic coordinates of a site, and byX E the subset of measured environmental variables. We denote byX the N × (D + 1) design matrix containing the value 1 in its first column plus D geographical and environmental covariates in its subsequent columns.

Objectives of POPS models
The models implemented in the program POPS fall into two main classes according to whether or not genetic admixture is considered. A model without admixture assumes that each individual originates from a unique ancestral group whereas a model with admixture assumes that individuals may share ancestry from more than one source population. In statistical terms, individual membership to a genetic cluster and admixture proportions correspond to nonobserved random variables. For readers not familiar with Bayesian inference of population structure, we recommend to read Pritchard et al. (2000) where the software STRUCTURE is introduced and the standard Bayesian model of population genetic structure is carefully explained.
Assuming that there are K ancestral groups -also called genetic clusters -models without admixture infer latent group labels corresponding to the membership of each individual, z i , of one of the K clusters. As with the program STRUCTURE, POPS estimates a matrix of ancestral allele frequencies p = (p k j ), k = 1, . . . , K, = 1, . . . , L and j = 1, . . . , J . Each entry of p corresponds to the frequency of allele j at locus in population k, and J is the number of distinct alleles observed at locus .
Models with admixture suppose that each individual genome shares ancestry from K ancestral clusters. Admixture models estimate a matrix, q, where each element, q ik , i = 1, . . . , N , k = 1, . . . , K, corresponds to the proportion of individual i's genome that originates from cluster k. In admixture models, POPS estimates cluster labels, z ( ,a) i , for each allele copy, a = 1, . . . , A, at each locus, = 1, . . . , L.
POPS provides estimates and predictions for the hidden variables z and q. Compared to existing software packages, the basic principle of POPS is that the inclusion of geographical and environmental variables improves the accuracy of inference when there is genuine association between those variables and population genetic structure. Using latent correlation models and posterior predictive analysis, POPS additionally enables building geographic maps of ancestry distribution under various scenarios of environmental change.

Models without admixture
Let us denote by x i = x ( ,a) i the multilocus genotype obtained by concatenating genotypes at all loci for individual i. Models without admixture are based on a refinement of latent class models that incorporates geographical and environmental covariates in mixture models. In latent class models, the data are modeled using mixtures of K class-specific distributions (1) In POPS, P(x i |z i = k) integrates over the allele frequencies p, and the conditional distribution is defined as in the STRUCTURE model (Pritchard et al. 2000) In other words, the allele frequencies at each locus are assumed to be independent within each ancestral group, and the distribution of allele counts at each locus is assumed to be multinomial with frequency p k . .
POPS incorporates knowledge on geographical and environmental covariates,X, by considering the following distribution More specifically POPS considers a particular class of latent class models called concomitantvariable latent class or latent class feed-forward models (Dayton and Macready 1988;Vermunt and Magidson 2003). Such models were implemented in the R (R Core Team 2015) package poLCA (Linzer and Lewis 2011). They assume that the covariates have no influence on the distribution of the data in a given class, i.e., the right term in the previous sum can be simplified as In POPS, the conditional probabilities P(z i = k|X) are computed using a probit regression model (Jay, François, and Blum 2011). To emphasize geographical and environmental covariates, we denote byX S the subset of geographic covariates and byX E the subset of environmental covariates. In the probit model, the cluster K is considered as a reference cluster, and all regression coefficients are estimated with respect to this reference. The probit model considers vectors of unobserved continuous variables c i = (c i,1 , . . . , c i,K−1 ) as response variables in K − 1 regression equations (Albert and Chib 1993) Id K−1 is the identity matrix, β E k and β S k are vectors of regression parameters, and f is a polynomial function of degree lower than 3. The function f represents a trend surface for the unobserved variables c i,k . Its introduction is motivated by the definition of admixture models for which the estimation of individual ancestry coefficients is performed by using trend surfaces and universal kriging models (see next section and Durand, Jay, Gaggiotti, and François 2009). The variable z i can be obtained from the vector c i as follows Since the probability of having more than one cluster label maximizing c i,k is equal to zero, the definition of z i is unambiguous. Note that equivalent clustering solutions can be obtained from the models after any permutation of group labels.
Finally, the posterior distribution in the model without admixture can be written as where P(p) and P(β) are prior distributions. The prior distribution on allele frequencies within each ancestral group is a Dirichlet distribution where λ is set to 1 by default. Note that other values of λ can be chosen by a POPS user.

Models with admixture
Models with admixture assume that the genome of each individual originates from K ancestral populations. Since there is a latent class variable for each allele at each locus, admixture models can be considered as extensions of latent class models. In the admixture models implemented in STRUCTURE, Equation 1 extends to where q ik denotes the admixture coefficient of individual i in population k.
POPS incorporates geographical and environmental covariates in admixture models as in latent class feed-forward models, which implies where P(x i |p, q) is given by Equation 7 and q i. depends on the covariatesX via the hyperparameters α i. as described in the next paragraph.
The admixture coefficients are sampled from a Dirichlet distribution where the parameters, α ik , are proportional to the expected amount of individual admixture from each ancestral group. The geographical and environmental covariatesX are incorporated by considering the parameter α as a response variable in a multivariate latent regression model (Durand et al. 2009). POPS implements the following log-normal model where y ik is a zero-mean conditional autoregressive Gaussian model (CAR ;Besag 1975;Ripley 1981;Vounatsou, Smith, and Gelfand 2000). In Equation 10 the second term represents a spatial trend that accounts for broad-scale spatial patterns. The third term corresponds to a spatially autocorrelated residual that accounts for local effects. The first term measures the effect of environmental covariates once spatial effects are removed (Lichstein, Simons, Shriner, and Franzreb 2002). In the CAR model, the distribution of y ik depends on the values of y jk at neighboring sites. The model is conditionally Gaussian with mean and variance where ρ k is the magnitude of the spatial neighborhood effect in cluster k, w ij are weights that determine the neighborhood of i and the relative effect of j on i. The parameter σ 2 k is a variance parameter for the cluster k. The neighborhood is obtained from a Dirichlet tessellation built on sample site coordinates (François, Ancelet, and Guillot 2006). The weights are functions of the great-circle distance between sites, d ij , where θ is equal to the mean value of great-circle distance between sample sites.
The posterior distribution of the multidimensional parameter (z, p, β, q, α) can be obtained as The prior distribution on p is given by Equation 6, and the conditional distribution P(z|q) is given by

Label switching
In this section, we briefly discuss label switching that is a common issue for STRUCTURE-like models. Label switching refers to the invariance of the likelihood function under relabeling of the mixture components. The likelihood is unchanged when permuting the labels (Stephens 2000). Because of this invariance to permutation, the marginal posterior distribution of an individual's membership coefficient, z i , is given by P(z i = k|x,X) = 1/K, for k = 1, . . . , K.
Label switching in POPS can be addressed using the software CLUMPP (Jakobsson and Rosenberg 2007) that provides alignments of distinct clustering results, and allows users to compare these results. Once distinct clustering solutions are aligned with CLUMPP, we recommend to use a model averaging approach based on the outputs of several MCMC runs. Averaging the prediction of membership or admixture coefficients over several posterior regions obtained with different MCMC outputs improve prediction performance.

Inference and posterior predictive simulations
In this section we describe the MCMC algorithms implemented in POPS to perform parameter inference, sample from "modes" of the posterior distribution, deal with label switching issues (identifiability), and use posterior predictive simulations to make predictions using new values of the geographical and environmental covariates.

Models without admixture
To sample from the posterior distribution P(z, p, β|x,X), POPS implements MCMC algorithms using Gibbs sampling updates.
Updating p. To update the allele frequencies p, we consider the same Gibbs sampler updating step as in the software STRUCTURE (Pritchard et al. 2000). It is performed by simulating allele frequencies as follows where n k j denotes the number of copies of allele j in population k at locus , k = 1, . . . , K, = 1, . . . , L, j = 1, . . . , J .
Updating z. Since the cluster label z can be obtained from the latent variable c in a deterministic fashion, z and c are updated simultaneously. Using Bayes' formula, the conditional distribution of (c, z) can be written as P(c, z|β, p, x,X) ∝ P(x|p, z)P(c|β,X)P(z|c).
Samples from the above conditional distribution are generated using the following rejection algorithm: Step 1. For each i, generate c i from the regression Equation 3 and determine z i with the rule described in Equation 4.
Step 2. Given z i = k from Step 1, accept the pair (c i , z i ) with probability otherwise return to Step 1. The probability P(x i |p, z i = k) can be computed from Equation 2.
Updating β. POPS uses a diffuse prior distribution β ∼ N (0, B −1 ) with B = 0. The Gibbs sampler proceeds by updating values of β using its conditional distribution (Albert and Chib 1993) where V = (Ω Ω) −1 , and Ω is the concatenation of the geographical f (X S ) and envi-ronmentalX E variables.

Models with admixture
To perform inference in admixture models, POPS uses a hybrid MCMC algorithm. Unless specified otherwise, updates are done using Gibbs samplers. Updates of p, q and z are similar to those used in the algorithm of STRUCTURE (Pritchard et al. 2000). The updates of (α, β, σ 2 , ρ) are similar to those used in the algorithm of TESS (Durand et al. 2009).
Updating p. Given x and z, p is simulated according to Equation 15 that is given for the model without admixture.
Updating q. For an individual i, the admixture coefficients are sampled from a Dirichlet distribution where m ik is the number of allele copies that originated from population k.
Updating z. A cluster label z (i,a) is simulated independently for each triplet (i, a, ) from Updating (α, y). The parameters (α ik , y ik ) are updated for each pair (i, k) using a Metropolis-Hastings algorithm. For each (i 0 , k 0 ) ∈ {1, . . . , N } × {1, . . . , K}, a new value y * i 0 k 0 is sampled from the Gaussian conditional probability distribution The latent variable y * i 0 k 0 is accepted with probability is the current value of the parameter, and Γ denotes the Gamma function. If Updating β. POPS uses a diffuse prior distribution on β, β ∼ N (0, B −1 ), with B = 0, so that the conditional posterior distribution on β is given by where V = Ω (Id − ρW )Ω −1 , Id is the identity matrix, Ω = (X E , f (X S )), and W = (w ij ).
Updating σ 2 . POPS uses a Gamma distribution for updating the hyperprior parameter φ k = 1/σ 2 k for each k in {1, . . . , K} given by where Ga(a, b) denotes the Gamma distribution with shape a and rate b.
Updating ρ. Let e = (e 1 , . . . , e N ) be the N eigenvalues of the weight matrix W , and e max the largest eigenvalue. POPS uses a uniform hyperprior distribution in the range (0, e −1 max ) on ρ. POPS updates ρ k independently for each k by using a Metropolis-Hastings step. A new value ρ * of ρ k is proposed from a Gaussian random walk with a fixed variance equal to 0.05. POPS rejects the proposed value when the value is not within the interval (0, e −1 max ). Otherwise, the program accepts it with probability

Posterior predictive simulations and model selection
An important feature of POPS is that the inferred model can be used for predicting cluster membership and admixture proportions given new values of environmental and spatial covariates. In models without admixture, membership coefficients can be obtained by sampling latent cluster labels using Equations 3 and 4, where regression coefficients are generated from their posterior distribution. For each individual, the relative frequency of predicted cluster labels provides a predicted membership probability for each cluster. Similarly in admixture models, admixture coefficients can be predicted using new values of the environmental and geographical covariates. Predicted admixture coefficients can be obtained by sampling regression coefficients from the posterior distribution, and by simulating admixture coefficients using Equations 9 and 10.

Addressing label switching and multimodality
To address label switching and multimodality, we use the following post-processing methods for the MCMC runs. First, we run several MCMC runs and we select a subset of runs minimizing the DIC (Spiegelhalter, Best, Carlin, and Linde 2002). Because of multimodality, MCMC runs may visit regions of the posterior distribution that correspond to distinct modes of the posterior distribution and are truly distinct solutions (Jakobsson and Rosenberg 2007). By selecting runs with the lowest DIC values, we discard runs that visit less interesting regions of the posterior distribution.
In STRUCTURE MCMC runs, it is generally observed that labels do not switch during a single MCMC run. When we run the MCMC algorithm several times, the replicates that we select using the DIC generally provide comparable membership or admixture coefficients after permutation of cluster labels (Jakobsson and Rosenberg 2007). We use the program CLUMPP, which implements an algorithm that deals with label switching by looking for an optimal alignment of estimates obtained from distinct MCMC runs. Then, we average the results over different MCMC runs in order to improve the prediction of cluster memberships and admixture coefficients.

Using POPS
POPS is a free software implemented in the C++ programming language, and the graphical user interface uses the library Qt (Qt Project 2015). In this section, we briefly describe its graphical user interface, and we explain the main instructions of the command-line engine.

POPS graphical user interface (GUI)
To start an analysis using the POPS GUI, the user must create a project by loading a file containing genetic data, and spatial and environmental data ( Figure 1). The data format required by POPS is similar to STRUCTURE or TESS formats. By default the genotypes should be stored on one row for an haploid individual, and two rows for a diploid individual. Other formats compatible with STRUCTURE and TESS can also be specified. In addition to geographic or environmental data, there can be additional rows or columns present in the input file for informational or other purposes. Columns must be stored in a predefined order: (1) extra columns (optional, e.g., identifier for samples, population labels), (2) qualita- Users must specify the following parameters: • model with or without admixture, • degree of the spatial trend surface (0, 1, 2, or 3), • range of values for the maximal number of clusters K, • number of runs to launch for each value of K, • number of sweeps of each run (total number of steps and number of burn-in iterations).
Optional parameters can be specified, including hyperparameters for admixture models: the scale parameter θ and initial CAR variances (one value for all clusters).

Outputs
The program POPS provides textual and graphical summaries of MCMC runs, including estimates of model parameters, log-likelihood histories of each run for convergence diagnostics, DIC values for model selection, and visualizations of vectors of posterior probabilities for cluster labels (models without admixture), or matrices of admixture coefficients (admixture models). These graphical outputs use barplot representation of ancestry coefficient matrices, as commonly displayed by other ancestry estimation programs. When geographic information Figure 2: Analysis of a Rhododendron ferrugineum L. data set using the POPS GUI. Graphical results are displayed for one run. A: a tree widget from which the user has access to project and data information, run options, textual and graphical outputs. All output results can be displayed by double-clicking on the appropriate items in the tree widget B: Log-likelihood history of the run. C, D, E: 3 graphical outputs, where colors correspond to clusters. C: Each plant is assigned to one cluster (for which the admixture coefficient is maximal) and assignments are displayed where each black point represents a sample site. D: Admixture coefficients estimated for each sample. E: Admixture coefficients predicted from environmental data for each sample. In D and E, individuals are represented as vertical lines partitioned into segments corresponding to the fraction of their genomes assigned/predicted to each genetic cluster.
is provided, POPS displays hard-clustering assignments where each individual is assigned to the cluster for which the membership or admixture coefficient is maximal. An additional feature of POPS is to compute predictions of population genetic structure for new environmental data. Supplementary scripts are provided on the POPS web page to display actual or predicted population genetic structures on geographic maps using the fields package for R (Nychka, Furrer, and Sain 2015). Finally, correlation coefficients evaluating the correlation between matrices of inferred and predicted coefficients are computed and returned to users for facilitating the interpretation of results.
Output results are accessible by double-clicking on the corresponding items in the tree widget ( Figure 2A). In addition, the "Summarize Runs" button allows the user to quickly access the summaries of all runs (K, trend degree, model used, DIC, average log-likelihood, correlation score). POPS also provides a tool to export runs to CLUMPP that can average membership and admixture coefficients estimated from multiple MCMC runs (Jakobsson and Rosenberg 2007). Finally, the summary window can be used to perform predictions. To predict membership or admixture coefficients for new values of covariates, the user needs to load a file containing new data and choose runs that will be used for predictions. Textual and graphical results of predicted coefficients are computed for each selected run, and are accessible from the tree widget ( Figure 2A). Prediction outputs from independent runs can also be exported to CLUMPP.

POPS command-line options
POPS is based on two command-line engines: one for models without admixture, and the other for models with admixture. We describe the main commands for both programs. When there are no options given to POPS, it will show its typical usage and exit. In the following the main POPS options are given. They can be specified in any order.

Required parameters:
-F File name of input data file. Folder Name of output result files (default: current folder). -orun Suffix to append to output result file names, e.g., a run number (-orun1 or -orun0001) or a specific run name (-orunAdm002).

-sp
Special data format: 1 individual = 1 row (-spy: yes, -spn: no, default). Execute pops|more and popsAdm|more or see POPS manual for extra options. The command pops (or pops.exe on the Windows operating system) runs models without admixture, whereas popsAdm (or popsAdm.exe) runs admixture models.
Assume now that the file example.txt contains an extra column (identifier for samples, -c1), 1 quantitative covariate (temperature -X1), and 2 columns for longitude and latitude. To run POPS with the same run parameters as before but using the covariates, and setting the degree of the trend surface to 1 (-T1), the command is To run the POPS admixture model using the same data and the same parameters as shown in the preceding command, we additionally specify the default value of the spatial interaction parameter (-P0.6) $ popsAdm -Fexample.txt -N268 -A1 -XL0 -X1 -T1 -L86 -K3 -D1.0 -S1000 \ > -B200 -P0.6 -r1 -c1 -iExample -oExample -orun003

Examples
In this section, we illustrate several features of POPS using a data set of 377 individuals from plant species Rhododendron ferrugineum L., with each individual genotyped at 111 loci (INTRABIODIV database, Gugerli et al. 2008). Rhododendron ferrugineum L. is a small and evergreen shrub present in European mountains. With the genetic data, we consider geographic and environmental covariates consisting of latitude, longitude, average minimum and maximum annual temperatures, average precipitations, and an additional set of topographic variables measured at each sampled site.

Estimating population genetic structure
A subset of 60 individuals were randomly chosen among the 377 samples to constitute a test set. We ran POPS on the remaining 317 plants, and we reported the results from a single run using the admixture model with K = 3 clusters. The run used 1,000 sweeps following a burn-in period of 200 sweeps. The log-likelihood function increased quickly during the run, and then reached a stationary state ( Figure 2B). Admixture coefficients were estimated for each individual and displayed in the graphical user interface in Figure 2D. Hard-clustering assignments are displayed on the tessellation built from the sample sites locations ( Figure 2C). The map shows that the 3 inferred genetic clusters correspond to three well-separated geographical regions, in the southwestern region (red cluster), central region (blue cluster) and northeastern region (green cluster). R scripts based on kriging methods allow us to display admixture coefficients spatially (Figure 3). Though substantial admixture occurs within contact zones between clusters, the results are consistent with hard-clustering assignments. To evaluate the improvement on the estimation of population genetic structure obtained with models using environmental covariates, we computed DIC for models that do not use any environmental information. We find that DIC decreases from 19565 to 19525 when adding environmental covariates.

Predicting population genetic structure based on covariates
POPS can test if a set of covariates included in a model is useful to predict population genetic structure by computing the correlation between the estimated admixture coefficients and admixture coefficients predicted from the geographical and environmental covariates (Figures 2D and E). A correlation score of ≈ 0.96, reported in the tree widget, indicates that prediction from environmental variables is accurate. When we use POPS to predict admixture coefficients from the covariates of 60 individuals contained in the test data set (and not used for inference), the correlation score is equal to 0.85 (Figure 4).

Forecasting population genetic structure under changes
According to the Intergovernmental Panel on Climate Change, environmental conditions may change drastically during the coming century (Intergovernmental Panel on Climate Change 2007). Temperatures are predicted to rise by 1.8 to 4 • C, depending on the IPCC expert projection. Precipitations are also likely to increase in several regions. These changes are now acknowledged to have an impact on species distributions and there has been increasing evidence of species' range shifts due to climate change (Parmesan and Yohe 2003). POPS provides a framework to investigate and forecast modifications in population genetic structure in response to environmental changes. We used POPS to forecast changes in population genetic structure of the species Rhododendron ferrugineum L. under a 2 • C temperature increase and a 40% augmentation in precipitation levels (see Jay et al. 2012, for a more extensive study). Admixture coefficients computed with the projected climatic variables are displayed in Fig-Figure 4: Admixture coefficients predicted for the 60 individuals contained in the test set. Admixture coefficients are computed for the 3 clusters inferred, using geographic, climatic and topographic information but no genetic data. The correlation between predicted admixture coefficients and coefficients estimated for the 60 test individuals using all environmental and genetic information is 0.85. Figure 5: Admixture coefficients predicted for Rhododendron ferrugineum L. under a global warming scenario (2 • C temperature increase and 40% precipitation increase). A: Admixture coefficients are displayed on a bar chart. B: Admixture coefficients greater than 0.5 are displayed spatially using kriging methods. ure 5. A comparison with current admixture coefficients ( Figure 2) provides evidence that the contact zones between the central and the southern cluster, and between the central and the northern cluster, shift in the northward and westward directions respectively.

Conclusion
POPS is a software package that can be used to estimate population structure from individual multilocus genotypes and external covariates without assuming predefined populations.
POPS jointly infers population genetic structure and the effect of environmental covariates on this structure. POPS users can use environmental covariates to predict genetic structure or for changing environmental conditions. POPS can be used either from a graphical user interface or from a command-line engine. R scripts for post-processing results and displaying membership or admixture coefficients are available in the POPS package. POPS can be downloaded from http://membres-timc.imag.fr/Olivier.Francois/pops.html.