BayesSUR: An R package for high-dimensional multivariate Bayesian variable and covariance selection in linear regression

In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with high-dimensional genomic and other omics data, a problem that can be studied with high-dimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. We also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.


Introduction
With the development of high-throughput technologies in molecular biology, the large-scale molecular characterization of biological samples has become common-place, for example by genome-wide measurement of gene expression, single nucleotide polymorphisms (SNP) or CpG methylation status. Other complex phenotypes, for example, pharmacological profiling from large-scale cancer drug screens, are also measured in order to guide personalized cancer therapies (Garnett et al. 2012; Barretina et al. 2012;Gray and Mills 2015). The analysis of joint associations between multiple correlated phenotypes and high-dimensional molecular features is challenging.
When multiple phenotypes and high-dimensional genomic information are jointly analyzed, the Bayesian framework allows to specify in a flexible manner the complex relationships between the highly structured data sets. Much work has been done in this area in recent years. Our software package BayesSUR  gathers together several models that we have proposed for high-dimensional regression of multiple responses and also introduces a novel model, allowing for different priors for variable selection in the regression models and for different assumptions about the dependence structure between responses.
Bayesian variable selection uses latent indicator variables to explicitly add or remove predictors in each regression during the model search. Here, as we consider simultaneously many predictors and several responses, we have a matrix of variable selection indicators. Different variable selection priors have been proposed in the literature. For example, Jia and Xu (2007) mapped multiple phenotypes to genetic markers (i.e., expression quantitative trait loci, eQTL) using the spike-and-slab prior and hyper predictor-effect prior. Liquet, Mengersen, Pettitt, and Sutton (2017) incorporated group structures of multiple predictors via a (multivariate) spike-and-slab prior. The corresponding R (R Core Team 2021) package MBSGS  is available from the Comprehensive R Archive Network (CRAN) at https:// CRAN.R-project.org/package=MBSGS. Bottolo et al. (2011) and Lewin et al. (2015b) further proposed the hotspot prior for variable selection in multivariate regression, in which the probability of association between the predictors and responses is decomposed multiplicatively into predictor and response random effects. This prior is implemented in a multivariate Bayesian hierarchical regression setup in the software R2HESS (Lewin, Campanella, Saadi, Liquet, and Chadeau-Hyam 2015a), available from https://www.mrc-bsu.cam.ac.uk/software/. Lee, Tadesse, Baccarelli, Schwartz, and Coull (2017) used the Markov random field (MRF) prior to encourage joint selection of the same variable across several correlated response variables. Their C-based R package mBvs (Lee, Tadesse, Coull, and Starr 2021) is available from CRAN (https://CRAN.R-project.org/package=mBvs).
For high-dimensional predictors and multivariate responses, the space of models is very large. To overcome the infeasibility of the enumerated model space for the MCMC samplers in the high-dimensional situation, Bottolo and Richardson (2010) proposed an evolutionary stochastic search (ESS) algorithm based on evolutionary Monte Carlo. This sampler has been extended in a number of situations and efficient implementation of ESS for multivariate Bayesian hierarchical regression has been provided by the C++-based R package R2GUESS (Liquet, Bottolo, Campanella, Richardson, and Chadeau-Hyam 2016). Richardson, Bottolo, and Rosenthal (2011) proposed a new model and computationally efficient hierarchical evolutionary stochastic search algorithm (HESS) for multi-response (i.e., multivariate) regression which assumes independence between residuals across responses and is implemented in the R2HESS package. Petretto et al. (2010) used the inverse Wishart prior on the covariance matrix of residuals in order to do simultaneous analysis of multiple response variables allowing for correlations in response residuals, for more moderate sized data sets.
In order to analyze larger numbers of response variables, yet retain the ability to estimate dependence structures between them, sparsity can be introduced into the residual covariances, as well as into the regression model selection. Holmes, Denison, and Mallick (2002) adapted seemingly unrelated regression (SUR) to the Bayesian framework and used a Markov chain Monte Carlo (MCMC) algorithm for the analytically intractable posterior inference. The hyper-inverse Wishart prior has been used to learn a sparser graph structure for the covariance matrix of high-dimensional variables (Carvalho, Massam, and West 2007;Wang 2010;Bhadra and Mallick 2013), thus performing covariance selection. However, these approaches are not computationally feasible if the number of input variables is very large. Bottolo, Banterle, Richardson, Ala-Korpela, Järvelin, and Lewin (2021) recently developed a Bayesian variable selection model which employs the hotspot prior for variable selection, learns a structured covariance matrix and implements the ESS algorithm in the SUR framework to further improve computational efficiency.
The BayesSUR package implements many of these possible choices for high-dimensional multiresponse regressions by allowing the user to choose among three different prior structures for the residual covariance matrix and among three priors for the joint distribution of the variable selection indicators. This includes a novel model setup, where the MRF prior for incorporating prior knowledge about the dependence structure of the inclusion indicators is combined with Bayesian SUR models (Zhao, Banterle, Lewin, and Zucknick 2021). BayesSUR employs ESS as a basic variable selection algorithm.

Models specification
The BayesSUR package fits a Bayesian seemingly unrelated regression model with a number of options for variable selection, and where the covariance matrix structure is allowed to be diagonal, dense or sparse. It encompasses three classes of Bayesian multi-response linear regression models: hierarchical related regressions (HRR, Richardson et al. 2011), dense and sparse seemingly unrelated regressions (dSUR and SSUR, Bottolo et al. 2021), and the structured seemingly unrelated regression, which makes use of a Markov random field (MRF) prior ).
The regression model is written as where Y is a n × s matrix of outcome variables with s × s covariance matrix C, X is a n × p matrix of predictors for all outcomes, B is a p × s matrix of regression coefficients, U is the matrix of residuals, vec(·) indicates the vectorization of a matrix, N (µ, Σ) denotes a normal distribution with mean vector µ and covariance matrix Σ, 0 denotes a column vector with all elements zero, ⊗ is the Kronecker product and I n is the identity matrix of size n × n.
We use a binary latent indicator matrix Γ = {γ jk } to perform variable selection. A spikeand-slab prior is used to find a sparse relevant subset of predictors that explain the variability Table 1: Nine models across three priors of C by three priors of Γ.
The precision matrix W is generally decomposed into a shrinkage coefficient and a matrix that governs the covariance structure of the regression coefficients. Here we use W = w −1 I sp , meaning that all the regression coefficients are a priori independent, with an inverse gamma hyperprior on the shrinkage coefficient w, i.e., w ∼ IGamma(a w , b w ). The binary latent indicator matrix Γ has three possible options for priors: the independent hierarchical Bernoulli prior, the hotspot prior and the MRF prior. The covariance matrix C also has three possible options for priors: the independent inverse gamma prior, the inverse Wishart prior and hyper-inverse Wishart prior. Thus, we consider nine possible models (Table 1) across all combinations of three priors for C and three priors for Γ.

Hierarchical related regression (HRR)
The hierarchical related regression model assumes that C is a diagonal matrix which translates into conditional independence between the multiple response variables, so the likelihood factorizes across responses. An inverse gamma prior is specified for the residual covariance, i.e., σ 2 k ∼ IGamma(a σ , b σ ) which, combined with the priors in (2) is conjugate with the likelihood in the model in (1). We can thus sample the variable selection structure Γ marginally with respect to C and B. For inference for this model, Richardson et al. (2011) implemented the hierarchical evolutionary stochastic search algorithm (HESS).

HRR with independent Bernoulli prior
For a simple independent prior on the regression model selection, the binary latent indicators follow a Bernoulli prior γ jk |ω jk ∼ Ber(ω j ), j = 1, · · · , p, k = 1, · · · , s, with a further hierarchical Beta prior on ω j , i.e., ω j ∼ Beta(a ω , b ω ), which quantifies the probability for each predictor to be associated with any response variable. Richardson et al. (2011) and Bottolo et al. (2011) proposed decomposing the probability of association parameter ω jk in (4) as ω jk = o k × π j , where o k accounts for the sparsity of each response model and π j controls the propensity of each predictor to be associated with multiple responses simultaneously.

HRR with MRF prior
To consider the relationship between different predictors and associate highly correlated responses with the same predictors, we set a Markov random field prior on the latent binary vector γ where G is an adjacency matrix containing prior information about similarities amongst the binary model selection indicators γ = vec(Γ). The parameters d and e are treated as fixed in the model. Alternative approaches include the use of a hyperprior on e (Stingo, Chen, Tadesse, and Vannucci 2011) or to fit the model repeatedly over a grid of values for these parameters, in order to detect the phase transition boundary for e (Lee et al. 2017) and to identify a sensible combination of d and e that corresponds to prior expectations of overall model sparsity and sparsity for the MRF graph.

Dense seemingly unrelated regression (dSUR)
The HRR models in Section 2.1 assume residual independence between any two response variables because of the diagonal matrix C in (3). It is possible to estimate a full covariance matrix by specifying an inverse Wishart prior, i.e., C ∼ IW(ν, τ I s ). To avoid estimating the dense and large covariance matrix directly, Bottolo et al. (2021) exploited a factorization of the dense covariance matrix to transform the parameter space (ν, τ ) of the inverse Wishart distribution to space {σ 2 k , ρ kl |σ 2 k : k = 1, · · · , s; l < k}, with priors Here, we assume that τ ∼ Gamma(a τ , b τ ). Thus, model (1) is rewritten as where u l = y l − Xβ l and β l is the lth column of B, so again the likelihood is factorized across responses.
Similarly to the HRR model, employing either the simple independence prior (4), the hotspot prior (5) or the MRF prior (6) for the indicator matrix Γ results in different sparsity specifications for the regression coefficients in the dSUR model. The marginal likelihood integrating out B is no longer available for this model, so joint sampling of B, Γ and C is required. However, the reparameterization of the model (8) enables fast computation using the MCMC algorithm.

Sparse seemingly unrelated regression (SSUR)
Another approach to model the covariance matrix C is to specify a hyper-inverse Wishart prior, which means the multiple response variables have an underlying graph G encoding the conditional dependence structure between responses. In this setup, a sparse graph corresponds to a sparse precision matrix C −1 . From a computational point of view, it is infeasible to specify a hyper-inverse Wishart prior directly on C −1 in high dimensions (Carvalho et al. 2007;Jones, Carvalho, Dobra, Hans, Carter, and West 2005;Uhler, Lenkoski, and Richards 2018;Deshpande, Ročková, and George 2019). However, Bottolo et al. (2021) used a transformation of C to factorize the likelihood as in Equation 8. The hyper-inverse Wishart distribution, i.e., C ∼ HIW G (ν, τ I s ), becomes in the transformed variables the scalar variance σ 2 qt and the associated correlation vector ρ qt = (ρ 1,qt , ρ 2,qt , · · · , ρ t−1,qt ) with where Q is the number of prime components in the decomposable graph G, and S q and R q are the separators and residual components of G, respectively. |S q | and |R q | denote the number of variables in these components. For more technical details, please refer to Bottolo et al. (2021).
As prior for the graph we use a Bernoulli prior with probability η on each edge E kk of the graph as in The three priors on β γ , i.e., independence (4), hotspot (5) and MRF (6) priors can be used in the SSUR model.

MCMC sampler and posterior inference
To sample from the posterior distribution, we use the evolutionary stochastic search algorithm Bottolo et al. 2011;Lewin et al. 2015b), which uses a particular form of evolutionary Monte Carlo (EMC) introduced by Liang and Wong (2000). Multiple tempered Markov chains are run in parallel and both exchange and crossover moves are allowed between the chains to improve mixing between potentially different modes in the posterior. Note that we run multiple tempered chains at the same temperature instead of a ladder of different temperatures as was proposed in the original implementations of the (H)ESS sampler in Bottolo and Richardson (2010); Bottolo et al. (2011); Lewin et al. (2015b). The temperature is adapted during the burn-in phase of the MCMC sampling.
The main chain samples from the un-tempered posterior distribution, which is used for all inference. For each response variable, we use a Gibbs sampler to update the regression coefficients vector, β k (k = 1, · · · , s), based on the conditional posterior corresponding to the specific model selected among the models presented in Sections 2.2 and 2.3. After L MCMC iterations, we obtain B (1) , · · · , B (L) and the estimate of the posterior mean iŝ where b is the number of burn-in iterations. Posterior full conditionals are also available to update σ 2 k and ρ kl for the dSUR model and σ 2 qt and ρ qt for the SSUR model. In the HRR models in Section 2.1, the regression coefficients and residual covariances have been integrated out and therefore the MCMC output cannot be used directly for posterior inference of these parameters. However, for B, the posterior distribution conditional on Γ can be derived analytically for the HRR models and this is the output for B that is provided in the BayesSUR package for HRR models.
At MCMC iteration t we also update each binary latent vector γ k (k = 1, · · · , s) via a Metropolis-Hastings sampler, jointly proposing an update for the corresponding β k . After L iterations, using the binary matrices Γ (1) , · · · , Γ (L) , the marginal posterior inclusion probabilities (mPIP) of the predictors are estimated bŷ In the SSUR models, another important parameter is G in the hyper-inverse Wishart prior for the covariance matrix C. It is updated by the junction tree sampler (Green and Thomas 2013;Bottolo et al. 2021) jointly with the corresponding proposal for σ 2 qt and ρ qt |σ 2 qt in (10). At each MCMC iteration we then extract the adjacency matrix G (t) (t = 1, · · · , L), from which we derive posterior mean estimators of the edge inclusion probabilities aŝ Note that even though a priori the graph G is decomposable, the posterior mean estimateĜ can be outside the space of decomposable models (see Bottolo et al. 2021).
The hyper-parameter τ in the inverse Wishart prior or hyper-inverse Wishart prior is updated by a random walk Metropolis-Hastings sampler. The hyper-parameter η and the variance w in the spike-and-slab prior are sampled from their posterior conditionals. For details see Bottolo et al. (2021).

The R package BayesSUR
The package BayesSUR is available from CRAN at http://CRAN.R-project.org/package= BayesSUR. This article refers to version 2.0-1. The main function is BayesSUR(), which has various arguments that can be used to specify the models introduced in Section 2, by setting the priors for the covariance matrix C and the binary latent indicator matrix Γ. In addition, MCMC parameters (nIter, burnin, nChains) can also be defined. The following syntax example introduces the most important function arguments, which are further explained below. The full list of all arguments in function BayesSUR() is given in Table 2.
BayesSUR(data, Y, X, X_0, covariancePrior, gammaPrior, nIter, burnin, nChains, ...) The data can be provided as a large combined numeric matrix [Y, X, X 0 ] of dimension n × (s + p) via the argument data; in that case the arguments Y, X and X_0 need to contain the dimensions of the individual response variables Y, predictors under selection X and fixed predictors X 0 (i.e., mandatory predictors that will always be included in the model). Alternatively, it is also possible to supply X 0 , X and Y directly as numeric matrices via the arguments X_0, X and Y. In that case, argument data needs to be NULL, which is the default.
The arguments covariancePrior and gammaPrior specify the different models introduced in Section 2. When using the Markov random field prior (6) for the latent binary vector γ, an additional argument mrfG is needed to assign the edge potentials; this can either be specified as a numeric matrix or as a file directory path leading to a text file with the corresponding information. For example, the HRR model with independent hierarchical prior in Section 2.1 is specified by (covariancePrior = "IG", gammaPrior = "hierarchical"), the dSUR model with hotspot prior in Section 2.2 by (covariancePrior = "IW", gammaPrior = "hotspot") and the SSUR model with MRF prior in Section 2.3 for example by (covariancePrior = "HIW", gammaPrior = "MRF", mrfG = "mrfFile.txt").
The MCMC parameter arguments nIter, burnin and nChains indicate the total number of MCMC iterations, the number of iterations in the burn-in period and the number of parallel tempered chains in the evolutionary stochastic search MCMC algorithm, respectively. See Section 2.4 and, e.g., Bottolo and Richardson (2010) for more details on the ESS algorithm.
The main function BayesSUR() is used to fit the model. It returns an object of S3 class 'BayesSUR' in a list format, which includes the input parameters and directory paths of output text files, so that other functions can retrieve the MCMC output from the output files, load them into R and further process the output for posterior inference of the model output.
In particular, a summary() function has been provided for 'BayesSUR' class objects, which is used to summarize the output produced by BayesSUR(). For this purpose, a number of predictors are selected into the model by thresholding the posterior means of the latent indicator variables. By default, the threshold is 0.5, i.e., variable j is selected into the model for response k ifγ jk > 0.5. The summary() function also outputs the quantiles of the conditional predictive ordinates (CPO, Gelfand 1996), top predictors with respect to average marginal Argument Description Numeric matrix or indices with respect to the argument data for the responses. X Numeric matrix or indices with respect to the argument data for the predictors.

X_0
Numeric matrix or indices with respect to the argument data for predictors forced to be included (i.e., that are not part of the variable selection procedure). Default is NULL. outFilePath Directory path where the output files are saved. covariancePrior Prior for the covariance matrix; "IG": independent inverse gamma prior, "IW": inverse Wishart prior, "HIW": hyper-inverse Wishart prior (default). gammaPrior Prior for the binary latent variable Γ; "hierarchical": Bernoulli prior, "hotspot": hotspot prior (default), "MRF": Markov random field prior. mrfG A numeric matrix or a path to the file containing (the edge list of) the G matrix for the MRF prior on Γ. Default is NULL. nIter Total number of MCMC iterations. burnin Number of iterations in the burn-in period. nChains Number of parallel chains in the evolutionary stochastic search MCMC algorithm. gammaSampler Local move sampler for the binary latent variable Γ, either (default) "bandit" for a Thompson-sampling inspired sampler or "MC3" for the usual M C 3 sampler. gammaInit Γ initialization to either all zeros ("0"), all ones ("1"), MLE-informed ("MLE") or (default) randomly ("R"). hyperpar A list of named prior hyperparameters to use instead of the default values, including a_w, b_w, a_sigma, b_sigma, a_omega, b_omega, a_o, b_o, a_pi, b_pi, nu, a_tau, b_tau, a_eta and b_eta. They Maximum threads used for parallelization. Default is 1. output_* Allow (TRUE) or suppress (FALSE) the output for *; possible outputs are Γ, G, B, σ, π, tail (hotspot tail probability, see Bottolo and Richardson 2010) or model_size. Default is all: TRUE. tmpFolder The path to a temporary folder where intermediate data files are stored (will be erased at the end of the MCMC sampling). It is specified relative to outFilePath.

Function Description
BayesSUR() Main function of the package. Fits any of the models introduced in Section 2. Returns an object of S3 class 'BayesSUR', which is a list which includes the input parameters (input) and directory paths of output text files (output), as well as the run status and function call. print() Print a short summary of the fitted model generated by BayesSUR(), which is an object of class 'BayesSUR'. summary() Summarize the fitted model generated by BayesSUR(), which is an object of class 'BayesSUR'. coef() Extract the posterior mean of the coefficients of a 'BayesSUR' class object. fitted() Return the fitted response values that correspond to the posterior mean of the coefficients matrix of a 'BayesSUR' class object.

predict()
Predict responses corresponding to the posterior mean of the coefficients, return posterior mean of coefficients or indices of non-zero coefficients of a 'BayesSUR' class object.

plot()
Main plot function to be called by the user. Creates a selection of plots for a 'BayesSUR' class object by calling one or several of the specific plot functions below as specified by the combination of the two arguments estimator and type. elpd() Measure the prediction accuracy by the expected log pointwise predictive density (elpd). The out-of-sample predictive fit can either be estimated by Bayesian leave-one-out cross-validation (LOO) or by widely applicable information criterion (WAIC, Vehtari et al. 2017). See Appendix A for details. getEstimator() Extract the posterior mean of the parameters (i.e., B, Γ and G) of a 'BayesSUR' class object. Also, the log-likelihood of Γ, model size and G can be extracted for the MCMC diagnostics. plotEstimator() Plot the estimated relationships between response variables and estimated coefficients of a 'BayesSUR' class object with argument estimator = c("beta", "gamma", "Gy"). plotGraph() Plot the estimated graph for multiple response variables from a 'BayesSUR' class object with argument estimator = "Gy". plotNetwork() Plot the network representation of the associations between responses and predictors, based on the estimatedΓ matrix of a 'BayesSUR' class object with argument estimator = c("gamma", "Gy"). plotManhattan() Plot Manhattan-like plots for marginal posterior inclusion probabilities (mPIP) and numbers of responses of association for predictors of a 'BayesSUR' class object with argument estimator = "gamma". plotMCMCdiag() Show trace plots and diagnostic density plots of a 'BayesSUR' class object with argument estimator = "logP". plotCPO() Plot the conditional predictive ordinate (CPO) for each individual of a fitted model generated by BayesSUR() with argument estimator = "CPO". CPO is used to identify potential outliers (Gelfand 1996). To use a specific estimator, the function getEstimator() is convenient to extract point estimates of the coefficients matrixB, latent indicator variable matrixΓ or learned structurê G from the directory path of the model object. All point estimates are posterior means, thusγ jk is the marginal posterior inclusion probability for variable j to be selected in the regression for response k, andĜ kl is the marginal posterior edge inclusion probability between responses k and l, i.e., the marginal posterior probability of conditional dependence between k and l. The regression coefficient estimatesB can be the marginal posterior means over all models, independently ofΓ (with default argument beta.type = "marginal"). Then, β jk represents the shrunken estimate of the effect of variable j in the regression for response k. Alternatively,β jk can be the posterior mean conditional on γ jk = 1 with argument beta.type = "conditional". If beta.type = "conditional" and Pmax = 0.5 are chosen, then these conditionalβ jk estimates correspond to the coefficients in a median probability model (Barbieri and Berger 2004).
In addition, the generic S3 methods coef(), predict(), and fitted() can be used to extract regression coefficients, predicted responses, or indices of non-zero coefficients, all corresponding to the posterior mean estimates of an 'BayesSUR' object.
The main function for creating plots of a fitted BayesSUR model, is the generic S3 method plot(). It creates a selection of the above plots, which the user can specify via the estimator and type arguments. If both arguments are set to NULL (default), then all available plots are shown in an interactive manner. The main plot() function uses the following specific plot functions internally. These can also be called directly by the user. The function plotEstimator() visualizes the three estimators. To show the relationship of multiple response variables with each other, the function plotGraph() prints the structure graph based onĜ. Furthermore, the structure relations between multiple response variables and predictors can be shown via function plotNetwork(). The marginal posterior probabilities of individual predictors are illustrated via the plotManhattan() function, which also shows the number of associated response variables of each predictor.
Model fit can be investigated with elpd() and plotCPO(). elpd() estimates the expected log pointwise predictive density (Vehtari et al. 2017) to assess out-of-sample prediction accuracy. plotCPO() plots the conditional predictive ordinate for each individual, i.e., the leave-oneout cross-validation predictive density. CPOs are useful for identifying potential outliers (Gelfand 1996). To check convergence of the MCMC sampler, function plotMCMCdiag() prints traceplots and density plots for moving windows over the MCMC chains. The igraph package (Csárdi and Nepusz 2006) is used for constructing the graph plots. Note that the igraph package creates the layout in a dynamic way, which is determined among other things by the size of the figure window. The layout of the plots obtained with the replication material may thus differ from those shown in the manuscript.

Quick start with a simple example
In the following example, we illustrate a simple simulation study where we run two models: the default model choice, which is an SSUR model with the hotspot prior, and in addition an SSUR model with the MRF prior. The purpose of the latter is to illustrate how we can construct an MRF prior graph. We simulate a data set X with dimensions n × p = 10 × 15, i.e., 10 observations and 15 input variables, a sparse coefficients matrix B with dimension p × s = 15 × 3, which creates associations between the input variables and s = 3 response variables, and random noise E. The response matrix is generated by the linear model Y = XB + E.
R> plot(fit, estimator = c("beta", "gamma"), type = "heatmap", + fig.tex = TRUE, output = "exampleEst", xlab = "Predictors", + ylab = "Responses") Before running the SSUR model with the MRF prior, we need to construct the edge potentials matrix G. If we assume (in accordance with the true matrix B in this simulation scenario) that the second and third predictors are related to the first two response variables, this implies that γ 21 , γ 22 , γ 31 and γ 32 are expected to be related and therefore we might want to encourage these variables to be selected together. In addition, we assume that we know that the first and fourth predictors are associated with the third response variable, and therefore we encourage the selection of γ 13 as well. Since matrix G represents prior relations of any two predictors corresponding to vec{Γ}, it can be generated by the following code: R> G <-matrix(0, ncol = s * p, nrow = s * p) R> combn1 <-combn(rep((1:2 -1) * p, each = length(2:3)) + + rep(2:3, times = length(1:2)), 2) R> combn2 <-combn(rep((3-1) * p, each = length(c(1, 4))) + + rep(c(1, 4), times = length (3) Calling BayesSUR() with the argument gammaPrior = "MRF" will run the SSUR model with the MRF prior, and the argument mrfG = G imports the edge potentials for the MRF prior. The two hyper-parameters d and e for the MRF prior (6)

Two extended examples based on real data
In this section, we use a simulated eQTL data set and real data from a pharmacogenomic database to illustrate the usage of the BayesSUR package. The first example is under the known true model and demonstrates the recovery performance of the models introduced in Section 2. It also demonstrates a full data analysis step by step. The second example illustrates how to use potential relationships between multiple response variables and input predictors as the prior information in Bayesian SUR models and showcases how the resulting estimated graph structures can be visualized with functions provided in the package.

Simulated eQTL data
Similarly to Bottolo et al. (2021), we simulate single nucleotide polymorphism (SNP) data X by resampling from the scrime package (Schwender 2018), with p = 150 SNPs and n = 100 subjects. To construct multiple response variables Y (with s = 10) with structured correlation -which we imagine to represent gene expression measurements of genes that are potentially affected by the SNPs -we first fix a sparse latent indicator variable Γ and then design a decomposable graph for responses to build association patterns between multi-response variables and predictors. The non-zero coefficients are sampled from the normal distribution independently and the noise term from a multivariate normal distribution with the precision matrix sampled from the G-Wishart distribution W G (2, M ) (Mohammadi and Wit 2019). Finally, the simulated gene expression data Y is then generated from the linear model (1). The concrete steps are as follows: • Simulate SNPs data X from the scrime package, dim(X) = n × p.
• Design a decomposable graph G as in the right panel of Figure 3, dim(G) = s × s.
• Design a sparse matrix Γ as in the left panel of Figure 3, dim(Γ) = p × s.
The resulting average signal-to-noise ratio is 25. The R code for the simulation can be found through help("exampleEQTL").

R> attach(exampleEQTL)
In the BayesSUR package, the data Y and X are provided as a numeric matrix in the first list component data of the example data set exampleEQTL. Here the first 10 columns of data are the Y variables, and the last 150 columns are the X variables. The second component of exampleEQTL is blockList which specifies the indices of Y and X in data. The third component is the true latent indicator matrix Γ of regression coefficients. The fourth component is the true graph G between response variables. Throughout this section we attach the data set for more concise R code. Figure 3 shows the true Γ and decomposible graph G used in the eQTL simulation scenario. The following code shows how to fit an SSUR model with hotspot prior for the indicator variables Γ and the sparsity-inducing hyper-inverse Wishart prior for the covariance using the main function BayesSUR().  Figure 5: The estimated structure of the ten response variables is visualized by plot(fit, estimator = "Gy", type = "graph") withĜ thresholded at 0.5 (left). The true structure is shown with plotGraph(Gy), where Gy is the true adjacency matrix (right).  Figure 4 summarizes the posterior inference results by plots forB,Γ andĜ created with the function plot() with arguments estimator = c("beta", "gamma", "Gy") and type = "heatmap". When comparing with Figure 3, we see that this SSUR model has good recovery of the true latent indicator matrix Γ and of the structure of the responses as represented by G. The function plot() can also visualize the estimated structure of the ten gene expression variables as shown in the right panel of Figure 5 with arguments estimator = "Gy" and type = "graph". For comparison, the true structure is shown in the left panel (created by function plotGraph()). When we threshold the posterior selection probability estimates for G and for Γ at 0.5, the resulting full network between the ten gene expression variables and 150 SNPs is displayed in Figure 6. Furthermore, the Manhattan-like plots in Figure 7 show both, the marginal posterior inclusion probabilities (mPIP) of the SNP variables (top panel) and the number of gene expression response variables associated with each SNP (bottom panel).
after subtracting the burn-in length.

The genomics of drug sensitivity in cancer data
In this section we analyze a subset of the Genomics of Drug Sensitivity in Cancer (GDSC) data set from a large-scale pharmacogenomic study (Yang et al. 2013;Garnett et al. 2012). We analyze the pharmacological profiling of n = 499 cell lines from p 0 = 13 different tissue types for s = 7 cancer drugs. The sensitivity of the cell lines to each of the drugs was summarized by the log(IC 50 ) values estimated from in vitro dose response experiments. The cell lines are characterized by p 1 = 343 selected gene expression features (GEX), p 2 = 426 genes affected by copy number variations (CNV) and p 3 = 68 genes with point mutations (MUT). The data sets were downloaded from ftp://ftp.sanger.ac.uk/pub4/cancerrxgene/releases/release-5.0/ and processed as described in help("exampleGDSC"). Gene expression features are logtransformed. Garnett et al. (2012) provide the target genes or pathways for all drugs. The aim of this study was to identify molecular characteristics that help predict the response of a cell line to a particular drug. Because many of the drugs share common targets and mechanisms of action, the response of cell lines to many of the drugs is expected to be correlated. Therefore, a multivariate model seems appropriate: where the elements of B 0 and non-zero elements of B 1 , B 2 and B 3 are independent and identically distributed with the prior N (0, w).
We may know the biological relationships within and between drugs and molecular features, so that the MRF prior (6) can be used to learn the above multivariate model well. In our example, we know that the four drugs RDEA119, PD-0325901, CI-1040 and AZD6244 are MEK inhibitors which affect the MAPK/ERK pathway. Drugs Nilotinib and Axitinib are Bcr-Abl tyrosine kinase inhibitors which inhibit the mutated BCR-ABL gene. Finally, the drug Methotrexate is a chemotherapy agent and general immune system suppressant, which is not associated with a particular molecular target gene or pathway. For the target genes (and genes in target pathways) we consider all characteristics (GEX, CNV, MUT) available in our data set as being potentially associated. Based on this information, we construct edge potentials for the MRF prior: • edges between all features representing genes in the MAPK/ERK pathway and the four MEK inhibitors; • edges between all features representing the Bcr-Abl fusion gene and the two Bcr-Abl inhibitors, see illustration in Figure 9(a); Figure 9: Illustration of the relationship between drugs and a group of related genes. The left panel is for the Bcr-Abl fusion gene and the corresponding related genes. The right panel is for all drugs and gene TP35 as one example with features representing all three data sources. The names with suffix ".GEX", ".CNV" and ".MUT" are features of expression, copy number variation and mutation, respectively.
• edges between all features from different data sources (i.e., GEX, CNV and MUT) representing a gene and all drugs, see illustration in Figure 9(b).
By matching the selected genes with the gene set of the MAPK/ERK pathway from the KEGG database, 57 features are considered to be connected to the four MEK inhibitors. The two genes (i.e., BCR and ABL) representing the Bcr-Abl fusion are connected with five features in the data set, which are BCR-ABL mutation, BCR gene expression, BCR copy number variation, ABL gene expression and ABL copy number variation (Figure 9(a)). In addition, there are 347 small feature groups representing the different available data sources for each of the genes in the data set, which are potentially connected to all drugs. Figure 9(a) illustrates the edges between drugs Nilotinib, Axitinib and the related genes of the Bcr-Abl fusion gene, and Figure 9(b) uses the TP53 gene as an example for how the different data sources representing a gene are related to each drug, thus linking the data sources together. Based on this information, we construct an edge list of the matrix G for the MRF prior.
First, we load and attach the data. Note that in this example, we illustrate the use of the specific plot functions plotEstimator(), plotGraph() and plotNetwork(), which are called directly here rather than via the generic plot() function as in the examples above.
R> data("exampleGDSC", package = "BayesSUR") R> attach(exampleGDSC) The following code chunk will run the MCMC sampler to fit the model. This represents a full analysis, which might take several hours to run with the chosen MCMC parameter values (nIter = 200000, nChains = 6, burnin = 100000) and no parallelization (maxThreads = 1 by default). Approximate results for an initial assessment of the model can be achieved with much shorter MCMC runs. Note that we use the X_0 argument for the thirteen cancer tissue types, which are included in the model as mandatory predictors that are always selected.
R> hyperpar <-list(mrf_d = -3, mrf_e = 0.2) R> set.seed(6437) After fitting an SSUR model with the MRF prior, the structure of the seven drugs, G, has been learned as illustrated in Figure 10, where edges between two drugs k and k indicate thatĜ kk > 0.5. All expected associations between the drugs within each drug group are found, but some additional connections are also identified: there are edges between Axitinib and Methotrexate and between CI-1040 and both Nilotinib and Axitinib.

Conclusion
The BayesSUR package presents a series of multivariate Bayesian variable selection models, for which the ESS algorithm is employed for posterior inference over the model space. It provides a unified R package and a consistent interface for the C++ implementations of individual models. The package supports all combinations of the covariance priors and variable selection priors from Section 2 in the Bayesian HRR and SUR model frameworks. This includes the MRF prior on the latent indicator variables to allow the user to make use of prior knowledge of the relationships between both response variables and predictors. To overcome the computational cost for data sets with large numbers of input variables, parallel processing is also implemented with respect to multiple chains, and for calculation of likelihoods of parameters and samples, although the MCMC algorithm itself is still challenging to be parallelized. We demonstrated the modeling aspects of variable selection and structure recovery to identify relationships between multivariate (potentially high-dimensional) responses as well as between responses and high-dimensional predictors, by applying the package to a simulated eQTL data set and to pharmacogenomic data from the GDSC project.
Possible extensions of the R package include the implementation of different priors to introduce even more flexibility in the modeling choices. In particular, the g-prior could be considered for the regression coefficients matrix B Richardson et al. 2011;Lewin et al. 2015b), whereas currently only the independence prior is available. In addition, the spike-and-slab prior on the covariance matrix C (Wang 2015;Banerjee and Ghosal 2015;Deshpande et al. 2019) might be useful, or the horseshoe prior on the latent indicator variable Γ, which was recently implemented in the multivariate regression setup by Ruffieux et al. (2020).
where Var[·] denotes the variance of logarithm y i |y that can be estimated from the MCMC iterations.