Software for Distributed Computation on Medical Databases: A Demonstration Project

Bringing together the information latent in distributed medical databases promises to personalize medical care by enabling reliable, stable modeling of outcomes with rich feature sets (including patient characteristics and treatments received). However, there are barriers to aggregation of medical data, due to lack of standardization of ontologies, privacy concerns, proprietary attitudes toward data, and a reluctance to give up control over end use. Aggregation of data is not always necessary for model fitting. In models based on maximizing a likelihood, the computations can be distributed, with aggregation limited to the intermediate results of calculations on local data, rather than raw data. Distributed fitting is also possible for singular value decomposition. There has been work on the technical aspects of shared computation for particular applications, but little has been published on the software needed to support the"social networking"aspect of shared computing, to reduce the barriers to collaboration. We describe a set of software tools that allow the rapid assembly of a collaborative computational project, based on the flexible and extensible R statistical software and other open source packages, that can work across a heterogeneous collection of database environments, with full transparency to allow local officials concerned with privacy protections to validate the safety of the method. We describe the principles, architecture, and successful test results for the site-stratified Cox model and rank-k Singular Value Decomposition (SVD).


Introduction
Bringing together the information latent in distributed medical databases promises to personalize medical care by enabling reliable, stable modeling of outcomes with rich feature sets (including patient characteristics and treatments received). To realize that promise, the National Institutes of Health (NIH) and other organizations support the aggregation of data from such databases, particularly the data developed by investigators on NIH-supported projects. These aggregated databases are made available to qualified investigators to explore and extract the useful information they contain. However, there are high (and growing) barriers to aggregation of medical data, some of them having to do with lack of standardization of arXiv:1412.6890v1 [stat.CO] 22 Dec 2014 ontologies, others owing to privacy concerns, and still others arising from a generally proprietary attitude toward one's own institution's data, and a reluctance to give up control. In addition, the sheer size and complexity of some databases has caused the NIH to think about splitting the storage of aggregated databases across several centers; see for example the NIH Collaboratory (NIH 2014).
It has long been known that aggregation of data is not always necessary for model fitting. In many circumstances, such as in fitting models based on maximizing a likelihood, the computations can be distributed, with aggregation limited to the intermediate results of calculations on local data, rather than raw data (Murtagh, Demir, Jenkings, Wallace, Murtagh, Boniol, Bota, Laflamme, Boffetta, Ferretti, and Burton 2012). Indeed, sometimes distribution of the calculation among sites is necessary to share a heavy computational burden such as fitting ADMM models (Boyd, Parikh, Chu, Peleato, and Eckstein 2011). It can even help attack the "conflicting ontologies" problem, by shifting the task of translation to the sites.
There has been work on the technical aspects of shared computation (Wu, Jiang, and Ohno-Machado 2012b;Wu, Jiang, Kim, and Ohno-Machado 2012a;Jiang, Li, Wang, Wu, Xue, Ohno-Machado, and Jiang 2013) for particular applications, but little has been published on the software needed to support the "social networking" aspect of shared computing, to reduce the barriers to collaboration. We describe a set of software tools that allow the rapid assembly of a collaborative computational project, based on the flexible and extensible R (R Core Team 2014) statistical software and other open source packages, that can work across a heterogeneous collection of database environments (RedCAP, i2b2, local versions), with full transparency to allow local officials concerned with privacy protections to validate the safety of the method. The software, examples, and documentation can be found at GitHub, and it is freely modifiable by users. The rationale, scientific uses, and further details of the social networking aspects of the method are discussed elsewhere. In this article we describe the principles, architecture, and successful test results for two privacy-preserving examples. The first is a well-known distributed version of Rank-k Singular Value Decomposition and the second is a novel instance (fitting a site-stratified Cox Proportional Hazards Model). These illustrate the challenges and solutions to implementing a shared computation. We show how to take any model fit that breaks up in a similar way and implement it as a distributed computation. We describe some possible use cases that may be of interest.

Management of computations for a collaboration
We begin with an overview of the process and our software, with details in the Implementation section following. As illustrated in Figure 1, the setup of the software to manage computations for a collaboration begins with a few steps that need only be completed once at at a site, and that will create an environment that will support many different collaborative efforts. The local systems administrator must provide a computation server, install the application distcomp, and make the server URL available to the site investigators. The administrator would also set up the required permissions.
The other steps are taken by the site investigator for a given collaborative project (a "computation"). We suppose first that the investigator is the originator of a new computation and therefore, has access to a dataset that will be used as that site's contribution to the shared effort. The investigator runs a function from the distcomp package on his or her local com-puter, which brings up a browser window with instructions for entering the location of the dataset, the names of its variables, and certain information about the format and values of the fields. The formula, in R format is entered. Completing this task and surviving various checks on data integrity generates a unique computation-specific identifier and a collection of metadata that define the computation and dataset.
Then the investigator sets up the slave process by running another function in the distcomp package. The result is that a copy of the local dataset and the metadata generated in the previous step is placed in the computation server under the unique computation identifier. At this point the originating investigator is ready to recruit other collaborators. They will have to take the step of creating a computation server, as described above, unless it has already been done at that site. Then, they will run the same setup function locally, and enter the computation definition (metadata) sent by the originating investigator. The function will return a URL (pointing to the computation server) that is to be transmitted back to the originating investigator.
Once the originating investigator has the URLs from the sites in hand, he or she runs the function for setting up a master site, and then finally another function that runs the computation. The process of setting up slave and master sites can be repeated for any number of different computations and datasets.
We imagine that a group of collaborators interested in a particular topic (say, prediction of outcome of therapy for a subtype of melanoma) would run (and subsequentially update) several such computations, perhaps variations of models. The distributed computations of interest to that group investigators might be listed and managed on a website where a registry of active and proposed projects is maintained. Depending on the scope of the collaboration there may be several such websites. One may imagine a website providing the following capabilities for a collaboration: Registration Registering users, providing credentials, bonafides etc.
Available Methods A mechanism for adding to a growing list of statistical computations that may be implemented in the distcomp projects. The latest version of the package will automatically provide an up-to-date list.
Project Linkages Associating users with projects so that they may participate in the collaboration. In order to prevent malicious use and avoid inappropriate computational burdens on already established projects there are some checks and balances, such as requiring a new user to bring new data to the collaboration.
Project Examples/Templates Providing some testable examples for potential collaborators so that there is a complete understanding of the requirements of each site. Some canonical templates will be provided. Users may run them on local sites to completely examine the entire process.
Project Publishing Creation of projects and publishing them so that they move into place on a publicly visible website. A group of users or a single user will create a new collaborative project and lead it.
Tools Downloadable tools, virtual machines and detailed tool documentation for enabling dryrun experiments locally.
Documentation Wikis for facilitating communication within the collaboration.
There are many open source tools that can be used for some of the tasks described above. We focus our attention on those tasks for which new tools required development. These are the tools that allow rapid setup of a computation, at both master and slave sites, as described above, with minimal ongoing burden once the initial steps are taken. The intent is to lower the bar to collaboration to the point where it is as easy to set up a distributed collaboration as it would be to even begin the discussion of data aggregation.

Implementation
Our implementation is in the form of the distcomp R package which utilizes other R packages notably, opencpu and shiny (RStudio and Inc. 2014), to provide the infrastructure support.
We use a master-slave model, where the master is the one in charge of the overall computation (the main iteration in an optimization for example) and the slaves merely compute functions on local datasets and return the function result. Both the master and slave sites are expected to have our package distcomp installed (from CRAN). In the course of a single fit, the master process will make an unspecified number of computation requests to the slave sites over secure http protocol. The distcomp package is designed to be used in conjunction with an OpenCPU (Ooms 2014) server or an Rserve (Urbanek 2013) process. In fact, several other analogs may be used, but our current implementation is explicitly geared to opencpu because it is straightforward to install and configure. Therefore, in what follows, we assume an opencpu R server.
We summarize the main steps in implementing a distributed computation, followed by details of each step. Figure 1 illustrates the first three steps.
Define the Computation An R function defineNewComputation in distcomp lets one define the formula and variables to be used in the Cox model. The result of this step is an output file, the 'computation definition' or compdef for brevity, that will encode the characteristics of the computation. The file includes the R model formula and variables, but no data and may therefore be shared freely. The compdef file is used in all the ensuing steps.
Set up a Slave Process for the Computation The function setupSlave uses the compdef to create a slave process that can participate in the computation. This requires an R Server process that is suitably configured for talking across networks. Several checks are built into this step and if everything is successful, the server becomes ready to perform the new computation using the provided compdef.
Build a Master Process for the Computation The function setupMaster allows a user to create a master process (R code) for the computation that will run the computation and provide the final results of a fit, such as the coefficients and estimates of variance. It takes as input the compdef created in the previous step, URLs of sites participating in the computation and outputs an R script which may be run.
Fit the Model The master process code can now be run to to do the model fitting. (b) Setting up the slave. This requires a one-time configuration of an opencpu server with the R package distcomp and a writable workspace. All interaction is through a shiny app that configures the slave for the computation.  The master and slave processes will communicate via simple messages. This requires a robust serialization protocol since the R data structures (consisting of vectors, matrices and lists) will be passing over the wire. The current implementation uses the simple JSON serialization format for transmitting the data structures as it is the best supported format in opencpu.
A typical computation will involve the optimization of some criterion, very often a likelihood function. For any specific computation, the associated master will initiate this optimization.

Define a Computation
The function defineNewComputation invokes a shiny web application and Figure 1 gives an overview of what is expected by the application. A sample screen shot is shown in Figure 2. Besides gathering some generic information such as a name and title for the computation, a formula for the model and a starting dataset is also taken as input. We expect that the initial proposer has actual data on subjects from her site in some analyzable form (CSV assumed here) with meaningful names for the covariates. Other screens enable the user to choose from the available computations, check data for conformability and execute a dry run. For example, the provided formula is checked against the dataset to ensure the model can actually be fit. The process is not complete until the fit proceeds without error; the various buttons in the user interface are articulated appropriately. The variables in the formula then become the variables that all other participating sites will have to provide at a minimum.
An identifier (for all practical purposes unique) iqs generated and associated with every computation definition. This ensures that a slave site (i.e. the associated R computation engine) may participate in more than one computation; the identifier serves to unambiguously distinguish various computations and associate appropriate data and functions for that computation.
The final result is a compdef object, containing all necessary metadata defining the computation (but no individual patient data), that is saved in an .rds file.

Build Slave Process for the Computation
Before a slave site can be configured for a computation, an opencpu server needs to be set up and its profile modified so that the distcomp package is loaded and configured with a workspace. The workspace configuration requires a one-time editing of a security policy (at least on the commonly used Ubuntu Linux servers) so that the opencpu server may write serialized objects into the area.
The function setupSlave invokes another shiny webapp that requires several inputs: the URL of the opencpu server, the compdef (see section 3.1) and the site specific data. The shiny webapp performs sanity checks, ensures that the model can be fit and if everything succeeds, it uploads information to the opencpu server so that the computation becomes accessible. The site is then ready to participate in a computation.

Build Master Process for the Computation
The function setupMaster invokes a shiny webapp that creates a master process R object that can interact with slave processes. Inputs to this application are the the compdef generated in section 3.1 and URLs of sites participating in the computation. Once again some sanity checks are performed to ensure that sites are indeed addressable and finally, the R code for running the computation is generated. The application updates the user R profile with code that ensures the instantiation of an object with the correct identifier and variables and formulae. No data is needed at this point, but the master object can do no real computation until slave sites are added.

Two Examples
We present two self-contained examples that may be run in an R session using the distcomp package in conjunction with the opencpu R package. The first example is more detailed to show R code behind the shiny applications in Figure 1.
The distcomp package may be installed from Github as follows.
library("devtools") install_github("hrpcisd/distcomp") Since we use a single opencpu server on a single machine, the data for each site has to be kept separate and the implementation ensures that the server performs computations only with site-specific data. Furthermore, an empty directory has to be set aside for the workspace for the opencpu server. This is best done by modifying the user R profile to contain the following two lines.
library("distcomp") distcompSetup(workspace = "full path of workspace", ssl.verifyhost=FALSE, ssl.verifypeer=FALSE) (On Unix and MacOS, the user's .Rprofile file in the home directory will suffice, but on Windows, this needs to be inserted into the site profile.) Thus every R invocation will have loaded the distcomp package and know about the workspace.
The full structure of the workspace is shown in Figure 3. The defn folders store compdef objects under various identifiers generated for the computations. The instances folders store the instantiated objects used in the computations; these may be repeatedly saved during iterations as they change state.

A Rank-k Approximation Example
We consider the problem of approximating a matrix by another low-rank matrix. Assuming that the original matrix X X X is row-partioned into sub-matrices X X X j at j = 1 . . . , K sites, there is a well-known iterative singular value decomposition algorithm (see appendix B) to obtain a low-rank approximation using the singular vectors, which is implemented in distcomp. The example below assumes three sites.  We now generate random data for three sites, start the opencpu server, and set the URLs for the sites to be the (same) local opencpu URL.
> siteDataFiles <-lapply(seq.int(nSites), + function(i) paste("site", i, ".rds", sep="")) > ok <-Map(uploadNewComputation, siteURLs, + lapply(seq.int(nSites), function(i) svdDef), + siteData, + siteDataFiles) > stopifnot(all(ok)) At this point, the sites are instantiated and ready to compute. So we instantiate the master, specifying that we are using a local server, which ensures that our opencpu server handles data for each sites appropriately. This returns the approximate first singular value and the associated vector (up to a sign change). All five singular vectors can be obtained as shown below. We also compare the results to the SVD on the aggregated matrix. As we can see, the distributed SVD is able to recover the same factors (up to sign change) as the standard SVD algorithm. There is a slight loss of numerical precision in the alternating power algorithm versus the standard LAPACK implementation. The JSON serialization format employed here also loses precision. Future implementations will use a more portable format such as Google protocol buffers, as we note in section 6.

A Distributed Stratified Cox Model Example
We next present a distributed Statified Cox model fit using the UIS dataset from Hosmer and Lemeshow (Hosmer, Lemeshow, and May 2008) on time until return to drug use for patients enrolled in two different residential treatment programs. Assuming all data in one place, one would fit a stratified cox proportional hazard model using site (0 or 1) as a stratification variable.
We have successfully tested other examples in a real three-site configuration, involving two US and one UK site. Tests involving real locally-originated data are underway.

Privacy Control and Confidence
The motivating principles guiding the architecture of the distributed computation method we have built include the preservation of data privacy, retention of local control and building confidence in the safety of the method by using an open platform. We list several features that serve that principle.
Registering machines Only known machines are permitted to request computations Logging All requests for computation are logged by the local computation server, providing accountability. Site specific dashboards can be developed (as shiny apps, for example) for analyzing these logs Single computation server A site could use a single computation server for all distributed collaborations, or a small number, simplifying the oversight task. This simplicity, however, creates a loophole. Our model relies on associating a unique identifier with each compDef and associated data at each site. Thus, if slave site A participates in two computations id1 and id2 with two different institutions B and C respectively, there is the possibility that B may be able to run a computation on data that was made available to site C and vice-versa if B knows id2. However, this is easily handled by means of a generic gatekeeper and reverse proxy application in front of the opencpu server and further runtime checks on requests.
Contingent participation A site agrees to or refuses a specific computation, which it can inspect, and can withdraw at any time from any computation, so it retains complete control over the future use of its data.
Limited computation The computation server can only respond to the specific computation request described in the compdef. It cannot run other functions or execute other instructions.

Some possible use cases
We anticipate that the main initial use will be to make it possible for investigators who are already known to each other to quickly pool information by mounting "pop-up" collaborations. Once the system is set up, understood and has passed local due diligence for privacy protection, the additional cost and effort required to start a new computation among such a working group should be negligible. As use cases occur, a byproduct will be a growing set of mutually understood data fields, making future computations easier to set up.
There are a few other use cases that we expect to see after some time.
Preparation for aggregation A group of sites who are considering the possibility of aggregating data for some purpose might start their collaboration with a few "small wins", to test their ability to rationalize their data dictionaries and see if there is sufficient value in aggregation.
Combining information when control over data use is important Owners of data, such as pharmaceutical companies, who are not willing to transfer them may be willing to participate in collaborations that make use of the distributed technologies. In a single-slave configuration, information from a single source could be used to fit a model proposed by an outside party (acting as the master). This provides a way for organizations to allow others to access their data (for the purpose of model fitting) in a controlled manner.
Virtual aggregation In a group of sites that often collaborate, a library of "translated terms" and the associated data fields could be developed, creating a virtual data aggregation, whose information would be accessed by distributed model building. If sufficient trust has been created, it is possible to make the setup of a new computation nearly transparent to the investigator.

Discussion
The data aggregator must work in a highly restrictive regulatory regime that controls and limits the export of (identified) Protected Health Information (PHI) beyond the originating institution. Owners of data can be reluctant to cede control over the use of their data, even when privacy issues can be resolved. It seems realistic to predict that regulatory concerns, privacy issues, and reluctance to let raw data leave the local source will continue to make it difficult to construct central repositories of data on biomarkers, treatments, and clinical outcomes, despite ongoing efforts by the NIH to support (and even mandate) such repositories. The barriers to access for such data do not seem to be falling. This is of special concern for the aggregators of data (such as gene sequences) that are increasingly likely to emerge as part of the diagnostic and treatment process, and therefore become part of a system of medical records. Such data would be regulated as PHI (under HIPAA) in the future if a consensus develops that they cannot be irreversibly de-identified.
The lack of uniform standards for data management, database ontologies, and the specific content and scope of databases makes global aggregation of complex datsets laborious, expensive, and time-consuming. Reluctance to adopt common standards makes central repositories difficult to achieve, and the effort is seldom funded by outside sources. Investigators might be willing to pay the modest price for standardizing the part of the data that is needed for a particular project, and then to continue to increase common ontological themes as they are rewarded with analyses.
There are limitations to our current implementation. We have not yet built methods beyond the two described above. We are working on others, including the lasso. We expect that the repertoire of models will expand as others add their own features and make them available. There is not yet an easy-to-use procedure for handling factor variables whose support varies by site. This and many other improvements in ease of use will require further effort. It is often the case that most data at sites is in databases such as Redcap or i2b2. Our examples have used pure R structures for persisting the data. A simple modification can enable the use of a system such as Redcap. In discussions with personnel at some sites, this configuration seems most acceptable to the CIOs, enabling periodic updates of the data.
We used R for our current implementation since it offers great potential to add new methods. Often, this involves mere re-engineering of already existing algorithms in packages to enable distributed computations. However, a robust implementation of a distributed computation must deal with many modes of failure. More work remains to be done on this, particularly in implementing a robust messaging system to deal with failures.
The current implementation makes uses of the JSON serialization format, simply because it is the best supported format of opencpu. It would be better to use a well-established portable and stable serialization format, such as Google protocol buffers (Google 2012;Francois, Eddelbuettel, Stokely, and Ooms 2014). It is straightforward to adapt to this protocol once it is more widely supported in opencpu for example.
These limitations of our first implementation notwithstanding, we hope that lowering the barriers to shared statistical computation will accelerate the pace of collaboration and increase the accessibility of information that is now locked up.

A. The computational method for a distributed Cox PH fit
We focus on survival data, where the number of covariates might be gathered on subjects that are followed over time. The random quantity of interest is the time to "event" T i for each subject i which may be observed in some patients (in which case the observation is uncensored) or not (in which case it is censored). Thus, each subject i yields data (T i , δ i ) where δ i = 1 if event, 0 otherwise, independent of T i in addition to covariates that may vary over time.
The Cox proportional hazards model assumes a hazard function of the form where λ is an n×1 vector, n being the number of subjects, λ 0 , an unspecified baseline hazard, X X X the n × p vector of covariates and β a p × 1 vector of coefficients.
Model fitting and inference is accomplished by maximizing a partial likelihood function. The explicit form of this likelihood can be written out using a counting process formulation following Therneau and Grambsch (Terry M. Therneau and Patricia M. Grambsch 2000).
For each subject i, let N i (t) be the number of observed events in [0, t] and Y i (t), the indicator of whether the subject is in the "risk set", that is, the subject may potentially contribute an event at time t. Then the log of the partial likelihood function introduced by Cox (Cox 1972) has the form: where i indexes the subject, r i (β, t) = exp[X X X i β] is the risk score, X X X i (t) is the covariate row vector. The score vector is given by: where and The negative of the hessian is the Fisher information given by where The solutionβ to the score equation 3 and the inverse of the information matrix I −1 (β) are used as the estimate and the variance respectively along with the fact that the estimate The singular value decomposition (SVD) is a method for summarizing a dataset. The SVD of X X X, the n × p matrix of covariates, is formed by the matrices U U U , V V V , D D D such that Data: each slave has private data X X X i ∈ R n i ×p Result: V ∈ R p×k , and transmit n j to master; end for i ← 1 to k do foreach slave site j do u [j] ← (1, 1, . . . , 1) of length n j u ← j n j ; transmit u , V, and D to slaves; repeat foreach slave site j do ; v ← v/ v ; transmit v to slaves; foreach slave site j do calculate u [j] ← X X X [j] v; transmit u [j] to master; end u ← j u [j] ; transmit u to slaves; d i ← u ; until convergence; V ← cbind(V, v); foreach slave site j do U [j] ← cbind(U [j] , u [j] ) end Algorithm 2: Privacy Preserving Algorithm for rank-k SVD Additionally, we assume that the values of D D D is nonnegative and sorted, namely that D D D = diag(d 1 , d 2 , . . . , d min(n,p) ) with d 1 ≥ d 2 , ≥ · · · ≥ 0. With these assumptions, the SVD of X X X is unique up to the signs of the columns of U U U and V V V .
The SVD is a sort of factor model which decomposes the variance of X X X into what are called principle components. v 1 , the first column of V V V , is called the first principle component of X and is the maximizer of var(X X Xv). The values of u 1 can be viewed as loading that tell you how much of the factor v 1 is present in each observation. d 2 1 / j d 2 j is the proportion of the variance of X X X that can be explained by v 1 , which means that the SVD also sorts the components in order of their contribution to variance. When we take a rank-k SVD we only keep the k most important factors, equivalent to setting d k+1 = d k+2 = · · · = d min(n,p) = 0.
The SVD has many uses in medical research. Often times, the results are themselves of interest for examining a factor model. This is the case for methods like Latent Semantic Indexing (Chen, Martin, Daimon, and Maudsley 2013). One recent example of an SVD that was done on genomic data is (Novembre, Johnson, Bryc, Kutalik, Boyko, Auton, Indap, King, Bergmann, Nelson et al. 2008) where the authors are able to reconstruct a map of europe from the first two principal components of SNP data. Additionally, SVDs are often used as a preprocessing step before some further data analysis. There are many examples of this, including Principal Component Regression, SVD based imputation methods, and Surrogate Variable Analysis -a technique based on trying to remove unwanted variation from high dimensional data (Mazumder, Hastie, and Tibshirani 2010;Leek and Storey 2007).
Calculating an SVD can be a challenging problem from a numerical stability standpoint. For this reason, it is relatively rare for people to implement their own version of an SVD; R and Matlab both use the LAPACK implementation (Anderson, Bai, Bischof, Blackford, Demmel, Dongarra, Du Croz, Greenbaum, Hammerling, McKenney et al. 1999). Most of these methods involve transforming the X X X matrix in a way that cannot be done while maintaining privacy. That said, there is a relatively simple iterative procedure for calculating the rank-1 SVD that can be modified to be privacy preserving. Algorithm 1 calculates (u, v, d), the rank-1 SVD of X X X.