GaussianProcesses.jl: A Nonparametric Bayes package for the Julia Language

Gaussian processes are a class of flexible nonparametric Bayesian tools that are widely used across the sciences, and in industry, to model complex data sources. Key to applying Gaussian process models is the availability of well-developed open source software, which is available in many programming languages. In this paper, we present a tutorial of the GaussianProcesses.jl package that has been developed for the Julia language. GaussianProcesses.jl utilises the inherent computational benefits of the Julia language, including multiple dispatch and just-in-time compilation, to produce a fast, flexible and user-friendly Gaussian processes package. The package provides a range of mean and kernel functions with supporting inference tools to fit the Gaussian process models, as well as a range of alternative likelihood functions to handle non-Gaussian data (e.g. binary classification models). The package makes efficient use of existing Julia packages to provide users with a range of optimization and plotting tools.


Introduction
Gaussian processes (GPs) are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. In the most basic setting, a Gaussian process models a latent function based on a finite set of observations. The Gaussian process can be viewed as an extension of a multivariate Gaussian distribution to an infinite number of dimensions, where any finite combination of dimensions will result in a multivariate Gaussian distribution, which is completely specified by its mean and covariance functions. The choice of mean and covariance function, also known as the kernel, impose smoothness assumptions on the latent function of interest and determines the correlation between output observations y as a function of the Euclidean distance between their respective input data points x.
Gaussian processes have been widely used across a vast range of scientific and industrial fields, for example, to model astronomical time series (Foreman-Mackey, Agol, Ambikasaran, and Angus 2017), hypergraphs (Pinder, Turnbull, Nemeth, and Leslie 2021) and brain networks (Wang, Durante, Jung, and Dunson 2017), or for improved soil mapping (Gonzalez, Cook, Oberthur, Jarvis, Bagnell, and Dias 2007) and robotic control (Deisenroth, Fox, and Rasmussen 2015). Arguably, the success of Gaussian processes in these various fields stems from the ease with which scientists and practitioners can apply Gaussian processes to their problems, as well as the general flexibility afforded to GPs for modelling various data forms.
Gaussian processes have a longstanding history in geostatistics (Matheron 1963) for modelling spatial data. However, more recent interest in GPs has stemmed from the machine learning and other scientific communities. In particular, the successful uptake of GPs in other areas has been a result of high-quality and freely available software. There are now a number of excellent Gaussian process packages available in several computing and scientific programming languages. One of the most mature of these is the GPML package (Rasmussen and Nickisch 2017) for the MATLAB language (The MathWorks Inc. 2019) which was originally developed to demonstrate the algorithms in the book by Rasmussen and Williams (2006) and provides a wide range of functionality. Packages written for other languages, including packages for Python (Van Rossum et al. 2011), e.g., GPy (GPy Contributors 2012 and GPFlow (Matthews, Van Der Wilk, Nickson, Fujii, Boukouvalas, León-Villagrá, Ghahramani, and Hensman 2017), have incorporated more recent developments in the area of Gaussian processes, most notably implementations of sparse Gaussian processes and graphics processing unit (GPU) accelerations. This paper presents a new package, GaussianProcesses.jl, for implementing Gaussian processes in the recently developed Julia programming language. Julia (Bezanson, Edelman, Karpinski, and Shah 2017), an open source programming language, is designed specifically for numerical computing and has many features which make it attractive for implementing Gaussian processes. Two of the most useful and unique features of Julia are just-in-time (JIT) compilation and multiple dispatch. JIT compilation compiles a function into binary code the first time it is used, which allows code to run much more efficiently compared with interpreted code in other dynamic programming languages. This provides a solution to the "two-language" problem: in contrast to e.g., R (R Core Team 2022) or Python, where performance-critical parts are often delegated to libraries written in C/C++ (Stroustrup 2013) or Fortran, it is possible to write highly performant code in Julia, while keeping the convenience of a high-level language. Multiple dispatch allows functions to be dynamically dispatched based on inputted arguments. In the context of our package, this allows us to have a general framework for operating on Gaussian processes, while allowing us to implement more efficient functions for the different types of objects which will be used with the process. Similar to the R language, Julia has an excellent package manager system which allows users to easily install packages from inside the Julia read-eval-print loop (REPL) as well as many well-developed packages for statistical analysis.
GaussianProcesses.jl is an open source package which is entirely written in Julia. It supports a wide choice of mean, kernel and likelihood functions (see Appendix A) with a convenient interface for composing existing functions via summation or multiplication. The package leverages other Julia packages to extend its functionality and ensure computational efficiency. For example, hyperparameters of the Gaussian process are optimized using the Optim.jl package (Mogensen and Riseth 2018) which provides a range of efficient and configurable unconstrained optimization algorithms; prior distributions for hyperparameters can be set using the Distributions.jl package (Besançon, Papamarkou, Anthoff, Arslan, Byrne, Lin, and Pearson 2021). Additionally, this package has now become a dependency of other Julia packages, for example, BayesianOptimization.jl (Brea 2021), a demo of which is given in Section 4.4. The run-time speed of GaussianProcesses.jl has been heavily optimized and is competitive with other Gaussian process packages.
Within the Julia language, Gaussian process software is currently limited. Aside from Gaus-sianProcesses.jl, two other packages currently under development are Stheno.jl (Tebbutt, Bruinsma, and Turner 2022) and AugmentedGaussianProcesses.jl (Galy-Fajou, Wenzel, and Opper 2020). The Stheno.jl package is designed to provide ease of compatibility with other Julia packages to allow users to leverage the benefits of Bayesian inference (AdvancedHMC.jl, Xu, Ge, Tebbutt, Tarek, Trapp, and Ghahramani 2020), optimization (Optim.jl Mogensen and Riseth 2018) and automatic differentiation (Zygote.jl, Innes 2018) when fitting Gaussian process models. Stheno.jl is designed for Gaussian process regression and does not currently provide features to handle non-Gaussian likelihoods, e.g., classification data. The Augment-edGaussianProcesses.jl package is developed primarily for data-augmented sparse Gaussian processes. The data-augmentation structure of this package allows users to utilize Gaussian and non-Gaussian likelihoods by transforming the likelihood functions into conditionally conjugate likelihoods, which allows for faster inference via block coordinate updates. Aug-mentedGaussianProcesses.jl is geared towards variational inference instead of Markov chain Monte Carlo-based inference that is primarily used in GaussianProcesses.jl.
The paper is organized as follows. Section 2 provides an introduction to Gaussian processes, how they can be applied to model Gaussian and non-Gaussian observational data, and how enhanced computational efficiency can be achieved through sparse approximations. Section 3 gives an overview of the main functionality of the package which is presented through a simple application of fitting a Gaussian process to simulated data. This is then followed by five application demos in Section 4 which highlight how Gaussian processes can be applied to classification problems, time series modelling, count data, black-box optimization and computationally-efficient large-scale nonparametric modelling via sparse Gaussian process approximations. Finally, the paper concludes (Section 5) with a discussion of ongoing package developments which will provide further functionality in future releases of the package.

Gaussian processes in a nutshell
Gaussian processes are a class of models which are popular tools for nonlinear regression and classification problems. They have been applied extensively in scientific disciplines ranging from modelling environmental sensor data (Osborne, Roberts, Rogers, and Jennings 2008) to astronomical time series data (Wilson, Dann, and Nickisch 2015) all within a Bayesian nonparametric setting. A Gaussian process (GP) defines a distribution over functions, p(f ), where f : X → R is a function mapping from the input space X to the space of real numbers. The space of functions f can be infinite-dimensional, for example when X ⊆ R d , but for any subset of inputs X = {x 1 , x 2 , . . . , x n } ⊂ X we define f := {f (x i )} n i=1 as a random variable whose marginal distribution p(f ) is a multivariate Gaussian.
The Gaussian process framework provides a flexible structure for modelling a wide range of data types. In this package we consider models of the following general form, where y = (y 1 , y 2 , . . . , y n ) ∈ Y and x ∈ X are the observations and covariates, respectively, and f i := f (x i ). We assume that the responses y are independent and identically distributed, and as a result, the likelihood p(y | f , θ) can be factorized over the observations. For the sake of notational convenience, we let θ ∈ Θ ⊆ R d denote the vector of model parameters for both the likelihood function and Gaussian process prior.
The Gaussian process prior is completely specified by its mean function m θ (x) and covariance function k θ (x, x ′ ), also known as the kernel. The mean function is commonly set to zero (i.e., m θ (x) = 0, ∀x), which can often by achieved by centering the observations (i.e., y − E(y)) resulting in a mean of zero. If the observations cannot be re-centered in this way, for example if the observations display a linear or periodic trend, then the zero mean function can still be applied with the trend modelled by the kernel function.
The kernel determines the correlation between any two function values f i and f j in the output space as a function of their corresponding inputs x i and x j . The user is free to choose any appropriate kernel that best models the data as long as the covariance matrix formed by the kernel is symmetric and positive semi-definite. Perhaps the most common kernel function is the squared exponential, . For this kernel the correlation between f i and f j is determined by the Euclidean distance between x i and x j and the hyperparameters θ = (σ, ℓ), where ℓ determines the speed at which the correlation between x and x ′ decays. There exists a wide range of kernels that can flexibly model a wide range of data patterns. It is possible to create more complex kernels from the sum and product of simpler kernels (Duvenaud 2014), (see Chapter 4 of Rasmussen and Williams (2006) for a detailed discussion of kernels). Figure 1 shows one-dimensional Gaussian processes sampled from three simple kernels (squared exponential, periodic and linear) and three composite kernels, and demonstrates how the combination of these kernels can provide a richer correlation structure to capture more intricate function behavior.
Often we are interested in predicting function vales f * at new inputs x * . Assuming a finite set of function values f , the joint distribution between these observed points and the test points f * forms a joint Gaussian distribution, where K f ,f = k θ (X, X), K f , * = k θ (X, x * ) and K * , * = k θ (x * , x * ).
By the properties of the multivariate Gaussian distribution, the conditional distribution of f * given f is also a Gaussian distribution for fixed X and x * . Extending to the general case, the conditional distribution for the latent function f (x * ) is a Gaussian process Using the modelling framework in Equation 1, we have a Gaussian process prior p(f | θ) over the function f (x). If we let D = {X, y} represent our observed data, where X = (x 1 , x 2 , . . . , x n ), then the likelihood of the data, conditional on function values f , is p(D | f , θ). Using Bayes theorem, we can show that the posterior distribution for the function f is p(f | D, θ) ∝ p(D | f , θ)p(f | θ). In the general setting, the posterior is non-Gaussian (see Section 2.1 for an exception) and cannot be expressed in an analytic form, but can often be approximated using a Laplace approximation (Williams and Barber 1998), expectationpropagation (Minka 2001), variational inference (Opper and Archambeau 2009) or Stein variational gradient descent (Pinder, Nemeth, and Leslie 2020). Alternatively, simulation-based inference methods including Markov chain Monte Carlo (MCMC) algorithms (Robert, Casella, and Casella 2004) can be applied.
From the posterior distribution, we can derive the marginal predictive distribution of y * , given test points x * , by integrating out the latent function, Calculating this integral is generally intractable, with the exception of nonlinear regression with Gaussian observations (see Section 2.1). In settings such as seen with classification models, the marginal predictive distribution is intractable, but can be approximated using the methods mentioned above. In Section 2.2 we will introduce a MCMC algorithm for sampling exactly from the posterior distribution and use these samples to evaluate the marginal predictive distribution through Monte Carlo integration.

Nonparametric regression: The analytic case
We start by considering a special case of Equation 1, where the observations follow a Gaussian distribution, In this instance, the posterior for the latent variables, conditional on the data, can be derived analytically as a Gaussian distribution (see Rasmussen and Williams 2006), The predictive distribution for y * in Equation 4 can also be calculated analytically by noting that the likelihood in Equation 5, the posterior in Equation 6 and the conditional distribution for f * in Equation 3 are all Gaussian and integration over a product of Gaussians produces a Gaussian distribution, where Rasmussen and Williams 2006, for the full derivation).
The quality of the Gaussian process fit to the data is dependent on the model hyperparameters, θ, which are present in the mean and kernel functions as well as the observation noise σ 2 . Estimating these parameters requires the marginal likelihood of the data, which is given by marginalising over the latent function values f . Assuming a Gaussian observation model in Equation 5, the marginal distribution is p(y | X, θ, σ 2 ) = N (0, K f ,f + σ 2 I). For convenience of optimization we work with the log-marginal likelihood The tractability of the log-marginal likelihood allows for the straightforward calculation of the gradient with respect to the hyperparameters. Efficient gradient-based optimization techniques (e.g., L-BFGS and conjugate gradients) can be applied to numerically maximize the log-marginal likelihood function. In practice, we utilize the excellent Optim.jl package and provide an interface for the user to specify their choice of optimization algorithm. Alternatively, a Bayesian approach can be taken, where samples are drawn from the posterior of the hyperparameters using the in-built MCMC algorithm, see Section 3 for an example.

Gaussian processes with non-Gaussian data
In Section 2.1 we considered the simple tractable case of nonlinear regression with Gaussian observations. The modelling framework given in Equation 1 is general enough to extend the Gaussian process model to a wide range of data types. For example, Gaussian processes can be applied to binary classification problems (see Rasmussen and Williams 2006, Chapter 3), by using a Bernoulli likelihood function (see Section 4.1 for more details).
When the likelihood p(y | f , θ) is non-Gaussian, the posterior distribution of the latent function, conditional on observed data p(f | D, θ), does not have a closed form solution. A popular approach for addressing this problem is to replace the posterior with an analytic approximation, such as a Gaussian distribution derived from a Laplace approximation (Williams and Barber 1998) or an expectation-propagation algorithm (Minka 2001). These approximations are simple to employ and can work well in practice on specific problems (Nickisch and Rasmussen 2008), however, in general these methods struggle if the posterior is significantly non-Gaussian. Alternatively, rather than trying to find a tractable approximation to the posterior, one could sample from it and use the samples as a stochastic approximation and evaluate integrals of interest through Monte Carlo integration (Ripley 2009).
MCMC methods represent a general class of algorithms for sampling from high-dimensional posterior distributions. They have favorable theoretical support to guarantee algorithmic convergence (Roberts and Rosenthal 2004) and are generally easy to implement only requiring that it is possible to evaluate the posterior density pointwise. We use the centred parameterization as given in Murray and Adams (2010); Filippone, Zhong, and Girolami (2013); Hensman, Matthews, Filippone, and Ghahramani (2015), which has been shown to improve the accuracy of MCMC algorithms by de-coupling the strong dependence between f and θ.
Re-parameterising Equation 1 we have, where L θ is the lower Cholesky decomposition of the covariance matrix K θ , with (i, j)-element The random variables v are now independent under the prior and a deterministic transformation gives the function values f . The posterior distribution for p(f | D, θ), or in the transformed setting, p(v | D, θ) usually does not have a closed form expression. Using MCMC we can instead sample from this distribution, or in the case of unknown model parameters θ, we can sample from Numerous MCMC algorithms have been proposed to sample from the Gaussian process posterior (see Titsias, Rattray, and Lawrence 2011, for a review). In this package we use the highly efficient Hamiltonian Monte Carlo (HMC) algorithm (Neal 2010), which utilizes gradient information to efficiently sample from the posterior. Under the re-parametrized model Equation 9, calculating the gradient of the posterior requires the derivative of the Cholesky factor L θ . We calculate this derivative using the blocked algorithm of Murray (2016).
. Function values f are given by the deterministic transform of the Monte Carlo samples, f (i) = L θ (i) v (i) . Monte Carlo integration is then used to estimate for the marginal predictive distribution from Equation 4, where we have a one-dimensional integral for f * that can be efficiently evaluated using Gauss-Hermite quadrature (Liu and Pierce 1994).
In contrast to MCMC, a variational approach can be employed to perform inference in a nonconjugate Gaussian process. Unlike MCMC which seeks to approximate the process using sampling, variational inference (VI) uses optimization techniques to minimize a discrepancy term between an approximating distribution and the true posterior of the process (Opper and Archambeau 2009). Many schemes exist to perform VI, however, under the general framework, one seeks to find an optimal distribution q ⋆ from a family of probability distributions Q, that are parameterized by a set of variational parameters λ. To find q ⋆ , the Kullback-Léibler (KL) divergence between the true posterior and the approximating distribution is minimized such that It is common to let q ∼ N (m, V), where m is a mean vector and V a covariance matrix. Consequently, the variational parameters to be estimated through VI are λ = (m, V).
Optimising Equation 11 in closed form is not possible due to the need to evaluate the intractable marginal likelihood when computing p(f | D, θ). To circumvent this, the objective function can be reformulated in terms of an evidence lower bound (ELBO): While the KL-divergence term is now tractable as it only involves the evaluation of the Gaussian process' prior distribution, the newly required need to compute the expectation of log p(y n | f n ) is an intractable sum for all n. Using the variational objective presented in Khan, Mohamed, and Murphy (2012), a tractable objective function can be expressed as where Ω = K −1 f ,f , N is the number of observations and µ is the Gaussian process prior mean. The function g(·) is used to denote the likelihood-specific, local variational bound function that is defined such that E (log p (y n |f n )) ≥ g (y n , m n , V nn ). An extensive list of local variational bounds can be found in Khan et al. (2012), however, an example for Poisson data is g (y n , m n , V nn ) = ym − exp(m + v/2) − log y!, where v denotes a single term from the diagonal of V . Upon convergence of the optimizer, K f ,f is approximated using V .
Using the Optim.jl package, we can maximize Equation 13 with respect to m and V respectively to find q ⋆ . In order to perform optimization, gradients of the objective function's lower variational bound must be taken. To enable variational inference to be easily extended to new likelihoods, the computation of derivatives is handled using the automatic differentiation functionality in Zygote.jl.

Scaling Gaussian processes with sparse approximations
When applying Gaussian processes to a dataset of size n, an unfortunate by-product is the O(n 3 ) scalability of the Gaussian process. This is due to the need to invert and compute the determinant of the n × n kernel matrix K f ,f . There exist a number of approaches to deriving more scalable Gaussian processes: sparsity inducing kernels (Melkumyan and Ramos 2009), Nyström-based eigendecompositions (Williams and Seeger 2001), variational posterior approximations (Titsias 2009), neighborhood partitioning schemes (Datta, Banerjee, Finley, and Gelfand 2016), and divide-and-conquer strategies (Guhaniyogi, Li, Savitsky, and Srivastava 2017). In this package, scalability within the Gaussian process model is achieved by approximating the Gaussian process' prior with a subset of inducing points u of size m, such that m << n (Quiñonero-Candela and Rasmussen 2005).
Due to the consistency of a Gaussian process 1 the joint prior in Equation 2 can be recovered from a sparse Gaussian process through the marginalization of u where u ∼ N (0, K u,u ) and K u,u = k θ (u, u) is an m × m covariance matrix. An approximation is only induced under the sparse framework through the assumption that f and f * are conditionally independent, given u From this dependency structure, it can be seen that f and f * are only dependent through the information expressed in u. The fundamental difference between each of the four sparse Gaussian process schemes implemented in this package is the additional assumptions that each scheme imposes upon the conditional distributions q(f | u) and q(f * | u). In the exact case, these two conditional distributions can be expressed as where Q f ,f = K fu K −1 uu K uf . The simplest, and most computationally efficient, sparse method is the subset of regressors (SoR) strategy. SoR assumes a deterministic relationship between each f and u, making the Gaussian process marginal predictive distribution Equation 4 equivalent to Such scalability comes at the great cost of wildly inaccurate predictive uncertainties that often underestimate the true posterior variance as the model can only express m degrees-of-freedom. This result occurs as at most m linearly independent functions can be drawn from the prior, and consequently prior variances are poorly approximated.
A more elegant sparse method, is the deterministic training conditional (DTC) approach of Seeger, Williams, and Lawrence (2003). DTC addresses the issue of inaccuracy within the Gaussian process posterior variance by computing the Gaussian process' likelihood using information from all n data points; not just u. This is achieved by projecting f such that With an exact likelihood computation, an approximation is still required on the Gaussian process' joint prior Through retention of an exact likelihood, coupled with an approximate prior, a deterministic relationship need only be imposed on u and f , allowing for an exact test conditional (Equation 15) to be computed. Given that the test conditional is now exact, and the prior variance of f * is computed using K * , * , not Q * , * , more reasonable predictive uncertainties are now produced. Note, while an exact test conditional is now being computed, a DTC approximation is not an exact Gaussian process as the process is no longer consistent across training and test cases due to the inclusion of K * , * in Equation 16.
The fully independent training conditional (FITC) scheme enables a richer covariance structure by preserving the exact prior covariances along the diagonal of the sparse covariance matrix (Snelson and Ghahramani 2006). This can be seen in the model's joint prior As with the DTC, FITC imposes an approximation to the training conditional from Equation 14, but computes Equation 15 exactly. An important extension to Equation 17, is proposed in Quiñonero-Candela and Rasmussen (2005) whereby the prior variance for f * is reformulated as Q * , * − diag [Q * , * − K * , * ]. This assumption of full independence within the conditionals of both f and f * ensures that the FITC approximation is equivalent to exact inference within a non-degenerate Gaussian process; a property not enjoyed by the aforementioned sparse approximations 2 .
The final sparse method implemented within the package is the full-scale approximation of Sang and Huang (2012). A full-scale approximation further enriches the prior covariance structure by imposing a series blocked matrix corrective terms along diagonal of f The predictive uncertainties that a full-scale approach yields will be far superior to any of the previous sparse approximation, however, this comes at the cost of an increased computational complexity due to the presence of a more dense covariance matrix. As with the DTC and FITC approximations, the exact test conditional of Equation 15 is preserved, while the approximation of the training conditional in Equation 14 takes the form . Adopting a full-scale approach requires the practitioner to specify the number of blocks k, apriori. The trade-off when making this decision is that fewer blocks will result in a more accurate predictive distribution, however, the computational complexity will increase. As recommended by Tresp (2000), it is commonly advised to select k = n m , where each block is of dimension m × m. In the extreme case that k = m, the full-scale approach becomes a FITC approximation, and if k = 1, just a single block will exist, and the exact Gaussian process will be recovered.
A final note with regard to sparse approximations is that the set of inducing point locations X u , such that u = f (X u ), will heavily influence the process. Modern extensions to the sparse methods detailed above seek to learn X u concurrently during hyper-parameter optimization.
However, such an approach, whilst elegant, requires first-order derivatives of u to be available; a functionality not currently available in the package. Instead, the practitioner is required to specify a set of points apriori that correspond to the locations of X u . A simple, yet effective, approach to this is to divide the input space up into equidistant knots and use these knot points as X u .

The package
The package can be downloaded from the Julia package repository during a Julia session by using the package manager tool. The ] symbol activates the package manager, after which the GaussianProcesses.jl package can be installed with the following command add GaussianProcesses. Alternatively, the Pkg.jl package (Julia Programming Language 2022) can be used with command julia> Pkg.add("GaussianProcesses") Julia will also install all of the required dependency packages. Documentation for types and functions in the package, like other documentation in Julia, can be accessed through the help mode in the Julia REPL. Help mode is activated by typing ? at the prompt, and documentation can then be searched by entering the name of a function or type. Optimize the hyperparameters of Gaussian process gp based on type II maximum likelihood estimation. This function performs gradient-based optimization using the Optim package to which the user is referred to for further details.
Keyword arguments: *`domean::Bool`: Mean function hyperparameters should be optimized *`kern::Bool`: Kernel function hyperparameters should be optimized *`noise::Bool`: Observation noise hyperparameter should be optimized (GPE only) *`lik::Bool`: Likelihood hyperparameters should be optimized (GPA only) *`kwargs`: Keyword arguments for the optimize function from the Optim package The main function in the package is GP, which fits the Gaussian process model to covariates X and responses y. As discussed in the previous section, the Gaussian process is completely specified by its mean and kernel functions and possibly a likelihood when the observations y are non-Gaussian.
julia> gp = GP(X, y, mean, kernel) julia> gp = GP (X, y, mean, kernel, likelihood) This highlights the use of the Julia multiple dispatch feature. The GP function will, in the background, construct either an object of type GPE or GPA for exact or approximate inference, respectively, depending on whether or not a likelihood function is specified. If no likelihood function is given, then it is assumed that y are Gaussian distributed as in the case analytic case of Equation 5.
In this section we will highlight the functionality of the package by considering a simple Gaussian process regression example which follows the tractable case outlined in Section 2.1. We start by loading the package and simulating n = 10 data points. julia> using GaussianProcesses, Random julia> Random.seed!(13579) julia> n = 10; julia> x = 2π * rand(n); julia> y = sin.(x) + 0.05*randn(n); Note that Julia supports UTF-8 characters, and so one can use Greek characters to improve the readability of the code.
The first step in modelling data with a Gaussian process is to choose the mean and kernel functions which describe the data. There are a variety of mean and kernel functions available in the package (see Appendix A for a list). Note that all hyperparameters for the mean and kernel functions and the observation noise, σ, are given on the log scale. The Gaussian process is represented by an object of type GP and constructed from the observation data, a mean function and kernel, and optionally the observation noise. For this example we have used a zero mean function and squared exponential kernel with signal standard deviation and length scale parameters equal to 1.0 (recalling that inputs are on the log scale). After fitting the GP, a summary output is produced which provides some basic information on the GP object, including the type of mean and kernel functions used, as well as returning the value of the marginal log-likelihood Equation 8. Once the user has applied the GP function to the the data, a summary of the GP object is printed. Once we have fitted the GP function to the data we can calculate the predicted mean and variance of the function at unobserved points {x * , y * }, conditional on the observed data D = {y, X}. This is done with the predict_y function. We can also calculate the predictive distribution for the latent function f * using the predict_f function. The predict_y function returns the mean vector µ(x * ) and covariance matrix Σ(x * , x * ′ ) of the predictive distribution in Equation 7 (or variance vector if full_cov=false). For a given vector x * (in this example a sequence between 0 and 2π with 0.1 spacing) we have: julia> x = 0:0.1:2π julia> µ, Σ = predict_y(gp, x); Plotting one and two-dimensional GPs is straightforward and in the package we utilize the recipes approach to plotting graphs from the Plots.jl (Breloff 2021) package. Plots.jl provides a general interface for plotting with several different backends including PyPlot.jl (Johnson 2021), Plotly.jl (Malmaud, Zheng, Hanson, Knowles, and Palmer 2021) and GR.jl (Heinen 2021). The default plot function plot(gp) outputs the predicted mean and variance of the function (i.e., uses predict_f in the background), with the uncertainty in the function represented by a confidence ribbon (set to 95% by default). All optional plotting arguments are given after the ; symbol. julia> using Plots julia> pyplot() julia> plot(gp; xlabel="x", ylabel="y", title="Gaussian process", legend=false, fmt=:png) The parameters θ are optimized using the Optim.jl package (see the right-hand side of Figure 2). This offers users a range of optimization algorithms which can be applied to estimate the parameters using maximum likelihood estimation. Gradients are available for all mean and kernel functions used in the package and therefore it is recommended that the user utilizes gradient-based optimization techniques. As a default, the optimize! function uses the L-BFGS solver, however, alternative solvers can be applied and the user should refer to the Optim.jl documentation for further details. For highly non-convex models, gradient-based methods will only converge to a local optimum. In these settings, a popular pragmatic solution is to run the optimizer multiple times for different initial parameter values and then choose the best parameter set from these multiple runs. Parameters can be estimated using a Bayesian approach, where instead of maximising the log-likelihood function, we can approximate the marginal posterior distribution p(θ, σ | D) ∝ p(D | θ, σ)p(θ, σ). We use MCMC sampling (specifically HMC sampling) to draw samples from the posterior distribution with the mcmc function. Prior distributions are assigned to the parameters of the mean and kernel parameters through the set_priors! function. The log-noise parameter σ is set to a non-informative prior p(σ) ∝ 1. A wide range of prior distributions are available through the Distributions.jl package. Further details on the MCMC sampling of the package is given in Section 4.1.
julia> using Distributions julia> set_priors!(kern, [Normal(0,1), Normal(0,1)]) julia> chain = mcmc(gp) julia> plot(chain', label=["Noise", "SE log length", "SE log scale"]) The regression example above can be easily extended to higher dimensions. For the purpose of visualization, and without loss of generality, we consider a two-dimensional regression example. When d > 1 (recalling that X ⊆ R d ), there is the option to either use an isotropic (Iso) kernel or an automatic relevance determination (ARD) kernel. The Iso kernels have one length scale parameter ℓ which is the same for all dimensions. The ARD kernels, however, have different length scale parameters for each dimension. To obtain Iso or ARD kernels, a kernel function is called either with a single length scale parameter or with a vector of parameters. For example, below we will use the Matérn 5/2 ARD kernel, if we wanted to use the Iso alternative instead, we would set the kernel as kern = Matern(5/2,0.0,0.0).
In this example we use a composite kernel represented as the sum of a Matérn 5/2 ARD kernel and a squared exponential isotropic kernel. This is easily implemented using the + symbol, or in the case of a product kernel, using the * symbol. By default, the in-built plot function returns only the mean of the GP in the two-dimensional setting. There is an optional var argument which can be used to plot the two-dimensional variance (see Figure 4). julia> p1 = plot(gp2; title="Mean of GP") julia> p2 = plot(gp2; var=true, title="Variance of GP", fill=true) julia> plot(p1, p2; fmt=:pdf) The Plots.jl package provides a flexible recipe structure which allows the user to change the plotting backend, e.g., PyPlot.jl to GR.jl. The package also provides a rich array of plotting functions, such as contour, surface and heatmap plots. In this example we use the GR.jl backend to also allow for a wireframe plot: julia> gr() julia> p1 = contour(gp2) julia> p2 = surface(gp2) julia> p3 = heatmap(gp2) julia> p4 = wireframe(gp2) julia> plot(p1, p2, p3, p4; fmt=:pdf)

Demos
So far we have considered Gaussian processes where the data y are assumed to follow a Gaussian distribution centred around the latent Gaussian process function f (Equation 5). As highlighted in Section 2.2, Gaussian processes can easily be extended to model non-Gaussian data by assuming that the data are conditional on a latent Gaussian process function. This approach has been widely applied, for example, in machine learning for classification problems (Williams and Barber 1998) and in geostatistics for spatial point process modelling (Møller, Syversveen, and Waagepetersen 1998). In this section, we will show how the GaussianProcesses.jl package can be used to fit Gaussian process models for binary classification, time series and count data. The code for all examples is also available in a Notebook format (https://github.com/STOR-i/GaussianProcesses.jl/tree/master/notebooks).

Binary classification
In this example we show how the approximate GP function can be used for supervised learning classification with MCMC used to estimate the latent process. We use the crab dataset from the R package MASS. In this dataset we are interested in predicting whether a crab is of color form blue or orange. Our aim is to perform a Bayesian analysis and calculate the posterior distribution of the latent GP function f and parameters θ from the training data {X, y}.
We first load the data and create a training sample, where the response is converted from characters to booleans: julia> using GaussianProcesses, RDatasets, Random We assume a zero mean GP with a Matérn 3/2 kernel. We use the automatic relevance determination (ARD) kernel to allow each dimension of the predictor variables to have a different length scale. As this is binary classification, we use the Bernoulli likelihood, where Φ : R → [0, 1] is the cumulative distribution function of a standard Gaussian and acts as a link function that maps the GP function to the interval [0,1], giving the probability that y i = 1. Note that BernLik requires the observations to be of type Bool and unlike some likelihood functions (e.g., Student-t) does not contain any parameters to be set at initialization.
julia> mZero = MeanZero(); julia> kern = Matern(3/2, zeros(5), 0.0); julia> lik = BernLik(); data {0,1} We fit the Gaussian process using the general GP function. This function is a shorthand for the GPA function, which is used to generate approximations of the latent function using MCMC or variational inference when the likelihood is non-Gaussian. As we are taking a Bayesian approach to infer the latent function and model parameters, we shall assign prior distributions to the unknown variables. As outlined in the general modelling framework in Equation 9, the latent function f is reparameterized as f = L θ v, where v ∼ N (0 n , I n ) is the prior on the transformed latent function. Using the Distributions.jl package we can assign normal priors to each of the Matérn kernel parameters. If the mean and likelihood functions also contained parameters then we could set these priors in the way same using gp.mean and gp.lik in place of gp.kernel, respectively.
julia> set_priors!(gp.kernel,[Distributions.Normal(0.0,2.0) for i in 1:6]) Samples from the posterior distribution of the latent function and parameters f , θ | D, are drawn using MCMC sampling. The mcmc function uses a Hamiltonian Monte Carlo sampler (Neal 2010). By default, the function runs for nIter=1000 iterations and uses a step-size of ϵ = 0.01 with a random number of leap-frog steps L between 5 and 15. Setting Lmin=1 and Lmax=1 gives the MALA algorithm (Roberts and Rosenthal 1998). Additionally, after the MCMC sampling is complete, the Markov chain can be post-processed using the burn and thin arguments to remove the burn-in phase (e.g., first 100 iterations) and thin the Markov chain to reduce the autocorrelation by removing values systematically (e.g., if thin=5 then only every fifth value is retained).

Time series
Gaussian processes can be used to model nonlinear time series. We consider the problem of predicting future concentrations of CO 2 in the atmosphere. The data are taken from the Mauna Loa observatory in Hawaii which records the monthly average atmospheric concentration of CO 2 (in parts per million) between 1958 to 2015. For the purpose of testing the predictive accuracy of the Gaussian process model, we fit the GP to the historical data from 1958 to 2004 and optimize the parameters using maximum likelihood estimation.
We employ a seemingly complex kernel function to model these data which follows the kernel structure given in Rasmussen and Williams (2006), Chapter 5. The kernel comprises of simpler kernels with each kernel term accounting for a different aspect in the variation of the data. For example, the Periodic kernel captures the seasonal effect of CO 2 absorption from plants. A detailed description of each kernel contribution is given in Rasmussen and Williams (2006), Chapter 5.

Count data
Gaussian process models can be incredibly flexible for modelling non-Gaussian data. One such example is in the case of count data y, which can be modelled with a Poisson distribution, where the log-rate parameter can be modelled with a latent Gaussian process.
where λ i = exp(f i ) and f i is the latent Gaussian process.
The package contains two methods for performing inference for Gaussian processes with non-Gaussian data: MCMC and variational inference, as described in Section 2.2. To demonstrate the performance of both algorithms, we will simulate 20 observations such that y ∼ Poisson (exp (f (x i ))), where f (x i ) = 2 cos(2x i ) and x i ∈ [−3, 3].
julia> using GaussianProcesses, Distributions, Plots, Random julia> Random.seed!(203617) julia> pyplot() julia> n = 20 julia> X = collect(range(-3,stop=3,length=n)); julia> f = 2*cos. The MCMC algorithm is a sampling-based approach and so we use the mcmc() function to draw samples from the posterior distribution. Conditional on these posterior samples we can also draw samples from the predictive distribution julia> samples = mcmc(gpmc; nIter=10000); julia> xtest = range(minimum(gpmc.x),stop=maximum(gpmc.x),length=50); julia> ymean = []; julia> fsamples = Array{Float64}(undef,size(samples,2), length(xtest)); julia> for i in 1:size(samples,2) set_params!(gpmc,samples [:,i]) update_target!(gpmc) push! (ymean, predict_y(gpmc,xtest)[1]) fsamples[i,:] = rand(gpmc, xtest) end The variational inference approach for parameter estimation relies on optimization rather than sampling to approximate the posterior. Using the ELBO function in Equation 12 as an objective, we can optimize the free variational parameters to maximize the ELBO function in a similar manner as we maximize the marginal log-likelihood function to optimize the GP parameters in the exact Gaussian processes setting (see Section 3 for an example). julia> plot!(xtest, mean(ymean), label="posterior mean", w=2) julia> xx = range(-3,stop=3,length=1000); julia> f_xx = 2*cos.(2*xx); julia> plot!(xx, exp.(f_xx), label="truth") julia> scatter!(X,Y, label="data") The results of both algorithms are presented in Figure 8. As can be seen in the left-hand panel, MCMC offers a far richer approximation to the posterior of the Gaussian process. In the asymptote, MCMC will in fact yield the true posterior, whereas variational inference is not equipped with such theoretically desirable guarantees. Further, as the posterior approximation is constrained to be a set of independent Gaussian distributions, it will often fail to capture the intricate details that exist in a non-Gaussian posterior. Typically, variational methods scaled more efficiently than MCMC as it does not require the costly evaluation of the proposal distribution. Instead modern techniques such as stochastic optimization and graphics processing unit (GPU) accelerations can be incorporated into variational schemes, a task left for future work.

Bayesian optimization
This section introduces the BayesianOptimization.jl package, which requires GaussianProcesses.jl as a dependency. We highlight some of the memory-efficiency features of Julia and show how Gaussian processes can be applied to optimize noisy or costly black-box objective functions Shahriari, Swersky, Wang, Adams, and de Freitas (2016). In Bayesian optimization, an objective function l(x) is evaluated at some points y 1 = l(x 1 ), y 2 = l(x 2 ), . . . , y t = l(x t ).
A model M(D t ), e.g., a Gaussian process, is fitted to these observations D t = {(x i , y i )} i=1,...,t and used to determine the next input point x t+1 at which the objective function should be evaluated. The model is refitted with inclusion of the new observation (x t+1 , y t+1 ) and M(D t+1 ) is used to acquire the next input point. With a clever acquisition of next input points, Bayesian optimization can find the optima of the objective function with fewer function evaluations than alternative optimization methods Shahriari et al. (2016).
Since the observed data sets in different time steps are highly correlated, D t+1 = D t ∪ {(x t+1 , y t+1 )}, it would be wasteful to refit a Gaussian process to D t+1 without considering the model M(D t ) that was already fit to D t . To avoid refitting, GaussianProcesses.jl includes the function ElasticGPE that creates a Gaussian process where it costs little to add new observations. In the following example, we create an elastic and exact GP for two input dimensions with an initial capacity for 3000 observations, and an increase in capacity for 1000 observations, whenever the current capacity limit is reached. Under the hood, elastic GPs allocates memory for the number of observations specified with the keyword argument capacity and uses views to select only the part of memory that is already filled with actual observations. Whenever the current capacity c is reached, memory for c + stepsize observations is allocated and the old data copied over. Elastic GPs uses efficient rank-one updates of the Cholesky decomposition that holds the covariance data of the GP.
In the following example we use Bayesian optimization on a not so costly, but noisy onedimensional objective function, f (x) = 0.1 · (x − 2) 2 + cos(π/2 · x) + ϵ, where ϵ ∼ N (0, 1), which is illustrated in Figure 9. The optimization is performed over the hyperparameters of the Gaussian process using maximum likelihood. BayesianOptimization.jl uses automatic differentiation tools in ForwardDiff.jl (Revels, Lubin, and Papamarkou 2016) to compute gradients of the acquisition function (in the example above ExpectedImprovement()). After evaluating the function at 200 positions, the global minimum of the Gaussian process at model_optimizer = [1.97054] is close to the global minimum of the noise-free objective function.

Sparse inputs
In this section we will demonstrate how each of the sparse approximations detailed in Section 2.3 can be used. The performance of each sparse method will be demonstrated by fitting a sparse Gaussian process to a set of n = 5000 data points that are simulated from f (x) = |x − 5| cos(2x), julia> using Random julia> function fstar(x::Float64) return abs(x-5)*cos(2*x) end julia> σy = 10.0 julia> n=5000 julia> Random.seed!(1) julia> Xdistr = Beta(7,7) julia> ϵdistr = Normal(0,σy) julia> x = rand(Xdistr, n)*10 julia> X = Matrix(x') julia> Y = fstar.(x) .+ rand(ϵdistr,n) The set of inducing point locations X u used here will be consistent for each method and are defined explicitly. With the inducing point locations defined, we can now fit each of the sparse Gaussian process approximations. The practitioner is free to select from any of the sparse approaches outlined in Section 2.3, each of which can be invoked using the below syntax. The only syntactic deviation is the full scale approach, which requires the practitioner to choose the covariance matrix's local blocks. In this example, m blocks have been created, with a one-to-one mapping between block and inducing point locations. Below we illustrate the commands for employing the subset of regressors method, the deterministic training conditional approach, the fully independent training conditional scheme and the full scale approach.
As discussed in Section 2.3, each sparse method yields a computational acceleration, however, this often comes at the cost of poorer predictive inference, as shown in Figure 10. This is  no more apparent than in the SoR approach, where the posterior predictions are excessively confident, particularly beyond the range of the inducing points. Both the DTC and FITC are more conservative in the predictive uncertainty as the process moves away from the inducing points' location, while sacrificing little in terms of computational efficiency 3 -see Table 1 for computational timing results.

Future developments
GaussianProcesses.jl is a fully formed package providing a range of kernel, mean and likelihood functions, and inference tools for Gaussian process modelling with Gaussian and non-Gaussian data types. The inclusion of new features in the package is ongoing and the development of the package can be followed via the Github page https://github.com/STOR-i/ GaussianProcesses.jl. The following are package enhancements currently under development: • Automatic differentiation -The package provides functionality for maximum likelihood estimation of the hyperparameters, or Bayesian inference using an MCMC algorithm. In both cases, these functions require gradients to be calculated for optimization or sampling. Currently, derivatives of functions of interest (e.g., log-likelihood function) are hand-coded for computational efficiency. However, recent tests have shown that calculating these gradients using automatic differentiation does not incur a significant additional computational overhead. In the future, the package will move towards implementing all gradient calculations using automatic differentiation. The main advantage of this approach is that users will be able to add new functionality to the package more easily, for example creating a new kernel functions.
• Gaussian process latent variable model (GPLVM) -Currently the package is well suited for supervised learning tasks, whereby an observational value exists for each input. GPLVMs are a probabilistic dimensionality reduction method that use Gaussian processes to learn a mapping between an observed, possibly very high-dimensional, variable and a reduced dimension latent space. GPLVMs are a popular method for dimensionality reduction as they transcend principal component analysis by learning a non-linear relationship between the observations and corresponding latent space. Furthermore, a GPLVM is also able to express the uncertainty surrounding the structure of the latent space. In the future, the package will support the original GPLVM of Lawrence (2004), and its Bayesian counterpart Titsias and Lawrence (2010).