qgam: Bayesian non-parametric quantile regression modelling in R

Generalized additive models (GAMs) are flexible non-linear regression models, which can be fitted efficiently using the approximate Bayesian methods provided by the mgcv R package. While the GAM methods provided by mgcv are based on the assumption that the response distribution is modelled parametrically, here we discuss more flexible methods that do not entail any parametric assumption. In particular, this article introduces the qgam package, which is an extension of mgcv providing fast calibrated Bayesian methods for fitting quantile GAMs (QGAMs) in R. QGAMs are based on a smooth version of the pinball loss of Koenker (2005), rather than on a likelihood function, hence jointly achieving satisfactory accuracy of the quantile point estimates and coverage of the corresponding credible intervals requires adopting the specialized Bayesian fitting framework of Fasiolo, Wood, Zaffran, Nedellec, and Goude (2020b). Here we detail how this framework is implemented in qgam and we provide examples illustrating how the package should be used in practice.


Introduction: additive quantile modelling in R
Generalized additive models (GAMs, Hastie and Tibshirani 1986) are flexible regression models, where the relation between the response distribution and several covariates is modelled non-parametrically, typically via spline bases expansions.In standard GAMs only one parameter of the response distribution (typically the location) is modelled additively, but Rigby and Stasinopoulos (2005) developed methods for handling more flexible generalized additive models for location, scale and shape (GAMLSS), where potentially all the parameters of the response distribution are allowed to depend on the covariates.The quantile GAM models (QGAMs) described in this work provide even more flexibility by modelling the quantiles of conditional response distribution individually, thus avoiding any parametric assumption on the distribution of the response variable.
The purpose of this article is to discuss methods and software for QGAM modelling in R. In particular, we focus on the qgam R package (Fasiolo et al. 2020a), which implements the fast calibrated Bayesian fitting methods proposed by Fasiolo et al. (2020b).The qgam package is an extension of the recommended mgcv package (Wood 2019), which provides tools for building, fitting and visualizing GAM and GAMLSS models.A new package is required, because QGAMs can not be handled using the standard GAM methods implemented by mgcv.In particular, QGAMs are based on the pinball loss function (Koenker and Bassett 1978), rather than on a probabilistic model of the response distribution, and the absence of a likelihood function impedes direct application of the Bayes' rule when updating the prior distribution on the regression coefficients given the observed responses.While this problem can be overcome by adopting the coherent Bayesian belief updating framework of Bissiri et al. (2016), which effectively leads to the application of Bayes' rule using a loss-based pseudolikelihood, naïvely plugging such pseudo-likelihood into standard Bayesian fitting methods can lead to inaccurate quantile estimates and inadequate coverage of the corresponding credible intervals as discussed, for instance, in Yang et al. (2016) and Sriram (2015).To avoid such issues, qgam implements the calibrated Bayesian methods of Fasiolo et al. (2020b), which explicitly aim at selecting the 'learning rate' tuning parameter of the loss so as to achieve near-nominal frequentist coverage of the quantile credible intervals.Furthermore, qgam bases quantile regression on a smoothed version of the pinball loss, which enables the adoption of fast maximum a posterior (MAP) and empirical Bayes methods to estimate the regression coefficient and select their prior variance hyper-parameters, respectively.The smoothness of the new loss is determined by minimizing the asymptotic mean squared error (MSE) of the estimated regression coefficients, approximated using a location-scale GAM model.
To our best knowledge and at the time of writing, the QGAM model fitting framework proposed by Fasiolo et al. (2020b) is the only one able to estimate all the regression coefficients and prior hyper-parameters using fast direct optimization methods and to provide, at no extra computational cost, credible intervals which achieve adequate coverage for tail quantiles.It should be clarified that, when we describe qgam as "fast", we mean that it is the fastest additive quantile regression method we know of, which has the properties just described.Hence, for example, the rqss function provided by the quantreg package (Koenker 2013) might be faster, if the smoothing parameters are known or if the model of interest contains at most one or two such parameters, which could be selected by minimizing a model selection criterion numerically, as discussed in Koenker (2011), Section 3. Another alternative to the calibrated Bayesian estimation methods provided by qgam is additive quantile regression via gradient boosting, available in R via the mboost package (Hothorn et al. 2018).Fasiolo et al. (2020b) shows that boosting can provide accurate point estimates, but selecting the optimal number of boosting steps requires running a computationally intensive cross-validation routine.Furthermore, the uncertainty of the estimated quantiles must be obtained by bootstrapping.Waldmann et al. (2013) propose Bayesian methods for estimating QGAMs via the BayesX stand-alone software (Brezger et al. 2005), with a specific quantile regression family being accessible from R using the bamlss package (Umlauf et al. 2018).However, their proposal is based on Markov chain Monte Carlo (MCMC) methods, which are slower than the direct optimization methods adopted here, and the resulting credible intervals struggle to achieve nominal frequentist coverage for tail quantiles, as detailed in Waldmann et al. (2013).Another alternative is the VGAM (Yee 2010) R package but, as for quantreg, the prior smoothing hyper-parameters have to be selected manually.Lin et al. (2013) focus on variable selection, not smoothing, hence the corresponding software can not be considered an alternative to qgam.
The rest of the paper is structured as follows.In Section 2 we first present the basic structure of QGAM models, then we show how such models can be fitted using the fast calibrated Bayesian methods of Fasiolo et al. (2020b) and we explain how the fitting framework is implemented in the qgam package.In Section 3 we illustrate how the package can be used for quantile GAM modelling.In particular, the first example introduces the basic features of the package, while the second example considers a more realistic application, focused on electricity demand forecasting using functional effects.In Section 4 we outline some promising directions for future work on additive quantile modelling with qgam.

General structure of additive quantile regression models
Let y be a continuous random variable (r.v.) with conditional distribution p(y|x), where x is a p-dimensional vector of covariates.Define also the conditional quantiles of p(y|x) by µ τ (x) = F −1 (τ |x), where τ ∈ (0, 1) and F −1 is the inverse conditional cumulative distribution function (c.d.f.) of y.In quantile regression (Koenker and Bassett 1978) the conditional quantiles are modelled individually, without specifying a model for p(y|x).Direct quantile estimation is generally achieved by exploiting the following alternative definition of a quantile where is the scaled version of the so-called 'pinball' or 'check' loss.σ > 0 is a scale parameter, which can potentially depend on the covariates x.Its role will be clarified in Section 2.2.In qgam the pinball loss, which is piecewise linear and has discontinuous derivatives, is replaced by the extended log-f (ELF) loss (Fasiolo et al. 2020b), that is where λ > 0, and the pinball loss is recovered as λ → 0 + .The ELF loss is a smoothed version of the pinball loss which, as will be explained later, has the advantages of leading to more accurate quantile estimates (if λ is tuned adequately) and of enabling the use of efficient computational methods for model fitting.
In this work we assume that the conditional quantiles have an additive structure, that is where the f j 's are parametric, random or smooth effects.The latter are built using spline bases expansions, so the j-th effect can be written where b j 1 , . . ., b j K j are the spline basis function used to built the j-th effect and β j 1 , . . ., β j k are the corresponding regression coefficients.The basis functions are known and fixed, while the regression coefficients must be estimated.Note that the quantile µ depends both on x and β, but in the following we will refer to it using µ(x) or µ(β), depending on the context.In a Bayesian framework, the complexity of the smooth and random effects is controlled using a prior distribution on the regression coefficients, which we indicate with p(β).In this work we assume that the prior is an improper multivariate Gaussian distribution, centered at 0 and with positive semi-definite precision matrix S γ = m l=1 γ l S l .Here the S l 's are positive semidefinite matrices, scaled by the positive parameters γ = {γ 1 , . . ., γ l }.We assume that the S j 's are given, while the vector γ needs to be selected.The S j 's related to random effects are often simple diagonal matrices, while the S j 's related to smooth effects have more complex structures, aimed at penalizing departures from (some definition of) smoothness.Hence, increasing γ leads to a prior where the random effects coefficients are more concentrated around zero and the smooth effects less wiggly.In the following we refer to γ as the vector of smoothing parameters, with the understanding that some of its elements could instead be controlling the prior precision of the random effects.In general, there needs not to be a one-to-one correspondence between the smoothing parameters and the effects because, for example, a smoothing parameter might be determining the prior precision of several effects or the complexity of a single effect might be controlled using multiple smoothing parameters.See Wood (2017) for a detailed account on smooth and random effects model structures, in an additive modelling context.This section outlined the basic QGAM modelling framework, the next two will detail how such models can be estimated within a fast calibrated Bayesian framework.

Performing the Bayesian update under the ELF loss
Let us indicate with p(y|β) the true conditional distribution of the response, where the dependency on x has been temporarily suppressed to simplify the notation, and indicate with p(β) the prior distribution, which is implicitly a function of the smoothing parameters γ.Assume for the moment that the latter have been fixed to some value, so that the prior is given.Recall that we are basing additive quantile regression on the ELF loss, rather than on a probabilistic model for p(y|β).This is an impediment to performing Bayesian inference for QGAMs, as the lack of a likelihood function prevents us from applying Bayes' rule to update p(β) to the corresponding posterior, p(β|y).We address the issue by adopting the belief updating (BU) framework of Bissiri et al. (2016), which allows to perform the Bayesian update using a general loss function, rather than a likelihood.In particular, as detailed in Fasiolo et al. (2020b), applying the BU framework to the ELF loss leads to the following update formula p(β|y) where is the probability density function (p.d.f.) of the ELF distribution, which is an extension of the log-f distribution of Jones (2008).Here λ determines the smoothness of the loss, while 1/σ plays the role of a 'learning rate', determining the relative weights of the loss-based pseudolikelihood and the prior.In fact, letting λ → 0 + leads to p(β|y) ∝ exp{−ρ τ {y − µ}/σ}p(β) and, as 1/σ increases, the loss-based likelihood progressively dominates the prior, thus leading to faster learning and a higher risk of overfitting.Decreasing 1/σ has the opposite effect.In contexts where the variance of y varies strongly with x, it can be advantageous to let σ depend on x via the decomposition σ(x) = σ 0 σ(x), where σ 0 is the baseline or average learning rate and σ(x) is its x-dependent component.
Having defined (3) which, following Syring and Martin (2015), we refer to as the Gibbs posterior, the next section outlines how the fast calibrated Bayesian methods proposed by Fasiolo et al. (2020b), and implemented by qgam, can be used to fit ELF-based QGAMs.

Fast calibrated Bayesian model fitting methods
We fit quantile GAMs using three nested optimization routines, in particular: 1. in the outer iteration, the baseline learning rate 1/σ 0 is selected by minimizing a calibration loss function numerically; 2. for fixed σ 0 , the loss smoothness λ and the x-dependent component of the learning-rate σ(x) are determined using closed-form expressions, while the smoothing parameters γ are selected by numerically optimizing an intermediate criterion; 3. for fixed σ(x) = σ 0 σ(x), λ and γ, the regression coefficients β are estimated by numerically optimizing an inner criterion.
Of course parameter τ ∈ (0, 1), which indicates the quantile of interest, is kept fixed throughout.The fact that the three iterations are nested, not sequential, means that evaluating the outer objective function requires solving the intermediate optimization problem and, in turn, each step of the latter entails running the inner optimization to convergence.In the following we provide a methodological outline of the three iterations, going from the inner to the outer one, and in Section 2.4 we describe how the whole procedure is implemented in qgam.

Inner iteration: MAP estimation of the regression coefficients
The inner iteration estimates the regression coefficients using efficient maximum a posteriori (MAP) methods, for fixed γ, σ(x) and λ.In particular, let y = {y 1 , . . ., y n } be the vector of observed responses and note that the conditional quantile is modelled by µ(x i ) = x T i β, where x i is the i-th row of the n × d design matrix X, containing the spline basis functions evaluated at x i .It is easy to show that maximizing the logarithm of the ELF-based Gibbs posterior (3) is equivalent to minimizing the criterion where Dev i {β, σ(x i ), λ} is the i-th deviance component, based on the ELF density (4).Criterion ( 5) is smooth and can be minimized efficiently with respect to (w.r.t.) β using a penalized iteratively re-weighted least squares (PIRLS) algorithm, as detailed in Fasiolo et al. (2020b).

Intermediate iteration: selection of the smoothing parameters and ELF loss smoothness
The intermediate iteration selects the smoothing parameters by maximizing a Laplace approximation to the ELF-based marginal likelihood.The latter is log p{y|γ, σ(x), λ} = pτ ( where we have made explicit the dependence of the prior on the smoothing parameters. Applying a Laplace approximation to the intractable integral (6) leads to the following Laplace approximate marginal likelihood (LAML) criterion where β is the minimizer of (5), estimated using the inner iteration, ll{σ(x), λ} is the saturated log-likelihood corresponding to the ELF density (4), M p is the dimension of the null-space of S γ and |S γ | + is the product of its non-zero eigenvalues.Fasiolo et al. (2020b) focus on a marginal loss criterion, which is equal to the negative of (6) up to a normalization constant, to stress that σ 0 should not be selected by jointly maximizing (7) w.r.t.γ and σ 0 .This would be appropriate if σ 0 were the scale parameter of a standard GAMs, but that the outer calibration procedure for σ 0 should be used instead.Here we prefer to adopt a likelihoodbased terminology to emphasize that ( 7) can be maximized w.r.t.γ using the numerically stable methods of Wood et al. (2016), which are aimed at standard probabilistic GAMs.
Before describing the outer iteration, we need to specify how σ(x) and λ, which are held constant throughout the intermediate and inner iterations, are determined for fixed σ 0 .Recall that σ(x) = σ 0 σ(x) and define the scaled loss bandwidth h(x) = σ(x)λ.Assume that the responses follow the location-scale model y i |x i ∼ α(x i ) + κ(x i )z i , where the z i 's are i.i.d with E(z|x) = 0 and var(z|x) = 1.Fasiolo et al. (2020b) shows that, under the above locationscale model and further assumptions specified therein, the asymptotic MSE of the regression coefficients is minimized by where f z , f z and F z are the p.d.f. of z, its first derivative and its c.d.f., The idea is to use σ(x) to modulate the baseline scaled loss smoothness, λσ 0 , and learning rate, 1/σ 0 , to make them respectively directly and inversely proportional to the conditional variance of y.Hence, for fixed σ 0 , we obtain σ(x) and λ using the formulas just mentioned.This, of course, requires estimates of α(x) and κ(x) in the location-scale model, as well as of f z and F z .Such estimates need to be obtained only once, before initiating the nested QGAM fitting procedure described here, hence we will discuss the specific approach we follow in Section 2.4, where we also describe its software implementation.

Outer iteration: selection of the learning rate by calibrated Bayes
Recall that 1/σ 0 is the baseline learning rate, which determines the relative weight of the loss and the prior in the Gibbs posterior (3).Increasing 1/σ 0 leads to faster learning, that is wigglier fitted quantile curves and narrower posterior credible intervals.Fasiolo et al. (2020b) selects σ 0 using a calibration procedure aimed at guaranteeing that the credible intervals for µ(x) approximately achieve the correct frequentist coverage.This is achieved by minimizing which is an estimate of the integrated Kullback-Leibler (IKL) divergence Here KL(•, •) and N(•, •) indicate respectively the KL divergence and the univariate Gaussian distribution, ζ is a positive parameter which we fix to 1/2, v(x) = x T Vx and v s (x) = x T V s x are the posterior variance of µ(x) under two alternative posterior covariance matrices for β.
In particular , where I is the negative Hessian of the ELF-based log-likelihood and is the covariance matrix of the gradient of the ELF loss, under the data generating process.vs (x i ) is an estimate of v s (x i ), based on the regularized estimator of V s described in Fasiolo et al. (2020b) (who use a different notation and indicate v s (x i ) and V s by ṽ(x i ) and Ṽ, respectively).The IKL objective function is generally smooth and convex, hence we minimize it efficiently using Brent's method (Brent 2013).Fasiolo et al. (2020b) propose also an alternative calibration procedure, based on another version of the IKL loss where vs (x) is substituted by the sample pointwise variance of the estimated quantile μ(x), estimated by bootstrapping.Such a procedure requires re-fitting the model to several bootstrap samples, hence it is computationally more expensive than minimization of the IKL loss defined above, but it can lead to better coverage in small samples.We refer to Fasiolo et al. (2020b) for explanations regarding how minimizing either version of the IKL loss w.r.t.σ 0 leads to better frequentist coverage of the posterior credible intervals, relative to joint maximization of LAML (7) w.r.t.σ 0 and γ.
This section outlined the Bayesian QGAM fitting framework of Fasiolo et al. (2020b) from a methodological point of view.The next section explains how the fitting framework just described is implemented in the qgam package.

Software implementation of the fitting framework
The qgam package is an extension of mgcv providing additional tools for handling QGAMs.
Here we focus mainly on how the package implements the fitting framework of Fasiolo et al. (2020b), while in Section 3 we provide usage examples.set the ELF loss smoothness manually.By default qgam determines the loss smoothness automatically, using the methods described in Section 2.3, hence most users will not need to use argument err.However, Appendix A provides some guidelines for users who decide to select the loss smoothness manually.In qgam version 1.3.2,arguments multicore, cluster, ncores and paropts are relevant only when the learning rate is calibrated by bootstrapping, and can be used to perform bootstrap re-fitting in parallel.control is a list of control parameters useful, for instance, to choose the type of calibration used and to suppress the text output printed by qgam, while argGam is a list of arguments to be passed to gam.
To describe how qgam implements the fitting framework of Fasiolo et al. (2020b), assume that we want to fit a QGAM for quantile τ = 0.5, with model formula myForm = list(y ~s(x1) + s(x2), ~s(x3)) and using data from a data.framecalled myDat.Here y ~s(x1) + s(x2) is the model used for the median, that is µ τ (x) = f 1 (x 1 ) + f 2 (x 2 ), where f 1 and f 2 are two smooth effects constructed, by default, using ten thin plate splines basis function.
The second element of the formula, ~s(x3), is the model used for the variance κ(x) in the preliminary location-scale GAM model fit.The first thing to point out is that, if σ 0 and the bandwidth h(x) were known, QGAM models could be fitted using gam directly.In fact, we could fit a median quantile model using where qgam::elf is a family providing the ELF log-likelihood function and its derivatives.Its argument theta and co correspond respectively to log σ 0 and h(x), which have been arbitrarily fixed to 1 here.Hence, given σ 0 and h(x), the intermediate and inner iterations for selecting γ and estimating β take place inside gam.But these parameters are generally unknown, so qgam determines σ 0 using the outer iteration and h(x) via the preliminary location-scale GAM fit.In particular, upon calling qgam(myForm, data = myDat, qu = 0.5) the qgam function executes the following pseudo-code: 1. estimate α(x) and κ(x) by fitting a Gaussian location-scale GAM gam(myForm, data = myDat, family = gaulss) where mgcv::gaulss is a Gaussian location-scale family.
2. numerically fit the sinh-arch density of Jones and Pewsey (2009) to the standardized Gaussian GAM residuals, to provide estimates of f z , f z and F −1 z (τ ).
3. estimate h * (x) by plugging the estimates obtained in previous two steps into (8), where d is set to the number of effective degrees of freedom used to model α(x) in step 1.
4. call the tuneLearnFast function, which minimizes the IKL loss w.r.t.log σ 0 using an outer Brent algorithm.For each trial value of log σ 0 : (a) call gam with arguments formula = myForm [[1]] and family = elf, the latter having parameter theta fixed to the current value of log σ 0 and co fixed to h * (x).ELF parameters λ and σ(x) are obtained within elf by λ = n −1 i h * (x i )/σ 0 and then σ(x i ) = h * (x i )/λ.The gam call performs the intermediate iteration for selecting γ and the inner iteration for estimating β.
(b) Estimate the IKL loss using ( 9), where v(x) is easily obtained from the gam output, while v s (x) is estimated using the regularized estimator of Fasiolo et al. (2020b).
Upon convergence, return the value of log σ 0 minimizing the IKL loss.
5. Call gam as in step 4(a), with elf parameter theta fixed to the value of log σ 0 returned by tuneLearnFast, and co set to h * (x).
6.Return the output of the last gam call, with the output of tuneLearnFast call stored in its $calibr slot.
Note that a variety of different density estimators could be used in step 2 above.We use the sinh-arch distribution because it provides a good balance between flexibility and stability.However, future versions of qgam might allow user to provide their own density estimator.Similar considerations hold for the Gaussian GAM used in step 1.The tuneLearnFast function of step 4 has exactly the same arguments as qgam, and it returns a list containing diagnostic information regarding the outer iteration and the value of σ 0 which minimizes the IKL loss.The output of qgam is an object of class c("qgam", "gam", "glm", "lm"), which can be manipulated using any of the S3 methods available for objects of class "gam".Version 1.3.2 of the qgam package does not provide any S4 method or class.

A basic example: the motorcycle data set
Here we consider the classic motorcycle accident data set of Silverman (1985), available in the MASS package (Ripley et al. 2013).We start by loading the data and fitting a QGAM for quantile τ = 0.9 as follows Figure 1: Fitted smooth effect of time on quantile 0.9 of the mcycle data set, with 95% credible intervals obtained using a Gaussian approximation to the posterior.
R> library("MASS") R> library("qgam") R> fitCycle1 <-qgam(list(form = accel ~s(times, k = 20, bs = "ad"), + ~s(times)), data = mcycle, qu = 0.9) We are using the model µ τ (time) = f (time), where f (•) is an adaptive smooth effect built using 20 P-splines basis functions (Eilers and Marx 1996).We use an adaptive smooth because, as shown in Figure 1, the shape of the conditional quantile varies sharply between roughly 10 and 40 milliseconds, but it is otherwise quite flat.Recall that the second part of the formula, ~s(times), is not used to model the quantile, but the variance κ(x) of the response within the location-scale GAM model.
Given that fitCycle1 inherits from class "gam", generic function calls such as R> predict(fitCycle1)[1:5] 1 2 3 4 5 0.460372275 0.421737850 0.271621363 0.146774904 0.005606131 dispatch to the relevant method (here predict.gam).Similarly, the fitted effects in the quantile model can be plotted using R> plot(fitCycle1) which dispatches to plot.gam, and produces the plot shown in Figure 1.To save memory, no component of the Gaussian location-scale fit is stored in the output of qgam, but the selected values of the parameters log σ 0 and h(x) can be extracted by R> fitCycle1$family$getTheta() 0.9 1.237221 R> fitCycle1$family$getCo()[1:5] [1] 0.2791352 0.2728345 0.2549963 0.2442300 0.2346063 Above we fitted a single quantile, but multiple quantiles can be estimated at once using the following function call R> fitCycleM <-mqgam(list(form = accel ~s(times, k = 20, bs = "ad"), + ~s(times)), data = mcycle, qu = c(0.1,0.25, 0.5, 0.75, 0.9)) where the mqgam function fits a QGAM to each of the five quantiles.The main advantage of using mqgam, rather than several calls to qgam, is that it saves memory by storing a single version of potentially large objects, such as the model matrix, for all the fitted QGAMs.
The five fitted QGAM models can be found in the list fitCycleM$fit, but calling functions such as plot(fitCycleM$fit[[1]]) would trigger an error because some object elements are missing.The output of mqgam should instead be manipulated using the qdo function, for example More convenient methods for plotting multiple QGAMs jointly are provided by the mgcViz package Fasiolo et al. (2019a).In the following we will only use the mgcViz tools that are specific to quantile GAMs, and we refer to Fasiolo et al. (2019b) and to the relevant package documentation for more details.To exploit the visualizations offered by mgcViz, it is necessary to transform the output of mqgam as follows R> library("mgcViz") R> fitCycleM <-getViz(fitCycleM) which produces an object of class c("mqgamViz", "mgamViz").This is simply a list of objects of class "gamViz", which inherits from the "qgam" class.Given that the elements of fitCycleM are full objects, without any missing slot, they can be manipulated directly rather than via qdo.For example, we can use plot(fitCycleM[[1]]) to plot the fitted effect for quantile τ = 0.1.Of course, transforming fitCycleM using getViz has some memory cost, because fitCycleM now stores a copy of the model matrix for each quantile.
Having transformed fitCycleM, the effect of time on all quantiles can now be visualized jointly by

R> plot(fitCycleM)
which produces the plot shown in Figure 2, by dispatching to mgcViz::plot.mgamViz.The latter produces an object of class c("plotGam", "gg"), which is a wrapper around one or several objects of class ggplot, defined in the ggplot2 package (Wickham et al. 2019).Note that Figure 2 shows the fitted effects for each τ which, in this case, are equal to the fitted quantiles up to some vertical shifts (the intercepts).The effects cross each other because they are all centered (see Wood (2017) for details on identifiability constraints in GAMs), hence this does not imply that the corresponding quantile estimates (which are not centered) cross within the range of the data, time ∈ (0, 60).In fact, if we plot the estimated quantiles using we do not see any clear crossing in Figure 3. Indeed, the minimal distance between consecutive quantiles is R> min(apply(preds, 1, diff)) [1] 0.1524413 hence there is no crossing within the range of the data.However, quantiles fitted with qgam often cross somewhere (outside the data range in this example), because they are fitted independently and without any non-crossing constraint.This issue is not specific to qgam, but is a well known drawback of distribution-free quantile regression (Koenker 2005).
In the mcycle data set, the variance of the response (accel) varies strongly with the independent variable (times), hence it is natural to let the learning rate and the ELF loss smoothness vary along the latter.This is achieved by providing a model formula which is a list with two elements, where the second, ~s(times), is used by qgam to model the conditional variance κ(x) in the location-scale GAM model.However, it is possible to avoid modelling the variance and to assume that it is constant along the covariates.For example, in the following code R> fitCycleConst <-qgam(accel ~s(times, k = 20, bs = "ad"), + data = mcycle, qu = 0.9) we are providing only the model for the quantile τ = 0.9, rather than a list of two formulas, so qgam uses a constant learning rate 1/σ(x) = 1/σ 0 and loss bandwidth h * (x) = h * .For the : Fitted quantile τ = 0.9, with 95% credible intervals, when the learning rate and the loss smoothness are allowed to vary with the independent variable (left) or not (right).
mcycle data set, ignoring the heteroscedasticity leads to the poor results shown in Figure 4.
One problem with the quantile fit shown on the right is that it lies far above all the responses, for time < 10ms.But we are fitting quantile 0.9, hence we should expect around 10% of the responses to lie above the fit.The issue is that the bias of the ELF loss used by qgam is inversely proportional to the conditional variance of the response (see Fasiolo et al. (2020b) for details), hence not modelling the response variance leads to high bias for time < 10ms.
Ignoring heteroscedasticity can also lead to credible intervals that have non-constant coverage.
In fact, the width of the intervals should be proportional to the variance of the response, but on the right plot in Figure 4 they have nearly constant width, thus providing an incorrect representation of the uncertainty of the fit.Of course, in most data set, heteroscedasticity is not as dramatic as in mcycle and using a location-only model often leads to satisfactory results.However, it is important to be aware of the issues highlighted by this example.

Functional QGAM modelling of electricity demand data
In this example we consider smart meter data from n c = 247 anonymized residential customers from Sydney, Australia, covering the period between the 3rd of July 2010 and the 30th of June 2011.The data has been downloaded from https://www.ausgrid.com.au, and it originally contained electricity demand from 300 customers, at 30min resolution.We discarded 53 customers because their demand was too irregular (e.g., they were absent from home for over 30 consecutive days), and we integrated the demand data with temperature data from the National Climatic Data Center (www.ncdc.noaa.gov),covering the same period.The aim of this example is to illustrate how to use quantile GAMs in qgam, in the context of an electricity demand forecasting application.We start by loading the following data set: R> data("AUDem", package = "qgam") R> meanDem <-AUDem$meanDem R> head(meanDem, 3) • doy the day of the year, from 1 to 365; • tod the time of day, ranging from 18 to 22, where 18 indicates the period from 17:00 to 17:30, 18.5 the period from 17:30 to 18:00 and so on; • dem the demand (in KWh) during a 30min period, averaged over the n c households; • dow factor variable indicating the day of the week; • temp the external temperature at Sydney airport, in degrees Celsius; • date local date and time; • dem48 the lagged mean demand, that is the average demand (dem) during the same 30min period of the previous day; Assume that we aim at producing a probabilistic forecast of the average demand one day ahead, using additive quantile regression.We start by dividing the data into a learning and a testing set: R> cutDate <-as.Date("2011-04-01 00:00:00") R> meanDemLearn <-subset(meanDem, as.Date(date) < cutDate) # Learning R> meanDemTest <-subset(meanDem, as.Date(date) >= cutDate) # Testing which leaves the last three months for testing.Figure 5 shows the average demand between 19:30 and 20:00, over the whole period.We focus only on the period between 17:30 and 21:30, because the demand dynamics change considerably before and after this time slot, hence modelling demand across the whole day would require a much more sophisticated and computationally expensive model than any of those considered below.
We start by modelling the τ -th quantile of the average demand using the quantile model where I(dow t = j) = 1 when dow t is the j-th day of the week and zero otherwise, f 1 and f 2 are smooth effects built using thin plate splines, while f 3 is a cyclical effect constructed using cubic regression splines.We use a cyclic effect for doy to make so that the seasonal effect has the same value on the 31st of December and on the 1st of January.Given that demand variability seems to spike during the austral summer (see Figure 5), we model the variance κ(x) using a cyclic effect for doy in the location-scale Gaussian GAM.We fit the model just described to quantiles τ = 0.1, 0.3, 0.5, 0.7, 0.9 using the following code R> library("qgam") R> qusObj <-seq(0.1,0.9, length.out= 5) R> fitQ <-mqgam(list(dem ~dow + dem48 + + s(tod, k = 6) + s(temp) + s(doy, bs = "cc"), ~s(doy, bs = "cc")), + qu = qusObj, data = meanDemLearn) We then plot the estimated smooth effects and the linear coefficients of lagged demand using R> library("mgcViz") R> fitQ <-getViz(fitQ) R> print(plot(fitQ, allTerms = TRUE, select = c(1:3, 5)), pages = 1) The output, up to some aesthetic manipulation, is shown in Figure 6.Note that we used the plotting tools provided by mgcViz, which required us to convert the output of mqgam using getViz.Then we used plot to plot all model terms (allTerms = TRUE is used to plot the parametric terms in addition to the smooth effects), we selected a subset of the effects using the select argument and we used print to arrange all the plots on a single page.
The shapes of the estimated effects seem quite reasonable: the tod effect increases as people get home and decreases later in the evening, the temperature plot shows a strong cooling effect and a negligible heating effect, and the effect of doy shows a main mode during the austral winter and a lower, narrower, mode during the summer.The discrepancy between the effects estimated on different quantiles can be interpreted in terms of the corresponding effect on the shape of the demand distribution.For example, at tod = 18 the effects f 1 are more spread out than at tod = 19 implying that, all else being equal, the variance of the demand is higher at tod = 18.Similarly, at doy ≈ 25 the effect f 3 is more positive for quantile τ = 0.9 than for τ = 0.1, while the positions are reversed at doy ≈ 120.Hence, the demand distribution is more skewed to the right in the first case.The linear coefficient of  10), for the quantiles labelled using the id variable.
the lagged demand increases with τ hence, all else being equal, the demand variance one day ahead increases with the lagged demand.
So far we considered only the mean demand over the portfolio of customers, however the original data set contains demand for each of the n c households.Average demand is typically much easier to forecast than individual demand, because the highly irregular consumption of individual customers is averaged out.However, the by-customer data should carry more information, hence it is interesting to verify whether it can be used to improve the quantile forecasts for the average demand.We capture part of the information contained in the bycustomer demand data by considering a model which, for storage reasons, uses a functional summary of the full demand data, as explained in the following.
We consider the functional quantile model where µ τ (x t ) is, as in ( 10), the τ -th quantile of the average demand, while we indicate with G t the c.d.f. of the individual demand across customers at time t and with G −1 t−48 (p) the corresponding p-th quantile, at the same instant of the previous day.Function f 4 is a smooth effect along p ∈ [0, 1], constructed using a thin plate spline basis.Hence, the last term in ( 11) is a functional effect which takes as input the whole quantile function of the across-customers demand distribution at time t − 48, and integrates it over p using the weight function f 4 (p).The aim is to verify whether the shape of the distribution of the individual demand at time t − 48 can be used to improve the forecast of the average demand at time t.The regression coefficients of f 4 (p) are unknown, but they can be estimated using the methods presented in Section 2.3.In particular, we have that where the transformed basis b4 1 , . . ., b4 K 4 implicitly depends on time t.Hence, once the transformed basis is obtained, the functional effect is linear in the regression coefficients and estimating them presents no extra difficulty, relative to standard smooth effects.The acrosscustomers demand c.d.f.G t−48 is unknown, but we have a sample of size n c from it, consisting of the by-customer demand data.Hence, we can approximate the transformed basis using where Ĝ−1 t−48 (p 1 ), . . ., Ĝ−1 t−48 (p nq ) is a set of empirical quantiles of the n c by-customer demand observations at time t − 48, corresponding to probability level 0 To reduce the amount of data stored in the qgam package, we use n q = 20 < n c quantiles equally spaced between p 1 = 0.01 and p nq = 0.99.The quantiles are stored in R> qDem48 <-AUDem$qDem48 R> qDem48 [1:3, c(1, 2, 5, 15, 19, 20) where each row contains Ĝ−1 t−48 (p 1 ), . . ., Ĝ−1 t−48 (p nq ), for some t. Figure 7 shows five selected rows of qDem48, each line being an estimate of the quantile function G −1 t−48 , for a different t.To fit model ( 11) with qgam, we need also a matrix where each row of probLev contains the probability levels p 1 , p 2 , . . ., p nq .This is computed as follows R> probLev <-matrix(seq(0.01,0.99, length.out= 20), + nrow = nrow(meanDem), ncol = 20, byrow = TRUE) R> probLev [1:3, c(1, 2, 5, 15, 19, 20) We add the matrices just defined to the learning and testing sets by We are now ready to fit model (11) using R> fitFunQ <-mqgamV(list(dem ~dow + s(temp) + s(doy, bs = "cc") + + s(tod, k = 6) + s(probLev, by = qDem48), ~s(doy, bs = "cc")), + qu = qusObj, data = meanDemLearn) where we are using the mgcViz::mqgamV function, which is simply a shortcut for fitting the model using mqgam and then transforming it using getViz.Here the functional effect is constructed using s(probLev, by = orDem48), which is the constructor typically used for by-variable or varying-coefficient effects in mgcv.The corresponding transformed spline basis is constructed by gam, which performs the summation in (12) using the rows of the probLev and orDem48 matrices.
We can have a look at the fitted functional smooth effects f 4 (p) using R> plot(fitFunQ, select = 4) + labs(color = expression(tau)) which produces the plot shown in Figure 8.It is interesting to note that the effects corresponding to low and high probabilities diverge, suggesting that the extremes of the quantile function G −1 t−48 (p) have a stronger linear effect on the high quantiles of the average demand (e.g., µ τ (x) with τ = 0.9) than on the low ones.However, the plot.mgamVizmethod does not include credible intervals when plotting the same effect for multiple quantiles, to avoid cluttering the plot.In Figure 9 we compare the effects for τ = 0.1 and 0.9, and we include 95% credible intervals.This shows that the discrepancy between the effects is statistically stronger for p ≈ 1, than for p ≈ 0.
Of course, model ( 11) could be improved further.For example, a call to

R> check(fitFunQ[[3]])
Theor.proportion of neg.resid.: 0.5 Actual proportion: 0.5155229 Integrated absolute bias |F(mu) -F(mu0)| = 0.02719397 Method: REML Optimizer: outer newton full convergence after 7 iterations.Gradient range [-3.421902e-08,4.shows that, for τ = 0.5, the effective degrees of freedom (edf) are quite close to their maximum possible value (k') for the temperature effect.This means that the fit is using all the degrees of freedom available for that effect, hence we might want to consider increasing the number of basis functions by using s(temp, k = 20) in the model formula.However, one might be reluctant to increase the number of basis functions used to fit more extreme quantiles (e.g., τ = 0.9), where the data is sparser.This highlights a disadvantage of using mqgam, rather than repeated calls to qgam, when fitting several QGAMs: the same model formula is used for all the quantiles.The model could also be improved by, for example, including the effect of lagged temperatures, often useful for capturing thermal inertia in buildings.

Conclusions
The qgam R package provides tools for building, fitting and checking quantile additive models.The main advantage of such models, relative to standard probabilistic GAMs, is that they do not make any parametric assumption on the distribution of the response variable.This is achieved by modelling and estimating the conditional quantiles directly, using the pinball loss.Model fitting is based on the fast calibrated Bayesian framework of Fasiolo et al. (2020b), which provides calibrated credible intervals, via IKL selection of the learning rate, and computational efficiency, by smoothing the pinball loss and by exploiting the marginal likelihood methods of Wood et al. (2016).Model building is handled by mgcv, hence the quantile GAMs built using qgam can include any of the effect types made available by mgcv, among others: cyclic, adaptive and multivariate tensor-product smooths, as well as functional and random effects.
At the time of writing, qgam version 1.3.2can handle data sets and models of moderate size.In particular, the current fitting methods require explicit formation of the n × d model matrix X, which leads to high memory usage for large n.Wood et al. (2015) proposed a GAM fitting framework where explicit formation of X is avoided by adopting block-oriented iterations for smoothing parameters selection and regression coefficients estimation.The bam function in mgcv implements such methods, as well as the marginal discretization approach of Wood et al. (2017), which allows it to fit models with up to 10 4 coefficients and 10 8 data points.Future work will aim at improving the scalability of the Bayesian QGAM fitting framework currently implemented in qgam, by exploiting the block-oriented discretized fitting methods provided by mgcv for the inner and intermediate iterations, and developing a similarly scalable version of the IKL minimization routine currently used for learning rate selection.

A. Choosing the loss bandwidth manually
The qgam function determines the ELF loss bandwidth h(x) automatically, but this parameter can be chosen manuallly by specifying the err argument, which is set to NULL by default.Here we provide some guidelines for users who wish to choose this parameter manually.In particular, the examples in Section A.1 illustrate how parameter err affects the statistical and computational performance of the fitting methods provided by qgam.

A.1. Illustrating the effect of the loss bandwidth and some diagnostics
To clarify the interpretation of parameter err, let us consider a univariate context where we want to estimate the τ -th quantile of response y and there are no covariates.Define = |F (µ * ) − F (µ 0 )|, where F is the c.d.f. of y, µ 0 is the true quantile at probability level τ and µ * is the minimizer of E{ρ τ (y − µ)}.Given that µ 0 is the minimizer of the expected pinball loss (see (1)), is the asymptotic absolute bias induced by the use of the smooth ELF loss, with bandwidth h = λσ.In Appendix A.2 we prove that where f is the p.d.f. of y and the last equality holds if y is Gaussian with variance κ.Via argument err, the qgam function allows users to set the maximum tolerable bias , which is then used to determine the loss bandwidth by Here κ is estimated via the location-scale GAM model, and can vary with x as usual.Parameter h = h(x), obtained via ( 14), is then used to determine λ and σ(x) as in Section 2.3.Of course (err in the software) is an approximate bound, because the response is not normally distributed in practice.
In qgam we let users choose , rather than λ, because it does not depend on the scale of the response y.Indeed, considering the interpretation of , reasonable values of this parameter fall in (0, 1) (but qgam checks only that err is positive).To illustrate how err influences the accuracy of the quantile estimates and the computational performance of qgam, we consider a simple data set simulated as follows R> set.seed(5523)R> x <-seq(-3, 3, length.out= 1e3) R> X <-data.frame("x"= x, "y" = x + x^2 + rgamma(1e3, 4, 1)) Hence y i = x i + x 2 i + z i , where z i ∼ Gamma(4, 1).Assuming that we want to estimate quantiles 0.05, 0.5 and 0.95, the following code estimates them using different values of err The resulting quantile fits are shown in Figure 10.It is clear that the bias is positive for τ = 0.95, negative for τ = 0.05 and that its absolute value increases with err, becoming clearly visible for err > 0.1.However, increasing err affects the median estimate only slightly.Heuristically, this is because the bias induced by the smoothed loss is greater when the response distribution is more asymmetric around the quantile of interest.
One might conclude that the loss smoothness should be kept as small as possible, but this leads to computational and numerical stability issues.In fact, decreasing err tends to slow down computation, and in this example using a very small tolerance increases the computing time dramatically More importantly, using a very low value of err can lead to numerical problems.In fact, the following code R> check(fitSmallErr$calibr, sel = 2) produces the plot in Figure 11, which shows how the estimated IKL loss minimized by the outer iteration looks like.The dots indicate the points at which the loss has been evaluated during the outer Brent optimization, and the vertical red line indicates the value of log σ 0 which minimizes the loss.Here the loss seems to be discontinuous, which is due to numerical instabilities.In fact, the analogous plot for fitBigErr (not shown) looks smooth and convex.
To produce the diagnostic plot in Figure 11 we used the generic qgam::check function, which dispatched to the check.learnFastmethod.This is because the output of qgam contains the results of a call to tuneLearnFast in the $calibr slot.The call to check generates two plots and here we are showing only the second one by choosing sel = 2.
We can use check to have also an estimate of the bias attributable to the smoothed loss and information regarding the convergence of the smoothing parameter estimation routine.This is achieved by

R> check(fitBigErr)
Theor.proportion of neg.resid.: 0.95 Actual proportion: 0.952 Integrated absolute bias |F(mu) -F(mu0)| = 0.004319621 Method: REML Optimizer: outer newton right is a histogram of all the estimated values of the bias |F {µ * (x i )} − F (x i )|, and it shows that the bias is low for all observations.Hence, the text and visual output of check.qgamconfirm that the LAML optimization achieved full convergence and that we should not be too concerned about the bias.The last few lines of the text output indicate that the effective degrees of freedom (edf) used to model the effect of x are well below the degrees of freedom available for this smooth effect (k').This suggests that the number of basis functions used to model the effect of x was sufficiently large, while edf ≈ k' would indicate that the fit is using all the available degrees of freedom, and that we might want to increase the number of basis functions used.
The results presented in this section and our practical experience on loss smoothness selection and convergence checking with qgam can be summarized in the following set of suggestions: • the automatic procedure for selecting the loss smoothness generally offers a good compromise between statistical bias, variance and numerical stability; • the old default (used in versions of the qgam package lower than 1.3.0)was err = 0.05, which generally does not lead to unacceptably high levels of bias; • if the calibration loss plotted by check(fittedQGAM$learn) is irregular, or the text printed by check(fittedQGAM) does not confirm that full convergence was achieved, try to increase err; • if you have to increase err to 0.2 or higher to avoid convergence issues, then there might be something wrong with your model (for example, it is missing an important effect); • the bias attributable to the adoption of a smoothed ELF loss can be estimated using check(fittedQGAM); • you might get messages saying that outer Newton did not converge fully during estimation.These are generated by mgcv::gam during LAML maximization, and should not be problematic as long as the calibration loss is smooth (which you can check using check(fittedQGAM$calibr)) and check(fittedQGAM) states that full convergence was achieved; • in preliminary studies, that is when you are exploring different model structures and you are not yet interested in getting the most accurate estimates, do not decrease err too much as it considerably slows down computation; • setting err too low is generally not a good idea.In fact, err is an approximate upper bound on the bias, the latter being generally much lower than this parameter suggests, and it is arguably better to have a small amount of bias than numerical problems.As stated above, the default loss smoothness selection procedure, used by default in qgam and mqgam, typically provides stable fits and a near optimal bias-variance tradeoff.

Figure 2 :
Figure 2: Plot showing the fitted additive effects of time on all five quantiles, obtained using the mgcViz package.

Figure 3 :
Figure 3: Plot showing the five conditional quantiles fitted using qgam.
Figure4: Fitted quantile τ = 0.9, with 95% credible intervals, when the learning rate and the loss smoothness are allowed to vary with the independent variable (left) or not (right).

Figure 6 :
Figure6: Estimated effects of time of day (tod), temperature (temp), day of year (doy) and lagged demand (dem48) from model (10), for the quantiles labelled using the id variable.

Figure 7 :
Figure 7: Estimates of the individual demand quantile function G −1 t−48 , for five values of t.

Figure 10 :
Figure 10: Quantile fits for τ = 0.05, 0.5 and 0.95 for different values of err.The dashed lines are the true quantiles.

Figure 11 :
Figure 11: Plot produced by check.learnFast,showing the estimated IKL loss at several values of log σ 0 , when err = 0.001.The vertical jump in the loss indicates numerical problems.

Table 1
lists the main functions provided by qgam.The most commonly used one is itself called qgam and has the following arguments Function name Description qgam Fits a QGAM model for a single quantile τ .It is analogous to the gam function in mgcv, hence any smooth effect type available in gam is also available in qgam.mqgam Fits the same QGAM model to a vector of k quantiles τ 1 , . . ., τ k more efficiently than by calling qgam repeatedly.tuneLearnFast Selects the learning rate by minimizing (9) using Brent's method.tuneLearn Evaluates (9) on a grid of values of σ 0 .check Given a model fitted using qgam, produces some diagnostics plots.qdo Is a wrapper for using standard generics (e.g., summary) on the output of mqgam.Needed because mqgam does not output an object of class gam.
Arguments form and data have the same meaning as in mgcv::gam (henceforth just gam), qu is the quantile of interest τ , lsig is log σ 0 and err is a positive parameter allowing to

Table 1 :
Main functions provided by qgam.