bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond)

Over the last decades, the challenges in applied regression and in predictive modeling have been changing considerably: (1) More flexible model specifications are needed as big(ger) data become available, facilitated by more powerful computing infrastructure. (2) Full probabilistic modeling rather than predicting just means or expectations is crucial in many applications. (3) Interest in Bayesian inference has been increasing both as an appealing framework for regularizing or penalizing model estimation as well as a natural alternative to classical frequentist inference. However, while there has been a lot of research in all three areas, also leading to associated software packages, a modular software implementation that allows to easily combine all three aspects has not yet been available. For filling this gap, the R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highly-efficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as"Lego bricks"encompassing various distributions (exponential family, Cox, joint models, ...), regression terms (linear, splines, random effects, tensor products, spatial fields, ...), and estimators (MCMC, backfitting, gradient boosting, lasso, ...). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges.


Introduction
Many modern modeling tasks necessitate flexible regression tools that can deal with: (1) Large data sets that can be both long (many observations) and/or wide (many variables or complex effect types).
(2) Probabilistic forecasts that capture the entire distribution and not only its mean or expectation. (3) Enhanced inference infrastructure, typically Bayesian, broadening classical frequentist methodology. A popular framework to combine flexible regression with probabilistic modeling are generalized additive models (GAMs, Hastie and Tibshirani 1990), later extended to generalized additive models for location, scale, and shape (GAMLSS, Rigby and Stasinopoulos 2005), also known as Bayesian structured additive distributional regression (Klein, Kneib, Lang, and Sohn 2015c) which encompasses (generalized) linear models (GLMs, Nelder and Wedderburn 1972) as special cases. Bayesian inference in these models can be seen as a natural framework for penalizing flexible model terms and to overcome potential problems with p values and classical null hypothesis significance testing (Wasserstein and Lazar 2016). However, when fitting such models to big data -long and/or wide -classical estimation techniques using standard algorithms like iteratively weighted least squares (IWLS, Gamerman 1997) or Markov chain Monte Carlo (MCMC) might not be feasible. Instead, regularized estimation techniques such as lasso or boosting (Friedman, Hastie, and Tibshirani 2010;Mayr, Fenske, Hofner, Kneib, and Schmid 2012) might be necessary or further advanced custom algorithms (Wood 2017). Hence, to facilitate addressing all challenges and needs simultaneously -independent of a specific estimation strategy and/or fitting algorithm -the bamlss package for the R system for statistical computing (R Core Team 2021) implements a modular "Lego toolbox", extending the work of Umlauf, Klein, and Zeileis (2018). In this framework not only the response distribution is a "Lego brick" (as in a classical GLM) or the regression terms (as in a GAM) but also the estimation algorithm such as a specific MCMC sampler.
The idea of a "Lego toolbox" for regression models has of course been around for some time; in some implementations, Bayesian and frequentist, there is not only the possibility to easily implement new distributions, but also model terms, from splines to neural networks to regression trees. In some implementations optimization routines may also be exchanged. The following is a list of well-known packages for regression models in the R ecosystem, whose implementations are designed to be extremely flexible.
• GAMs and GAMLSSs are available in a number of packages, most notably the mgcv package (Wood 2017) and also the gamlss family of packages (Stasinopoulos, Rigby, Heller, Voudouris, and Bastiani 2017;Rigby, Stasinopoulos, Heller, and Bastiani 2019) and VGAM (Yee 2010). The latter two are notable for their support of a wide range of response distributions. While VGAM is restrictive with respect to the integration of flexible model terms, the gamlss package also supports (user-defined) smooth additive terms of general type (e.g., neural networks and regression trees), however, inference is mainly supported only for linear model terms. In contrast, mgcv excels at providing highly-optimized algorithms for general smooth models (Wood, Pya, and Säfken 2016), including inference, as well as the dedicated bam() function for big data that is long and/or wide (Wood, Li, Shaddick, and Augustin 2017). All these packages rely on frequentist estimation strategies. Moreover, the package provides sophisticated infrastructure for generating new classes of smooth terms (which is fully adopted by the bamlss package).
• Bayesian inference is not only an increasingly popular alternative to classical frequentist inference, it is also particularly attractive for hierarchical or multilevel models and for penalizing regression effects through suitable prior distributions. Also, fully Bayesian approaches using MCMC are appealing in flexible regression models for obtaining credible intervals from the posterior samples. The brms package (Bürkner 2017) is notable for providing a standard R workflow for estimating Bayesian multilevel models using Stan (Carpenter et al. 2017). Also, the above-mentioned mgcv package supports estimation of Bayesian GAMs via its jagam() function (Wood 2016) based on JAGS (Plummer 2003).
For more flexibility, going beyond these capabilities, it is in principle possible to directly implement custom models using general purpose MCMC software like JAGS, Stan, or the BUGS family of packages (Lunn, Thomas, Best, and Spiegelhalter 2000;Goudie, Turner, Angelis, and Thomas 2020). However, for complex models -e.g., using large data sets, spatial effects, or higher-order interactions -sampling times from these generic MCMC engines can become long, sometimes prohibitively long. This has been addressed by dedicated packages for Bayesian additive models, e.g., with the standalone package BayesX (Brezger, Kneib, and Lang 2005;Belitz, Brezger, Klein, Kneib, Lang, and Umlauf 2015) being the first to provide highly-efficient sampling schemes for very large data sets as well as spatial/multilevel models and structured additive distributional regression. An R interface is available in R2BayesX (Umlauf, Adler, Kneib, Lang, and Zeileis 2015). Instead of fully Bayesian MCMC it is also possible to employ posterior mean estimation via the integrated nested Laplace approximation to estimate flexible Bayesian regression models. This is provided in the comprehensive R package INLA (Rue, Martino, and Chopin 2009), popular for estimating complex spatial Bayesian regression models (see e.g., Lindgren and Rue 2015;Bivand, Gómez-Rubio, and Rue 2015).
• Regularized estimation and explicit variable selection might be necessary, though, for going beyond the models described above, especially for large/wide data with many potential regressors and corresponding effects/interactions/etc. Widely-used approaches for this include the lasso, e.g., as available for GLM-type models in the R package glmnet (Friedman et al. 2010), or gradient boosting as available for GAMLSS-type models in the R package gamboostLSS (Hofner, Mayr, and Schmid 2016). However, many packages do not cover the Bayesian posterior estimation parts.
In summary, the discussion above highlights that many different packages with different strengths are already available in R. However, a package combining all the aspects above in a single framework is not readily available as there are typically limitations with respect to the inferential framework, the distributions and/or complexity of the models supported, or the estimation techniques and fitting algorithms. The package bamlss, available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=bamlss, tries to fill this gap with a modular "Lego" approach to flexible Bayesian regression providing: • The usual R "look & feel" for regression modeling.
• An extensible "plug & play" approach for regression terms.
• Modular combinations of fitting algorithms and samplers.
Especially the last item is notable because the models in bamlss are not limited to a specific estimation algorithm but different engines can be plugged in without necessitating changes in other aspects of the model specification (such as response distributions or regression terms). By default bamlss is using IWLS-based backfitting for optimizing the model and IWLS-based MCMC for sampling from the posterior distribution. However, alternative optimizers and samplers are also implemented that support lasso or boosting, and more. Moreover, the package builds on the well-established mgcv infrastructure for smooth model terms, uses R's formula syntax for model specification, and provides standard extractor methods like summary(), plot(), predict(), etc.
The remainder of this paper is as follows. In Section 2, three motivating examples illustrate the first steps using bamlss and show cases the flexibility of the provided infrastructure. Section 3 introduces the flexible regression framework in more detail. A thorough introduction of the R package bamlss, describing the most important building blocks for developing families, model terms and estimation algorithms, is then given in Section 4. In Section 5 we highlight the unified modeling approach using a complex distributional regression model for lighting counts in complex terrain. Further details and examples about the bamlss package can be found online at http://www.bamlss.org/.

Motivating examples
This section gives a first quick overview of the functionality of the package. The first example demonstrates that the usual "look & feel" when using well-established model fitting functions like glm() is an elementary part of bamlss, i.e., first steps and basic handling of the package should be relatively simple. The second example shows that the package can deal with a variety of different model terms and that model fitting functions can easily be exchanged, here, we exemplify this feature by applying a lasso-type estimation engine. The third example then explains how full distributional regression models can be estimated and show cases once more the flexibility of the provided modeling infrastructure.

Basic Bayesian regression: Logit model
This example data is taken from the AER package (Kleiber and Zeileis 2008) and is about labor force participation (yes/no) of women in Switzerland 1981 (Gerfin 1996). The bamlss package and the data can be loaded with R> library("bamlss") R> data("SwissLabor", package = "AER") The data frame contains 872 observations of 6 variables, where some of them might have a nonlinear influence on the response labor participation. Now, a standard Bayesian binomial logit model using the default MCMC algorithm can be fitted (sampler function sam_GMCMC(), see also Section 4 for other options). The MCMC algorithm uses iteratively weighted least squares (IWLS, Gamerman 1997, for more details see Section 3.2) proposals, which have very good mixing properties and computational advantages when using very large data sets (Lang, Umlauf, Wechselberger, Harttgen, and Kneib 2014). First, the model formula is specified with R> f <-participation~income + age + education + + youngkids + oldkids + foreign + I(age^2) Then, to reproduce the results the seed of the random number generator is set and the model is estimated by R> set.seed(123) R> b <-bamlss(f, family = "binomial", data = SwissLabor, + n.iter = 1200, burnin = 200, thin = 1) Note that the default number of iterations (n.iter) for the MCMC sampler is 1200, the burnin-phase burnin is 200 and thinning (thin) is 1. The reason is that during the modeling process, users usually want to obtain first results rather quickly. Afterwards, if a final model is estimated the number of iterations of the sampler is usually set much higher to get close to i.i.d. samples from the posterior distribution. To obtain reasonable starting values for the MCMC sampler we run a backfitting algorithm that optimizes the posterior mode. Using the main model fitting function bamlss() all model fitting engines can be exchanged, which is explained in detail in Section 4 and the application Section 5. The default model fitting engines use family objects (see also Section 4), similar to the families that can be used with the glm() function, which enables easy implementation of new distributions (models).
Note that the model contains a quadratic term for variable age in order to capture nonlinearities. The resulting object b is of class "bamlss" for which standard extractor functions like summary(), coef(), plot(), predict(), etc. are available. The model summary output is printed by

R> summary(b)
Call: bamlss(formula = f, family = "binomial", data = SwissLabor) ---Family: binomial Link function: pi = logit *--- and is based on MCMC samples, which suggest "significant" effects for all covariates, except for variable education, since the 95% credible interval contains zero. In addition, the acceptance probabilities alpha are reported, i.e., the acceptance probability of the sample candidate based on the proposal and the posterior distribution which is calculated in each iteration, indicating proper behavior of the MCMC algorithm. The column parameters shows respective posterior mode estimates of the regression coefficients, which are calculated by the upstream optimizer algorithm (note that the column is named parameters, because optimizer functions can in principle return any type of parameters). Besides, more results from the optimizer are reported at the very end of the output: the corrected AIC (AICc, Hurvich and Tsai 1989;Cavanaugh 1997), the equivalent degrees of freedom (edf), the log-likelihood (logLik), etc. In addition, there are also extractor functions in bamlss for information criteria like the DIC (function DIC()) and the widely applicable information criterion (WAIC, Watanabe 2010, function WAIC()), or the out-of-sample continuous rank probability score (CRPS, Gneiting, Balabdaoui, and Raftery 2007, function CRPS()). Note that CRPS() approximates numerically, while the scoringRules package (Jordan, Krüger, and Lerch 2019) can compute the CRPS very efficiently for some distributions. The usage of the provided functions is similar to the generic AIC() and BIC(), e.g., the DIC can be computed with Before proceeding the analysis, users usually perform additional convergence checks of the MCMC chains by looking at traceplots and autocorrelation (besides acceptance probabilities).
R> plot(b, which = c("samples", "max-acf")) These are visualized in Figure 1 and reveal approximate convergence of the MCMC chains, i.e., there is no visible trend, and the very low autocorrelation shown for the intercept and the maximum autocorrelation calculated as the maximum for each lag across all parameters suggest close to i.i.d. samples from the posterior distribution. As mentioned above, the user could also increase the number iterations and the burnin-phase, as well as adapt the thinning parameter (arguments n.iter, burnin and thin), to make the significant bar at lag one disappear. Note that the function call would compute all trace and autocorrelation plots, however, for convenience we only show plots for the intercept. In addition, samples can also be extracted using function samples(), which returns an object of class "mcmc", a class provided by the coda package (Plummer, Best, Cowles, and Vines 2006) which includes a rich infrastructure for further convergence diagnostic checks, e.g., Gelman and Rubin's convergence diagnostic (Gelman and Rubin 1992;Brooks and Gelman 1998) or Heidelberger and Welch's convergence diagnostic Welch 1981, 1983).
Model predictions on the probability scale can be obtained by the predict() method, e.g., to visualize the effect of covariate age on the probability we can create a new data frame for prediction R> nd <-data.frame(income = 11, age = seq(2, 6.2, length = 100), + education = 12, youngkids = 1, oldkids = 1, foreign = "no") Afterwards, we predict for both cases of variable foreign R> nd$p_swiss <-predict(b, newdata = nd, type = "parameter", FUN = c95) R> nd$foreign <-"yes" R> nd$p_foreign <-predict(b, newdata = nd, type = "parameter", FUN = c95) The predict() method is applied on all MCMC samples and argument FUN specifies a function that can be applied on the predictor or distribution parameter samples. The default is the mean() function, however, in this case we additionally extract the empirical 2.5% and 97.5% quantiles using function c95() to obtain credible intervals (note, individual samples can be extracted by passing FUN = identity, i.e., this way users can easily generate their own statistics). Then, the estimated effect can be visualized with Figure 2: Left panel, quadratic polynomial effect of covariate age on estimated probabilities for both cases, foreign "yes" and "no". Right panel, effect on Logit −1 (π) of variable age using regression splines (see Section 2.2). The solid lines represent mean estimates, the shaded areas show 95% credible intervals.

Flexible model terms and estimators
Using the flexible infrastructure of bamlss, model terms can be easily exchanged. To give a first impression of the modeling capabilities, we again use the SwissLabor data and binomial logit model of Section 2.1, however, in this example we use regression splines to capture the nonlinear effect variable age.
As noted in the introduction, the bamlss package leverage the infrastructure from the R package mgcv (Wood 2021) for setting up the design and penalty matrices for smooth terms by calling mgcv's smooth.construct() or smoothCon(), i.e., new user-defined smooth terms can also be added by providing new classes for the generic functions. To estimate a spline model instead of a polynomial model for variable age the model formula only needs to be slightly adapted R> f <-participation~income + education + + youngkids + oldkids + foreign + s(age, k = 10) The function s() is the smooth term constructor from the mgcv package, the default of s() are thin-plate regression splines with k = 10 basis functions. The model is again fitted by R> set.seed(123) R> b <-bamlss(f, family = "binomial", data = SwissLabor) Description Formula Linear effects: Xβ x1 + x2 + x3 Nonlinear effects of continuous covariates: s(x1,x2), te(x1,x2) or ti(x1,x2) (higher dimensional terms possible). Spatially correlated effects: s(xs, bs = "mrf", xt = list(penalty = K)), where xs is a factor indicating the discrete regional information and K is a supplied penalty matrix. Other options within the xt argument are possible, please see the documentation of smooth.construct.mrf.smooth.spec().
Varying coefficients: Spatially varying effects: s(xs, bs = "mrf", xt = list(penalty = K), by = x1), s(x2, x3, by = x1) or te(x2, x3, by = x1) Random intercepts with cluster index c: f (x) = β c s(id, bs = "re"), where id is a factor of cluster indices. Random slopes with cluster index c: f (x) = x 1 β c s(id, x1, bs = "re"), as above with continuous covariate x1. The estimated nonlinear effect can be plotted instantly by typing R> plot(b, term = "s(age)") The estimated effect based on regression splines is shown in the right panel of Figure 2 and reveals that the quadratic polynomial seems to capture the nonlinearity appropriately.
To give a better impression what type of model terms can be used with the bamlss framework Table 1 lists commonly-used specifications.
Besides the supported infrastructure from the mgcv package, it is also possible to implement completely new model terms that may follow different setups compared to the basis functions approach (see also Appendix B for an example using growth curves). Moreover, using bamlss, estimation engines can also be exchanged. To give an example we estimate the nonlinear age effect in the SwissLabor example using a fused lasso algorithm (see also Section 5 for a complex example using gradient boosting optimization). The algorithm performs variable selection in combination with factor fusion (clustering) and can also be used to identify interpretable nonlinearities. Methodological details on lasso-type penalization using bamlss are provided in Groll, Hambuckers, Kneib, and Umlauf (2019). To apply the fused lasso, the numeric variable age is categorized using empirical quantiles, e.g., with R> SwissLabor$cage <-cut(SwissLabor$age, + breaks = quantile(SwissLabor$age, prob = seq(0, 1, length = 10)), + include.lowest = TRUE, ordered_result = TRUE)  The formula for the fused lasso model is then specified with the special la() model term constructor function provided in bamlss: R> f <-participation~income + education + youngkids + oldkids + foreign + + la(cage, fuse = 2) where argument fuse specifies the type of fusion (nominal fusion fuse = 1, ordered fusion fuse = 2). To estimate the fused lasso model only the default optimizer function in the bamlss() wrapper function call needs to exchanged R> b <-bamlss(f, family = "binomial", data = SwissLabor, + optimizer = opt_lasso, sampler = FALSE, + criterion = "AIC", upper = exp(5), lower = 1) The optimum shrinkage parameter λ is selected by the AIC (another option is criterion = "BIC"). Arguments upper and lower determine the search interval of λ, per default nlambda = 100 values are generated. Note that no MCMC sampling is used after the opt_lasso() estimation engine is applied, argument sampler = FALSE in the bamlss() call.
The AIC curve and the coefficient paths including the optimum shrinkage parameter λ can be visualized with R> pathplot(b) Figure 3 shows the AIC curve and coefficient paths for cage. The AIC curve assumes a minimum at the vertical gray dashed line. The coefficient paths obviously depict that the algorithm can either shrink categories out of the model (shrink to zero), or even fuses them. In the right panel of Figure 3, the estimated effect of the categorized variable age is shown. The effect is computed by predicting without intercept using the optimum stopping iteration, which is selected by AIC and can be extracted with function lasso_stop(). The stopping iteration is passed to the predict() method by specifying the mstop argument.
R> page <-predict(b, term = "cage", intercept = FALSE, The figure is then created using the untransformed original covariate on the x-axis. R> plot2d(page~age, data = SwissLabor, rug = TRUE) Using the fused lasso estimation nonlinearities can be identified again, similar to the spline based estimate in the right panel of Figure 2.

Location-scale model
Here, we extend the framework and estimate a distributional regression model that not only captures the mean (or location) of the response variable but also its variance (or scale). As an example, we employ the number of weekly fatalities in Austria from 2000-2020 (up to week 46 in 2020) as obtained from the Eurostat data base (https://ec.europa.eu/eurostat/). The data is available in the bamlss package as fatalities, providing the number (num) of fatalities in each year and week. It can be loaded with R> data("fatalities", package = "bamlss") The idea of the subsequent analysis is to estimate a reference mortality model based on the data from 2000-2019 prior to the COVID-19 (Corona virus disease 2019) crisis in order to bring out graphically the excess mortality in 2020. Excess mortality is often employed for assessing the effects of exceptional events such as pandemics (Leon et al. 2020) or natural catastrophes (Fouillet et al. 2008). First, we split the data into the corresponding subsets.
R> d19 <-subset(fatalities, year <= 2019) R> d20 <-subset(fatalities, year >= 2020) To capture the long-term seasonal trend of the fatality number distribution, we employ a simple model here: log-fatalities are assumed to be normally distributed with smooth seasonal variations in both mean and variance. As shown below the log-transformation stabilizes skewness and variance in the data somewhat so that a normal model works sufficiently well. Cyclic splines with respect to the week of the year are employed to capture the smooth seasonal trends while assuring that the values at the beginning and the end of the year match. The model formula is now a list with elements for the mean of log(num) (corresponding to parameter mu) and standard deviation sigma of the normal distribution.
R> f <-list( + log(num)~s(week, bs = "cc", k = 20), + sigma~s(week, bs = "cc", k = 20) + ) Function s() is again the smooth term constructor from the mgcv package (Wood 2021) and bs = "cc" specifies a penalized cyclic cubic regression spline. (Other smooth terms such as te() or ti() could be included in the same way.) Based on this bamlss() is used to estimate a full Bayesian regression model using the NO() normal family from the gamlss.dist package. R> library("gamlss.dist") R> set.seed(456) R> b <-bamlss(f, data = d19, family = NO) The resulting estimated effects along with their 95% credible intervals can be easily visualized using the plot() method: The resulting displays are shown in Figure 4 depicting a clear nonlinear relationship for both distribution parameters. The left panel shows that mean log-fatalities are much higher in winter than in summer with a peak around February matching the highest risk for influenza and other viral infections in Austria. The right panel shows that the standard deviation is also highest at around the same time but that there is another local maximum in the summer months, possibly related to recurrent heat waves that can be quite stressful for the cardiovascular system (Fouillet et al. 2008). Figure 5 shows the predicted 5%, 50%, and 95% quantiles (in black) of the corresponding normal distributions along with the observed fatalities in 2000-2019 (in light gray) and in 2020 (in red, up to week 46), respectively. Thus, the quantiles reflect the effects already conveyed by the predicted parameters in Figure 4. This shows that the fatalities in 2020 are above the median almost throughout all weeks and above the 95% quantile for a couple of weeks in spring and in the fall/winter, respectively. While the mortality in the spring period is only moderately increased, it is much higher than in previous years in fall/winter during the second COVID-19 wave in Austria.
In the following, we show how to draw Figure 5 using the bamlss infrastructure. First, we set up a new data frame and predict the distribution parameters for each week of the year.
R> nd <-data.frame(week = 1:53) R> par <-predict(b, newdata = nd, type = "parameter") Based on these, the fitted quantiles can be computed using the quantile function from the family of the model (see Section 4.2 for details). The exp() transformation maps the fitted values from the log-scale back to the original frequency scale. R> nd$fit <-sapply(c(0.05, 0.5, 0.95), Finally, the estimated quantiles and observed data can be visualized using matplot() after reshaping the data to "wide" format with a separate column for each year.
R> plot(b, which = c("hist-resid", "qq-resid"), c95 = TRUE) By setting c95 = TRUE, the Q-Q plot includes 95% credible intervals. Both plots show that the log-transformation of the fatality numbers only partially captures the right-skewed observations and that therefore the model fit is not ideal in the upper tail. In an accompanying online vignette at http://www.bamlss.org/articles/fatalities.html we show how to find While this improves the distributional model, the qualitative insights regarding the excess mortality in Austria during the COVID-19 crisis remain unchanged.

A flexible Bayesian model framework
This section briefly summarizes the BAMLSS modeling framework. For a detailed methodological description please refer to , as well as to the references given below on page 15 that discuss various applications and extensions that are also implemented in bamlss. The following outlines the framework from the viewpoint of distributional regression models, however, please note that model classes like, e.g., GLMs and GAMs or even survival joint models (Köhler, Umlauf, Beyerlein, Winkler, Ziegler, and Greven 2017;Köhler, Umlauf, and Greven 2018) are special cases in this setup.

Model structure
Within the framework of GAMLSS or distributional regression models all parameters of the response distribution can be modeled by explanatory variables such that where D denotes a parametric distribution for the response variable y with K parameters θ k , k = 1, . . . , K, that are linked to additive predictors using known monotonic and twice differentiable functions h k (·). Note that the response may also be a q-dimensional vector y = (y 1 , . . . , y q ) , e.g., when D is a multivariate distribution (see, e.g., Klein, Kneib, Klasen, and Lang 2015a). The additive predictor for the k-th parameter is given by based on j = 1, . . . , J k unspecified (possibly nonlinear) functions f jk (·), applied to each row of the generic data matrix X, encompassing all available covariate information. The corresponding parameters β k = (β 1k , . . . , β J k k ) are typically regression coefficients pertaining to model matrices X k = (X 1k , . . . , X J k k ) , whose structure only depend on the type of covariate (s) and prior assumptions about f jk (·).
Usually, functions f jk (·) are based on a basis function approach, where η k then is a typical GAM-type or so-called structured additive predictor (STAR, Fahrmeir, Kneib, and Lang 2004). Similar to Stasinopoulos et al. (2017),  relax this assumption and let f jk (·) be an unspecified composition of covariate data and regression coefficients. For example, functions f jk (·) could also represent nonlinear growth curves, a regression tree, a neural network or lasso-penalized model terms as shown in Section 2.2.
For full Bayesian inference, priors need to be assigned to the regression coefficients β jk . To be as flexible as possible,  use the rather general prior p jk (β jk ; τ jk , α jk ) for the j-th model term of the k-th parameter, where the form of p jk (·) depends on the type of function f jk (·). Here, τ = (τ 11 , . . . , τ J 1 1 , . . . , τ 1K , . . . , τ J K K ) is the vector of all assigned hyper-parameters, e.g., representing smoothing variances (shrinkage parameters). Similarly, α jk is the set of all fixed prior specifications, i.e., for GAM-type models α jk usually holds the so-called penalty matrices, amongst others. In most situations the prior p jk (β jk ; τ jk , α jk ) is based on a multivariate normal kernel for β jk and on inverse gamma distributions for each τ jk = (τ 1jk , . . . , τ L jk jk ) , but as indicated previously, in principle any type of prior can be used ( Examples of distributional models that fit well in this framework are the ones for: • Univariate responses of any type, e.g., counts with zero-inflation and/or overdispersion as proposed in Klein, Kneib, and Lang (2015b)

Posterior estimation
Estimation typically requires to evaluate the log-likelihood (β; y, X) function and its derivatives w.r.t. all regression coefficients β a number of times. For Bayesian inference the logposterior is either used for posterior mode estimation, or for solving high-dimensional integrals. e.g., for posterior mean estimation MCMC samples need to be computed.
Although the types of models that can be fitted within the flexible BAMLSS framework can be quite complex,  show that there are a number of similarities between optimization and sampling concepts. Fortunately, and albeit the different model term complexity, algorithms for posterior mode and mean estimation can be summarized into a partitioned updating scheme with separate updating equations using leapfrog or zigzag iteration (Aitkin 1987;Smyth 1996), e.g., with updating equations where function U jk (·) is an updating function, e.g., for generating one Newton-Raphson step or for getting the next step in an MCMC simulation. Rigby and Stasinopoulos (2005) showed that using a basis function approach, i.e., each function f jk (·) can be represented by a linear combination of a design matrix and regression coefficients, the updating functions U jk (·) for posterior mode (frequentist penalized likelihood) estimation for β jk share an iteratively weighted least squares updating step (IWLS, Gamerman 1997) with weight matrices W kk and working responses z k , similarly to the well-known IWLS updating scheme for generalized linear models (GLM, Nelder and Wedderburn 1972). In the same way, approximate full conditionals π(β jk |·) for MCMC are constructed with this updating step (Gamerman 1997;Fahrmeir et al. 2004;Brezger and Lang 2006;Klein and Kneib 2016b). The matrices G jk (τ jk ) are derivative matrices of the priors p jk (β jk ; τ jk , α jk ) w.r.t. the regression coefficients β jk , e.g., using basis function for f jk (·) matrices G jk (τ jk ) can be a penalty matrices that penalize the complexity using a P-spline representation (Eilers and Marx 1996).
Even if the functions f jk (·) are not based on a basis function approach, the updating scheme (4) can be further generalized to i.e., theoretically any updating function applied to the "partial residuals" z k − η (t+1) k,−j can be used (for detailed derivations see also ).
The great advantage of this modular architecture is that the concept does not limit to modeling of the distributional parameters θ k in (1), e.g., as mentioned above, based on the survival function, Köhler et al. (2017) and Köhler et al. (2018) implement Bayesian joint models for survival and longitudinal data. Moreover, the updating schemes do not restrict to any particular estimation engine, e.g., Groll et al. (2019) use the framework to implement lasso-type penalization for GAMLSS and Simon, Fabsic, Mayr, Umlauf, and Zeileis (2018) investigate gradient boosting with stability selection algorithms (see also Section 5). Very recently, Klein, Simon, and Umlauf (2019) implement neural network distributional regression models.

Measures of performance
Model choice and variable selection is important in distributional regression due to the large number of candidate models. The following lists commonly-used tools: • Information criteria can be used to compare different model specifications. For posterior mode estimation, the Akaike information criterion (AIC), or the corrected AIC, as well as the Bayesian information criterion (BIC), can be used. Estimation of model complexity is based on the so-called equivalent degrees of freedom (EDF), i.e., for each model term the trace of the smoother matrix is computed (see, e.g., Hastie and Tibshirani 1990) and the total degrees of freedom are approximated by the sum over all distributional parameters and model terms.
For MCMC based estimation, model choice mainly relies on the deviance information criterion (DIC, Spiegelhalter, Best, Carlin, and Van der Linde 2002) and the widely applicable information criterion (WAIC, Watanabe 2010).
• Quantile residuals (Dunn and Smyth 1996) can be used to evaluate the model fit.
The residuals can be assessed by quantile-quantile-plots, probability integral transforms (PIT) histograms  or worm plots (Van Buuren and Fredriks 2001).
• Scoring rules: Sometimes it is helpful to evaluate the performance on a test data set (or for instance based on cross validation). For this, proper scoring rules ) can be utilized.

Evaluation and interpretation
• Plotting: Estimated functionsf jk (·) are usually subject to a centering constraint (e.g., f jk (x i ) = 0), therefore, simple effect plots are a straightforward method to evaluate individual model term importance and can also be used for respective interpretations. Sometimes it can be useful in distributional regression to look at transformations of the original model parameters such as expected value or variance of the response variable y.
• Predictions: For obtaining such transformations model predictions need to be computed. This can be done either manually by the corresponding predict() method, or by the R package distreg.vis (Stadlmann 2021), which provides a graphical user interface for visualization of distributional regression models.

The bamlss package
The R package bamlss provides a modular software architecture for flexible Bayesian regression models (and beyond). The implementation follows the conceptional framework presented in , which supports Bayesian and/or frequentist estimation engines using complex possibly nonlinear model terms of any type. The highlights of the package are: • A unified model description where a formula specifies how to set up the predictors from the data and the family, which holds information about the response distribution, the model. function can optionally set up modified terms, e.g., using mixed model representation for smooth terms.
• Support for modular and exchangeable updating functions or complete model fitting engines in order to optionally implement either algorithms for maximization of the logposterior for posterior mode estimation or for solving high-dimensional integrals, e.g., for posterior mean or median estimation. First, an (optional) optimizer() function can be run, e.g., for computing posterior mode estimates. Second, a sampler() is employed for full Bayesian inference with MCMC, which uses the posterior mode estimates from the optimizer() as starting values. An additional step can be used for preparing the results(), e.g., for creating model term effect plots.
• Standard post-modeling extractor functions to create sampling statistics, visualizations, predictions, amongst others.
The modular architecture of bamlss is illustrated in Figure 7. As mentioned above, the first step in model development is to setup design and penalty matrices for a model that is specified by the family object. Therefore a formula is processed together with the data using Step Type Function Pre-processing  Table 2: Current available functions that can be used for pre-processing, estimation and post-processing within the bamlss framework.
the bamlss.frame() function. In a second pre-processing step, the returned model frame may also be transformed. The BAMLSS model frame can then be used with optimizer() and/or sampler() functions in the estimation step. This is probably the main advantage of the architecture, users can easily exchange and integrate user defined estimation functions. The only requirement is to keep the structure of the bamlss.frame() function, as well for optimizer() and sampler() functions. Note that there is naming convention, optimizer functions start with prefix opt_* and sampler functions with sam_*. After the estimation step optional post-processing functions can be applied to create additional sampling statistics, function samplestats(), or results that can be used for plotting the estimated effects, function results(). The post-processing step is optional since it is not necessarily needed in the last output step, e.g., for computing predictions. This feature is especially important when using large data sets, because the run time for computing samplestats() or results() can be quite long or computations can even lead to memory problems.
In summary, besides implementing models using the family infrastructure (see Section 4.2) the architecture is very flexible such that also users interested in implementing new and non-standard models or algorithms only need to focus on the estimation step, i.e., write optimizer() or sampler() functions and get all post-processing and extractor functionalities "for free". This way, prototyping becomes relatively easy, but also the integration highperformance estimation engines is facilitated. Table 2 provides an overview of current available functions. Note that sampler functions sam_BayesX() and sam_JAGS() need installation of the BayesXsrc (Umlauf, Adler, Kneib, Lang, and Zeileis 2021)  The table shows that the NO() family is compatible with all pure R implementations of optimizer and sampler functions, but not with special samplers like BayesX and JAGS. These can only be used with the gaussian_bamlss() family. In addition, neither NO() or gaussian_bamlss() has its own special optimizer or sampler implemented, such as the cox_bamlss() family.
To exemplify the presented "Lego toolbox", the following R code estimates the logit model using the SwissLabor data presented in Section 2.1. First, the data is loaded and the model formula is specified with R> data("SwissLabor", package = "AER") R> f <-participation~income + age + education + + youngkids + oldkids + foreign + I(age^2) In the second step, the necessary design matrices are constructed using the model frame parser function bamlss.frame() R> bf <-bamlss.frame(f, data = SwissLabor, family = "binomial") Then, posterior mode estimates are obtained by using the implemented backfitting estimation function opt_bfit() R> pm <-with(bf, opt_bfit(x, y, family)) The estimated parameters returned from function opt_bfit() can then be used as starting values for the MCMC sampler function sam_GMCMC() R> set.seed(123) R> samps <-with(bf, sam_GMCMC(x, y, family, start = pm$parameters)) Using the parameters samples returned from function sam_GMCMC(), statistics like the DIC are computed using the samplestats() function R> stats <-with(bf, samplestats(samps, x, y, family)) R> print(unlist(stats)) logLik DIC pd -512.72579 1033.32501 7.87343 As one can see in the code above, estimation engines have common arguments x (holding the design and penalty matrices), y (the response data) and family (the bamlss family object). For implementing new estimation engines, users only need to keep the argument structures and the return values, i.e., for optimizer() functions a named numeric vector of estimated parameters and for sampler() functions parameter samples of class "mcmc" or "mcmc.list" (see package coda, Plummer et al. 2006). More details on the naming convention and the structure of the return value of bamlss.frame() are given in Section 4.1.
To ease the modeling process, all the single modeling steps presented in the above can be executed using the bamlss wrapper function bamlss(). The main arguments of bamlss() are bamlss(formula, family = "gaussian", data = NULL, transform = NULL, ## Pre-processing optimizer = NULL, sampler = NULL, ## Estimation samplestats = NULL, results = NULL, ...) ## Post-processing where the first line basically represents the standard model frame specifications (see Chambers and Hastie 1992). All other arguments represent functions presented in Table 2 and can be exchanged. Note that the default for argument optimizer is the backfitting estimation function opt_bfit() and the default for argument sampler is the sam_GMCMC() sampling function, which is a quite generic implementation. More specifically, sam_GMCMC() accepts proposal functions for each model term which do not necessarily have to be the same and can be exchanged, e.g., the core proposal function is implemented in C and is additionally optimized for large design and penalty matrices such that sampling using very large data sets is possible (see Lang et al. 2014 for details on algorithms, e.g., using sam_JAGS() is only suitable for moderate sized data and low complexity model terms). For more details on sam_GMCMC() please see the bamlss manual. The returned fitted model object is a list of class "bamlss", which is supported by several standard methods and extractor functions, such as plot(), summary() and predict().
As already exemplified in Section 2, using the model fitting wrapper function bamlss() it is straightforward to use different modeling approaches by simply exchanging the estimation engines. This feature can be particularly important in complex modeling situation, where good mixing of the MCMC algorithm requires very good starting values. One use case is presented in Section 5, where for stability reasons posterior mode estimates are obtained using the gradient boosting optimizer function boost(). Afterwards the MCMC sampling engine sam_GMCMC() is applied with the boosting estimates as starting values.

The BAMLSS model frame
Similar to the well-known model.frame() function that is used, e.g., by the linear model fitting function lm(), or for generalized linear models glm(), the bamlss.frame() function extracts a "model frame" for fitting distributional regression models. Internally, the function parses model formulae, one for each parameter of the distribution, using the Formula package infrastructure (Zeileis and Croissant 2010) in combination with model.matrix() processing for linear effects and smooth.construct() processing of the mgcv package to setup design and penalty matrices for unspecified smooth function estimation (Wood 2021, see also, e.g., the documentation of function s() and te()).

The most important arguments are
bamlss.frame(formula, data = NULL, family = "gaussian", weights = NULL, subset = NULL, offset = NULL, na.action = na.omit, contrasts = NULL, ...) The argument formula can be a classical model formulae, e.g., as used by the lm() function, or an extended bamlss formula including smooth term specifications like s() or te(), that is internally parsed by function bamlss.formula(). Note that the bamlss package uses special family objects, that can be passed either as a character without the "_bamlss" extension of the bamlss family name (see the manual ?bamlss.family for a list of available families), or the family function itself. In addition, all families of the gamlss (Stasinopoulos and Rigby 2021a) and gamlss.dist (Stasinopoulos and Rigby 2021b) package are supported, i.e., there is a transformer function that reads all necessary components and then transfers them into a family object for bamlss.
The returned object, a named list of class "bamlss.frame", can be employed with the model fitting engines listed in Table 2. The most important elements used for estimation are: • x: A named list, the elements correspond to the parameters that are specified within the family object. For each distribution parameter, the list contains all design and penalty matrices needed for modeling (see the upcoming example).
• y: The response data.
To better understand the structure of the "bamlss.frame" object a print method is provided. For illustration, we simulate data R> set.seed(111) R> d <-GAMart() and set up a "bamlss.frame" object for a Gaussian distributional regression model including smooth terms. First, a model formula is needed R> f <-list( + num~x1 + s(x2) + s(x3) + te(lon,lat), + sigma~x1 + s(x2) + s(x3) + te(lon,lat) + ) Afterwards the model frame can be computed with R> bf <-bamlss.frame(f, data = d, family = "gaussian") To keep the overview, there is also an implemented print method for "bamlss.frame" objects. For writing a new estimation engine, the user can directly work with the model.matrix elements, for linear effects, and the smooth.construct list, for smooth effects respectively. The smooth.construct is a named list which is compiled using the smoothCon() function of the mgcv package using the generic smooth.construct() method for setting up smooth terms.
As shown in Appendix B the bamlss.frame() function can also process special model terms, i.e., model terms that are not necessarily represented by a linear matrix vector product.

Family objects
Family objects are important building blocks in the design of BAMLSS models. The implementation in bamlss follows the well-established structures for family objects that are supported, e.g., by the base R model fitting function glm(), or family objects of the gamlss and VGAM package. This means, that users can also easily write new family objects to be used with bamlss. Such family objects specify the distribution by collecting functions of the density, respective log-likelihood, first-order derivatives of the log-likelihood w.r.t. predictors (the score function), and (optionally) second-order derivatives of the log-likelihood w.r.t. predictors or their expectation (the Hessian). Commonly used distributions are already implemented in bamlss; and note that the ones from the gamlss and gamlss.dist package can also be accessed through the bamlss package (see Section 2.3 for an example).
We illustrate how to build a bamlss family by hand along the Gaussian distribution, with density and log-likelihood function

bamlss: A Lego Toolbox for Flexible Bayesian Regression
for an individual observation. The sum of the log-likelihood function over all observations is the target function of the optimization problem.
In the distributional regression framework the parameters are linked to predictors by link functions, For the Gaussian µ and σ are linked to η µ and η σ by the identity function and the logarithm, respectively.
The score functions in bamlss are the first derivatives of the log-likelihood w.r.t. the predictors: For the second derivative of the log-likelihood we are able to obtain the negative expectation, and E(−∂ 2 /∂η 2 σ ) = 2. In more detail, the default bamlss estimation engines are based on IWLS updating functions and do not require the mixed elements of the Hessian, i.e., the backfitting optimizer function opt_bfit() uses leapfrog or zizag iterations (Smyth 1996) and the MCMC sampler function sam_GMCMC() also only updates one model term f jk (·) at a time, hence, only the diagonal elements of the Fisher information matrix are needed. Furthermore, it is not mandatory to use the expected Fisher information, but for numerical stability it is recommended. If the information on the second derivatives is not provided the bamlss.frame() will set up approximate versions by numerical differentiation of the score functions, the same mechanism is applied for first order derivatives. Hence, in quite a few cases implementing a new family can only be based on the specification of the density function, however, in terms of optimization runtime this is certainly not the most efficient choice. For distributions for which the expectation of the second derivative is intractable or does not exist, the user can rely on two options: the first option is to simply take the Hessian evaluated at observations and corresponding predictors rather than computing the theoretical expectation analytically for filling the diagonals of the weight matrices W kk . The second option is to find a good approximation for the expectation (see, e.g., Klein et al. 2015b, for the case of the overdispersion parameter of the negative binomial distribution). Now we have to write a function that returns a family.bamlss object (S3) which encapsulates functions for density, score and Hessian, and the names of the family, parameter and link functions. The required elements are listed in Table 3. Note that there are no other specifications to follow, for example one could also build a family that allows for flexible link functions (like the families from the gamlss package).
Merely all functions take as first argument the response y and as second argument a named list holding the evaluated parameters par of the distribution. The example implementation is shown in Appendix A.

Name of element Value family
Character string with the name of the family. names Vector of character strings with the names of the parameters. links Vector of character strings with the names of the link functions d A function returning the density with arguments d(y, par, log = FALSE) (see below). p The cumulative distribution function p(y, par, ...). score A list with functions (one for each parameter) returning the first derivatives of the log-likelihood w.r.t. predictors. hess A list with functions (one for each parameter) returning the negative second derivatives of the log-likelihood w.r.t. predictors. Table 3: Elements of the Gaussian distribution "bamlss.family" object.
Optionally, the "family.bamlss" object can be extended by functions for • the quantile function (the inverse cdf) q(p, par), • a random number generator r(n, par), • the log-likelihood loglik(y, par), • the expectation mu(par, ...), • initial values for optimization, which has to be a list containing a function for each parameter, • a customized predict() function which will be called by predict.bamlss(), e.g., as implemented in the family cox_bamlss(), • similarly, a customized residuals() function that should be used by residuals.bamlss(), which can help to speed up optimization, or be convenient for predictions and simulations.
When all formulas for a family are worked out, it usually takes about an hour to create a new family object. Of course, this also depends on the complexity of the density function. With some families it can be meaningful for speed reasons to port the functions for example to C. In our experience, programming then takes only slightly longer, about 2 to 3 hours.
For a list of all implemented families, please see the documentation of ?bamlss.family.

Estimation engines
Estimation engines in bamlss are usually based on the model frame setup function bamlss.frame() (see Section 4.1), i.e., the functions all have a x argument, which contains all the necessary model and penalty matrices, and a y argument, which is the response (univariate or multivariate). In addition, an estimation engine usually has a family argument, which specifies the model to be estimated. However, this is not a mandatory argument, i.e., one could write an estimation function that is designed for one specific problem, only. As mentioned at the beginning of Section 4, there is naming convention, optimizer functions start with prefix opt_* and sampler functions with sam_*. The naming convention is not mandatory, but it gives the user a better overview of the many functions of the package.
The modeling setup is best explained by looking at the main estimation engines provided by bamlss. The default optimizer using the bamlss() wrapper function is opt_bfit(), which is a backfitting routine. The most important arguments are opt_bfit(x, y, family, start = NULL, weights = NULL, offset = NULL, ...) The default sampling engine in bamlss is sam_GMCMC(), again the most important arguments are sam_GMCMC(x, y, family, start = NULL, weights = NULL, offset = NULL, ...) So basically, the arguments of the optimizer and the sampling function are the same, the main difference is the return value. In bamlss optimizer functions usually return a vector of estimated regression coefficients (parameters), while sampling functions return a matrix of parameter samples of class "mcmc" or "mcmc.list" (for details see the documentation of the coda package).
Internally, what the optimizer or sampling function is actually processing is not important for the bamlss() wrapper function as long as a vector or matrix of parameters is returned. For optimizer functions the return value needs to be named list with an element "parameters", the vector (also a matrix, e.g., for lasso() and boost() optimizers) of estimated parameters. The most important requirement to make use of all extractor functions like summary.bamlss(), predict.bamlss(), plot.bamlss(), residuals.bamlss(), etc., is to follow the naming convention of the returned estimates. The parameter names are based on the names of the distribution parameters as specified in the family object. For example, the family object gaussian_bamlss() has parameter names "mu" and "sigma"
An example of how to setup an estimation engine for bamlss for linear regression models is given in Appendix C. The example also provides details on the naming convention and return values of optimizer and sampler functions.

Flexible count regression for lightning reanalysis
This section illustrates the workflow with bamlss along a small case study. We want to build a statistical model linking positive counts of cloud-to-ground lightning discharges to atmospheric quantities from a reanalysis dataset.

Motivation and data
The region we focus on are the European Eastern Alps. Cloud-to-ground lightning discharges -detected by the Austrian Lightning Detection and Information System (ALDIS, Schulz, Cummins, Diendorfer, and Dorninger 2005) -are counted on grids with a mesh size of 32 km. The lightning observations are available for the period 2010-2018. The reanalysis data come from ERA5, the fifth generation of the ECMWF (European Centre for Medium-Range Weather Forecasts) atmospheric reanalyses of the global climate (Copernicus Climate Change Service 2017; Hersbach and et al. 2020). ERA5 provides globally complete and consistent pseudo-observations of the atmosphere using the laws of physics. The horizontal resolution is approx. 32 km, while the temporal resolution is hourly and covers the years from 1979 to present. In this example application we work only with a small subset of the data, which can be assessed from the accompanying R package FlashAustria (Simon 2021

R> nrow(FlashAustriaTrain)
[1] 12000 The motivation for this application is as follows: lightning counts are not modeled within the atmospheric reanalyses, as their spatial resolution is too coarse for resolving convective events that lead to lightning discharges. Homogeneous lightning observations are only available for the period in the order of a decade, here 2010-2018. Thus, based on a probabilistic statistical model, lightning counts for the time before 2010 could be fitted, thus enabling the analysis Abbreviation Description d2m 2 metre dewpoint temperature is a measure of the humidity of the air. The temperature to which the air, at 2 metres above the surface of the Earth, would have to be cooled for saturation to occur. q_prof_PC1 The vertical profile of specific humidity q has been decomposed by principal component analysis (PCA). This is the first principal component. cswc_prof_PC4 The vertical profile of specific snow water content cswc has been decomposed by PCA. This is the forth principal component. t_prof_PC1 The vertical profile of temperature t has been decomposed by PCA. This is the first principal component. v_prof_PC2 The vertical profile of the v-component of the wind v has been decomposed by PCA. This is the second principal component. sqrt_cape The square root of convective available potential energy. This is an indication of the (in)stability of the atmosphere. sqrt_lsp Large-scale precipitation. Accumulated liquid and frozen water, comprising rain and snow, which is generated by the cloud scheme of the numerical model. of lightning events in the past for which no observations are available. On the one hand this will increase our knowledge about physical processes leading to such events, and on the other it will enable quantification how these extreme short-term events are affected by changing climate (Westra et al. 2014). Table 4 lists the covariates considered which are based on a small subset of ERA5 quantities (Copernicus Climate Change Service 2017; Hersbach and et al. 2020) and include variables that are known to be good predictors for convective events (e.g., Simon et al. 2018).

Model specification
The response of our statistical model are positive counts, with a mean of 13.61, and a variance of 1180.63. Thus, we are facing a truncated count data distribution which is highly overdispersive (Cameron and Trivedi 2013). Simon, Mayr, Umlauf, and Zeileis (2019) employed a zero-truncated negative binomial distribution, which is specified by two parameters µ > 0 and θ > 0. µ is the expectation of the underlying untruncated negative binomial, and θ modifies the variance of the untruncated negative binomial by VAR(Ỹ ) = µ+µ 2 /θ, whereỸ is a latent random variable following the underlying untruncated negative binomial distribution.
The spatial and temporal scale of aggregation of the lightning discharges here differs from the one in Simon et al. (2019). Therefore, it is worth comparing the zero-truncated negative binomial against other distributions that could capture the truncation of the count data and its overdispersion. Hence, we also consider the zero-truncated Sichel distribution which can also capture skewed count data.
The zero-truncated negative binomial distribution is implemented as ztnbinom_bamlss() within bamlss while the Sichel is available as SICHEL() within gamlss.dist. Using the gamlss.tr package the latter is truncated at zero so that it can be readily plugged into the family argument of bamlss().
R> library("gamlss.dist") R> library("gamlss.tr") R> ztSICHEL <-trun(0, family = "SICHEL", local = FALSE) In the following we illustrate how to model the lightning counts with one of the two distributions. To specify smooth terms for all distributional parameters -for ztnbinom_bamlss() parameters µ and θ and for ztSICHEL() parameters µ, σ and ν -we set up a list of three formulas. Smooth P-splines (Eilers and Marx 1996), known for their good sampling properties, are employed for all predictors in the formula for µ. For the (over)dispersion model large-scale precipitation is used in the second formula (without a parameter name on the left-hand side in order to be applicable to both distributional models). Finally, for the Sichel distribution a constant shape parameter is added in the third formula (which is ignored when using the formula list with the zero-truncated negative binomial distribution).
R> f <-list( + counts~s(d2m, bs = "ps") + s(q_prof_PC1, bs = "ps") + + s(cswc_prof_PC4, bs = "ps") + s(t_prof_PC1, bs = "ps") + + s(v_prof_PC2, bs = "ps") + s(sqrt_cape, bs = "ps"), +~s(sqrt_lsp, bs = "ps"), +~1 + ) Now, we have all ingredients on hand to feed the standard interface for statistical models in R: a formula f, families ztnbinom_bamss(), ztSICHEL(), and a data set FlashAustriaTrain. Within the bamlss() call we also provide arguments which are passed forward to the optimizer and the sampler. We choose the gradient boosting optimizer opt_boost() in order to find initial values for the default sampler sam_GMCMC(). Gradient boosting proved to offer a very stable method for finding regression coefficients that serve as initial values for a MCMC sampler (Simon et al. 2019). In the following, we illustrate the estimation of the models with the ztSICHEL() family. We set the number of iterations to maxit = 1000. For the sampling we allow 1000 iterations as burn-in phase, and apply a thinning of the resulting chain of 3. Running n.iter = 2000 iterations on 3 cores in parallel leads to 1000 MCMC samples in the end (note that parallel chains are started using function mclapply() of the base R parallel package by setting argument cores, see the manual of sampler function sam_MCMC()). logLik -36636.9 eps 0.0003 iteration 1000 qsel 7 elapsed time: 28.99min Starting the sampler... |********************| 100% 0.00sec 115.67min The model was fitted on three Intel Xeon CPU E5-4660 v4 cores with 2.20GHz on which the boosting took about 28.99 minutes and the average sampling time about 1.9 hours. This is relatively slow as the ztSICHEL() uses a rather generic high-level R implementation. To fit the model flash_model_ztnbinom we rather used family = ztnbinom_bamlss() than family = ztSICHEL, all other specifications for optimization left untouched.
R> data("FlashAustriaModel", package = "FlashAustria") The corresponding R code is provided in the supplemental materials.

Model diagnostics
To select one of the two models, we examine their calibration using a worm plot (Van Buuren and Fredriks 2001). The worm plot is implemented within bamlss for objects of class bamlss.residuals and can be selected via the which argument of the plot method for these objects: R> resids <-c( + "ztnbinom" = residuals(flash_model_ztnbinom), + "ztSICHEL" = residuals(flash_model_ztSICHEL) + ) R> plot(resids, which = "wp", main = "Worm plot") The worm plots (Figure 8) reveal that both ztSICHEL and ztnbinom somewhat underestimate the mass of the upper tail. However, for ztSICHEL the effect is less pronounced and overall calibration is much better than for ztnbinom. Hence, we focus on the ztSICHEL() model but remark that most qualitative insights are very similar for ztnbinom.
As a next diagnostic we check the log-likelihood contributions of the individual terms during the boosting optimization ( Figure 9).
R> pathplot(flash_model_ztSICHEL, which = "loglik.contrib") After 1000 iterations the term s(q_prof_PC1).mu has the highest contribution to the loglikelihood with 144 followed by s(sqrt_cape).mu with 115 and s(d2m).mu with 50. Overall contributions to the log-likelihood at the end of the boosting procedure are very small signaling that the algorithm approach a state that is suitable for initializing the MCMC sampling.
The MCMC chains can be assessed by visualizations of their traces and autocorrelation functions (ACFs), exemplified in Figure 10 for the term s(q_prof_PC1) (for parameter µ of the Sichel distribution).
R> plot(flash_model_ztSICHEL, model = "mu", term = "s(q_prof_PC1)", + which = "samples") The traces reveal samples around stables means, confirming that the 1000 boosting iterations and the 1000 burn-in samples were sufficient. The ACFs reveal quite some autocorrelation after the thinning, suggesting that sampling efforts should be increased further in a final model run.

Predictions and visualizations
As the boosting summary ( Figure 9) reveals that the terms s(q_prof_PC1), s(sqrt_cape) and s(d2m) have the largest contribution for improving the fit, the corresponding effects are shown in Figure 11 to illustrate how the atmospheric quantities of the reanalyses are related to lightning events. The effects are presented on the scale of the additive predictor of the distributional parameter µ, i.e., the log scale. A higher log(µ) would result in a higher expectation of the count data distribution.
R> plot(flash_model_ztSICHEL, model = "mu", + term = c("s(q_prof_PC1)", "s(sqrt_cape)", "s(d2m)")) s(q_prof_PC1) shows a clear decrease. As q_prof_PC1 is the leading principal component of the vertical profile of specific humidity, one has to consider the corresponding spatial mode (not shown) for interpretation: positive values of q_prof_PC1 are linked to more moisture in the lower atmosphere (below 850 hPa) and less moisture in the mid atmosphere (between 850 hPa and 600 hPa). Thus, smaller values of the principal component mean that more moisture is available in the mid atmosphere, a source of latent energy, energy that becomes free when water transfers from the gas to the liquid phase. This energy supports the occurrence of deep convection and thus of heavy lightning events. s(sqrt_cape) reveals an increasing shape. This means a higher convective available potential energy (CAPE) increases µ, which increases the expectation of the distribution and thus is associated with higher probabilities for events with high counts. Physically the shape of the effect is meaningful as more convective available potential energy has the potential to lead to heavier lightning events. The similar is true for the increasing effect of s(d2m). Finally, the model is leveraged to predict a case for the period before 2010, for which no lightning data are available.   space information, and is of class sf (Pebesma 2018). We predict the parameters for this case, and derive the probability of observing 10 or more flashes within a grid box conditioned on thunderstorm activity, by applying the cumulative distribution function ...$p() of the family which can be extracted from the fitted model using family(). The family contains functions to map the predictors to the parameter scale, density, cumulative distribution function, loglikelihood, and scores and Hessian. We apply the cdf to compute the probability of observing more than 10 flashes in a box and hour given a lightning event. The function ...$p() takes the quantile as first argument, and the list with the parameters, as returned by predict(), as a second argument.

Conclusion
The R package bamlss is a very comprehensive software to estimate Bayesian distributional regression models. The package is primarily based on the typical R "look & feel", which makes it easy to get started with the package. Similar to other implementations, bamlss is a modular "Lego toolbox", however, the package stands out from others in that it makes complete sampling and/or optimizer functions exchangeable, so that users who are interested in extensions can easily set up new models, where all elaborate data processing infrastructure and extractor functions are completely provided by the package. Several examples illustrate the functionality of the package. For the future it is planned to provide algorithms for Gigadata, a topic that the package so far treats only very superficially. In addition, it is planned to expand the family infrastructure to support families from other implementations more easily.

A. Gaussian family object
The following R code shows an example implementation of the Gaussian distribution as presented in Section 4.2.

B. Special model terms
The default estimation engines opt_bfit() and sam_GMCMC() (also the gradient boosting optimizer function boost()) in bamlss provide support for the implementation of special model terms, i.e., model terms that cannot be represented by the mgcv smooth term constructor infrastructure. One simple example of such a special model term is a nonlinear growth curve, e.g., a nonlinear Gompertz curve but also the lasso model term constructor la() presented in Section 2.2 is a special bamlss model term. The special model term constructor is needed in this case, since the growth curve is nonlinear in the parameters β, hence, the default backfitting and sampling strategies cannot be applied. Fortunately, estimation algorithms in distributional regression can be split into separate updating equations (see also Section 3.2). This means that each model term can have its own updating function. The user interested in this feature only needs to write a new smooth.construct() and Predict.matrix() method.
• update(): An updating function to be used with optimizer opt_bfit().
• prior(): Function of the parameters b that evaluates the log-prior. Note, additional functions can be grad() and hess that evaluate the first and second derivative of the log-prior w.r.t. the parameters b.
• fixed: Is the number of degrees of freedom fixed or not?
• state: This is a named list with starting values for the "parameters", the "fitted.values" and degrees of freedom "edf". Note that regression coefficients are always named with "b*" and shrinkage or smoothing variances with "tau2*" in the "parameters" vector.
• special.npar: How many parameters does this model term have in total? This is needed for internal setup, because the Gompertz function has three parameters but the design matrix only one column.
To compute predictions of this model term a new method for the Predict.matrix() function needs to be implemented, too.
In summary, in order to build up special bamlss model terms only a few things have to be considered. The example R code for the Gompertz smooth constructor given here is a good starting point for readers interested in using this feature.

C. Model fitting engines for linear regression
In the following, to explain the setup and the naming convention of estimation engines in more detail, we implement • a new family object for simple linear models y = x β + ε with ε ∼ N (0, σ 2 ), • and set up an optimizer function, • and additionally a MCMC sampling function.
For illustration, the family object is kept very simple, we only model the mean function in terms of covariates.