Evaluating Probabilistic Forecasts with scoringRules

Probabilistic forecasts in the form of probability distributions over future events have become popular in several ﬁelds including meteorology, hydrology, economics, and demography. In typical applications, many alternative statistical models and data sources can be used to produce probabilistic forecasts. Hence, evaluating and selecting among competing methods is an important task. The scoringRules package for R provides functionality for comparative evaluation of probabilistic models based on proper scoring rules, covering a wide range of situations in applied work. This paper discusses implementation and usage details, presents case studies from meteorology and economics, and points to the relevant background literature.


Introduction: Forecast evaluation
Forecasts are generally surrounded by uncertainty, and being able to quantify this uncertainty is key to good decision making.Accordingly, probabilistic forecasts in the form of predictive probability distributions over future quantities or events have become popular over the last decades in various fields including meteorology, climate science, hydrology, seismology, economics, finance, demography and political science.Important examples include the United Nation's probabilistic population forecasts (Raftery et al. 2014), inflation projections issued by the Bank of England (e.g., Clements 2004), or the now widespread use of probabilistic ensemble methods in meteorology (Gneiting and Raftery 2005;Leutbecher and Palmer 2008).For recent reviews see Gneiting and Katzfuss (2014) and Raftery (2016).
With the proliferation of probabilistic models arises the need for tools to evaluate the appropri-arXiv:1709.04743v2 [stat.CO] 30 Jul 2018 ateness of models and forecasts in a principled way.Various measures of forecast performance have been developed over the past decades to address this demand.Scoring rules are functions S(F, y) that evaluate the accuracy of a forecast distribution F , given that an outcome y was observed.As such, they allow to compare alternative models, a crucial ability given the variety of theories, data sources and statistical specifications available in many situations.Conceptually, scoring rules can be thought of as error measures for distribution functions: While the squared error SE(x, y) = (y − x) 2 measures the performance of a point forecast x, a scoring rule S(F, y) measures the performance of a distribution forecast F .This paper introduces the R (R Core Team 2017) software package scoringRules (Jordan et al. 2018b), which provides functions to compute scoring rules for a variety of distributions F that come up in applied work, and popular choices of S. Two main classes of probabilistic forecasts are parametric distributions and distributions that are not known analytically, but are indirectly described through a sample of simulation draws.For example, Bayesian forecasts produced via Markov chain Monte Carlo (MCMC) methods take the latter form.Hence, the scoringRules package provides a general framework for model evaluation that covers both classical (frequentist) and Bayesian forecasting methods.
The scoringRules package aims to be a comprehensive library for computing scoring rules.We offer implementations of several known (but not routinely applied) formulas, and implement some closed-form expressions that were previously unavailable.Whenever more than one implementation variant exists, we offer statistically principled default choices.The package contains the continuous ranked probability score (CRPS) and the logarithmic score, as well as the multivariate energy score and variogram score.All of these scoring rules are proper, which means that forecasters have an incentive to state their true belief, see Section 2.
It is worth emphasizing that scoring rules are designed for comparative forecast evaluation.That is, one wants to know whether model A or model B provides better forecasts, in terms of a proper scoring rule.Comparative forecast evaluation is of interest either for choosing a specification for future use, or for comparing various scientific approaches.A distinct, complementary issue is to check the suitability of a given model via tools for absolute forecast evaluation.Here, the focus typically lies in diagnostics, e.g., whether the predictive distributions match the observations statistically (see probability integral transform histogram, e.g., in Gneiting and Katzfuss 2014).To retain focus, the scoringRules package does not cover absolute forecast evaluation.
Comparative forecast evaluation shares key aspects with out-of-sample model comparison.In that sense, scoringRules is broadly related to all software packages which help users determine an appropriate model for the data at hand.Perhaps most fundamentally, the stats (R Core Team 2017) package provides the traditional Akaike and Bayes information criteria to select among linear models in in-sample evaluation.The packages caret (Kuhn et al. 2018) and forecast (Hyndman and Khandakar 2008) provide cross-validation tools suitable for crosssectional and time series data, respectively.The loo (Vehtari et al. 2018) package implements recent proposals to select among Bayesian models.In contrast to existing software, a key novelty of the scoringRules package is its extensive coverage of the CRPS.This scoring rule is attractive for both practical and theoretical reasons (Gneiting and Raftery 2007;Krüger et al. 2016), yet more widespread use has been hampered by computational challenges.In providing analytical formulas and efficient numerical implementations, we hope to enable convenient use of the CRPS in applied work.
To the best of our knowledge, scoringRules is the first R package designed as a library for computing proper scoring rules for many types of forecast distributions.However, a number of existing R packages include scoring rule computations for more specific empirical situations: The ensembleBMA (Fraley et al. 2018) and ensembleMOS (Yuen et al. 2018) packages contain formulas for the CRPS of a small subset of the distributions listed in Table 1 which are relevant for post-processing ensemble weather forecasts (Fraley et al. 2011), and can only be applied to specific data structures utilized in the packages.The surveillance (Meyer et al. 2017) package provides functions to compute the logarithmic score and other scoring rules for count data models in epidemiology.By contrast, the distributions contained in scoringRules are relevant in applications across disciplines and the score functions are generally applicable.The scoring (Merkle and Steyvers 2013) package focuses on discrete (categorical) outcomes, for which it offers a large number of proper scoring rules.It is thus complementary to scor-ingRules which supports a wide range of forecast distributions while focusing on a smaller number of scoring rules.Furthermore, the verification (NCAR -Research Applications Laboratory 2015) and SpecsVerification (Siegert 2017) packages contain implementations of the CRPS for simulated forecast distributions.Our contribution in that domain is twofold: First, we offer efficient implementations to deal with predictive distributions given as large samples.MCMC methods are popular across the disciplines, and many sophisticated R implementations are available (e.g., Kastner 2016;Carpenter et al. 2017).Second, we include various implementation options, and propose principled default settings based on recent research (Krüger et al. 2016).
For programming languages other than R, implementations of proper scoring rules are sparse, and generally cover a much narrower score computation functionality than the scoringRules package.The properscoring (The Climate Corporation 2015) package for the Python (Python Software Foundation 2017) language provides implementations of the CRPS for Gaussian distributions and for forecast distributions given by a discrete sample.Several institutionally supported software packages include tools to compute scoring rules, but typically require input in specific data formats and are tailored towards operational use at meteorological institutions.The Model Evaluation Tools (Developmental Testbed Center 2018) software provides code to compute the CRPS based on a sample from the forecast distribution.However, note that a Gaussian approximation is applied which can be problematic if the underlying distribution is not Gaussian, see Krüger et al. (2016).The Ensemble Verification System (Brown et al. 2010) also provides an implementation of the CRPS for discrete samples.For a general overview of software for forecast evaluation in meteorology, see Pocernich (2012).
The remainder of this paper is organized as follows.Section 2 provides some theoretical background on scoring rules, and introduces the logarithmic score and the continuous ranked probability score.Section 3 gives an overview of the score computation functionality in the scoringRules package and presents the implementation of univariate proper scoring rules.In Section 4, we give usage examples by application in case studies.In a meteorological example of accumulated precipitation forecasts, we compare ensemble system output from numerical weather prediction models to parametric forecast distributions from statistical post-processing models.A second case study shows how using analytical information of a Bayesian time series model for the growth rate of the US economy's gross domestic product (GDP) can help in evaluating the model's simulation output.Definitions and details on the use of multivariate scoring rules are provided in Section 5.The paper closes with a discussion in Section 6.Two appendices contain various closed-form expressions for the CRPS, as well as details on evaluating multivariate forecast distributions.

Theoretical background
Probabilistic forecasts usually fit one of two categories, parametric distributions or simulated samples.The former type is represented by its cumulative distribution function (CDF) or its probability density function (PDF), whereas the latter is often used if the predictive distribution is not available analytically.Here, we give a brief overview of the theoretical background for comparative forecast evaluation in both cases.

Proper scoring rules
Let Ω denote the set of possible values of the quantity of interest, Y , and let F denote a convex class of probability distributions on Ω.A scoring rule is a function that assigns numerical values to pairs of forecasts F ∈ F and observations y ∈ Ω.For now, we restrict our attention to univariate observations and set Ω = R or subsets thereof, and identify probabilistic forecasts F with the associated CDF F or PDF f .In Section 5, we will consider multivariate scoring rules for which Ω = R d .
We consider scoring rules to be negatively oriented, such that a lower score indicates a better forecast.For a proper scoring rule, the expected score is optimized if the true distribution of the observation is issued as a forecast, i.e., if for all F, G ∈ F. A scoring rule is further called strictly proper if equality holds only if F = G.Using a proper scoring rule is critical for comparative evaluation, i.e., the ranking of forecasts.In practice, the lowest average score over multiple forecast cases among competing forecasters indicates the best predictive performance, and in this setup, proper scoring rules compel forecasters to truthfully report what they think is the true distribution.See Gneiting and Raftery (2007) for a more detailed review of the mathematical properties.
Popular examples of proper scoring rules for Ω = R include the logarithmic score and the continuous ranked probability score.The logarithmic score (LogS; Good 1952) is defined as where F admits a PDF f , and is a strictly proper scoring rule relative to the class of probability distributions with densities.The continuous ranked probability score (Matheson and Winkler 1976) is defined in terms of the predictive CDF F and is given by where 1{y ≤ z} denotes the indicator function which is one if y ≤ z and zero otherwise.If the first moment of F is finite, the CRPS can be written as where X 1 and X 2 are independent random variables with distribution F , see Gneiting and Raftery (2007).The CRPS is a strictly proper scoring rule for the class of probability distributions with finite first moment.Closed-form expressions of the integral in Equation 1 can be obtained for many parametric distributions and allow for exact and efficient computation of the CRPS.They are implemented in the scoringRules package for a range of parametric families, see Table 1 for an overview, and are provided in Appendix A.

Model assessment based on simulated forecast distributions
In various applications, the forecast distribution of interest F is not available in an analytic form, but only through a simulated sample X 1 , . . ., X m ∼ F .Examples include Bayesian forecasting applications where the sample is generated by a MCMC algorithm, or ensemble weather forecasting applications where the different sample values are generated by numerical weather prediction models with different model physics and/or initial conditions.In order to compute the value of a proper scoring rule, the simulated sample needs to be converted into an estimated distribution (say, Fm (z)) that can be evaluated at any point z ∈ R. The implementation choices and default settings in the scoringRules package follow the findings of Krüger et al. (2016) who provide a systematic analysis of probabilistic forecasting based on MCMC output.
For the CRPS, the empirical CDF is a natural approximation of the predictive CDF.In this case, the CRPS reduces to which allows one to compute the CRPS directly from the simulated sample, see Grimit et al. (2006).Implementations of Equation 2 are rather inefficient with computational complexity O(m 2 ), and can be improved upon with representations using the order statistics X (1) , . . ., X (m) , i.e., the sorted simulated sample, thus achieving an average O(m log m) performance.In the scoringRules package, we use an algebraically equivalent representation of the CRPS based on the generalized quantile function (Laio and Tamea 2007), leading to which Murphy (1970) reported in the context of the precursory, discrete version of the CRPS.We refer to Jordan (2016) for details.
In contrast to the CRPS, the computation of LogS requires a predictive density.An estimator can be obtained with classical nonparametric kernel density estimation (KDE, e.g.Silverman 1986).However, this estimator is valid only under stringent theoretical assumptions and can be fragile in practice: If the outcome falls into the tails of the simulated forecast distribution, the estimated score may be highly sensitive to the choice of the bandwidth tuning parameter.
In an MCMC context, a mixture-of-parameters estimator that utilizes a simulated sample of parameter draws rather than draws from the posterior predictive distribution is a better and often much more efficient choice, see Krüger et al. (2016).This mixture-of-parameters estimator is specific to the model being used, in that one needs to know the analytic form of the forecast distribution conditional on the parameter draws.If such knowledge is available, the mixture-of-parameters estimator can be implemented using functionality for parametric forecast distributions.We provide an example in Section 4.2.This example features a conditionally Gaussian forecast distribution, a typical case in applications.

Package design and functionality
The main functionality of scoringRules is the computation of scores.The essential functions for this purpose follow the naming convention [score]_[suffix](), where the two maturest choices for [score] are crps and logs.Regarding the [suffix], we aim for analogy to the usual d/p/q/r functions for parametric families of distributions, both in terms of naming convention and usage details, e.g., crps_norm(y, mean = 0, sd = 1, location = mean, scale = sd) Based on these computation functions, package developers may write S3 methods that hook into the respective S3 generic functions, currently limited to crps(y, ...) logs(y, ...) As the implementation of additional scoring rules matures, this list will be extended.We reserve methods for the class 'numeric', e.g., crps.numeric(y,family, ...) which are pedantic wrappers for the corresponding [score]_[family]() functions, but with meaningful error messages, making the 'numeric' class methods more suitable for interactive use as opposed to numerical score optimization or package integration.Table 1 gives a list of the implemented parametric families, as does the 'numeric' class method documentation for the respective score, e.g., see ?crps.numeric.
Echoing the distinction in Section 2 between parametric and empirical predictive distributions, we note that computation functions following the naming scheme [score]_sample(), see Sections 3.2 and 5, have a special status and cannot be called via the 'numeric' class method.

Parametric predictive distributions
When the predictive distribution comes from a parametric family, we have two options to perform the score computation and get the resulting vector of score values, i.e., via the score generics or the computation function.For a Gaussian distribution: R> library("scoringRules") R> obs <-rnorm(10) R> crps(obs, "norm", mean = c(1:10), sd = c(1:10)) [1] 0.288 1.625 1.570 2.003 2.744 3.688 3.270 4.884 4.162 6.067 R> crps_norm(obs, mean = c(1:10), sd = c(1:10)) [1] 0.288 1.625 1.570 2.003 2.744 3.688 3.270 4.884 4.162 6.067 The results are identical, except when the much stricter input checks of the 'numeric' class method are violated.This helps in detecting manual errors during interactive use, or in debugging automated model fitting and evaluation.Other restrictions in the 'numeric' class method include the necessity to pass input to all arguments, i.e., default values in the computation functions are ignored, and that all numerical arguments should be vectors of the same length, with the exception that vectors of length one will be recycled.In contrast, the computation functions aim to closely obey standard R conventions.
In package development, we expect predominant use of the computation functions, where developers will handle regularity checks themselves.As a trivial example, we define functions that only depend on the argument y and compute scores for a fixed predictive gamma distribution: R> crps_y <-function(y) crps_gamma(y, shape = 2, scale = 1.5)R> logs_y <-function(y) logs_gamma(y, shape = 2, scale = 1.5) In Figure 1 these functions are used to illustrate the dependence between the score value and the observation in an example of a gamma distribution as forecast.The logarithmic score rapidly increases at the right-sided limit of zero, and the minimum score value is attained if the observation equals the predictive distribution's mode.By contrast, the CRPS is more symmetric around the minimum that is attained at the median value of the forecast distribution, particularly, it increases more slowly as the observation approaches zero.

Simulated predictive distributions
Often forecast distributions can only be given as simulated samples, e.g., ensemble systems in weather prediction (Section 4.1) or MCMC output in econometrics (Section 4.2).We provide functions for both univariate and multivariate samples.The latter are discussed in Section 5, whereas the former are presented here: crps_sample(y, dat, method = "edf", w = NULL, bw = NULL, num_int = FALSE, show_messages = TRUE) logs_sample(y, dat, bw = NULL, show_messages = FALSE) The input for y is a vector of observations, and the input for dat is a matrix with the number of rows matching the length of y and each row comprising one simulated sample, e.g., for a Gaussian predictive distribution: When y has length one then dat may also be a vector.Random sampling from the forecast distribution can be seen as an option to approximate the values of the proper scoring rules.To empirically assess the quality of this approximation and to illustrate the use of the score functions, consider the following Gaussian toy example, where we examine the performance of forecasts given as samples of size up to 5 000.The approximation experiment is independently repeated 500 times: In this case, the true CRPS and LogS values can be computed using the crps() and logs() functions.Figure 2 graphically illustrates how the scores based on sampling approximations become more accurate as the sample size increases.
The method argument controls which approximation method is used in crps_sample(), with possible choices given by "edf" (empirical distribution function) and "kde" (kernel density estimation).The default choice "edf" corresponds to computing the approximation from Equation 2, implemented as in Equation 3. A vector or matrix of weights, matching the input for dat, can be passed to the argument w to compute the CRPS, for any distribution with a finite number of outcomes.
For kernel density estimation, i.e., the default in logs_sample() and the corresponding method in crps_sample(), we use a Gaussian kernel to estimate the predictive distribution.Kernel density estimation is an unusual choice in the case of the CRPS, but it is the only implemented option for evaluating the LogS of a simulated sample.In particular, the empirical distribution function is not applicable to LogS because an estimated density is required.The bw argument allows one to manually select a bandwidth parameter for kernel density estimation; by default, the bw.nrd() function from the stats (R Core Team 2017) package is employed.

Probabilistic weather forecasting via ensemble post-processing
In numerical weather prediction (NWP), physical processes in the atmosphere are mod-eled through systems of partial differential equations that are solved numerically on threedimensional grids.To account for major sources of uncertainty, weather forecasts are typically obtained from multiple runs of NWP models with varying initial conditions and model physics resulting in a set of deterministic predictions, called the forecast ensemble.While ensemble predictions are an important step from deterministic to probabilistic forecasts, they tend to be biased and underdispersive (such that, empirically, the actual observation falls outside the range of the ensemble too frequently).Hence, ensembles require some form of statistical post-processing.Over the past decade, a variety of approaches to statistical post-processing has been proposed, including non-homogeneous regression (Gneiting et al. 2005) and Bayesian model averaging (Raftery et al. 2005).
Here we illustrate how to evaluate post-processed ensemble forecasts of precipitation, based on data and methods from the crch (Messner et al. 2016) package for R. We model the conditional distribution of precipitation accumulation, Y ≥ 0, given the ensemble forecasts X 1 , . . ., X m using censored non-homogeneous regression models of the form where F θ is the CDF of a continuous parametric distribution with parameters θ.Equations 4 and 5 specify a mixed discrete-continuous forecast distribution for precipitation: There is a positive probability of observing no precipitation at all (Y = 0), however, if Y > 0, it can take many possible values y.In order to incorporate information from the raw forecast ensemble, we let θ be a function of X 1 , . . ., X m , i.e., we use features of the raw ensemble to determine the parameters of the forecast distribution.Specifically, we consider different location-scale families F θ and model the location parameter µ as a linear function of the ensemble mean and the scale parameter σ as a linear function of the logarithm of the standard deviation s of the ensemble, log(σ) = b 0 + b 1 log (s) .
A logarithmic link function is used to ensure positivity of the scale parameter.The coefficients a 0 , a 1 , b 0 , b 1 can be estimated using maximum likelihood approaches implemented in the crch package.The choice of a suitable parametric family F θ is not obvious.Then, we estimate the censored regression models that are based on the logistic, Student's t, and Gaussian distributions, and produce the parameters of the forecast distributions for the evaluation period using built-in functionality of the crch package.We only show the code for the Gaussian model since it can be adapted straightforwardly for the logistic and Student's t models.
R> CRCHgauss <-crch(rain ~ensmean | log(enssd), data_train, + dist = "gaussian", left = 0) R> gauss_mu <-predict(CRCHgauss, data_eval, type = "location") R> gauss_sc <-predict(CRCHgauss, data_eval, type = "scale") The raw ensemble of forecasts is a natural benchmark for comparison since interest commonly lies in quantifying the gains in forecast accuracy that result from post-processing: Figure 3 shows the models' forecast distributions in three illustrative cases.To evaluate forecast performance in the entire out of sample period, we use the function crps() for the model outputs and the function crps_sample() to compute the CRPS of the raw ensemble.Note that we have to turn ens_fc into an object of class 'matrix' manually.
R> obs <-data_eval$rain R> gauss_crps <-crps(obs, family = "cnorm", location = gauss_mu, + scale = gauss_sc, lower = 0, upper = Inf) R> ens_crps <-crps_sample(obs, dat = as.matrix(ens_fc)) The mean CRPS values indicate that all post-processing models substantially improve upon the raw ensemble forecasts.There are only small differences between the censored regression models, with the models based on the logistic and Student's t distributions slightly outperforming the model based on a normal distribution.

Bayesian forecasts of US GDP growth rate
We next present a representative example from economics, where the predictive distribution is given as a simulated sample.Hamilton (1989) first proposed the Markov switching autoregressive model to allow regime-dependent modeling, i.e., to capture different recurring time-series characteristics.We consider the following simple variant of the model: where Y t is the observed GDP growth rate of quarter t, and s t ∈ {1, 2} is a latent state variable that represents two regimes in the residual variance σ 2 st .Since s t evolves according to a first-order Markov chain, the specification allows for the possibility that periods of high (or low) volatility cluster over time.The latter is a salient feature of the US GDP growth rate: For example, the series was much more volatile in the 1970s than in the 1990s.The model is estimated using Bayesian Markov chain Monte Carlo methods (Frühwirth-Schnatter 2006, e.g.,).Our implementation closely follows Krüger et al. (2016, Section 5), and uses the ar_ms() function, and the data set gdp, included in the scoringRules package.The data set gdp comprises US GDP growth observations val, and the corresponding quarters dt and vintages vint.Measuring economic variables is challenging, hence records tend to be revised and each quarter has its own time-series, or vintage, of past observations.As a result, the data set for 271 quarters from 1947 to 2014 contains 33616 records.
We split the data into a training sample of observations containing the data before 2014's first quarter, and an evaluation period containing only the four quarters of 2014, using the most recent vintage in both cases: R> data("gdp", package = "scoringRules") R> data_train <-subset(gdp, vint == "2014Q1") R> data_eval <-subset(gdp, vint == "2015Q1" & grepl("2014", dt)) As is typical for MCMC-based analysis, the model's forecast distribution F 0 is not available as an analytical formula, but must be approximated in some way.Following Krüger et al. (2016), a generic MCMC algorithm to generate samples of the parameter vector θ and sample from the posterior predictive distribution proceeds as follows: , where K is a transition kernel that is specific to the model under use where F c denotes the conditional distribution given the parameter values.
We use the function ar_ms() to fit the model and produce forecasts for the four quarters of 2014 based on the information available at the end of year 2013, i.e., a single prediction case where the forecast horizon extends from one to four quarters ahead.Here, the conditional distribution F c is Gaussian, and we run the chain for 20 000 iterations.

R> h <-4 R> m <-20000 R> fc_params <-ar_ms(data_train$val, forecast_periods = h, n_rep = m)
This function call yields a simulated sample corresponding to {θ 1 , . . ., θ m }, where we obtain matrices of parameters for the mean and standard deviation.We transpose these matrices to have the rows correspond to the observations, and columns represent the position in the Markov chain: R> mu <-t(fc_params$fcMeans) R> Sd <-t(fc_params$fcSds) Next, we draw the sample of possible observations corresponding to {X 1 , . . ., X m } conditional on the Gaussian assumption and the available parameter information: We consider two competing estimators of the posterior predictive distribution F 0 .The mixture-of-parameters estimator (MPE) builds on the simulated parameter values by mixing a series of Gaussian distributions uniformly, whereas the empirical CDF based approximation utilizes the simulated sample from the conditional distribution given the parameter values, instead of building on the simulated parameter values directly.A standard choice for a smoother approximation is to replace the indicator function with a Gaussian kernel, as in the logs_sample() function.
The two alternative estimators are illustrated in Figure 4: For each date, the histogram represents a simulated sample from the model's forecast distribution, and the black line indicates the mixture-of-parameters estimator.We can observe a distinct decrease in the forecast's certainty as the forecast horizon increases from one to four quarters ahead.
The algorithm and approximation methods just sketched are not idiosyncratic to our example, but arise whenever a Bayesian model is used for forecasting.For illustrative R implementations of other Bayesian models, see, e.g., the packages bayesgarch (Ardia and Hoogerheide 2010) and stochvol (Kastner 2016).

Parameter estimation with scoring rules
Apart from comparative forecast evaluation, proper scoring rules also provide useful tools for parameter estimation.In the optimum score estimation framework of Gneiting and Raftery (2007, Section 9.1), the parameters of a model's forecast distribution are determined by optimizing the value of a proper scoring rule, on average over a training sample.Optimum score estimation based on the LogS corresponds to classical maximum likelihood estimation.
The score functions to compute CRPS and LogS for parametric forecast distributions included in scoringRules (see Table 1) thus offer tools for the straightforward implementation of such optimum score estimation approaches.Specifically, the worker functions of the form [crps]_[family]() and [logs]_[family]() entail little overhead in terms of input checks and are thus well suited for use in numerical optimization procedures such as optim().Furthermore, functions to compute gradients and Hessian matrices of the CRPS have been implemented for a subset of parametric families, and can be supplied to assist numerical optimizers.Such functions are available for the "logis", "norm" and "t" families and truncated and censored versions thereof ("clogis", "tlogis", "cnorm", "tnorm", "ct", "tt").The corresponding computation functions follow the naming scheme [gradcrps]_[family]() and [hesscrps]_[family]().However, we emphasize that implementing minimum CRPS or LogS estimation approaches is possible for all parametric families listed in Table 1, even if analytical gradient and Hessian functions are not supplied but are instead approximated numerically by optim().These functions can then be passed to optim(), for example, mean and standard deviation of a normal distribution with true values −1 and 2 can be estimated as illustrated in the following.The estimation with sample size 500 is repeated 1 000 times.q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q µ ^− µ Deviation from the true value −0.3 0.0 0.3 q q q q q q q q q q q q q q q q σ ^− σ  59).For the standard deviation parameter σ, the difference between estimate and true value exhibits slightly less variability for the maximum likelihood method.

Multivariate scoring rules
The basic concept of proper scoring rules can be extended to multivariate forecast distributions for which the support Ω is given by R d , d ∈ {2, 3, . ..}.A variety of multivariate proper scoring rules has been proposed in the literature.For example, the univariate LogS allows for a straightforward generalization towards multivariate forecast distributions.However, parametric modeling and forecasting of multivariate observations is challenging, and when sampling is a feasible alternative we encounter the same, even exacerbated, problems in kernel density estimation as for univariate samples.As another example, the univariate CRPS can also be generalized to multivariate forecast distributions, and one such generalization is discussed in this chapter, the energy score.Finding closed form expressions for parametric distributions is even more involved than for the univariate CRPS, but the robustness in the evaluation of sample forecasts is retained.We refer to Gneiting et al. (2008) and Scheuerer and Hamill (2015b) for a detailed discussion of multivariate proper scoring rules and limit our attention to the case where probabilistic forecasts are given as samples from the forecast distributions.
Let y = (y (1) , . . ., y (d) ) ∈ Ω = R d , and let F denote a forecast distribution on R d given through The scoringRules package provides implementations of the energy score (ES; Gneiting et al. 2008), where • denotes the Euclidean norm on R d , and the variogram score of order p (VS p ; Scheuerer and Hamill 2015b), In the definition of VS p , w i,j is a non-negative weight that allows one to emphasize or downweight pairs of component combinations based on subjective expert decisions, and p is the order of the variogram score.Typical choices of p include 0.5 and 1.
ES and VS p are implemented for multivariate forecast distributions given through simulated samples as functions es_sample(y, dat) vs_sample(y, dat, w = NULL, p = 0.5) These functions can only evaluate a single multivariate forecast case and always return a single number to simplify use and documentation, see Appendix B for an example on how to use apply() functions or for loops to sequentially apply them to multiple forecast cases.The observation input for y is required to be a vector of length d, and the corresponding forecast input for dat has to be given as a d × m matrix, the columns of which are the simulated samples X 1 , . . ., X m from the multivariate forecast distribution.In vs_sample() it is possible to specify a d × d matrix for w of non-negative weights as described in the text.
The entry in the i-th row and j-th column of w corresponds to the weight assigned to the combination of the i-th and j-th component.If no weights are specified, constant weights with w i,j = 1 for all i, j ∈ {1, . . ., d} are used.For details and examples on choosing appropriate weights, see Scheuerer and Hamill (2015b).
In the following, we give a usage example of the multivariate scoring rules using the results from the economic case study in Section 4.2.Instead of evaluating the forecasts separately for each horizon (as we did before), we now jointly evaluate the forecast performance over the four forecast horizons based on the four-variate simulated sample.
R> names(obs) <-NULL R> es_sample(obs, dat = X) [1] 4.13 R> vs_sample(obs, dat = X) [1] 7.05 While this simple example refers to a single forecast case and a single model, a typical empirical analysis would consider the average scores (across several forecast cases) of two or more models.

Summary and discussion
The scoringRules package enables computing proper scoring rules for parametric and simulated forecast distributions.The package covers a wide range of situations prevalent in work on modeling and forecasting, and provides generally applicable and numerically efficient implementations based on recent theoretical considerations.
The main functions of the packagecrps() and logs() -are S3 generics, for which we provide methods crps.numeric() and logs.numeric().The package can be extended naturally by defining S3 methods for classes other than 'numeric'.For example, consider a fitted model object of class 'crch', obtained by the R package of the same name (Messner et al. 2016).An object of this class contains a detailed specification of the fitted model's forecast distribution (such as the parametric family of distributions and the values of the fitted parameters).This information could be utilized to write a specific method that computes the CRPS of a fitted model object.
The choice of an appropriate proper scoring rule for model evaluation or parameter estimation is a non-trivial task.We have implemented the widely used LogS and CRPS along with the multivariate ES and VS p .Possible future extension of the scoringRules package include the addition of novel proper scoring rules such as the Dawid-Sebastiani score (Dawid and Sebastiani 1999) which has been partially implemented.Further, given the availability of appropriate analytical expressions, the list of covered parametric families can be extended as demand arises and time allows.

Laplace distribution
The function crps_lapl() computes the CRPS for the standard distribution, and generalizes via location parameter µ ∈ R and scale parameter σ > 0, The CDFs are given by F µ,σ (x) = F x−µ σ and

Logistic distribution
The function crps_logis() computes the CRPS for the standard distribution, and generalizes via location parameter µ ∈ R and scale parameter σ > 0, The CDFs are given by

Normal distribution
The function crps_norm() computes the CRPS for the standard distribution, and generalizes via mean parameter µ ∈ R and sd parameter σ > 0, or alternatively, location and scale, The CDFs are given by Φ and . Derived by Gneiting et al. (2005).

Student's t distribution
The function crps_t() computes the CRPS for Student's t distribution with df parameter ν > 1, and generalizes via location parameter µ ∈ R and scale parameter σ > 0, The CDFs and PDF are given by F ν,µ,σ (x) = F ν x−µ σ and .

Two-piece normal distribution
The function crps_2pnorm() computes the CRPS for the two-piece exponential distribution with scale1 and scale2 parameters σ 1 , σ 2 > 0, and generalizes via location parameter , min(0,y) where F u,U l,L is the CDF of the generalized truncated/censored normal distribution as in Section A.4.7.The CDFs for the two-piece normal distribution are given by Gneiting and Thorarinsdottir (2010) give an explicit CRPS formula.

Exponential distribution
The function crps_exp() computes the CRPS for the exponential distribution with rate parameter λ > 0, The CDF is given by

Gamma distribution
The function crps_gamma() computes the CRPS for the gamma distribution with shape parameter α > 0 and rate parameter β > 0, or alternatively scale = 1/rate, The CDF is given by Derived by Möller and Scheuerer (2015).

Log-normal distribution
The function crps_lnorm() computes the CRPS for the log-logistic distribution with locationlog parameter µ ∈ R and scalelog parameter σ > 0, The CDF is given by Derived by Baran and Lerch (2015).

Beta distribution
The function crps_beta() computes the CRPS for the beta distribution with shape1 and shape2 parameters α, β > 0, and generalizes via lower and upper parameters l, u ∈ R, l < u, The CDFs are given by F u l,α,β (x) = F α,β x−l u−l and Taillardat et al. (2016) give an equivalent expression.

Continuous uniform distribution
The function crps_unif() computes the CRPS for the continuous uniform distribution on the unit interval, and generalizes via min and max parameters l, u ∈ R, l < u, and by allowing point masses in the boundaries, i.e., lmass and umass parameters L, U ≥ 0, The CDFs are given by F

Exponential distribution with point mass
The function crps_expM() computes the CRPS for the standard exponential distribution, and generalizes via location parameter µ ∈ R and scale parameter σ > 0, and by allowing a point mass in the boundary, i.e., a mass parameter M ∈ [0, 1], The CDFs are given by F M,µ,σ (x) = F M x−µ σ and

Generalized Pareto distribution with point mass
The function crps_gpd() computes the CRPS for the generalized extreme value distribution with shape parameter ξ < 1, and generalizes via location parameter µ ∈ R and scale parameter σ > 0, and by allowing a point mass in the lower boundary, i.e., a mass parameter The CDFs are given by F M,ξ,µ,σ (x) = F M,ξ x−µ σ and Friederichs and Thorarinsdottir (2012) give a CRPS formula for the generalized Pareto distribution without a point mass.

Generalized truncated/censored logistic distribution
The function crps_gtclogis() computes the CRPS for the generalized truncated/censored logistic distribution with location parameter µ ∈ R, scale parameter σ > 0, lower and upper boundary parameters l, u ∈ R, l < u, and by allowing point masses in the boundaries, i.e., lmass and umass parameters L, U ≥ 0, The CDFs are given by F (x) = (1 + exp(−x)) −1 and Otherwise required functions are given by G(x) = xF (x) + log F (−x) and The function crps_clogis() computes the CRPS for the special case when the tail probabilities collapse into the respective boundary, where the CDF is given by The function crps_tlogis() computes the CRPS for the special case when L = U = 0, where the CDF is given by Taillardat et al. (2016) give a formula for left-censoring at zero.Möller and Scheuerer (2015) give a formula for left-truncating at zero.

Generalized truncated/censored normal distribution
The function crps_gtcnorm() computes the CRPS for the generalized truncated/censored normal distribution with location parameter µ ∈ R, scale parameter σ > 0, lower and upper boundary parameters l, u ∈ R, l < u, and by allowing point masses in the boundaries, i.e., lmass and umass parameters L, U ≥ 0, The CDFs and otherwise required functions are given by l, y < l, y, l ≤ y < u, u, y ≥ u.
The function crps_cnorm() computes the CRPS for the special case when the tail probabilities collapse into the respective boundary, where the CDF is given by x ≥ u.
The function crps_tnorm() computes the CRPS for the special case when L = U = 0, where the CDF is given by F (u)−F (l) , l ≤ x < u, 1, x ≥ u.Thorarinsdottir and Gneiting (2010) give a formula for left-censoring at zero.Gneiting et al. (2006) give a formula for left-truncating at zero.
Otherwise required functions are given by l, y < l, y, l ≤ y < u, u, y ≥ u, The function crps_ct() computes the CRPS for the special case when the tail probabilities collapse into the respective boundary, where the CDF is given by x ≥ u.
The function crps_tt() computes the CRPS for the special case when L = U = 0, where the CDF is given by

Hypergeometric distribution
The function crps_hyper() computes the CRPS for the hypergeometric distribution with two population parameters, the number m = 0, 1, . . ., of entities with the relevant feature and the number n = 0, 1, . . ., of entities without that feature, and a parameter for the size k = 0, . . ., m + n of the sample to be drawn,

Figure 1 :
Figure1: Score values for a gamma distribution with shape = 2 and scale = 1.5, dependent on the observation (functions crps_y() and logs_y() defined in the text).A scaled version of the predictive density is shown in gray.

Figure 2 :
Figure2: The scores of a Gaussian forecast distribution with mean 2 and standard deviation 3 when a value of 0 is observed, estimated from an independent sample from the predictive distribution, and shown as a function of the size of the (expanding) sample.The horizontal dashed line represents the analytically calculated score.The gray area indicates empirical 90% confidence intervals for each sample size computed from 500 independent replications of the simulation experiment, and the gray line shows the corresponding mean value over all repetitions.

Figure 3 :
Figure3: Illustration of the forecast distributions of the censored regression models for three illustrative 3-day accumulation periods (plot title indicates end of period).The predicted probabilities of zero precipitation are shown as solid thick vertical lines at 0, and the colored thin lines indicate the upper tail on the positive half axis of the forecast densities f θ , c.f., Equations 4 and 5.The raw ensemble forecasts are shown as short line segments at the bottom, and the realizing observation is indicated by the long dashed line.

Figure 4 :
Figure 4: Forecast distributions for the growth rate of US GDP.The forecasts stem from a Bayesian time series model, as detailed in Krüger et al. (2016, Section 5).Histograms summarize simulated forecast draws at each date.Mixture-of-normals approximation to distribution shown in black; realizing observations shown by dashed line.

Gebetsberger
et al. (2017)  provide a detailed comparison of maximum likelihood and minimum CRPS estimation in the context of non-homogeneous regression models for post-processing ensemble weather forecasts.Here we illustrate the use for minimum CRPS estimation in a simple simulation example.Consider an independent sample y 1 , . . ., y n from a normal distribution with mean µ and standard deviation σ.The analytical maximum likelihood estimates μML and σML are given − μML ) 2 .To determine the corresponding estimates by numerically minimizing the CRPS define wrapper functions which compute the mean CRPS and corresponding gradient for a vector of training data y_train and distribution parameters param.R> meancrps <-function(y_train, param) mean(crps_norm(y = y_train, + mean = param[1], sd = param[2])) R> grad_meancrps <-function(y_train, param) apply(gradcrps_norm(y_train, + param[1], param[2]), 2, mean)

Figure 5 :
Figure5: Boxplots of deviations from the true parameter values for estimates obtained via minimum CRPS and minimum LogS (i.e., maximum likelihood) estimation based on 1 000 independent samples of size 500 from a normal distribution with mean µ = −1 and standard deviation σ = 2.

Table 1 :
List of implemented parametric families for which CRPS and LogS can be computed via crps() and logs().The character string is the corresponding value for the family argument.The CRPS formulas are given in Appendix A.
Messner et al. (2016):ins a data set RainIbk of precipitation for Innsbruck, Austria.It comprises ensemble forecasts rainfc.1,..., rainfc.11andobservationsrain for 4971 cases from January 2000 to September 2013.The precipitation amounts are accumulated over three days, and the corresponding 11 member ensemble consists of accumulated precipitation amount forecasts between five and eight days ahead.FollowingMessner et al. (2016)we model the square root of precipitation amounts, and omit forecast cases where the ensemble has a standard deviation of zero.FromMessner et al. (2016): Scheuerer and Hamill (2015a)016), we thus consider three alternative choices: the logistic, Gaussian, and Student's t distributions.For details and further alternatives, see, e.g.,Messner et al. (2014);Scheuerer (2014)andScheuerer and Hamill (2015a).