extremeﬁt : A Package for Extreme Quantiles

extremeﬁt is a package to estimate the extreme quantiles and probabilities of rare events. The idea of our approach is to adjust the tail of the distribution function over a threshold with a Pareto distribution. We propose a pointwise data driven procedure to choose the threshold. To illustrate the method, we use simulated data sets and three real-world data sets included in the package.


Introduction
Extreme values investigation plays an important role in several practical domains of applications, such as insurance, biology and geology. For example, in Buishand, De Haan, and Zhou (2008), the authors study extremes to determine how severe rainfall periods occur in North Holland. Sharma, Khare, and Chakrabarti (1999) use an extreme values procedure to predict violations of air quality standards. Various applications were presented in a lot of areas such as hydrology (Davison and Smith 1990;Katz, Parlange, and Naveau 2002), insurance (McNeil 1997;Rootzén and Tajvidi 1997) or finance (Danielsson and De Vries 1997;McNeil 1998;Embrechts, Resnick, and Samorodnitsky 1999;Gençay and Selçuk 2004). Other applications range from rainfall data (Gardes and Girard 2010) to earthquake analysis (Sornette, Knopoff, Kagan, and Vanneste 1996). The extreme value theory consists of using appropriate statistical models to estimate extreme quantiles and probabilities of rare events.
The idea of the approach implemented in the R (R Core Team 2018) package extremefit (Durrieu, Grama, Jaunatre, Pham, and Tricot 2018), which is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=extremefit, is to fit a Pareto distribution to the data over a threshold τ using the peak-over-threshold method. The choice of τ is a challenging problem, a large value can lead to an important variability while a small value may increase the bias. We refer to Hall and Welsh (1985), Drees and Kaufmann (1998), Guillou and Hall (2001), Huisman, Koedijk, Kool, and Palm (2001), Beirlant, Goegebeur, Teugels, and Segers (2004), Spokoiny (2008, 2007) and El Methni, Gardes, Girard, and Guillou (2012) where several procedures for choosing the threshold τ have been proposed. Here, we adopt the method from Grama and Spokoiny (2008) and Durrieu, Grama, Pham, and Tricot (2015). The package extremefit includes the modeling of time dependent data. The analysis of time series involves a bandwidth parameter h whose data driven choice is non-trivial. We refer to Staniswalis (1989) and Loader (2006) for the choice of the bandwidth in a nonparametric regression. For the purposes of extreme value modeling, we use a cross-validation approach from Durrieu et al. (2015).
The extremefit package is based on the methodology described in Durrieu et al. (2015). The package performs a nonparametric estimation of extreme quantiles and probabilities of rare events. It proposes a pointwise choice of the threshold τ and, for time series, a global choice of the bandwidth h and it provides graphical representations of the results.
The paper is organized as follows. Section 2 gives an overview of several existing R packages dealing with extreme value analysis. In Section 3, we describe the model and the estimation of the parameters, including the threshold τ and the bandwidth h choices. Section 4 contains a simulation study whose aim is to illustrate the performance of our approach. In Section 5, we give several applications on real data sets and we conclude in Section 6.

Extreme value packages
There exist several R packages dealing with the extreme value analysis. We give a short description of some of them. For a detailed description of these packages, we refer to Gilleland, Ribatet, and Stephenson (2013). There also exists a CRAN Task View on extreme value analysis which gives a description of registered packages available on CRAN (Dutang and Jaunatre 2017). Among those available packages, the well known peak-over-threshold method, we mentioned before, has many implementations, e.g., in the POT package (Ribatet and Dutang 2016).
Some of the packages have a specific use, such as the package SpatialExtremes (Ribatet, Singleton, and R Core Team 2018), which models spatial extremes and provides maximum likelihood estimation, Bayesian hierarchical and copula modeling, or the package fExtremes (Wuertz and many others 2017) for financial purposes using functions from the packages evd (Stephenson and Ferro 2018), evir (Pfaff, McNeil, and Stephenson 2018) and others.
The copula package (Hofert and Mächler 2011;Kojadinovic and Yan 2010) provides tools for exploring and modeling dependent data using copulas. The evd package provides both block maxima and peak-over-threshold computations based on maximum likelihood estimation in the univariate and bivariate cases. The evdbayes package (Stephenson and Ribatet 2014) provides an extension of the evd package using Bayesian statistical methods for univariate extreme value models. The package extRemes (Gilleland 2018) implements also univariate estimation of block maxima and peak-over-threshold by maximum likelihood estimation allow-ing for non-stationarity. The package evir is based on fitting a generalized Pareto distribution with the Hill estimator over a given threshold. The package lmom (Hosking 2017) is dealing with L-moments to estimate the parameters of extreme value distributions and quantile estimations for reliability or survival analysis. The package texmex (Southworth and Heffernan 2018) provides statistical extreme value modeling of threshold excesses, maxima and multivariate extremes, including maximum likelihood and Bayesian estimation of parameters.
In contrast to previous described packages, the extremefit package provides tools for modeling heavy tail distributions without assuming a general parametric structure. The idea is to fit a parametric Pareto model to the tail of the unknown distribution over some threshold. The remaining part of the distribution is estimated nonparametrically and a data driven algorithm for choosing the threshold is proposed in Section 3.2. We also provide a version of this method for analyzing extreme values of a time series based on the nonparametric kernel function estimation approach. A data driven choice of the bandwidth parameter is given in Section 3.3. These estimators are studied in more details in Durrieu et al. (2015).

Model and estimator
We consider F t (x) = P(X ≤ x|T = t) the conditional distribution of a random variable X given a time covariate T = t, where x ∈ [x 0 , ∞) and t ∈ [0, T max ]. We observe independent random variables X t 1 , . . . , X tn associated to a sequence of times 0 ≤ t 1 < . . . < t n ≤ T max , such that for each t i , the random variable X t i has the distribution function F t i . The purpose of the extremefit package is to provide a pointwise estimation of the tail probability S t (x) = 1 − F t (x) and the extreme p quantile F −1 t (p) functions for any t ∈ [0, T max ], given x > x 0 and p ∈ (0, 1). We assume that F t is in the domain of attraction of the Fréchet distribution. The idea is to adjust, for some τ ≥ x 0 , the excess distribution function by a Pareto distribution: where θ > 0 and τ ≥ x 0 an unknown threshold, depending on t. The justification of this approach is given by the Fisher-Tippett-Gnedenko theorem (Beirlant et al. 2004, Theorem 2.1) which states that F t is in the domain of attraction of the Fréchet distribution if and only if 1 − F t,τ (τ x) → x −1/θ as τ → ∞. This consideration is based on the peak-over-threshold (POT) approach (Beirlant et al. 2004). We consider the semi-parametric model defined by: where τ ≥ x 0 is the threshold parameter. We propose in the sequel how to estimate F t and θ which are unknown in (3).
The estimator of F t (x) is taken as the weighted empirical distribution given by are the weights and K(·) is a kernel function assumed to be continuous, non-negative, symmetric with support on the real line such that K(x) ≤ 1, and h > 0 is a bandwidth.
By maximizing the weighted quasi-log-likelihood function (see Durrieu et al. 2015;Staniswalis 1989;Loader 2006) with respect to θ, we obtain the estimator Plugging-in (4) and (6) in the semi-parametric model (3), we obtain: For any p ∈ (0, 1), the estimator of the p quantile of X t is defined by where p τ = F t,h (τ ).

Selection of the threshold
The determination of the threshold τ in model (3) is based on a testing procedure which is a goodness-of-fit test for the parametric-based part of the model. At each step of the procedure, the tail adjustment to a Pareto distribution is tested based on k upper statistics. If it is not rejected, the number k of upper statistics is increased and the tail adjustment is tested again until it is rejected. If the test rejects the parametric tail fit from the very beginning, the Pareto tail adjustment is not significant. On the other hand, if all the tests accept the parametric Pareto fit then the underlying distribution F t,τ,θ follows a Pareto distribution G τ,θ . The critical value denoted by D depends on the kernel choice and is determined by Monte-Carlo simulation, using the CriticalValue function of the package.
In Table 1, we display the critical values using CriticalValue obtained for several kernel functions. Table 1: Critical values associated with kernel functions. The Gaussian kernel with standard deviation 1/3 is approximated by the truncated Gaussian kernel 1 The default values of the parameters in the algorithm yielded satisfying estimation results in a simulation study without being time-consuming even for large data sets. The choice of these tuning parameters is explained in Durrieu et al. (2015).
The following commands compute the critical value D for the truncated Gaussian kernel with σ = 1 (default value) and display the empirical distribution function of the goodness-of-fit test statistic which determines the threshold τ . The parameters n and N MC define respectively the sample size and the number of Monte-Carlo simulated samples.

Selection of the bandwidth h
We determine the bandwidth h by cross-validation from a sequence of the form h l = aq l , , where a is the minimum bandwidth of the sequence, b is the maximum bandwidth of the sequence and M h is the length of the sequence. The choice is performed globally on the grid where the number K of the points on the grid is defined by the user. The choice K = n is possible but can be time consuming for large samples. We recommend to use a fraction of n.

Package presentation on simulated data
In this section, we demonstrate the extremefit package by applying it to two simulated data sets.
The following code displays the computation of the survival probabilities and quantiles using the adaptive choice of the threshold provided by the hill.adapt function.

Wind data
Wind is mainly studied in the framework of an alternative to fossil energy. Studies of wind speed in extreme value theory were made to model wind storm losses or detect areas which can be subject to hurricanes (see Rootzén and Tajvidi 1997;Simiu and Heckert 1996).
The wind data in the package extremefit comes from the Airport of Brest (France) and represents the average wind speed per day from 1976 to 2005. The data set is included in the package extremefit and can be loaded by the following code.
R> data("dataWind", package = "extremefit") R> attach(dataWind) The commands below illustrate the function hill.adapt on the wind data set and computes a monthly estimation of the survival probabilities 1 − F t,h,τ (x) for a given x = 100 km/h with the predict method for 'hill.adapt' objects.

Sea shores water quality
The study of the pollution in the aquatic environment is an important problem. Humans tend to pollute the environment through their activities and the water quality survey is necessary for monitoring. The bivalve's activity is investigated by modeling the valve movements using high frequency valvometry. The electronic equipment is described in Tran, Ciret, Ciutat, Durrieu, and Massabuau (2003) and modified by Chambon, Legeay, Durrieu, Gonzalez, Ciret,  (2007). More information can be found at http://molluscan-eye.epoc. u-bordeaux1.fr/. High-frequency data (10 Hz) are produced by noninvasive valvometric techniques and the study of the bivalve's behavior in their natural habitat leads to the proposal of several statistical models (Sow, Durrieu, and Briollais 2011;Schmitt et al. 2011;Jou and Liao 2006;Coudret, Durrieu, and Saracco 2015;Azaïs, Coudret, and Durrieu 2014;Durrieu et al. 2015Durrieu et al. , 2016. It is observed that in the presence of a pollutant, the activity of the bivalves is modified and consequently they can be used as bioindicators to detect perturbations in aquatic systems (pollutions, global warming). A group of oysters Crassostrea gigas of the same age are installed around the world but we concentrate on the Locmariaquer site (GPS coordinates 47 • 34 N, 2 • 56 W) in France. The oysters are placed in a traditional oyster bag. In the package extremefit, we provide a sample of the measurements for one oyster over one day. The data can be accessed by R> data("dataOyster", package = "extremefit") The description of the data can be found with the R command help("dataOyster", package = "extremefit"). The following code covers the velocities and the time covariate and also displays the data.
R> Velocity <-dataOyster$data[, 3] R> time <-dataOyster$data[, 1] R> plot(time, Velocity, type = "l", xlab = "time (hour)", + ylab = "Velocity (mm/s)") We observe in Figure 7 that the velocity is equal to zero in two periods of time. To facilitate the study of these data, we have included a time grid where the intervals with null velocities are removed. The grid of time can be accessed by dataOyster$Tgrid. We shift the data to be positive.

R> new.Tgrid <-dataOyster$Tgrid R> X <-Velocity + (-min(Velocity))
The bandwidth parameter is selected by the cross-validation method (h cv = 0.3) using bandwidth.CV but we select it manually in the following command due to the long computation time.
R> hcv <-0.3 R> TS.Oyster <-hill.ts(X ,t = time, new.Tgrid, h = hcv, + TruncGauss.kernel, CritVal = 3.4) The estimations of the extreme quantile of order 0.999 and the probabilities of rare events are computed as described in Section 3.2. The critical value of the sequential test is D = 3.4 when considering a truncated Gaussian kernel, see Table 1. A global study on a set of 16 oysters on a 6 months period is given in Durrieu et al. (2015).

Electric consumption
Electric consumption is an important challenge due to population expansion and increasing needs in a lot of countries. Obviously this consumption is dealing with the state procurement policies. Therefore statistical models may give an understanding of electric consumption. Multiple models have been used to forecast the electric consumption, as regression and time series models (Bianco, Manca, and Nardini 2009;Ranjan and Jain 1999;Harris and Liu 1993;Bercu and Proïa 2013). Durand, Bozzi, Celeux, and Derquenne (2004) used a hidden Markov model to forecast the electric consumption.
A research project conducted in France (Lorient, GPS coordinates 47 • 45 N, 3 • 22 W) concerns the measurements of electric consumption using Linky, a smart communicating electric meter (http://www.enedis.fr/linky-communicating-meter). Installed in end-consumer's dwellings and linked to a supervision center, this meter is in constant interaction with the network. The Linky electric meter allows a measurement of the electric consumption every 10 minutes.
To prevent from major power outages, the SOLENN project (http://www.smartgrid-solenn. fr/en/) is testing an alternative to load shedding. Data of electric consumption are collected on selected dwellings to study the effect of a decrease on the maximal power limit. For example, an dwelling with a maximal electric power contract of 9 kiloVolt ampere is decreased to 6 kiloVolt ampere. This experiment enables to study the consumption of the dwelling with the application of an electric constraint related to the need of the network. For instance, after an incident such as a power outage on the electric network, the objective is to limit the number of dwellings without electricity. If during the time period where the electric constraint is applied, the electric consumption of the dwelling exceeds the restricted maximal power, the breaker cuts off and the dwelling has no more electricity. The consumer can, at that time, close the circuit breaker and gets the electricity back. In any cases, at the end of the electric constraint, the network manager can close the breaker using the Linky electric meter which is connected to the network. The control of the cutoff breakers is crucial to prevent a dissatisfaction of the customers and to detect which dwellings are at risk.
The extreme value modeling approach described in Section 3 was carried out on the electric consumption data for one dwelling from December 24, 2015 to June 29, 2016. This data are available in the extremefit package and Figure 9 displays the electric consumption of one dwelling. This dwelling has a maximal power contract of 9 kVA. R> data("LoadCurve", package = "extremefit") R> Date <-as.POSIXct(LoadCurve$data$Time * 86400, origin = "1970-01-01") R> plot(Date, LoadCurve$data$Value / 1000, type = "l", + ylab = "Electric consumption (kVA)", xlab = "Date") We consider the following grid of time T grid : R> Tgrid <-seq(min(LoadCurve$data$Time), max(LoadCurve$data$Time), + length = 400) We observe in April 2016 missing values in Figure 9 due to a technical problem. We modify the grid of time by removing the intervals of T grid with no data.

R> new.Tgrid <-LoadCurve$Tgrid
We choose the truncated Gaussian kernel and the associated critical value of the goodness-offit test is D = 3.4 (see Table 1). The bandwidth parameter is selected by the cross-validation method (h cv = 3.78) using bandwidth.CV but we select it manually in the following command due to long computation time.
Using a prediction method coupled with the method implemented in the package extremefit, it will be possible to detect high probability of cutting off a breaker and react accordingly.

Conclusion
This paper focus on the functions contained in the package extremefit to estimate extreme quantiles and probabilities of rare events. The package extremefit works also well on large data sets and the performance was illustrated on simulated data and on real-world data sets.
The choice of the pointwise data driven threshold allows a flexible estimation of the extreme quantiles. The diffusion of the use of this method for the scientific community will improve the choice of estimation of the extreme quantiles and probability of rare events using the peak-over-threshold approach.