NHPoisson : An R Package for Fitting and Validating Nonhomogeneous Poisson Processes

NHPoisson is an R package for the modeling of nonhomogeneous Poisson processes in one dimension. It includes functions for data preparation, maximum likelihood estimation, covariate selection and inference based on asymptotic distributions and simulation methods. It also provides specific methods for the estimation of Poisson processes resulting from a peak over threshold approach. In addition, the package supports a wide range of model validation tools and functions for generating nonhomogenous Poisson process trajectories. This paper is a description of the package and aims to help those interested in modeling data using nonhomogeneous Poisson processes.


Introduction
Despite the large number of potential applications of Poisson processes, the scientific community in many fields has not yet taken full advantage of Poisson process modeling, particularly in a nonstationary framework.Most of the published applications are restricted to the environmental field, see the review of extreme event studies on hydrology by Katz, Parlange, and Naveau (2002).Ogata (1988) used Poisson processes to model earthquakes, and Madsen, Rasmussen, and Rosbjerg (1997) and Abaurrea and Cebrián (2002) modeled hydrological and meteorological drought occurrences, respectively.The statistical techniques for modeling and fitting this type of processes are well-known, see Embrechts, Klüppelberg, and Mikosch (1997), Coles (2001), or Kutoyants (1998) who discuss the estimation theory for nonhomogeneous Poisson processes.However, the implementation of these techniques is a time-consuming and non trivial task and the tools for validating nonhomogeneous Poisson processes are not as well developed as for other statistical models.Accordingly, an easy-to-use software, which provides simple and fast algorithms that facilitate the modeling and validation of nonhomogeneous Poisson processes, can be very useful.
NHPoisson (Cebrian 2015) is an R (R Core Team 2015) package which provides an assembly of tools for all the steps involved in data modeling using this type of processes.In particular, it supplies functions to model the Poisson intensity as a function of covariates, including the estimation of Poisson processes resulting from a peak over threshold approach.It also contains a wide toolkit for model selection and validation analysis, including several types of residuals and diagnostic techniques adapted from other statistical models.NHPoisson is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=NHPoisson.
The outline of the paper is as follows.Section 2 introduces Poisson processes and the peak over threshold approach.Some R packages related to this topic are reviewed in Section 3, and Section 4 describes the main functions and capabilities of the package.Section 5 shows an example of data analysis using NHPoisson: the modeling of the occurrence of extreme heat events in a daily temperature series.Finally, some conclusions and future work are discussed in Section 6.

Poisson processes and peak over threshold models
A Poisson process (PP in short) is a point process, i.e., a random collection of points in a space where each point represents the occurrence of an event.Time processes are the most common, but PPs can also model events in space or in space-time.The PP assumes that points occur randomly at a given intensity λ (> 0), which characterizes the frequency these events are expected to occur with.The model allows us to calculate values such as the probability of observing a certain number of events in a time period, or the expected time between two events.Examples of events that can be modeled as a PP include the occurrence of extreme environmental events such as fires, earthquakes or heat waves, the times of customer arrivals at queues, the occurrence times of claims in a given period, etc. Herein, we will restrict our attention to the modeling of one-dimensional PPs, in particular, to time PPs.
Statistical modeling based on a PP requires the estimation of the intensity of the process.A realization of a time PP is a sequence of points (t i ) n i=1 observed in a region of R + .The information provided is that events occur at those points and nowhere else.In most of the real problems, the intensity is time-varying and a nonhomogeneous Poisson process (NHPP) must be considered.The behavior of a time NHPP is generally modeled by representing the intensity λ(t) as a function of covariates, that can include time trends, seasonal terms and/or external factors.In order to guarantee that λ is positive, a logarithm link is usually used, log(λ(t; β)) = X (t)β where X (t) is the row vector of covariates at time t and β the vector of parameters.
PPs often appear in the framework of peak over threshold (POT) models.POT models consider the occurrence times of peaks in a sequence of random variables, where a peak is the maximum value in a run of consecutive excesses (observations over a threshold).A well-known result of extreme value theory states that, in an i.i.d.series, the occurrences of excesses over a sufficiently extreme threshold u follow a PP, and the excesses have a generalized Pareto distribution.More precisely, given a series of i.i.d.random variables Y 1 , Y 2 , . .., the process on regions of the form (0, 1] × [u, ∞) is approximately a PP for sufficiently high u.
Although these results require i.i.d.series (Y i ), most of them are also applicable, under quite mild conditions, to processes with short-term dependence (Coles 2001).

Point processes and POT in other R packages
There are several R packages dealing with point processes.Package ppstat (Hansen 2013) focuses on the analysis of marked point processes with discrete marks.It fits generalized linear point process models, including the Hawkes and the Gibbs families, where the intensity is formulated in terms of a linear combination of linear filters.No validation tools are implemented.
The extensive package spatstat (Baddeley and Turner 2005;Baddeley, Turner, Mateu, and Bevan 2013) is a package for analyzing spatial point patterns.It supports the model fitting of spatial point processes, in particular Poisson processes, and related inference and diagnostic tools.The function ppm fits a spatial point process to an observed point pattern and allows the inclusion of covariates.Two estimation methods, pseudo-maximum likelihood and Huang-Ogata approximate maximum likelihood, are implemented.The available functions and data types (point patterns, spatial windows, pixel images, etc.) are intended for spatial point processes.Some of its functions could be applied to one-dimensional data, but the analysis of time point processes has their own peculiarities, and the spatial framework, in particular the specification of the points of the process, is unnecessarily complicated for the one-dimensional case.The package yuima (Brouste et al. 2014) develops tools for inference and simulation of stochastic differential equations to model random evolution along continuous, or practically continuous, time.A wide variety of time processes, including compound Poisson processes, can be fitted using a quasi-likelihood approach.Different tools for model selection are available in the package, but the inclusion of covariates is not implemented.
There are also some R packages devoted to POT and extreme analysis, such as evir (Pfaff, McNeil, and Stephenson 2012), POT (Ribatet 2012), and ismev (Heffernan and Stephenson 2014) and its graphical user interface extRemes (Gilleland and Katz 2011).The functions fitpp and pot, from POT and evir respectively, fit the exceedances above a threshold, in the homogeneous case, using the point process characterization as described in Coles (2001).This approach requires both the occurrence times and the exceedance values.The function pp.fit from ismev fits the point process model with a generalized linear modeling of each parameter, by maximum likelihood.There are a wide set of tools in these packages for selecting an adequate threshold to apply the POT model, such as the functions meplot (in evir), mrl.plot (in ismev) or tcplot and mrlplot (in POT).Since these functions cover the analysis of threshold selection, no function of this type is included in NHPoisson.
As far as we know, no commercial statistical software, such as SPSS (IBM Corporation 2013) or Minitab (Minitab Inc. 2010), supports the fit of NHPPs.

Features and capabilities
NHPoisson aims to provide a global framework for the modeling of data using NHPPs in one dimension.One of the new functionalities is the maximum likelihood estimation of PP models with the intensity modeled as a function of time-dependent covariates.In particular, it includes a specific likelihood function to estimate the PP resulting from the POT approach, with the peculiarity that it uses information only on the point occurrences, and not on the exceedance values.This likelihood function can also handle missing values.Another capability is the estimation of models with fixed β parameters.The package provides automatic tools for model selection by AIC and likelihood ratio tests, and a comprehensive toolkit for residual analysis and diagnostics to check the fitted model.Random generation of NHPPs with a given time-varying intensity and inference methods based on asymptotic distributions and simulation are also implemented.

Data and exploratory analysis
The usual input to fit a time PP is the vector of points containing the occurrence times of the events (t i ) n i=1 .Since the PP is often used in the framework of POT models, the function fitPP.funalso provides the possibility of using as input the original series (y i ) T i=1 and the threshold which defines the extreme events.The function POTevents.funcalculates some characteristics of these extreme events: the first and the maximum excess positions, the mean excess, the maximum excess and the length of the events.The maximum excess position defines the occurrence point in the point process.Missing values in the original series are not allowed in this function, since it is difficult to adopt a general rule about how to handle them, specially in the case of missing information between two exceedances.
The functions emplambda.funand emplambdaD.funcalculate the empirical occurrence rates on overlapping and disjoint intervals, respectively.The occurrence rate is calculated as the number of points in the considered interval divided by its length, and each rate is assigned to the mean point of the interval.Overlapping intervals are defined by a constant length (argument lint), while disjoint intervals can be specified by the number of intervals (argument nint) or by a constant length lint.All the intervals have the same length except the last one, which will be shorter if the length of the series is not a multiple of lint or nint.In both cases, a plot of the empirical rate over time can be optionally performed.

Fitting the model
The function fitPP.funfits a NHPP by conditional maximum likelihood, given the covariates.It is worth to recall that efficient estimation based only on the conditional model, ignoring the information of the marginal process, can only be carried out if weak exogeneity holds (Ericsson and Irons 1994).The conditional likelihood function is, where λ(t; β) = exp(X (t)β) and A is the space where the point process is defined.Assuming that λ(t; β) is constant in each time unit, the expression of the loglikelihood is where T is the length of the observation period.
Some adjustments are necessary to define the likelihood function of a NHPP based on the POT approach.Only one occurrence point can be assigned to each event, and that is the point of maximum intensity.The other observations in the extreme event cannot be considered as non occurrence points, since their covariates would show their extreme character.To avoid this problem, these observations are eliminated from the estimation sample.The result is the likelihood we would have obtained, if only the point of maximum intensity in each event had been observed.This is done using an index variable inddat, equal to one in the observations to be used and zero otherwise.
This index also allows us to handle missing values by assigning them a value zero.Consequently, all the functions with the argument inddat, such as fitPP.fun,emplambda.funor emplambdaD.fun,can deal with missing values.The function fitPP.funcan also fit homogeneous Poisson processes (HPP) by defining the intensity with an intercept and no covariate.
The maximum likelihood (ML) estimation can be done using two different optimization routines, optim or nlminb, selected in the argument minfun.This is useful since, in some cases, one routine can succeed to converge when the other fails.fitPP.funalso allows us to fit a model keeping fixed some β parameters (offset terms) using the fixed argument.This can be used to specify an a priori known component to be included in the linear predictor.
The output of fitPP.fun is an object of the S4 class 'mlePP', which encapsulates the most relevant information about the maximum likelihood estimation of a NHPP.Some of the slots of this class are: the maximum value of the loglikelihood function, the ML estimators of the β parameters, their covariance matrix, the fitted values λ(t), and the lower and upper limits of the confidence intervals for each λ(t), whose calculation is described below.The estimation of the β covariance matrix is based on the asymptotic distribution of ML estimators and calculated as the inverse of the negative of the Hessian, where the second derivative of the considered loglikelihood is The class 'mlePP' is a subclass of the S4 class 'mle' defined in stats4, and most of the generic functions with 'mle' methods, such as logLik, extractAIC or summary, can be applied to the output of fitPP.fun.Some other generic functions related to fitted models, such as AIC or BIC, can also be applied to 'mlePP' objects.There is an 'mle' method for vcov, which calculates the approximate variance-covariance matrix of β.However, the solve function used to obtain the inverse of the negative of the Hessian matrix may not work, specially when its dimension is high.For this reason, we have implemented the function VARbeta.fun, which works in a wider range of situations, since when solve gives an error, the calculation of the inverse via its Cholesky decomposition is tried.
The package also provides a method for 'mlePP' objects for the generic function profile, very similar to the 'mle' method for this function in stats4.As in the 'mle' method, the output is a 'profile.mle'object, and the 'mle' method for confint can be used to obtain confidence intervals based on profile likelihoods of the β parameters.Additionally, confidence intervals based on the asymptotic normal approximation of the ML estimators of the β parameters are calculated by the function confintAsin.fun.

Confidence intervals for the intensity
Confidence intervals for λ(t) = exp(ν(t)) can be obtained by transforming the confidence intervals for ν(t) = X (t)β.Another option is based on the delta method, but this approach uses an approximation and does not always guarantee the confidence level of the interval.Both alternatives can be calculated by fitPP.fun,which calls either the function CItran.fun or CIdelta.fundepending on the argument CIty = "Transf" or CIty = "Delta".In general, the transformation approach (the default) should be preferred.
The function fitPP.funoptionally performs, with the argument dplot = TRUE, a plot of the fitted intensities together with the confidence intervals for each λ(t).

Model selection
There are different tools for aiding in model selection.When the number of covariates is high, automatic procedures such as a stepwise selection can be useful.The function stepAICmle.funperforms a stepwise model selection by AIC for 'mlePP' objects.Three directions, "forward", "backward" and "both", are implemented.The initial model is specified by the argument ImlePP, and the algorithm stops when, according to the AIC, none of the covariates eliminated from the model or added from the potential covariates set (argument covariatesAdd) improves the model fitted in the previous step.The definition of AIC uses constant k = 2, but a different value k can be passed as an additional argument.
Authors such as Cook and Weisberg (1999) warn that automatic model selection methods must be used with caution since they can reject reasonable submodels from further consideration.For a more thorough selection, it can be useful to perform the likelihood ratio test by the function testlik.fun,which compares a general model ModG, with a reduced model ModR (a particular case of the first).
To make the selection procedure easier, the function LRTpv.funcalculates, for each covariate in a model, the p value of a likelihood ratio test comparing the fitted NHPP with the model excluding that covariate from the linear predictor.The functions dropAIC.funand addAIC.fun(both used in stepAICmle.fun)perform a similar task but using the AIC value.The first calculates the AIC for all one-covariate deletions from the current model, while the second calculates it for all one-additions from the covariates supplied in covariatesAdd.In both cases, the best covariate to be dropped from (added to) the model is returned together with some additional information.All these automatic selection functions are not implemented for models with fixed parameters.

Generation of NHPP trajectories and simulation based inference
The generation of trajectories from a NHPP is a useful tool, particularly for simulation based inference.There exist quite a few methods to generate the points from a NHPP.One of the most popular is the thinning approach by Lewis and Shedler (1979), an exact method which does not require the discretization of the intensity function.However, an inversion approach ( Çinlar 1975) has been preferred here, since it is faster when the inversion can be easily computed, as in this case.The inversion approach is based on the fact that a NHPP can be transformed into a unit-rate HPP, using the following time scale transformation (Daley and Vere-Jones 2003), The first step is to generate the points t H i in a unit rate HPP via the generation of independent exponential distances.Then, the points t N H i in a NHPP with intensity λ(t) are the result from transforming the points t H i with the inverse of the function in Equation 2. Since it is assumed that λ(t) is constant in [t, t + 1), the points t N H i are calculated as the lowest integer such that i , where r runs in the non-negative integer numbers.This method is supported by the function simNHP.fun,which generates the points in a period (0, T ).The intensity in each time of this period must be provided by the argument lambda, and the value T is determined by the length of this intensity vector.
In order to allow the exact reproduction of the results, a seed can be set in the generation process with the argument fixed.seed.If the argument is an integer, it gives the seed and if it is NULL (the default option), a random seed will be used.

Simulation based inference
Inference on the β parameters and on λ(t) intensities can be carried out using likelihood ratio tests, as indicated in Section 4.2.However, there are other values whose inference is difficult to be performed using exact or approximate distributions.In those cases, simulation based inference is a useful too to calculate, for example, envelopes for the number of events to occur in a given time interval, or for the time of the k-th occurrence.The upper and lower limits of these envelopes are the m lowest and m highest values of a simulated sample.These envelopes are not real confidence intervals for the value of the function, but significance bands that specify the critical points for a Monte Carlo test with a significance level α = 2m/(1 + niter ), where niter is the number of simulated trajectories (Ripley 1981).
The function GenEnv.funcalculates simulated envelopes for the value of the function specified in the argument fun.name, which must be a function of the NHPP points and any additional argument.It calls the function funSim.fun(not intended for the users), which generates a simulated sample and evaluates the function in that sample.As in simNHP.fun, the argument fixed.seedallows us to set the seed in the simulation process.

Model validation
NHPoisson includes the function globalval.funwhich carries out a complete standard analysis to check the fitted NHPP.However, users can customize their own validation analysis using the available individual functions described in this section.
Two types of residuals can be calculated: uniform (or exponential) and raw residuals.Both of them are useful and provide complementary information.

Exponential or uniform residuals
A usual approach to validate a NHPP with intensity λ(t) is to transform it into a unit-rate HPP, using the time transformation in Equation 2.Then, the validation analysis consists in analyzing the inter-event distances d * i = t * i − t * i−1 that, under the null hypothesis, must be an i.i.d.exponential sample.Equivalently, the validation can be based on the transformed distances exp(−d * i ) that must be an i.i.d.uniform sample.These residuals may be plotted against covariates, or potential covariates, as an informal assessment of the need of including them.For more details on this topic, see the diagnostic analysis for Cox-Snell type residuals in Collett (1994).
Hence, the validation analysis starts by applying the function transfH.fun to transform the NHPP into a HPP.Then, the exponential and uniform residuals of the homogeneous process are calculated using unifres.fun.The function graphresU.funchecks the uniform behavior and the serial correlation in the uniform residuals.To analyze the correlation, the function calculates the Pearson correlation coefficient, the Ljung-Box p values and a lagged serial correlation plot.To analyze the uniform behavior, a Kolmogorov-Smirnov test and a uniform qqplot with a 95% confidence envelope are performed.In addition, an index plot of the residuals and residual plots against the variables specified in the argument Xvariables are drawn.

Raw residuals
A complementary approach to validate a NHPP is the analysis of the raw residuals, see Lewis (1972) and Baddeley, Turner, Møller, and Hazelton (2005) for this type of residuals in time and space-time processes.The error (or innovation) process is ε(t) = N (t) − t 0 λ(u)du which, when the model is true, must be a zero-mean martingale.The corresponding raw residual process is with I t i an indicator variable, equal to 1 at time t i .The residuals are defined by the increments of the raw process in intervals (l 1 , l 2 ), which can be disjoint or overlapping, divided by its length, (3) These residuals can be interpreted as observed minus fitted value residuals and they are useful to check the fitted PP intensity since, under the correct model, r(l 1 , l 2 ) ≈ 0.
The functions CalcRes.fun and CalcResD.funcalculate the raw residuals using overlapping and disjoint intervals (l 1 , l 2 ) respectively.The residuals ro(l 1 , l 2 ) are based on overlapping intervals of a given length, centered on each time t.It is noteworthy that, due to their overlapping structure, they will not be independent even under the model.The residuals rd(l 1 , l 2 ) are calculated on disjoint intervals of equal length, and assigned to the mean point of the interval.In both functions, the empirical and the mean fitted intensities used to calculate the raw residuals, t i ∈(l 1 ,l 2 ) I t i /(l 2 − l 1 ) and l 2 l 1 λ(u)du/(l 2 − l 1 ), are elements of the output list.Scaled versions of the residuals, can be calculated using a non negative weight function h(u).By default, the Pearson residuals are calculated, h(u) = 1/ λ(u), but any other weight function can be provided through the argument h.The function graphrate.funplots, for comparative purposes, the empirical and the mean fitted intensities against time in the same graph.
A usual diagnostic tool for analyzing the covariate effects is a residual plot.In the case of raw residuals, which are not linked to a time instant but to an interval, two different plots can be used depending on the characteristics of the covariate: the time residual plot or the lurking variable plot.A qqplot to check the residual distribution is also implemented.

Time residual plots
The function graphres.funplots any type of residuals (raw or scaled, overlapping or disjoint) against time or a specific type of variables.In these plots, the residuals are plotted against the value of the variable in the mean point of the interval and, consequently, they are adequate only for variables which are lineal or, at least, monotonous functions of time.A lowess smoother of the residuals can be optionally plotted to make the interpretation easier.In the case of overlapping intervals, the residuals of the occurrence points are marked with different symbols.
Approximate confidence envelopes can be optionally plotted in these graphs.These envelopes are based on the approach suggested by Baddeley et al. (2005) for spatial inhomogeneous Poisson processes.Their interpretation, including the pointwise significance, is similar to the limits ±2 used for the standardized residuals in linear regression analysis.The envelopes are based on the variance of the error process ε(l 1 , l 2 ) = ε(l 2 ) − ε(l 1 ).The variance of the raw error processes for NHPPs is and for the scaled Pearson error process These variances can be estimated by plugging into the expressions the fitted intensity λ(u).
It is noteworthy that the variance of the error V [ε(l 1 , l 2 )] will overestimate the variance of the residual process and, consequently, ±2 V [ε(l 1 , l 2 )] will be wider than a confidence interval based on the residual process variance.For the raw residuals defined in (3), the envelopes are, where i runs over the integers in (l 1 , l 2 ).Analogously, for the Pearson residuals in Equation 4, Lurking variable plots A possible way to analyze the relationship between the residuals linked to intervals and a covariate which is not a monotonous function of time is to plot them against the mean value of the variable in that interval.A more complicated but better approach is to calculate the residuals on intervals defined by the values of the covariate instead of by time.The following approach is a modification of the lurking variable plot by Baddeley et al. (2005).The intervals considered to calculate the residuals for a covariate X(t) are where P X,i is the sample i-percentile of X.The function graphResX.funplots the residuals against the mean value of the covariate in that interval.In the case of raw or Pearson residuals, it also calculates approximate confidence envelopes as the ones described for time residual plots.The function graphResCov.funperforms the lurking variable plot for each variable in the argument Xvar.
Residual qqplot A common plot to check the distributional assumptions in a model is the qqplot.The theoretical distribution of the NHPP raw residuals is unknown but a simulation based qqplot can be calculated.The function resQQplot.funcompares the empirical quantiles of the residuals with the expected quantiles under the fitted model, obtained from the following simulation process.
First, the model is fitted to the data set and the residuals r i are calculated.The order statistics r (1) ≤ r (2) ≤ . . .≤ r (n) are obtained from the residual sample.Then, niter independent trajectories of a NHPP( λ(t)) are generated and a NHPP is fitted to each one.The corresponding fitted intensities and the ordered residuals from the i-th trajectory are denoted λ(i) (t) and r , respectively.The expected j-th quantile of the residuals under the fitted model is calculated as the sample mean of the j-th order statistic Individual (1 − α)% envelopes for each residual are defined by the corresponding α/2 and 1 − α/2 sample quantiles.
The function resQQplot.funcalculates for each residual the simulated expected quantile and plots them together with the 1 − α individual envelopes.Disjoint or overlapping and Pearson or raw residuals are supported by this function, but the overlapping option can require a high time computing cost.The output list of the function includes the expected quantiles and the upper and lower limits of the envelopes of the residuals.The argument fixed.seedallows us to set the seed to obtain reproducible results.

A global validation function
A standard validation analysis to check a NHPP model can be performed in one step using the function globalval.fun.First, the uniform residuals are calculated and analyzed.The residuals are plotted against the covariates in Xvar and, if this argument is NULL, all the covariates in the fitted model are considered.The empirical and the cumulative mean fitted intensities (using optionally overlapping or disjoint intervals) are plotted together.Using these intensities, the raw or scaled residuals are calculated, and residual plots against time and the variables in Xvart are carried out.The lurking variable plots for the variables in Xvar are also performed; as before, if this argument is NULL, all the covariates in the fitted model are considered.Finally, a residual qqplot to check the distributional assumptions can be optionally calculated.By default, the qqplot is carried out for the Pearson residuals in disjoint intervals.
The argument histWgraph = TRUE (the default option) opens a new graphical device (using dev.new(record = TRUE), which may not work on all systems), so that the history of all the plots carried out by the function is recorded in the same device.This means that the resulting plots can be accessed by scrolling up and down.The argument plotDisp provides the layout to display the residual plots, 2 × 2 by default.These both arguments are available in all the functions which carry out several residual plots: graphres.fun,graphresU.fun,graphresCov.fun,etc.

An example
In this section, we illustrate how the package NHPoisson can be used to model a data set, from the data preparation for a POT approach to the validation analysis.Data are daily maximum temperature series, and our aim is to model the occurrence of its extreme heat events using a NHPP.

Data
The data set BarTxTn is available in the package NHPoisson.The variables Tx and Tn are the daily maximum and minimum temperature series, in tenths of • C, corresponding to the summer months May, June, July, August and September (MJJAS) from 1951 to 2004 in Barcelona (Spain).The date (day, month and year) of the observations, and some variables representing the short and long-term temperature evolution are also available.The daily temperature series were provided by the Spanish Meteorological Office (AEMET).
The variables are recorded at a discrete time scale, as in most real data applications.However, given that the time unit is short compared with the length of the recorded period and the occurrence rate very low, the use of a continuous time model, such as the PP, is justified.

Data preparation
To define an extreme heat event (EHE) we use the POT approach.The extreme threshold is 31.8• C, which is the 95th percentile of Tx during a reference period (months JJA in [1971(months JJA in [ , 2000]]).The selection of this threshold (not shown here) is based on climate basis and on statistical criteria about the Poisson behavior of the resulting peak series (using for example the functions diplot, mrlplot and tcplot of the POT package).The EHEs and their characteristics are obtained by, R> data("BarTxTn", package = "NHPoisson") R> dateB <-cbind(BarTxTn$ano, BarTxTn$mes, BarTxTn$diames) R> BarEv <-POTevents.fun(T= BarTxTn$Tx, thres = 318, date = dateB)

Number of events: 137 Number of excesses over threshold 318 : 253
where the matrix dateB contains the date of each observation, and the object BarEv several characteristics of the 137 EHEs, such as Px, the points of maximum intensity in the extreme events, which are the occurrence points of the process.Another element of the object is the index inddat, which marks the observations to be used in the estimation process, To describe the time evolution of the EHE occurrence, a plot of its empirical rate calculated in 153-day long overlapping periods is performed using the following commands (153 is the length of the period MJJAS), R> tB <-BarTxTn$ano + rep(c(0:152) / 153,55) R> emplambdaB <-emplambda.fun(posE= Px, inddat = inddat, t = tB, + lint = 153, tit = "Barcelona") where tB is an index time of the summer months linked as consecutive time periods.The argument inddat must be specified only when there are missing values, or when the point process is defined using the POT approach and the length of the resulting events can be greater than one.By default, all the inddat values are equal to 1 so that all the available observations are included in the estimation process.

Fitting the model
Due to the characteristics of the EHE occurrence (seasonal behavior and a nonmonotonic trend), it must be modeled by a PP with a nonhomogeneous intensity (Abaurrea, Asín, Cebrián, and Centelles 2007).This intensity is modeled as a deterministic function of covariates such as long and short-term temperature, and seasonal terms (the part corresponding to the summer months of the harmonic functions describing the annual cycle).Temperatureharmonic interaction terms are also considered as possible covariates.
As an exploratory step to check which covariates are more influential, we carry out an automatic stepwise selection by AIC in both directions.First, the initial model which only includes the intercept is fitted, and the first order harmonic terms and Txm31 , Tnm31 , TTx and TTn temperature covariates (all available from BarT xT n) are considered as potential covariates.In order to make the interpretation of the output easier, names are assigned to each covariate, using dimnames.It is recalled that with the arguments modSim = TRUE, dplot = FALSE and modCI = FALSE, the information on the screen is eliminated, the fitted intensity is not plotted and the confidence intervals are not calculated.
R> aux <-LRTpv.fun(modB.final)0.05893695 0.0728897 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 residuals(t−1) residuals (t) Pearson coef.: 0.02.P−value: 0.833 0 5 10 15 20 −0.2 0.2 0.4 0.6 0.8 1.0 lag ACF q q q q q q q q q q q q q q q q q q q q q 5 10 15 20 0.0 0.2 0.4 0.6 0. unif quantiles unires q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q KS (uniform) p−value: 0.359 Model: BARCELONA; cos, sin, Txm31

Model validation
The validation starts by the analysis of the uniform residuals using the commands, R> posEHB <-transfH.fun(modB.final)$posEHR> resB <-unifres.fun(posEHB)R> graphresU.fun(unires= resB$unires, posE = modB.final@posE,+ Xvariables = cbind(covB.final,BarTxTn$dia), + namXv = c("cos", "sin", "Txm31", "summer day index"), + tit = "BARCELONA; cos, sin, Txm31", addlow = FALSE) The results of the autocorrelation analysis and the uniform behaviour are satisfactory, see Figure 2. The plots of the uniform residuals against the covariates included in the model and the index time corresponding to the summer cycle, not shown here, are also requested for  The output of the function CalcResD.fun is a list and one of its elements are the residuals, which can be used in other commands, for example to analyze their normal behavior

R> qqnorm(ResDB$RawRes)
Other elements in the output list are the empirical and the fitted intensities used to calculate the raw residuals.They can be graphically compared, see Figure 3, using the following command, R> graphrate.fun(ResDB,tit = "BARCELONA; cos, sin, Txm31") The lurking variable plots of nonscaled residuals against Txm31 (a covariate included in the model) and TTx (not included), and their confidence envelopes, are obtained from the following code snippet, R> covBtemp <-cbind(BarTxTn$Txm31, BarTxTn$TTx) R> aux <-graphResCov.fun(mlePP= modB.final,nint = 50, q q q q q q q q q q q q q q qq q qqqq q qq q qq q q q q q q q q q q q q q q q q q q q q q q q 180 200 220 240 260 280 300 −0.05 0.00 0.05

Txm31
Raw residuals q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q 240 245 250 + X = covBtemp, namX = c("Txm31", "TTx"), + tit = "BARCELONA; cos, sin, Txm31", typeRes = "Raw") The lurking plots in Figure 4 show that the influence of Txm31 is properly captured by the model and that the inclusion of TTx is not necessary.If we want to obtain plots displayed individually (but recorded on the same graphical device), the arguments plotDisp=c(1,1) and the default option histWgraph=TRUE should be used.
The time residual plot for the ResDB residuals and the residual qqplot are carried out using the following commands (in resQQplot.funthe seed is set to obtain reproducible results.),R> graphres.fun(objres= ResDB, typeRes = "Pearson", addlow = TRUE, + plotDisp = c(1, 1), tit = "BARCELONA; cos, sin, Txm31") R> aux <-resQQplot.fun(nsim= 100, objres = ResDB, covariates = covB.final,+ tit = "BARCELONA; cos, sin, Txm31", fixed.seed= 123) The plots do not show any evidence against the fitted model, see Figure 5.It is noteworthy to recall that depending on the number of points in the process, the number of simulations and the computer specifications, the computation time of the function resQQplot.funcan be quite high.In any case, at least 100 simulations are recommended.

Simulation based inference
Once the model is satisfactorily checked, inference can be performed.The following command q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Model: BARCELONA; cos, sin, Txm31 empirical quantile of the residuals q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 5: Time residual plot and residual qqplot (with seed 123 and calculated using one core) for the Pearson disjoint residuals.where posk.fun(x,k) returns the element in the k-th row of the vector x.The estimated occurrence time of the first EHE in a summer with the considered intensity is the 24th of June and the prediction envelope moves from the 21st of May to the 25th of July.It is noteworthy that this function is estimating the occurrence time of the first EHE given that, at least, one EHE occurs in that period.This is due to the fact that simulated trajectories where no EHE occurs are not included in the calculations.Denoting the number of points in the specified time interval by N (., .), the probability that this occurs is P (N (8263, 8415) = 0) = 8415 8263 λ(t)dt = 0.023, practically negligible.This probability could also be estimated by simulation using the function simNHP.fun.The simplest, but not fastest, way is R> L <-NULL R> for (i in c(1:1000)) + L[i] <-length(simNHP.fun(lambda= + modB.final@lambdafit[8263:8415],fixed.seed= (123 + i))$posNH) R> P0 <-sum(L == 0) / 1000 R> P0 [1] 0.014

Conclusions and future work
NHPoisson is an R package designed to facilitate the modeling of data using one-dimensional NHPPs whose intensity is described as a function of covariates.This is an important feature since the use of nonhomogeneous processes allows us to include information on the state of the system, the geographical position of an event, or different environmental characteristics.The package provides an assembly of all the steps involved in the modeling process: data preparation, maximum likelihood estimation (including fixed parameters), covariate selection, inference, simulation, and a wide toolkit for validation analysis, including residuals and diagnostic techniques adapted from other statistical models.Since the output of the main function fitPP.fun is an S4 class object which extends the 'mle' class, many available methods for this type of objects and some generic functions related to fitted models can also be used.
This package does not include techniques on threshold selection for POT models since they are already available in other packages.However, an extension of the package could include the implementation of new techniques on this topic, such as the ones developed by Wadsworth and Tawn (2012).Some future additions for the package are functions for the statistical modeling based on multivariate nonhomogeneous Poisson processes.An example of this type of model is the common Poisson shock process, a multivariate process with dependent components.

Figure 1 :
Figure 1: Fitted intensity and transformed confidence intervals (Section 4.2) of the selected NHPP model.

Figure 2 :
Figure 2: Validation plots of the uniform residuals of the final NHPP model.

Figure 3 :
Figure 3: Empirical and fitted intensities calculated on disjoint summer intervals.

Figure 4 :
Figure 4: Lurking plots based on raw residuals for two temperature variables.