TSclust : An R Package for Time Series Clustering

Time series clustering is an active research area with applications in a wide range of ﬁelds. One key component in cluster analysis is determining a proper dissimilarity measure between two data objects, and many criteria have been proposed in the literature to assess dissimilarity between two time series. The R package TSclust is aimed to implement a large set of well-established peer-reviewed time series dissimilarity measures, including measures based on raw data, extracted features, underlying parametric models, complexity levels, and forecast behaviors. Computation of these measures allows the user to perform clustering by using conventional clustering algorithms. TSclust also includes a clustering procedure based on p values from checking the equality of generating models, and some utilities to evaluate cluster solutions. The implemented dissimilarity functions are accessible individually for an easier extension and possible use out of the clustering context. The main features of TSclust are described and examples of its use are presented.


Introduction
Clustering is an unsupervised learning task aimed to partition a set of unlabeled data objects into homogeneous groups or clusters. Partition is performed in such a way that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criterion. In many real applications, the cluster analysis must be performed on time series data. Finding stocks that behave in a similar way, determining products with similar selling patterns, identifying countries with similar population growth or regions with similar temperature are some typical applications where the similarity searching between time series is clearly motivated. In fact, the time series clustering problems arise in a natural way in a wide variety of fields, including economics, finance, medicine, ecology, environmental studies, engineering, and many others. Frequently, the grouping of series plays a central role in the studied problem. These arguments motivate the growing interest in the literature on time series clustering, especially in the last two decades where a huge number of contributions on this topic has been provided. An excellent survey on time series clustering can be seen in Liao (2005) and references therein, although significant new contributions have been made subsequently. Particularly important in the last decade has been the explosion of papers on the topic coming from both data mining and pattern recognition communities. Fu (2011) provides a complete and comprehensive overview on recent time series data mining directions, including a range of key problems such as representation, indexing and segmentation of series, measures of dissimilarity, clustering procedures and visualization tools.
A crucial question in cluster analysis is establishing what we mean by "similar" data objects, i.e., determining a suitable similarity/dissimilarity measure between two objects. In the specific context of time series data, the concept of dissimilarity is particularly complex due to the dynamic character of the series. Dissimilarities usually considered in conventional clustering could not work adequately with time dependent data because they ignore the interdependence relationship between values. This way, different approaches to define dissimilarity between time series have been proposed in the literature and a short overview of them is presented below.
If the objective is to compare profiles of series, then conventional distances between raw data (Euclidean or Manhattan, among others) evaluating a one-to-one mapping of each pair of sequences can produce satisfactory results. Sometimes, depending on the domain, the chosen dissimilarity must satisfy properties of invariance to specific distortions of the data to obtain proper results. Batista, Wang, and Keogh (2011) provide an interesting review of dissimilarity measures designed to be invariant to features like local scaling (warping), uniform scaling, offset, amplitude scaling, phase, complexity, and so on. Other times, dissimilarity is measured by comparing sequences of serial features extracted from the original time series, such as autocorrelations, cross-correlations, spectral features, wavelet coefficients, and so on (see Kovacić 1998;Struzik and Siebes 1999;Galeano and Peña 2000;Caiado, Crato, and Peña 2006;Douzal Chouakria and Nagabhushan 2007, among others). These feature-based approaches are aimed to represent the dynamic structure of each series by a feature vector of lower dimension, thus allowing a dimensionality reduction (time series are essentially high-dimensionality data) and a meaningful saving in computation time. In addition, the features commonly used in the literature for the similarity matching problem can be obtained in a straightforward manner and all of these properties result in more efficient clustering procedures. An alternative approach consists in assuming specific underlying models and evaluating dissimilarity between fitted models (some references following this approach are Piccolo 1990;Maharaj 1996Maharaj , 2002Kakizawa, Shumway, and Taniguchi 1998;Vilar and Pértega 2004, among many others). The most commonly considered criterion has been to assume that the time series are generated by ARIMA processes, although researchers in speech recognition and machine learning have also adopted alternative models as Markov chains (MC, see Ramoni, Sebastiani, and Cohen 2002) or hidden Markov models (HMM, see Smyth 1997;Oates, Firoiu, and Cohen 1999, among others). An interesting set of references on these approaches can be seen in Bagnall, Janacek, de la Iglesia, and Zhang (2003). Other dissimilarity measures are aimed to compare levels of complexity of time series (see, e.g., Li, Chen, Li, Ma, and Vitányi 2004;Sculley and Brodley 2006;Keogh, Lonardi, Ratanamahatana, Wei, Lee, and Handley 2007;Brandmaier 2011;Batista et al. 2011). The complexity of a time series can be estimated by following different approaches although, roughly speaking, most of them are based on the notion of Kolmogorov complexity or algorithmic entropy (Li and Vitányi 2007). Kolmogorov complexity can be thought of as the ultimate lower bound of all measures of information, but unfortunately it cannot be computed and must be approximated in practice. Two prominent approaches to evaluate complexity differences between two time series are: (i) using algorithms based on data compression (see, e.g., Li, Badger, Chen, Kwong, Kearney, and Zhang 2001;Li et al. 2004;Keogh, Lonardi, and Ratanamahatana 2004;Cilibrasi and Vitányi 2005;Keogh et al. 2007), and (ii) considering differences between permutation distributions (Brandmaier 2011). Beyond all these criteria to define dissimilarity between series, it is worth mentioning that the dissimilarity measures are often tailored for the problem at hand, highlighting properties of the time series that are of interest for the specific context. For instance, there are practical situations where the real interest of the clustering relies on the properties of the predictions, as in the case of any sustainable development problem or in situations where the concern is to reach target values on a pre-specified future time. Works by Alonso, Berrendero, Hernandez, and Justel (2006) and Vilar, Alonso, and Vilar (2010) focused on this idea and considered a notion of dissimilarity governed by the performance of future forecasts. Specifically, two time series are similar if their forecasts for a specific future time are close.
Summing up, there exist a broad range of measures to compare time series and the choice of the proper dissimilarity measure depends largely on the nature of the clustering, i.e., on determining what the purpose of the grouping is. Once the dissimilarity measure is determined, an initial pairwise dissimilarity matrix can be obtained and a conventional clustering algorithm be then used to form groups of objects. In fact, most of the time series clustering approaches reviewed by Liao (2005) are variations of general procedures (e.g., a k-means or a hierarchical clustering) that use a range of dissimilarities specifically designed to deal with time series. According to this consideration, our attention focused on the implementation of these measures, and the package TSclust presented in this work is the result of integrating a wide range of dissimilarity measures to perform time series clustering. The package contains commonly used dissimilarity measures, including complexity-based measures, model-based measures, feature-based measures and the prediction-based dissimilarity introduced by Vilar et al. (2010). Some of these measures work in the time domain but others are developed in the frequency domain, and as will be seen later, in many cases their implementation requires a range of statistical techniques (autoregressive models estimation, kernel density estimation, local polynomial regression, automatic bandwidth selectors, resampling procedures, wavelets, among others). All dissimilarity measures included in TSclust have been previously studied in the time series clustering literature and are properly referenced throughout this work. It is worth stressing that some dissimilarity measures work under certain regularity conditions while others are applicable in more general contexts. Indeed, TSclust users should carefully analyze what specific measures are more suitable to capture similarity in their clustering problem, and for it some considerations on how choosing the proper dissimilarity measure are discussed in Section 4.
Although the intended spirit of TSclust is not to provide a self-contained tool to perform time series clustering, some additional clustering utilities (not available in other R packages to the best of our knowledge) have been incorporated into TSclust as useful complements. For instance, a clustering algorithm based on checking the equality of generating models proposed by Maharaj (2000), and two clustering validation indices used by different authors in time series clustering are implemented in TSclust (see Section 3 for details).
In short, TSclust package is an attempt to integrate a wide set of dissimilarity measures between time series so that the user can compare their performance and identify useful dis-similarity criteria in a given context. Clustering is indeed a typical scenario of application of these measures, and in fact, all of them have been studied in the clustering framework. Moreover, the package was designed to provide a flexible and extensible environment, with functions accessible individually for an easier extension and use out of the clustering context. TSclust is available from the Comprehensive R Archive Network at http: //CRAN.R-project.org/package=TSclust. The remainder of this paper is organized as follows. The different dissimilarity measures between time series implemented in TSclust are introduced in Section 2. The main features of each measure are briefly discussed, and appropriate references are provided for those readers interested in a more detailed analysis of each dissimilarity measure and its properties. Some specific clustering tools included in the package and a quick overview of clustering utilities available in R are presented in Section 3. Some general considerations on the choice of a proper dissimilarity measure to perform time series clustering are given in Section 4. Section 5 is devoted to illustrate the functionality of the package via a series of applications. Specifically, the behavior in time series clustering of several dissimilarity measures is analyzed by using a set of simulated series and several illustrative applications with real data. Finally, the main conclusions are reported in Section 6.

Dissimilarity measures for time series in TSclust
Many dissimilarity measures between time series have been proposed in the literature and a representative set of them has been implemented in TSclust. This section is devoted to briefly describe the implemented measures, which have been grouped into four categories: modelfree measures (Section 2.1), model-based measures (Section 2.2), complexity-based measures (Section 2.3) and prediction-based measures (Section 2.4).
In the remainder of this section, unless otherwise specified, X T = (X 1 , . . . , X T ) and Y T = (Y 1 , . . . , Y T ) denote partial realizations from two real-valued processes X = {X t , t ∈ Z} and Y = {Y t , t ∈ Z}, respectively. Note that serial realizations of the same length T are initially assumed, although this limitation can be omitted in some cases.

Model-free approaches
A simple approach to measure the proximity between X T and Y T is to consider conventional metrics based on the closeness of their values at specific points of time. Some commonly used raw-values-based dissimilarity measures are introduced below.

Minkowski distance
The Minkowski distance of order q, with q a positive integer, also called L q -norm distance, is defined by The Minkowski distance is typically used with q being 2 (Euclidean distance) or 1 (Manhattan distance). This metric is very sensitive to signal transformations as shifting or time scaling (stretching or shrinking of time axis). On the other hand, the proximity notion relies on the closeness of the values observed at corresponding points of time so that the observations are treated as if they were independent. In particular, d Lq is invariant to permutations over time.

Fréchet distance
This distance was introduced by Fréchet (1906) to measure the proximity between continuous curves, but it has been extensively used on the discrete case (see Eiter and Mannila 1994) and in the time series framework. A formal definition for the discrete case can be given as follows. Let M be the set of all possible sequences of m pairs preserving the observations order in the form Then the Fréchet distance is defined by Unlike the Minkowski distance, the Fréchet distance does not just treat the series as two points sets, but it has into account the ordering of the observations. Note that d F can be computed on series of different length.

Dynamic time warping distance
The dynamic time warping (DTW) distance was studied in depth by Sankoff and Kruskal (1983) and proposed to find patterns in time series by Berndt and Clifford (1994). As Fréchet distance, DTW distance is aimed to find a mapping r between the series so that a specific distance measure between the coupled observations (X a i , Y b i ) is minimized. The definition of the DTW distance is given by In TSclust, d F and d DTW are computed by using the R packages longitudinalData (Genolini 2014) and dtw (Giorgino 2009), respectively. Both distances allow to recognize similar shapes, even in the presence of signal transformations such as shifting and/or scaling. However, as in the case of d Lq , both d F and d DTW ignore the temporal structure of the values as the proximity is based on the differences |X a i − Y b i | regardless of the behavior around these values. A dissimilarity index model including both behavior and values proximity estimation is described below.
An adaptive dissimilarity index covering both proximity on values and on behavior Douzal Chouakria and Nagabhushan (2007) introduce a dissimilarity measure addressed to cover both conventional measures for the proximity on observations and temporal correlation for the behavior proximity estimation. The proximity between the dynamic behaviors of the series is evaluated by means of the first order temporal correlation coefficient, which is defined by CORT (X T , Y T ) belongs to the interval [−1, 1]. The value CORT (X T , Y T ) = 1 means that both series show a similar dynamic behavior, i.e., their growths (positive or negative) at any instant of time are similar in direction and rate. The value CORT (X T , Y T ) = −1 implies a similar growth in rate but opposite in direction (opposite behavior). Finally, CORT (X T , Y T ) = 0 expresses that there is no monotonicity between X T and Y T , and their growth rates are stochastically linearly independent (different behaviors).
The dissimilarity index proposed by Douzal Chouakria and Nagabhushan (2007) modulates the proximity between the raw-values of two series X T and Y T using the coefficient CORT (X T , Y T ). Specifically, it is defined as follows.
where φ k (·) is an adaptive tuning function to automatically modulate a conventional rawdata distance d (X T , Y T ) (e.g., d Lq , d F or d DTW ) according to the temporal correlation. The modulating function should work increasing (decreasing) the weight of the dissimilarity between observations as the temporal correlation decreases from 0 to −1 (increases from 0 to +1). In addition, d CORT (X T , Y T ) should approach the raw-data discrepancy as the temporal correlation is zero. Instead of, for instance, a linear tuning function, Douzal Chouakria and Nagabhushan propose to use an exponential adaptive function given by Now, we focus on model-free dissimilarity measures based on particular features of the time series. The idea is to replace the raw data by a reduced number of features characterizing the time series, which allows us to take into account the temporal structure of the series.

Correlation-based distances
A first and simple dissimilarity criterion is to consider the Pearson's correlation factor between X T and Y T given by with X T and Y T the average values of the serial realizations X T and Y T respectively. Golay, Kollias, Stoll, Meier, Valavanis, and Boesiger (2005) construct a fuzzy k-means algorithm using the following two cross-correlation-based distances: Note that d COR.2 becomes infinite when COR (X T , Y T ) = −1 and the parameter β allows regulation of the fast decreasing of the distance. Graphs of functions d 1 = d 2 COR.1 and d 2 = d 2 COR.2 for some values of β are shown in Figure 1.
.,ρ L,Y T ) be the estimated autocorrelation vectors of X T and Y T respectively, for some L such thatρ i,X T ≈ 0 andρ i,Y T ≈ 0 for i > L. Galeano and Peña (2000) define a distance between X T and Y T as follows.
where Ω is a matrix of weights.
Some common choices of Ω are: (i) Consider uniform weights by taking Ω = I. In such case d ACF becomes the Euclidean distance between the estimated autocorrelation functions: (ii) Consider geometric weights decaying with the autocorrelation lag, so that d ACF takes the form: Analogous distances can be constructed by considering the partial autocorrelation functions (PACFs) instead of the ACFs. Hereafter, notation d PACFU and d PACFG will be used to denote the Euclidean distance between the estimated partial autocorrelation coefficients with uniform weights and with geometric weights decaying with the lag, respectively.
So far all metrics work in the time domain, but the frequency domain approach also offers an interesting alternative to measure the dissimilarity between time series. The key idea is to assess the dissimilarity between the corresponding spectral representations of the series.
(i) The Euclidean distance between the periodogram ordinates: (ii) If we are not interested in the process scale but only on its correlation structure, better results can be obtained using the Euclidean distance between the normalized periodogram ordinates: the sample variances of X T and Y T , respectively.
(iii) As the variance of the periodogram ordinates is proportional to the spectrum value at the corresponding frequencies, it makes sense to use the logarithm of the normalized periodogram: Casado de Lucas (2010) considers a distance measure based on the cumulative versions of the periodograms, i.e., the integrated periodograms. Casado de Lucas argues that the approaches based on the integrated periodogram present several advantages over the ones based on periodograms. In particular, The periodogram is an asymptotically unbiased but inconsistent estimator of the spectral density while the integrated periodogram is a consistent estimator of the spectral distribution.
From a theoretical point of view, the spectral distribution always exists, but the spectral density exists only under absolutely continuous distributions.
The integrated periodogram completely determines the stochastic process.
The following distances based on the integrated periodogram, one normalized and other nonnormalized, are proposed in Casado de Lucas (2010).
for the normalized version, and C X T = C Y T = 1 for the non-normalized version.
The normalized version gives more weight to the shape of the curves while the non-normalized considers the scale. Casado de Lucas suggests using the normalized version when the graphs of the functions tend to intersect, and the non-normalized when they do not.
Dissimilarity measures based on nonparametric spectral estimators Kakizawa et al. (1998) proposed a general spectral disparity measure between two series given by where f X T and f Y T denote the spectral densities of X T and Y T , respectively, and W (·) is a divergence function satisfying appropriate regular conditions to ensure that d W has the quasi-distance property. If, for example, W (x) = log(αx + (1 − α)) − α log x, with 0 < α < 1, then d W corresponds to the limiting spectral approximation of the Chernoff information in the time domain (Shumway and Unger 1974). Note that d W is not a real distance because it is not symmetric and does not satisfy the triangle inequality. For clustering, it is more convenient to modify the divergence function by settingW (x) = W (x) + W (x −1 ).
In practice, the spectra f X T and f Y T are unknown and must be previously estimated. Vilar and Pértega (2004) studied the asymptotic properties of d W when f X T and f Y T are replaced by nonparametric estimators constructed via local linear regression. These approximations can be done in three different ways (Fan and Kreutzberger 1998), thus resulting three different versions of the d W dissimilarity measure. Specifically, d W (DLS ) , when the spectra are replaced by local lineal smoothers of the periodograms, obtained via least squares.
d W (LS ) , when the spectra are replaced by the exponential transformation of local linear smoothers of the log-periodograms, obtained via least squares.
d W (LK ) , when the spectra are replaced by the exponential transformation of local linear smoothers of the log-periodograms, here obtained by using the maximum local likelihood criterion.
Due to the asymptotic inefficiency of d W (LS ) with respect to both d W (DLS ) and d W (LK ) , only these latter two metric are implemented in TSclust. In particular, the weighted least squares smoother is obtained using the R package locpol (Ojeda Cabrera 2012). The default value of the bandwidth is an automatic plug-in selector specifically designed for local linear Gaussian kernel regression (see Ruppert, Sheather, and Wand 1995). This plug-in method is implemented using the R package KernSmooth (Wand 2014).
Two alternative nonparametric spectral dissimilarity measures introduced by Pértega and Vilar (2010) are also implemented in TSclust. In both cases, the discrepancy measure is given by a nonparametric statistic originally introduced to check the equality of the log-spectra of two processes.
The first alternative comes from the generalized likelihood ratio test approach introduced by Fan and Zhang (2004) to check whether the density of an observed time series belongs to a parametric family. Pértega and Vilar (2010) introduce a slight modification of this test statistic in order to check the equality of two log-spectra, resulting ) computed by local linear fitting.
The second distance evaluates the integrated squared differences between nonparametric estimators of the log-spectra and it is given by are local linear smoothers of the log-periodograms obtained using the maximum local likelihood criterion.
In all cases, local linear smoothing techniques were considered to approximate the unknown log-spectra because of their good theoretical and practical properties, such as nice minimax efficiency properties and absence of boundary effects, among others (Fan and Gijbels 1996). Moreover, these properties are satisfied under very general regularity conditions of the spectral densities, thus providing a great versatility to approximate spectra from different kinds of processes. Hence, it is intuitively expected that these methods are more robust in cluster analysis than other approaches based on ad hoc parametric modeling.
A dissimilarity measure based on the discrete wavelet transform Discrete wavelet transform (DWT) is a useful feature extraction technique often used to measure dissimilarity between time series. DWT performs a scale-wise decomposing of the time series in such a way that most of the energy of the time series can be represented by only a few coefficients. The basic idea is to replace the original series by their wavelet approximation coefficients in an appropriate scale, and then to measure the dissimilarity between the wavelet approximations. A detailed description of wavelet methods for time series analysis can be seen in Percival and Walden (2006) and some interesting references where wavelets are used to clustering time series are Struzik and Siebes (1999), Chan and Fu (1999), Popivanov and Miller (2002), Chan, Fu, and Yu (2003) and Zhang, Ho, Zhang, and Lin (2006), among others.
There is indeed a key point when we wish to use the DWT technique for clustering: the choice of an appropriate scale to obtain an accurate clustering. In TSclust, an automatic algorithm to determine this scale is implemented. The algorithm was proposed by Zhang et al. (2006) and it is aimed to select the scale by levering two conflicting requirements: an efficient reduction of the dimensionality and preserving as much information from the original data as possible. Specifically, the algorithm works as follows.
Consider a set of time series X (1) J−1 the coefficients corresponding to the discrete wavelet transform of X (i) T at the scale j. The A (i) j are called approximation coefficients (or smooth coefficients) and represent the smooth behavior of the data. The D (i) k , k = j, j + 1, . . . , J − 1, are called the detail coefficients because they represent the finer, more high-frequency nature, of the data. As feature vector, Zhang et al. propose to retain the approximation coefficients A (i) j for a particular scale j corresponding to the highest scale that satisfies: where E (Z) = s k=1 z 2 k denotes the energy associated with a vector Z ∈ R s . Argument behind this criterion is that the sum of squared errors between X (i) T and the reconstructed approximation seriesX Therefore, removing the detail coefficients within a scale j satisfying Equation 2 and higher scales means to achieve a tradeoff between lower dimensionality and lower SSE X (i.e., lower energy loss). Moreover, Zhang et al. show that the proposed algorithm is efficient when using Haar wavelet. Once the appropriate scale j is determined, the dissimilarity between two series X (u) j , respectively. In TSclust, the discrete wavelet transform of the time series is computed using the R package wmtsa (Constantine and Percival 2014).

Dissimilarity based on the symbolic representation SAX
Symbolization involves transformation of time series into sequences of discretized symbols that can be efficiently processed to extract information about the underlying processes. Although different techniques to generate symbolic representations of time series have been introduced in the literature (Daw, Finney, and Tracy 2003), many of them fail to define a dissimilarity measure in the symbolic space due to several reasons, such as producing a poor dimension reduction or showing little correlation with a dissimilarity measure in the original space. The symbolic representation SAX (symbolic aggregate approximation) introduced by Lin, Keogh, Lonardi, and Chiu (2003) does not suffer from these flaws. The SAX approach first transforms the original data into the piecewise aggregate approximation (PAA) representation (Yi and Faloutsos 2000;Keogh, Chakrabarti, Pazzani, and Mehrotra 2000) and then symbolizes the PAA representation into a discrete string. The PAA intermediate representation guarantees a high dimensionality reduction power, and furthermore, allows to prove that a dissimilarity based on the symbolic strings lower bounds the true dissimilarity between the original time series (Lin et al. 2003). As result, SAX representation is highly competitive to deal with a variety of data mining problems, including indeed cluster analysis of time series. The SAX approach is outlined below.
1. Normalization Time series are transformed to have zero mean and unit variance.
2. PAA representation Each normalized series X T is represented in a w-dimensional space by the vector X w = X 1 , . . . , X w , whose i-th element is calculated by Thus, data are first divided into w segments of equal-length. Then, the i-th component of the PAA representation is given by the mean value of the data falling within the i-th segment. Equation 3 assumes that T /ω is an integer, but this assumption can be relaxed by weighting the contribution of data points placed in adjacent segments (see details on this generalization in Section 3.5 of Lin, Keogh, Wei, and Lonardi 2007).
This way, each symbol of the resulting wordX α , i.e., of the SAX symbolic representation, is generated with approximately the same probability.
4. MINDIST dissimilarity measure Though many dissimilarities can be defined over the SAX representation, one that approximates the Euclidean distance can be useful. First, the distance between a pair of symbols l i and l j , i, j ∈ {1, . . . , α}, is defined by Note that the square matrix containing all pairwise distances between symbols needs only be computed once. Based on this matrix, the dissimilarity is directly calculated as follows A complete information on the SAX approach and related issues can be seen at the web page Keogh (2014) and references therein. TSclust incorporates routines to generate the PAA and SAX representations and compute the dissimilarity d MINDIST .SAX .

Model-based approaches
Model-based dissimilarity measures assume that the underlying models are generated from specific parametric structures. The main approach in the literature is to assume that the generating processes of X T and Y T follow invertible ARIMA models. In such case, the idea is fitting an ARIMA model to each series and then measuring the dissimilarity between the fitted models. First step requires estimating the structure and the parameters of ARIMA models. The structure is either assumed to be given or automatically estimated using, for example, the Akaike's information criterion (AIC) or the Schawartz's Bayesian information criterion (BIC). The parameter values are commonly fitted using generalized least squares estimators. Some of the most relevant dissimilarity measures derived in the literature under the assumption of underlying ARIMA models are provided below.
Piccolo distance Piccolo (1990) defines a dissimilarity measure in the class of invertible ARIMA processes as the Euclidean distance between the AR(∞) operators approximating the corresponding ARIMA structures. Piccolo argues that the autoregressive expansions convey all the useful information about the stochastic structure of this kind of processes (except for initial values).
If the series are non-stationary, differencing is carried out to make them stationary, and if the series possess seasonality, then it should be removed before further analysis. Then, a definite criterion such as AIC or BIC is used to fit truncated AR(∞) models of orders k 1 and k 2 that approximate the generating processes of X T and Y T , respectively. This approach allows us to overcome the problem of obtaining ad hoc ARMA approximations for each of the series subjected to clustering.
and AR(k 2 ) parameter estimations for X T and Y T , respectively, then the Piccolo's distance takes the form Besides satisfying the properties of a distance (non-negativity, symmetry and triangularity), d PIC always exists for any invertible ARIMA process since π j , π j and π 2 j are welldefined quantities.

Maharaj distance
For the class of invertible and stationary ARMA processes, Maharaj (1996Maharaj ( , 2000 introduced two discrepancy measures based on hypotheses testing to determine whether or not two time series have significantly different generating processes. The first of these metrics is given by the test statistic whereΠ X T andΠ Y T are the AR(k) parameter estimations of X T and Y T , respectively, with k selected as in the Piccolo's distance, andV is an estimator of V = σ 2 , with σ 2 X T and σ 2 Y T denoting the variances of the white noise processes associated with X T and Y T , and R X T and R Y T the sample covariance matrices of both series.
Maharaj demonstrated that d MAH is asymptotically χ 2 distributed under the null hypothesis of equality of generating processes, i.e., by assuming that Π X T = Π Y T . Therefore, the dissimilarity betweenΠ X T andΠ Y T can also be measured through the associated p value, i.e., by considering Both the test statistic d MAH and the associated p value d MAH ,p satisfy the properties of nonnegativity and symmetry so that any of them can be used as dissimilarity measure between X T and Y T . Although d MAH and d PIC evaluate the dissimilarity between two series by comparing their autoregressive approximations, there is a substantial difference between them: the Piccolo's distance does not take into account the variance of the white noise processes associated with the observed series, while the Maharaj's statistic involves these variances in its definition. It is important to be aware of this fact when we use these dissimilarity measures to carry out clustering because d MAH will be affected by the scale unit.
It is also worth emphasizing that if an hierarchical algorithm starting from the pairwise matrix of p values d MAH ,p is developed, then a clustering homogeneity criterion is implicitly provided by pre-specifying a threshold significance level α (e.g., 5% or 1%). Those series with associated p values greater than α will be grouped together, which implies that only those series whose dynamic structures are not significantly different at level α will be placed in the same group.
Measures d MAH and d MAH ,p come from a hypothesis testing procedure designed to compare two independent time series. To overcome this limitation, Maharaj (2000) introduced a new testing procedure that can be applied to time series that are not necessarily independent. In this case, a pooled model including collectively the models fitted to X T and Y T is considered, and the combined vector of 2k AR parameters Π = (Π X T , Π Y T ) is estimated by using generalized least squares. Assuming that the two models are correlated at the same points in time but uncorrelated across observations, the proposed test statistic (say d MAHext ) is also asymptotically distributed as χ 2 with k degrees of freedom. As before, a dissimilarity measure (say d MAHext,p ) based on the p values associated with this new test can be constructed.
Cepstral-based distance Kalpakis, Gada, and Puttagunta (2001) propose the use of the linear predictive coding (LPC) cepstrum for clustering ARIMA time series. The cepstrum is defined as the inverse Fourier transform of the short-time logarithmic amplitude spectrum. The cepstrum constructed by using the autoregression coefficients from linear model of the signal is referred to as the LPC Cepstrum, since it is derived from the linear predictive coding of the signal. Kalpakis et al. argue that LPC cepstral coefficients present good properties to discriminate between ARIMA time series. In particular, only a few LPC cepstral coefficients retains high amount of information on the underlying ARIMA model.
Consider a time series X T following an AR(p) structure, i.e., X t = p r=1 φ r X t−r + ε t , where φ r are the autoregression coefficients and ε t is a white noise process with zero mean and non-zero variance. Then the LPC cepstral coefficients can be derived from the autoregressive coefficients φ r as follows: In order to measure the distance between two time series X T and Y T , Kalpakis et al. (2001) consider the Euclidean distance between their corresponding estimated LPC cepstral coefficients, taking the form

Complexity-based approaches
A representative group of dissimilarity measures based on comparing levels of complexity of time series are presented in this section. Here, similarity of two time series does not rely on specific serial features or the knowledge of underlying models, but on measuring the level of shared information by both time series. The mutual information between two series can be formally established using the Kolmogorov complexity concept, although this measure cannot be computed in practice and must be approximated. A pair of approaches to measure complexity differences between two time series are shortly described below.

Compression-based dissimilarity measures
The Kolmogorov complexity K(x) of an object x is the length of the shortest program capable to produce x on a universal computer, such as a Turing machine (Li and Vitányi 2007).
is the minimal quantity of information required to generate x by an algorithm, and therefore, the level of complexity of x is related to K(x). Analogously, given two objects x and y, the conditional Kolmogorov complexity K(x|y) of x given y is defined as the length of the shortest program producing x when y is given as an auxiliary input on the program. Therefore, K(x) − K(x|y) measures the amount of information that y produces about x. Based on these concepts, Li et al. (2004) propose a normalized information distance (NID) between two objects x and y given by Li et al. show that d NID is a metric taking values in [0, 1] and it is universal in the following sense: d NID leads to the smallest values (up to constant precision) among a broad class of normalized distances. Metric d NID can be applied to different collections of objects such as time series, images, texts, etc. From now on we assume that x and y represent times series X T and Y T , respectively.
As Kolmogorov complexity is noncomputable, d NID is approximated by replacing the quantities K(·) by the length of the compressed objects obtained from data compressors such as gzip, bzip2, etc. Consider a specific data compression algorithm and denote by C(X T ) the compressed size of X T . The denominator of d NID is easy to approximate by max {C(X T ), C(Y T )}, but the numerator involves conditional Kolmogorov complexities making more difficult to obtain the approximation. Li et al. (2004) overcome this drawback by taking into account that K(x|y) is roughly equal to K(xy) − K(y), where K(xy) is the length of the shortest program to compute the concatenation of x and y. The resulting approximation by following this approach is called "normalized compression distance" (NCD) and takes the form The NCD dissimilarity takes nonnegative values ranging from 0 to 1 + ε, where ε is due to flaws in the compression techniques. The smaller the d NCD (X T , Y T ), the more closely related X T and Y T are. The optimality properties satisfied by the theoretical version d NID do not directly hold for d NCD , although Cilibrasi and Vitányi (2005) show that d NCD satisfies the metric properties and approximates optimality when a specific notion of "normal" compressor is considered. The axiomatic notion of normal compressor includes symmetry as one of the basic features, and the authors argue that all standard compressors are asymptotically normal. The good performance of d NCD has been illustrated in Cilibrasi and Vitányi (2005) on a wide set of experiments in areas as diverse as genomics, virology, languages, literature, music, handwritten digits and astronomy. Moreover, the dissimilarity turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip) or block based (bzip2). Keogh et al. (2004) (see also Keogh et al. 2007) use a simplified version of the NCD dissimilarity called "compression-based dissimilarity measure" (CDM) defined as The CDM dissimilarity ranges from 1/2 to 1, where 1/2 shows pure identity and 1 shows maximum discrepancy. Although Keogh et al. (2004) do not provide a theoretical analysis (d CDM is not a metric), they emphasize that their simpler proposal produces successful results in clustering and anomaly detection.
Other compression-based dissimilarity measures have been proposed in the literature. In particular, four different compression-based measures, including NCD and CDM, are compared in an experimental study carried out by Sculley and Brodley (2006), concluding that the four examined methods lead to very similar results. In TSclust, d NCD and d CDM are implemented using the compression function memCompress() included in the base package of R.

Permutation distribution clustering
Permutation distribution clustering (PDC) represents an alternative complexity-based approach to clustering time series. Dissimilarity between series is described in terms of divergence between permutation distributions of order patterns in m-embedding of the original series. Specifically, given X T , an m-dimensional embedding is constructed by considering Then, for each X m ∈ X m , permutation Π (X m ) obtained by sorting X m in ascending order (so-called codeword of X m ) is recorded, and the distribution of these permutations on X m , P (X T ) (so-called codebook of X T ), is used to characterize the complexity of X T . Furthermore, dissimilarity between two time series X T and Y T is measured in terms of the dissimilarity between their codebooks P (X T ) and P (Y T ), respectively. Brandmaier (2011) establishes this dissimilarity as the α-divergence between codebooks. The α-divergence concept (Amari 2007) generalizes the Kullback-Leibler divergence and the parameter α can be chosen to obtain a symmetric divergence. An interesting discussion on the use of different divergence measures and the nice properties of the permutation distribution to perform clustering (computational efficiency, robustness to drift, invariance to differences in mean and variance, among others) can be seen in Brandmaier (2011).
The choice of the embedding dimension m is the crucial point in PDC approach. A small value of m might lead to a permutation distribution with a low representational power, while a large value of m quickly leads to less reliable estimates of codeword frequencies. Brandmaier (2011) proposes a heuristic procedure to automatically select the embedding dimension, thus making a parameter-free clustering method.
A package specifically devoted to this approach, pdc (Brandmaier 2014), is available on the CRAN package repository. For it, the dissimilarity based on permutation distributions has not been implemented in our package and it is directly computed by using the corresponding function in pdc, which is automatically loaded when TSclust is installed.
A complexity-invariant dissimilarity measure Batista et al. (2011) argue that, under many dissimilarity measures, pairs of time series with high levels of complexity frequently tend to be further apart than pairs of simple series. This way, complex series are incorrectly assigned to classes with less complexity. In order to mitigate this effect, the authors propose to use information about complexity difference between two series as a correction factor for existing dissimilarity measures. Specifically, a general complexity-invariant dissimilarity measure called CID is defined as where d (X T , Y T ) denotes a conventional raw-data distance and CF (X T , Y T ) is a complexity correction factor given by with CE (X T ) a complexity estimator of X T . If all series have the same complexity, then d CID (X T , Y T ) = d (X T , Y T ). Nevertheless, an important complexity difference between X T and Y T turns into an increase of the dissimilarity between them. The complexity estimator considered by Batista et al. is very simple and consists in computing The CID method is intuitive, parameter-free, invariant to the complexity of time series, computationally efficient, and it has produced improvements in accuracy in several clustering experiments carried out in Batista et al. (2011). In TSclust, d CID has been implemented by using the Euclidean distance as dissimilarity measure d.

Prediction-based approaches
Now we focus on a new dissimilarity notion governed by the performance of future forecasts, i.e., two time series are similar if their forecasts at a specific future time are close. Obviously, a clustering procedure based on this dissimilarity concept may produce results very different to the ones generated from model-based or feature-based clustering methods. For instance, two time series coming from the same generating process can produce different forecasts at a pre-specified horizon, and hence these series might not be clustered together by using this new dissimilarity criterion. There are many practical situations where the real interest of the clustering relies directly on the properties of the predictions, as in the case of any sustainable development problem or in situations where the concern is to reach target values on pre-specified future time periods. Alonso et al. (2006) proposed a dissimilarity measure based on comparing the forecast densities for each series at a future horizon of interest. They argue that using full forecast densities permits to take into account the variability of the predictions, which is completely ignored when comparisons are based on pointwise forecasts. In practice, the forecast densities are unknown and must be approximated from the data. Alonso et al. construct this approximation using a smoothed sieve bootstrap procedure combined with kernel density estimation techniques. This procedure requires assuming that the time series admit an AR(1) representation because the sieve bootstrap is based on resampling residuals from autoregressive approximations. Vilar et al. (2010) extend this methodology to cover the case of nonparametric models of arbitrary autoregressions. In this new scenario, the sieve bootstrap is not valid, and the forecast densities are approximated considering a bootstrap procedure that mimics the generating processes without assuming any parametric model for the true autoregressive structure of the series. The most general procedure proposed by Vilar et al. (2010) has been implemented in TSclust, allowing thus the classification of general autoregressive models, including extensively studied parametric models, such as the threshold autoregressive (TAR), the exponential autoregressive (EXPAR), the smoothtransition autoregressive (STAR) and the bilinear, among others (see Tong and Yeung 2000, and references therein).
Specifically, let X T and Y T be realizations of stationary processes that admit a general autoregressive representation of the form S t = ϕ(S t−1 ) + ε t , with {ε t } an i.i.d. sequence and ϕ(·) a smooth function not restricted to any pre-specified parametric model. Given a particular future time T + h, Vilar et al. introduce the following distance between X T and Y T : wheref X T +h andf Y T +h denote estimates of the forecast densities at horizon T + h for X T and Y T , respectively.
By construction, d PRED,h (X T , Y T ) measures the distance between X T and Y T in terms of the disparity between the behaviors of their predicted values at horizon T + h. The true forecast densities are replaced by kernel-type estimators based on bootstrap predictions. Although different resampling procedures can be used to obtain the bootstrap predictions, we consider a bootstrap procedure based on generating a process whereφ g is a nonparametric estimator of ϕ and {ε * t } is a conditionally i.i.d. resample from the nonparametric residuals. This bootstrap method, called autoregression bootstrap, completely mimics the dependence structure of the underlying process. Actually autoregression bootstrap uses an approach similar to that of the residual-based resampling of linear autoregressions, but it takes advantage of being free of the linearity requirement, and hence, it can be applied to a wider class of nonparametric models. A detailed description of the steps involved in generating a set of bootstrap predictions can be seen in Vilar et al. (2010).

Tools for time series clustering
The dissimilarity measures presented in Section 2 and included in TSclust are specially useful to perform clustering on a given set of m time series X (1) T , . . . , X (m) T . Once the dissimilarity measure is selected according to the desired clustering purpose, a pairwise dissimilarity matrix D can be obtained and taken as starting point of a conventional clustering algorithm. Many clustering tools are available through R packages and can be used with dissimilarities generated from TSclust. Nevertheless, some additional tools often used in time series clustering but also useful outside this domain have been implemented in TSclust. In this section, we begin describing the clustering utilities implemented in TSclust and continue with a quick overview of some available options in R to perform, validate and visualize clustering. This overview is not intended to be a comprehensive list, but only a starting point to illustrate how to take advantage of TSclust by interoperating with functions from other packages.

A hierarchical clustering algorithm based on p values
The algorithm, introduced by Maharaj (2000), takes as starting point the m × m matrix P = (p i,j ), whose (i, j)-th entry, p i,j , for i = j, corresponds to the p value obtained by testing whether or not X T come from the same generating model. Then, the algorithm proceeds in a similar way as an agglomerative hierarchical clustering based on P, although in this case will only group together those series whose associated p values are greater than a significance level α previously specified by the user. In other words, the i-th series X (i) T will merge into a specific cluster C k formed by the series X (j 1 ) T , . . . , X (jm k ) T iff p i,j l ≥ α, for all l = 1, . . . , m k . Analogously, two clusters will be joined together iff the p values of all pairs of series across the two clusters are greater than α. This algorithm behaves similar to the single linkage procedure because the dissimilarity between two clusters is the smallest dissimilarity (the greatest p value) between series in the two groups. Unlike the single linkage, two clusters will not be joined with the Maharaj's algorithm when a significant difference between a pair of series of the candidate clusters is obtained. Note also that, unlike the conventional hierarchical methods, this algorithm presents the advantage of providing automatically the number of clusters, which obviously depends on the prefixed significance level. Furthermore, the amount of compactness of each cluster can be evaluated by examining the p values within each cluster. The flow chart of this clustering algorithm can be seen in Figure 3 of Maharaj (2000), and it is available in TSclust by invoking the function pvalues.clust().

Cluster evaluation
Two clustering evaluation criteria based on known ground-truth have been implemented in TSclust. One of them consists in calculating an index that measures the amount of agreement between the true cluster partition G = {G 1 , . . . , G k } (the "ground-truth"), assumed to be known, and the experimental cluster solution A = {A 1 , . . . , A k } obtained by a clustering method under evaluation. The similarity index Sim (G, A) is defined by: where The other evaluation method implemented in TSclust uses a one-nearest-neighbour (1-NN) classifier evaluated by leave-one-out cross-validation. Given the true cluster partition G = {G 1 , . . . , G k } and a particular dissimilarity matrix D (i.e., a dist object), the 1-NN classifier assigns each series X (i) T to the element of G containing the nearest series, X T , j = i, according to the dissimilarities in D. Then the proportion of correctly classified series is calculated. Note that this criterion directly evaluates the efficacy of the dissimilarity measure regardless of the considered clustering algorithm. This clustering evaluation procedure is intuitive, straightforward to implement, parameter-free and asymptotically optimal in the Bayes sense (Tan, Steinbach, and Kumar 2006). These reasons support its intensive use in a broad range of pattern recognition applications, including time series clustering (see, e.g., Keogh and Kasetty 2003). In TSclust, the evaluation based on the 1-NN classifier is available by invoking the function loo1nn.cv(), whose implementation allows us to deal with ties. Specifically, tied dissimilarities are solved by majority vote. If ties occur in voting, then a candidate cluster is selected at random and a warning is produced.

Some avaliable R clustering tools
As already mentioned, TSclust should not be seen as a stand-alone package but as an userextensible framework whose functionality is strengthened by interfacing with existing general clustering methods. For it, a small sample of clustering utilities available in R system and helpful to interact with outcomes from TSclust is summarized below.
First, time series dissimilarities integrated in TSclust will usually be the starting point to conduct a clustering algorithm. A wide range of functions to develop different cluster algorithms are available in different R packages. For instance, primary functions to perform hierarchical clustering are hclust() in the package stats (R Core Team 2014) and agnes() in the package cluster (Maechler, Rousseeuw, Struyf, Hubert, and Hornik 2014). The cluster package also includes the functions diana() and pam() to carry out divisive hierarchical clustering and partitioning clustering around "medoids", respectively.
On the other hand, a variety of measures aimed at validating the results of a cluster analysis and comparing different cluster solutions are available in several R libraries. For example, the function clValid() in the package clValid (Brock, Pihur, Datta, and Datta 2008) reports a broad range of measures of clustering validation, and the function cluster.stats() in the package fpc (Hennig 2014) provides a number of distance-based statistics, which can be used for cluster validation, comparison between clustering solutions and decision about the number of clusters. Additional cluster validation techniques can be also found in the package clv (Nieweglowski 2013). Some of the tools described in this section are used in the illustrative examples of Section 5.

Considerations on the dissimilarity measure selection
As TSclust provides a broad range of dissimilarity measures to perform clustering of time series, some considerations on the choice of a proper dissimilarity are given in the present section. First of all, it is worth stressing that TSclust should not be used as a test bed where all available dissimilarities are executed and the one reporting the "best" results is selected.
In particular, what constitutes a "good" clustering is often unclear as the perception of a good clustering differs across users. Although interesting experimental comparisons of several dissimilarity measures are available in the literature (see, e.g., Keogh and Kasetty 2003;Ding, Trajcevski, Scheuermann, Wang, and Keogh 2008;Pértega and Vilar 2010), we recommend users of TSclust to choose the proper criterion according to the nature and specific purpose of the clustering task. Only by doing so, the cluster solution will admit an interpretation in terms of the grouping target.
A first important issue is to decide whether clustering must be governed by a "shape-based" or "structure-based" dissimilarity concept (see, e.g., Lin and Li 2009;Corduas 2010). Shapebased dissimilarity is aimed to compare the geometric profiles of the series, or alternatively, representations of them designed to reduce the dimensionality of the problem. Thus, shapebased dissimilarity principle is mainly dominated by local comparisons. On the other hand, structure-based dissimilarity is aimed to compare underlying dependence structures. Higherlevel dynamic structures describing the global performance of the series must be captured and compared in this case. To shed some light on differences between both dissimilarity concepts, consider a simple and intuitive synthetic dataset of 9 time series generated from three different patterns, let us say P1, P2 and P3. All of them are displayed in Figure 2(a). The underlying patterns are identified by colour and type of line: blue solid lines for P1, red dashed lines for P2 and black dotted lines for P3. Profiles of P1 and P2 series move close within a narrow band. Therefore P1 and P2 series are the closest ones if similarity is measured in terms of proximity between geometric profiles. If, e.g., a shape-based dissimilarity like the Euclidean distance is used to perform clustering on this dataset, then the P1 and P2 series are placed together forming a mixed cluster (see Figure 2(b)). On the other hand, it can be seen that P1 and P3 have increasing/decreasing patterns closer to each other than to P2. This way P1 and P3 series become the closest ones if similarity is understood in terms of underlying correlation structures. In fact, P1 and P3 are first merged to form a cluster when clustering is carried out using a structure-based dissimilarity like d COR or d CORT with k > 1 (see Figure 2(c)).
As the interest focuses on shape-based dissimilarity, conventional distances between raw data (e.g., L p type) or complexity-based measures (e.g., CID dissimilarity) can produce satisfactory results, although sometimes measures invariant to specific distortions of the data could be required. For instance, time series recorded in different scales will require previous normalization to cancel differences in amplitude and then match well similar shapes. In fact, conventional metrics like Minkowski, DTW or Fréchet distances may lead to common misunderstandings if the time series are not previously normalized (see illustrative examples in Rakthanmanon et al. 2012). Also, many biological series (e.g., recording motion capture data) are efficiently compared using dissimilarities invariant to local scaling (warping). DTW-based techniques will work specially well in these scenarios (Ding et al. 2008). Nevertheless, it is worth remarking that the high computational cost of dynamic time warping makes it a less desirable choice for large time series.
Overall, shaped-based dissimilarities work well with short time series but they can fail by  Dendrograms from clustering based on the Euclidean distance (shape-based dissimilarity, b), and d CORT dissimilarity with parameter k = 2 (structure-based dissimilarity, c).
working with long sequences, especially when a high amount of noise or anomalous records are present. In these situations, a structure-based dissimilarity aimed to compare global underlying structures can be more appropriate. This is frequently the case when performing cluster analysis of economic or financial indicators. A range of feature-and model-based dissimilarities are included in TSclust to be used when a structure-based dissimilarity is required. Note that some of these dissimilarity measures assume regularity conditions for the series at hand, and users must be aware of it. For example, the model-based measures included in TSclust are just applicable on time series with stationary and linear underlying processes, while the prediction-based measures are free of the linearity requirement but assuming autoregressive structures of lag one.
Prediction-based dissimilarity responds to a different and easily interpretable clustering target. The real interest is not to group series showing similar shapes or underlying structures, but series with similar forecasts at a specific future time. Figure 3 illustrates this idea. Cluster analysis on the depicted time series aimed to a shape-or structure-based dissimilarity will obviously produce an entirely different result to the one coming from the study of the forecasts at a specific horizon.
Once the clustering objetives are made clear, the range of proper measures becomes more limited, and other suitable properties must be then considered. For example, most dissimilarity measures introduced in TSclust require setting a number of input parameters, which might produce an undesirable variability of results. Although implementation of some measures as d DWT or d PDC includes automatic procedures to determine these values, a careful adjustment of these parameters is recommended to find the true underlying patterns. So, practitioners should obtain background from key references to understand how adequate values can be determined in each case.
Computational complexity is also an important point. Although all measures in TSclust have been implemented using efficient algorithms available in R, some of them include procedures with high computational cost. For instance, measure d W (LK ) involves numerical integration Figure 3: Illustrative example showing the two clusters obtained with any shape-or structurebased dissimilarity (evaluated on the period observation) and the two clusters derived from cluster analysis based on point predictions at a specific future time. Note that both groupings differs.
of differences between local linear smoothers computed by maximum local likelihood, which implies to solve repeatedly an optimization problem in two variables. Prediction-based measures construct kernel density estimators based on bootstrap predictions, thus increasing the computational complexity by a factor B, where B is the number of bootstrap samples. As result, both dissimilarity measures suffer of a significantly high computational complexity and could be unfeasible to perform clustering on databases including very long series. Note also that the computing times of some measures can vary largely according to the values of the dependent parameters. For instance, the autocorrelation-based distances and the Maharaj's distance have computing times O(T × lag.max) and O(T ), respectively, but these times become O(T 3 ) when user specifies a matrix of weights Ω in the autocorrelation-based measures or considers the extended version of the Maharaj's distance.
Attention must be also placed on the choice of the clustering algorithm, which should not distort the clustering process modifying the pre-established dissimilarity notion. In this sense, some well-known clustering methods could not work properly. For instance, the popular k-means algorithm moves each series to the cluster whose centroid is closest, recalculates the cluster centroid and repeats the assignment procedure until no time series is reassigned. Therefore k-means involves the computation of dissimilarity between time series and"centroids of a set of time series", which might not be properly defined. If for example the predictionbased metric d PRED,b given in Equation 4 is considered, then a centroid would be a kernel prediction density generated from an averaging of different series, and this is not reasonable at all. Similar arguments apply to other dissimilarity measures introduced in Section 2. In addition, if a hierarchical method is considered, then caution will be put when the Ward's method is chosen as linkage criterion because this algorithm is based on Euclidean distances between objects.

Illustrative use of TSclust package
In this section, the use of TSclust is illustrated by working on several sets of real and synthetic time series included in the own package. Note that the examples with real data are here considered to show the usefulness of the package TSclust and not to answer substantive questions in the data. Since TSclust provides an important number of dissimilarity measures between time series, first it is worth getting acquainted with the functions that compute these dissimilarities, which are summarized in Table 1. Given a set of time series, functions in Table 1 are mainly used to create a pairwise dissimilarity matrix (dist object) that is taken as base of several R clustering utilities.

An example with a synthetic dataset
For testing and comparison purposes, TSclust includes a collection of eighteen synthetic time series available by loading synthetic.tseries. More precisely, synthetic.tseries consists of three partial realizations of length T = 200 of each of the six first order autoregressive models enumerated in Table 2.
In all cases, ε t are i.i.d. Gaussian random variables with zero mean and unit variance. These models have been selected to include both linear and non-linear structures. Model M1 is an AR(1) process with moderate autocorrelation. Model M2 is a bilinear process with approximately quadratic conditional mean, and thus, strongly non-linear. Model M3 is an exponential autoregressive model with a more complex non-linear structure although very close to linearity. Model M4 is a self-exciting threshold autoregressive model with a relatively strong non-linearity. Finally, Models M5, a general non-linear autoregressive model, and M6, a smooth transition autoregressive model, present a weak non-linear structure. Therefore, the selected models represent different nonlinear structures for the conditional mean, thus providing a valuable scenario to examine the performance in clustering of the dissimilarity measures implemented in TSclust. In fact, some of these models were previously considered by Vilar et al. (2010) in a broad simulation study conducted to attain this objective. Here, our interest is mainly illustrative and for this reason our analysis is limited to this set of series. Indeed, a more extensive analysis including a large number of simulated replicates should be required to state rigorous conclusions.
We proceed as follows. Series in synthetic.tseries are subjected to clustering by using different dissimilarity measures and several clustering algorithms. Assuming that the clustering is governed by the similarity between underlying models, the "true" cluster solution is given by the six clusters involving the three series from the same generating model. Then, the experimental cluster solutions are compared with the true cluster solution by using the cluster similarity function cluster.evaluation() included in TSclust.
First steps are loading both the package and dataset and creating the true solution.
R> library("TSclust") R> data("synthetic.tseries") R> true_cluster <-rep(1:6, each = 3) We start testing the dissimilarity measure d IP , that computes the L 2 distance between integrated periodograms and can be evaluated in TSclust by invoking diss.INT.PER(). The wrapper function diss() is provided as an easy way to apply any of the TSclust dissimilarity functions to a matrix of time series data. Function diss() takes a dataset and a string specifying the TSclust dissimilarity to be used, and returns a dist object with all the pairwise dissimilarities. A dataset can be introduced as a mts object, or alternatively as a matrix, list or data.frame object. Additional arguments required by the particular dissimilarity must be also passed to diss() function. In our example, diss() is used as follows.
R> IP.dis <-diss(synthetic.tseries, "INT.PER") Now, the object IP.dis contains the dissimilarity matrix computed by applying d IP to each pair of series in synthetic.tseries and can be taken as starting point for several conventional clustering methods. For example, we use hclust() of the base package stats to conduct a hierarchical clustering algorithm. We get the six cluster solution from the hierarchical clustering using the cutree() function of stats.
R> library("cluster") R> IP.pamclus <-pam(IP.dis, k = 6)$clustering R> cluster.evaluation(true_cluster, IP.pamclus) [1] 0.8888889 Recall that the output of cluster.evaluation() lies between 0 and 1 and admits a simple interpretation: the closer it is to 1, the higher is the agreement between the true and experimental partitions. Therefore, it can be concluded that the model-free measure d IP produces good results in this case, showing the best performance as the PAM algorithm is used. In fact, the cluster solution obtained with the PAM algorithm only classifies incorrectly two series. Any dissimilarity function can be tested in a similar way. Consider for example the dissimilarity based on the simple autocorrelations d ACF .
Let us now consider the model-based dissimilarity d MAH ,p introduced by Maharaj (2000) which is implemented in the function diss.AR.MAH(). As d MAH ,p produces p values obtained by testing the equality of generating ARMA-models, we can use the hierarchical algorithm pvalues.clust() based on p values described in Section 3.  1 1 3 3 1 4 4 1 1 1 1 4 4 4 The number of clusters is automatically established by the clustering algorithm depending on the threshold significance specified in significance. Using a 0.05 significance level only four clusters are identified. The AR(1) series are correctly placed forming one group, the exponential autoregressive series (EXPAR) form other group although one of them is erroneously not included, and the two remaining groups mainly place together the SETAR and STAR structures and the bilinear and NLAR series, respectively. Indeed, this poor result is expected because d MAH ,p is aimed to assume that the generating processes are ARMA models. Note that by increasing the threshold significance, the pairwise equality tests become less conservative, and thus a greater number of clusters can be found. For instance, setting significance = 0.6 in our example, six clusters are found.

Clustering interest rate series
Our second example deals with long-term interest rate data included in TSclust and available by loading interest.rates. This dataset is formed by 18 time series of length T = 215 representing monthly long-term interest rates (10-year bonds), issued in percentages per annum, for seventeen countries and the Economic and Monetary Union (EMU). The observation period goes from January 1995 to November 2012 (OECD source: http://www.oecd.org/). Long-term government bonds are the instrument whose yield is used as the representative "interest rate" for each area, and hence these indicators are a vital tool of monetary policy and are taken into account when dealing with variables like investment, inflation, and unemployment. In particular, one of the convergence criteria required for all countries in the euro area is related to this indicator. A lattice plot of the series in interest.rates is shown in Figure 4.
Data in interest.rates are here used to perform clustering from two different points of view. First, the original series are clustered in accordance with the performance of the predictions at a given horizon. Identifying groups of countries with similar predictions for their longterm interest rates is a meaningful objective due to the influence of these indicators on the monetary policy actions. In any case, our application just intends illustrating the usefulness of TSclust to achieve this objective without studying this interesting issue in depth. The proper function to carry out prediction-based clustering is diss.PRED(), which computes the L 1 distance between bootstrap approximations of the h-steps-ahead forecast densities for two given time series ( countries with similar six-months-ahead predictions (h = 6). As the bootstrap mechanism requires stationarity and the time series under study are clearly non-stationary, the original series are previously transformed by using logarithms (if required) and taking an appropriate number of regular differences. Bootstrap prediction-paths are generated from the transformed series, and then the prediction-paths are back-transformed to obtain bootstrap predictions for the original series. All these steps are automatically carried out by diss.PRED(). The required arguments for this function are the horizon of interest, h, the number of bootstrap resamples, B, and how transformation must be carried out, that is if logarithms are or not taken (logarithms.x and logarithms.y) and the number of regular differences to be applied (differences.x and differences.y). We start comparing the predictions for Finland and USA (columns 13 and 16 of interest.rates, respectively).
R> data("interest.rates") R> diss.PRED(interest.rates [, 13], interest.rates [, 16], h = 6, B = 2000, + logarithm.x = TRUE, logarithm.y = TRUE, differences.x = 1, + differences.y = 1, plot = TRUE)$L1dist [1] 0.2248442 The plot generated by diss.PRED() is shown on the right part of Figure 6. Note that the prediction densities have similar means, but differ in their shapes (the forecast density of Finland presents a higher kurtosis), and this feature is captured by using the L 1 distance between prediction densities (taking the value 0.2248442 in this particular case). original dataset. For instance, assume that the recommended transformation for all series in interest.rates is to take one difference of the logarithm, then the dissimilarity matrix based on the six-months-ahead prediction density estimates can be obtained as follows.
R> hc.dpred <-hclust(dpred$dist) R> plot(hc.pred) Figure 5 shows the resulting dendrogram. The four cluster solution seems to determine groups reasonably well-separated, namely C 1 = {Ireland, Italy, Spain, Portugal}, C 2 = {Germany, Sweden, Denmark}, C 3 = {Switzerland, Japan}, and the fourth cluster C 4 formed by the remaining countries. Indeed, if a finer partition identifying more compact groups is required, then C 4 can be split into two homogeneous subclusters, and Portugal (and in a lesser extension Spain and Denmark) would leave their groups to appear like isolated points. To gain insight into the clustering, all the estimated forecast densities have been jointly depicted in Figure 6 by using the option plot = TRUE in the function call. crisis, presenting the largest predictions for their long-term interest rates. All of them show a high amount of variability in their prediction densities (especially Portugal), thus increasing the uncertainty for their future interest rates. By contrast, Japan and Switzerland, forming C 3 , are the countries with the lowest predictions and less uncertainty. C 1 and C 3 are perfectly separated and constitute very stable clusters because no overlaps are given between densities from these two groups. Clusters C 2 and C 4 collect countries with intermediate predictions, also well-separated from C 1 and C 3 . Denmark, Germany and Sweden, forming C 2 , present predictions close to the lowest ones from C 3 , especially Denmark, and the countries in C 4 show somewhat higher predictions, but in any case significantly lower than the countries in C 1 . It is worthy to remark that a classification based on pointwise predictions differs from the classification aimed to prediction densities. For example, Portugal and Spain are clearly isolated points if a pointwise prediction approach is considered, while they are in the same cluster with the density criterion. Now, a different clustering problem based on the same interest rate dataset is posed. Specifically, we are interested in performing cluster analysis with the series transformed by taking the first difference of the logarithms, i.e., by using series Y (i) T , i = 1, . . . , 24, given by log X T approximately measures the percentage changes in the interest rates from month to month, and therefore this new cluster analysis intends to group together those countries with similar monthly increases in their interest rates. A lattice plot of the new time series is shown in Figure 7.
The clustering aim is now to identify similar dependence structures hidden behind the wiggly and noisy profiles observed in Figure 7. Note that nonstationarity of the original series has been removed by taking differences, and therefore dissimilarity measures constructed under the stationarity assumption can be used. For illustrative purposes, we perform hierarchical clustering with complete linkage and use several dissimilarity measures implemented in TSclust, namely the dissimilarities based on the simple autocorrelation functions, the logarithm of the normalized periodograms and the AR-metric introduced by Piccolo. The average Silhouette coefficients were examined for several k-cluster solutions and it was observed that a number of k = 5 groups yields a reasonably compact solution. First, the time series are transformed and a matrix for storing the 5-cluster solutions is created.

Matching pairs of times series from different domains
Now, we consider a dataset formed by a collection of 18 pairs of time series covering different domains, including finance, science, medicine, industry, etc. This dataset was presented by Keogh et al. (2004) to evaluate different clustering procedures and has been later considered by other authors (Lin and Li 2009;Brandmaier 2011, among others). The dataset is freely available at http://www.cs.ucr.edu/~eamonn/SIGKDD2004/, and included in TSclust as paired.tseries. Each pair of series was selected from a different domain in datasets from the University of California Riverside (UCR) Time Series Archive . For example, the dataset includes the series thoracic and abdominal from the Fetal-ECG archive, annual power demand series for two different months Power: April-June and Power: Jan-March, from the Italy-Power-Demand archive, and so on. The 36 time series are displayed in Figure 8, where a different colour is used to identify each pair. Note that some of the distinct pairs present clearly different profiles.
One of the experiments described by Keogh et al. (2004) consisted in performing agglomerative hierarchical cluster analysis starting from different pairwise dissimilarities. The main objective was to identify the 18 pairs at the lowest level of the tree, assuming that a clustering in which each pair of time series is closest to each other is the ground truth. For this purpose, the quality of clustering is evaluated with a measure Q defined by the number of correct pairings at the lowest clustering level divided by 18. A perfect result is attained when Q = 1, while the authors argue that Q = 0 would be expected for a random clustering due to the high number of possible dendrograms with 36 objects (greater than 3 × 10 49 ).
Package TSclust is a useful tool to carry out this type of experiments in a simple way. For illustrative purposes, some of the dissimilarity measures are evaluated using functions of TSclust. We begin by loading the dataset and creating the "ground-truth" solution. Since hclust() will be used to perform hierarchical clustering, the i-th row of the output merge describes the merging of clusters at step i of the hierarchical process. A negative element j in the row means that observation −j was merged at this stage. Hence, a row with negative entries indicates agglomeration of singletons, and regarding the order of the series in the database, this row represents one of the desired pairings when takes the form (−j, −(j + 1)), with j odd. For this reason, true_pairs is defined as the 2 × 36 matrix whose j-th column is the pair (−j, −(j + 1)). The Q-measure introduced by Keogh et al. (2011) is then evaluated by counting the number of matches between true_pairs and hclust$merge.
R> data("paired.tseries") R> true_pairs <-data.frame(-matrix(1:36, nrow = 2, byrow = FALSE)) A glance at Figure 8 suggests substantial differences between shapes and complexity levels of some pairs of series. Thus, a conventional shape-based dissimilarity or some complexity-based dissimilarity could produce good results. We begin examining up to three dissimilarities, namely the Euclidean distance, evaluating point-to-point discrepancy and so emphasizing local shape differences, d CID , addressed to compare shape complexities, and d PDC , measuring divergence between permutation distributions of the series.
Next, the compression-based dissimilarity diss.CDM is examined using gzip as compression algorithm. Alternative options in TSclust to determine the compression algorithm are bzip2, xzip, and min. If min is selected, the other three options will be tested, selecting the one that gives the best compression.
Dissimilarity diss.CDM with gzip compression gets 9 out of 18 pairs correctly. Nevertheless, the compression-based procedures can be affected by numerical issues (Keogh et al. 2004) and a symbolic representation of the series can help overcome these problems. Although there are many different symbolic approximations of time series in the literature (see, e.g., the review of symbolic analysis of experimental data provided by Daw et al. 2003), the Symbolic Aggregate ApproXimation (SAX) representation has been implemented in TSclust because allows both dimensionality reduction and lower bounding (Lin et al. 2003). SAX consists of a z-normalization, a PAA reduction, and a further transformation where PAA representations are mapped to symbols with equiprobability, thus obtaining the SAX representation (see Section 2.1). Both PAA representation and SAX symbolization are performed in TSclust using the routines PAA and convert.to.SAX.symbol, respectively. A graphical illustration of the combined result of these procedures is provided by SAX.plot function. See Figure 9 generated by the code below.

Conclusions
A key issue in time series clustering is the choice of a suitable measure to assess the dissimilarity between two time series data. A large number of well-established and peer-reviewed dissimilarity measures between time series have been proposed in the literature, and a wide range of them are included in the R package TSclust presented in this work. TSclust is mainly an attempt to integrate different time series dissimilarity criteria in a single software package to check and compare their behavior in clustering. The main motivation behind this package is that, to our knowledge, no previous packages are available targeting the problem of clustering time series, except for the pdc package (Brandmaier 2014), which is mainly focused on the permutation distribution clustering. Nevertheless, demand for a package of these features is supported by the increasing number of references and applications in different fields. A brief description and key references of the dissimilarity measures implemented in TSclust are included in the first part of this paper to give insight into their rationale. By using some real and synthetic dataset examples, the capabilities and usefulness of the package are illustrated. Several scenarios with different clustering objectives have been proposed to support the need of considering different dissimilarity measures. In this sense, some general considerations on how to choose the proper dissimilarity are also discussed. The package TSclust is under continuous development with the purpose of incorporating new dissimilarity criteria and clustering utilities in time series framework.