kml and kml3d : R Packages to Cluster Longitudinal Data

Longitudinal studies are essential tools in medical research. In these studies, variables are not restricted to single measurements but can be seen as variable-trajectories, either single or joint. Thus, an important question concerns the identiﬁcation of homogeneous patient trajectories. kml and kml3d are R packages providing an implementation of k -means designed to work speciﬁcally on trajectories ( kml ) or on joint trajectories ( kml3d ). They provide various tools to work on longitudinal data: imputation methods for trajectories (nine classic and one original), methods to deﬁne starting conditions in k -means (four classic and three original) and quality criteria to choose the best number of clusters (four classic and one original). In addition, they oﬀer graphic facilities to “visualize” the trajectories, either in 2D (single trajectory) or 3D (joint-trajectories). The 3D graph representing the mean joint-trajectories of each cluster can be exported through L A TEX in a 3D dynamic rotating PDF graph (Figures 1 and 9).


Introduction
Longitudinal studies are becoming essential tools in epidemiological research. In these studies, the same variables are measured repeatedly over time. This enables the evolution of a parameter of interest over time to be examined. This kind of variables will be referred to as "variable-trajectory". Because longitudinal datasets usually include a large number of variables, a key issue is to study the joint evolution of several variable-trajectories. This kind of variable will be referred to as joint-trajectories.
A standard way to work with variable-trajectories is to cluster them into distinct groups of patients with homogeneous characteristics. One strength of these classification methods is that they enable the conversion of several correlated continuous variables into a single categorical variable. The categories obtained can then be used, for instance, in a regression model, either as predictors or as dependent variables. Various methods have been designed to this purpose (Tarpey and Kinateder 2003;Rossi, Conan-Guez, and Golli 2004;Nagin 2005;Muthén and Muthén 2012) but they all present several weaknesses.
1. Most methods cluster according to a single variable-trajectory (except some model-based packages like mixAK, see Komárek 2009;Komárek, Hansen, Kuiper, van Buuren, and Lesaffre 2010). To date, joint-trajectories are mainly clustered by (1) clustering each single variable-trajectory independently, and (2) considering the combination obtained by crossing the partitions, which is not a convenient solution because it does not detect any complex interactions occurring between variables.
3. The problem of missing data is almost ubiquitous in observational research. This is particularly true in longitudinal studies where at least one follow-up assessment is often missing (Little 1993;Hedeker and Gibbons 1997;Mallinckrodt, Lane, Schnell, Peng, and Mancuso 2008;Laird 1988). The appropriate handling of dropout and withdrawal is still a key issue. The classic management techniques may result in considerable loss of information, especially when all patients with at least one missing measure are removed from the analyses. The development of imputation methods, such as imputation according to the mean or the classic LOCF (last observation carried forward) entails a lesser loss of information, but can generate results that are just as biased (Engels and Diehr 2003).
4. Partitioning techniques often rely on mathematical theories whose considerable complexities constitute a barrier for actual implementation. Therefore, algorithms leading to approximate solutions are commonly used. In the case of k-means, given an initial configuration the algorithm converges towards a maximum. But there is no way to be sure that this maximum is the global or a local maximum. One solution can be to run k-means several times from different initial configurations. Then eventually, one configuration will lead to the global maximum. But even so, the user has no certainty about the optimality of the partition obtained.
kml (Genolini 2015a;Genolini and Falissard 2010) and kml3d (Genolini 2015b;Genolini et al. 2013b) are two R (R Core Team 2015) packages that implement k-means in the context of longitudinal data. kml is designed to work specifically on single trajectories while kml3d clusters joint trajectories. They offer different solutions to the issues tackled above. Both packages are available from the Comprehensive R Archive Network (CRAN) at http://CRAN. R-project.org/package=kml and http://CRAN.R-project.org/package=kml3d.
1. Package kml3d clusters several variable-trajectories jointly. This summarizes several continuous correlated variables (the joint-trajectories) into a single nominal variable (group) that resumes the information contained in the correlated variables. This makes the use of this information for further statistical analysis much easier. In addition, package kml3d provides tools to visualize 3D dynamic graphs in an R session and offers the possibility of exporting them in a 3D dynamic PDF file (Figure 1) 1 . This visualization enables a better representation of the interaction between the two variable-trajectories.
2. The problem of selecting the number of clusters remains thus far unsolved. Nevertheless, various quality criteria have been proposed to choose the "right" number of clusters. As often when several solutions exist, none is fully satisfactory. Packages kml and kml3d therefore offer various quality indices, mainly the ones that show the best performances according to the scientific literature (Shim et al. 2005;Milligan and Cooper 1985).
Depending on their field of interest and their knowledge, users can choose the one that is the most appropriate for their particular study. Furthermore, when several criteria lead to the same number of clusters, it increases the confidence that one can have in the results.
4. The problem of local or global maximum is also an open question. Packages kml and kml3d do not solve it, but offer six different ways to set an initial configuration, including two innovative methods.
Finally, packages kml and kml3d integrate all those tools into a single, "user-friendly" function. They automatically perform the k-means algorithm for a different number of clusters, enabling multiple draws for each of them and varying the initialization methods.
This article presents these two packages. Section 2 introduces the statistical techniques used. Section 3 details the algorithms and the main functions. Section 4 gives examples of use. Section 5 is the discussion.

Statistical techniques
2.1. Introduction to k-means K-means is a hill-climbing algorithm belonging to the EM class (expectation-maximization; Celeux and Govaert 1992). EM algorithms work as follows: initially, each observation is assigned to a cluster. Then the optimal clustering is reached by alternating two phases. During the expectation phase, the centers of the different clusters (known as seeds) are computed. The maximization phase then consists in assigning each observation to its "nearest cluster". The alternation of the two phases is repeated until no further changes occur in the clusters.

Notations
Let S be a set of n subjects.
where lines are single variable trajectories. If j is fixed, the sequence is called the individual's state at time j. The individual's state at time j is the jth column of the matrix y i.. . The aim of clustering is to divide S into k homogeneous sub-groups.

Clustering joint-trajectories
The two main problems encountered in partitioning joint-trajectories are the choice of the distance between individuals and problems of relative weight of variables measured on different scales.
Defining a distance between joint-trajectories K-means can work with various distances: Euclidean, Manhattan, Minkowski (the generalization of the two previous distances) and many others. Working on joint-trajectories raises the question of the distance between two joint-trajectories. More precisely, considering the joint-trajectories of two individuals y 1.. and y 2.. , we seek to define d(y 1.. , y 2.. ), the distance between y 1.. and y 2.. . Strictly speaking, this is the distance between two matrices. Several methods are possible, we will focus on two. The first is to consider the t columns of the two matrices, then to compute t distances between the t couples of columns and finally to combine these t distances using a function that will combine the "column-distances". The second is to consider the m lines of the two matrices, then compute m distances between the m couples of lines and finally to combine these m distances using a function that will combine the "line-distances".
More formally, let Dist be a distance function and · be a norm. To compute a distance d between y 1.. and y 2.. according to the first method, for each fixed j, we define the distance between y 1j. and y 2j. (distance between the individuals' state at time j) as d j. (y 1j. , y 2j. ) = Dist(y 1j. , y 2j. ). This is the distance between column j in matrix y 1.. and column j in matrix y 2.. . The result is a "vector of t distances" (d 1. (y 11. , y 21. ), d 2. (y 12. , y 22. ), . . . , d t. (y 1t. , y 2t. )).
To compute a distance d between y 1.. and y 2.. according to the second method, for each variable X, we define the distance between y 1.X and y 2.X (distance between the individual trajectories X) as d .X (y 1.X , y 2.X ) = Dist(y 1.X , y 2.X ). This is the distance between line X in matrix y 1.. and line X in matrix y 2.. . The result is a "vector of m distances" Then we combine these m distances by considering the norm · of the vector of distance. The choice of the norm · and the distance Dist can lead to the definition of a large number of distances. On the contrary, in the case where · is the standard p-norm and Dist is the Minkowski distance with parameters p, choosing either method d or d gives the same result: d(y 1.. , y 2.. ) = d (y 1.. , y 2.. ). Proof: We can therefore define the Minkowski distance between two joint variable-trajectories as: The Euclidean distance is obtained by setting p = 2, the Manhattan distance by setting p = 1 and the maximum distance by passing to the limit p → +∞. In practice, the kml3d package uses Euclidean distance as the default distance. But it also allows users to define their own distance.

Data standardization
Since longitudinal studies deal with several different kinds of variables, the joint variables can be measured on different scales. This problem has already been extensively discussed in the classic (non-trajectory) situation (Everitt et al. 2001). A possible solution is to normalize the data. This can also be done with trajectories. As a slight difference to the classic situation, each variable-trajectory is not normalized at each time but in its entirety: let y ..X and s ..X be respectively the mean and the standard deviation of y ..X . Then, the "global normalization" will turn the measure y ijX into y ijX = y ijX −y ..X s ..X . The normalized joint trajectory y i.. is obtained by normalization of its single trajectories y i.A , . . . , y i.M one by one.

Imputation methods for longitudinal data
In their founding documents, Rubin and Little distinguished three kinds of missing values (Rubin 1976;Little and Rubin 1987). Let Y TRUE be the trajectories without missing values (unavailable data); let Y OBS be trajectories with missing values (available measured longitudinal data); let R denote the Boolean matrix of the location of a missing value; let Y MISS be the missing part of Y TRUE . Thus, Y TRUE = Y OBS + Y MISS . Their classification of missing values is then based on a potential link between R and Y TRUE , Y OBS , and Y MISS as follows: MCAR: A value is Missing Completely At Random if the probability of missingness for observation y ijX is independent of Y TRUE .
MAR: A value is Missing At Random if the probability of missingness for observation y ijX is independent of Y MISS , but may be dependent on the observed values Y OBS . For example, if patients who performed badly at time j −1 decide to miss time j, the missing data will be MAR.

MNAR:
A value is Missing Not At Random if the probability of missingness for observation y ijX is independent of its own value, i.e., ofY MISS . Typically, the probability for an observation y ijX to be missing at time j depends on the current value of y at time j. For example, if patients who think that they are going to perform badly at time j refuse to be tested at time j, the data will be MNAR.
The impact of the mechanism of missingness on the imputation of the missing values was examined by Molenberghs et al. (2004). In the particular case of longitudinal data, the missingness mechanisms were classified according to the position of the missing values within the trajectory: Intermittent missing data are data missing within a trajectory. Formally, y ijX is an intermittent missing value if there exists a and b, a < j < b, such that y iaX and y ibX are not missing.
Monotone missing data are data missing either at the beginning or at the end of a trajectory. This includes the case of left -or right -truncated variables. If a value is missing, then all the following (or, conversely, preceding) values are also missing. Formally, y ijX is a right monotone missing value if, for all d > j, y idX is missing. Conversely, y ijX is a left monotone missing value if, for all d < j, y idX is missing.
Orthogonally, imputation methods are grouped according to the information necessary for their implementation. Cross-sectional methods impute y ijX using data collected at time j, that is according to the values of the other individuals at the same time j: 1. Cross Mean replaces y ijX by the mean values of variable X observed at time j.
2. Cross Median replaces y ijX by the median value of variable X observed at time j.
3. Cross Hot Deck replaces y ijX by a value chosen randomly among all the values of variable X observed at time j.
Longitudinal methods impute y ijA using the non-missing data of variable X and subject i. The imputation is performed independently from the data of other individuals: 4. Traj Mean replaces y ijX by the average of the values of trajectory y i.X .

5.
Traj Median replaces y ijX by the median value of trajectory y i.X .
6. Traj Hot Deck replaces y ijX by a value chosen randomly among the values of trajectory y i.X .
7. Last Occurrence Carried Forward (LOCF) replaces y ijX by the previous nonmissing value.
8. Linear Interpolation imputes y ijX by drawing a line between the two non-missing values that immediately precede and follow the missing value. Let y iaX and y ibX be the closest preceding and following non-missing values of y ijX ; then y ijX = y iaX + 9. Spline Interpolation imputes y ijA by drawing a cubic spline between the two nonmissing values that immediately precede and follow the missing value. For mathematical details, see Fritsch and Carlson (1980).

Cross-sectional & Longitudinal methods use both longitudinal and cross-sectional information:
10. Copy Mean is a new method. The main idea is to impute using linear interpolation, and then to add a variation to make the trajectory follow the "shape" of the population's mean trajectory. Formally, if y ijX is missing, let y iaX and y ibX be the closest preceding and following non-missing values of y ijX (all the notation introduced here is illustrated in Figure 2(a)); let y ..X = (y .1X , . . . , y .tX ) denote the mean trajectory of the population S; let y LI ijX be the value obtained by imputing y ijX using linear interpolation; and finally, let y LI .jX be the value obtained by applying a linear interpolation between a and b on the mean trajectory y ..X , this is y LI .jX = y .aX + y .bX −y .aX b−a (j − a). Then the average variation at time j is the difference between y .jX and y LI .jX , this is AV jX = y .jX − y LI .jX . Finally, Copy Mean imputes y ijX by adding the average variation AV jX to the result of the linear interpolation: y CM ijX = y LI ijX + AV jX . Figure 2 shows an example of a trajectory imputed using Copy Mean. The efficiency of this method has been shown in Genolini et al. (2013a).
All these methods work both on intermittent and on monotone missing values except for two: (1) linear interpolation needs to know values surrounding the missing value, it cannot therefore impute monotone missing values. And since (2) Copy Mean uses Linear Interpolation, it inherits this weakness. To overcome this problem, a modified version of Copy Mean is dedicated to the imputation of monotone missing values: 10' Copy Mean for monotone missing values: As with intermittent missing data, the main idea is first to impute using a longitudinal method, then to add a variation.  The only difference is that since the longitudinal method Linear Interpolation is not available on monotone missing values, the LOCF method will be used instead. Formally, if y ijX is a right monotone missing value, let y ibX (b < j) be the last known value of the trajectories y i.X (all the notation introduced here is illustrated in Figure 2(b)); let y ..X = (y .1X , . . . , y .tX ) denote the mean trajectory of the population S; let y LOCF ijX be the value obtained by imputing y ijX using LOCF: y LOCF ijX = y iaX ; and finally, let y LOCF ijX be the value obtained by applying LOCF on the mean trajectory y ..X , this is y .jX = y .aX . Then the average variation at time j is the difference between y .jX and y LOCF jX , that . Finally, Copy Mean imputes y ijX by adding the average variation AV jX to the result of the LOCF imputation: For the imputation of the left monotone missing value, NOCB (next occurrence carried backward) is used instead LOCF. The rest of the method remains the same.

Quality criteria
Quality criteria are indices associated with a partition. These take high values for partitions of "high quality", low values otherwise (or the reverse, depending on the criterion). Different definitions of the "high" and "low" quality partitions lead to different indices, but the operating principles are mostly similar: a "good" partition is a partition where clusters are (1) compact and (2) well separated from each other. So most of the indices calculate some kind of "withincluster compactness index" and "between-cluster spacing index"; they then divide one by the other. More specifically: Calinski & Harabasz criterion (Caliński and Harabasz 1974): where B is the between-cluster covariance matrix (so high values of Trace(B) denote well-separated clusters) and W is the within-cluster covariance matrix (so low values of Trace(W ) correspond to compact clusters).
where B is the betweencluster covariance matrix and W is the within-cluster covariance matrix.
Ray & Turi criterion (Ray and Turi 1999): Davies & Bouldin criterion (Davies and Bouldin 1979): where DistInterne is a cluster compactness measure (for example, maximum distance between two points of the cluster) and DistExterne is a separation cluster measure (for example, distance between cluster's centers).
These five criteria are non-parametric. They can be computed without making any hypothesis. Considering extra hypothesis, some parametric criteria like BIC (Schwarz 1978), AIC (Akaike 1974), AIC with correction for finite sample size (Hurvich and Tsai 1989) or the global posterior probability (Bolstad 2007) can also be computed. This computation needs two kinds of information: the likelihood and the number of measures.
The former can be found assuming that for each cluster C and each time, the single trajectory variable Y ..X follows a normal law with variance σ 2 = var i∈S,j∈{1,...,t} (Y ijX ) (homoscedastic, one variance for all the times and all the groups) and mean jC = mean i∈C (Y ijX ) (the mean trajectories can change any time and for any group). Using these hypotheses, it is possible to compute the posterior probabilities of each individual trajectory, and then the likelihood.
The number of measures is more problematic: in a classic study, the number of independent observations is equal to the sample size. In longitudinal studies, the sample size is n while the number of observation is N = t · n. But these t · n observations are not independent measures. Therefore, deciding whether the number of measures should be n or N is not straightforward. Packages kml and kml3d compute criteria using both definitions: where h is the number of parameters in the model. Note that although these parametric criteria can always be computed, they are only pertinent if the hypothesis is true (the variable follows a normal law).

Maximization or minimization?
Among all these criteria, some should be maximized (high value denoting good partition), others should be minimized (low value denoting good partition). This might be confusing, in particular for the comparison of several criteria all together. To avoid this confusion, packages kml and kml3d compute the criteria that should be maximized, and compute the opposite of the criteria that should be minimized. As such, all the criteria proposed by packages kml and kml3d should be maximized. This enables the representation of several criteria on the same graph, making it easier to compare them (see Figure 6).

Initialization of k-means
The first step of the k-means algorithm is to choose an initial configuration, which is a set of k cluster centers. This choice has a dual importance: (1) on it depends the partition towards which the algorithm will converge (local or global maximum); (2) on it also depends the convergence time. Ultimately, if a method is able to choose an initial configuration fairly close to the best partition, k-means would converge fast to the optimal solution. Some authors have therefore proposed various initialization methods (Pena, Lozano, and Larranaga 1999;Khan and Ahmad 2004;Redmond and Heneghan 2007;Steinley and Brusco 2007). Most of these methods try to build an initial configuration in which the initial centers are as distant as possible from each other. The idea is that individuals that are clearly distant probably belong to different clusters, which is a guarantee of quick convergence to a good partition. However, whatever the method chosen, it is nevertheless not certain that the initial configuration enables convergence to the best partition. Thus most methods include a non-deterministic process that allows the user to run the method several times. Since the runs start from different initial configurations, they might converge to different maxima. Eventually, one of them will reach the global maximum. The packages kml and kml3d offer seven methods for choosing initial configurations: 1. randomK: k individuals are chosen randomly. They are the initial cluster centers.
2. randomAll: All individuals are randomly assigned to a cluster. The mean of each cluster is the cluster center.
3. maxDist: This method is incremental. First, it selects the two individuals that are the most distant and considers them as the two first centers. Then it adds the individual that is the farthest away from the list of centers already preselected. More precisely: (a) Compute the matrix of the distance between all points.
(b) Choose the two farthest points as the first two centers c 1 and c 2 .
(c) Start a loop: i. For each data point x, consider D(x), the distance between x and the nearest center that has already been chosen. ii. Choose as a new center c i , the data point for which D(c i ) is maximum. iii. Repeat steps 3(c)i and 3(c)ii until k centers have been chosen.

kmeans+:
The maxDist method is more effective than randomAll or randomK (see Genolini and Falissard 2011) since the initial centers are distant from each other. But it is also time-consuming with a complexity in o(n 2 ) (computation of the matrix of all the distances). kmeans+ is an improvement of maxDist. It is based on a similar principle: distant centers are added to the list of already preselected centers. This ensures that the property of having initial centers distant from each other is retained. The only difference is that the first center is chosen randomly. Thus, it is no longer necessary to compute the matrix of all distances between individuals, the calculation of distances between each individual and the centers already selected (up to k) being sufficient. Thus the complexity is o(nk): (a) Choose a center c 1 at random (uniformly). On the other hand, its efficiency is slightly worse. Indeed, the first point selected may be a "bad" choice (e.g., a point located between two clusters). A solution to improve this is to choose the first and the second centers according to kmeans+ rules, then to remove the first center (the one that may be a bad choice) from the list. The second point cannot be a bad choice since it is at least distant from the first one. After the first center has been removed, the second center becomes the new first center. Then the other centers are added one by one as in kmeans+. Formally: (a) Choose one center c 0 uniformly at random from among the data points.
(b) For each data point x, compute D(x), the distance between x and c 0 .
(c) Choose as second point c 1 for which D(c 1 ) is maximum.
(d) Remove c 0 from the list of centers. c 1 is now the "new" first center.
(e) Start a loop: i. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. ii. As a new center c i , choose the data point for which D(c i ) is maximum. iii. Repeat steps 5(e)i and 5(e)ii until k centers have been chosen.
6. kmeans--: kmeans-has the same advantages of maxDist in terms of center repartition (distant one from the other) and the same advantages of kmeans+ in terms of complexity (small time complexity, o(kn)). But except for the initial draw, kmeans-is a deterministic method. This greatly impairs the advantage of multiple draws, which is one of the strengths of k-means. A way to solve this problem is to randomly choose the centers that are added to the list of the previously selected centers. To maintain a high probability of obtaining centers that are distant from each other, the probability for an individual x to be added to the selected center list will be proportional to the square of the distance between x and the selected centers. Formally, kmeans--is the same algorithm as kmeans-except for step 3 and 5b: (a) Choose one center c 0 uniformly at random from among the data points.
(b) For each data point x, compute D(x), the distance between x and c 0 .
(c) Choose one new center c 1 at random using a weighted probability distribution proportional to D(x) 2 .
(d) Remove c 0 from the list of the centers.
(e) Start a loop: i. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. ii. Randomly choose a data point as the new center c i , using a weighted probability distribution where a point x is chosen with probability proportional to D(x) 2 . iii. Repeat steps 6(e)i and 6(e)ii until k centers have been chosen.
kmeans--combines all the advantages of the previous methods: the centers are distant from each other, the first center cannot be a bad choice, the time complexity is o(kn) and the method is non-deterministic. 7. kmeans++: In the same way, kmeans++ (Arthur and Vassilvitskii 2007) is the nondeterministic version of kmeans+: (a) Choose one center c 0 uniformly at random from among the data points.
(b) Start a loop: i. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. ii. Randomly choose a data point as the new center c i , using a weighted probability distribution where a point x is chosen with probability proportional to D(x) 2 . iii. Repeat steps 7(b)i and 7(b)ii until k centers have been chosen.

Overview
An overview of the packages and the algorithm is presented in Figure 3.

Preparation of the data
Packages kml and kml3d cluster longitudinal data. One of their properties is that they "memorize" all the clusters that they find. To do this, they use an S4 structure (Chambers 2008;Genolini 2008). 'Partition' objects contain clusters. They also contain some information such as the size of each cluster, as well as quality criteria (see Section 2.5). The second object, 'ClusterLongData' (or 'ClusterLongData3d' for kml3d), mainly contains a field traj that stores the trajectory, and fields c2, c3, c4, . . . , c26 that store lists of 'Partition' objects with respectively 2, 3, 4, . . . , 26 clusters. Data preparation therefore simply consists in transforming longitudinal data into a 'ClusterLongData' or 'ClusterLongData3d' object.
In package kml, this can be done using function cld(). cld() turns data frames or matrix into a 'ClusterLongData' object. It uses the following argument (the type of the argument is given in brackets):

maxNa ([numeric]): maximum number of missing values that is tolerated on a trajectory.
If a trajectory has more missing values than maxNa, then it will be removed from the analysis.
Package kml3d works in nearly the same way: the object 'ClusterLongData3d' contains the same fields as the object 'ClusterLongData'. They are built using function cld3d() from a data.frame or a three-dimensional array.   timeInData ([list(vector(numeric))]): if traj is a data frame, timeInData specifies columns that contain the trajectories. The list labels are the names of the variables. The vectors of numbers associated with each variable are the column numbers in the data frame that store the variable. For example, timeInData = list(A = c(2, 4, 6), B = c(3, 5, 9)) defines a joint-trajectory composed of two variables A and B, the trajectories of A are contained in columns 2, 4 and 6, the trajectories of B are in 3, 5 and 9.
This argument is not used if traj is an array.

varNames ([character]): names of the variables measured.
maxNa ([numeric] or [vector(numeric)]): maximum number of missing values that is tolerated on a trajectory. If a trajectory has more missing values than maxNa, then it will be removed from the analysis. In the 3D case, maxNa can be a single numeric (same value for all the variable) or a vector of numeric values (one value per variable).

Finding the optimal partition
Once an object of the class 'ClusterLongData' has been created, the kml() function can be called. kml() runs k-means several times, varying starting conditions and the number of clusters. On each run, it finds a 'Partition' and stores it in the appropriate field. The starting condition can be "randomAll", "randomK", "maxDist", "kmeans++", "kmeans+", "kmeans---" or "kmeans-" as described in Section 2.6. In addition, it can also use two specific values: "all" stands for c("maxDist", "kmeans-") followed by an alternation of "kmeans--" and "randomK"; "nearlyAll" stands for "kmeans-" followed by an alternation of "kmeans--" and "randomK". By default, kml() runs k-means for k ∈ {2, 3, 4, 5, 6} clusters 20 times each using "nearlyAll". The k-means version used here is the Hartigan and Wong version (Hartigan and Wong 1979). The default distance is the Euclidean distance with Gower adjustment (Gower 1966). kml() can also work with user-defined distances using the optional argument distance. If specified, distance should be a function that takes two trajectories and returns a number, letting the user compute a non-classic distance (such as the adaptive dissimilarity index, Hennig and Hausdorf 2006;dynamic time warping, Rath and Manmatha 2003;Fréchet distance, Fréchet 1905; . . . ).
There are two implementations of kml(). The first one (kmlSlow) is programmed in R. It is called by kml() when the user asks for some graphical display or when he/she wants to use some non-classical user-defined distance function. The second (kmlFast) is optimized in C. It is run only when all the parameters are standard. This second version is around 20 times faster than the first one. The dispatching to kmlSlow / kmlFast is done automatically by kml, according to the options defined by the user. Every 'Partition' found by kml() is stored in the 'ClusterLongData' object. Fields c2, c3, . . . , c26 are lists storing partitions with a specific number of clusters (for example; the sublist c3 stores all the 'Partition' with 3 clusters). Storage is performed in real time. If kml() is interrupted before the computation ends, the partitions already found are not lost. When kml() is re-run several times on the same data, the new partitions found are added to the previous ones. This is convenient when the user asks for several runs, then realizes that the result is not optimal and asks for further runs. In addition, kml() saves all the partitions on the hard disc at frequent intervals (this can be specified by the user) to guard against any system interruption that may occur when the algorithm has to run for a long time (up to several days).
The main arguments of kml() are: object (['ClusterLongData']): contains trajectories to clusters and all the partitions already found.
nbRedrawing ([numeric]): sets the number of times that k-means must be run (with different starting conditions) for each number of clusters. The default value is 20.
toPlot ([character]): while it is running, kml() can display two graphs. The first displays the quality criterion of all the partitions already found. The second represents the evolution of the clustering process (the different steps in k-means). According to the value of toPlot, the first (toPlot = "criterion"), the second (toPlot = "traj"), none (toPlot = "none") or both (toPlot = "both") will be displayed.
parAlgo (['ParKml']): some more advanced options can be specified using the argument parAlgo. parAlgo takes as value an object of class 'ParKml'. Objects of class 'ParKml' contain different fields that detail the parameters that kml() should use (such as a user-defined distance, the starting condition, the save frequency, . . . ).
The kml3d() function from package kml3d works on the same principle. The only difference is that the argument object belongs to the class 'ClusterLongData3d' instead of 'ClusterLongData'.

Exporting results
After kml() has found some partitions, the user can "visualize" them, and then can decide to export some of them. This can be done via the function choice(). choice() opens a graphic window split in two parts (see Figure 4 for example). On the left, all the partitions that have been found by kml() are represented. A partition is represented by a number that is the number of its clusters. The height of the points represents the value of a specific quality criterion (presented in Section 2.5) called the "active criterion". The name of the active criterion appears above the graph. Thus this graph gives an overall view of all partitions which have been found.
Among all the partitions plotted on the left-hand graph, one is "selected" (the one highlighted with a black dot, see Figure 4). This partition is graphically plotted on the right-hand part of the graphic window. More precisely, the right-hand graph plots the longitudinal data and highlights the cluster structure of the selected partition using colors. According to the user's choice, trajectories belonging to a specific cluster may or may not be plotted, colorized, or labeled (ditto for mean trajectories of each groups). This is a dynamic process: the user can change the selected partition (by moving the black dot); he/she can also change the active criterion. Clusters and mean trajectories shown on the right side of the graphic window are amended accordingly. This allows the user to visualize various partitions, but also to check if the best partition based on a criterion remains the best one according to another criterion.
Finally, the user can choose to export some partitions by selecting them (by simply pressing the space bar when the partition is shown). When he/she has made his/her choices, clusters and related information (quality criteria, frequency of individuals in each cluster, . . . ) are exported into .csv files. Graphs are also exported in a format compatible with savePlot(). For joint trajectories, the process is similar. The only difference is that the right-hand graph shows every single variable trajectory (one graph for each) that are part of the joint-variable trajectories (see Figure 5) More precisely, the choice() function arguments are: object (['ClusterLongData']): Object containing the trajectories and all the partitions found by kml() or by kml3d() that the user wants to see.
typeGraph ([character]): For every selected partition, choice() can export graphs. typeGraph sets the format that will be used. The possible formats are those available for savePlot.

3D rotating graph in a PDF file
For joint trajectories, kml3d provides tools to visualize trajectories. The plot3d() function displays the joint-trajectories in a three-dimensional space where the x-axis represents time, the y-axis is the first variable, and the z-axis is the second one. This graph function is built using the rgl package (Adler and Murdoch 2014). It is therefore possible to move the plotted graph in three dimensions using the mouse. This helps to visualize the joint evolution for trajectories fairly accurately.
Moreover, recent versions of the PDF standard accept dynamic graphs. In association with L A T E X and Asymptote (Hammerlindl, Bowman, and Prince 2014) 2 , kml3d make it possible to export 3D graphic representations of mean trajectories inside a .pdf document. This is done using the misc3d package (Feng and Tierney 2008). The 3D trajectories can thus easily be sent to a colleague, or can be dynamically included in an article (like the present article, see Figures 1 and 9). The procedure for producing 3D dynamic graphs is the following: 1. plot3dPdf(): Create a scene, which is a collection of triangle objects that represent a 3D image.
3. The .asy file cannot be included in a L A T E X file. L A T E X can read only .pre files. So the next step is to use Asymptote to convert an .asy file into a .pre file. This is done by the command: asy -inlineimage -tex pdflatex scene.asy.
4. The previous step produced a file scene+0.prc that can be included in a L A T E X file. makeLatexFile() creates a L A T E X file (default name: "main.tex") that is directly compilable using pdflatex.
Then compiling main.tex produces a .pdf document that contains the 3D dynamic graph.

Other functions
The four previous functions are probably sufficient for most users. Nevertheless, the following functions may also be helpful in some specific cases.
Scaling then restoring the original data (package kml3d only) When working on joint-trajectories, variables may be measured on different scales. This can give too much weight to one of the variables at the expense of another. A possible solution to overcome this problem is to normalize the data. The scale() function can either normalize the data globally (global normalization as defined in Section 2.3) or change the scales according to values defined by the user. Whatever the changes made by the scale() function, the function restoreRealData() can restore the data in their original form. By default, the scaling (normalization) and restoring options are automatically performed by kml3d() but this can be switched off (using the argument parAlgo), then the scale() function can be used to specify values manually.

Quality criteria comparison
As we noted in the introduction, quality criteria used to select the "correct" number of clusters are not always efficient. Using several of them might strengthen the reliability of the results. The function plotAllCriterion() displays several criteria on the same graph. In order to make them comparable, kml() and kml3d() compute the opposite of the criterion that should be minimized. Thus all criteria have to be maximized. In addition, criteria are mapped into [0, 1]. This is more convenient for comparing them. Figure 6 gives an example of concordant and discordant criteria.

Fuzzy k-means
The kml package also provides a fuzzy version of k-means (fuzzyKmlSlow()). It has the same syntax and offers the same facilities as the function kml().

Generating artificial longitudinal data
When researchers want to test new algorithms or methods, they usually first work on artificial data. These data have the advantage of being well known, and thus providing the optimal solution. A way to demonstrate the effectiveness of a new method is to prove that it is better than the others at finding the optimal solution on artificial data.
Package kml (resp. kml3d) provides a function that generates artificial longitudinal data. The random generation method works as follows: A data set shape is defined by a number of groups k and k real functions from R to R (resp. k functions from R to R i ), one for each group. This defines the typical single trajectories (resp. typical joint trajectories) that follow individuals in the group. Data sets are then created from the data set shape. Initially, a number of individuals per group is set. The trajectory of an individual is obtained by adding a personal variation and a residual variation to the typical trajectory of its group. An individual's personal variation is a constant over time (it represents the individual's specificities) while the residual variation can change at each time. Finally, a percentage of missing values (Missing Completely at Random according to the Rubin classification) can be added to each cluster.

Using package kml
Package kml has already been used in various studies (Touchette et al. 2008;Pryor et al. 2011;Pingault et al. 2011Pingault et al. , 2014. The following example is included in the package.

Dataset description
The EPIPAGE cohort, funded by INSERM and the French general health authority, is a multi-regional French follow-up survey of severely premature children. It included more than 4000 children born at less than 33 weeks gestational age, and two control samples of children, respectively born at 33-34 weeks of gestational age and born full term. The general objectives were to study short and long term motor, cognitive and behavioral outcomes in these children, and to determine the impact of medical practice, care provision and organization of perinatal care, environment, family circle and living conditions on child health and development. About 2600 children born severely premature and 400 and 600 controls respectively were followed up to the age of 5 years (Larroque et al. 2008) and then to the age of 8 (Larroque et al. 2011).
The database belongs to the INSERM unit U953 (P.Y. Ancel) which has agreed to include some of the data in the package.

Code
A part of the EPIPAGE Database has been included in the package, mainly the gender and the "Strengths and Difficulties Questionnaire" (SDQ) score at age 3, 4, 5 and 8. The SDQ is a behavioral questionnaire for children and adolescents aged 4 through 16 years old. It measures the severity of the disability (a higher score indicates greater disability). We present here a complete example code for analyzing the data. Data is loaded simply using the data() instruction: R> library("kml") R> data("epipageShort", package = "kml") R> head(epipageShort) id gender sdq3 sdq4 sdq5 sdq8 1 S1 Male 24 Partitioning the data is then performed using the function kml(). At first, we want to "see" the clustering process. Since it might be slow, we will ask only for two redrawings with the graph display.

R> choice(cldSDQ) R> plotAllCriterion(cldSDQ)
The result of choice() is presented in Figure 4, the result of plotAllCriterion() is in Figure 7(a). According to the latter, there is no clear evidence on any "best cluster number". According to expert opinion, the partition with four clusters seems to be the most relevant. According to this partition, four profiles can be found in the population: (A) high then decreasing, (B) low, (C) low then increasing and (D) high trajectories (see Figure 7(b)).
From a medical standpoint, this result is particularly pertinent: at three years of age, groups (A) and (D) are close. Then the health of some of these children does not change while others will see an improvement in their condition. The opposite occurs for groups (B) and (C): close at three years of age, some children keep a good health while some other see their condition deteriorate.
Finding the determinants that explain the divergence between (A) and (D) (or (B) and (C)) is therefore a major issue. This can be done by trying to explain the membership to the groups (A) or (D) using classical analysis. In our example, we try to see if the gender has an impact: R> epipageShort$clusters <-getClusters(cldSDQ, 4) R> epipageGroupAD <-epipageShort[epipageShort$clusters %in% c("A", "D"), ] R> summary(glm(clusters~gender, data = epipageGroupAD, + family = "binomial")) Call: glm(formula = clusters~gender, family = "binomial", data = epipageGroupAD) In our example, we see that gender has no influence on belonging to groups A or D.

Using package kml3d
Package kml3d has already been used in various studies (Pingault et al. 2012;Wahl et al. 2014). The following example is included in the package.

Presentation of the data
The QUIDEL database aims to gain better knowledge of hormone profiles among women who have no fertility problem. This database has been described as the largest existing database on hormone profiles in the normal human menstrual cycle, involving ultrasound scan on the day of ovulation (Ecochard 2006). It involves 107 women and 283 cycles in all, with identification of the day of ovulation and daily titration of the levels of the four main hormones in the ovulation cycle. It has already been the subject of numerous publications (including Ecochard and Gougeon 2000;Ecochard, Boehringer, Rabilloud, and Marret 2001). The database belongs to the laboratory in charge of the analysis of hormone trajectories (CNRS 5558, René Ecochard) which has agreed to include some of the data in the package.

Code
In this analysis, we investigate the joint evolution of pregnanediol and temperature. The data are included in the package kml3d.

R> choice(cldPreg)
There is no clear evidence on any "best cluster number" (see Figure 8).
If, according to expert opinion, we consider the partition with 4 clusters, we can export this in a 3D PDF dynamic graph. We shall first build an .asy file and create a L A T E X main document.
R> scene <-plot3dPdf(cldPreg, 3) R> saveTrianglesAsASY(scene) R> makeLatexFile() Then, using Asymptote (in a console), we can build a .prc file that can be included in a .pdf file using L A T E X.
Cons> asy -inlineimage -tex pdflatex scene.asy Cons> pdflatex main.tex   = 12  3  3  3  3  5  34  283  5 842  n = 40  3  3  3  3  7  81 Table 1: Execution time (in second) using fastKml, 2 to 6 clusters, 20 reroll each. n is the number of individuals, t is the length of the trajectories. NA stands for the R error that R is unable to allocate a vector of this size.
The result is presented in Figure 9 3 .

Default values
There is a huge number of k-means variants in the literature. Different authors propose to modify initial conditions, center calculation methods, distances between individuals, distances between clusters, or stopping criteria, while others sequentialize the cluster allocation or randomly disrupt some individuals to escape from local minima (see Liao 2005 for a survey of all these techniques). It is obviously not possible to test all these variants, and even less their combinations. In packages kml and kml3d, we decided to define some default settings for the non-expert, and to allow the expert user to modify a large number of parameters. We chose the default values in two ways: when there is an article showing the superiority of one method over the other (kmeans++, Copy Mean, Calinski & Harabatz criterion), then it is used as the default method. Otherwise, we took the method that most closely matches the original k-means algorithm (Hartigan and Wong 1979): the distance between the trajectories is the Euclidean distance, the cluster centers are the mean of each cluster, the distance between two clusters is the Euclidean distance between their centers. This version of k-means is also the one that was used in Genolini and Falissard (2010), an article showing that in some cases kml outperforms Traj Proc. Table 1 shows the limitations in terms of data size and, when appropriate, the average execution time. The computation was performed using fastKml with the default settings (number of clusters: 2-6; 20 rerolls each). The tests were conducted under Windows on a laptop equipped with an Intel (R) Core (TM) i5-2520M processor CPU@2.50GHz with 8GB of RAM.

Limitations
On small data sizes, k-means has a clear advantage over mixture models: it always converges. Thus, the limitation in terms of size is the meaning that one can give to results. Is it reasonable to present the average trajectory of a group of three or four individuals? Can we consider that only two or three repeated measurements define a trajectory? The answer depends on the specificities of each scientific field. In all cases, packages kml and kml3d can work well with very small datasets.

Discussion
This article presents the packages kml and kml3d, versions of k-means adapted to the analysis of single and joint variable trajectories respectively. They are able to deal with missing values, provide an easy way to run the algorithm several times and their plotting facilities help the user to choose the appropriate number of clusters when criteria traditionally devoted to this task are not efficient. Package kml3d also provides devices for displaying and exporting 3D dynamic graphs.

Limitations
The limitations of packages kml and kml3d are those inherent in all clustering algorithms. These techniques are mainly exploratory; they cannot statistically test the reality of cluster existence. Packages kml and kml3d are not model-based, which can be an advantage (non parametric, more flexible) but also a disadvantage (no scope for testing goodness-of-fit). The determination of the optimal number of clusters is still an unsettled issue. The EM algorithm can also be particularly sensitive to the problem of local maxima. Finally, the kml3d package provides some tools making it possible to "see" the trajectories in 3D, but these tools can only display two variables at the same time. This might be a problem for clustering data using more than two joint variables.

Strengths
Packages kml and kml3d provide non-parametric algorithms. They do not need any prior information. They enable the clustering of trajectories that do not follow polynomial or parametric trajectories. They avoid the issues related to model selection. Package kml3d provides a way to cluster data according to several joint trajectories. This can help to highlight complex relationships between variable-trajectories, it also combines the information of several strongly-correlated variable-trajectories into a single nominal variable.

Perspectives
A number of unsolved problems still need further investigation. In the context of joint trajectories, the optimization of cluster number is becoming an increasingly important issue, since it is not possible to graphically represent the result of the partitioning process if there are more than two variables. It is possible that the particular situation of joint longitudinal data could lead to an efficient solution not yet found in the general context of cluster analysis. Another interesting approach would be to compare the efficiency of the initialization methods. kmeans++ is more efficient than the classic randomAll, but other methods have not yet been formally evaluated. It would be interesting to study their respective performances. Another interesting context would be to extend the package to binary data clustering, which is already used in Subtil, Boussari, Bastard, Etard, Ecochard, and Genolini (2015).