RSKC : An R Package for a Robust and Sparse K-Means Clustering Algorithm

Witten and Tibshirani (2010) proposed an algorithim to simultaneously ﬁnd clusters and select clustering variables, called sparse K-means (SK-means). SK-means is particularly useful when the dataset has a large fraction of noise variables (that is, variables without useful information to separate the clusters). SK-means works very well on clean and complete data but cannot handle outliers nor missing data. To remedy these problems we introduce a new robust and sparse K-means clustering algorithm implemented in the R package RSKC . We demonstrate the use of our package on four datasets. We also conduct a Monte Carlo study to compare the performances of RSK-means and SK-means regarding the selection of important variables and identiﬁcation of clusters. Our simulation study shows that RSK-means performs well on clean data and better than SK-means and other competitors on outlier-contaminated data.


Introduction
K-means is a very popular clustering method introduced by Steinhaus (1956) and popularized by MacQueen (1967). K-means is conceptually simple, optimizes a natural objective function, and is widely implemented in statistical packages. Datasets may contain two types of variables: clustering variables and noise variables. Clustering variables change their behavior from cluster to cluster while noise variables behave similarly across clusters. It is easy to construct examples where K-means breaks down in the presence of a large fraction of noise variables. Furthermore, it may be of interest to simultaneously find the clusters and a small number of variables which are sufficient to uncover the cluster structure. Witten and Tibshirani (2010) proposed an algorithm called sparse K-means (SK-means) and (c) show the partition obtained by K-and SK-means. K-means is unable to uncover the cluster structure while SK-means produces a good partition.
SK-means cleverly exploits the fact that commonly used dissimilarity measures (e.g., squared Euclidean distance) can be additively decomposed into p terms, each depending on a single variable. Given a cluster partition C = (C 1 , C 2 , . . . , C K ), the between-cluster dissimilarity measure B (C) satisfies where B j (C) solely depends on the jth variable. The basic idea is to assign weights to each variable and base the clustering algorithm on the corresponding weighted dissimilarity measure. Given p non-negative weights w = (w 1 , w 2 , ..., w p ) consider the weighted betweencluster dissimilarity measure where SK-means searches for the pair (w * ,C * ) that maximizes (1) subject to the constraints p i=1 w 2 j ≤ 1 and for a given value of l between 1 and √ p. In other words, SK-means performs a regularized (LASSO-type) version of K-means.
Sections 2 and 4 show that a very small fraction of outliers -e.g., 1% of outlying observations one out of several hundred features -can drastically upset the performance of SK-means. To remedy this problem we propose a robustified alternative called robust and sparse K-means (RSK-means). Our simulation study and real data examples show that RSK-means works well with clean and contaminated data.
The rest of the paper is organized as follows. Section 2 describes RSK-means. Section 3 illustrates the implementation of our method in the RSKC package using four real datasets. Section 4 reports the results of our simulation study and Section 5 provides concluding remarks.

Robust and Sparse K-Means
The K-means clustering algorithm finds K clusters C 1 , . . . , C K that minimize the withinclusters sum of squares where C 1 , . . . , C K are the disjoint sets of cluster indices, n k is the number of observations in the kth cluster, and is the (additive) dissimilarity measure between the i and i observations. When our observations are vectors x 1 , . . ., x n in R p and d i,i is the squared Euclidean distance between x i and x i we have d i,i ,j = (x i,j − x i ,j ) 2 , j = 1, . . . , p and where µ k is the sample mean of the observations in the k-th cluster. A popular algorithm (see Lloyd 1982) to find local solutions to Equation 2 is as follows. First randomly select K initial "centers" µ 1 , . . . , µ K and iterate the following two steps until convergence: (a) Given cluster centers µ 1 , . . . , µ K , assign each point to the cluster with the closest center.
(b) Given a cluster assignment, update the cluster centers to be the sample mean of the observations in each cluster.
Although this algorithm decreases the objective function at each iteration it may be trapped in a local minima. Hence, one tries different starting points and the best solution is returned.
Cuesta-Albertos, Gordaliza, and Matran (1997) proposed a modification of this algorithm in order to obtain outlier-robust clusters. The main idea is to replace step (b) by (b') Given a cluster assignment, trim α100% of the observations with the largest distance to their cluster centers, and update the cluster centers to the sample mean of the remaining observations in each cluster.
Since the total sum of squares 1 n does not depend on the cluster assignments, minimizing the within-cluster sum of squares is equivalent to maximizing the between-cluster sum of squares Witten and Tibshirani (2010) introduced non-negative weights w j , j = 1, . . . , p for each feature and solved subject to w 2 ≤ 1, w 1 ≤ l and w j ≥ 0, j = 1, . . . , p, where l > 1 determines the degree of sparcity (in terms of non-zero weights) of the solution. The optimization problem in Equation 3 can be solved by iterating the following steps: (a) Given weights w and cluster centers µ 1 , . . . , µ K solve which is obtained by assigning each point to the cluster with closest center using weighted Euclidean squared distances.
(b) Given weights w and cluster assignments C 1 , . . . , C K , update the cluster centers to be the weighted sample mean of the observations in each cluster.
(c) Given cluster assignments C 1 , . . . , C K and centers µ 1 , . . . , There is a closed form expression for the vector of weights that solves this optimization problem (see Witten and Tibshirani 2010).
A naive first approach to robustify this algorithm is to use trimmed K-means with weighted features, and then optimize the weights using the trimmed sample. In other words, to replace step (b) above with (b') where B j (C 1 , . . . , C K ), j = 1, . . . , p are calculated without the observations flagged as outliers. A problem with this approach is that if an observation is outlying in a feature that recieved a small weight in steps (a) and (b'), then this observation might not be trimmed. In this case, the variable where the outlier is more evident will receive a very high weight in step (c) (because this feature will be associated with a very large B j ). This may in turn cause the weighted K-means steps above to form a cluster containing this single point, which will then not be downweighted (since its distance to the cluster center will be zero). This phenomenom is illustrated in the following example. Simulated data example: We generate n = 300 5-dimensional observations arranged in three clusters of equal size. The observations have multivariate normal distributions with covariance matrix equal to the identity and centers at (µ, µ, 0, 0, 0) with µ = −3, 0 and 3 for each cluster, respectively. Only the first 2 features contain information on the clusters. Figure 2 contains the scatterplot of these 2 clustering features. Colors and shapes indicate the true cluster labels.
To illustrate the problem mentioned above, we replace the 4th and 5th entries of the first 3 observations with large outliers. The naive robustification described above returns a disappointing partition because it fails to correctly identify the clustering features, placing all the weight on the noise ones. The result is shown in Figure 2 (b). As expected, SK-means also fails in this case assigning all the weights to the noise features and forming one small cluster with the 3 outliers (see Figure 2 (c)). Finally, Figure 2 (d) shows the partition found by our RSK-means algorithm which is described below.
The key step in our proposal is to use two sets of trimmed observations which we call the weighted and unweighted trimmed sets. By doing this, zero weights will not necessarily mask outliers in the noise features. Our algorithm can be described as follows (a) Perform trimmed K-means on the weighted dataset: (i) Given weights w and cluster centers µ 1 , . . . , µ K solve which is obtained by assigning each point to the cluster with closest center using weighted Euclidean squared distances.
(ii) Given weights w and cluster assignments, trim the α100% observations with largest distance to their cluster centers, and update the cluster centers to the sample mean of the remaining observations in each cluster.
(iii) Iterate the two steps above until convergence.
(b) Let O W be the subscripts of the α100% cases labelled as outliers in the final step of the weighted trimmed K-means procedure above.
(c) Using the partition returned by trimmed K-means, calculate the unweighted cluster centersμ k , k = 1, . . . , K. For each observation x i , letd i be the unweighted distance to its cluster center, i.e., (e) Given cluster assignments C 1 , . . . , C K , centers µ 1 , . . . , µ K and trimmed points O, find a new set of weights w by solving This algorithm is called RSK-means and it is implemented in the R package RSKC (Kondo 2016).
RSK-means requires the selection of three parameters: the L 1 bound, l, the trimming proportion, α, and the number of clusters, K. In our experience the choice α = 0.10 works well for most applications. The choice of l determines the degree of sparcity and hence l can be chosen to achieve a desired number of selected features. In practice, one can also consider several combinations of values for these two parameters and compare the results. The problem of selecting K is outside of the scope of this paper, and has been discussed extensively in the literature for standard K-means clustering; see Milligan and Cooper (1985), Kaufman and Rousseeuw (1990), Sugar and James (2003), and Tibshirani and Walther (2005). To select the number of clusters K we recommend using the Clest algorithm of Dudoit and Fridlyand (2002) which is the default method implemented in the RSKC package.
Missing values are a challenging difficulty for clustering algorithms. All commonly used implementations of K-means (including SK-means and trimmed K-means) do not work if there are missing values in the data. However, even in moderately high-dimensional datasets it is not uncommon to find that a large proportion of observations has a missing value in at least one variable. Hence, using only complete cases may result in a significant loss of information. Unfortunately, standard imputation methods will typically not work well in a clustering application, since missing values may depend on the cluster to which the observation belongs, and these are of course unknown. The RSKC package implements the following adjustment to deal with potential missing values. Consider the i-th observation, x i = (x i,1 , . . . , x i,p ) and the center of the k-th cluster, µ k = (µ k,1 , . . . , µ k,p ) . Let M i,k ⊂ {1, 2, . . . , p} be the set of coordinates where either x i or µ k have a missing value. Our proposed algorithm scales the (weighted) distances according to the number of missing features. More specifically, the adjusted distance between x i and µ k is where S i,k = p j=1 w j / j / ∈M i,k w j , and w j , 1 ≤ j ≤ p are the sparsity weights.

Using the RSKC Package
In this section we illustrate the functionality available in the RSKC package (version 2.4.2) through the analysis of four real datasets. The main function in the package is called RSKC and it implements the RSK-means algorithm described in Section 2. The function RSKC has five main arguments: • d: An N by p data matrix, where each row corresponds to a data point • ncl: The number of clusters to be identified.
• L1: The upper bound for the L 1 constraint for the vector of weights w, L 1 > 1.
• nstart: The number of initial partitions used in step (a).
In addition the function RSKC runs K-means when alpha = 0 and L1 = NULL; Sparse K-means when alpha = 0 and L1 = NULL; and trimmed K-means when alpha > 0 and L1 = NULL.
Several other R packages implement K-means type algorithms. For example, the function kmeans in stats (R Core Team 2016), the trimmed K-means algorithm in trimcluster (Hennig 2012), and the SK-means algorithm in sparcl (Witten and Tibshirani 2013). None of these can handle missing values.

Optical recognition of handwritten digits
The dataset was downloaded from the UCI Machine Learning Repository (Bache and Lichman 2013) and consists of n = 1797 digitized images of the numbers 0 to 9 handwritten by 13 subjects. The number of observations for each digit varies between 174 and 183. The raw observations are 32 x 32 bitmaps, which were divided into 64 non-overlapping blocks of 4 x 4 bits. For each block we observe the number of "black" pixels. Hence, each image is represented by p = 64 variables recording the count of pixels in each block, taking integer values between 0 and 16. This dataset is included in the RSKC package in the object optd. Its row names identify the true digit in the image and the column names identify the position of each block in the original bitmap. The following code loads the data into R: R> library("RSKC") R> data("optd") R> truedigit <-rownames(optd) We run K-means, trimmed K-means, SK-means and RSK-means on the data ignoring the known labels (confirmatory cluster analysis). Since the digits tend to appear in the center of the images, the left and right ends are often left blank. We select an L 1 bound that returns around 16 weights equal to zero, that corresponds to the 16 blocks on both sides of each picture. The trimming proportion α is set to 0.10, which works well in most cases. The observations trimmed by RSKC are not necessarily removed from the cluster partition as they can be assigned to the closest cluster. We use the classification error rate (CER, Chipman and Tibshirani 2005) to measure agreement between the true labels and the cluster assignments returned by each algorithm. Given two partitions of a dataset, the CER is the proportion of pairs of observations that are together in one partition and apart in the other. Note that CER is one minus the Rand index (Rand 1971). We also compute a measure of the probability of the correct classification of each partition. Given the clusters identified by an algorithm, we compute the proportion of cases from each original class that are assigned to each cluster. The largest of these proportions is the sensitivity of that class. This number represents the proportion of cases of each class that remained "together" in the partition returned by the clustering algorithm. We also report the label of the cluster where this majority of points was assigned. Given two vectors of labels label1 and label2 the sensitivities of each class in label2 with respect to the estimated partition in label2 can be computed using Sensitivity(label1, b).
The following code runs K-means, trimmed K-means, SK-means and RSK-means and computes the CER and the sensitivity of the true classes (digits), for each of the returned partitions. The column labels represent the true classes.
Note that the largest relative frequency for the kth row is the reported sensitivity of kth true class. Figure 3 shows that K-means, trimmed K-means and SK-means split the digits "1", "8" and "9" into 2 or 3 different clusters, whereas in the RSK-partition the observations corresponding to each digit are mostly assigned to a single cluster.
The rows in Figure 4 depict conditional relative frequencies of the true classes given the estimated clusters. Note that clusters "D", "G" and "I" from K-means are relatively weak (no true class dominates). Similarly, clusters "D", "H" and "I", from trimmed K-means, clusters "C", "E", "F" and "J' from SK-means are rather weak. The weakest cluster in the partition from RSK-means is cluster "F", although it is fairly dominated by 9's.
The sensitivity of the true digit 1 from SK-means is in particular low (45%) and Figure 3 shows that the true 1's are separated into three distinct clusters; "C", "E" and "F". These subjects tend to write 1 in three ways; straight vertical line with hat (namely Arial font style), straight line with hat and vertical line at bottom (namely Times New Roman -TNR -font style) and thick straight line style. SK-means classifies the Arial font observations into a cluster containing observation with true digits 9, the TNR font observations into a cluster containing observations with true digits 2, and the straight line style observations into a cluster containing observations with true digits 8.
To observe the original 32 x 32 bitmap of any observation in data optd, the function showbitmap is useful. The following commands return the bitmaps of true digits 1 observations, which were classified into clusters labeled as "C", "E" or "F" by SK-means.
We present the partial outputs on the R console for each font style observations; Arial, TNR and straight line. obs. 973 (straight) classified with true-digit-8 obs.
About 40% of the observations trimmed by RSK-means are 1's, many of these in turn the "Arial font" type. These 1's were later mistakenly assigned to a cluster dominated by 9's. Had they not been trimmed, these 1's would have affected the center of the cluster that contains most of the 9's. In other words, trimming these miss-classified observations results in a better separation between the clusters containing true 1's and true 9's.
It is interesting to note that, of those observations that were trimmed by RSK-means, about 40% of them are true 1's, and many of these are in turn of the "Arial font" type, which were later assigned to a cluster dominated by 9's. Had they not been trimmed, these 1's might have affected the centre of the cluster that contains most of the 9's. In other words, trimming these miss-classified observations results in a better separation between the clusters containing true 1's and true 9's.
The choice of L 1 bound L 1 = 5.7 returns 19 weights equal to zero. These are 19 blocks that were not used in the construction of the clusters by RSK-means and SK-means. The following script displays the names of the variables which received zero weights: R> names(which(re$RSK$weights == 0))  Figure 5: Recall-performance curves for the different cluster partitions. We plot the percentages of remained cases versus the corresponding CER obtained using only the assigned cases.
It is a good practice to flag potential outliers identified by each method and assess the method performances after these potential outliers have been removed. A good method should achieve a good performance (very small CER, say) for a large recall percentage. For the considered algorithms, we computed the weighted Euclidean distances from each case to its cluster center. For non-sparse methods, all the feature weights are set to 1. For robust methods, cluster centers are computed without trimmed cases in weighted distances. For each clustering method, we sorted the weighted distances in increasing order and then computed CER without β% of the cases with the largest weighted distances (removed as potential outliers). Figure 5 shows the plot of the percentage of remained cases (that is, 100 − β%) versus CER without the β% most extreme cases. Note that in the case of the robust procedures, the top 10% of the removed cases are those originaly trimmed by the robust algorithms in order to decide the cluster centers and the cluster partitions. For all the methods, we observe an overall increasing trend of CER as the recall percentage increases, indicating that most of the removed cases are in fact outliers which are indeed misclassified. RSK-means returns the lowest CER for all the considered recall percentages.
In summary, RSK-means identifies the true given handwritten digits better than the other considered algorithms. It is also interesting to note that it does so by ignoring the left and right sides of the pictures, and that many trimmed "outliers" appear to be 1's, which look very similar to 9's.

Digits from a collection of Dutch utility maps
The dataset, available from the UCI Machine Learning Repository, was extracted from a collection of Dutch utility maps and consists of digitized handwritten numerals between 0 and 9. The object DutchUtility available in the RSKC package contains this dataset. There are 200 instances of each digit, for a total of n = 2000 images. The raw observations consist of 32 x 45 bitmaps, which are divided into a 16 x 15 grid of non overlapping blocks, each of size 2 x 3 bits. The number of "black" pixels in each block is recorded as a feature taking integer values between 0 and 6, hence each image is represented by p = 240 features.
As an previous example, we use K-means, trimmed K-means, SK-means and RSK-means to identify K = 10 clusters. The choice L 1 = 12.4 results in approximately 30 zeroes for the feature weights which we would expect will ignore 15 blocks on the left side and 15 blocks on right side of the images as in previous example. Again, the trimming proportion is set at 0.10. The CER and the sensitivity of the true classes (digits), for each method are computed as in previous example. The output is as follows.

F J E H D G B I C A
K-means return class "B" which contains 96% of true 0's and 84% of 8's. Similarly, SKmeans returns class "F" which contains 96% of true 0's and 86% of 8's. On the other hand, trimmed K-means and RSK-means create 10 clusters, each of which uniquely corresponds to a majority of observations from a single true digit group. Although the CER is almost the same between the trimmed K-means and RSK-means, the sensitivity of digit "2" of trimmed Kmeans is substantially larger than RSK-means, and the sensitivity of digit "6" of RSK-means is substantially larger than trimmed K-means. This could be due to the weights allocation. Figure 6 shows the weight allocations over the 16 x 15 reduced scale bitmap. RSK-means returns zero weights to 12 (out of 15) bottom pixels. As the horizontal line of the digit "2" is often written in the bottom pixels, treating these features as noise resulted in relatively poor performance of RSK-means. Figure 7 left panel shows 16 x 15 reduced scale bitmap of an observation with true digit "2" which was missclassified by RSK-means while classified correctly by trimmed K-means. We note that the original 32 x 45 bitmmaps was lost. The RSKC package contains the function showDigit which, given an observation index, returns the reduced scale bitmap and Figure 7 can be generated with

R> showDigit(467)
On the contrary, RSK-means outperforms trimmed K-means in terms of the sensitivity of digit "6" while trimmed K-means tends to misclassify observations with true digits "6" and "4". Observations with true digit "4" or "6" both equally (and indistinguishably) have high pixel in the center top reverse triangle region. The better distinction of observations with true digit "4" and "6" could be because RSK-means weights less this region (see Figure 12). Unlike the handwritten digit, the digits from the collections of Dutch utility maps are not always centerd. In the absence of obvious noise features, the sparcity component of RSK-means did not contribute much to its good performance on this dataset. On the other hand its trimming component played an important role.

DBWorld e-mail
The next dataset contains n = 64 bodies of e-mails in binary "bag-of-words" representation. Filannino (2011) manually collected the dataset from the DBWorld mailing list between October 19 and 29 in 2011. The dataset is available from the UCI Machine Learning Repository and included in our package as object DBWorld. The DBWorld mailing list announces conferences, jobs, books, software and grants. Of the 64 e-mails in this dataset, 29 correspond to conference announcements. Filannino (2011) applied supervised learning algorithms to classify e-mails in two classes: "conference announcement" and "everything else".
A "bag-of-words" representation of a document consists of a vector of p zeroes and ones, where p is the number of words extracted from the entire corpus (the size of the dictionary). Typically these collections do not contain many common or highly unusual words. For example, the dictionary in this example does not contain words such as "the", "is" or "which", so-called stop words, and words that have less than 3 characters or more than 30 characters. The entry of the vector is 1 if the corresponding word belongs to the e-mail and 0 otherwise. The number of unique words in this dataset is p = 4702. The following code loads the data into R R> library("RSKC") R> data("DBWorld") R> true <-rownames(DBWorld) To compare the performance of the different K-means variants in identifying the conference announcements from the other documents, we run K-means, trimmed K-means, SK-means and RSK-means on the data ignoring the known labels. Since there is no information about the number of clustering features or the number of outliers that may be present in the data, we explore the results obtained with different values of the sparsity and trimming parameters. Specifically, we use L 1 = 10, 12, . . . , 30 and α = 0.00, 0.05, 0.1, always with K = 2. The following code runs the 4 algorithms for all combinations of L 1 and α and computes the CER performances. This code also returns the size of the smallest cluster and the number of nonzero weights for each procedure as well as the plots in Figure 8.

R> mean(DBWorld[50,] == DBWorld[26,])
[1] 1 When we increase the amount of sparsity (e.g., L 1 > 18) both robust fits identify roughly the same two clusters with a much lower CER (bottom panel of Figure 8). The table of classification errors with respect to the true labels for the clusters found by RSK-means with α = 0.10 and L 1 = 20 can be computed with the following commands: R> ialpha <-which(alphas == 0. It is important to note that for this partition RSK-means trims e-mails 26 and 50, while the non-sparse trimmed K-means fits (with either α = 0.05 or 0.10) do not. In other words, these potentially atypical documents are trimmed only by the robust and sparse methodsrobustness alone does not seem to be sufficient. Finally, this partition agrees reasonably well with the true classes (see the CER plot in the bottom panel of Figure 8).
In order to investigate why the variables chosen by the sparse and robust method provide a better partition, we will show that the selected features seem to have a higher degree of association with the true classes than the other variables. To measure the association of words and true classes we use the chi-squared test for the corresponding 2-by-2 contigency table ("presence"/"absence" of the word and "conference announcement"/"other"). Of the 4702 words in the dictionary, 167 (3.6%) of them result in a p value of 0.05 or less. Although the sparse and robust RSK-means with α = 0.1 and L 1 = 20 returns nonzero weights for 3819 words (Figure 8), most of the weight (95%) is assigned to only 371 variables. Of these, 74 (20%) are among those with a small chi-squared p value. In other words, many of the highly weighted words from RSK-means are indeed associated with the e-mail classes.
In summary, RSK-means with a relatively large sparsity parameter (L 1 ) performed the best in identifying conference announcement e-mails. This method assigns most weight to features that are highly associated with the true underlying classes. Furthemore, RSK-means trims the two e-mails that, if left unchecked, tend to form their own non-informative 2-point cluster.

Biometric identification of subjects with miniature inertial sensors
In this example we analyze data collected from inertial sensors placed on the body of 8 individuals. The dataset is available on line from the UCI Machine Learning Repository. The original objective of the experiment Altun, Barshan, and Tuncel (2010) conducted was to assess the performance of supervised learning methods to identify human activities. We will focus on one of these activities (sitting) with the goal of identifying the different subjects. As recent mobile devices contain accelerometers, biometric subject identification based on inertial sensors has a great potential to serve as an unobtrusive authentication procedure for mobile phone users (Derawi, Nickel, Bours, and Busch 2010).
The observations come from an accelerometer, a gyroscope and a magnetometer, with a 25-Hz sampling frequency. Measurements are taken for 5 minutes, and these are divided into 5-second segments, resulting in 60 observations per subject (n = 480). Each 5-second time window produces 125 measurements, which are then converted into 96 featuresmaximum, minimum, mean, skewness, kurtosis, the largest 30 peaks of the discrete Fourier transform of the signal, the 30 associated frequencies, and their auto-covariance function with lags 0 to 30. Each individual wears 5 sets of sensors, each sensor contains three trackers (accelerometer, gyroscope and magnetometer), and each tracker records data on 3 dimensions. This results in 45 (5 × 3 × 3) sets of signals per subject, for a total of 4320 features (45 × 96). These features coincide with those in Altun et al. (2010), except that they only considered the top 5 Fourier frequencies, and auto-correlations of lags up to 10.
The following code loads the corresponding data matrix sitting.
R> library("RSKC") R> sitting0 <-read.csv("sitting.csv") R> true <-sitting0[,1] R> sitting <-as.matrix(sitting0[,-1]) R> featActivity <-read.csv("featActivity.csv") Each row of the matrix featActivity identifies the source of each feature (location of the sensor on the subject's body, the type of tracker, direction, and the name of the summary statistics). Because the units of the available features are different, we normalize them to the interval [0,1] as follows: each column x j of sitting is transformed into (x j −min(x j ))/(max(x j )− min(x j )) R> dat.norm0 <-apply(sitting, 2, function(x) (x -min(x))/(max(x) -min(x))) We now remove features that remain constant across our observations, which results in a total of p = 4301 features.
R> complete.columns <-complete.cases(t(dat.norm0)) R> dat.norm <-dat.norm0[, complete.columns] R> feat <-featActivity[complete.columns, ] We use K-means, trimmed K-means and their sparse versions to simultaneously reduce the number of features and identify clusters of observations coming from the same individual in the study. Altun et al. (2010) reduce the number of features via the principal component analysis and use the 30 principal components corresponding to the largest 30 eigenvalues, so the ideal sparsity could be 30 nonzero-weighted features. However, for demonstration purpose, we analyze the dataset assuming that there is no information regarding the number of clustering features but it is pre-known that the sample size for each subject is the same. For the sparse methods we set the penalty L 1 to 5, 10, 15, . . . , 45, whereas for the robust methods we consider α = 0.05 and 0.1. Figure 9 shows the results of our analysis for different levels of sparsity, in terms of the number of non-zero weights, the size of the smallest cluster and the CER. Since we know that there are 60 observations per subject, we expect a good partition to have all clusters with approximately 60 points in them. This corresponds to a large minimum cluster size. Note that RSK-means with L 1 = 5 or 10 and α = 0.05 or 0.1, and SK-means with L 1 = 5 or 10 have the largest possible minimum cluster size. However, K-means returns one very small cluster of size 3. Interestingly, these 3 cases are indeed trimmed in either weighted or non-weighted distances by both robust clustering methods with all considered α and L 1 . Finally, note from the last panel that all the partitions with the "right" smallest cluster size (60) are perfect solutions (CER = 0).
The ability of a clustering method to filter noise features seems to be an advantage in this example, where sparse methods outperform non-sparse ones. To investigate which features are selected by the sparse variants we analyze the feature weights. SK-means with L 1 = 5 and RSK-means with α = 0.1 and L 1 = 5 only assign non-zero weights to 39 and 38 features, respectively. Furthermore, the weight allocations of both methods are very similar, and hence here we only look at the weights returned by RSK-means. To obtain the results of RSK-means with α = 0.1 and L 1 = 5 , run the codes R> set.seed(1) R> reRSK <-RSKC(dat.norm, L1 = 5, nstart = 300, ncl = 8, alpha = 0.1) We first compute the proportion of weights associated to the measurements taken by each of the 5 sensors on the individual R> myRound <-function(x) { + name.x <-names(x) + or <-order(x, decreasing = TRUE) + out <-paste(sprintf("%1.1f", x), "%", sep = "") + out[x == 0] <-"ZERO" + names(out) <-name.x + return(out[or]) + } R> w <-reRSK$weight R> sw <-sum(w)/100 R> myRound(tapply(w, feat$body, function(x) sum(x)/sw)) right.arm left.arm right.leg left.leg torso "30.8%" "27.8%" "23.9%" "9.6%" "7.8%" The largest weights are assigned to the features based on the signals from the right arm (30.8%) while the features based on torso signals received the smallest weights (7.8%). We now look at the proportion of weights allocated to features originated on each type of sensor R> myRound(tapply(w, feat$tracker, function(x) sum(x)/sw)) magnetometer accelrometer gyroscope "92.3%" "7.7%" "ZERO" ZERO means exact zero. As one might expect, features obtained from the gyroscope do not seem to be useful for clustering sitting individuals, while those taken from the magnetometer receive the largest proportion of weights. The statistics derived from the signals that received highest 5 weights can be found as follows R> myRound(tapply(w, feat$statistic, function(x) sum(x)/sw))[1:5] peak1 mean max min ACF0 "30.4%" "29.2%" "21.3%" "19.1%" "ZERO" The largest weight (30.4%) is assigned to features corresponding to the first Fourier peak, followed by the mean, maximum and minimum. The rest of statistics do receive zero weights.
To explore the precision of the cluster partitions obtained when outliers are removed from the data, we construct a curve similar to a recall rate. For each clustering method (K-means, trimmed K-means, SK-means with L 1 = 5 and RSK-means with L 1 = 5 and α = 0.1) we sort the weighted distances of each point to their cluster centers and then compute the CER obtained by removing β% of cases with the largest distances, which may be considered potential outliers. The curves are displayed in Figure 10. Note that the curves for the sparse methods (RKS-means and SK-means) are constant at CER = 0, while those of the non-sparse methods have an increasing trend. This indicates that the cases misclassified by K-means and  Figure 10: Recall-performance curves for the different cluster partitions on the sitting dataset. Each curve corresponds to each clustering method and displays the resulting CER when one removes β% of cases with largest distances to their cluster centers.
trimmed K-means are indeed possible outliers (those with largest weighted distances to their cluster centers).
In summary, we see that in this example sparse methods outperform their non-sparse counterparts. Non-sparse methods tend to misclassify the cases with large distances to their cluster centers. Both sparse methods achieve a notable dimension reduction (RSK-means uses 38 features, whilie SK-means uses 39) while still correctly identifying all clusters. The results of this analysis suggest that signals from gyroscopes may not be useful in identifying sitting individuals, and also that few and relatively simple summary statistics are needed for this purpose.

Simulation Results
In this section we report the results of a simulation study to investigate the properties of the proposed RSK-means algorithm and compare it with K-means, trimmed K-means, and SK-means. Our simulated datasets contain n = 60 observations generated from multivariate normal distributions with covariance matrix equal to the identity. We form 3 clusters of equal size by setting the mean vector to be µ = µ × b, where b ∈ R 500 has its first 50 entries equal to 1 followed by 450 zeroes, and µ = −1, 0, and 1, respectively. Note that the clusters are determined by the first 50 features only (which we call clustering features), the remaining 450 being noise features.
We will assess the performance of the different cluster procedures regarding two outcomes: q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Figure 11: Boxplots of CERs calculated between the true partition and the partitions from four algorithms. They correspond to 1000 simulated datasets of size n = 60 and p = 500 features (with 50 true clustering features). "K", "SK", "TK" and "RSK" denote K-means, sparse K-means, trimmed K-means and robust sparse K-means, respectively.
the identification of the true clusters and the identification of the true clustering features.
To measure the degree of cluster agreement we use the CER. To investigate the correct identification of clustering features we adapt the average precision measure, which we compute as follows. First, sort the features in decreasing order according the their weights and count the number of true clustering features appearing among the 50 top-ranked features.
We consider the following two contamination configurations These configurations can be thought of as producing "scattered" outliers located outside the range of the data. Additional experiments where outliers were introduced by replacing a single feature with a large value are discussed in Kondo, Salibian-Barrera, and Zamar (2012). The general conclusions of all our studies agree with those reported here.
The L 1 bound for each algorithm was selected in such a way that approximately 50 features would receive positive weights when used on clean datasets. More specifically, we generated 50 datasets without outliers and, for each algorithm, considered the following 11 L 1 bounds: 5.7, 5.8, ..., 6.7. The one resulting in the number of non-zero weights closest to 50 was recorded.
In our simulation study we used the corresponding average of selected L 1 bounds for each algorithm. For SK-means and RSK-means this procedure yielded an L 1 bound of 6.264 and 6.234, respectively. The proportion α of trimmed observations in the trimmed K-means and RSK-means was set equal to 0.1, the true proportion of outliers.
We generated 1000 datasets from each model. Figure 11 shows the boxplots of CERs between  the true partition and the partitions returned by the algorithms. When the data do not contain any outliers we see that the performance of K-means, SK-means, and RSK-means are comparable. The results of Models 1 and 2 show that the outliers affect the performance of Kmeans and SK-means, and also that the presence of 450 noise features upsets the performance of the trimmed K-means algorithm. However, RSK-means retains small values of CER for both types of contamination.
To compare the performance of the different algorithms with regards to the features selected by them we consider the median weight assigned to the true clustering features. The results are shown in Figure 12. We can see that when there are no outliers in the data both the SK-means and RSK-means algorithms assign very similar weights to the correct clustering features. The presence of outliers, however, results in the SK-means algorithm to assign much smaller weights to the clustering features. Table 1 contains the average number of non-zero weights returned by each algorithm, and average number of true clustering features among the 50 features receiving highest weights. When the data are clean, both SK-means and RSK-means return approximately 50 clustering features, and they are among the ones with highest weights. The presence of outliers (with either contamination Model 1 or 2) results in SK-means selecting noise features (Model 1), as the average precision is zero, or selecting about 300 features (Model 1), while RSK-means remains unaffected.

Conclusion
We propose a robust algorithm to simultaneously identify clusters and features using K-means and illustrate our method and R package RSKC with the analysis of four real datasets. The main idea behind our proposal is to adapt the SK-means algorithm of Witten and Tibshirani (2010) by trimming a given fraction of observations that are farthest away from their cluster centers, using the approach of the trimmed K-means algorithm of Cuesta-Albertos et al. (1997). Sparcity is obtained by assigning weights to the features and bounding their L 1 norm. Only those features for which the optimal weights are positive are used to determine the clusters. Because possible outliers may contain atypical entries in features that are being downweighted, our algorithm also considers the distances of each point to their cluster centers using all features. Our simulations and examples show that RSK-means works as well as SKmeans when there are no oultiers in the data, and much better than SK-means in the presence of outliers. Handdigit data example (the frist real data example) and DBWorld dataset (the third real data example) show that when noise features and outliers are present, RSKmeans can recover the true partition outperforming all the considered competing algorithms. Dataset of digits from Dutch utility map (the second real data example) shows that when sparsity is not clearly present, RSK-means still does a good job and can deal with a relatively large fraction of outliers. Dataset containing signals from the body-worn miniture inertial sensors (the fourth real data example) shows that when outliers are not clearly present, RSKmeans outperforms K-means and trimmed K-means, and performs as well as SK-means by identifying sensible clustering features. Finally, the RSK-means algorithm is implemented in the R package RSKC available from the Comprehensive R Archive Network at https:// CRAN.R-project.org/package=RSKC, including the two digit datasets and DBWorld dataset.