Fast Computation of Trimmed Means

We present two methods of calculating trimmed means without sorting the data in O ( n ) time. The existing method implemented in major statistical packages relies on sorting, which takes O ( n log n ) time. The proposed algorithm is based on the quickselect algorithm for calculating order statistics with O ( n ) expected running time. It is an order of magnitude faster than the existing method for large data sets.


Introduction
Trimmed means are used very frequently in statistical sciences as robust estimates of location. Trimmed means are much less sensitive to the outliers compared to the arithmetic mean. They are also often used in image processing as filters. The α%-trimmed mean of x ∈ n is the quantity where x (i) are order statistics and K = [αn].
The current method of computation of a trimmed mean requires sorting the data, and its complexity is O(n log n). This method based on sorting is implemented in major statistical packages such as SAS, SPSS and R. For small to medium sized data sets the CPU time is negligible. However for very large data sets n ≥ 10 5 the sort operation becomes a bottleneck. Data sets of this size or larger are now common (analysis of internet traffic and microarray data processing are two noteworthy examples).
In this note we draw attention to a little known fact that the sort operation can be avoided, by using a simple alternative formula based on computation of order statistics, which can be done in O(n) expected time using the quickselect algorithm (Sedgewick 1988;Press, Teukolsky, Vetterling, and Flannery 2002), or in O(n) worst case time using the median of medians algorithm BFPRT (Blum, Floyd, Watt, Rive, and Tarjan 1973) We propose two different algorithms to calculate trimmed means. The gain in numerical performance is six to ten-fold as illustrated by our numerical experiments. Existing algorithms can be modified very easily, and the source code in C++ is provided.

Computation of trimmed and Winsorized means
The following quantity is known as K-Winsorized mean, By comparing to (1), we can easily see that Suppose we have computed the order statistics x (K+1) and x (n−K) by using the quickselect algorithm in O(n) time. Then we can compute the Winsorized mean without sorting, by using The trimmed mean is then computed from WM also without sorting by (3).
For very large n ≥ 10 8 caution should be exercised when using (4) because of the roundoff errors. The sum contains an extra 2K terms compared to (2) which may result in loss of precision. In our experiments, when we used a 4-byte float type (in C language) to calculate WM using (4), it lead to unacceptable loss of precision and resulted in values as twice as much as the actual trimmed mean for n = 10 6 . Of course, the use of double or extended numbers easily solves this problem on CPUs, this is not the case when calculations are performed on Graphic Processing Units (GPU).
General purpose GPUs, such as NVIDIA's Tesla unit (NVIDIA Corporation 2008), allow one to offload routine numerical calculations to a GPU device, which possesses several hundreds of processor cores, and is capable of executing hundreds of thousands of threads in parallel. Many modern software packages are capable to execute parts of their code on GPUs. Summation is easily parallelized for GPUs by using a generic reduce operation using binary trees, see NVIDIA Corporation (2007). However the use of extended precision in reduction on GPUs bears a cost, as a working array of extended precision numbers of size n has to be used. Below we present an alternative method, which does not use the extra terms in the sum (4).
Note that it would be incorrect to compute the trimmed mean similarly to (4) by replacing the terms outside the range [x (K+1) , x (n−K) ] with zeroes. In that case the resulting expression is not a trimmed mean, and is in fact a discontinuous function of the components of x.
Instead, let us compute the expression in the following way with where the integers a, b, c, d are computed as described below. Note that if no elements of x coincide, there are no issues with computation and we let a = b = c = d = 1. A problem occurs when some elements coincide. The multiplicity of the k-th order statistic is the number of components of x equal to x (k) . (5) and (6) coincide.

Consider the numbers
The calculation of the numbers a, b, c, d above is done in O(n) operations when x (K+1) and x (n−K) are known. Therefore the overall complexity of evaluating S in (6) is O(n).
Compared to the formula (3), there are less roundoff errors at an additional marginal cost. In calculations we performed when using 4-byte floats, the loss of precision was ten times smaller than in (3)-(4).

Numerical comparison
Let us now illustrate the advantages of the formulas based on (3) and (6) numerically. We considered several data sets, generated as follows.
Calculations were performed on a Pentium IV 2.1 GHz workstation with 1 GB RAM. The results were very similar for all different data sets and are averaged in Table 1.
In a separate series of experiments, we compared the CPU time when calculating multiple trimmed means of smaller data sets n < 100 by formulas (6) and (3), and by using the traditional sort operation. Here all methods have performed equally and there was no speedup.

Conclusion
We described two alternative algorithms to calculate trimmed means without using the sort operation. The speedup is an order of magnitude on very large data sets n ≥ 10 5 . The algorithm based on Winsorized mean is faster, but requires care dealing with roundoff errors. The proposed algorithms are easy to implement and incorporate into existing statistical packages.
A. Source code C++ source code for trimmed means based on (6). Note that the arrays in C are indexed from 0. The code below requires the quickselect procedure available from Press et al. (2002). n is the length of the array x and K = [αn]. return TrimmedSum(x,n,K)/(n-2*K); }