ks : Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R

Kernel smoothing is one of the most widely used non-parametric data smoothing techniques. We introduce a new R package ks for multivariate kernel smoothing. Currently it contains functionality for kernel density estimation and kernel discriminant analysis. It is a comprehensive package for bandwidth matrix selection, implementing a wide range of data-driven diagonal and unconstrained bandwidth selectors.


Introduction
Kernel density estimation is a popular tool for visualising the distribution of data. See Simonoff (1996), for example, for an overview. When multivariate kernel density estimation is considered it is usually in the constrained context with diagonal bandwidth matrices, e.g. in the packages sm (Bowman and Azzalini 2007) and KernSmooth (Wand and Ripley 2006) for R (R Development Core Team 2007). We introduce a new R package ks-available from the Comprehensive R Archive Network at http://CRAN.R-project.org/-which implements diagonal and unconstrained data-driven bandwidth matrices for kernel density estimation, which can also be used for multivariate kernel discriminant analysis. The ks package implements selectors for 2-to 6-dimensional data.
In Section 2, we supply a brief review of kernel density estimation and indicate some reasons why unconstrained matrix selectors would be a useful generalisation over diagonal selectors. The optimal bandwidth selection problem for these unconstrained matrices is considered in Section 3. Also in this section are examples from ks for density estimation. We demonstrate the package's functionality for discriminant analysis in Section 4. With these unconstrained selectors now available, we conclude by offering some general recommendations.

Kernel density estimation
For a d-variate random sample X 1 , X 2 , . . . , X n drawn from a density f , the kernel density estimate is defined byf where x = (x 1 , x 2 , . . . , x d ) and X i = (X i1 , X i2 , . . . , X id ) , i = 1, 2, . . . , n. Here K(x) is the kernel which is a symmetric probability density function, H is the bandwidth matrix which is symmetric and positive-definite, and K H (x) = |H| −1/2 K(H −1/2 x). The choice of K is not crucial: we take K(x) = (2π) −d/2 exp(− 1 2 x x), the standard normal throughout. In contrast, the choice of H is crucial in determining the performance off .
According to Wand and Jones (1993), the most useful parameterisations of the bandwidth matrix are the diagonal H = diag(h 2 1 , h 2 2 , . . . , h 2 d ) and the general or unconstrained which has no restrictions on H, provided that H remains positive definite and symmetric.
To compare these two parameterisations, we examine the 'dumbbell' density, given by the normal mixture and displayed on the left in Figure 1. This density is unimodal. On the right is a sample of 200 data points.
From this data sample, we compute a diagonal and an unconstrained bandwidth matrix:   the off-diagonal elements in the latter matrix are almost the same size in magnitude as the diagonal elements, and so induce substantial smoothing in an oblique direction.
The respective kernel density estimates are produced in Figure 2. The diagonal bandwidth matrix constrains the smoothing to be performed in directions parallel to the co-ordinate axes, so it is not able to apply accurate levels of smoothing to the obliquely oriented central portion. The result is a bimodal density estimate. The unconstrained bandwidth matrix correctly produces a unimodal density estimate. Wand and Jones (1993) and Hazelton (2003, 2005b) contain more extensive simulation studies on a range of target densities concerning the relative gains in efficacy when using an unconstrained parametrisation as compared to a diagonal one. The general conclusion from these papers is that an unconstrained bandwidth matrix is most useful when there is large probability mass oriented away from the co-ordinate directions, such as the dumbbell density examined here.
Closely related to the bandwidth parameterisation is the pre-transformation of the data. Instead of basing our bandwidth selection on the original data X 1 , X 2 , . . . , X n , we could use the transformed data X * 1 , X * 2 , . . . , X * n , where the transformation is either sphering where S is the sample covariance matrix of the untransformed data; or scaling where S D = diag(s 2 1 , s 2 2 , . . . , s 2 d ) and s 2 1 , s 2 2 , . . . , s 2 d are the marginal sample variances. The bandwidth matrix H * suitable for the sphered or scaled data can be back transformed to the original scale by H = S 1/2 H * S 1/2 or H = S 1/2 D H * S 1/2 D , as appropriate. The transformed data are more aligned to the co-ordinate axes which may lead us to believe that more restricted bandwidth parametrisations may be suitable for these transformed data.
Sphering or scaling with the most restricted parameterisation H * = h * 2 I, where I is the identity matrix, gives H = h * 2 S or H = h * 2 S D . However Wand and Jones (1993) strongly advise against these combinations of pre-transformation and bandwidth parametrisation.
In light of this, we need to use at least the diagonal bandwidth matrix with these transformations. Let H * = diag(h * 2 1 , h * 2 2 , . . . , h * 2 d ). The bandwidth matrix obtained with pre-scaled data is H = diag(s 2 1 h * 2 1 , s 2 2 h * 2 2 , . . . , s 2 d h * 2 d ) which is still a diagonal matrix, so it is not clear that this offers any advantage over a diagonal bandwidth matrix without pre-scaling. On the other hand, with pre-sphering, the resulting H = S 1/2 H * S 1/2 is non-diagonal. However, the off-diagonal elements are still constrained since they rely on the sample variance, whereas those in the unconstrained parametrisation do not have this restriction. So the latter will be better in situations where the sample variance is not an appropriate summary of the dispersion of the data. With this in mind, we focus on bandwidth selectors for the maximally general unconstrained parametrisation.

Optimal bandwidth selectors
We briefly present the main ideas behind optimal, data-based bandwidth selectors. The reader can consult Hazelton (2003, 2005b) and references therein for more details on the multivariate bandwidth selection problem.
We measure the performance off (in common with the majority of researchers in this field) using the Mean Integrated Squared Error (MISE) criterion,

Our aim in bandwidth selection is to estimate
over the space of all symmetric, positive definite d × d matrices. It is well known that the optimal bandwidth H MISE does not have a closed form. To make progress it is usual to employ an asymptotic approximation, known as the AMISE (Asymptotic MISE) where vech is the vector half operator e.g.
The subscript 4 on Ψ relates to the order of the derivatives involved. See Wand and Jones (1995, p. 98) for the general expression of the 1 2 d(d + 1) × 1 2 d(d + 1) matrix Ψ 4 . For the purposes of this paper, this expression is not required. All that we need is that the elements of Ψ 4 are integrated density derivative functionals We make use of the tractability of AMISE by seeking For the next step we estimate the MISE or AMISE. A data-driven bandwidth selector is either Different selectors arise from the different methods used in the estimation step.

Plug-in bandwidth selectors
The most well-known univariate plug-in selector is due to Sheather and Jones (1991). One multivariate extension of this is developed in Wand and Jones (1994). The plug-in estimate of the AMISE, can be numerically minimised to give the plug-in bandwidth matrix,Ĥ PI . If we note that where G is a pilot bandwidth matrix. Theseψ r (G) are the elements ofΨ 4 .
Like H, we need to choose a sensible value for G. We consider pilot bandwidth matrices of the form G = g 2 I along with the pre-transformations of Section 2. This combination of bandwidth parameterisation and pre-transformation is ill-advised for selecting the final bandwidth H but it is acceptable for selecting a pilot bandwidth G. This is because it is not crucial to select G with the same accuracy as for H. With this restricted parameterisation g 2 I, we are able to derive analytical expressions for optimal pilot selectors, allowing us to avoid computationally intensive numerical optimisation for pilot bandwidth selection.
The MSE (Mean Squared Error) forψ r (g) is Wand and Jones (1994) select g to minimise the Asymptotic MSE (AMSE). However, attempting to minimise this criterion may lead to numerical instabilities, as documented in Duong and Hazelton (2003). These authors propose the Sum of AMSE (SAMSE) criterion which has better numerical and theoretical properties SAMSE (g) = r:|r|=4 AMSE (g). Duong and Hazelton (2003) contains the explicit formula for the selector which minimises this SAMSE.

Cross validation bandwidth selectors
Cross-validation selectors are the main alternative to plug-in selectors. There are three main flavours of cross-validation selectors: least squares, biased and smoothed.

Least squares cross validation
The multivariate version of the least squares cross validation (LSCV) criterion of Rudemo (1982) and Bowman (1984) is where the leave-one-out estimator iŝ The LSCV selectorĤ LSCV is the minimiser of LSCV(H). We can rewrite LSCV as

Biased cross validation
Plug-in methods use a pilot bandwidth matrix G, which is independent of H, to estimate Ψ 4 . For BCV, we set G = H and use slightly different estimators. There are two versions of BCV, depending on the estimator of ψ r , see Sain, Baggerly, and Scott (1994). We can usě The estimatesΨ 4 andΨ 4 are obtained from Ψ 4 by substitutingψ r andψ r for ψ r . From this we obtain respectively The BCV selectorsĤ BCV1 andĤ BCV2 are the minimisers of the appropriate BCV function.

Smoothed cross validation
Smoothed cross validation (SCV) was introduced by Hall, Marron, and Park (1992). The SCV can be motivated by starting with a slightly modified version of LSCV in Equation (6), known as the leave-in-diagonals version, for data samples which have no repeated values where K 0 is the Dirac delta function. To form SCV, we pre-smooth the data differences X i − X j by K 2G , i.e. replace X i − X j by the convolution with K 2G (X i − X j ): The SCV selectorĤ SCV is the minimiser of SCV(H). Again, we consider pilot bandwidth matrices of the form G = g 2 I with pre-transformations. A suitable selector for g is the minimiser of whereĥ SCV,ij is the (i, j)-th element ofĤ SCV and h AMISE,ij is the (i, j)-th element of H AMISE . Its closed form expression is given in Duong and Hazelton (2005b). Duong and Hazelton (2005a) show that for a selectorĤ, For the plug-in selector, they show that the right hand side is O(tr E[Ψ 4 (g)−Ψ 4 ] 2 ), which is a weighted sum of the individual mean squared errors E[ψ r (g)−ψ r ] 2 , |r| = 4. On the other hand, the SAMSE pilot selector minimises the unweighted sum of these terms, though we still have . This establishes that the SAMSE pilot asymptotically minimises the distance betweenĤ PI and H AMISE . An alternative pilot selector based on the weighted MSEs may have better finite sample properties, but this is not pursued further here.

Code examples
Like in Section 2, we generate a sample of 200 points from the dumbbell density to compute bandwidth selectors and their corresponding density estimates.
There are three other arguments which further specify the plug-in selector: nstage is the number of pilot estimation stages (1 or 2). We recommend using 2 stages of pilot estimation, see Wand and Jones (1995, pp. 73-74). The type of pilot estimation is set using pilot ("amse" or "samse"). The argument pre involves the pre-transformations ("scale" or "sphere"). We can use the pre-sphering or pre-scaling transformation with the unconstrained bandwidths. For the diagonal bandwidths, we should only use the pre-scaling, otherwise the back-transformation of pre-sphering results in a non-diagonal matrix.
Recalling that the true density is unimodal, of the density estimates displayed in Figure 3, only the density estimates with unconstrained bandwidth matrices and pre-sphering reproduce this unimodality.
The commands Hlscv and Hlscv.diag are the unconstrained and diagonal LSCV selectors. The command Hbcv implements both BCV1 and BCV2. The default is BCV1; set whichbcv = 2 to call BCV2. Their diagonal counterpart is Hbcv.diag. The unconstrained SCV selector is Hscv and its diagonal version is Hscv.diag. The argument pre is the same as before.
More extensive simulation studies are reported in Hazelton (2003, 2005b). These confirm the general behaviour and performance of the bandwidth selectors demonstrated here for a wider range of target density shapes.
So far the calls to kde computef exactly. This exact computation is O(n 2 ) complexity which becomes infeasible for large sample sizes, say n = 10 000 on a current desktop PC. One common technique for increasing computational speed for these large samples is binned kernel estimation, see Wand and Jones (1994, Appendix D), and is implemented in KernSmooth (Wand and Ripley 2006). Binning converts the data sample of size n to a grid of size m, so binned estimation remains O(m) regardless of the sample size. Suitable default binning grid sizes are m = 401, 151 2 , 51 3 for d = 1, 2, 3. Computing density estimates on a grid becomes less feasible for d > 3.
Binned estimation is only defined with diagonal bandwidth matrices. Applicable cases include kernel density estimators with diagonal bandwidth matrices and the integrated density functional estimators with g 2 I pilot bandwidth matrices for the plug-in and SCV selectors.

Kernel discriminant analysis
We have ν populations, each associated with a density f j and a prior probability π j , j = 1, 2, . . . , ν. We wish to allocate a point x in the sample space to one (and only one) of these populations. A common allocation rule is the Bayes discriminant rule: Allocate x to group j 0 where j 0 = argmax j∈{1,2,...,ν} π j f j (x).
From each f j , we have a random sample X j1 , X j2 , . . . , X j,n j . These X ji are collectively known as the training data. In the cases where we wish only to classify a data sample, rather than the entire sample space, the random sample Y 1 , Y 2 , . . . , Y m drawn from ν j=1 π j f j is known as the test data. The kernel discriminant rule (KDR) is obtained from the Bayes discriminant rule by replacing f j by its kernel density estimatê and π j is usually replaced by the sample proportion n j /n where n = ν j=1 n j ; that is KDR : Allocate x to group j 0 where j 0 = argmax j∈{1,2,...,ν}π This now raises the question of the most appropriate bandwidth selectors for these density estimates. We take the simplest approach by using an optimal bandwidth selector for each of the training data sub-samples. These bandwidths are optimal for MISE for a single density function. An alternative is to select bandwidths which satisfy an error measure tailored to discriminant analysis, as is done by Hall and Kang (2005). Their data-driven bandwidth selectors are too computationally intensive for inclusion in ks. Fortunately they show that their discriminant analysis-optimal selectors are the same asymptotic order as density estimationoptimal selectors.
To illustrate, we use the well-known iris data which is part of the base software for R. There are 50 records each from 3 species of iris: Iris setosa, I. versicolor and I. virginica. We take the first three variables: sepal length, sepal width and petal length. Figure 5 is a scatter plot of these data.
R> library("MASS") R> data("iris") R> ir <-iris[,1:3] R> ir.group <-iris[,5] The commands for bandwidth selection in this case are similar to those for density estimation. The basic commands are Hkda and Hkda.diag for unconstrained and diagonal matrices. We set bw = "plugin", bw = "lscv" or bw = "scv" for the plug-in, LSCV or SCV selectors. Due to the poor performance of BCV selectors for density estimation, they are not implemented for discriminant analysis. The other arguments to further specify the bandwidth selectors are the same as before.
R> kda.kde(x = ir, x.group = ir.group, Hs = Hpi1) The kda.kde class has a plot method. For this example, it calls the rgl (Adler and Murdoch 2007) and misc3d (Feng and Tierney 2007) libraries to render the 3-dimensional display. In Figure 6, the density estimate for each group is plotted in a separate colour, using the same colours as in Figure 5. The default is to plot the contours of the upper 25% and 50% highest density regions, with the more opaque contour shell being 25% and the less opaque being 50%.
The computation of a misclassification rate (MR) for a kernel discriminant rule depends on whether the test data are independent of the training data. If the training data are independent, then MR = 1 − m −1 m j=1 1{Y j is correctly classified using KDR} where KDR is the kernel discriminant rule in Equation (10). If the training and test data are the same, then it is appropriate to use a cross validated estimate of the misclassification rate: 1{X ji is correctly classified using KDR −ji } where KDR −ji is a kernel discriminant rule in Equation (10) except thatπ j andf j (x; H j ) are replaced byπ j,−i = (n j − 1)/(n − 1) and For independent training data, the command is compare, which is not illustrated here. Since we have test data which is equal to the training data, we use R> compare.kda.cv(x = ir, x.group = ir.group, bw = "plugin", + pilot = "samse", pre = "sphere") R> compare.kda.diag.cv(x = ir, x.group = ir.group, bw = "plugin", + pilot = "samse", pre = "scale") R> compare.kda.cv(x = ir, x.group = ir.group, bw = "scv", pre = "sphere") R> compare.kda.diag.cv(x = ir, x.group = ir.group, bw = "scv",pre = "scale") to give misclassification rates: plug-in 0.0533, plug-in diagonal 0.0667, SCV 0.0400 and SCV diagonal 0.0533. The unconstrained SCV selector performs the best in this case.

General recommendations
The different bandwidth selectors available in ks may now pose a problem of too much choice.
The unconstrained bandwidth selectors will be better than their diagonal counterparts when the data have large mass oriented obliquely to the co-ordinate axes. Amongst the unconstrained selectors, we advise against using the BCV selector. The LSCV selector is useful in some cases though its performance is known to be highly variable. The 2-stage plug-in and the SCV selectors can be viewed as generally recommended selectors.