CompPD : A MATLAB Package for Computing Projection Depth

Since the seminal work of Tukey (1975), depth functions have proved extremely useful in robust data analysis and inference for multivariate data. Many notions of depth have been developed in the last decades. Among others, projection depth appears to be very favorable. It turns out that (Zuo 2003; Zuo, Cui, and He 2004; Zuo 2006), with appropriate choices of univariate location and scale estimators, the projection depth induced estimators usually possess very high breakdown point robustness and ﬁnite sample relative eﬃciency. However, the computation of the projection depth seems hopeless and intimidating if not impossible. This hinders the further inference procedures development in practice. Sporadically algorithms exist in individual papers, though an uniﬁed computation package for projection depth has not been documented. To ﬁll the gap, a MATLAB package entitled CompPD is presented in this paper, which is in fact an implementation of the latest developments (Liu, Zuo, and Wang 2013; Liu and Zuo 2014). Illustrative examples are also provided to guide readers through step-by-step usage of package CompPD to demonstrate its utility.


Introduction
In statistics, order statistics play very important roles in defining many desirable robust methods, which can serve as powerful alternatives to their traditional counterparts when outliers are present. Nevertheless, such methods depend heavily on the ordering of data. For univariate observations, the order principle is clear, and can naturally arise from the intrinsic order on the real line R 1 . However, a tool that can provide a desirable ordering for multivariate observations is not straightforward available in higher dimensions R p (p ≥ 2).
To serve this purpose, Tukey (1975) heuristically introduced a new depth notion, i.e., halfspace depth. In the literature, this depth notion is also commonly referred to as "Tukey depth" in order to reflect Tukey's seminal work in this field. The most outstanding property of this depth notion is that it can provide the observations with a new ordering way. That is, unlike the traditional case, observations would be sorted center-outwardly according to this new principle. The deeper the point relative to the data cloud, the higher the depth value it has. Such an ordering has proved extremely powerful in extending univariate tools of signs and ranks, order statistics, quantiles, and outlyingness measures to the multivariate setting in a unified way (Serfling 2006).
In addition to the Tukey depth, there exist other depth notions in the literature. Among others, the most prominent and prevailing ones are simplicial depth (Liu 1990), projection depth (Liu 1992;Zuo 2003) and zonoid depth (Koshevoy and Mosler 1997) in the location setting, and regression depth (Rousseeuw and Hubert 1999) in the regression setting. Ideally, a desirable notion of depth should satisfy four key properties as stated in Zuo and Serfling (2000), namely, affine equivariance, maximality at center, monotonicity relative to the deepest point and vanishing at infinity. Many of the early results were summarized in Zuo and Serfling (2000), and the updated results for depths can be found in Liu, Serfling, and Souvaine (2006).
Depth-induced statistics (or procedures) usually inherit the desirable properties of their univariate counterparts. That is, they are affine equivariant, and possess very high breakdown robustness. The exception of the aforementioned depth notions is the zonoid depth, which centers at the regular mean and is consequently non-robust (Koshevoy and Mosler 1997). As a result, depth-induced statistics have been paid increasing attentions in the practice of inducing many favorable methods when outliers are present. For example, Ghosh and Chaudhuri (2005) proposed a depth-based classification technique for unequal prior problems. Yeh and Singh (1997) investigated a nonparametric method for constructing bootstrap confidence regions based on the depth-induced ordering. Chenouri and Small (2012) developed several nonparametric tests for testing the equality of two or more multivariate populations relying on statistical depth functions such as halfspace depth.
However, the price to be paid is that the computation of most depth-induced procedures is usually very challenging. Many authors have devoted themselves to overcoming this problem in the last decades. To name but a few, Rousseeuw and Ruts (1996) proposed two efficient algorithms for exactly computing bivariate halfspace depth and simplicial depth, respectively. Rousseeuw and Struyf (1998) studied the exact and approximate computation of halfspace depth in higher dimensions. Hallin, Paindaveine, andŠiman (2010) and Paindaveine andŠiman (2012) considered the computing issue of halfspace regions (or contours) in spaces with dimension p ≥ 2 by using parametric linear programming. Dyckerhoff, Mosler, and Koshevoy (1996) and Mosler, Lange, and Bazovkin (2009) (see also Bazovkin and Mosler (2012)) investigated the efficient geometric algorithms for computing the zonoid depth value of a single point and the depth regions in any dimension, respectively. Publicly accessible packages on computing halfspace depth and zonoid depth are now available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=depth (Genest, Masse, and Plante 2012) and http://www.jstatsoft.org/v47/i13/ (Bazovkin and Mosler 2012), respectively.
In this paper, we are interested to address the computing issue of projection depth by offering a new MATLAB (The MathWorks Inc. 2009) package CompPD. Our motivation lies in the following fact. As a major depth notion, projection depth enjoys many desirable properties. It has continuous depth regions and does not vanish even outside the convex hull of the data, which can bring great convenience to practical applications, such as classification (Ghosh and Chaudhuri 2005) and outlier identification (Dang and Serfling 2010). Furthermore, under mild conditions, the projection depth median is unique. This uniqueness property seldom holds true for many other depth notions such as halfspace depth and simplicial depth (Zuo 2013). With appropriate choices of univariate location and scale estimators, the projection depth induced estimators usually possess very high breakdown point robustness and finite sample relative efficiency (Zuo 2003;Zuo et al. 2004;Zuo 2006). However, application packages which can be used to compute the projection depth and its related estimators are currently lacking. The only available package is ExPD2D developed by Zuo and Ye (2009) in R. Unfortunately, ExPD2D is designed primitively for bivariate data as an implementation of Zuo and Lai (2011) and the code is inefficient and does not take account of the x-free property, namely, that the computation of optimal direction vectors is independent of the point x for which the depth value is being computed, of projection depth (Liu et al. 2013). In this paper, we fill the gap by presenting a MATLAB package CompPD, which is in fact an implementation of the latest developments (Liu et al. 2013;Liu and Zuo 2014). Codes of fast approximate algorithms are also provided for higher dimensions. Projection depth and all of its induced estimators can be conveniently computed by utilizing the current package.
The rest of this paper is organized as follows. In Section 2, we give necessary backgrounds on the definitions of projection depth, adjusted projection depth and their associated estimators. In Section 3, we describe the major functions contained in the MATLAB package. In Section 4, we provide some illustrative examples to demonstrate the utility of the package through stepby-step usage. In Section 5, we end the paper with a brief summary.

Preliminary
In this section, we will describe the definitions of the projection depth and adjusted projection depth and their associated estimators. Since the discussion focuses mainly on the computing issue, we only provide their sample versions in the following.
By utilizing the projection pursuit technique and taking the supremum of all the univariate outlyingness values of the projections of x onto lines, one can obtain the following pdimensional outlyingness function (Stahel 1981;Donoho 1982) O(x, X n ) = sup where S p−1 = {v ∈ R p : v = 1} with · standing for the Euclidean norm, u X n = {u X 1 , u X 2 , . . . , u X n }. Based on O(x, X n ), the projection depth value of a point x with respect to X n in R p then can be defined as (Liu 1992;Zuo 2003) PD(x, X n ) = (1 + O(x, X n )) −1 .
As a major depth notion, projection depth can induce affine equivariant, nested and convex depth regions (Zuo 2003): where 0 ≤ α ≤ α * = sup x∈R p p(x). In practical applications, depth regions PR(α, X n ) play similar roles as the quantiles in univariate statistics. As a special case, the innermost region contains only a single point θ under some mild conditions (Zuo 2013). In the literature, θ is usually referred to as projection depth median PM(X n ).
Furthermore, by choosing proper weight functions such as one can define highly robust multivariate estimators, e.g., the Stahel-Donoho estimators (Zuo et al. 2004) ω 2,i , and the projection depth trimmed mean (Zuo 2006) Note that for any direction u (∈ S p−1 ), the univariate outlyingness function o 1 (u x, u X n ) is always symmetrical about Med(u X n ) with respect to (w.r.t.) x. As a result, the projection depth may fail to capture the real shape of the data cloud when skewed data are present (Hubert and Van der Veeken 2010). This motivates us to consider an adjusted version of the aforementioned projection depth as follows: APD(x, X n ) = 1 + sup where Q 1 (Z n ) = Z (q 1 ) and Q 3 (Z n ) = Z (q 2 ) , q 1 = n/4 , q 2 = n − q 1 + 1, and · denotes the ceiling function. For convenience, hereafter we denote a(x) = APD(x, X n ). Without loss of generality, we call it adjusted projection depth. Using similar proofs to Liu et al. (2013), one can show that this adjusted projection depth can also be exactly computed and shares the x-property of the projection depth.
Based on the adjusted projection depth, one can define the adjusted projection depth regions APR(α, X n ) and median APM(X n ) as by following a similar fashion to that of projection depth.

The MATLAB package CompPD
This paper includes a MATLAB implementation of the latest developments (Zuo and Lai 2011;Liu et al. 2013;Liu and Zuo 2014) concerning the computation of projection depth and adjusted projection depth and their associated estimators as described in the earlier paragraph. Let us review the included code entitled CompPD.
The package consists of thirty files, including two data sets, one script, and twenty-seven functions. All of these files must be in MATLAB's current directory or on the search path to use the package CompPD. Tables 1 and 2 present a brief list of the main files included in the package.
The general usage of the code is as follows. The function PertX is used to perform data preprocessing when the input data are from a real application, and may be not in general position. Handled by PertX, X would be perturbed by adding some random noises of a very small magnitude e× TINY0, and then centralized at its mean vector, where e is uniformly distributed over (−1, 1). The default value of TINY0 is 10 −6 . The user also can choose another value through calling PertX(X, TINY0) with a user-defined TINY0.
Note that both of projection depth and adjusted projection depth possess the x-free property (Liu et al. 2013). We construct special functions to compute the optimal direction vectors, based on which one can conveniently obtain the exact depth value(s) and the related estimators. Hereafter we refer to the optimal direction vectors as ODVs.
For the case of projection depth, this task can be fulfilled by calling ExVecPD2D(X) and ExVecPDHD(X). ExVecPD2D finds the ODVs counter-clockwise and is constructed only for bivariate data. ExVecPDHD evaluates the ODVs in higher dimensions by utilizing the breadthfirst search algorithm, and is in fact an implementation of the algorithm developed in Liu and Zuo (2014). The input argument X should be an n × p matrix, where n denotes the sample size. Both of them return a struct with the form Part of the Boston housing data (65 × 3 matrix).

quakes.mat
Dataset load quakes Locations of 1000 seismic events of MB > 4.0 occurred in a cube near Fiji since 1964 (1000 × 2 matrix).

Script edit examples
Examples of this section.

APC2D.m
Function APC2D(X, vec, alpha) Compute the αth bivariate adjusted projection depth contour(s) of X based on vec.

APDVal.m
Function apdv = APDVal(X, vec, x) Compute the adjusted projection depth value(s) of x w.r.t. a p-dimensional (p ≥ 2) data set X based on vec.

APM.m
Function apm0 = APM(X, vec) Compute the adjusted projection depth median of X based on vec.

DDplot.m
Function DDplot(X, Y) Depth versus depth plot of X versus Y .

DSplot.m
Function DSplot(X, pdv) Depth-size scatter plot of a bivariate or trivariate sample X.

ExVecAPD2D.m Function vec = ExVecAPD2D(X)
Compute the optimal direction vectors of the adjusted projection depth w.r.t. a bivariate sample X.

ExVecAPDHD.m Function vec = ExVecAPDHD(X)
Compute the optimal direction vectors of the adjusted projection depth w.r.t. a 3-dimensional sample X.

ExVecPD2D.m
Function vec = ExVecPD2D(X) Compute the optimal direction vectors of the projection depth w.r.t. a bivariate sample X.

Function vec = ExVecPDHD(X)
Compute the optimal direction vectors of the projection depth w.r.t. a 3-dimensional sample X.

RndVecAPD.m
Function vec = RndVecAPD(X, m) Generate m random direction vectors for computing the p-dimensional (p ≥ 2) random adjusted projection depth by using the uniform distribution on the sphere.

RndVecPD.m
Function vec = RndVecPD(X, m) Generate m random direction vectors for computing the p-dimensional (p ≥ 2) random projection depth by using the uniform distribution on the sphere.  It is worth mentioning that ExVecPDHD and ExVecAPDHD are ideally limited to work in spaces with dimension 3 ≤ p ≤ 8, because they need to call a built-in function convhulln (Barber, Dobkinm, and Huhdanpaa 1996), which has such a dimensional limitation. However, we do not recommend using them when p > 3 since they consume quite considerable CPU times in such spaces. Both of them search the whole p-dimensional space quadrant-by-quadrant. In each quadrant, the default maximum number of iterations is 35000. Therefore, when n and/or p increase, they may miss finding some ODVs. A similar trick is used in compContourM1, which aims at computing the high dimensional halfspace depth contours (Paindaveine andŠiman 2012). compContourM1 has been developed by Prof. Davy Paindaveine and his coauthors, and can be downloaded from http://homepages.ulb.ac.be/~dpaindav/.
In addition, we also construct a function AppVecPD to generate m direction vectors for approximating the projection depth and its associated estimators. Its typical usage is vec = AppVecPD(X, m). The resulting direction vectors are data-dependent and affine equivariant; see the code for more details. The function RndVecPD is used to generate a finite number of random direction vectors vec by using the uniform distribution on the sphere. vec has the same form as vecpd. RndVecPD is computationally simple, but cannot generate random unit vectors in an affine equivariant way. So this function is not useful in computing estimators such as the Stahel-Donoho estimator if the affine equivariance is a big concern. Similarly, for the case of adjusted projection depth, the interested user is advised to refer to AppVecAPD and RndVecAPD, which return a struct with the same form as vecapd.
The main functions OutVal(X, vec, x) and PDVal(X, vec, x) are used to evaluate the outlyingness and projection depth value(s) of x w.r.t. X based on vec, respectively. If vec is returned from functions such as ExVecPD2D, the resulting outlyingness/projection depth value(s) is (are) exact. The function [pm0, pdpm0] = PM(X, vec, IsLP) is constructed to compute the projection depth median based on vec, and returns the computed median pm0 and its corresponding depth value pdpm0. IsLP is an indicator, and IsLP = true means calling the function linprog. It is worth mentioning that, when the number of direction vectors contained in vec are large, linprog called in PM consumes quite considerable CPU times. In such a case, we recommend the user to employ only the following code instead initx0 = median(X, 2); while tmpv > 1e-12 outfun = @(x0, AMat, bvec) max(AMat * x0 -bvec); [pm0, outv] = fminsearch(@(x0)outfun(x0, AMat, bvec), initx0); tmpv = norm(initx0 -pm0); initx0 = pm0; end pdvpm = 1 / (1 + outv); by calling PM(X, vec) or PM(X, vec, false), because one can show that the projection depth median is a global optimum and no local optimum exists based on a finite number of direction vectors vec (Liu et al. 2013). The function fminsearch works well if there is no local optimum in the objective function; see its help file for details. Remarkably, fminsearch is much faster than linprog, although the result of fminsearch has slightly lower accuracy than that of linprog.
To learn more details about the files included in CompPD, the interested user is advised to consult the help text and comments within each files, including optional input and output arguments that are not discussed here. Code for running the examples can be found in the script file Examples.

Using CompPD
In this section, we will show line-by-line the sequence of commands by utilizing some illustrative examples.

Bivariate example: quakes data
Our first example is a R data set from package datasets (R Core Team 2015). The data set gives the locations of 1000 seismic events of M B > 4.0 occurred in a cube near Fiji since 1964. The original data frame includes 5 variables. In the following, we pick up only the first two variables, namely, the latitude and longitude of events, in the discussion.
The first step in the analysis is to load the data and perform some preliminary operations. The resulting scatter plot is reported in Figure 1.
Next, the function ExVecPD2D(X) will be called to find the ODVs. Doing so illustrates how to compute the projection depth and the associated estimators.     In univariate descriptive statistics, the QQ-plot (quantile versus quantile plot) is a very useful graphical method for comparing two probability distributions by plotting their quantiles against each other. In higher dimensions, a similar tool is the DD-plot (depth versus depth plot) introduced by Liu, Parelius, and Singh (1999). This package contains a function DDplot of this kind based on the projection depth.
As a special case of such applications, DDplot can be used to test for multinormality (with a unspecified mean vector and covariance matrix) by calling DDplot(X). Usually, there are two natural ways to do that on the basis of the DD-plot. The first one is to use theoretical depth values associated with the multinormal distribution with mean and covariance parameter values that are matching the sample mean vector and covariance matrix of the data at hand. The second one is to compute the DD-plot on the basis of standardized data (where standardization, again, is based on the sample mean and sample covariance matrix). From affine-invariance, both procedures are equivalent. Nevertheless, the second one has the advantage that no new functions have to be introduced, and hence is utilized in DDplot. Provided in the sequel are some illustrations of its usage, and the results are shown in Figure 3.

DDplot(quakes)
DDplot(normrnd(0, 1, 1000, 2)) DDplot(normrnd(0, 0.1, 1000, 2)) DDplot(normrnd(0, 5, 1000, 2)) Ideally, if the observations are normally distributed, the points in the DD-plot would approximately lie on the line y = x; see Figure 3(b). In this sense, it is improper to conclude that X = quakes are generated from a normal distribution; see Figure 3(a). In recent years, the DD-plot was mostly used in a classification context. In this two-sample setup, DDplot(X, Y, Z) reports a scatter plot with x-coordinates of each point being the depth of the corresponding data point Z within the observations of the first sample X and y-coordinates of each point being the depth of the corresponding data point Z within observations of the second sample Y. New observations to be classified, in the most standard case,  are then classified into population one or two according to their position with respect to the main bisector. See Li, Cuesta-Albertos, and Liu (2012) and references therein for more details about depth-based classifiers.
We move on to compute the projection depth regions. The function PC2D is constructed to serve this purpose. Several regions can be plotted simultaneously. The exact and approximate regions can be computed by calling alpha0 = 0.1:0.1:1; PC2D(X, ExVec1, alpha0, false) PC2D(X, AppVec1, alpha0, false) where alpha0 denotes the projection depth values.
If the given alpha0 > α * (the maximum depth value), PC2D does nothing. The resulting exact and approximate depth regions are showed in Figures 4(a) and 4(b), respectively. Here depth regions are indexed by their "depth" α 0 . Equivalently, they may be indexed by their probability content, i.e., in the sample case, by the proportion of data points outside the region. In practical applications, this alternative indexing may be very useful and can be fulfilled by calling the following code (see Figures 4(c)-4(d)).

Trivariate example: Boston housing data
Our next example is a trivariate data set, which is actually a part of the Boston housing data downloaded from http://lib.stat.cmu.edu/datasets/boston/. In the sequel we take the first 65 items of variables rm, dis and lstat, where rm denotes the average number of rooms per dwelling, dis the weighted distances to five Boston employment centers, and lstat the percentage of lower status of the population, respectively.  The data can be obtained through calling load BostData X = BostData; The function ExVecPDHD is used to find the ODVs. One can trace this program by calling   Based on ExVec3, the projection depth and its induced estimators can be computed by calling functions such as PDVal and PM similar to the bivariate example given above. The exception is the computation of depth regions, which needs to call the function PC3D. That is,

Summary
As a major depth notion, projection depth has many desirable properties. It can induce favorable estimators with high breakdown point robustness, without having to pay a price of losing too much efficiency. It has continuous depth regions. For a given data cloud, the depth value does not vanish even outside the convex hull. Its induced median is unique, etc. However, application packages which can be used to compute the projection depth and its related estimators are currently lacking, although sporadically algorithms exist in individual papers.
In this paper, we present a MATLAB package CompPD, which is in fact a MATLAB implementation of the latest developments about the computation of projection depth and its related estimators (Zuo and Lai 2011;Liu et al. 2013;Liu and Zuo 2014). The current version has its limitations. We are committed to continuous improvement of the package as research advances in this field. We wish that the developed package has the potential to help practitioners to obtain satisfactory results when the data are contaminated or heavy-tailed.