An R Package for p Values for Classification

Let (X,Y ) be a random variable consisting of an observed feature vector X and an unobserved class label Y ∈ {1, 2, . . . , L} with unknown joint distribution. In addition, let D be a training data set consisting of n completely observed independent copies of (X,Y ). Instead of providing point predictors (classifiers) for Y , we compute for each b ∈ {1, 2, . . . , L} a p value πb(X,D) for the null hypothesis that Y = b, treating Y temporarily as a fixed parameter, i.e., we construct a prediction region for Y with a certain confidence. The advantages of this approach over more traditional ones are reviewed briefly. In principle, any reasonable classifier can be modified to yield nonparametric p values. We describe the R package pvclass which computes nonparametric p values for the potential class memberships of new observations as well as cross-validated p values for the training data. Additionally, it provides graphical displays and quantitative analyses of the p values.


Introduction
Let (X, Y ) be a pair of random variables, consisting of an observed feature vector X with values in a feature space X and an unobserved class label Y ∈ Y := {1, 2, . . ., L} with L ≥ 2 possible values.Our aim is inference about Y with a given confidence, based on X and certain training data.In the sequel we provide a brief introduction to the particular paradigm of p values as introduced by Dümbgen, Igl, and Munk (2008).It is closely related to Neyman-Pearson classification, see Scott (2007), Zhao, Feng, Wang, and Tong (2015) and the references cited therein.

From classifiers to p values
For the moment let us assume that the joint distribution of X and Y is known, i.e., the a priori probabilities w y := P(Y = y) and the conditional distributions P y := L(X | Y = y) for all y ∈ Y.Here and throughout, L(•) stands for 'distribution of'.Further let P y have density f y > 0 with respect to some measure M on X .In the simplest case, we choose a classifier Ŷ : X → Y, i.e., a point predictor of Y , and claim that Y = Ŷ (X).However we have no information about confidence.A Bayesian approach would be to calculate the posterior distribution of Y given X, i.e., the posterior weights w y (X) := P(Y = y | X) = w y f y (X)/f (X) with the density f = y∈Y w y f y of L(X) with respect to M .In fact, a classifier Ŷ * satisfying Ŷ * (X) ∈ arg max y∈Y w y (X) is optimal in the sense of minimizing the risk R( Ŷ ) := P( Ŷ (X) = Y ).Furthermore, the posterior weights w y (X) carry additional information such as P( Ŷ * (X) = Y | X) = 1 − max y∈Y w y (X).However, this depends very sensitively on the prior weights w y .In some realistic settings, e.g., case-control studies in biostatistics, estimation of the conditional distributions P y from training data is possible, but we have no information about the a priori probabilities w y .Moreover, if some classes y have very small prior weights, the classifier Ŷ * tends to ignore these, i.e., the class-dependent risk P( Ŷ * = Y | Y = y) may be rather large for some classes y.
Thus we can exclude all classes θ / ∈ Ŷα (X) with confidence 1 − α.If there is only one θ ∈ Ŷα (X), we have classified X uniquely with confidence 1 − α.There exist intrinsically difficult classification problems, especially in biomedical applications.But even in such situations, some observations may be classified uniquely with high confidence.And in settings with L ≥ 3 potential classes, excluding at least one or more classes with a prescribed confidence might be useful.
Alternatively, we can calculate p values which do not depend on the prior probabilities w y .Since w 2 (x) is increasing in x, we define the p values where Figure 2 shows the p value functions π 1 (x) and π 2 (x).In addition, the three regions where Ŷ0.1 (x) = {1}, {2}, {1, 2} are marked.With other choices of the gamma parameters there could be a region where Ŷ0.1 (x) = ∅.
To assess how well the two classes can be discriminated, it is also of interest to look at the two receiver operating characteristic (ROC) functions These functions are depicted in Figure 3.The first ROC function specifies for each test level α the probability that class 2 is rejected at this level if class 1 is the true one, i.e. the conditional probability that 2 ∈ Ŷα (X) given Y = 1.The second ROC curve is analogous with the roles of classes 1 and 2 interchanged.For a general account of ROC curves in binary classification and hypothesis testing we refer to Altman andBland (1994) andFawcett (2006).

Optimal p values as benchmark
Suppose that L (f θ /f )(X) is continuous.This is true, for instance, if the P y are nondegenerate Gaussian distributions on X = R q and not all identical.Then the Neyman-Pearson Lemma shows that the p value is optimal in the sense that it minimizes the risk R α (π θ ) := P(π θ > α) for any α ∈ (0, 1), cf.Dümbgen et al. (2008).Two other representations of π * θ (x) are given by with The first representation shows that π * θ (x) is a non-decreasing function of w θ (x).The second representation shows that the prior weight w θ itself is irrelevant for the optimal p value π * θ .Only the ratios w c /w b with b, c = θ matter.In particular, in case of L = 2 classes, T * 1 (x) = (f 2 /f 1 )(x) = T * 2 (x) −1 , and the optimal p values do not depend on the prior distribution of Y at all.

Training data and nonparametric p values
The joint distribution of (X, Y ) is typically unknown.In this case we compute p values π θ (X, D) and prediction regions Ŷα (X, D) := {θ ∈ Y : π θ (X, D) > α} depending on the current feature vector X as well as on a training data set D consisting of n completely observed pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., (X n , Y n ).We assume that (X, Y ) and (X 1 , Y 1 ), . . ., (X n , Y n ) are independent with L(X i | Y i = y) = P y .This setting includes situations with stratified training data, e.g., case-control studies, as well as i.i.d.training data.Condition (1) can be extended in two ways: for any θ ∈ Y, α ∈ (0, 1).Condition (3) corresponds to "single use" of the p values in the sense that we consider how well one future observation (X, Y ) is classified.This condition (3) can be guaranteed in various settings.Condition (4) corresponds to "multiple use" of the p values in the sense that the training data D are utilized to classify arbitrarily many future observations.Thus it is reasonable to condition on the training data D as well as the class label of future observations.As explained in Dümbgen et al. (2008), Condition (4) can be guaranteed under moderate assumptions, but the summand o p (1) cannot be avoided in general.
For the computation of the p values we condition on the n + 1 class labels Y 1 , Y 2 , . . ., Y n and Y and treat them as fixed parameters.We write and assume tacitly that all these group sizes N θ are positive.
To compute a nonparametric p value π θ (X, D) for one particular class label θ, let T θ (X, D) be a test statistic which is symmetric in (X i ) i∈G θ and quantifies the implausibility of 'Y = θ'.
To test the latter hypothesis we define D(X, θ) to be the training data augmented by (X, θ), i.e., we assume temporarily that Y = θ.If that is true, the N θ + 1 random variables X and X i , i ∈ G θ , are exchangeable.Hence the nonparametric p value For instance if α = 0.05, N θ should be at least 19.
As to the test statistic T θ (X, D), there are no restrictions except for the symmetry in (X i ) i∈G θ .
The optimal p value π * θ (x) suggests using an estimator for the weighted likelihood ratio T * θ (x) or a strictly increasing transformation thereof.Section 2 contains explicit examples for T θ (X, D).
More details on the nonparametric p values, including asymptotic properties such as (4), can be found in Dümbgen et al. (2008) and Zumbrunnen (2009).

Cross-validated p values and ROC functions
To visualize the separability of different classes by means of given We also compute the empirical conditional inclusion probabilities and the empirical pattern probabilities Pα (b, S) for b, θ ∈ Y and S ⊂ Y.These numbers can be interpreted as estimators of and Finally, for training data with large group sizes N b we also display the L 2 empirical ROC functions as a plot matrix with L rows and L columns.In row b and column θ we depict the function For any fixed α (horizontal axis), the value 1 − Îα (b, θ) (vertical axis) is an estimator for the probability that class θ is rejected at this level if class b is the true one, i.e. the conditional probability that θ ∈ Ŷα (X, D) given Y = 1 and D. The plots on the diagonal (b = θ) are essentially straight lines connecting (0, 0) and (1, 1).Deviations are due to the fact that π θ (•, D) has a discrete distribution.

Data example buerk
To illustrate the main functions of pvclass we use the data set buerk provided by pvclass.It was collected by Prof. Dr. Conny Georg Bürk at the university hospital in Lübeck, Germany, and contains data of 21556 surgeries in a certain time period (end of the nineties).Besides the mortality and the morbidity it contains 21 variables describing the condition of the patient and the surgery.

Choices of test statistics
In this section we describe the test statistics T θ (•, •) implemented in the package pvclass.As indicated in Section 5, users could easily incorporate test statistics corresponding to their own favorite classifiers, see Section 5. From now on we assume that the feature vector X consists of q numerical covariates.If the raw data involve categorical coveralls, they should be coded by means of {0, 1}-valued indicator variables as usual.In our procedures the latter are treated like numerical variables.

Plug-in estimator for standard model
In the standard model, where P θ = N q (µ θ , Σ) with mean vectors µ θ ∈ R q and a common symmetric, nonsingular covariance matrix Σ ∈ R q×q , the test statistic for the optimal p value is given by To compute the nonparametric p values, we replace the unknown parameters in T * θ with the corresponding estimators.We use N b /n as a proxy for w b and the the standard estimator μθ := 1 N θ i∈G θ X i for µ θ .For Σ the package pvclass offers the standard estimator as well as more robust M estimators ΣM and Σsym .The reason to use a robust estimator for Σ is that for the computation of π θ (X, D) we add the new observation X temporarily to the the class θ and X may be an outlier with respect to the distribution P θ .
The first M estimator, ΣM , is the maximum likelihood estimator in the model where P θ = N q (µ θ , c θ Σ) with c θ > 0 and Σ ∈ R q×q symmetric and positive definite with det(Σ) = 1.For the calculations we use that ΣM is the solution of the fixed point equation The second M estimator, Σsym , is a generalization for more than one group of the symmetrized version of Tyler's M estimator, as it is defined in Dümbgen (1998).It is the solution of the fixed point equation (If observations X i within one group G b are identical, the previous equation has to be modified somewhat.)For more details on M estimators we refer to Dümbgen, Pauly, and Schweizer (2015) and Dümbgen, Nordhausen, and Schuhmacher (2016).

Nearest neighbors and weighted nearest neighbors
Suppose that d(•, •) is some metric on X .Let B(x, r) := {y ∈ X : d(x, y) ≤ r}, and for a fixed positive integer k ≤ n define Further let Pθ (B) denote the empirical distribution of the points X i with Then The k nearest neighbor estimator weights all k nearest neighbors of the observation X equally.However it would be reasonable to assign larger weights to training observations which are closer to X. pvclass also offers a weighted nearest neighbor estimator for w θ (x).To compute this, we first order the training data according to their distance to X and then assign descending weights to them.Let the rank of training observation X i according to its distance to x.The weighted nearest neighbor estimator for w θ (x) is then defined as In pvclass either the linear weight function or the exponential weight function with a tuning parameter τ > 0 can be used.For the linear weight function, τ should be in (0, 1].Alternatively, one can specify arbitrary weights with an n dimensional vector W .
Often the different variables of a data set are measured on different scales.To take this into account, pvclass offers besides the fixed Euclidean distance two data-driven distances which are scale invariant.The first is the data driven Euclidean distance where we divide each component of X by its sample standard deviation and then use the Euclidean distance.For the second we assume that all classes have a common covariance matrix and estimate this with one of the estimators described in Section 2.1.Then we use the Mahalanobis distance with respect to the estimated covariance matrix Σ, i.e., we use the distance

Penalized multicategory logistic regression
One of our versions of penalized multicategory logistic regression is similar to the regularized multinomial regression introduced by Friedman, Hastie, and Tibshirani ( 2010), the other one is a variation of the procedure of Zhu and Hastie (2004).Let X = R q and X contain the values of q − 1 numerical or binary variables and a constant term.Our temporary working assumption is that for unknown parameters b 1 , b 2 , . . ., b L ∈ R q .With estimators bz = bz (D) to be specified below, p values are computed with the test statistics The parameter b = b 1 , b 2 , . . ., b L ∈ R Lq is estimated via penalized maximum likelihood.We pretend temporarily that the training observations (X i , Y i ) are independent copies of (X, Y ) satisfying ( 6).The corresponding negative conditional log-likelihood function, given the X i , is given by The parametrization in ( 6) is not unique, because P (Y = θ | X = x) remains unchanged if we add one and the same arbitrary vector to all parameters b z .One way to guarantee uniqueness of the parameter b is to require for all j = 1, 2, . . ., q.To enforce (7) at least for some j we can add with numbers σ j ≥ 0 to Λ(b).The choice of σ j will depend on further regularization terms.
Regularization 1: Penalizing subvectors.For logistic regression there are various good reasons to regularize the functional Λ or Λ+R 0 .One is to avoid numerical problems.Another is to guarantee existence of a minimizer in cases where Λ alone has no minimizer.This happens if one subgroup {X i : } by a hyperplane.Moreover, we want to favor parameter vectors with only few large components.A first way to do this would be to add the penalty with numbers τ j ≥ 0 to Λ(b)+R 0 (b).Here and throughout, • denotes Euclidean norm.This regularization is motivated by Tibshirani's (1996) LASSO and similar in spirit to penalized logistic regression as proposed by Zhu and Hastie (2004).The latter authors used θ Hence minimizing Λ(b) + R 0 (b) + R 1 (b) enforces Condition (7) whenever σ j + τ j > 0 for all j.
Regularization 2: Component-wise penalties.A simple form of regularization, analogous to Tibshirani's (1996) LASSO is to add the penalty Choice of σ j and τ j .The three versions of penalized logistic regression are available in pvclass, specified by the parameters pen.method and τ o : Here S j is the sample standard deviation (within groups) of the j-th components of X i .
For results about the existence of unique minimizers we refer to Zumbrunnen (2014).

Implementation and main functions
The package pvclass (Zumbrunnen and Dümbgen 2017) was written in the R programming language (R Core Team 2017) and depends on the recommended package Matrix (Bates and Mächler 2017).
The main functions of pvclass compute p values for the potential class memberships of new observations (pvs) as well as cross-validated p values for training data (cvpvs).With the function analyze.pvs the package pvclass also provides graphical displays and quantitative analyses of the p values.
In this section we illustrate the main functions with the data set buerk of Section 1.6.We use the mortality as class label Y.The original data set contains 21556 observations.To get a smaller data set, which is easier to handle, we take all the 662 observations with Y = 1 and with the function sample we choose randomly 3  In this example the empirical pattern probabilities for uniquely correct classification Pα (b, {b}) are 0.4 for both classes.

Cross-validated p values
The The cross-validated p values can be illustrated graphically with analyze.pvs the same way as the p values for the new observations.In the following example we suppress the plot of the p values and get only the plot of the ROC functions, see Figure 6.The output shows the empirical pattern probabilities Pα (b, S) as described in Section 3.1.

Choice of tuning parameters
Some of the previous test statistics depend on a tuning parameter τ > 0, i.e., T θ (x, D) = T (τ ) θ (x, D).Our goal is to choose this parameter in a data-driven way such that overfitting of the training data is avoided while symmetry in (X i ) i∈G θ is preserved.
The following criterion turned out to be quite useful: In view of our specific construction of the p values, T where D i (X i , X, θ) denotes the training data after adding the observation (X, θ) and setting the class label of observation X i to θ.Then we take the sum of these values,

Implementations in pvclass.
The functions for the nearest neighbor methods accept vectors for the parameters k and τ , respectively.If a vector is provided, the function searches for the best of the given parameters.If no parameter is provided (k = NULL or tau = 0) certain default vectors are used.
In connection with penalized logistic regression, note that to determine the optimal tuning parameters for a new observation X and all potential class memberships, the test statistic has to be computed (L − 1) • n • l times, where l is the number of tuning parameters from which we want to choose the optimal one.This can be very computer-intensive, especially for penalized logistic regression in high dimensions.For the latter method we observed in simulated and real data examples that the p values do not depend too sensitively on the choice of the penalty parameter and S(τ o , X, θ) is unimodal in τ o .Therefore if find.tau == TRUE, we only consider few values for τ o on a logarithmic grid as follows: We start with a certain value τ o (default 10) and compare S(τ o , X, θ) with S(τ o • δ, X, θ) for some δ > 1 (default 2).While S(τ o • δ, X, θ) > S(τ o , X, θ) and τ o is smaller than a certain threshold τ max (default 100) we replace τ o with τ o • δ.If in the very beginning S(τ o , X, θ) ≥ S(τ o • δ, X, θ), we compare S(τ o , X, θ) with S(τ o /δ, X, θ).While S(τ o /δ, X, θ) > S(τ o , X, θ) and τ o is larger than a certain threshold τ min (default 1) we replace τ o with τ o /δ.

Numerical examples
We illustrate the procedure with two simulated data sets.Figure 7 shows boxplots of the distributions of the rates of uniquely correct classified test observations for different values of τ .The misclassification rates depend heavily on the training set.Nevertheless our method is very stable and chooses for 99.8% of the test observations τ * = 16, which has the highest median rate of uniquely correct classification.
Figure 8 shows boxplots of the distributions of the rates of uniquely correct classified test observations for different values of τ .As in the previous example the misclassification rates depend heavily on the training set.Again our method is very stable and chooses for all test observations τ * = 16, which has the highest median rate of uniquely correct classification.
Example 3 (Buerk).For the hospital data described in Section 3 we computed p values for τ = 1, 2, . . ., 100.The amount of regularization had no influence on the misclassification rates in this example.pvclass: p Values for Classification in R q q q q q q q 1 2 4

Relation to other classifiers and packages
There are numerous software packages for classification.It should be stressed that our package does not provide just another classifier.Our aim is to provide a first open-source implementation of nonparametric p values for classification.Indeed one can can easily combine our package with one's favorite classifier, e.g. based on neural nets or support vector machines, as long as it involves some plausibility or implausibility measures for class memberships.Precisely, suppose that for given training data D your favorite classifier provides explicitly or implicitly a matrix where T iy measures the implausibility of "Y i = y" for the feature vector X i .(If T iy measures the plausibility of "Y i = y", just replace T iy with −T iy .)To compute a p value π θ (X, D) for a new observation, one has to compute this matrix with the augmented data matrix D(X, θ) in place of D. This results in a matrix with an additional first row [T 01 , T 02 , . . ., T 0L ] corresponding to the additional observation (X 0 , Y 0 ) = (X, θ).Note that in general the other n rows will be different from the original n rows of (8).Then the p value π θ (X, D) is given by For θ = Y i one may use the same formula, but one has to recalculate the matrix (8) after replacing (X i , Y i ) in D with (X i , θ).Note that this affects the group sizes N θ and N Y i , too.The resulting matrix π θ (X i , D i ) i,θ ∈ [0, 1] n×L of crossvalidated p values may be analyzed by means of analyze.pvs.
These remarks indicate that the computation of our p values is necessarily more involved than the computation of a traditional classifier.Roughly saying, to evaluate one future feature vector, one has to compute the underlying classifier L times, and for the computation of all cross-validated p values one needs nL computations of the classifier.On the other hand, the underlying data sets D, D(X, θ), D i etc. are very similar, and in our implementation in pvclass we utilize various tricks to exploit these redundancies.

Figure 2 :
Figure 2: p value functions for class memberships.

Figure 4 :
Figure 4: Illustration of the p values without indicating the class labels of the test data.

Figure 5 :
Figure 5: Illustration of the p values with class labels of the test data.
function cvpvs returns a matrix PV containing cross-validated nonparametric p values for the potential class memberships of the training data.Precisely, for each feature vector X[i,] and each class b the number PV[i,b] is a p value for the null hypothesis that Y [i] = b.For the following example we use the logistic regression with penalty parameter tau.o = 2.The option progress = FALSE the printing of the computation progress.R> PV.cv <-cvpvs(X = Xtrain, Y = Ytrain, method = "logreg", tau.o

Figure 6 :
Figure 6: ROC curves of the cross-validated p values.
D(X, θ)) should take relatively big values whenever Y = θ.Hence we compute for all training observations with Y i = θ the test statistic

Figure 7 :
Figure 7: Distributions of the rates of uniquely correct classified test observations for different values of τ for Example 1.

Figure 8 :
Figure 8: Distributions of the rates of uniquely correct classified test observations for different values of τ for Example 2.
To analyze the training data itself by means of cross-validated p values, one has to compute the matrix (8) 1 + n(L − 1) times: For i = 1, 2, . . ., n, the original data set D and the resulting matrix (8) yield the cross-validated p valueπ θ (X i , D i ) = N −1 θ n j=1 1{Y j = θ, T jθ ≥ T iθ } for θ = Y i .
• 662 observations with Y = 0.For the test data set we choose 100 observations from each of the classes.So we end up with a training data set containing 2448 observations, whereof 562 belong to class 1.