Random-projection ensemble classification

We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.


Introduction
Supervised classification concerns the task of assigning an object (or a number of objects) to one of two or more groups, based on a sample of labelled training data.The problem was first studied in generality in the famous work of Fisher (1936), where he introduced some of the ideas of Linear Discriminant Analysis (LDA), and applied them to his Iris Data set.Nowadays, classification problems arise in a plethora of applications, including spam filtering, fraud detection, medical diagnoses, market research, natural language processing and many others.
An increasing number of modern classification problems are high-dimensional, in the sense that the dimension p of the feature vectors may be comparable to or even greater than the number of training data points, n.In such settings, classical methods such as those mentioned in the previous paragraph tend to perform poorly (Bickel and Levina, 2004), and may even be intractable; for example, this is the case for LDA, where the problems are caused by the fact that the sample covariance matrix is not invertible when p ≥ n.
Many methods proposed to overcome such problems assume that the optimal decision boundary between the classes is linear, e.g.Friedman (1989) and Hastie et al. (1995).Another common approach assumes that only a small subset of features are relevant for classification.Examples of works that impose such a sparsity condition include Fan and Fan (2008), where it is also assumed that the features are independent, as well as Tibshirani et al. (2003) and Guo, Hastie and Tibshirani (2007), where soft thresholding is used to obtain a sparse boundary.More recently, Witten and Tibshirani (2011) and Fan, Feng and Tong (2012) both solve an optimisation problem similar to Fisher's linear discriminant, with the addition of an ℓ 1 penalty term to encourage sparsity.
In this paper we attempt to avoid the curse of dimensionality by projecting the feature vectors at random into a lower-dimensional space.The use of random projections in high-dimensional statistical problems is motivated by the celebrated Johnson-Lindenstrauss Lemma (e.g.Dasgupta and Gupta, 2002).This lemma states that, given x 1 , . . ., x n ∈ R p , ǫ ∈ (0, 1) and d > 8 log n ǫ 2 , there exists a linear map f : for all i, j = 1, . . ., n.In fact, the function f that nearly preserves the pairwise distances can be found in randomised polynomial time using random projections distributed according to Haar measure as described in Section 3 below.It is interesting to note that the lower bound on d in the Johnson-Lindenstrauss lemma does not depend on p.As a result, random projections have been used successfully as a computational time saver: when p is large compared to log n, one may project the data at random into a lower-dimensional space and run the statistical procedure on the projected data, potentially making great computational savings, while achieving comparable or even improved statistical performance.As one example of the above strategy, Durrant and Kaban (2013) obtained Vapnik-Chervonenkis type bounds on the generalisation error of a linear classifier trained on a single random projection of the data.See also Dasgupta (1999), Ailon and Chazelle (2006) and McWilliams et al. (2014) for other instances.
Other works have sought to reap the benefits of aggregating over many random projections.For instance, Marzetta, Tucci and Simon (2011) considered estimating a p × p population inverse covariance (precision) matrix using where Σ denotes the sample covariance matrix and A 1 , . . ., A B are random projections from R p to R d .Lopes, Jacob and Wainwright (2011) used this estimate when testing for a difference between two Gaussian population means in high dimensions, while Durrant and Kaban (2014) applied the same technique in Fisher's linear discriminant for a high-dimensional classification problem.
Our proposed methodology for high-dimensional classification has some similarities with the techniques described above, in the sense that we consider many random projections of the data, but is also closely related to bagging (Breiman, 1996), since the ultimate assignment of each test point is made by aggregation and a vote.Bagging has proved to be an effective tool for improving unstable classifiers; indeed, a bagged version of the (generally inconsistent) 1-nearest neighbour classifier is universally consistent as long as the resample size is carefully chosen; see Hall and Samworth (2005).More generally, bagging has been shown to be particularly effective in high-dimensional problems such as variable selection (Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013).Another related approach to ours is Blaser and Fryzlewicz (2015), who consider ensembles of random rotations, as opposed to projections.One of the basic but fundamental observations that underpins our proposal is the fact that aggregating the classifications of all random projections is not sensible, since most of these projections will typically destroy the class structure in the data; see the top row of Figure 1.For this reason, we advocate partitioning the projections into non-overlapping blocks, and within each block we retain only the projection yielding the smallest estimate of the test error.The attraction of this strategy is illustrated in the bottom row of Figure 1, where we see a much clearer partition of the classes.Another key feature of our proposal is the realisation that a simple majority vote of the classifications based on the retained projections can be highly suboptimal; instead, we argue that the voting threshold should be chosen in a data-driven fashion in an attempt to minimise the test error of the infinitesimulation version of our random projection ensemble classifier.In fact, this estimate of the optimal threshold turns out to be remarkably effective in practice; see Section 5.1 for further details.We emphasise that our methodology can be used in conjunction with any base classifier, though we particularly have in mind classifiers designed for use in low-dimensional settings.The random projection ensemble classifier can therefore be regarded as a general technique for either extending the applicability of an existing classifier to high dimensions, or improving its performance.
Our theoretical results are divided into three parts.In the first, we consider a generic base classifier and a generic method for generating the random projections into R d and quantify the difference between the test error of the random projection ensemble classifier and its infinite-simulation counterpart as the number of projections increases.We then consider selecting random projections from non-overlapping blocks by initially drawing them according to Haar measure, and then within each block retaining the projection that minimises an estimate of the test error.Under a condition implied by the widely-used sufficient dimension reduction assumption (Li, 1991;Cook, 1998;Lee et al., 2013), we can then control the difference between the test error of the random projection classifier and the Bayes risk as a function of terms that depend on the performance of the base classifier based on projected data and our method for estimating the test error, as well as terms that become negligible as the number of projections increases.The final part of our theory gives risk bounds on the first two of these terms for specific choices of base classifier, namely Fisher's linear discriminant and the k-nearest neighbour classifier.The key point here is that these bounds only depend on d, the sample size n and the number of projections, and not on the original data dimension p.
The remainder of the paper is organised as follows.Our methodology and general theory are developed in Sections 2 and 3. Specific choices of base classifier are discussed in Section 4, while Section 5 is devoted to a consideration of the practical issues of the choice of voting threshold, projected dimension and the number of projections used.In Section 6 we present results from an extensive empirical analysis on both simulated and real data where we compare the performance of the random projection ensemble classifier with several popular techniques for high-dimensional classification.The outcomes are extremely encouraging, and suggest that the random projection ensemble classifier has excellent finitesample performance in a variety of different high-dimensional classification settings.We conclude with a discussion of various extensions and open problems.All proofs are deferred to the Appendix.
Finally in this section, we introduce the following general notation used throughout the paper.For a sufficiently smooth real-valued function g defined on a neighbourhood of t ∈ R, let ġ(t) and g(t) denote its first and second derivatives at t, and let ⌊t⌋ and t := t − ⌊t⌋ denote the integer and fractional part of t respectively.

A generic random projection ensemble classifier
We start by describing our setting and defining the relevant notation.Suppose that the pair (X, Y ) takes values in R p ×{1, 2}, with joint distribution P , characterised by π 1 := P(Y = 1), and P r , the conditional distribution of X|Y = r, for r = 1, 2. For convenience, we let π 2 := P(Y = 2) = 1 − π 1 .In the alternative characterisation of P , we let P X denote the marginal distribution of X and write η(x) := P(Y = 1|X = x) for the regression function.Recall that a classifier on R p is a Borel measurable function C : R p → {1, 2}, with the interpretation that we assign a point x ∈ R p to class C(x).We let C p denote the set of all such classifiers.
The misclassification rate, or risk, of a classifier C is R(C) := P{C(X) = Y }, and is minimised by the Bayes classifier Devroye, Györfi and Lugosi, 1996, p. 10).
Of course, we cannot use the Bayes classifier in practice, since η is unknown.Nevertheless, we often have access to a sample of training data that we can use to construct an approximation to the Bayes classifier.Throughout this section and Section 3, it is convenient to consider the training sample T n := {(x 1 , y 1 ), . . ., (x n , y n )} to be fixed points in R p ×{1, 2}.Our methodology will be applied to a base classifier Ĉn = Ĉn,T n,d , which we assume can be constructed from an arbitrary training sample T n,d of size n in R d × {1, 2}; thus Ĉn is a measurable function from (R d × {1, 2}) n to C d .Now assume that d ≤ p.We say a matrix A ∈ R d×p is a projection if AA T = I d×d , the d-dimensional identity matrix.Let A = A d×p := {A ∈ R d×p : AA T = I d×d } be the set of all such matrices.Given a projection A ∈ A, define projected data z A i := Ax i and y A i := y i for i = 1, . . ., n, and let Note that although ĈA n is a classifier on R p , the value of ĈA n (x) only depends on x through its d-dimensional projection Ax.
We now define a generic ensemble classifier based on random projections.For B 1 ∈ N, let A 1 , A 2 , . . ., A B 1 denote independent and identically distributed projections in A, independent of (X, Y ).The distribution on A is left unspecified at this stage, and in fact our proposed method ultimately involves choosing this distribution depending on T n .Now set . (1) For α ∈ (0, 1), the random projection ensemble classifier is defined to be (2) We emphasise again here the additional flexibility afforded by not pre-specifying the voting threshold α to be 1/2.Our analysis of the random projection ensemble classifier will require some further definitions.Let where the randomness here comes from the random projections.Let G n,1 and G n,2 denote the distribution functions of μn (X)|{Y = 1} and μn (X)|{Y = 2}, respectively.We will make use of the following assumption: (A.1) G n,1 and G n,2 are twice differentiable at α.
The first derivatives of G n,1 and G n,2 , when they exist, are denoted as g n,1 and g n,2 respectively; under (A.1), these derivatives are well-defined in a neighbourhood of α.Our first main result below gives an asymptotic expansion for the test error L( ĈRP n ) := P{ ĈRP n (X) = Y } of our generic random projection ensemble classifier as the number of projections increases.In particular, we show that this test error can be well approximated by the test error of the infinite-simulation random projection classifier This infinite-simulation classifier turns out to be easier to analyse in subsequent results.Note that under (A.1), The proof of Theorem 1 in the Appendix is lengthy, and involves a one-term Edgeworth approximation to the distribution function of a standardised Binomial random variable.One of the technical challenges is to show that the error in this approximation holds uniformly in the binomial proportion.
Our next result controls the test excess risk, i.e. the difference between the test error and the Bayes risk, of the infinite-simulation random projection classifier in terms of the expected test excess risk of the classifier based on a single random projection.An attractive feature of this result is its generality: no assumptions are placed on the configuration of the training data T n , the distribution P of the test point (X, Y ) or on the distribution of the individual projections.
Theorem 2. We have 3 Choosing good random projections In this section, we study a special case of the generic random projection ensemble classifier introduced in Section 2, where we propose a screening method for choosing the random projections.Let LA n be an estimator of L A n , based on {(z A 1 , y A 1 ), . . ., (z A n , y A n )}, that takes values in the set {0, 1/n, . . ., 1}.Examples of such estimators include resubstitution and leave-one-out estimates; we discuss these choices in greater detail in Section 4. For B 1 , B 2 ∈ N, let {A b 1 ,b 2 : b 1 = 1, . . ., B 1 , b 2 = 1, . . ., B 2 } denote independent projections, independent of (X, Y ), distributed according to Haar measure on A. One way to simulate from Haar measure on the set A is to first generate a matrix R ∈ R d×p , where each entry is drawn independently from a standard normal distribution, and then take A T to be the matrix of left singular vectors in the singular value decomposition of R T .For b where sargmin denotes the smallest index where the minimum is attained in the case of a tie.We now set , and consider the random projection ensemble classifier from Section 2 constructed using the independent projections A 1 , . . ., A B 1 .

Let
L * n := min A∈A LA n denote the optimal test error estimate over all projections.The minimum is attained here, since LA n takes only finitely many values.For j = 0, 1, . . ., ⌊n(1 − L * n )⌋, let β n (j) := P LA n ≤ L * n + j/n , where A is distributed according to Haar measure on A. We assume the following: 1 We define L A n through an integral rather than defining L A n := P{ ĈA n (X) = Y } to make it clear that when A is a random projection, it should be conditioned on when computing L A n .
(A.2) There exist β 0 ∈ (0, 1) and β, ρ > 0 such that Condition (A.2) asks for a certain growth rate of the distribution function of LA n close to its minimum value L * n ; observe that the strength of the condition decreases as B 2 increases.Under this condition, the following result is a starting point for controlling the expected test excess risk of the classifier based on a single projection chosen according to the scheme described above.
Proposition 3. Assume (A.2).Then where The form of the bound in Proposition 3 motivates us to seek to control L * n − R(C Bayes ) in terms of the test excess risk of a classifier based on the projected data, in the hope that we will be able to show this does not depend on p.To this end, define the regression function on R d induced by the projection A ∈ A to be η A (z) := P(Y = 1|AX = z).The corresponding induced Bayes classifier, which is the optimal classifier knowing only the distribution of (AX, Y ), is given by In order to ensure that L * n will be close to the Bayes risk, we will invoke an additional assumption on the form of the Bayes classifier: Condition (A.3) requires that the set of points x ∈ R p assigned by the Bayes classifier to class 1 can be expressed as a function of a d-dimensional projection of x.Note that if the Bayes decision boundary is a hyperplane, then (A.3) holds with d = 1.Moreover, Proposition 4 below shows that, in fact, (A.3) holds under the sufficient dimension reduction condition, which states that Y is independent of X given A * X; see Cook (1998) for many statistical settings where such an assumption is natural.
Finally, then, we are in a position to control the test excess risk of our random projection ensemble classifier in terms of the test excess risk of a classifier based on d-dimensional data, as well as terms that reflect our ability to estimate the test error of classifiers based on projected data and terms that depend on B 1 and B 2 .
Theorem 5. Assume (A.1), (A.2) and (A.3).Then Regarding the bound in Theorem 5 as a sum of four terms, we see that the last two of these can be seen as the price we have to pay for the fact that we do not have access to an infinite sample of random projections.These terms can be made negligible by choosing B 1 and B 2 to be sufficiently large, but it should be noted that ǫ n may increase with B 2 .This is a reflection of the fact that minimising an estimate of test error may lead to overfitting.The behaviour of this term, together with that of L A * n − R A * −Bayes and ǫ A * n , depends on the choice of base classifier, but in the next section below we describe settings where these terms can be bounded by expressions that do not depend on p.

Possible choices of the base classifier
In this section, we change our previous perspective and regard the training data T n := {(X 1 , Y 1 ), . . ., (X n , Y n )} as independent random pairs with distribution P , so our earlier statements are interpreted conditionally on T n .We consider particular choices of base classifier, and study the first two terms in the bound in Theorem 5.

Linear Discriminant Analysis
Linear Discriminant Analysis (LDA), introduced by Fisher (1936), is arguably the simplest classification technique.Recall that in the special case where X|Y = r ∼ N p (µ r , Σ), we have , a 1 × p matrix.In LDA, π r , µ r and Σ are estimated by their sample versions, using a pooled estimate of Σ.Although LDA cannot be applied directly when p ≥ n since the sample covariance matrix is singular, we can still use it as the base classifier for a random projection ensemble, provided that d < n.Indeed, noting that for any A ∈ A, we have AX|Y = r ∼ N d (µ A r , Σ A ), where µ A r := Aµ r and Σ A := AΣA T , we can define (5) Here, πA and ΩA := ( ΣA ) −1 .
Write Φ for the standard normal distribution function.Under the normal model specified above, the test error of the LDA classifier can be written as , where δA := μA 2 − μA 1 and μA := (μ A 1 + μA 2 )/2.Okamoto (1963) studied the excess risk of the LDA classifier in an asymptotic regime in which d is fixed as n diverges.In fact, he considered a very slightly different data generating model, in which the training sample sizes from each population are assumed to be known in advance, so that without loss of generality, we may assume that Y 1 = . . .= Y n 1 = 1 and Y n 1 +1 = . . .= Y n = 2, while X i |Y i = r ∼ N p (µ r , Σ), as before.Specialising his results for simplicity to the case where n is even and Okamoto (1963) showed that using the LDA classifier (5) with ) .It remains to control the errors ǫ n and ǫ A * n in Theorem 5.For the LDA classifier, we consider the resubstitution estimator Devroye and Wagner (1976) provided a Vapnik-Chervonenkis bound for LA n under no assumptions on the underlying data generating mechanism: for every n ∈ N and ǫ > 0, see also Devroye et al. (1996, Theorem 23.1).We can then conclude that In this case, the bound (8) cannot be applied directly, because (X 1 , Y 1 ), . . ., (X n , Y n ) are no longer independent conditional on A 1 .Nevertheless, since A 1,1 , . . ., A 1,B 2 are independent of T n , we still have that We can therefore conclude by almost the same argument as that leading to ( 9) that Note that none of the bounds ( 6), ( 9) and ( 10) depend on the original data dimension p.Moreover, substituting the bound (10) into Theorem 5 reveals a trade-off in the choice of B 2 when using LDA as the base classifier.Choosing B 2 to be large gives us a good chance of finding a projection with a small estimate of test error, but we may incur a small overfitting penalty as reflected by (10).

Quadratic Discriminant Analysis
Quadratic Discriminant Analysis (QDA) is designed to handle situations where the classconditional covariance matrices are unequal.Recall that when X|Y = r ∼ N p (µ r , Σ r ), and π r := P(Y = r), for r = 1, 2, the Bayes decision boundary is given by {x In QDA, π r , µ r and Σ r are estimated by their sample versions.If p ≥ min(n 1 , n 2 ), where n r := n i=1 ½ {Y i =r} is the number of training sample observations from the rth class, then at least one of the sample covariance matrix estimates is singular, and QDA cannot be used directly.Nevertheless, we can still choose d < min{n 1 , n 2 } and use QDA as the base classifier in a random projection ensemble.Specifically, we can set where πA r , ΣA r and μA r were defined in Section 4.1, and where for r = 1, 2. Unfortunately, analogous theory to that presented in Section 4.1 does not appear to exist for the QDA classifier (unlike for LDA, the risk does not have a closed form).Nevertheless, we found in our simulations presented in Section 6 that the QDA random projection ensemble classifier can perform very well in practice.In this case, we estimate the test errors using the leave-one-out method given by where ĈA n,i denotes the classifier ĈA n , trained without the ith pair, i.e. based on For a method like QDA that involves estimating more parameters than LDA, we found that the leave-one-out estimator was less susceptible to overfitting than the resubstitution estimator.

The k-nearest neighbour classifier
The k-nearest neighbour classifier (knn), first proposed by Fix and Hodges (1951), is a nonparametric method that classifies the test point x ∈ R p according to a majority vote over the classes of the k nearest training data points to x.The enormous popularity of the knn classifier can be attributed partly due to its simplicity and intuitive appeal; however, it also has the attractive property of being universally consistent: for every distribution P , as long as k → ∞ and k/n → 0, the risk of the knn classifier converges to the Bayes risk (Devroye et al., 1996, Theorem 6.4).Hall, Park and Samworth (2008) derived the rate of convergence of the excess risk of the k-nearest neighbour classifier.Under regularity conditions, the optimal choice of k, in terms of minimising the excess risk, is O(n 4/(p+4) ), and the rate of convergence of the excess risk with this choice is O(n −4/(p+4) ).Thus, in common with other nonparametric methods, there is a 'curse of dimensionality' effect that means the classifier typically performs poorly in high dimensions.Samworth (2012a,b) found the optimal way of assigning decreasing weights to increasingly distant neighbours, and quantified the asymptotic improvement in risk over the unweighted version, but the rate of convergence remains the same.
We can use the knn classifier as the base classifier for a random projection ensemble as follows: let . The theory described in the previous paragraph can be applied to show that, under regularity conditions, E(L d+4) ).Once again, a natural estimate of the test error in this case is the leave-one-out estimator defined in ( 11), where we use the same value of k on the leave-one-out datasets as for the base classifier, and where distance ties are split in the same way as for the base classifier.For this estimator, Devroye and Wagner (1979) showed that for every n ∈ N, see also Devroye et al. (1996, Chapter 24).It follows that Devroye and Wagner (1979) also provided a tail bound analogous to (8) for the leave-one-out estimator.Arguing very similarly to Section 4.1, we can deduce that under no conditions on the data generating mechanism, .

Choice of α
We now discuss the choice of the voting threshold α in (2).The expression for the test error of the infinite-simulation random projection ensemble classifier given in (3) suggests the 'oracle' choice α * ∈ argmin Note that, if assumption (A.1) holds and α * ∈ (0, 1) then π 1 g n,1 (α * ) = π 2 g n,2 (α * ) and in Theorem 1, we have Of course, α * cannot be used directly, because we do not know G n,1 and G n,2 (and we may not know π 1 and π 2 either).Nevertheless, for the LDA base classifier we can estimate G n,r using Ĝn,r (t for r = 1, 2. For the QDA and k-nearest neighbour base classifiers, we use the leave-oneout-based estimate νn (X i ) := B −1 in place of νn (X i ).We also estimate π r by πr := n −1 n i=1 ½ {Y i =r} , and then set the cut-off in (2) as α ∈ argmin Since empirical distribution functions are piecewise constant, the objective function in (13) does not have a unique minimum, so we choose α to be the average of the smallest and largest minimisers.An attractive feature of the method is that {ν n (X i ) : i = 1, . . ., n} (or {ν n (X i ) : i = 1, . . ., n} in the case of QDA and knn) are already calculated in order to choose the projections, so calculating Ĝn,1 and Ĝn,2 carries negligible extra computational cost.
Figures 2 and 3 illustrate π1 Ĝn,1 for different base classifiers as well as different values of n and π 1 .Here, a very good approximation to the estimand was obtained using an independent data set of size 5000.Unsurprisingly, the performance of the estimator improves as n increases, but the most notable feature of these plots is the fact that for all classifiers and even for small sample sizes, α is an excellent estimator of α * , and may be a substantial improvement on the naive choice α = 1/2 (which may result in a classifier that assigns every point to a single class).

Choice of d
We want to choose d as small as possible in order to obtain the best possible performance bounds as described in Section 4 above.This also reduces the computational cost.However, the performance bounds rely on assumption (A.3), whose strength decreases as d increases, so we want to choose d large enough that this condition holds (at least approximately).
In Section 6 we see that the random ensemble projection method is quite robust to the choice of d.Nevertheless, in some circumstances it may be desirable to have an automatic choice.As one way to do this, suppose that we wish to choose d from a set D ⊆ {1, . . ., p}. .Finally, we can select In Figures 4 and 5 we present the empirical distribution functions of , where d ∈ {2, 3, 4, 5}, for one training dataset from Model 1 (described in Section 6.1.1),and Model 3 (described in Section 6.1.3).In each case, we set π 1 = 1/2, n = 100, p = 50 and B 1 = B 2 = 100.particularly for Model 2, that projecting into a slightly larger dimensional space is preferable, and indeed this appears to be the case from the relevant entry of Table 2 below.
The ideas presented here may also be used to decide between two different base classifiers.For example, comparing the green lines across different panels of Figure 4, we see that for Model 1 and d = 5, we might expect the best results with the QDA base classifier, and indeed this is confirmed by the simulation results in Table 1 below.Here, π 1 = 0.33, p = 50 and d = 2.

Choice of B 1 and B 2
In order to minimise the third term in the bound in Theorem 5, we should choose B 1 to be as large as possible.The constraint, of course, is that the computational cost of the random projection classifier scales linearly with B 1 .The choice of B 2 is more subtle; while the fourth term in the bound in Theorem 5 decreases as B 2 increases, we saw in Section 4 that upper bounds on E(|ǫ n |) may increase with B 2 .In principle, we could try to use the expressions given in Theorem 5 and Section 4 to choose B 2 to minimise the overall upper bound on L( ĈRP n ) − R(C Bayes ).Although β 0 , β and ρ are unknown to the practitioner, for the LDA (left), QDA (middle) and knn (right) base classifiers after projecting for Model 2, π 1 = 1/2, n = 100, p = 50, and d = 2 (black), 3 (red), 4 (blue) and 5 (green).
these could be estimated based on the empirical distribution of { LA b n } B b=1 , where {A b } B b=1 are independent projections drawn according to Haar measure.In practice, however, we found that an involved approach such as this was unnecessary, and that the ensemble method was robust to the choice of B 1 and B 2 .In all of our simulations, we set B 1 = B 2 = 100.

Empirical analysis
In this section, we assess the empirical performance of the random projection ensemble classifier.Throughout this section, RP-knn d , RP-LDA d and RP-QDA d denote the random projection classifier with LDA, QDA and knn base classifiers, respectively; the subscript d refers to the dimension of the image space of the projections.For comparison, we also present results for several state-of-the-art methods for high-dimensional classification, namely Penalized LDA (Witten and Tibshirani, 2011), Nearest Shrunken Centroids (Tibshirani et al., 2003), Shrunken Centroids Regularized Discriminant Analysis (Guo, Hastie and Tibshirani, 2007), and Independence Rules (IR) (Bickel and Levina, 2004), as well as for the base classifier applied in the original space.For the standard knn classifier, we chose k via leave-one-out cross validation.The tuning parameters for the other methods were chosen using the default settings in the corresponding R packages PenLDA (Witten, 2011), NSC (Hastie et al., 2015) and SCRDA (Islam and Mcleod, 2015), namely 6-fold, 10-fold and 10-fold cross validation, respectively.
In each of the simulated examples below, we used n ∈ {50, 100, 200}, p = 50, two different values of the prior probability π 1 and B 1 = B 2 = 100.Each experiment was repeated 100 times and the risk was estimated on an independent test set of size 1000.

Independent features
Model 1: Here, P 1 is the distribution of p independent components, each with a standard Laplace distribution, while P 2 = N p (µ, I p×p ), with µ = 1 8 (1, . . ., 1) T .In Model 1, the class boundaries are non-linear and, in fact, assumption (A.3) is not satisfied for any d < p.Nevertheless, in Table 1 we see that the random projection versions outperform their respective vanilla counterparts (when these are tractable) as well as all of the other methods implemented.The RP-QDA classifiers perform especially well; in particular, they are able to cope better with the non-linearity of the class boundaries than the RP-LDA classifiers.There is little difference between the performance of the d = 2 and d = 5 versions of the random projection classifiers., where Z ∼ N p (0, Σ r ) independent of U ∼ χ 2 νr .Thus, P r is the multivariate t-distribution centred at µ r , with ν r degrees of freedom and shape parameter Σ r .We set µ 1 = 0, µ 2 = 2(1, . . ., 1, 0, . . ., 0) T , where µ 2 has 5 non-zero components, ν 1 = 1, ν 2 = 2, Σ 1 = I p×p and Σ 2 = (Σ j,k ), where Σ j,j = 1, Σ j,k = 0.5 if max(j, k) ≤ 5 and j = k, Σ j,k = 0 otherwise.Model 2 explores the effect of heavy tails and the presence of correlation between the features.Again, assumption (A.3) is not satisfied for any d < p.We see in Table 2 that the RP-knn classifiers are the most successful among all of the methods in the comparison, and there is little difference between the d = 2 and d = 5 versions.The RP-QDA classifiers perform poorly here because the heavy-tailed distributions mean that the mean and covariance matrix estimates are poor.
Model 3 is chosen to investigate a setting in which one class is multi-modal.Note that assumption (A.3) holds with d = 5; indeed, for example, the first five rows of A * may be taken to be the first five standard Euclidean basis vectors.In Table 3, we see that the the random projection ensembles for each of the three base classifiers outperform their standard counterparts.The RP-knn and RP-QDA classifiers are particularly effective here.Even though the Bayes decision boundary here is sparse -it only depends on the first 5 features -the PenLDA, NSC, SCRDA and IR classifiers perform poorly because the Bayes decision boundary is non-linear.
The setting in Model 4 is chosen so that assumption (A.3) holds with d = 3; in fact, A * can be taken to be the first three rows of R T .Note that, if R = I p×p , then the model would be sparse and we would expect the PenLDA, NSC, and SCRDA methods to perform very well.However, for a generic R, the model is not sparse and the random projection ensemble methods, which are invariant to the choice of coordinate system, typically perform as well or better (especially RP-LDA).Classification is difficult in this setting, and the risks of all of the classifiers are considerably greater than the Bayes risk.

Real data example
The Ozone dataset, available from the UCI machine learning repository, classifies days as either 'normal days' (class 1) or 'ozone days' (class 2), and also records p = 72 potentially relevant features.After removing the missing values, there are 1719 observations in class 1 and 128 observations in class 2. To avoid the situation where classifiers assign every point to class 1, we subsampled 384 of the class 1 observations to yield a prevalence ratio of 3:1.We then sampled a training set of size n ∈ {50, 100, 200} and used the remaining 512 − n pairs as the test set.The means and standard errors of the risk, estimated over 100 repetitions of the training set sampling, are presented in Table 5.In each case we took B 1 = B 2 = 100.In this example, all of the random projection ensemble classifiers performed very well, though SCRDA was also competitive, particularly for the largest sample size.

Discussion and extensions
We have introduced a general framework for high-dimensional classification via the combination of the results of applying a base classifier on carefully selected low-dimensional random projections of the data.One of its attractive features is its generality: the approach can be used in conjunction with any base classifier.Moreover, although we explored in detail one method for combining the random projections (partly because it facilitates rigorous statistical analysis), there are many other options available here.For instance, instead of only retaining the projection within each block yielding the smallest estimate of test error, one might give weights to the different projections, where the weights decrease as the estimate of test error increases.Another interesting avenue to explore would be alternative methods for estimating the test error, such as sample splitting.The idea here would be to split the sample T n into T n,1 and T n,2 , say, where to estimate the test error L A n (1) ,1 based on the training data T n,1 .Since T n,1 and T n,2 are independent, we can apply Hoeffding's inequality to deduce that sup It then follows by very similar arguments to those given in Section 4.1 that The advantages of this approach are twofold: first, the bounds hold for any choice of base classifier (and still without any assumptions on the data generating mechanism); second, the bounds on the terms in ( 14) merely rely on Hoeffding's inequality as opposed to Vapnik-Chervonenkis theory, so are typically sharper.The disadvantage is that the other terms in the bound in Theorem 5 will tend to be larger due to the reduced effective sample size.The choice of n (1) and n (2) in such an approach therefore becomes an interesting question.
Many practical classification problems involve K > 2 classes.The main issue in extending our methodology to such settings is the definition of ĈRP n analogous to (2).To outline one approach, let The choice of α 1 , . . ., α K is analogous to the choice of α in the case K = 2.It is therefore natural to seek to minimise the the test error of the corresponding infinite-simulation random projection ensemble classifier as before.
In other situations, it may be advantageous to consider alternative types of projection, perhaps because of additional structure in the problem.One particularly interesting issue concerns ultrahigh-dimensional settings, say p in the thousands.Here, it may be too timeconsuming to generate enough random projections to explore adequately the space A d×p .As a mathematical quantification of this, the cardinality of an ǫ-net in the Euclidean norm of the surface of the Euclidean ball in R p increases exponentially in p (e.g.Vershynin, 2012;Kim and Samworth, 2014).In such challenging problems, one might restrict the projections A to be axis-aligned, so that each row of A consists of a single non-zero component, equal to 1, and p − 1 zero components.There are then only p d ≤ p d /d! choices for the projections, and if d is small, it may be feasible even to carry out an exhaustive search.Of course, this approach loses one of the attractive features of our original proposal, namely the fact that it is equivariant to orthogonal transformations.Nevertheless, corresponding theory can be obtained provided that the projection A * in (A.3) is axis-aligned.This is a much stronger requirement, but it seems that imposing greater structure is inevitable to obtain good classification in such settings.
Recall that G n,1 and G n,2 are the distribution functions of μn (X)|{Y = 1} and μn (X)|{Y = 2}, respectively.We can therefore write where here and throughout the proof, T denotes a Bin(B 1 , θ) random variable.Similarly, It follows that where , we now show that as B 1 → ∞.Our proof involves a one-term Edgeworth expansion to the binomial distribution function in ( 15), where the error term is controlled uniformly in the parameter.The expansion relies on the following version of Esseen's smoothing lemma.
Theorem 6. (Esseen, 1945, Chapter 2, Theorem 2b) • The set of discontinuities of F and G is contained in {t i : i ∈ Z}, where (t i ) is a strictly increasing sequence with inf i {t i+1 − t i } ≥ c 1 ; moreover F is constant on the intervals [t i , t i+1 ) for all i ∈ Z; Then there exist constants c 2 , C 2 > 0 such that Let σ 2 := θ(1 − θ), and let Φ and φ denote the standard normal distribution and density functions, respectively.Moreover, for t ∈ R, let In Proposition 7 below we apply Theorem 6 to the following functions: and Proposition 7. Let F B 1 and G B 1 be as in ( 16) and ( 17).There exists a constant C > 0 such that, for all B 1 ∈ N, sup θ∈(0,1) Proposition 7, whose proof is given after the proof of Theorem 5, bounds uniformly in θ the error in the one-term Edgeworth expansion G B 1 of the distribution function F B 1 .Returning to the proof of Theorem 1, we will argue that the dominant contribution to the integral in (15) arises from the interval (max{0, α − ǫ 1 },min{α + ǫ 1 , 1}), where ǫ 1 := B −1/2 1 log B 1 .For the remainder of the proof we assume B 1 is large enough that [α − ǫ 1 , α + ǫ 1 ] ⊆ (0, 1).
For the region |θ − α| < ǫ 1 , by Proposition 7, there exists C ′ > 0 such that, for all B 1 sufficiently large, where r(t) := p(t) + q(t).Hence, using the fact that for large B 1 , sup |θ−α|<ǫ as B 1 → ∞.To aid exposition, we will henceforth concentrate on the dominant terms in our expansions, denoting the remainder terms as R 1 , R 2 , . ... These remainders are then controlled at the end of the argument.For the first term in (19), we write Now, for the first term in (20), For the second term in (20), write Returning to the second term in ( 19), observe that 1 The claim (15) will now follow from ( 18), ( 19), ( 20), ( 21), ( 22) and ( 23), once we have shown that as B 1 → ∞.
. Observe that, by a Taylor expansion about ζ = α, there exists B 0 ∈ N, such that, for all B 1 > B 0 and all θ, ζ ∈ Using this bound with ζ = θ, we deduce that, for all B 1 sufficiently large, To bound R 2 : Since g • n is differentiable at α, given ǫ > 0, there exists δ ǫ > 0 such that for all |θ − α| < δ ǫ .It follows that, for all B 1 sufficiently large, To bound R 4 : By the bound in (25), we have that, given ǫ > 0, for all B 1 sufficiently large, To bound R 5 : For all B 1 sufficiently large, To bound R 6 : We write R 6 = R 61 + R 62 , where and for all sufficiently large B 1 .We deduce that R 61 = o(B −1 1 ) as B 1 → ∞.To control R 62 , by the mean value theorem, we have that for all B 1 sufficiently large and all Thus, for large B 1 , By the bound in (25), given ǫ > 0, for all B 1 sufficiently large, Moreover, by the mean value theorem, for all B 1 sufficiently large and all |ζ − α| ≤ ǫ 1 , It follows that, for all B 1 sufficiently large, We have now established the claim at (24), and the result follows.
Proof of Proposition 4. For a Borel set C ⊆ R d , let P A * X (C) := {x:A * x∈C} dP X (x), so that P A * X is the marginal distribution of A * X.Further, for z ∈ R d , write P X|A * X=z for the conditional distribution of X given A * X = z.If Y is independent of X given A * X, and if B is a Borel subset of R p , then We deduce that P X ({x ∈ R p : η(x) = η A * (A * x)}) = 0; in particular, (A.3) holds, as required.Note that we have used (A.3) to obtain the penultimate equality.The result now follows immediately from these facts, together with Theorem 1, Theorem 2 and Proposition 3.

Figure 1 :
Figure 1: Different two-dimensional projections of a sample of size n = 200 from Model 2 in Section 6.1.2with p = 50 dimensions and prior probability π 1 = 1/2.Top row: three projections drawn from Haar measure; bottom row: the projections with smallest estimate of test error out of 100 Haar projections with LDA (left), Quadratic Discriminant Analysis (middle) and k-nearest neighbours (right).

Figures 4 Figure 2 :
Figures 4 and 5 do not suggest great differences in performance for different choices of d, especially for the QDA and knn base classifiers.For the LDA classifier, it appears,

Figure 5 :
Figure 5: Empirical distribution functions of the test error estimates { LA d,b 1