Sparse principal component analysis via axis-aligned random projections

Summary. We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully selected axis-aligned random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so it is not vulnerable to a bad choice of initialization. We provide theoretical guarantees under which our principal subspace estimator can attain the minimax optimal rate of convergence in polynomial time.In addition, our theory provides a more reﬁned understanding of the statistical and computational trade-off in the problem of sparse principal component estimation, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and conﬁrm its highly competitive ﬁnite sample performance.


Introduction
Principal component analysis (PCA) is one of the most widely used techniques for dimensionality reduction in statistics, image processing and many other fields.The aim is to project the data along directions that explain the greatest proportion of the variance in the population.In the simplest setting where we seek a single, univariate projection of our data, we may estimate this optimal direction by computing the leading eigenvector of the sample covariance matrix.
Despite its successes and enormous popularity, it has been well known for a decade or more that PCA breaks down as soon as the dimensionality p of the data is of the same order as the sample size n.More precisely, suppose that X 1 , : : : , X n ∼ IID N p .0, Σ/, with p 2, are observations from a Gaussian distribution with a spiked covariance matrix Σ = I p + v 1 v T 1 whose leading eigenvector is v 1 ∈ S p−1 := {v ∈ R p : v 2 = 1}, and let v1 denote the leading unit length eigenvector of the sample covariance matrix Σ := n −1 Σ n i=1 X i X T i .Then Johnstone and Lu (2009) and Paul (2007) showed that v1 is a consistent estimator of v 1 , i.e. | vT 1 v 1 | → p 1, if and only if p = p n satisfies p=n → 0 as n → ∞.It is also worth noting that the principal component v 1 may be a linear combination of all elements of the canonical basis in R p , which can often make it difficult to interpret the estimated projected directions (Jolliffe et al., 2003).
To remedy this situation, and to provide additional interpretability to the principal components in high dimensional settings, Jolliffe et al. (2003) and Zou et al. (2006) proposed sparse principal component analysis (SPCA).Here it is assumed that the leading population eigenvectors belong to the k-sparse unit ball Address for correspondence: Richard J. Samworth, Department of Pure Mathematics and Mathematical Statistics, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge, CB3 0WB, UK.E-mail: r.samworth@statslab.cam.ac.ukB p−1 0 .k/:= v = .v.1/, : : : , v .p// T ∈ S p−1 : p j=1 1 {v .j/=0} k for some k ∈ {1, : : : , p}.In addition to the easier interpretability, a large amount of research effort has shown that such an assumption facilitates improved estimation performance (e.g.Johnstone and Lu (2009), Paul and Johnstone (2012), Vu and Lei (2013), Cai et al. (2013), Ma (2013) and Wang et al. (2016)).To give a flavour of these results, let V n denote the set of all estimators of v 1 , i.e. the class of Borel measurable functions from R n×p to S p−1 .Vu and Lei (2013) introduced a class Q of sub-Gaussian distributions whose first principal component v 1 belongs to B p−1 0 .k/ and showed that inf where a n b n means that 0 < lim inf n→∞ |a n =b n | lim sup n→∞ |a n =b n | < ∞.Thus, consistent estimation is possible in this framework provided only that k = k n and p = p n satisfy k log.p/=n → 0. Vu and Lei (2013) showed further that this estimation rate is achieved by the natural estimator v1 ∈ arg max v∈B p−1 0 .k/ v T Σv: .2/ However, results such as expression (1) do not complete the story of SPCA.Indeed, computing the estimator defined in expression (2) turns out to be an 'NP hard' problem (e.g. Tillmann and Pfetsch (2014)): the naive approach would require searching through all .p k / of the k × k symmetric submatrices of Σ, which takes exponential time in k.Therefore, in parallel with the theoretical developments that were described above, numerous alternative algorithms for SPCA have been proposed in recent years.For instance, several references have introduced techniques based on solving the non-convex optimization problem (2) by invoking an l 1 -penalty (e.g.Jolliffe et al. (2003), Zou et al. (2006), Shen and Huang (2008) and Witten et al. (2009)).Typically, these methods are fast but lack theoretical performance guarantees.In contrast d 'Aspremont et al. (2007) proposed to compute problem (2) via semidefinite relaxation.This approach and its variants were analysed by Amini and Wainwright (2009), Vu et al. (2013) and Wang et al. (2014Wang et al. ( , 2016) ) and have been proved to achieve the minimax rate of convergence under certain assumptions on the underlying distribution and asymptotic regime, but the algorithm is slow compared with other approaches.In a separate, recent development, it is now understood that, conditionally on a planted clique hypothesis from theoretical computer science, there is an asymptotic regime in which no randomized polynomial time algorithm can attain the minimax optimal rate (Wang et al., 2016).Various fast iterative algorithms were introduced by Johnstone and Lu (2009), Paul and Johnstone (2012) and Ma (2013); the last of these was shown to attain the minimax rate under a Gaussian spiked covariance model.We also mention the computationally efficient combinatorial approaches that were proposed by Moghaddam et al. (2006) and d 'Aspremont et al. (2008) that aim to find solutions to the optimization problem (2) by using greedy methods.
A common feature to all of the computationally efficient algorithms mentioned above is that they are iterative, in the sense that, starting from an initial guess v[0] ∈ R p , they refine their guess by producing a finite sequence of iterates v[1] , : : : , v[T ] ∈ R p , with the estimator defined to be the final iterate.A major drawback of such iterative methods is that a bad initialization may yield a disastrous final estimate.To illustrate this point, we ran a simple simulation in which the underlying distribution is N 400 .0,Σ/, with Σ = 10J 10 8:9J 390 + I 390 + 0:01I 400 , .3/where J q := 1 q 1 T q =q ∈ R q×q denotes the matrix with each entry equal to 1=q.In this example, v 1 = .1 T 10 , 0 T 390 / T = √ 10, so k = 10.Fig. 1 shows, for several SPCA algorithms, different sample sizes and different initialization methods, the average values of the loss function .4/ over 100 repetitions of the experiment.In Figs 1(a) and 1(b), the initialization methods that were used were the default recommendations of the respective authors, namely diagonal thresholding (d 'Aspremont et al., 2008;Ma, 2013) and classical PCA (Zou et al., 2006;Shen and Huang, 2008;Witten et al., 2009).We note that the consistency of diagonal thresholding relies on a spiked covariance structure, which is violated in this example.In Figs 1(c) and 1(d), we ran the same algorithms with 10 independent initializing vectors chosen uniformly at random on S p−1 , and we selected the solution v from these 10 that maximizes v → v T Σv.The main observation is that each of the previously proposed algorithms that were mentioned above produces very poor estimates, with some almost orthogonal to the true principal component!The reason for this is that all the default initialization procedures are unsuccessful in finding a good starting point.For some methods, this problem may be fixed by increasing the number of random initializations, but it may take an enormous number of such random restarts (and consequently a very long time) to achieve this.We demonstrate this in Figs 1(e) and 1(f), where, for n = 350 (Fig. 1(e)) and n = 2000 (Fig. 1(f)), we plot the logarithm of the average loss as time increases through the number of random restarts.As an alternative method, in Figs 1(a)-1(d), we also present the corresponding results for the variants of Wang et al. (2016) of the semidefinite programming algorithm that was introduced by d' Aspremont et al. (2007).This method is guaranteed to converge from any initialization and so does not suffer the same poor performance as mentioned above.However, the semidefinite programming algorithm took even longer to reach algorithmic convergence than any of the alternative approaches, so, in the setting of Figs 1(e) and 1(f), it finally reached a logarithmic average loss of around −4 (Fig. 1(e)) and −5:9 (Fig. 1(f)) after an average time of exp.8/ ≈ 3000 s (Fig. 1(e)) and exp.9:25/ ≈ 10000 s (Fig. 1(f)); this slow running time means that it does not appear in Figs 1(e) and 1(f).We refer to Section 4.2 for further comparisons using different examples.
In Section 2 of this paper, we propose a novel algorithm for SPCA that aggregates estimates over carefully chosen axis-aligned random projections of the data into a lower dimensional space.In contrast with the other algorithms that were mentioned above, it is non-iterative and does not depend on a choice of initialization, so it has no difficulty with the simulation example above.Indeed, from the blue curve in Fig. 1, we see that it outperforms even the semidefinite programming algorithm, compared with which it was over 7000 times faster in the n = 2000 case.
Our algorithm, which we refer to as SPCAvRP, turns out to be attractive for both theoretical and computational reasons.From a theoretical point of view, our algorithm provides a new perspective on the statistical and computational trade-off that is involved in the SPCA problem.As we show in Section 3, when the effective sample size is large, the SPCAvRP procedure can attain the minimax optimal rate with a number of projections that grows only polynomially in the problem parameters.In contrast, if one were to use a number of random projections exponentially large in k, SPCAvRP could even achieve this minimax rate in a much smaller effective sample size regime.Although this exponentially large number of projections may seem discouraging, we emphasize that it is in fact not a drawback of the SPCAvRP algorithm but simply a reflection of the fundamental difficulty of the problem in this effective sample size regime.Indeed, Wang et al. (2016) established a computational lower bound, which reveals that no randomized polynomial time algorithm can attain the minimax rate of convergence for these effective sample sizes.The elucidation of the transition from polynomial to exponentially large number of projections is an illustration of the fascinating fundamental statistical and computational trade-off in this problem.The computational attractions of the algorithm proposed include the fact that it is highly scalable because of easy parallelization and does not even require computation of Σ ∈ R p×p , since it suffices to extract principal submatrices of Σ, which can be done by computing the sample covariance matrices of the projected data.This may result in a significant computational saving if p is very large.Several numerical aspects of the algorithm, including a finite sample simulation comparison with alternative methods on both simulated and real data, are considered in Section 4. These reveal that our SPCAvRP algorithm has very competitive performance and, furthermore, it enjoys robustness properties that iterative algorithms do not share.The proofs of all of our results are given in Appendix A.
Algorithms based on random projections have recently been shown to be highly effective for several different problems in high dimensional statistical inference.For instance, in the context of high dimensional classification, Cannings and Samworth (2017) showed that their random projection ensemble classifier that aggregates over projections that yield small estimates of the test error can result in excellent performance.Marzetta et al. (2011) employed an ensemble of random projections to construct an estimator of the population covariance matrix and its inverse in the setting where n < p. Fowler (2009) introduced a so-called compressive projection PCA that reconstructs the sample principal components from many low dimensional projections of the data.Finally, to decrease the computational burden of classical PCA, Qi and Hughes (2012) and Pourkamali-Anaraki and Hughes (2014) proposed estimating v 1 .Σ/ by the leading eigenvector of n −1 Σ n i=1 P i X i X T i P i , where P 1 , : : : , P n are random projections of a particular form.

Notation
We conclude this introduction with some notation that is used throughout the paper.For r ∈ N, let [r] := {1, : : : , r}.For a vector u ∈ R p , we write u .j/for its jth component and u 2 := {Σ p j=1 .u.j/ /2 } 1=2 for its Euclidean norm.For a real symmetric matrix U ∈ R p×p , let λ 1 .U/ λ 2 .U/ : : : λ p .U/ denote its eigenvalues, arranged in decreasing order, and let v 1 .U/, : : : , v p .U/ denote the corresponding eigenvectors.In addition, for m ∈ [p], we write V m .U/ := .v 1 .U/, : : : , v m .U// for the p × m matrix whose columns are the m leading eigenvectors of U. In the special case where U = Σ, we drop the argument and write λ r = λ r .Σ/, v r = v r .Σ/ and V m = V m .Σ/.For a general U ∈ R p×m , we define U .j,j / to be the .j,j /th entry of U, and U .j,•/ the jth row of U, regarded as a column vector.Given S ⊆ [p] and S ⊆ [m], we write U .S,S / for the |S| × |S | matrix that is obtained by extracting the rows of U indexed by S and columns indexed by S ; we also write U .S,•/ := U .S,[m]/ .We write U op := sup x∈S m−1 Ux 2 and U F := .Σ p j=1 Σ m j =1 |U .j,j/ | 2 / 1=2 for the operator and Frobenius norms of U respectively.We denote the set of real orthogonal p × p matrices by O p and the set of real p × m matrices with orthonormal columns by O p,m .For matrices U, V ∈ O p,m , we define the loss function where the sine function acts elementwise, and where Θ.U, V/ is the m × m diagonal matrix whose jth diagonal entry is the jth principal angle between U and V , i.e. cos −1 .σj /, where σ j is the jth singular value of U T V .Observe that this loss function reduces to expression (4) when m = 1.
For any index set J ⊆ [p] we write P J to denote the projection onto the span of {e j : j ∈ J}, where e 1 , : : : , e p are the standard Euclidean basis vectors in R p , so that P J is a p × p diagonal matrix whose jth diagonal entry is 1 {j∈J} .Finally, for a, b ∈ R, we write a b to mean that there is a universal constant C > 0 such that a Cb.

Single principal component estimation
In this section, we describe our algorithm for estimating a single principal component v 1 in detail; more general estimation of multiple principal components v 1 , : : : , v m is treated in Section 2.2.Let x 1 , : : : , x n be data points in R p and let Σ := n −1 Σ n i=1 x i x T i .We think of x 1 , : : : , x n as independent realizations of a zero-mean random vector X, so a practitioner may choose to centre each variable so that Σ n i=1 x .j/ i = 0 for each j ∈  2).Of course, it would typically be too computationally expensive to compute all such projections, so instead we consider only B randomly chosen projections.
The remaining challenge is to aggregate over the selected projections.For this, for each coordinate j ∈ [p], we compute an importance score ŵ.j/ , defined as an average over a ∈ This means that we take account, not just of the frequency with which each co-ordinate is chosen, but also their corresponding magnitudes in the selected eigenvector, as well as an estimate of the signal strength.Finally, we select the l indices Ŝ corresponding to the largest values of ŵ.1/ , : : : , ŵ.p/ and output our estimate v1 as the leading eigenvector of P Ŝ ΣP Ŝ .Pseudocode for our SPCAvRP algorithm is given in algorithm 1 in Table 1.
Besides the intuitive selection of the most important co-ordinates, the use of axis-aligned projections facilitates faster computation as opposed to the use of general orthogonal projections.Indeed, the multiplication of Σ ∈ R p×p by an axis-aligned projection P ∈ P d from the left (or right) can be recast as the selection of d rows (or columns) of Σ corresponding to the indices of the non-zero diagonal entries of P. Thus, instead of the typical O.p 2 d/ matrix multiplication complexity, only O.pd/ operations are required.We also remark that, instead of storing P, it suffices to store its non-zero indices.
More generally, the computational complexity of algorithm 1 can be analysed as follows.Generating AB initial random projections takes O.ABd/ operations.Next, we need to compute then the second option is preferable.
The rest of algorithm 1 entails computing an eigendecomposition of each d × d matrix, and computing {b Å .a/: a ∈ [A]}, ŵ, Ŝ and v1 , which altogether amounts to O.ABd 3 + Ap + l 3 / operations.Thus, assuming that n d, the overall computational complexity of the SPCAvRP algorithm is We also note that, because of the use of random projections, the algorithm is highly parallelizable.In particular, both 'for' loops of algorithm 1 can be parallelized, and the selection of good projections can easily be carried out by using different (up to A) machines.
Finally, we note that the numbers A and B of projections, the dimension d of those projections and the sparsity l of the final estimator need to be provided as inputs to algorithm 1.The effect of these parameter choices on the theoretical guarantees of our SPCAvRP algorithm is elucidated in our theory in Section 3, whereas their practical selection is discussed in Section 4.1.

Multiple principal component estimation
The estimation of higher order principal components is typically achieved via a deflation scheme.Having computed estimates v1 , : : : , vr−1 of the top r − 1 principal components, the aim of such a procedure is to estimate the rth principal component based on modified observations, which have had their correlation with these previously estimated components removed (e.g.Mackey (2009)).For any matrix V ∈ R p×r of full column rank, we define the projection onto the orthogonal complement of the column space of V by Proj ⊥ .V/ := I p − V.V T V/ −1 V T if V = 0 and I p otherwise.Then, writing V r−1 := .v1 , : : : , vr−1 /, one possibility to implement a deflation scheme is to set xi := Proj ⊥ .V r−1 /x i for i ∈ [n].Note that in sparse PCA, by contrast with classical PCA, the estimated principal components from such a deflation scheme are typically   2, we therefore propose a modified deflation scheme, which in combination with algorithm 1 can be used to compute arbitrary m ∈ [p] principal components that are orthogonal (as well as sparse), as verified in lemma 1 below.
We remark that, in fact, our proposed deflation method can be used in conjunction with any SPCA algorithm.
Although algorithm 2 can conveniently be used to compute sparse principal components up to order m, it requires algorithm 1 to be executed m times.Instead, we can modify algorithm 1 to estimate directly the leading eigenspace of dimension m-the subspace that is spanned by the columns of matrix V m = .v 1 , : : : , v m /-at a computational cost that is not much higher than that of executing algorithm 1 only once.For this, we propose a generalization of the SPCAvRP algorithm for eigenspace estimation in algorithm 3 in Table 3.In this generalization, A projections are selected from a total of A × B random projections, by computing the ath selected projection to the importance score of the jth co-ordinate, and, analogously to the single-component-estimation case, we average these contributions over a ∈ [A] to obtain a vector of final importance scores.Again, similarly to the case m = 1, we then threshold the top l importance scores to obtain a final projection and our m estimated principal components.A notable difference, then, between algorithm 3 and the deflation scheme (algorithm 2) is that now we estimate the union of the supports of the leading m eigenvectors of Σ simultaneously rather than one at a time.A consequence is that algorithm 3 is particularly well suited to a sparsity setting known in the literature as 'row sparsity' (Vu and Lei, 2013), where leading eigenvectors of interest may share common support, because it borrows strength regarding the estimation of this support from the simultaneous nature of the multiple-component estimation.However, algorithm 2 may have a slight advantage in cases where the leading eigenvectors have disjoint supports; see Section 4.2.2 for further discussion.
Observe that, for m = 1, both algorithm 2 and algorithm 3 reduce to algorithm 1.Furthermore, for any m, up to the step where ŵ is computed, algorithm 3 has the same complexity as algorithm 1, with the total complexity of algorithm 3 amounting to O.min{np 2 + ABd 3 + Amp + l 3 , ABnd 2 + Amp + l 3 }/, provided that n d.

Theoretical guarantees
In this section, we focus on the general algorithm 3. We assume that X 1 , : : : , X n are independently sampled from a distribution Q satisfying a restricted covariance concentration (RCC) condition that was introduced in Wang et al. (2016).Recall that, for K > 0, we say that a zero-mean distribution Q on R p satisfies an RCC condition with parameter K, and write Q ∈ RCC p .K/, if, for all δ > 0, n ∈ N and r ∈ [p], we have (Wang et al. (2016), proposition 1).
As mentioned in Section 2.2, our theoretical justification of algorithm 3 does not require that the leading eigenvectors enjoy disjoint supports.Instead, we ask for V m to have not too many non-zero rows, and for these non-zero rows to have comparable Euclidean norms (i.e. to satisfy an incoherence condition).More precisely, writing nnzr.V/ for the number of non-zero rows of a matrix V , for μ 1, we consider the setting where V m belongs to the set The following theorem is our main result on the performance of our SPCAvRP algorithm.
Theorem 1. Suppose that Q ∈ RCC p .K/ has an associated covariance matrix .9/ Then, with probability at least 1 − 2p −3 − p exp{−Aθ 2 m =.50p 2 μ 8 θ 2 1 /}, we have An immediate consequence of theorem 1 is that, provided that A p 2 μ 8 θ 2 1 θ −2 m log.p/ and our SPCAvRP algorithm achieves the bound under the conditions of theorem 1.The salient observation here is that this choice of A, together with the algorithmic complexity analysis given in Section 2.2, ensures that algorithm 3 achieves the rate in bound (10) in polynomial time (provided that we consider μ, θ 1 and θ m as constants).
The minimax lower bound that is given in proposition 1 below complements theorem 1 by showing that this rate is minimax optimal, up to logarithmic factors, over all possible estimation procedures, provided that l k, that m log.p=k/ log.p/ and that we regard K and μ as constants (as well as other regularity conditions).It is important to note that this does not contradict the fundamental statistical and computational trade-off for this problem that was established in Wang et al. (2016), because condition (9) ensures that we are in the high effective sample size regime defined in that work.Assuming the planted clique hypothesis from theoretical computer science, this is the only setting in which any (randomized) polynomial time algorithm can be consistent.
The following proposition establishes a minimax lower bound for principal subspace estimation.It is similar to existing minimax lower bounds in the literature for SPCA under row sparsity, e.g.Vu and Lei (2013), theorem 3.1.The main difference is that we show that imposing an incoherence condition on the eigenspace does not make the problem any easier from this minimax perspective.For any V ∈ O p,m and θ > 0, we write P V ,θ := N p .0,I p + θVV T /, and recall the definition of O p,m,k .μ/from expression (7).
An interesting aspect of theorem 1 is that the same conclusion holds for every B ∈ N. It is attractive that we do not need to make any restrictions here; however, we would also expect the statistical performance of the algorithm to improve as B increases.Indeed, this is what we observe empirically; see Fig. 2 increasing B theoretically in the special setting where all signal co-ordinates have homogeneous signal strength, i.e.V m ∈ O p,m,k .1/.As illustrated by the following corollary (to theorem 1) and its proof, as B increases, signal co-ordinates are selected with increasing probability by the best projection within each group of B projections, and this significantly reduces the number of groups A that are required for rate optimal estimation.Recall that the hypergeometric distribution HyperGeom.d,k, p/ models the number of white balls that are obtained when drawing d balls uniformly and without replacement from an urn containing p balls, k of which are white.We write F HG .•;d, k, p/ for its distribution function.
Corollary 1.In addition to the conditions of theorem 1, assume that μ = 1, θ 1 = : : : = θ m and that Since, in this corollary, we use lemma 4 in Appendix A.5 instead of expression ( 16) in Appendix A.2 to control the inclusion probability of signal co-ordinates, the condition d k from theorem 1 is in fact no longer needed.We note that, for any fixed t, the function F HG .t− 1; d, k, p/ is decreasing with respect to d ∈ [p].Thus, corollary 1 also illustrates a computational trade-off between the choice of d and B. This trade-off is also demonstrated numerically in Fig. 6 in Section 4.1.2.
Finally, we remark that our algorithm enables us to understand the statistical and computational trade-off in SPCA in a more refined way.Recall that, in the limiting case when B = ∞, the estimator that is produced by algorithm 3 (with d = l = k and, for the simplicity of discussion, m = 1) is equal to the estimator v1 given in problem (2), i.e. the leading k-sparse eigenvector of Σ.In fact, this is already true with high probability for B .p k /.Hence, for B exponentially large, the SPCAvRP estimator is minimax rate optimal as long as n mkθ −2 m log.p/, which corresponds to the intermediate effective sample size regime that was defined in Wang et al. (2016).For such a choice of B, however, algorithm 3 will not run in polynomial time, which is in agreement with the conclusion of Wang et al. (2016) that there is no randomized polynomial time algorithm that can attain the minimax rate of convergence in this intermediate effective sample size regime.In contrast, as mentioned above, SPCAvRP is minimax rate optimal, using only a polynomial number of projections, in the high effective sample size regime as discussed after theorem 1.Therefore, the flexibility in varying the number of projections in our algorithm enables us to analyse its performance in a continuum of scenarios ranging from where consistent estimation is barely possible, through to high effective sample size regimes where the estimation problem is much easier.

Numerical experiments
In this section we demonstrate the performance of our proposed method in different examples and discuss the practical choice of its input parameters.We also compare our method with several existing sparse principal component estimation algorithms on both simulated and experimental data.All experiments were carried out using the R package SPCAvRP (Gataric et al., 2018).

Choice of input parameters 4.1.1. Choice of A and B
In Fig. 2, we show that choosing B > 1, which ensures that we make a non-trivial selection within each group of projections, considerably improves the statistical performance of the SPCAvRP algorithm.Specifically, we see that, using the same total number of random projections, our two-stage procedure has superior performance over the naive aggregation over all projections, which corresponds to setting B = 1 in the SPCAvRP algorithm.Interestingly, Fig. 2 shows that simply increasing the number of projections, without performing a selection step, does not noticeably improve the performance of the basic aggregation.We note that, even for the relatively small choices A = 50 and B = 25, the SPCAvRP algorithm does significantly better than the naive aggregation over 180000 projections.Fig. 3 demonstrates the effect of increasing either A or B while keeping the other fixed.We can see from Fig. 3(a) that increasing A steadily improves the quality of estimation, especially in the medium effective sample size regime and when A is relatively small.This agrees with the result in theorem 1, where the bound on the probability of attaining the minimax optimal rate improves as A increases.Thus, in practice, we should choose A to be as large as possible subject to our computational budget.The choice of B, however, is a little more delicate.In some settings, such as the single-spiked homogeneous model in Fig. 3(b), the performance appears to improve as B increases, though the effect is only really noticeable in the intermediate effective sample size regime.In contrast, we can also construct examples where, as B increases, some signal co-ordinates will have increasingly high probability of inclusion compared with other signal co-ordinates, making the latter less easily distinguishable from the noise co-ordinates.Hence the performance does not necessarily improve as B increases; Fig. 4.
In general, we find that A and B should increase with p.On the basis of our numerical experiments, we suggest using B = A=3 with A = 300 when p ≈ 100, and A = 800 when p ≈ 1000.

Choice of d
So far in our simulations we have assumed that the true sparsity level k is known and we took the dimension d of the random projections to be equal to k, but in practice k may not be known in advance.In Fig. 5, however, we see that, for a wide range of values of d, the loss curves are relatively close to each other, indicating the robustness of the SPCAvRP algorithm to the choice of d.For the homogeneous signal case, the loss curves for different choices of d merge in the high effective sample size regime, whereas, in the intermediate effective sample size regime, we may in fact see improved performance when d exceeds k.In the inhomogeneous case, the loss curves improve as d increases up to k and then exhibit little dependence on d when d k.
Although decreasing d reduces computational time, for a smaller choice of d it is then less likely that each signal co-ordinate will be selected in a given random projection.This means that a smaller d will require a larger number of projections A and B to achieve the desired accuracy, thereby increasing computational time.To illustrate this computational trade-off, in Fig. 6, for a single-spiked homogeneous model, we plot the trajectories of the average loss as a function of time (characterized by the choices of A and B), for various choices of d.Broadly speaking, the figures reveal that choosing d < k needs to be compensated by a very large choice of A and B to achieve similar statistical performance to that which can be obtained with d equal to, or even somewhat larger than, k.
In practice, we suggest using d = k where k is known but, when k is not given in advance, we would advocate erring on the side of projecting into a subspace of dimension slightly larger than the level of sparsity of the true eigenvectors, as this enables a significantly smaller choice of A and B, which results in an overall time saving.

Choice of l
The parameter l corresponds to the sparsity of the computed estimator; large values of l increase .., 30} we plot the trajectory realized) ( , d D 4; , d D 6; , d D 8; , d D 10; , d D 12; , d D 15; , d D 20; , d D 30) the chance that signal co-ordinates are discovered but also increase the probability of including noise co-ordinates.This statistical trade-off is typical for any algorithm that aims to estimate the support of a sparse eigenvector.It is worth noting that many of the SPCA algorithms that are proposed in the literature have a tuning parameter corresponding to the level of sparsity, and thus cross-validation techniques have been proposed in earlier works (e.g.Witten et al. (2009)).
A particularly popular approach in the SPCA literature (e.g.Shen and Huang (2008)) is to choose l by inspecting the total variance.More precisely, for each l on a grid of plausible values, we can compute an estimate v1,l ∈ B p−1 0 .l/ and its explained variance var l := vT 1,l Σ v1,l , and then plot var l against l.As can be seen from Fig. 7, var l increases with l, but plateaus off for l k.An attractive feature of our algorithm is that this procedure does not significantly increase the total computational time, since there is no need to rerun the entire algorithm for each value of l.Recall that ŵ in expression ( 5) ranks the co-ordinates by their importance.Therefore, we need to compute ŵ only once and then to calculate var l by selecting the top l co-ordinates in ŵ for each value of l.In cases where higher order principal components need to be computed, namely when m > 1, we can choose l = nnzr.V m / in algorithm 3, and l r = v r 0 , r ∈ [m], in algorithm 2, when these quantities are known.If this is not so, we can choose l in algorithm 3 in a similar fashion to that described above, by replacing v1,l with V m,l where nnzr.V m,l / l, or we can choose l r by inspecting the total variance at each iteration r of algorithm 2.

Comparison with existing methods
In this subsection, we compare our method with several existing approaches for SPCA.We first present examples where only the first principal component is computed, followed by examples of higher order principal component estimation and an illustration on some genetic data.Zou et al. (2006), Witten et al. (2009) and Ma (2013), which are used with their default parameters.

First principal component
In addition to the example that was presented in Fig. 1 in Section 1, we consider four further examples with data generated from an N p .0, Σ/ distribution, where Σ takes one of the two following forms: .11/ with various choices of p ∈ {100, 200, 1000, 2000} and k ∈ {10, 30}.Observe that We compare the empirical performance of our algorithm with methods proposed by Zou et al. (2006), Shen and Huang (2008), d 'Aspremont et al. (2008), Witten et al. (2009) and Ma (2013), as well as the semidefinite programming method that was mentioned in Section 1, by computing the average loss for each algorithm over 100 repetitions on the same set of data.We note that these are all iterative methods, whose success, with the exception of the semidefinite programming method, depends on good initialization, so we recall their default choices.The methods by Zou et al. (2006), Shen and Huang (2008) and Witten et al. (2009) use eigendecomposition of the sample covariance matrix, i.e. classical PCA, to compute their initial point, whereas d 'Aspremont et al. (2008) and Ma (2013) selected their initialization according to the largest diagonal entries of Σ.
In Fig. 8, we see that although the average losses of all algorithms decay appropriately with the sample size n in the double-spiked Σ .1/-setting, most of them perform very poorly in the setting of Σ .2/, where the spiked structure is absent.Indeed, only the SPCAvRP and SDP algorithms

Higher order components
In Table 4 and Fig. 9 we compare algorithms 2 and 3 with existing SPCA algorithms for subspace estimation, namely those proposed by Zou et al. (2006), Witten et al. (2009) and Ma (2013).For this we simulate observations from a normal distribution with a covariance matrix which is two and three spiked respectively.From Table 4 and Fig. 9, we observe that the SPCAvRP estimators computed by algorithms 2 and 3 perform well when compared with the alternative approaches.When the supports of leading eigenvectors are disjoint, namely S r ∩ S q = ∅, r = q, r, q ∈ [m], where S r := {j ∈ [p] : v .j/r = 0}, we observe that the deflation scheme that is proposed in algorithm 2 may perform better than algorithm 3, since it estimates each support S r individually.In contrast, if their supports are overlapping, algorithm 3 may perform better than algorithm 2, since it directly estimates ∪ m r=1 S r .From Table 4, we also see that only SPCAvRP algorithms and the algorithm that was proposed by Ma (2013) compute components that are orthogonal in both cases S 1 ∩ S 2 = ∅ and S 1 ∩ S 2 = ∅.

Microarray data
We test our SPCAvRP algorithm on the Alon et al. (1999) gene expression data set, which contains 40 colon tumour and 22 normal observations.A preprocessed data set can be downloaded from the R package datamicroarray (Ramey, 2016), with a total of p = 2000 features and n = 62 observations.For comparison with alternative SPCA approaches, we use algorithms that accept the output sparsity l as an input parameter, namely those proposed by Zou et al. (2006), d'Aspremont et al. (2008) and Shen and Huang (2008).For each l considered, we computed the estimator v1,l of the first principal component, and in Fig. 10 we plot the explained variance var l := vT 1,l Σ v1,l as well as two different metrics for the separability of the two classes of observations projected along the first principal component v1,l , namely the Wasserstein distance W l of order 1 and the p-value of Welch's t-test (Welch, 1947).Furthermore, in Fig. 11, we display their corresponding values for l = 20 together with the boxplots of the observations projected along v1,20 .From Figs 10 and 11, we observe that the SPCAvRP algorithm performs similarly to those proposed by d 'Aspremont et al. (2008) and Shen and Huang (2008), all of which are superior in this instance to the SPCA algorithm of Zou et al. (2006).In particular, for small values of l, we observe a steep slope of the blue Wasserstein and p-value curves corresponding to the SPCAvRP algorithm in Fig. 10, indicating that the two classes are well separated by projecting the observations along the estimated principal component which contains expression levels of only a few different genes.

A.4. Proof of corollary 1
The proof of theorem 1 remains valid for the setting of corollary 1. .31/ for j, j ∈ S 0 .Recall the definition of q j from the proof of theorem 1.By equation ( 31), for any j ∈ S 0 , we have on Ω RCC that q j P. where the penultimate inequality uses Markov's inequality and the fact that the pair .M, R/ is independent of X, and the final bound follows from lemma 4 below.Now, using expression (32) in place of expression ( 16), we find that E. ŵ.j/ a − ŵ.j / a |X/ tmθ m =.8k 2 / instead of inequality ( 22).Thus, P.Ω c |X/ p exp{−At 2 =.800k 2 /}.The desired result is then concluded in a similar fashion to that in theorem 1.

Fig. 1 .
Fig. 1.Comparison of various approaches by using covariance model (3) ( , Zou et al. (2006); , Shen and Huang (2008), l 1 -thresholding; , Shen and Huang (2008), l 0 -thresholding; , d'Aspremont et al. (2008); , Witten et al. (2009); , Ma (2013); , semidefinite programming; , SPCAvRP): in (a), (b), (c), (d) average loss (4) for different sample sizes n; in (a), (c) the normal scale; in (b), (d) the log-log-scale; in (a), (b) default initialization; in (c), (d) best of 10 random initializations; in (e), (f) average loss (4) against time in seconds on the log-log-scale when n D 350 in (e) and n D 2000 in (f) (we vary the number of random projections (A 2 .50,200/ and B D dA=2e) for SPCAvRP and the number of random initializations (from 1 to 250) for the other iterative competing methods) [p].For d ∈ [p], let P d := {P S : S ⊆ [p], |S| = d} denote the set of d-dimensional, axis-aligned projections.For fixed A, B ∈ N, consider projections {P a,b : a ∈ [A], b ∈ [B]} independently and uniformly distributed on P d .We think of these projections as consisting of A groups, each of cardinality B. For each a ∈ [A], let b Å .a/:= sarg max b∈[B] λ 1 .P a,b ΣP a,b / denote the index of the selected projection within the ath group, where sarg max denotes the smallest element of the arg max in the lexicographic ordering.The idea is that the non-zero entries of P a,b AE .a/ΣP a,b AE .a/form a principal submatrix of Σ that should have a large leading eigenvalue, so the non-zero entries of the corresponding leading eigenvector va,b AE .a/;1 of P a,b AE .a/ΣP a,b AE .a/should have some overlap with those of v 1 .Observe that, if d = k and {P a,b : b ∈ [B]} were to contain all .p k / projections, then the leading eigenvector of P a,b AE .a/ΣP a,b AE .a/would yield the minimax optimal estimator in problem (

Fig. 3 .
Fig. 3. Average loss L. v1 , v 1 / as the sample size n increases for various choices of A or B (the distribution is Np .0,I p C v 1 v > 1 / with v 1 D k 1=2 .1 > k , 0 > p k / > ,p D 50 and k D 7, and the other algorithmic parameters are d D l D 7): (a) B D 100 and A is varied ( , A D 5; , A D 10; , A D 15; , A D 20; , A D 25; , A D 30; , A D 35; , A D 40; , A D 50; , A D 100); (b) A D 200 and B is varied ( , B D 5; , B D 10; , B D 15; , B D 25; , B D 40; , B D 75; , B D 100; , B D 150; , B D 200; , B D 300)

Table 1 .
Algorithm 1: pseudocode for the SPCAvRP algorithm for a single principal component Input: x 1 , : :