Sparse partial least squares regression for simultaneous dimension reduction and variable selection

Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.


Introduction
With the recent advancements in biotechnology such as the use of genomewide microarrays and high throughput sequencing, regression-based modelling of high dimensional data in biology has never been more important. Two important statistical problems commonly arise within regression problems that concern modern biological data. The first is the selection of a set of important variables among a large number of predictors. Utilizing the sparsity principle, e.g. operating under the assumption that a small subset of the variables is deriving the underlying process, with L 1 -penalty has been promoted as an effective solution (Tibshirani, 1996;Efron et al., 2004). The second problem is that such a variable selection exercise often arises as an ill-posed problem where (a) the sample size n is much smaller than the total number of variables (p) and (b) covariates are highly correlated.
Dimension reduction techniques such as principal components analysis (PCA) or partial least squares (PLS) have recently gained much attention for addressing these within the context of genomic data (Boulesteix and Strimmer, 2006).
Although dimension reduction via PCA or PLS is a principled way of dealing with ill-posed problems, it does not automatically lead to selection of relevant variables. Typically, all or a large portion of the variables contribute to final direction vectors which represent linear combinations of original predictors. Imposing sparsity in the midst of the dimension reduction step might lead to simultaneous dimension reduction and variable selection. Recently, Huang et al. (2004) proposed a penalized PLS method that thresholds the final PLS estimator. Although this imposes sparsity on the solution itself, it does not necessarily lead to sparse linear combinations of the original predictors. Our goal is to impose sparsity in the dimension reduction step of PLS so that sparsity can play a direct principled role.
The rest of the paper is organized as follows. We review general principles of the PLS methodology in Section 2. We show that PLS regression for either a univariate or multivariate response provides consistent estimators only under restricted conditions, and the consistency property does not extend to the very large p and small n paradigm. We formulate sparse partial least squares (SPLS) regression by relating it to sparse principal components analysis (SPCA) (Jolliffe et al., 2003;Zou et al., 2006) in Section 3 and provide an efficient algorithm for solving the SPLS regression formulation in Section 4. Methods for tuning the sparsity parameter and the number of components are also discussed in this section. Simulation studies and an application to transcription factor activity analysis by integrating microarray gene expression and chromatin immuno-precipitation-microarray chip (CHIP-chip) data are provided in Sections 5 and 6.

Partial least squares regression
2.1. Description of partial least squares regression PLS regression, which was introduced by Wold (1966), has been used as an alternative approach to ordinary least squares (OLS) regression in ill-conditioned linear regression models that arise in several disciplines such as chemistry, economics and medicine (de Jong, 1993). At the core of PLS regression is a dimension reduction technique that operates under the assumption of a basic latent decomposition of the response matrix (Y ∈ R n×q ) and predictor matrix (X ∈ R n×p / : Y = TQ T + F and X = TP T + E, where T ∈ R n×K is a matrix that produces K linear combinations (scores); P ∈ R p×K and Q ∈ R q×K are matrices of coefficients (loadings), and E ∈ R n×p and F ∈ R n×q are matrices of random errors.
To specify the latent component matrix T such that T = XW , PLS requires finding the columns of W = .w 1 , w 2 , . . . , w K / from successive optimization problems. The criterion to find the kth direction vector w k for univariate Y is formulated as w k = arg max w {corr 2 .Y , Xw/ var.Xw/} subject to w T w = 1, w T Σ XX w j = 0, .1/ for j = 1, . . . , k − 1, where Σ XX is the covariance of X. As evident from this formulation, PLS seeks direction vectors that not only relate X to Y but also capture the most variable directions in the X -space (Frank and Friedman, 1993). There are two main formulations for finding PLS direction vectors in the context of multivariate Y. These vectors were originally derived from an algorithm, known as NIPALS (Wold, 1966), without a specific optimization problem formulation. Subsequently, a statistically inspired modification of PLS, known as SIMPLS (de Jong, 1993), was proposed with an algorithm by directly extending the univariate PLS formulation. Later, ter Braak and de Jong (1998) identified the 'PLS2' formulation which the NIPALS algorithm actually solves. The PLS2 formulation is given by w k = arg max w .w T σ XY σ XY T w/ subject to w T .I p − W k−1 W + k−1 /w = 1 and w T Σ XX w j = 0, .2/ for j = 1, . . . , k − 1, where σ XY is the covariance of X and Y , I p denotes a p × p identity matrix and W + k−1 is the unique Moore-Penrose inverse of W k−1 = .w 1 , . . . , w k−1 /. The SIMPLS formulation is given by w k = arg max w .w T σ XY σ T XY w/ subject to w T w = 1 and w T Σ XX w j = 0, .3/ for j = 1, . . . , k − 1. Both formulations have the same objective function but different constraints and thus yield different sets of direction vectors. Their prediction performances depend on the nature of the data (de Jong, 1993;ter Braak andde Jong, 1998). de Jong (1993) showed that both formulations become equivalent and yield the same set of direction vectors for univariate Y.
In the actual fitting of the PLS regression, either the NIPALS or the SIMPLS algorithm is used for obtaining the PLS estimator. The NIPALS algorithm produces the direction vector d k+1 with respect to the deflated matrixX k+1 at the .k + 1/th step by solving In contrast, the SIMPLS algorithm produces the .k + 1/th direction vectorŵ k+1 directly with respect to the original matrix X by solving After estimating the latent components (T K = XŴ K ) by using K numbers of direction vectors, loadings Q are estimated via solving min Q . Y − T K Q T 2 /. This leads to the final estimator β PLS =Ŵ KQ T , whereQ is the solution of this least squares problem.

2.
2. An asymptotic property of partial least squares regression 2.2.1. Partial least squares regression for univariate Y Stoica and Soderstorom (1998) derived asymptotic formulae for the bias and variance of the PLS estimator for the univariate case. These formulae are valid if the 'signal-to-noise ratio' is high or if n is large and the predictors are uncorrelated with the residuals. Naik and Tsai (2000) proved consistency of the PLS estimator under normality assumptions on both Y and X in addition to consistency of S XY and S XX and the following condition 1. This condition, which is known as the Helland and Almoy (1994) condition, implies that an integer K exists such that exactly K of the eigenvectors of Σ XX have non-zero components along σ XY .
Condition 1. There are eigenvectors v j .j = 1, . . . , K/ of Σ XX corresponding to different eigenvalues, such that σ XY = Σ K j=1 α j v j and α 1 , . . . , α K are non-zero. We note that the consistency proof of Naik and Tsai (2000) requires p to be fixed. In many fields of modern genomic research, data sets contain a large number of variables with a much smaller number of observations (e.g. gene expression data sets where the variables are of the order of thousands and the sample size is of the order of tens). Therefore, we investigate the consistency of the PLS regression estimator under the very large p and small n paradigm and extend the result of Naik and Tsai (2000) for the case where p is allowed to grow with n at an appropriate rate. In this setting, we need additional assumptions on both X and Y to ensure the consistency of S XX and S XY , which is the conventional assumption for fixed p. Recently, Johnstone and Lu (2004) proved that the leading PC of S XX is consistent if and only if p=n → 0. Hence, we adopt their assumptions for X to ensure consistency of S XX and S XY . Assumptions for X from Johnstone and Lu (2004) are as follows.
We remark that the above factor model for X is similar to that of Helland (1990) except for having an additional random error term e i . All properties of PLS in Helland (1990) will hold, as the eigenvectors of Σ XX and Σ XX − σ 2 1 I p are the same. We take the assumptions for Y from Helland (1990) with an additional norm condition on β.
We next show that, under the above assumptions and condition 1, the PLS estimator is consistent if and only if p grows much slower than n. Theorem 1. Under assumptions 1 and 2, and condition 1, (a) if p=n → 0, then β PLS − β 2 → 0 in probability and (b) if p=n → k 0 for k 0 > 0, then β PLS − β 2 > 0 in probability.
The main implication of this theorem is that the PLS estimator is not suitable for very large p and small n problems in complete generality. Although PLS utilizes a dimension reduction technique by using a few latent factors, it cannot avoid the sample size issue since a reasonable size of n is required to estimate sample covariances consistently as shown in the proof of theorem 1 in Appendix A. A referee pointed out that a qualitatively equivalent result has been obtained by Nadler and Coifman (2005), where the root-mean-squared error of the PLS estimator has an additional error term that depends on p 2 =n 2 .

Partial least squares regression for multivariate Y
There are limited or virtually no results on the theoretical properties of PLS regression within the context of a multivariate response. Counterintuitive simulation results, where multivariate PLS shows a minor improvement in prediction error, were reported in Frank and Friedman (1993). Later, Helland (2000) argued by intuition that, since multivariate PLS achieves parsimonious models by using the same reduced model space for all the responses, the net gain of sharing the model space could be negative if, in fact, all the responses require different reduced model spaces. Thus, we next introduce a specific setting for multivariate PLS regression in the light of Helland's (2000) intuition and extend the consistency result of univariate PLS to the multivariate case.
Assume that all the response variables have linear relationships with the same set of covariates: Y 1 = Xb 1 + f 1 , Y 2 = Xb 2 + f 2 , . . . , Y q = Xb q + f q , where b 1 , . . . , b q are p × 1 coefficient vectors and f 1 , . . . , f q are independent error vectors from N .0, σ 2 I n /. Since the shared reduced model space of each response is determined by b i s, we impose a restriction on these coefficients. Namely, we require the existence of eigenvectors v 1 , . . . , v K of Σ XX that span the solution space, which each b i belongs to.
We have proved consistency of the PLS estimator for a univariate response using the facts that S XY is proportional to the first direction vector and the solution space, whichβ PLS belongs to, can be explicitly characterized by {S XY , . . . , S K−1 XX S XY }. However, for a multivariate response, PLS finds the first direction vector as the first left singular vector of S XY . The presence of remaining directions in the column space of S XY makes it difficult to characterize the solution space explicitly. Furthermore, the solution space varies depending on the algorithm that is used to fit the model. If we further assume that b i = k i b 1 for constants k 2 , . . . , k q then Σ XY becomes a rank 1 matrix and these challenges are reduced, thereby leading to a setting where we can start to understand characteristics of multivariate PLS.
Condition 2 and assumption 3 below recapitulate these assumptions where the set of regression coefficients b 1 , b 2 , . . . , b q are represented by the coefficient matrix B.
Condition 2. There are eigenvectors v j .j = 1, . . . , K/ of Σ XX corresponding to different eigenvalues, such that σ XY i = Σ K j=1 α ij v j and α i1 , . . . , α iK are non-zero for i = 1, . . . , q. Assumption 3. Assume that Y = XB + F , where columns of F are independent and from N .0, σ 2 I n /. B is a rank 1 matrix with singular value decomposition ϑuv T , where ϑ denotes the singular value and u and v are left and right singular vectors respectively. In addition, ϑ < ∞ and q is fixed.
Lemma 1 proves the convergence of the first direction vector which plays a key role in forming the solution space of the PLS estimator. The proof is provided in Appendix A.
Lemma 1. Under assumption 3, whereŵ 1 is the estimate of the first direction vector w 1 and is given by Σ XX u= Σ XX u 2 .
The main implication of lemma 1 is that, under the given conditions, the convergence rate of the first direction vector from multivariate PLS is the same as that of a single univariate PLS. Since the application of univariate PLS for a multivariate response requires estimating q numbers of separate direction vectors, the advantage of multivariate PLS is immediate. The proof of lemma 1 relies on obtaining the left singular vector s by the rank 1 approximation of S XY , minimizing S XY − ςst T 1 F . Here, · F denotes Frobenius norm, ς is the non-zero singular value of S XY and s and t 1 are left and right singular vectors respectively. As a result, s can be represented by where t 1i is the ith element of t 1 , and sgn.t 1i / = sgn.s T S XY i /. This form of s provides intuition for estimating the first multivariate PLS direction vector. Namely, the first direction vector can be interpreted as the weighted sum of sign-adjusted covariance vectors. Directions with stronger signals contribute more in a sign-adjusted manner.
The above discussion highlighted the advantage of multivariate PLS compared with univariate PLS in terms of estimation of the direction vectors. Next, we present the convergence result of the final PLS solution.
Theorem 2. Under assumptions 1 and 3, condition 2 and for fixed K and q, B PLS − B 2 → 0 in probability if and only if p=n → 0.
Theorem 2 implies that, under the given conditions and for fixed K and q, the PLS estimator is consistent regardless of the algorithmic variant that is used if p=n → 0. Although PLS solutions from algorithmic variants might differ for finite n, these solutions are consistent. Moreover, the fixed q case is practical in most applications because we can always cluster Y s into smaller groups before linking them to X. We refer to Chun and Keleş (2009) for an application of this idea within the context of expression quantitative loci mapping.
Our results for multivariate Y are based on the equal variance assumption on the components of the error matrix F. Even though the popular objective functions of multivariate PLS given in expressions (2) and (3) do not involve a scaling factor for each component of multivariate Y , in practice, Y s are often scaled before the analysis. Violation of the equal variance assumption will affect the performance of PLS regression (Helland, 2000). Therefore, if there are reasons to believe that the error levels in Y , not the signal strengths, are different, scaling will aid in satisfying the equal variance assumption of our theoretical result.

Motivation for the sparsity principle in partial least squares regression
To motivate the sparsity principle, we now explicitly illustrate how a large number of irrelevant variables affect the PLS estimator through a simple example. This observation is central to our methodological development. We utilize the closed form solution of Helland (1990) where X 1 and X 2 denote p 1 relevant and p − p 1 irrelevant variables respectively and each column of X 2 follows N .0, I n /. We assume the existence of a latent variable (K = 1) as well as a fixed number of relevant variables (p 1 ) and let p grow at the rate O.k n/, where the constant k is sufficiently large to have where σ 1 and σ 2 are from Section 2.2.1. It is not difficult to obtain a sufficiently large k to satisfy condition (4) for fixed p 1 . Then, the PLS estimator can be approximated bŷ . 6/ Approximation (5) follows from lemma 2 in Appendix A and assumption (4). Approximation (6) is due to the fact that the largest and smallest eigenvalues of the Wishart matrix are O.k / (Geman, 1980). In this example, the large number of noise variables forces the loadings in the direction of S XY to be attenuated and thereby cause inconsistency. From a practical point of view, since latent factors of PLS have contributions from all the variables, the interpretation becomes difficult in the presence of large numbers of noise variables. Motivated by the observation that noise variables enter the PLS regression via direction vectors and attenuate estimates of the regression parameters, we consider imposing sparsity on the direction vectors.

Finding the first sparse partial least squares direction vector
We start with formulation of the first SPLS direction vector and illustrate the main ideas within this simpler problem. We formulate the objective function for the first SPLS direction vector by adding an L 1 -constraint to problems (2) and (3): where M = X T YY T X and λ determines the amount of sparsity. The same approach has been used in SPCA. By specifying M to be X T X in expression (7), this objective function coincides with that of a simplified component lasso technique called 'SCOTLASS' (Jolliffe et al., 2003) and both SPLS and SPCA correspond to the same class of maximum eigenvalue problem with a sparsity constraint. Jolliffe et al. (2003) pointed out that the solution of this formulation tends not to be sufficiently sparse and the problem is not convex. This convexity issue was revisited by d' Aspremont et al. (2007) in direct SPCA by reformulating the criterion in terms of W = ww T , thereby producing a semidefinite programming problem that is known to be convex. However, the sparsity issue remained.
To obtain a sufficiently sparse solution, we reformulate the SPLS criterion (7) by generalizing the regression formulation of SPCA (Zou et al., 2006). This formulation promotes the exact zero property by imposing an L 1 -penalty onto a surrogate of the direction vector (c) instead of the original direction vector (w), while keeping w and c close to each other: In this formulation, the L 1 -penalty encourages sparsity on c whereas the L 2 -penalty addresses the potential singularity in M when solving for c. We shall rescale c to have norm 1 and use this scaled version as the estimated direction vector. We note that this problem becomes that of SCOTLASS when w = c and M = X T X, SPCA when κ = 1 2 and M = X T X, and the original maximum eigenvalue problem of PLS when κ = 1. We aim to reduce the effect of the concave part (hence the local solution issue) by using a small κ.

Solution for the generalized regression formulation of sparse partial least squares
We solve the generalized regression formulation of SPLS given in expression (8) by alternatively iterating between solving for w for fixed c and solving for c after fixing w.
When solving for c for fixed w, problem (8) becomes . 10/ This problem, which is equivalent to the naive elastic net (EN) problem of Zou and Hastie (2005) when Y in the naive EN is replaced with Z T w, can be solved efficiently via the least angle regression spline algorithm LARS (Efron et al., 2004). SPLS often requires a large λ 2 -value to solve problem (10) because Z is a q × p matrix with usually small q, i.e. q = 1 for univariate Y.
As a remedy, we use an EN formulation with λ 2 = ∞ and this yields the solution to have the form of a soft thresholded estimator (Zou and Hastie, 2005). This concludes our solution of the regression formulation for general Y (univariate or multivariate). We further have the following simplification for univariate Y (q = 1).
Theorem 3. For univariate Y , the solution of problem (8) Proof. For a given c and κ = 0:5, it follows thatŵ =Z since the singular value decomposition of ZZ T c yields U =Z and V = 1. For a given c and 0 < κ < 0:5, the solution is given by w = {Z T c=. Z 2 + λ Å /}Z by using the Woodbury formula (Golub and van Loan, 1987). Noting that Z T c=. Z 2 + λ Å / is a scalar and by the norm constraint, we haveŵ =Z. Sinceŵ does not depend on c, we haveĉ = .|Z| − λ 1 =2/ + sgn.Z/ for large λ 2 .

Sparse partial least squares algorithm
In this section, we present the complete SPLS algorithm which encompasses the formulation of the first SPLS direction vector from Section 3.1 as well as an efficient algorithm for obtaining all the other direction vectors and coefficient estimates.
In principle, the objective function for the first SPLS direction vector can be utilized at each step of the NIPALS or SIMPLS algorithm to obtain the rest of the direction vectors. We call this idea the naive SPLS algorithm. However, this naive SPLS algorithm loses the conjugacy of the direction vectors. A similar issue appears in SPCA, where none of the methods proposed (Jolliffe et al., 2003;Zou et al., 2006;d'Aspremont et al., 2007) produces orthogonal sparse principal components. Although conjugacy can be obtained by the Gram-Schmidt conjugation of the derived sparse direction vectors, these post-conjugated vectors do not inherit the property of Krylov subsequences which is known to be crucial for the convergence of the algorithm (Krämer, 2007). Essentially, such a post-orthogonalization does not guarantee the existence of the solution among the iterations.
To address this concern, we propose an SPLS algorithm which leads to a sparse solution by keeping the Krylov subsequence structure of the direction vectors in a restricted X -space of selected variables. Specifically, at each step of either the NIPALS or the SIMPLS algorithm, it searches for relevant variables, the so-called active variables, by optimizing expression (8) and updates all direction vectors to form a Krylov subsequence on the subspace of the active variables. This is simply achieved by conducting PLS regression by using the selected variables. Let A be an index set for active variables and K the number of components. Denote X A as the submatrix of X whose column indices are contained in A. The SPLS algorithm can utilize either the NIPALS or the SIMPLS algorithm as described below.
Step 2: while k K, (a) findŵ by solving the objective (8) in Section 3.1 with M = X T Y 1 Y T 1 X for the NIPALS and M = X T 1 YY T X 1 for the SIMPLS algorithm, c) fit PLS with X A by using k number of latent components and (d) updateβ PLS by using the new PLS estimates of the direction vectors, The original NIPALS algorithm includes deflation steps for both X -and Y -matrices, but the same M-matrix can be computed via the deflation of either X or Y owing to the idempotency of the projection matrix. In our SPLS-NIPALS algorithm, we chose to deflate the Y -matrix because, in that case, the eigenvector X T Y 1 = X T Y 1 of M is proportional to the current correlations in the LARS algorithm for univariate Y. Hence, the LARS and SPLS-NIPALS algorithms use the same criterion to select active variables in this case. However, the SPLS-NIPALS algorithm differs from LARS in that it selects more than one variable at a time and utilizes the conjugate gradient (CG) method to compute the coefficients at each step (Friedman and Popescu, 2004). This, in particular, implies that the SPLS-NIPALS algorithm can select a group of correlated variables simultaneously. The cost of computing coefficients at each step of the SPLS algorithm is less than or equal to that of LARS as the CG method avoids matrix inversion.
The SPLS-SIMPLS algorithm has similar attributes to the SPLS-NIPALS algorithm. It also uses the CG method and selects more than one variable at each step and handles multivariate responses. However, the M-matrix is no longer proportional to the current correlations of the LARS algorithm. SIMPLS yields direction vectors directly satisfying the conjugacy constraint, which may hamper the ability of revealing relevant variables. In contrast, the direction vectors at each step of the NIPALS algorithm are derived to maximize the current correlations on the basis of residual matrices, and conjugated direction vectors are computed at the final stage. Thus, the SPLS-NIPALS algorithm is more likely to choose the correct set of relevant variables when the signals of the relevant variables are weak. A small simulation study investigating this point is presented in Section 5.1.

Choosing the thresholding parameter and the number of hidden components
Although the SPLS regression formulation in expression (8) has four tuning parameters (κ, λ 1 , λ 2 and K ), only two of these are key tuning parameters, namely the thresholding parameter λ 1 and the number of hidden components K. As we discussed in theorem 3 of Section 3.2, the solution does not depend on κ for univariate Y. For multivariate Y , we show with a simulation study in Section 5.2 that setting κ smaller than 1 2 generally avoids local solution issues. Different κ-values have the effect of starting the algorithm with different starting values. Since the algorithm is computationally inexpensive (the average run time including the tuning is only 9 min for a sample size of n = 100 with p = 5000 predictors on a 64-bit machine with 2.66 GHz central processor unit), users are encouraged to try several κ-values. Finally, as described in Section 3.2, setting the λ 2 -parameter to ∞ yields the thresholded estimator which depends only on λ 1 . Therefore, we proceed with the tuning mechanisms for the two key parameters λ 1 and K. We start with univariate Y since imposing an L 1 -penalty has the simple form of thresholding, and then we discuss multivariate Y.
We start with describing a form of soft thresholded direction vectorw :w = .|ŵ| − η max 1 i p |ŵ i |/ I.|ŵ| η max 1 i p |ŵ i |/ sgn.ŵ/, where 0 η 1. Here, η plays the role of the sparsity parameter λ 1 in theorem 3. This form of soft thresholding retains components that are greater than some fraction of the maximum component. A similar approach was utilized in Friedman and Popescu (2004) with hard thresholding as opposed to our soft thresholding scheme. The single tuning parameter η is tuned by cross-validation (CV) for all the direction vectors. We do not use separate sparsity parameters for individual directions because tuning multiple parameters is computationally prohibitive and may not produce a unique minimum for the CV criterion. Next, we describe a hard thresholding approach by the control of the false discovery rate FDR. SPLS selects variables which exhibit high correlations with Y in the first step and adds additional variables with high partial correlations in the subsequent steps. Although we are imposing sparsity on direction vectors via an L 1 -penalty, the thresholded form of our solution for univariate Y allows us to compare and contrast our approach directly with the supervised PC approach of Bair et al. (2006) that operates by an initial screening of the predictor variables. Selecting related variables on the basis of correlations has been utilized in supervised PCs, and, in a way, we further extend this approach by utilizing partial correlations in the later steps. Owing to uniform consistency of correlations (or partial correlations after taking into account the effect of relevant variables), FDR control is expected to work well even in the large p and small n scenario (Kosorok and Ma, 2007). As we described in Section 4, the components of the direction vectors for univariate Y have the form of a correlation coefficient (or a partial correlation coefficient after the first step) between the individual covariate and response, and a thresholding parameter can be determined by control of the FDR at a prespecified level α. Letr k YX i ·T k−1 1 denote the sample partial correlation of the ith variable X i with Y given T k−1 1 , where T k−1 1 denotes the set of first k − 1 latent variables included in the model. Under the normality assumption on X and Y , and the null hypothesis H 0i : r k YX i ·T k−1 1 = 0, the z-transformed (partial) correlation coefficients have the distribution (Bendel and Afifi, 1976) We compute the corresponding p-valuesp i , for i = 1, . . . , p, for the (partial) correlation coefficients by using this statistic and arrange them in ascending order:p [1] . . . p [p] . After definingm = max{m :p [m] .m=p/α}, the hard thresholded direction vector becomesw =ŵ I.|ŵ| > |ŵ| [p−m+1] / based on the Benjamini and Hochberg (1995) FDR procedure. We remark that the solution from FDR control is minimax optimal if α ∈ [0, 1 2 ] and α > γ= log.p/ .γ > 0/ under independence among tests. As long as α decreases with an appropriate rate as p increases, thresholding by FDR control is optimal without knowing the level of sparsity and, hence, reduces computation considerably. Although we do not have this independence, this adaptivity may work since the argument for minimax optimality mainly depends on marginal properties (Abramovich et al., 2006).
As discussed in Section 3.2, for multivariate Y , the solution for SPLS is obtained through iterations and the resulting solution has a form of soft thresholding. Although hard thresholding with FDR control is no longer applicable, we can still employ soft thresholding based on CV. The number of hidden components, K , is tuned by CV as in the original PLS. We note that CV will be a function of two arguments for soft thresholding and that of one argument for hard thresholding and thereby making hard thresholding computationally much cheaper than soft thresholding.

Comparison between SPLS-NIPALS and SPLS-SIMPLS algorithms
We conducted a small simulation study to compare variable selection performances of the two SPLS variants, SPLS-NIPALS and SPLS-SIMPLS. The data-generating mechanism is set as follows. Columns of X are generated by X i = H j + " i for n j−1 + 1 i n j , where j = 1, . . . , 3 and .n 0 , n 1 , n 2 , n 3 / = .0, 6, 13, 30/. Here, H 1 , H 2 and H 3 are independent random vectors from N .0, 25I 100 / and the " i s are from N .0, I 100 /. Columns of Y are generated by Y 1 = 0:1H 1 − 2H 2 + f 1 , and Y i+1 = 1:2Y i + f i , where the f i s are from N .0, I 100 /, i = 1, . . . , q = 10. We generated 100 simulated data sets and analysed them using both the SPLS-NIPALS and the SPLS-SIMPLS algorithms. Table 1 reports the first quartile, median, and the third quartile of the numbers of correctly and incorrectly selected variables. We observe that the SPLS-NIPALS algorithm performs better in identifying larger numbers of correct variables with a smaller number of false positive results compared with the SPLS-SIMPLS algorithm. Further investigation reveals that the relevant variables that the SPLS-SIMPLS algorithm misses are typically from the H 1 -component with weaker signal.

Setting the weight factor κ in the general regression formulation of problem (8)
We ran a small simulation study to examine how the generalization of the regression formulation given in expression (8) helps to avoid the local solution issue. The data-generating mechanism is set as follows. Columns of X are generated by X i = H j + " i for n j−1 + 1 i n j , where j = 1, . . . , 4 and .n 0 , . . . , n 4 / = .0, 4, 8, 10, 100/. Here, H 1 is a random vector from N .0, 290I 1000 /, H 2 is a random vector from N .0, 300I 1000 /, H 3 = −0:3H 1 + 0:925H 2 and H 4 = 0. The " i s are independent identically distributed random vectors from N .0, I 1000 /. For illustration, we use M = X T X. When κ = 0:5, the algorithm becomes stuck at a local solution in 27 out of 100 simulation runs. When κ = 0:1, 0:3, 0:4, the correct solution is obtained in all runs. This indicates that a slight imbalance giving less weight to the concave objective function of formulation (8) might lead to a numerically easier optimization problem.

Comparisons with recent variable selection methods in terms of prediction power and variable selection
In this section, we compare SPLS regression with other popular methods in terms of prediction and variable selection performances in various correlated covariates settings. We include OLS and the lasso, which are not particularly tailored for correlated variables. We also consider dimension reduction methods such as PLS, principal component regression (PCR) and supervised PCs, which ought to be appropriate for highly correlated variables. The EN is also included in these comparisons since it can handle highly correlated variables.
We first consider the case where there is a reasonable number of observations (i.e. n > p) and set n = 400 and p = 40. We vary the number of spurious variables as q = 10 and q = 30, and the noise-to-signal ratios as 0.1 and 0.2. Hidden variables H 1 , . . . , H 3 are from N .0, 25I n /, and the columns of the covariate matrix X are generated by X i = H j + " i for n j−1 + 1 i n j , where j = 1, . . . , 3, .n 0 , . . . , n 3 / = .0, .p − q/=2, p − q, p/ and " 1 , . . . , " p are drawn independently from N .0, I n /. Y is generated by 3H 1 − 4H 2 + f , where f is normally distributed with mean 0. This mechanism generates covariates, subsets of which are highly correlated.
We, then, consider the case where the sample size is smaller than the number of the variables (i.e. n < p) and set n = 40 and p = 80. The numbers of spurious variables are set to q = 20 and q = 40, and noise-to-signal ratios to 0.1 and 0.2 respectively. X and Y are generated similarly to the above n > p case.
We select the optimal tuning parameters for most of the methods by using tenfold CV. Since the CV curve tends to be flat in this simulation study, we first identify parameters of which CV scores are less than 1.1 times the minimum of the CV scores. We select the smallest K and the largest η among the selected parameters for SPLS, the largest λ 2 and the smallest step size for the EN and the smallest step size for the lasso. We use the F -statistic (the default CV score in the R package superpc) from the fitted model as a CV score for supervised PC. Then, we use the same procedure to generate an independent test data set and predict Y on this test data set on the basis of the fitted models. For each parameter setting, we perform 30 runs of simulations and compute the mean and standard deviation of the mean-squared prediction errors. The averages of the sensitivities and specificities are computed across the simulations to compare the accuracy of variable selection. The results are presented in Tables 2 and 3. Although not so surprising, the methods with an intrinsic variable selection property show smaller prediction errors compared with the methods lacking this property. For n > p, the lasso, SPLS, supervised PCs and the EN show similar prediction performances in all four scenarios. This  holds for the n < p case, except that supervised PC shows a slight increase in prediction error for dense models (p = 80 and q = 20). For the model selection accuracy, SPLS, supervised PCs and the EN show excellent performances, whereas the lasso exhibits poor performance by missing relevant variables. SPLS performs better than other methods for n < p and high noiseto-signal ratio scenarios. We observe that the EN misses relevant variables in the n < p scenario, even though its L 2 -penalty aims to handle these cases specifically. Moreover, the EN performs well for the right size of the regularization parameter λ 2 , but finding the optimal size objectively through CV seems to be a challenging task. In general, both SPLS-CV and SPLS-FDR perform at least as well as other methods (Table 3). Especially, when n < p, the lasso fails to identify important variables, whereas SPLS regression succeeds. This is because, although the number of SPLS latent components is limited by n, the actual number of variables that makes up the latent components can exceed n.

Comparisons of predictive power among methods that handle multicollinearity
In this section, we compare SPLS regression with some of the popular methods that handle multicollinearity such as PLS, PCR, ridge regression, a mixed variance-covariance approach, gene shaving (Hastie et al., 2000) and supervised PCs (Bair et al., 2006). These comparisons are motivated by those presented in Bair et al. (2006). We compare only prediction performances since all methods except for gene shaving and supervised PCs are not equipped with variable selection. For the dimension reduction methods, we allow only one latent component for a fair comparison.
Throughout these simulations, we set p = 5000 and n = 100. All the scenarios follow the general model of Y = Xβ + f , but the underlying data generation for X is varying. We devise simulation scenarios where the multicollinearity is due to the presence of one main latent variable (simulations 1 and 2), the presence of multiple latent variables (simulation 3) and the presence of a correlation structure that is not induced by latent variables but some other mechanism (simulation 4). We select the optimal tuning parameters and compute the prediction errors as in Section 5.3. The results are summarized in Table 4.
The first simulation scenario is the same as the 'simple simulation' that was utilized by Bair  et al. (2006), where hidden components H 1 and H 2 are defined as follows: H 1j equals 3 for 1 j 50 and 4 for 51 j n and H 2j = 3:5 for 1 j n. Columns of X are generated by X i = H 1 + " i for 1 i 50 and H 2 + " i for 51 i p, where " i are an independent identically distributed random vector from N .0, I n /. β is a p × 1 vector, where the ith element is 1/25 for 1 i 50 and 0 for 51 i p. f is a random vector from N .0, 1:5 2 I n /. Although this scenario is ideal for supervised PCs in that Y is related to one main hidden component, SPLS regression shows a comparable performance with supervised PCs and gene shaving.
The third simulation is designed to compare the prediction performances of the methods when all methods are allowed to use only one latent component, even though there are more than one hidden components related to Y. This scenario aims to illustrate the differences of the derived latent components depending on whether they are guided by the response Y. H 1 and H 2 are generated as H 1j = 2:5 I.j 50/ + 4 I.j > 50/, H 2j = 2:5 I.1 j 25 or 51 j 75/ + 4 I.26 j 50 or 76 j 100/. .H 3 , . . . , H 6 / are defined in the same way as .H 2 , . . . , H 5 / in the second simulation. Columns of X are generated by X i = H j + " i for n j−1 + 1 i n j , j = 1, . . . , 6, and .n 0 , . . . , n 6 / = .0, 25, 50, 100, 200, 300, p/. f is a random vector from N .0, I n /. Gene shaving and SPLS both exhibit good predictive performance in this scenario. In a way, when the number of components in the model is fixed, the methods which utilize Y when deriving latent components can achieve better predictive performances compared with methods that utilize only X when deriving these vectors. This agrees with the prior observation that PLS typically requires a smaller number of latent components than that of PCA (Frank and Friedman, 1993).
The fourth simulation is designed to compare the prediction performances of the methods when the relevant variables are not governed by a latent variable model. We generate the first 50 columns of X from a multivariate normal distribution with auto-regressive covariance, and the remaining 4950 columns of X are generated from hidden components as before. Five hidden components are generated as follows: H 1j equals 1 for 1 j 50 and 6 for 51 j n and H 2 , . . . , H 5 are the same as in the second simulation. Denoting X = .X .1/ , X .2/ / by using a partitioned matrix, we generate rows of X .1/ from N .0, Σ 50×50 /, where Σ 50×50 is from an AR(1) process with an auto-correlation ρ = 0:9. Columns of X .2/ are generated by X .2/ i = U j + " i for n j−1 + 1 i n j , where j = 1, . . . , 5 and .n 0 , . . . , n 5 / = .0, 50, 100, 200, 300, p − 50/. β is a p × 1 vector and its ith element is given by β i = k j for n j−1 + 1 i n j , where j = 1, . . . , 6, .n 0 , . . . , n 6 / = .0, 10, 20, 30, 40, 50, p/ and .k 1 , . . . , k 6 / = .8, 6, 4, 2, 1, 0/=25. SPLS regression and gene shaving perform well, indicating that they have the ability to handle such a correlation structure. As in the third simulation, these two methods may gain some advantage in handling more general correlation structures by utilizing response Y when deriving direction vectors.

Case-study: application to yeast cell cycle data set
Transcription factors (TFs) play an important role for interpreting a genome's regulatory code by binding to specific sequences to induce or repress gene expression. It is of general interest to identify TFs which are related to regulation of the cell cycle, which is one of the fundamental processes in a eukaryotic cell. Recently, Boulesteix and Strimmer (2005) performed an integrative analysis of gene expression and CHIP-chip data measuring the amount of transcription and physical binding of TFs respectively, to address this question. Their analysis focused on estimation rather than variable selection. In this section, we focus on identifying cell cycle regulating TFs.
We utilize a yeast cell cycle gene expression data set from Spellman et al. (1998). This experiment measures messenger ribonucleic acid levels every 7 min for 119 min with a total of 18 measurements covering two cell cycle periods. The second data set, CHIP-chip data of Lee et al. (2002), contains binding information of 106 TFs which elucidates which transcriptional regulators bind to promoter sequences of genes across the yeast genome. After excluding genes with missing values in either of the experiments, 542 cell-cycle-related genes are retained.
We analyse these data sets with our proposed multivariate (SPLS-NIPALS) and univariate SPLS regression methods, and also with the lasso for a comparison and summarize the results in Table 5. Since CHIP-chip data provide a proxy for the binary outcome of binding, we scale the CHIP-chip data and use tenfold CV for tuning. Multivariate SPLS selects the least number of TFs (32 TFs), and univariate SPLS selects 70 TFs. The lasso selects the largest number of TFs, 100 out of 106. There are a total of 21 experimentally confirmed cell-cycle-related TFs (Wang et al., 2007), and we report the number of confirmed TFs among those selected as a guideline for performance comparisons. In Table 5, we also report a hypergeometric probability calculation quantifying chance occurrences of the number of confirmed TFs among the variables selected by each method. A comparison of these probabilities indicates that multivariate SPLS has more evidence that selection of a large number of confirmed TFs is not due to chance.
We next compare results from multivariate and univariate SPLS. There are a total of 28 TFs which are selected by both methods and nine of these are experimentally verified according to the literature. The estimators, i.e. TF activities, of selected TFs in general show periodicity. This is indeed a desirable property since the 18 time points cover two periods of a cell cycle. Interestingly, as depicted Fig. 1, multivariate SPLS regression obtains smoother estimates of TF activities compared with univariate SPLS. A total of four TFs are selected only by multivariate SPLS regression. These coefficients are small but consistent across the time points (Fig. 2). A total of 42 TFs are selected only by univariate SPLS, and eight of these are among the confirmed TFs. These TFs do not show periodicity or have non zero coefficients only at few time points (the data are not shown). In general, multivariate SPLS regression can capture the weak effects that are consistent across the time points.

Discussion
PLS regression has been successfully utilized in ill-conditioned linear regression problems that arise in several scientific disciplines. Goutis (1996) showed that PLS yields shrinkage estimators. Butler and Denham (2000) argued that it may provide peculiar shrinkage in the sense that some of the components of the regression coefficient vector may expand instead of shrinking. However, as argued by Rosipal and Krämer (2006), this does not necessarily lead to worse shrinkage because PLS estimators are highly non-linear. We showed that both univariate and multivariate PLS regression estimators are consistent under the latent model assumption with strong restrictions on the number of variables and the sample size. This makes the suitability of PLS for the contemporary very large p and small n paradigm questionable. We argued and illustrated that imposing sparsity on direction vectors helps to avoid sample size problems in the presence of large numbers of irrelevant variables. We further developed a regression technique called SPLS. SPLS regression is also likely to yield shrinkage estimators since the methodology can be considered as a form of PLS regression on a restricted set of predictors. Analysis of its shrinkage properties is among our current investigations. SPLS regression is computationally efficient since it solves a linear equation by employing a CG algorithm rather than matrix inversion at each step.
We presented the solution of the SPLS criterion for the direction vectors and proposed an accompanying SPLS regression algorithm. Our SPLS regression algorithm has connections to other variable selection algorithms including the EN (Zou and Hastie, 2005) and the threshold gradient (Friedman and Popescu, 2004) method. The EN method deals with collinearity in variable selection by incorporating the ridge regression method into the LARS algorithm. In a way, SPLS handles the same issue by fusing the PLS technique into the LARS algorithm. SPLS can also be related to the threshold gradient method in that both algorithms use only the thresholded gradient and not the Hessian. However, SPLS achieves faster convergence by using the CG.
We presented proof-of-principle simulation studies with combinations of small and large number of predictors and sample sizes. These illustrated that SPLS regression achieves both high predictive power and accuracy for finding the relevant variables. Moreover, it can select a higher number of relevant variables than the available sample size since the number of variables that contribute to the direction vectors is not limited by the sample size.
Our application with SPLS involved two recent genomic data types, namely gene expression data and genomewide binding data of TFs. The response variable was continuous and a linear modelling framework followed naturally. Extensions of SPLS to other modelling frameworks such as generalized linear models and survival models are exciting future directions. Our application with integrative analysis of expression and TF binding date highlighted the use of SPLS within the context of a multivariate response. We expect that several genomic problems with multivariate responses, e.g. linking expression of a cluster of genes to genetic marker data, might lend themselves to the multivariate SPLS framework. We provide an implementation of the SPLS regression methodology as an R package at http://cran.r-project.org/web/ packages/spls.
The first claim is proved by using the definition of a matrix norm and lemmas 2 and 3 as For the second claim, we focus on R S XXR − RΣ XX R 2 .RΣ XX R/ −1 2 .RS XXR / −1 2 since (Golub and van Loan, 1987). Here, .RΣ XX R/ −1 2 and .RS XXR / −1 2 are finite as .RΣ XX R/ −1 and .RS XXR / −1 are non-singular for a given K. Using this fact as well as the triangular and Hölder's inequalities, we can easily show the second claim. The third claim follows by the fact that R − R 2 → 0 in probability, lemma 2 and the triangular and Hölder's inequalities.
Next, we can establish that β = Σ −1 XX σ XY = R.R T Σ XX R/ −1 R T σ XY by using the same argument of proposition 1 of Naik and Tsai (2000).

A.2. Proof of lemma 1
We remark that Σ q i=1 h i S XY i = .Σ h i v i /λS XX u + Σ h i X T f i =n, and

A.3. Proof of theorem 2
We start with the sufficient condition of the convergence. We shall first characterize the space that is generated by the direction vectors of each algorithm. For the NIPALS algorithm, we denoteŴ NIP K = .ŵ 1 , . . . ,ŵ K / and D K = .d 1 , . . . , d K / as direction vectors for the original covariate and the deflated covariates respectively. The first direction vector d 1 .=ŵ 1 / is obtained by S XY t 1 = S XY t 1 2 , where t 1 is the right singular vector of S XY . We denote s i,1 = S XY t i = S XY t i 2 , as this form of vector recurs in the remaining steps. Then, span.Ŵ NIP 1 / = span.s 1, 1 /. Define ψ i as the step size vector at the ith step, and the ith current correlation matrix C i as .1=n/X T .Y − XΣ i−1 j=1ŵ j ψ T j /. The current correlation matrix at the second step is given by C 2 = S XY − S XXŵ1 ψ T 1 and thus the second direction vector d 2 is proportional to S XY t 2 − ψ T 1 t 2 S XXŵ1 , where t 2 is the right singular vector of C 2 . Then span.Ŵ NIP 2 / = span.s 1, 1 , s 2, 1 + l 2, 1 S XX s 1, 1 /, where l 2, 1 = ψ T 1 t 2 = S XY t 2 2 . Similarly, we can obtain span.Ŵ NIP K / = span s 1, 1 , s 2, 1 + l 2, 1 S XX s 1, 1 , . . . , s K,1 + K−1 i=1 l K, i S i XX s K−i,1 : Now, we observe that span.Ŵ NIP i / does not form a Krylov space, because s i,1 is not the same as s 1, 1 for multivariate Y. However, it forms a Krylov space for large n, since S XY t= S XY t 2 − w 1 2 → 0 for any q-dimensional random vector t subject to t 2 = 1 almost surely, following lemma 1.
For the SIMPLS algorithm, using the fact that the ith direction vector of SIMPLS is obtained sequentially from the left singular vector of D i = .I − Π P i−1 /S XY , where P i−1 = S XXŴ i−1 SIM .Ŵ i−1 SIM S XXŴ i−1 SIM / −1 , we can characterize span.Ŵ SIM K / = span s 1, 1 , s 2, 1 + l 2, 1 S XX s 1, 1 , . . . , s K,1 + K−1 i=1 l K, i S i XX s K−i,1 : We note that s i,1 s and l i, j s from the NIPALS and SIMPLS algorithms are different because the t i s are from C i and D i for the NIPALS and SIMPLS algorithms respectively. Next, we shall focus on the convergence of the NIPALS estimator, because the convergence of the SIMPLS estimator can be proved by the same argument owing to the structural similarity of span.Ŵ NIP K / and span.Ŵ SIM K /. DenotingW = .s 1, 1 , s 2, 1 + l 2, 1 S XX s 1, 1 , . . . , s K,1 + Σ K−1 i=1 l K, i S i XX s K−i,1 / andW = .s 1, 1 , s 1, 1 + l 2, 1 S XX s 1, 1 , . . . , s 1, 1 + Σ K−1 i=1 l K, i S i XX s 1, 1 /, one can show that W −W 2 = O p { √ .p=n/} by using the fact s i,1 − w 1 2 = O p { √ .p=n/} for i = 1, . . . , K. Since span.W/ can also be represented as span.s 1, 1 , S K−1 XX s 1, 1 , . . . , S XX s 1, 1 / .= span.R//, we have that B NIP −R.R T S XXR / −1RT S XY 2 = O p { √ .p=n/}. Thus, we now deal with the convergence ofR.R T S XXR / −1RT S XY , which has a similar form to that of the univariate response case.
Since The necessity condition of the convergence is proved as follows. Assume that B PLS − B 2 → 0 in probability, when p=n → k 0 .> 0/. Following the argument in the proof of theorem 1, we have lim{P. R T E T F=n 2 = 0/} = 1: Since E T F=n 2 = 0 almost surely, this equation implies that P{range.E T F=n/ ⊂ null.R T /} → 1 as n → ∞.
If p=n → k 0 .> 0/, this contradicts the fact that E T F i = d χ .n/ χ .p/ U p , where F i denotes the ith column of F and U p is a vector uniform on the surface of the unit sphere S p−1 , as the dimension of null.R T / is p − K.