Sparse partial least squares regression for simultaneous dimension reduction and variable selection
Reuse of this article is permitted in accordance with the terms and conditions set out at http://www3.interscience.wiley.com/authorresources/onlineopen.html.
Abstract
Summary. Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well‐known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.
1. Introduction
With the recent advancements in biotechnology such as the use of genomewide microarrays and high throughput sequencing, regression‐based modelling of high dimensional data in biology has never been more important. Two important statistical problems commonly arise within regression problems that concern modern biological data. The first is the selection of a set of important variables among a large number of predictors. Utilizing the sparsity principle, e.g. operating under the assumption that a small subset of the variables is deriving the underlying process, with L1‐penalty has been promoted as an effective solution (Tibshirani, 1996; Efron et al., 2004). The second problem is that such a variable selection exercise often arises as an ill‐posed problem where
- (a)
the sample size n is much smaller than the total number of variables (p) and
- (b)
covariates are highly correlated.
Dimension reduction techniques such as principal components analysis (PCA) or partial least squares (PLS) have recently gained much attention for addressing these within the context of genomic data (Boulesteix and Strimmer, 2006).
Although dimension reduction via PCA or PLS is a principled way of dealing with ill‐posed problems, it does not automatically lead to selection of relevant variables. Typically, all or a large portion of the variables contribute to final direction vectors which represent linear combinations of original predictors. Imposing sparsity in the midst of the dimension reduction step might lead to simultaneous dimension reduction and variable selection. Recently, Huang et al. (2004) proposed a penalized PLS method that thresholds the final PLS estimator. Although this imposes sparsity on the solution itself, it does not necessarily lead to sparse linear combinations of the original predictors. Our goal is to impose sparsity in the dimension reduction step of PLS so that sparsity can play a direct principled role.
The rest of the paper is organized as follows. We review general principles of the PLS methodology in Section 2. We show that PLS regression for either a univariate or multivariate response provides consistent estimators only under restricted conditions, and the consistency property does not extend to the very large p and small n paradigm. We formulate sparse partial least squares (SPLS) regression by relating it to sparse principal components analysis (SPCA) (Jolliffe et al., 2003; Zou et al., 2006) in Section 3 and provide an efficient algorithm for solving the SPLS regression formulation in Section 4. Methods for tuning the sparsity parameter and the number of components are also discussed in this section. Simulation studies and an application to transcription factor activity analysis by integrating microarray gene expression and chromatin immuno‐precipitation–microarray chip (CHIP–chip) data are provided in .
2. Partial least squares regression
2.1. Description of partial least squares regression
PLS regression, which was introduced by Wold (1966), has been used as an alternative approach to ordinary least squares (OLS) regression in ill‐conditioned linear regression models that arise in several disciplines such as chemistry, economics and medicine (de Jong, 1993). At the core of PLS regression is a dimension reduction technique that operates under the assumption of a basic latent decomposition of the response matrix (
) and predictor matrix (
, where
is a matrix that produces K linear combinations (scores);
and
are matrices of coefficients (loadings), and
and
are matrices of random errors.
(1)
(2)
is the unique Moore–Penrose inverse of Wk−1=(w1,…,wk−1). The SIMPLS formulation is given by
(3)
at the (k+1)th step by solving

and
. At the final Kth step,
, the direction matrix with respect to the original matrix X, is computed by
, where
and DK=(d1,…,dK). In contrast, the SIMPLS algorithm produces the (k+1)th direction vector
directly with respect to the original matrix X by solving

After estimating the latent components (
) by using K numbers of direction vectors, loadings Q are estimated via solving minQ(‖Y−TKQT‖2). This leads to the final estimator
, where
is the solution of this least squares problem.
2.2. An asymptotic property of partial least squares regression
2.2.1. Partial least squares regression for univariate Y
Stoica and Soderstorom (1998) derived asymptotic formulae for the bias and variance of the PLS estimator for the univariate case. These formulae are valid if the ‘signal‐to‐noise ratio’ is high or if n is large and the predictors are uncorrelated with the residuals. Naik and Tsai (2000) proved consistency of the PLS estimator under normality assumptions on both Y and X in addition to consistency of SXY and SXX and the following condition 1. This condition, which is known as the Helland and Almoy (1994) condition, implies that an integer K exists such that exactly K of the eigenvectors of ΣXX have non‐zero components along σXY.
Condition 1. There are eigenvectors vj (j=1,…,K) of ΣXX corresponding to different eigenvalues, such that
and α1,…,αK are non‐zero.
We note that the consistency proof of Naik and Tsai (2000) requires p to be fixed. In many fields of modern genomic research, data sets contain a large number of variables with a much smaller number of observations (e.g. gene expression data sets where the variables are of the order of thousands and the sample size is of the order of tens). Therefore, we investigate the consistency of the PLS regression estimator under the very large p and small n paradigm and extend the result of Naik and Tsai (2000) for the case where p is allowed to grow with n at an appropriate rate. In this setting, we need additional assumptions on both X and Y to ensure the consistency of SXX and SXY, which is the conventional assumption for fixed p. Recently, Johnstone and Lu (2004) proved that the leading PC of SXX is consistent if and only if p/n→0. Hence, we adopt their assumptions for X to ensure consistency of SXX and SXY. Assumptions for X from Johnstone and Lu (2004) are as follows.
Assumption 1. Assume that each row of
follows the model
, for some constant σ1, where
- (a)
ρ j ,j=1,…,mp, are mutually orthogonal PCs with norms ‖ρ1‖‖ρ2‖…‖ρm‖,
- (b)
the multipliers
are independent over the indices of both i and j,
- (c)
the noise vectors ei∼N(0,Ip) are independent among themselves and of the random effects
and
- (d)
p(n),m(n) and {ρj(n),j=1,…,m} are functions of n, and the norms of the PCs converge as sequences: ϱ(n)=(‖ρ1(n)‖,…,‖ρj(n)‖,…)→ϱ=(ϱ1,…,ϱj,…). We also write ϱ+ for the limiting l1‐norm: ϱ+=Σj ϱj.
We remark that the above factor model for X is similar to that of Helland (1990) except for having an additional random error term ei. All properties of PLS in Helland (1990) will hold, as the eigenvectors of ΣXX and
are the same. We take the assumptions for Y from Helland (1990) with an additional norm condition on β.
Assumption 2. Assume that Y and X have the relationship, Y=Xβ+σ2f, where
<∞, and σ2 is a constant.
We next show that, under the above assumptions and condition 1, the PLS estimator is consistent if and only if p grows much slower than n.
Theorem 1. Under assumptions 1 and 2, and condition 1,
- (a)
if p/n→0, then
in probability and
- (b)
if p/n→k0 for k0>0, then
in probability.
The main implication of this theorem is that the PLS estimator is not suitable for very large p and small n problems in complete generality. Although PLS utilizes a dimension reduction technique by using a few latent factors, it cannot avoid the sample size issue since a reasonable size of n is required to estimate sample covariances consistently as shown in the proof of theorem 1 in Appendix A. A referee pointed out that a qualitatively equivalent result has been obtained by Nadler and Coifman (2005), where the root‐mean‐squared error of the PLS estimator has an additional error term that depends on p2/n2.
2.2.2. Partial least squares regression for multivariate Y
There are limited or virtually no results on the theoretical properties of PLS regression within the context of a multivariate response. Counterintuitive simulation results, where multivariate PLS shows a minor improvement in prediction error, were reported in Frank and Friedman (1993). Later, Helland (2000) argued by intuition that, since multivariate PLS achieves parsimonious models by using the same reduced model space for all the responses, the net gain of sharing the model space could be negative if, in fact, all the responses require different reduced model spaces. Thus, we next introduce a specific setting for multivariate PLS regression in the light of Helland's (2000) intuition and extend the consistency result of univariate PLS to the multivariate case.
Assume that all the response variables have linear relationships with the same set of covariates: Y1=Xb1+f1,Y2=Xb2+f2,…,Yq=Xbq+fq, where b1,…,bq are p×1 coefficient vectors and f1,…,fq are independent error vectors from
. Since the shared reduced model space of each response is determined by bis, we impose a restriction on these coefficients. Namely, we require the existence of eigenvectors v1,…,vK of ΣXX that span the solution space, which each bi belongs to.
We have proved consistency of the PLS estimator for a univariate response using the facts that SXY is proportional to the first direction vector and the solution space, which
belongs to, can be explicitly characterized by
. However, for a multivariate response, PLS finds the first direction vector as the first left singular vector of SXY. The presence of remaining directions in the column space of SXY makes it difficult to characterize the solution space explicitly. Furthermore, the solution space varies depending on the algorithm that is used to fit the model. If we further assume that bi=kib1 for constants k2,…,kq then ΣXY becomes a rank 1 matrix and these challenges are reduced, thereby leading to a setting where we can start to understand characteristics of multivariate PLS.
Condition 2 and assumption 3 below recapitulate these assumptions where the set of regression coefficients b1,b2,…,bq are represented by the coefficient matrix B.
Condition 2. There are eigenvectors vj(j=1,…,K) of ΣXX corresponding to different eigenvalues, such that
and αi1,…,αiK are non‐zero for i=1,…,q.
Assumption 3. Assume that Y=XB+F, where columns of F are independent and from
. B is a rank 1 matrix with singular value decomposition ϑuvT, where ϑ denotes the singular value and u and v are left and right singular vectors respectively. In addition, ϑ<∞ and q is fixed.
Lemma 1 proves the convergence of the first direction vector which plays a key role in forming the solution space of the PLS estimator. The proof is provided in Appendix A.

is the estimate of the first direction vector w1 and is given by ΣXXu/‖ΣXXu‖2.
. Here, ‖·‖F denotes Frobenius norm, ς is the non‐zero singular value of SXY and s and t1 are left and right singular vectors respectively. As a result, s can be represented by

The above discussion highlighted the advantage of multivariate PLS compared with univariate PLS in terms of estimation of the direction vectors. Next, we present the convergence result of the final PLS solution.
Theorem 2. Under assumptions 1 and 3, condition 2 and for fixed K and q,
in probability if and only if p/n→0.
Theorem 2 implies that, under the given conditions and for fixed K and q, the PLS estimator is consistent regardless of the algorithmic variant that is used if p/n→0. Although PLS solutions from algorithmic variants might differ for finite n, these solutions are consistent. Moreover, the fixed q case is practical in most applications because we can always cluster Ys into smaller groups before linking them to X. We refer to Chun and Keleş (2009) for an application of this idea within the context of expression quantitative loci mapping.
Our results for multivariate Y are based on the equal variance assumption on the components of the error matrix F. Even though the popular objective functions of multivariate PLS given in expressions (2) and (3) do not involve a scaling factor for each component of multivariate Y, in practice, Ys are often scaled before the analysis. Violation of the equal variance assumption will affect the performance of PLS regression (Helland, 2000). Therefore, if there are reasons to believe that the error levels in Y, not the signal strengths, are different, scaling will aid in satisfying the equal variance assumption of our theoretical result.
2.3. Motivation for the sparsity principle in partial least squares regression
To motivate the sparsity principle, we now explicitly illustrate how a large number of irrelevant variables affect the PLS estimator through a simple example. This observation is central to our methodological development. We utilize the closed form solution of Helland (1990) for univariate PLS regression
, where
.
. We assume the existence of a latent variable (K=1) as well as a fixed number of relevant variables (p1) and let p grow at the rate O(k′n), where the constant k′ is sufficiently large to have
(4)
(5)
(6)Approximation (5) follows from lemma 2 in Appendix A and assumption (4). Approximation (6) is due to the fact that the largest and smallest eigenvalues of the Wishart matrix are O(k′) (Geman, 1980). In this example, the large number of noise variables forces the loadings in the direction of SXY to be attenuated and thereby cause inconsistency.
From a practical point of view, since latent factors of PLS have contributions from all the variables, the interpretation becomes difficult in the presence of large numbers of noise variables. Motivated by the observation that noise variables enter the PLS regression via direction vectors and attenuate estimates of the regression parameters, we consider imposing sparsity on the direction vectors.
3. Sparse partial least squares regression
3.1. Finding the first sparse partial least squares direction vector
(7)Jolliffe et al. (2003) pointed out that the solution of this formulation tends not to be sufficiently sparse and the problem is not convex. This convexity issue was revisited by d'Aspremont et al. (2007) in direct SPCA by reformulating the criterion in terms of W=wwT, thereby producing a semidefinite programming problem that is known to be convex. However, the sparsity issue remained.
(8)In this formulation, the L1‐penalty encourages sparsity on c whereas the L2‐penalty addresses the potential singularity in M when solving for c. We shall rescale c to have norm 1 and use this scaled version as the estimated direction vector. We note that this problem becomes that of SCOTLASS when w=c and M=XTX, SPCA when
and M=XTX, and the original maximum eigenvalue problem of PLS when κ=1. We aim to reduce the effect of the concave part (hence the local solution issue) by using a small κ.
3.2. Solution for the generalized regression formulation of sparse partial least squares
We solve the generalized regression formulation of SPLS given in expression (8) by alternatively iterating between solving for w for fixed c and solving for c after fixing w.
(9)
, problem (9) can be rewritten as

, the objective function in problem (9) reduces to −wTMc and the solution is w=UVT, where U and V are obtained from the singular value decomposition of Mc (Zou et al., 2006).
(10)This problem, which is equivalent to the naive elastic net (EN) problem of Zou and Hastie (2005) when Y in the naive EN is replaced with ZTw, can be solved efficiently via the least angle regression spline algorithm LARS (Efron et al., 2004). SPLS often requires a large λ2‐value to solve problem (10) because Z is a q×p matrix with usually small q, i.e. q=1 for univariate Y. As a remedy, we use an EN formulation with λ2=∞ and this yields the solution to have the form of a soft thresholded estimator (Zou and Hastie, 2005). This concludes our solution of the regression formulation for general Y (univariate or multivariate). We further have the following simplification for univariate Y (q=1).
Theorem 3. For univariate Y, the solution of problem (8) is
, where
is the first direction vector of PLS.
Proof. For a given c and κ=0.5, it follows that
since the singular value decomposition of ZZTc yields
and V=1. For a given c and 0<κ<0.5, the solution is given by w={ZTc/(‖Z‖2+λ*)}Z by using the Woodbury formula (Golub and van Loan, 1987). Noting that ZTc/(‖Z‖2+λ*) is a scalar and by the norm constraint, we have
. Since
does not depend on c, we have
for large λ2.
4. Implementation and algorithmic details
4.1. Sparse partial least squares algorithm
In this section, we present the complete SPLS algorithm which encompasses the formulation of the first SPLS direction vector from Section 3.1 as well as an efficient algorithm for obtaining all the other direction vectors and coefficient estimates.
In principle, the objective function for the first SPLS direction vector can be utilized at each step of the NIPALS or SIMPLS algorithm to obtain the rest of the direction vectors. We call this idea the naive SPLS algorithm. However, this naive SPLS algorithm loses the conjugacy of the direction vectors. A similar issue appears in SPCA, where none of the methods proposed (Jolliffe et al., 2003; Zou et al., 2006; d'Aspremont et al., 2007) produces orthogonal sparse principal components. Although conjugacy can be obtained by the Gram–Schmidt conjugation of the derived sparse direction vectors, these post‐conjugated vectors do not inherit the property of Krylov subsequences which is known to be crucial for the convergence of the algorithm (Krämer, 2007). Essentially, such a post‐orthogonalization does not guarantee the existence of the solution among the iterations.
To address this concern, we propose an SPLS algorithm which leads to a sparse solution by keeping the Krylov subsequence structure of the direction vectors in a restricted X‐space of selected variables. Specifically, at each step of either the NIPALS or the SIMPLS algorithm, it searches for relevant variables, the so‐called active variables, by optimizing expression (8) and updates all direction vectors to form a Krylov subsequence on the subspace of the active variables. This is simply achieved by conducting PLS regression by using the selected variables. Let
be an index set for active variables and K the number of components. Denote
as the submatrix of X whose column indices are contained in
. The SPLS algorithm can utilize either the NIPALS or the SIMPLS algorithm as described below.
-
Step 1:
set
. For the NIPALS algorithm set, Y1=Y, and for the SIMPLS algorithm set X1=X.
-
Step 2:
while kK,
- (a)
find
by solving the objective (8) in Section 3.1 with
for the NIPALS and
for the SIMPLS algorithm,
- (b)
update
as
,
- (c)
fit PLS with
by using k number of latent components and
- (d)
update
by using the new PLS estimates of the direction vectors, update k with k←k+1, for the NIPALS algorithm, update Y1 through
and for the SIMPLS algorithm, update X1 through
, where
.
The original NIPALS algorithm includes deflation steps for both X‐ and Y‐ matrices, but the same M‐matrix can be computed via the deflation of either X or Y owing to the idempotency of the projection matrix. In our SPLS–NIPALS algorithm, we chose to deflate the Y‐matrix because, in that case, the eigenvector XTY1/‖XTY1‖ of M is proportional to the current correlations in the LARS algorithm for univariate Y. Hence, the LARS and SPLS–NIPALS algorithms use the same criterion to select active variables in this case. However, the SPLS–NIPALS algorithm differs from LARS in that it selects more than one variable at a time and utilizes the conjugate gradient (CG) method to compute the coefficients at each step (Friedman and Popescu, 2004). This, in particular, implies that the SPLS–NIPALS algorithm can select a group of correlated variables simultaneously. The cost of computing coefficients at each step of the SPLS algorithm is less than or equal to that of LARS as the CG method avoids matrix inversion.
The SPLS–SIMPLS algorithm has similar attributes to the SPLS–NIPALS algorithm. It also uses the CG method and selects more than one variable at each step and handles multivariate responses. However, the M‐matrix is no longer proportional to the current correlations of the LARS algorithm. SIMPLS yields direction vectors directly satisfying the conjugacy constraint, which may hamper the ability of revealing relevant variables. In contrast, the direction vectors at each step of the NIPALS algorithm are derived to maximize the current correlations on the basis of residual matrices, and conjugated direction vectors are computed at the final stage. Thus, the SPLS–NIPALS algorithm is more likely to choose the correct set of relevant variables when the signals of the relevant variables are weak. A small simulation study investigating this point is presented in Section 5.1.
4.2. Choosing the thresholding parameter and the number of hidden components
Although the SPLS regression formulation in expression (8) has four tuning parameters (κ,λ1,λ2 and K), only two of these are key tuning parameters, namely the thresholding parameter λ1 and the number of hidden components K. As we discussed in theorem 3 of Section 3.2, the solution does not depend on κ for univariate Y. For multivariate Y, we show with a simulation study in Section 5.2 that setting κ smaller than
generally avoids local solution issues. Different κ‐values have the effect of starting the algorithm with different starting values. Since the algorithm is computationally inexpensive (the average run time including the tuning is only 9 min for a sample size of n=100 with p=5000 predictors on a 64‐bit machine with 2.66 GHz central processor unit), users are encouraged to try several κ‐values. Finally, as described in Section 3.2, setting the λ2‐parameter to ∞ yields the thresholded estimator which depends only on λ1. Therefore, we proceed with the tuning mechanisms for the two key parameters λ1 and K. We start with univariate Y since imposing an L1‐penalty has the simple form of thresholding, and then we discuss multivariate Y.
We start with describing a form of soft thresholded direction vector
where 0η1. Here, η plays the role of the sparsity parameter λ1 in theorem 3. This form of soft thresholding retains components that are greater than some fraction of the maximum component. A similar approach was utilized in Friedman and Popescu (2004) with hard thresholding as opposed to our soft thresholding scheme. The single tuning parameter η is tuned by cross‐validation (CV) for all the direction vectors. We do not use separate sparsity parameters for individual directions because tuning multiple parameters is computationally prohibitive and may not produce a unique minimum for the CV criterion.
denote the sample partial correlation of the ith variable Xi with Y given
, where
denotes the set of first k−1 latent variables included in the model. Under the normality assumption on X and Y, and the null hypothesis
, the z‐transformed (partial) correlation coefficients have the distribution (Bendel and Afifi, 1976)

We compute the corresponding p‐values
, for i=1,…,p, for the (partial) correlation coefficients by using this statistic and arrange them in ascending order:
. After defining
, the hard thresholded direction vector becomes
based on the Benjamini and Hochberg (1995) FDR procedure.
We remark that the solution from FDR control is minimax optimal if
and α>γ/ log (p)(γ>0) under independence among tests. As long as α decreases with an appropriate rate as p increases, thresholding by FDR control is optimal without knowing the level of sparsity and, hence, reduces computation considerably. Although we do not have this independence, this adaptivity may work since the argument for minimax optimality mainly depends on marginal properties (Abramovich et al., 2006).
As discussed in Section 3.2, for multivariate Y, the solution for SPLS is obtained through iterations and the resulting solution has a form of soft thresholding. Although hard thresholding with FDR control is no longer applicable, we can still employ soft thresholding based on CV. The number of hidden components, K, is tuned by CV as in the original PLS. We note that CV will be a function of two arguments for soft thresholding and that of one argument for hard thresholding and thereby making hard thresholding computationally much cheaper than soft thresholding.
5. Simulation studies
5.1. Comparison between SPLS–NIPALS and SPLS–SIMPLS algorithms
We conducted a small simulation study to compare variable selection performances of the two SPLS variants, SPLS–NIPALS and SPLS–SIMPLS. The data‐generating mechanism is set as follows. Columns of X are generated by Xi=Hj+ɛi for nj−1+1inj, where j=1,…,3 and (n0,n1,n2,n3)=(0,6,13,30). Here, H1,H2 and H3 are independent random vectors from
and the ɛis are from
. Columns of Y are generated by Y1=0.1H1−2H2+f1, and Yi+1=1.2Yi+fi, where the fis are from
. We generated 100 simulated data sets and analysed them using both the SPLS–NIPALS and the SPLS–SIMPLS algorithms. Table 1 reports the first quartile, median, and the third quartile of the numbers of correctly and incorrectly selected variables. We observe that the SPLS–NIPALS algorithm performs better in identifying larger numbers of correct variables with a smaller number of false positive results compared with the SPLS–SIMPLS algorithm. Further investigation reveals that the relevant variables that the SPLS–SIMPLS algorithm misses are typically from the H1‐component with weaker signal.
| Method | Number of correct variables† | Number of incorrect variables† |
|---|---|---|
| SPLS–NIPALS | 9.75 / 12 / 13 | 0 / 0 / 2 |
| SPLS–SIMPLS | 7 / 9 / 13 | 0 / 2 / 5 |
- †First quartile/median/third quartile.
5.2. Setting the weight factor κ in the general regression formulation of problem (8)
We ran a small simulation study to examine how the generalization of the regression formulation given in expression (8) helps to avoid the local solution issue. The data‐generating mechanism is set as follows. Columns of X are generated by Xi=Hj+ɛi for nj−1+1inj, where j=1,…,4 and (n0,…,n4)=(0,4,8,10,100). Here, H1 is a random vector from
is a random vector from
and H4=0. The ɛis are independent identically distributed random vectors from
. For illustration, we use M=XTX. When κ=0.5, the algorithm becomes stuck at a local solution in 27 out of 100 simulation runs. When κ=0.1,0.3,0.4, the correct solution is obtained in all runs. This indicates that a slight imbalance giving less weight to the concave objective function of formulation (8) might lead to a numerically easier optimization problem.
5.3. Comparisons with recent variable selection methods in terms of prediction power and variable selection
In this section, we compare SPLS regression with other popular methods in terms of prediction and variable selection performances in various correlated covariates settings. We include OLS and the lasso, which are not particularly tailored for correlated variables. We also consider dimension reduction methods such as PLS, principal component regression (PCR) and supervised PCs, which ought to be appropriate for highly correlated variables. The EN is also included in these comparisons since it can handle highly correlated variables.
We first consider the case where there is a reasonable number of observations (i.e. n>p) and set n=400 and p=40. We vary the number of spurious variables as q=10 and q=30, and the noise‐to‐signal ratios as 0.1 and 0.2. Hidden variables H1,…,H3 are from
, and the columns of the covariate matrix X are generated by Xi=Hj+ɛi for nj−1+1inj, where j=1,…,3,(n0,…,n3)=(0,(p−q)/2,p−q,p) and ɛ1,…,ɛp are drawn independently from
. Y is generated by 3H1−4H2+f, where f is normally distributed with mean 0. This mechanism generates covariates, subsets of which are highly correlated.
We, then, consider the case where the sample size is smaller than the number of the variables (i.e. n<p) and set n=40 and p=80. The numbers of spurious variables are set to q=20 and q=40, and noise‐to‐signal ratios to 0.1 and 0.2 respectively. X and Y are generated similarly to the above n>p case.
We select the optimal tuning parameters for most of the methods by using tenfold CV. Since the CV curve tends to be flat in this simulation study, we first identify parameters of which CV scores are less than 1.1 times the minimum of the CV scores. We select the smallest K and the largest η among the selected parameters for SPLS, the largest λ2 and the smallest step size for the EN and the smallest step size for the lasso. We use the F‐statistic (the default CV score in the R package superpc) from the fitted model as a CV score for supervised PC. Then, we use the same procedure to generate an independent test data set and predict Y on this test data set on the basis of the fitted models. For each parameter setting, we perform 30 runs of simulations and compute the mean and standard deviation of the mean‐squared prediction errors. The averages of the sensitivities and specificities are computed across the simulations to compare the accuracy of variable selection. The results are presented in Tables 2 and 3.
| p/n/q/nssettings | Mean‐squared prediction errors for the following methods: | |||||||
|---|---|---|---|---|---|---|---|---|
| PLS (SE) | PCR (SE) | OLS (SE) | Lasso (SE) | SPLS1 (SE) | SPLS2 (SE) | Supervised PCs (SE) | EN (SE) | |
| 40/400/10/0.1 | 31417.9 | 15717.1 | 31444.4 | 208.3 | 199.8 | 201.4 | 198.6 | 200.1 |
| (552.5) | (224.2) | (554.0) | (10.4) | (9.0) | (11.2) | (9.5) | (10.0) | |
| 40/400/10/0.2 | 31872.0 | 16186.5 | 31956.9 | 697.3 | 661.4 | 658.7 | 658.8 | 685.5 |
| (544.4) | (231.4) | (548.9) | (15.7) | (13.9) | (15.7) | (14.2) | (17.7) | |
| 40/400/30/0.1 | 31409.1 | 20914.2 | 31431.7 | 205.0 | 203.3 | 205.5 | 202.7 | 203.1 |
| (552.5) | (1324.4) | (554.2) | (9.5) | (10.1) | (11.1) | (9.4) | (9.7) | |
| 40/400/30/0.2 | 31863.7 | 21336.0 | 31939.3 | 678.6 | 661.2 | 663.5 | 663.5 | 684.9 |
| (544.1) | (1307.6) | (549.1) | (13.6) | (14.4) | (15.6) | (14.4) | (19.3) | |
| 80/40/20/0.1 | 29121.4 | 15678.0 | 485.2 | 538.4 | 494.6 | 720.0 | 533.9 | |
| (1583.2) | (652.9) | (48.4) | (70.5) | (63.0) | (240.0) | (75.3) | ||
| 80/40/20/0.2 | 30766.9 | 16386.5 | 1099.2 | 1019.5 | 965.5 | 2015.8 | 1050.7 | |
| (1386.0) | (636.8) | (86.0) | (74.6) | (74.7) | (523.6) | (84.5) | ||
| 80/40/40/0.1 | 29116.2 | 17416.1 | 502.4 | 506.9 | 497.7 | 522.7 | 545.3 | |
| (1591.7) | (924.2) | (54.0) | (66.9) | (62.8) | (69.4) | (77.1) | ||
| 80/40/40/0.2 | 29732.4 | 17940.8 | 1007.2 | 1013.3 | 964.4 | 1080.6 | 1018.7 | |
| (1605.8) | (932.2) | (82.9) | (78.7) | (74.6) | (165.6) | (74.9) | ||
- †p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise‐to‐signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV; SE, standard error.
| p/n/q/ns settings | Results for the following methods: | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Lasso | SPLS1 | SPLS2 | SuperPC | EN | ||||||
| Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | |
| 40/400/10/0.1 | 0.76 | 1.00 | 1.00 | 0.83 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 |
| 40/400/10/0.2 | 0.67 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 1.00 | 0.94 | 0.97 |
| 40/400/30/0.1 | 1.00 | 0.98 | 1.00 | 0.83 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 |
| 40/400/30/0.2 | 0.96 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 |
| 80/40/20/0.1 | 0.15 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 0.97 | 0.93 | 0.72 | 0.99 |
| 80/40/20/0.2 | 0.12 | 1.00 | 1.00 | 0.67 | 1.00 | 1.00 | 0.86 | 0.83 | 0.80 | 0.98 |
| 80/40/40/0.1 | 0.21 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 0.93 | 0.72 | 0.99 |
| 80/40/40/0.2 | 0.15 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 0.97 | 0.90 | 0.80 | 0.98 |
- †p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise‐to‐signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV.
Although not so surprising, the methods with an intrinsic variable selection property show smaller prediction errors compared with the methods lacking this property. For n>p, the lasso, SPLS, supervised PCs and the EN show similar prediction performances in all four scenarios. This holds for the n<p case, except that supervised PC shows a slight increase in prediction error for dense models (p=80 and q=20). For the model selection accuracy, SPLS, supervised PCs and the EN show excellent performances, whereas the lasso exhibits poor performance by missing relevant variables. SPLS performs better than other methods for n<p and high noise‐to‐signal ratio scenarios. We observe that the EN misses relevant variables in the n<p scenario, even though its L2‐penalty aims to handle these cases specifically. Moreover, the EN performs well for the right size of the regularization parameter λ2, but finding the optimal size objectively through CV seems to be a challenging task.
In general, both SPLS–CV and SPLS–FDR perform at least as well as other methods (Table 3). Especially, when n<p, the lasso fails to identify important variables, whereas SPLS regression succeeds. This is because, although the number of SPLS latent components is limited by n, the actual number of variables that makes up the latent components can exceed n.
5.4. Comparisons of predictive power among methods that handle multicollinearity
In this section, we compare SPLS regression with some of the popular methods that handle multicollinearity such as PLS, PCR, ridge regression, a mixed variance–covariance approach, gene shaving (Hastie et al., 2000) and supervised PCs (Bair et al., 2006). These comparisons are motivated by those presented in Bair et al. (2006). We compare only prediction performances since all methods except for gene shaving and supervised PCs are not equipped with variable selection. For the dimension reduction methods, we allow only one latent component for a fair comparison.
Throughout these simulations, we set p=5000 and n=100. All the scenarios follow the general model of Y=Xβ+f, but the underlying data generation for X is varying. We devise simulation scenarios where the multicollinearity is due to the presence of one main latent variable (simulations 1 and 2), the presence of multiple latent variables (simulation 3) and the presence of a correlation structure that is not induced by latent variables but some other mechanism (simulation 4). We select the optimal tuning parameters and compute the prediction errors as in Section 5.3. The results are summarized in Table 4.
| Method | Mean‐squared prediction errors for the following simulations: | |||
|---|---|---|---|---|
| Simulation 1 | Simulation 2 | Simulation 3 | Simulation 4 | |
| PCR1 | 320.67 (8.07) | 308.93 (7.13) | 241.75 (5.62) | 2730.53 (75.82) |
| PLS1 | 301.25 (7.32) | 292.70 (7.69) | 209.19 (4.58) | 1748.53 (47.47) |
| Ridge regression | 304.80 (7.47) | 296.36 (7.81) | 211.59 (4.70) | 1723.58 (46.41) |
| Supervised PC | 252.01 (9.71) | 248.26 (7.68) | 134.90 (3.34) | 263.46 (14.98) |
| SPLS1(FDR) | 256.22 (13.82) | 246.28 (7.87) | 139.01 (3.74) | 290.78 (13.29) |
| SPLS1(CV) | 257.40 (9.66) | 261.14 (8.11) | 120.27 (3.42) | 195.63 (7.59) |
| Mixed variance–covariance | 301.05 (7.31) | 292.46 (7.67) | 209.45 (4.58) | 1748.65 (47.58) |
| Gene shaving | 255.60 (9.28) | 292.46 (7.67) | 119.39 (3.31) | 203.46 (7.95) |
| True | 224.13 (5.12) | 218.04 (6.80) | 96.90 (3.02) | 99.12 (2.50) |
- †PCR1, PCR with one component; PLS1, PLS with one component; SPLS1(FDR), SPLS with one component tuned by FDR control (FDR = 0.4); SPLS1(CV), SPLS with one component tuned by CV; True, true model.
The first simulation scenario is the same as the ‘simple simulation’ that was utilized by Bair et al. (2006), where hidden components H1 and H2 are defined as follows: H1j equals 3 for 1j50 and 4 for 51jn and H2j=3.5 for 1jn. Columns of X are generated by Xi=H1+ɛi for 1i50 and H2+ɛi for 51ip, where ɛi are an independent identically distributed random vector from
. β is a p×1 vector, where the ith element is 1/25 for 1i50 and 0 for 51ip. f is a random vector from
. Although this scenario is ideal for supervised PCs in that Y is related to one main hidden component, SPLS regression shows a comparable performance with supervised PCs and gene shaving.
The second simulation was referred to as ‘hard simulation’ by Bair et al. (2006), where more complicated hidden components are generated, and the rest of the data generation remains the same as in the simple simulation. H1,…,H5 are generated by H1j=3 I(j50)+4 I(j>50),H2j=3.5+1.5 I(u1j0.4),H3j=3.5+0.5 I(u1j0.7),H4j=3.5−1.5I(u1j0.3) and H5j=3.5, for 1jn, where u1j,u2j and u3j are independent identically distributed random variables from Unif(0,1). Columns of X are generated by Xi=Hj+ɛi for nj−1+1inj, where j=1,…,5 and (n0,…,n5)=(0,50,100,200,300,p). As seen in Table 4, when there are complex latent components, SPLS and supervised PCs show the best performance. These two simulation studies illustrate that both SPLS and supervised PCs have good prediction performances under the latent component model with few relevant variables.
The third simulation is designed to compare the prediction performances of the methods when all methods are allowed to use only one latent component, even though there are more than one hidden components related to Y. This scenario aims to illustrate the differences of the derived latent components depending on whether they are guided by the response Y. H1 and H2 are generated as H1j=2.5 I(j50)+4 I(j>50),H2j=2.5 I(1j25 or 51j75)+4 I(26j50 or 76j100). (H3,…,H6) are defined in the same way as (H2,…,H5) in the second simulation. Columns of X are generated by Xi=Hj+ɛi for nj−1+1inj, j=1,…,6, and (n0,…,n6)=(0,25,50,100,200,300,p). f is a random vector from
. Gene shaving and SPLS both exhibit good predictive performance in this scenario. In a way, when the number of components in the model is fixed, the methods which utilize Y when deriving latent components can achieve better predictive performances compared with methods that utilize only X when deriving these vectors. This agrees with the prior observation that PLS typically requires a smaller number of latent components than that of PCA (Frank and Friedman, 1993).
The fourth simulation is designed to compare the prediction performances of the methods when the relevant variables are not governed by a latent variable model. We generate the first 50 columns of X from a multivariate normal distribution with auto‐regressive covariance, and the remaining 4950 columns of X are generated from hidden components as before. Five hidden components are generated as follows: H1j equals 1 for 1j50 and 6 for 51jn and H2,…,H5 are the same as in the second simulation. Denoting X=(X(1),X(2)) by using a partitioned matrix, we generate rows of X(1) from
, where Σ50×50 is from an AR(1) process with an auto‐correlation ρ=0.9. Columns of X(2) are generated by
for nj−1+1inj, where j=1,…,5 and (n0,…,n5)=(0,50,100,200,300,p−50). β is a p×1 vector and its ith element is given by βi=kj for nj−1+1inj, where j=1,…,6, (n0,…,n6)=(0,10,20,30,40,50,p) and (k1,…,k6)=(8,6,4,2,1,0)/25. SPLS regression and gene shaving perform well, indicating that they have the ability to handle such a correlation structure. As in the third simulation, these two methods may gain some advantage in handling more general correlation structures by utilizing response Y when deriving direction vectors.
6. Case‐study: application to yeast cell cycle data set
Transcription factors (TFs) play an important role for interpreting a genome's regulatory code by binding to specific sequences to induce or repress gene expression. It is of general interest to identify TFs which are related to regulation of the cell cycle, which is one of the fundamental processes in a eukaryotic cell. Recently, Boulesteix and Strimmer (2005) performed an integrative analysis of gene expression and CHIP–chip data measuring the amount of transcription and physical binding of TFs respectively, to address this question. Their analysis focused on estimation rather than variable selection. In this section, we focus on identifying cell cycle regulating TFs.
We utilize a yeast cell cycle gene expression data set from Spellman et al. (1998). This experiment measures messenger ribonucleic acid levels every 7 min for 119 min with a total of 18 measurements covering two cell cycle periods. The second data set, CHIP–chip data of Lee et al. (2002), contains binding information of 106 TFs which elucidates which transcriptional regulators bind to promoter sequences of genes across the yeast genome. After excluding genes with missing values in either of the experiments, 542 cell‐cycle‐related genes are retained.
We analyse these data sets with our proposed multivariate (SPLS–NIPALS) and univariate SPLS regression methods, and also with the lasso for a comparison and summarize the results in Table 5. Since CHIP–chip data provide a proxy for the binary outcome of binding, we scale the CHIP–chip data and use tenfold CV for tuning. Multivariate SPLS selects the least number of TFs (32 TFs), and univariate SPLS selects 70 TFs. The lasso selects the largest number of TFs, 100 out of 106. There are a total of 21 experimentally confirmed cell‐cycle‐related TFs (Wang et al., 2007), and we report the number of confirmed TFs among those selected as a guideline for performance comparisons. In Table 5, we also report a hypergeometric probability calculation quantifying chance occurrences of the number of confirmed TFs among the variables selected by each method. A comparison of these probabilities indicates that multivariate SPLS has more evidence that selection of a large number of confirmed TFs is not due to chance.
| Method | Number of TFs selected (s) | Number of confirmed TFs (k) | Prob(Kk) |
|---|---|---|---|
| Multivariate SPLS | 32 | 10 | 0.034 |
| Univariate SPLS | 70 | 17 | 0.058 |
| Lasso | 100 | 21 | 0.256 |
| Total | 106 | 21 |
- †Prob(Kk) denotes the probability of observing at least k confirmed variables out of 85 unconfirmed and 21 confirmed variables in a random draw of s variables.
We next compare results from multivariate and univariate SPLS. There are a total of 28 TFs which are selected by both methods and nine of these are experimentally verified according to the literature. The estimators, i.e. TF activities, of selected TFs in general show periodicity. This is indeed a desirable property since the 18 time points cover two periods of a cell cycle. Interestingly, as depicted Fig. 1, multivariate SPLS regression obtains smoother estimates of TF activities compared with univariate SPLS. A total of four TFs are selected only by multivariate SPLS regression. These coefficients are small but consistent across the time points (Fig. 2). A total of 42 TFs are selected only by univariate SPLS, and eight of these are among the confirmed TFs. These TFs do not show periodicity or have non zero coefficients only at few time points (the data are not shown). In general, multivariate SPLS regression can capture the weak effects that are consistent across the time points.

Estimated TF activities for the 21 confirmed TFs (plots for ABF‐1, CBF‐1, GCR2 and SKN7 are not displayed since the TF activities of the factors were zero by both the univariate and the multivariate SPLS; the y‐axis denotes estimated coefficients and the x‐axis is time; multivariate SPLS regression yields smoother estimates and exhibits periodicity):
, estimated TF activities by the multivariate SPLS regression;
, estimated TF activities by univariate SPLS

Estimated TF activities selected only by the multivariate SPLS regression; the magnitudes of the estimated TF activities are small but consistent across the time points
7. Discussion
PLS regression has been successfully utilized in ill‐conditioned linear regression problems that arise in several scientific disciplines. Goutis (1996) showed that PLS yields shrinkage estimators. Butler and Denham (2000) argued that it may provide peculiar shrinkage in the sense that some of the components of the regression coefficient vector may expand instead of shrinking. However, as argued by Rosipal and Krämer (2006), this does not necessarily lead to worse shrinkage because PLS estimators are highly non‐linear. We showed that both univariate and multivariate PLS regression estimators are consistent under the latent model assumption with strong restrictions on the number of variables and the sample size. This makes the suitability of PLS for the contemporary very large p and small n paradigm questionable. We argued and illustrated that imposing sparsity on direction vectors helps to avoid sample size problems in the presence of large numbers of irrelevant variables. We further developed a regression technique called SPLS. SPLS regression is also likely to yield shrinkage estimators since the methodology can be considered as a form of PLS regression on a restricted set of predictors. Analysis of its shrinkage properties is among our current investigations. SPLS regression is computationally efficient since it solves a linear equation by employing a CG algorithm rather than matrix inversion at each step.
We presented the solution of the SPLS criterion for the direction vectors and proposed an accompanying SPLS regression algorithm. Our SPLS regression algorithm has connections to other variable selection algorithms including the EN (Zou and Hastie, 2005) and the threshold gradient (Friedman and Popescu, 2004) method. The EN method deals with collinearity in variable selection by incorporating the ridge regression method into the LARS algorithm. In a way, SPLS handles the same issue by fusing the PLS technique into the LARS algorithm. SPLS can also be related to the threshold gradient method in that both algorithms use only the thresholded gradient and not the Hessian. However, SPLS achieves faster convergence by using the CG.
We presented proof‐of‐principle simulation studies with combinations of small and large number of predictors and sample sizes. These illustrated that SPLS regression achieves both high predictive power and accuracy for finding the relevant variables. Moreover, it can select a higher number of relevant variables than the available sample size since the number of variables that contribute to the direction vectors is not limited by the sample size.
Our application with SPLS involved two recent genomic data types, namely gene expression data and genomewide binding data of TFs. The response variable was continuous and a linear modelling framework followed naturally. Extensions of SPLS to other modelling frameworks such as generalized linear models and survival models are exciting future directions. Our application with integrative analysis of expression and TF binding date highlighted the use of SPLS within the context of a multivariate response. We expect that several genomic problems with multivariate responses, e.g. linking expression of a cluster of genes to genetic marker data, might lend themselves to the multivariate SPLS framework. We provide an implementation of the SPLS regression methodology as an R package at http://cran.r‐project.org/web/packages/spls.
Acknowledgements
This research has been supported by National Institutes of Health grant H6003747 and National Science Foundation grant DMS 0804597 to SK.
Appendix
Appendix A: Proofs of the theorems
We first introduce lemmas 2 and 3 and then utilize these in the proof of theorem 1. ‖A‖2 for matrix A ∈ Rn×k is defined as the largest singular value of A.

Proof. The first part of lemma 2 was proved by Johnstone and Lu (2004), and we shall show the second part on the basis of their argument. We decompose SXY−σXY as (An+Bn+Cn)β+Dn, where
and
. We remark that here E is defined to be an n×p matrix of which the ith row is ei, whereas the corresponding matrix Z in Johnstone and Lu (2004) is a p×n matrix. We aim to show that the norm of each component of the decomposition is Op{√(p/n)}. Johnstone and Lu (2004) showed that, if p/n→k0 ∈ [0,∞), then ‖An‖2→0,‖Bn‖2σ1√k0Σϱj and
almost surely. Hence, we examine ‖Dn‖2, components of which have the distributions υjTf=dχnχ1Uj for 1jm and ETf=dχnχpUm+1, where
and
are χ2 random variables and the Ujs are random vectors, uniform on the surface of the unit sphere Sp−1 in Rp. After denoting aj=υjTf for 1jm and am+1=ETf, we have that
almost surely, for 1jm, and
almost surely from the previous results on the distributions. By using a version of the dominated convergence theorem (Pratt, 1960), the results follow:
almost surely ‖Dn‖2→√k0σ1σ2 almost surely and
almost surely, and thus the lemma is proved.
(11)
(12)

A.1. Proof of theorem 1

. First, we establish that


It is sufficient to show that
and
in probability.

since

are finite as (RΣXXR)−1 and
are non‐singular for a given K. Using this fact as well as the triangular and Hölder's inequalities, we can easily show the second claim. The third claim follows by the fact that
in probability, lemma 2 and the triangular and Hölder's inequalities.
Next, we can establish that
by using the same argument of proposition 1 of Naik and Tsai (2000).
almost surely,
(13)
in probability for p/n→k0(>0),
(14)Since ‖ETf/n‖2≠0 almost surely, equation (14) implies that
as n→∞.
This contradicts the fact that ETf=dχ(n)χ(p)Up, where Up is a vector uniform on the surface of the unit sphere Sp−1, as the dimension of
is p−K.
A.2. Proof of lemma 1
, and

. Then, we have

A.3. Proof of theorem 2
and DK=(d1,…,dK) as direction vectors for the original covariate and the deflated covariates respectively. The first direction vector
is obtained by SXYt1/‖SXYt1‖2, where t1 is the right singular vector of SXY. We denote si,1=SXYti/‖SXYti‖2, as this form of vector recurs in the remaining steps. Then,
. Define ψi as the step size vector at the ith step, and the ith current correlation matrix
as
. The current correlation matrix at the second step is given by
and thus the second direction vector d2 is proportional to
, where t2 is the right singular vector of
. Then
, where
. Similarly, we can obtain

Now, we observe that
does not form a Krylov space, because si,1 is not the same as s1,1 for multivariate Y. However, it forms a Krylov space for large n, since ‖SXYt/‖SXYt‖2−w1‖2→0 for any q‐dimensional random vector t subject to ‖t‖2=1 almost surely, following lemma 1.
, where
, we can characterize

We note that si,1s and li,js from the NIPALS and SIMPLS algorithms are different because the tis are from
and
for the NIPALS and SIMPLS algorithms respectively.
Next, we shall focus on the convergence of the NIPALS estimator, because the convergence of the SIMPLS estimator can be proved by the same argument owing to the structural similarity of
and
.
Denoting
and
, one can show that
by using the fact ‖si,1−w1‖2=Op{√(p/n)} for i=1,…,K. Since
can also be represented as
, we have that
. Thus, we now deal with the convergence of
, which has a similar form to that of the univariate response case.
for i=1,…,K, one can show that
, where R=(w1,ΣXXw1,…,ΣXXw1). The convergence of the estimator can be established similarly to the argument in theorem 1 with the following additional argument:
(15)
independently. The remainder of the proof is a simple extension of the proof of theorem 1.
in probability, when p/n→k0 (>0). Following the argument in the proof of theorem 1, we have

Since ‖ETF/n‖2≠0 almost surely, this equation implies that
as n→∞.
If p/n→k0 (>0), this contradicts the fact that ETFi=dχ(n)χ(p)Up, where Fi denotes the ith column of F and Up is a vector uniform on the surface of the unit sphere Sp−1, as the dimension of
is p−K.
References
Citing Literature
Number of times cited according to CrossRef: 362
- Pengfei Zhang, Zhuopin Xu, Qi Wang, Shuang Fan, Weimin Cheng, Haiping Wang, Yuejin Wu, A novel variable selection method based on combined moving window and intelligent optimization algorithm for variable selection in chemical modeling, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 10.1016/j.saa.2020.118986, 246, (118986), (2021).
- Mathias Cardner, Mustafa Yalcinkaya, Sandra Goetze, Edlira Luca, Miroslav Balaz, Monika Hunjadi, Johannes Hartung, Andrej Shemet, Nicolle Kränkel, Silvija Radosavljevic, Michaela Keel, Alaa Othman, Gergely Karsai, Thorsten Hornemann, Manfred Claassen, Gerhard Liebisch, Erick Carreira, Andreas Ritsch, Ulf Landmesser, Jan Krützfeldt, Christian Wolfrum, Bernd Wollscheid, Niko Beerenwinkel, Lucia Rohrer, Arnold von Eckardstein, Structure-function relationships of HDL in diabetes and coronary heart disease, JCI Insight, 10.1172/jci.insight.131491, 5, 1, (2020).
- Yinglin Xia, Correlation and association analyses in microbiome study integrating multiomics in health and disease, , 10.1016/bs.pmbts.2020.04.003, (2020).
- Ann M. Vuong, Changchun Xie, Roman Jandarov, Kim N. Dietrich, Hongmei Zhang, Andreas Sjödin, Antonia M. Calafat, Bruce P. Lanphear, Lawrence McCandless, Joseph M. Braun, Kimberly Yolton, Aimin Chen, Prenatal exposure to a mixture of persistent organic pollutants (POPs) and child reading skills at school age, International Journal of Hygiene and Environmental Health, 10.1016/j.ijheh.2020.113527, 228, (113527), (2020).
- Romano Trent Lottering, Mackyla Govender, Kabir Peerbhay, Shenelle Lottering, Comparing partial least squares (PLS) discriminant analysis and sparse PLS discriminant analysis in detecting and mapping Solanum mauritianum in commercial forest plantations using image texture, ISPRS Journal of Photogrammetry and Remote Sensing, 10.1016/j.isprsjprs.2019.11.019, 159, (271-280), (2020).
- German Cano-Sancho, Marie-Cécile Alexandre-Gouabau, Thomas Moyon, Anne-Lise Royer, Yann Guitton, Hélène Billard, Dominique Darmaun, Jean-Christophe Rozé, Clair-Yves Boquien, Bruno Le Bizec, Jean-Philippe Antignac, Simultaneous exploration of nutrients and pollutants in human milk and their impact on preterm infant growth: An integrative cross-platform approach, Environmental Research, 10.1016/j.envres.2019.109018, 182, (109018), (2020).
- Chuan-Quan Li, Zhaoyu Fang, Qing-Song Xu, A partition-based variable selection in partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 10.1016/j.chemolab.2020.103935, (103935), (2020).
- Jianfei Cao, Chris Gu, Yike Wang, Principal Component and Static Factor Analysis, Macroeconomic Forecasting in the Era of Big Data, 10.1007/978-3-030-31150-6_8, (229-266), (2020).
- Jinyu Chen, Shihua Zhang, Sparse Partial Least Squares Methods for Joint Modular Pattern Discovery, eQTL Analysis, 10.1007/978-1-0716-0026-9_12, (173-186), (2020).
- Qingchao Jiang, Xuefeng Yan, Hui Yi, Furong Gao, Data-Driven Batch-End Quality Modeling and Monitoring Based on Optimized Sparse Partial Least Squares, IEEE Transactions on Industrial Electronics, 10.1109/TIE.2019.2922941, 67, 5, (4098-4107), (2020).
- Shanhua Zhan, Jigang Wu, Na Han, Jie Wen, Xiaozhao Fang, Group Low-Rank Representation-Based Discriminant Linear Regression, IEEE Transactions on Circuits and Systems for Video Technology, 10.1109/TCSVT.2019.2897072, 30, 3, (760-770), (2020).
- Lingdi Zhang, Christian V. Forst, Aubree Gordon, Gabrielle Gussin, Adam B. Geber, Porfirio J. Fernandez, Tao Ding, Lauren Lashua, Minghui Wang, Angel Balmaseda, Richard Bonneau, Bin Zhang, Elodie Ghedin, Characterization of antibiotic resistance and host-microbiome interactions in the human upper respiratory tract during influenza infection, Microbiome, 10.1186/s40168-020-00803-2, 8, 1, (2020).
- Tong Lei, Da-Wen Sun, A Novel NIR Spectral Calibration Method: Sparse Coefficients Wavelength Selection and Regression (SCWR), Analytica Chimica Acta, 10.1016/j.aca.2020.03.007, (2020).
- Solène Cadiou, Mariona Bustamante, Lydiane Agier, Sandra Andrusaityte, Xavier Basagaña, Angel Carracedo, Leda Chatzi, Regina Grazuleviciene, Juan R. Gonzalez, Kristine B. Gutzkow, Léa Maitre, Dan Mason, Frédéric Millot, Mark Nieuwenhuijsen, Eleni Papadopoulou, Gillian Santorelli, Pierre-Jean Saulnier, Marta Vives, John Wright, Martine Vrijheid, Rémy Slama, Using methylome data to inform exposome-health association studies: An application to the identification of environmental drivers of child body mass index, Environment International, 10.1016/j.envint.2020.105622, 138, (105622), (2020).
- Shuxia Guo, Oleg Ryabchykov, Nairveen Ali, Rola Houhou, Thomas Bocklitz, Comprehensive Chemometrics, Reference Module in Chemistry, Molecular Sciences and Chemical Engineering, 10.1016/B978-0-12-409547-2.14600-1, (2020).
- Yi Zhou, Siu-wai Leung, Shosuke Mizutani, Tatsuya Takagi, Yu-Shi Tian, MEPHAS: an interactive graphical user interface for medical and pharmaceutical statistical analysis with R and Shiny, BMC Bioinformatics, 10.1186/s12859-020-3494-x, 21, 1, (2020).
- Romano Lottering, Onisimo Mutanga, Kabir Peerbhay, Shenelle Lottering, Spatially optimizing vegetation indices integrated with sparse partial least squares regression to detect and map the effects of Gonipterus scutellatus on the chlorophyll content of eucalyptus plantations , International Journal of Remote Sensing, 10.1080/01431161.2020.1739350, 41, 16, (6444-6459), (2020).
- Haileab Hilafu, Sandra E. Safo, Lillian Haine, Sparse reduced-rank regression for integrating omics data, BMC Bioinformatics, 10.1186/s12859-020-03606-2, 21, 1, (2020).
- Mohamed A. Salem, Rasha Ali Radwan, Eman Sherien Mostafa, Saleh Alseekh, Alisdair R. Fernie, Shahira M. Ezzat, Using an UPLC/MS-based untargeted metabolomics approach for assessing the antioxidant capacity and anti-aging potential of selected herbs, RSC Advances, 10.1039/D0RA06047J, 10, 52, (31511-31524), (2020).
- D. Bertsimas, D. Lahlou Kitane, N. Azami, F.R. Doucet, Novel mixed integer optimization sparse regression approach in chemometrics, Analytica Chimica Acta, 10.1016/j.aca.2020.08.054, 1137, (115-124), (2020).
- Hui Hu, Jinying Zhao, David A. Savitz, Mattia Prosperi, Yi Zheng, Thomas A. Pearson, An external exposome-wide association study of hypertensive disorders of pregnancy, Environment International, 10.1016/j.envint.2020.105797, 141, (105797), (2020).
- Clemens Kamrath, Michaela F. Hartmann, Jörn Pons-Kühnemann, Stefan A. Wudy, Urinary GC–MS steroid metabotyping in treated children with congenital adrenal hyperplasia., Metabolism, 10.1016/j.metabol.2020.154354, 112, (154354), (2020).
- R. Dennis Cook, Liliana Forzani, Envelopes: A new chapter in partial least squares regression, Journal of Chemometrics, 10.1002/cem.3287, 34, 10, (2020).
- Min Wang, Ting-Zhu Huang, Jian Fang, Vince D. Calhoun, Yu-Ping Wang, Integration of Imaging (epi)Genomics Data for the Study of Schizophrenia Using Group Sparse Joint Nonnegative Matrix Factorization, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10.1109/TCBB.2019.2899568, 17, 5, (1671-1681), (2020).
- Minji Lee, Zhihua Su, A Review of Envelope Models, International Statistical Review, 10.1111/insr.12361, 0, 0, (2020).
- Tara Eicher, Garrett Kinnebrew, Andrew Patt, Kyle Spencer, Kevin Ying, Qin Ma, Raghu Machiraju, Ewy A. Mathé, Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources, Metabolites, 10.3390/metabo10050202, 10, 5, (202), (2020).
- Álvaro Cuadros-Inostroza, Claudio Verdugo-Alegría, Lothar Willmitzer, Yerko Moreno-Simunovic, José G. Vallarino, Non-Targeted Metabolite Profiles and Sensory Properties Elucidate Commonalities and Differences of Wines Made with the Same Variety but Different Cultivar Clones, Metabolites, 10.3390/metabo10060220, 10, 6, (220), (2020).
- Jie Huang, Jiazhou Chen, Bin Zhang, Lei Zhu, Hongmin Cai, Evaluation of gene–drug common module identification methods using pharmacogenomics data, Briefings in Bioinformatics, 10.1093/bib/bbaa087, (2020).
- Lasanthi C. R. Pelawa Watagoda, David J. Olive, Comparing six shrinkage estimators with large sample theory and asymptotically optimal prediction intervals, Statistical Papers, 10.1007/s00362-020-01193-1, (2020).
- Sadi Alawadi, David Mera, Manuel Fernández-Delgado, Fahed Alkhabbas, Carl Magnus Olsson, Paul Davidsson, A comparison of machine learning algorithms for forecasting indoor temperature in smart buildings, Energy Systems, 10.1007/s12667-020-00376-x, (2020).
- Chuan-Quan Li, You-Wu Lin, Qing-Song Xu, An enhanced random forest with canonical partial least squares for classification, Communications in Statistics - Theory and Methods, 10.1080/03610926.2020.1716249, (1-11), (2020).
- Reza Foodeh, Saeed Ebadollahi, Mohammad Reza Daliri, Regularized Partial Least Square Regression for Continuous Decoding in Brain-Computer Interfaces, Neuroinformatics, 10.1007/s12021-020-09455-x, (2020).
- Susana Santos, Léa Maitre, Charline Warembourg, Lydiane Agier, Lorenzo Richiardi, Xavier Basagaña, Martine Vrijheid, Applying the exposome concept in birth cohort research: a review of statistical approaches, European Journal of Epidemiology, 10.1007/s10654-020-00625-4, (2020).
- Maryam Fatemi, Mohammad Reza Daliri, Nonlinear sparse partial least squares: an investigation of the effect of nonlinearity and sparsity on the decoding of intracranial data, Journal of Neural Engineering, 10.1088/1741-2552/ab5d47, 17, 1, (016055), (2020).
- Kai Xu, Zhiling Shen, Xudong Huang, Qing Cheng, Projection correlation between scalar and vector variables and its use in feature screening with multi-response data, Journal of Statistical Computation and Simulation, 10.1080/00949655.2020.1753057, (1-20), (2020).
- Debamita Kundu, Riten Mitra, Jeremy T. Gaskins, Bayesian variable selection for multioutcome models through shared shrinkage, Scandinavian Journal of Statistics, 10.1111/sjos.12455, 0, 0, (2020).
- Linjun Chen, Guangquan Lu, Yangding Li, Jiaye Li, Malong Tan, Local Structure Preservation for Nonlinear Clustering, Neural Processing Letters, 10.1007/s11063-020-10251-6, (2020).
- Alvaro Mendez-Civieta, M. Carmen Aguilera-Morillo, Rosa E. Lillo, Adaptive sparse group LASSO in quantile regression, Advances in Data Analysis and Classification, 10.1007/s11634-020-00413-8, (2020).
- Adolphus Wagala, Graciela González-Farías, Rogelio Ramos, Oscar Dalmau, PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification Problem, Revista Colombiana de Estadística, 10.15446/rce.v43n2.81811, 43, 2, (233-249), (2020).
- Dunfu Yang, Gyuhyeong Goh, Haiyan Wang, A fully Bayesian approach to sparse reduced-rank multivariate regression, Statistical Modelling, 10.1177/1471082X20948697, (1471082X2094869), (2020).
- Matteo Zoccolillo, Claudia Moia, Sergio Comincini, Davide Cittaro, Dejan Lazarevic, Karen A. Pisani, Jan M. Wit, Mauro Bozzola, Identification of novel genetic variants associated with short stature in a Baka Pygmies population, Human Genetics, 10.1007/s00439-020-02191-x, (2020).
- Kai Bao, Xiaofei Li, Lucy Poveda, Weihong Qi, Nathalie Selevsek, Pinar Gumus, Gulnur Emingil, Jonas Grossmann, Patricia I. Diaz, George Hajishengallis, Nagihan Bostanci, Georgios N. Belibasakis, Proteome and Microbiome Mapping of Human Gingival Tissue in Health and Disease, Frontiers in Cellular and Infection Microbiology, 10.3389/fcimb.2020.588155, 10, (2020).
- Chuan‐Quan Li, Qing‐Song Xu, High‐dimensional spectral data classification with nonparametric feature screening, Journal of Chemometrics, 10.1002/cem.3199, 34, 3, (2019).
- Ida Henriette Caspersen, Cathrine Thomsen, Line Småstuen Haug, Helle K. Knutsen, Anne Lise Brantsæter, Eleni Papadopoulou, Iris Erlund, Thomas Lundh, Jan Alexander, Helle Margrete Meltzer, Patterns and dietary determinants of essential and toxic elements in blood measured in mid-pregnancy: The Norwegian Environmental Biobank, Science of The Total Environment, 10.1016/j.scitotenv.2019.03.291, 671, (299-308), (2019).
- Ryan M.J. Genga, Eric M. Kernfeld, Krishna M. Parsi, Teagan J. Parsons, Michael J. Ziller, René Maehr, Single-Cell RNA-Sequencing-Based CRISPRi Screening Resolves Molecular Drivers of Early Human Endoderm Development, Cell Reports, 10.1016/j.celrep.2019.03.076, 27, 3, (708-718.e10), (2019).
- Megan M. Niedzwiecki, Douglas I. Walker, Roel Vermeulen, Marc Chadeau-Hyam, Dean P. Jones, Gary W. Miller, The Exposome: Molecules to Populations, Annual Review of Pharmacology and Toxicology, 10.1146/annurev-pharmtox-010818-021315, 59, 1, (107-127), (2019).
- Mbulisi Sibanda, Onisimo Mutanga, Timothy Dube, Mologadi C. Mothapo, Paramu L. Mafongoya, Remote sensing equivalent water thickness of grass treated with different fertiliser regimes using resample HyspIRI and EnMAP data, Physics and Chemistry of the Earth, Parts A/B/C, 10.1016/j.pce.2018.12.003, (2019).
- Dengdeng Yu, Li Zhang, Ivan Mizera, Bei Jiang, Linglong Kong, Sparse wavelet estimation in quantile regression with multiple functional predictors, Computational Statistics & Data Analysis, 10.1016/j.csda.2018.12.002, (2019).
- Gokhan Mert Yagli, Dazhi Yang, Dipti Srinivasan, Automatic hourly solar forecasting using machine learning models, Renewable and Sustainable Energy Reviews, 10.1016/j.rser.2019.02.006, 105, (487-498), (2019).
- Balu K. Chacko, Matthew R. Smith, Michelle S. Johnson, Gloria Benavides, Matilda L. Culp, Jyotsna Pilli, Sruti Shiva, Karan Uppal, Young-Mi Go, Dean P. Jones, Victor M. Darley-Usmar, Mitochondria in precision medicine; linking bioenergetics and metabolomics in platelets, Redox Biology, 10.1016/j.redox.2019.101165, (101165), (2019).
- Sumira Jan, Parvaiz Ahmad, An Integrated Approach to Plant Biology via Multi-Analogous Methods, Ecometabolomics, 10.1016/B978-0-12-814872-3.00002-3, (57-126), (2019).
- C. S. Zhao, T. L. Pan, S. T. Yang, Y. Sun, Y. Zhang, Y. R. Ge, B. E. Dong, Z. S. Zhang, H. M. Zhang, Quantifying the response of aquatic biodiversity to variations in river hydrology and water quality in a healthy water ecology pilot city, China, Marine and Freshwater Research, 10.1071/MF18385, 70, 5, (670), (2019).
- Jose Camacho, Gabriel Macia-Fernandez, Noemi Marta Fuentes-Garcia, Edoardo Saccenti, Semi-Supervised Multivariate Statistical Network Monitoring for Learning Security Threats, IEEE Transactions on Information Forensics and Security, 10.1109/TIFS.2019.2894358, 14, 8, (2179-2189), (2019).
- Ruxianguli Aimuzi, Kai Luo, Qian Chen, Hui Wang, Liping Feng, Fengxiu Ouyang, Jun Zhang, Perfluoroalkyl and polyfluoroalkyl substances and fetal thyroid hormone levels in umbilical cord blood among newborns by prelabor caesarean delivery, Environment International, 10.1016/j.envint.2019.104929, 130, (104929), (2019).
- Nicolas Cain, Oliver Alka, Torben Segelke, Kristian von Wuthenau, Oliver Kohlbacher, Markus Fischer, Food fingerprinting: Mass spectrometric determination of the cocoa shell content (Theobroma cacao L.) in cocoa products by HPLC-QTOF-MS, Food Chemistry, 10.1016/j.foodchem.2019.125013, 298, (125013), (2019).
- Marc Chadeau-Hyam, Roel Vermeulen, Statistical Models to Explore the Exposome: From OMICs Profiling to ‘Mechanome’ Characterization, Unraveling the Exposome, 10.1007/978-3-319-89321-1, (279-314), (2019).
- Soufiane Ajana, Niyazi Acar, Lionel Bretillon, Boris P Hejblum, Hélène Jacqmin-Gadda, Cécile Delcourt, Niyazi Acar, Soufiane Ajana, Olivier Berdeaux, Sylvain Bouton, Lionel Bretillon, Alain Bron, Benjamin Buaud, Stéphanie Cabaret, Audrey Cougnard-Grégoire, Catherine Creuzot-Garcher, Cécile Delcourt, Marie-Noelle Delyfer, Catherine Féart-Couret, Valérie Febvret, Stéphane Grégoire, Zhiguo He, Jean-François Korobelnik, Lucy Martine, Bénédicte Merle, Carole Vaysse, Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data: a case study in low sample size, Bioinformatics, 10.1093/bioinformatics/btz135, 35, 19, (3628-3634), (2019).
- Nina Lazarevic, Adrian G. Barnett, Peter D. Sly, Luke D. Knibbs, Statistical Methodology in Studies of Prenatal Exposure to Mixtures of Endocrine-Disrupting Chemicals: A Review of Existing Approaches and New Alternatives, Environmental Health Perspectives, 10.1289/EHP2207, 127, 2, (026001), (2019).
- William R. Johnson, Ajmal Mian, David G. Lloyd, Jacqueline A. Alderson, On-field player workload exposure and knee injury risk monitoring via deep learning, Journal of Biomechanics, 10.1016/j.jbiomech.2019.07.002, (2019).
- David Coronado-Gutiérrez, Gorane Santamaría, Sergi Ganau, Xavier Bargalló, Stefania Orlando, M. Eulalia Oliva-Brañas, Alvaro Perez-Moreno, Xavier P. Burgos-Artizzu, Quantitative Ultrasound Image Analysis of Axillary Lymph Nodes to Diagnose Metastatic Involvement in Breast Cancer, Ultrasound in Medicine & Biology, 10.1016/j.ultrasmedbio.2019.07.413, (2019).
- Ahmad Mani-Varnosfaderani, Sparse Methods, Reference Module in Chemistry, Molecular Sciences and Chemical Engineering, 10.1016/B978-0-12-409547-2.14592-5, (2019).
- Matthew Ryan Smith, Balu K. Chacko, Michelle S. Johnson, Gloria A. Benavides, Karan Uppal, Young-Mi Go, Dean P. Jones, Victor M. Darley-Usmar, A precision medicine approach to defining the impact of doxorubicin on the bioenergetic-metabolite interactome in human platelets, Redox Biology, 10.1016/j.redox.2019.101311, (101311), (2019).
- Chao Shang, Xiaolin Huang, Fan Yang, Dexian Huang, undefined, 2019 1st International Conference on Industrial Artificial Intelligence (IAI), 10.1109/ICIAI.2019.8850796, (1-6), (2019).
- Li Xiao, Julia M. Stephen, Tony W. Wilson, Vince D. Calhoun, Yu-Ping Wang, Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction, IEEE Transactions on Biomedical Engineering, 10.1109/TBME.2018.2884129, 66, 8, (2140-2151), (2019).
- Guang-Hui Fu, Min-Jie Zong, Feng-Hua Wang, Lun-Zhao Yi, A Comparison of Sparse Partial Least Squares and Elastic Net in Wavelength Selection on NIR Spectroscopy Data, International Journal of Analytical Chemistry, 10.1155/2019/7314916, 2019, (1-12), (2019).
- C. M. Clingensmith, S. Grunwald, S. P. Wani, Evaluation of calibration subsetting and new chemometric methods on the spectral prediction of key soil properties in a data‐limited environment, European Journal of Soil Science, 10.1111/ejss.12753, 70, 1, (107-126), (2019).
- Stefan Feuerriegel, Julius Gordon, News-based forecasts of macroeconomic indicators: A semantic path model for interpretable predictions, European Journal of Operational Research, 10.1016/j.ejor.2018.05.068, 272, 1, (162-175), (2019).
- Petter Stefansson, Ulf G. Indahl, Kristian H. Liland, Ingunn Burud, Orders of magnitude speed increase in partial least squares feature selection with new simple indexing technique for very tall data sets, Journal of Chemometrics, 10.1002/cem.3141, 33, 11, (2019).
- Tahir Mehmood, Maryam Sadiq, Muhammad Aslam, Filter-Based Factor Selection Methods in Partial Least Squares Regression, IEEE Access, 10.1109/ACCESS.2019.2948782, 7, (153499-153508), (2019).
- Laura Febvay, Erwann Hamon, Raphaël Recht, Nicolas Andres, Mathilde Vincent, Dalal Aoudé‐Werner, Hervé This, Identification of markers of thermal processing (“roasting”) in aqueous extracts of L. seeds through NMR fingerprinting and chemometrics, Magnetic Resonance in Chemistry, 10.1002/mrc.4834, 57, 9, (589-602), (2019).
- Cletah Shoko, Onisimo Mutanga, Timothy Dube, Remotely sensed C3 and C4 grass species aboveground biomass variability in response to seasonal climate and topography, African Journal of Ecology, 10.1111/aje.12622, 57, 4, (477-489), (2019).
- Peter Filzmoser, Sven Serneels, Ricardo Maronna, Christophe Croux, Robust Multivariate Methods in Chemometrics, Reference Module in Chemistry, Molecular Sciences and Chemical Engineering, 10.1016/B978-0-12-409547-2.14642-6, (2019).
- Young-Mi Go, Matthew R. Smith, Douglas I. Walker, Karan Uppal, Patricia Rohrbeck, Pamela L. Krahl, Philip K. Hopke, Mark J. Utell, Timothy M. Mallon, Dean P. Jones, Metabolome-Wide Association Study of Deployment to Balad, Iraq or Bagram, Afghanistan, Journal of Occupational and Environmental Medicine, 10.1097/JOM.0000000000001665, 61, (S25-S34), (2019).
- Agoston Mihalik, Fabio S. Ferreira, Michael Moutoussis, Gabriel Ziegler, Rick A. Adams, Maria J. Rosa, Gita Prabhu, Leticia de Oliveira, Mirtes Pereira, Edward T. Bullmore, Peter Fonagy, Ian M. Goodyer, Peter B. Jones, John Shawe-Taylor, Raymond Dolan, Janaina Mourao-Miranda, Multiple hold-outs with stability: improving the generalizability of machine learning analyses of brain-behaviour relationships, Biological Psychiatry, 10.1016/j.biopsych.2019.12.001, (2019).
- John H. Kalivas, Steven D. Brown, Calibration Methodologies, Reference Module in Chemistry, Molecular Sciences and Chemical Engineering, 10.1016/B978-0-12-409547-2.14666-9, (2019).
- Yuxi Wang, Zhenhong Jia, Jie Yang, An Variable Selection Method of the Significance Multivariate Correlation Competitive Population Analysis for Near-Infrared Spectroscopy in Chemical Modeling, IEEE Access, 10.1109/ACCESS.2019.2954115, 7, (167195-167209), (2019).
- Hee-Yeon Suh, Ho-Jin Lee, Yun-Sic Lee, Soo-Heang Eo, Richard E. Donatelli, Shin-Jae Lee, Predicting soft tissue changes after orthognathic surgery: The sparse partial least squares method , The Angle Orthodontist, 10.2319/120518-851.1, 89, 6, (910-916), (2019).
- Tulio L. Campos, Pasi K. Korhonen, Robin B. Gasser, Neil D. Young, An evaluation of machine learning approaches for the prediction of essential genes in eukaryotes using protein sequence-derived features, Computational and Structural Biotechnology Journal, 10.1016/j.csbj.2019.05.008, (2019).
- Hu Peng, Yuchen Jiang, Xiang Li, Hao Luo, Shen Yin, undefined, 2019 CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS), 10.1109/SAFEPROCESS45799.2019.9213394, (631-636), (2019).
- Qing Wang, Kaicen Wang, Wenrui Wu, Eleni Giannoulatou, Joshua W. K. Ho, Lanjuan Li, Host and microbiome multi-omics integration: applications and methodologies, Biophysical Reviews, 10.1007/s12551-018-0491-7, (2019).
- Camilo Broc, Borja Calvo, Benoit Liquet, Penalized Partial Least Square applied to structured data, Arabian Journal of Mathematics, 10.1007/s40065-019-0248-6, (2019).
- Sun Hye Kim, Fani Boukouvala, Machine learning-based surrogate modeling for data-driven optimization: a comparison of subset selection for regression techniques, Optimization Letters, 10.1007/s11590-019-01428-7, (2019).
- Heng-Hui Lue, Pairwise directions estimation for multivariate response regression data, Journal of Statistical Computation and Simulation, 10.1080/00949655.2019.1572145, (1-19), (2019).
- Marie Chavent, Robin Genuer, Jérôme Saracco, Combining clustering of variables and feature selection using random forests, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2018.1563145, (1-20), (2019).
- Julian Hagenauer, Hichem Omrani, Marco Helbich, Assessing the performance of 38 machine learning models: the case of land consumption rates in Bavaria, Germany, International Journal of Geographical Information Science, 10.1080/13658816.2019.1579333, (1-21), (2019).
- Vincent Guillemot, Derek Beaton, Arnaud Gloaguen, Tommy Löfstedt, Brian Levine, Nicolas Raymond, Arthur Tenenhaus, Hervé Abdi, A constrained singular value decomposition method that integrates sparsity and orthogonality, PLOS ONE, 10.1371/journal.pone.0211463, 14, 3, (e0211463), (2019).
- Zhiying Long, Yubao Wang, Xuanping Liu, Li Yao, Two-step paretial least square regression classifiers in brain-state decoding using functional magnetic resonance imaging, PLOS ONE, 10.1371/journal.pone.0214937, 14, 4, (e0214937), (2019).
- Annette Vriens, Tim S. Nawrot, Bram G. Janssen, Willy Baeyens, Liesbeth Bruckers, Adrian Covaci, Sam De Craemer, Stefaan De Henauw, Elly Den Hond, Ilse Loots, Vera Nelen, Thomas Schettgen, Greet Schoeters, Dries S. Martens, Michelle Plusquin, Exposure to Environmental Pollutants and Their Association with Biomarkers of Aging: A Multipollutant Approach, Environmental Science & Technology, 10.1021/acs.est.8b07141, (2019).
- Adnan Khan Niazi, Etienne Delannoy, Rana Khalid Iqbal, Daria Mileshina, Romain Val, Marta Gabryelska, Eliza Wyszko, Ludivine Soubigou-Taconnat, Maciej Szymanski, Jan Barciszewski, Frédérique Weber-Lotfi, José Manuel Gualberto, André Dietrich, Mitochondrial Transcriptome Control and Intercompartment Cross-Talk During Plant Development, Cells, 10.3390/cells8060583, 8, 6, (583), (2019).
- Rebecca C. Scholten, Joachim Hill, Willy Werner, Henning Buddenbaum, Jonathan P. Dash, Mireia Gomez Gallego, Carol A. Rolando, Grant D. Pearse, Robin Hartley, Honey Jane Estarija, Michael S. Watt, Hyperspectral VNIR-spectroscopy and imagery as a tool for monitoring herbicide damage in wilding conifers, Biological Invasions, 10.1007/s10530-019-02055-0, (2019).
- Kris Sankaran, Susan P. Holmes, Multitable Methods for Microbiome Data Integration, Frontiers in Genetics, 10.3389/fgene.2019.00627, 10, (2019).
- Antik Chakraborty, Anirban Bhattacharya, Bani K Mallick, Bayesian sparse multiple regression for simultaneous rank reduction and variable selection, Biometrika, 10.1093/biomet/asz056, (2019).
- Ali Razzaq, Bushra Sadia, Ali Raza, Muhammad Khalid Hameed, Fozia Saleem, Metabolomics: A Way Forward for Crop Improvement, Metabolites, 10.3390/metabo9120303, 9, 12, (303), (2019).
- undefined Antonelli, undefined Claggett, undefined Henglin, undefined Kim, undefined Ovsak, undefined Kim, undefined Deng, undefined Rao, undefined Tyagi, undefined Watrous, undefined Lagerborg, undefined Hushcha, undefined Demler, undefined Mora, undefined Niiranen, undefined Pereira, undefined Jain, undefined Cheng, Statistical Workflow for Feature Selection in Human Metabolomics Data, Metabolites, 10.3390/metabo9070143, 9, 7, (143), (2019).
- Luis A. Barboza, Julien Emile-Geay, Bo Li, Wan He, Efficient Reconstructions of Common Era Climate via Integrated Nested Laplace Approximations, Journal of Agricultural, Biological and Environmental Statistics, 10.1007/s13253-019-00372-4, (2019).
- Vahid Habibi, Hasan Ahmadi, Mohammad Jafari, Abolfazl Moeini, Application of nonlinear models and groundwater index to predict desertification case study: Sharifabad watershed, Natural Hazards, 10.1007/s11069-019-03769-z, (2019).
- Qingchao Jiang, Xuefeng Yan, Biao Huang, Review and Perspectives of Data-Driven Distributed Monitoring for Industrial Plant-Wide Processes, Industrial & Engineering Chemistry Research, 10.1021/acs.iecr.9b02391, (2019).
- Duo Jiang, Courtney R. Armour, Chenxiao Hu, Meng Mei, Chuan Tian, Thomas J. Sharpton, Yuan Jiang, Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities, Frontiers in Genetics, 10.3389/fgene.2019.00995, 10, (2019).
- Chen Chen, Bin He, Wenping Yuan, Lanlan Guo, Yafeng Zhang, Increasing interannual variability of global vegetation greenness, Environmental Research Letters, 10.1088/1748-9326/ab4ffc, 14, 12, (124005), (2019).
- Bing Cai Kok, Ji Sok Choi, Hyelim Oh, Ji Yeh Choi, Sparse Extended Redundancy Analysis: Variable Selection via the Exclusive LASSO, Multivariate Behavioral Research, 10.1080/00273171.2019.1694477, (1-21), (2019).
- See more




