Stability selection
Abstract
Summary. Estimation of structure, such as in variable selection, graphical modelling or cluster analysis, is notoriously difficult, especially for high dimensional data. We introduce stability selection. It is based on subsampling in combination with (high dimensional) selection algorithms. As such, the method is extremely general and has a very wide range of applicability. Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularization for structure estimation. Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied. We prove for the randomized lasso that stability selection will be variable selection consistent even if the necessary conditions for consistency of the original lasso method are violated. We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data.
1. Introduction
Estimation of discrete structure, such as graphs or clusters, or variable selection is an age‐old problem in statistics. It has enjoyed increased attention in recent years due to the massive growth of data across many scientific disciplines. These large data sets often make estimation of discrete structures or variable selection imperative for improved understanding and interpretation. Most classical results do not cover the loosely defined case of high dimensional data, and it is mainly in this area where we motivate the promising properties of our new stability selection.
In the context of regression, for example, an active area of research is to study the p≫n case, where the number of variables or covariates p exceeds the number of observations n; for an early overview see for example van de Geer and van Houwelingen (2004). In a similar spirit, graphical modelling with many more nodes than sample size has been the focus of recent research, and cluster analysis is another widely used technique to infer a discrete structure from observed data.
Challenges with estimation of discrete structures include computational aspects, since corresponding optimization problems are discrete, as well as determining the right amount of regularization, e.g. in an asymptotic sense for consistent structure estimation. Substantial progress has been made over recent years in developing computationally tractable methods which have provable statistical (asymptotic) properties, even for the high dimensional setting with many more variables than samples. One interesting stream of research has focused on relaxations of some discrete optimization problems, e.g. by l1‐penalty approaches (Donoho and Elad, 2003; Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006; Wainwright, 2009; Yuan and Lin, 2007) or greedy algorithms (Freund and Schapire, 1996; Tropp, 2004; Zhang, 2009). The practical usefulness of such procedures has been demonstrated in various applications. However, the general issue of selecting a proper amount of regularization (for the procedures that were mentioned above and for many others) for obtaining a right‐sized structure or model has largely remained a problem with unsatisfactory solutions.
We address the problem of proper regularization with a very generic subsampling approach (bootstrapping would behave similarly). We show that subsampling can be used to determine the amount of regularization such that a certain familywise type I error rate in multiple testing can be conservatively controlled for finite sample size. Particularly for complex, high dimensional problems, a finite sample control is much more valuable than an asymptotic statement with the number of observations tending to ∞. Beyond the issue of choosing the amount of regularization, the subsampling approach yields a new structure estimation or variable selection scheme. For the more specialized case of high dimensional linear models, we prove what we expect in greater generality: namely that subsampling in conjunction with l1‐penalized estimation requires much weaker assumptions on the design matrix for asymptotically consistent variable selection than what is needed for the (non‐subsampled) l1‐penalty scheme. Furthermore, we show that additional improvements can be achieved by randomizing not only via subsampling but also in the selection process for the variables, bearing some resemblance to the successful tree‐based random‐forest algorithm (Breiman, 2001). Subsampling (and bootstrapping) has been primarily used so far for asymptotic statistical inference in terms of standard errors, confidence intervals and statistical testing. Our work here is of a very different nature: the marriage of subsampling and high dimensional selection algorithms yields finite sample familywise error control and markedly improved structure estimation or selection methods.
1.1. Preliminaries and examples
In general, let β be a p‐dimensional vector, where β is sparse in the sense that s<p components are non‐zero. In other words, ‖β‖0=s<p. Denote the set of non‐zero values by S={k:βk≠0} and the set of variables with vanishing coefficient by N={k:βk=0}. The goal of structure estimation is to infer the set S from noisy observations.
(1)
(2)
is a regularization parameter and we typically assume that the covariates are on the same scale, i.e.
. An attractive feature of the lasso is its computational feasibility for large p since the optimization problem (2) is convex. Furthermore, the lasso can select variables by shrinking certain estimated coefficients exactly to 0. We can then estimate the set S of non‐zero β‐coefficients by
, which involves convex optimization only. Substantial understanding has been gained over the last few years about consistency of such lasso variable selection (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006; Wainwright, 2009; Yuan and Lin, 2007), and we present the details in Section 3.1. Among the challenges are the issue of choosing a proper amount of regularization λ for consistent variable selection and the fact that restrictive design conditions are needed for asymptotically recovering the true set S of relevant covariates.
(3)
, and we then draw an edge between nodes j and k in a corresponding graph (Lauritzen, 1996). The structure estimation is thus on the index set
which has cardinality
(and, of course, we can represent
as a p×1 vector) and the set of relevant conditional dependences is
. Similarly to the problem of variable selection in regression, l0‐norm methods are computationally very difficult and become very quickly unfeasible for moderate or large values of d. A relaxation with l1‐type penalties has also proven to be useful in this context (Meinshausen and Bühlmann, 2006). A recent proposal is the graphical lasso (Friedman et al., 2008):
(4)This amounts to an l1‐penalized estimator of the Gaussian log‐likelihood, partially maximized over the mean vector μ, when minimizing over all non‐negative definite symmetric matrices. The estimated graph structure is then
which involves convex optimization only and is computationally feasible for large values of d.
Another potential area of application is clustering. Choosing the correct number of clusters is a notoriously difficult problem. Looking for clusters that are stable under perturbations or subsampling of the data can help to obtain a better sense of a meaningful number of clusters and to validate results. Indeed, there has been some activity in this area, most notably in the context of consensus clustering (Monti et al., 2003). For an early application see Bhattacharjee et al. (2005). Our proposed false discovery control can be applied to consensus clustering, yielding good estimates of the parameters of a suitable base clustering method for consensus clustering.
1.2. Outline
The use of resampling for validation is certainly not new; we merely try to put it into a more formal framework and to show certain empirical and theoretical advantages of doing so. It seems difficult to give a complete coverage of all previous work in the area, as notions of stability, resampling and perturbations are very natural in the context of structure estimation and variable selection. We reference and compare with previous work throughout the paper.
The structure of the paper is as follows. The generic stability selection approach, its familywise type I multiple‐testing error control and some representative examples from high dimensional linear models and Gaussian graphical models are presented in Section 2. A detailed asymptotic analysis of the lasso and randomized lasso for high dimensional linear models is given in Section 3 and more numerical results are described in Section 4. After a discussion in Section 5, we collect all the technical proofs in Appendix A.
2. Stability selection
Stability selection is not a new variable selection technique. Its aim is rather to enhance and improve existing methods. First, we give a general description of stability selection and we present specific examples and applications later. We assume throughout this section that the data, which are denoted here by Z(1),…,Z(n), are IID (e.g. Z(i)=(X(i),Y(i)) with covariate X(i) and response Y(i)).
For a generic structure estimation or variable selection technique, we assume that we have a tuning parameter
that determines the amount of regularization. This tuning parameter could be the penalty parameter in l1‐penalized regression (see estimator (2)) or in Gaussian graphical modelling (see expression (4)), or it may be the number of steps in forward variable selection or orthogonal matching pursuit (OMP) (Mallat and Zhang, 1993) or the number of iterations in matching pursuit (Mallat and Zhang, 1993) or boosting (Freund and Schapire, 1996); a large number of steps of iterations would have the opposite meaning from a large penalty parameter, but this does not cause conceptual problems. For every value λ ∈ Λ, we obtain a structure estimate
. It is then of interest to determine whether there is a λ ∈ Λ such that
is identical to S with high probability and how to achieve that right amount of regularization.
2.1. Stability paths
We motivate the concept of stability paths in what follows, first for regression. Stability paths are derived from the concept of regularization paths. A regularization path is given by the coefficient value of each variable over all regularization parameters:
. Stability paths (which are defined below) are, in contrast, the probability for each variable to be selected when randomly resampling from the data. For any given regularization parameter λ ∈ Λ, the selected set
is implicitly a function of the samples I={1,…,n}. We write
where necessary to express this dependence.
is
(5)Remark 1. The probability P* in equation (5) is with respect to both the random subsampling and other sources of randomness if
is a randomized algorithm; see Section 3.1.
The sample size of ⌊n/2⌋ is chosen as it resembles most closely the bootstrap (Freedman, 1977; Büchlmann and Yu, 2002) while allowing computationally efficient implementation. Note that random subsampling can be viewed as a computational short cut for computing the relative frequency for
over all
subsets
, of size m=⌊n/2⌋, which itself is a U‐statistic of order m=⌊n/2⌋. Subsampling has also been advocated in a related context in Valdar et al. (2009).
For every variable k=1,…,p, the stability path is given by the selection probabilities
. It is a complement to the usual path plots that show the coefficients of all variables k=1,…,p as a function of the regularization parameter. It can be seen in Fig. 1 that this simple path plot is potentially very useful for improved variable selection for high dimensional data.

(a) Lasso path for the vitamin gene expression data set (
, paths of six non‐permuted genes; ‐ ‐ ‐ ‐ ‐ ‐ ‐, paths of the 4082 permuted genes; selecting a model with all six unpermuted genes invariably means selecting a large number of irrelevant noise variables), (b) stability path of the lasso (the first four variables chosen with stability selection are truly non‐permuted variables) and (c) stability path for the randomized lasso with weakness α=0.2, introduced in Section 3.1 (now all six non‐permuted variables are chosen before any noise variable enters the model)
In the remainder of the paper, we look at the selection probabilities of individual variables. The definition above covers sets of variables also. We could monitor the selection probability of a set of functionally related variables, say, by asking how often at least one variable in this set is chosen or how often all variables in the set are chosen.
2.2. Example I: variable selection in regression
We apply stability selection to the lasso that is defined in equation (2). We work with a gene expression data set for illustration which was kindly provided by DSM Nutritional Products (Switzerland). For n=115 samples, there is a continuous response variable measuring the logarithm of riboflavin (vitamin B2) production rate of bacillus subtilis, and we have p=4088 continuous covariates measuring the logarithm of gene expressions from essentially the whole genome of bacillus subtilis. Certain mutations of genes are thought to lead to higher vitamin concentrations and the challenge is to identify those relevant genes via a linear regression analysis, i.e. we consider a linear model as in equation (1) and want to infer the set S={k;βk≠0}.
Instability of the selected set of genes has been noted before (Ein‐Dor et al., 2005; Michiels et al., 2005), if either using marginal association or variable selection in a regression or classification model. Davis et al. (2006) were close in spirit to our approach by arguing for ‘consensus’ gene signatures which assess the stability of selection, whereas Zucknick et al. (2008) proposed to measure stability of so‐called ‘molecular profiles’ by the Jaccard index.
To see how the lasso and the related stability path cope with noise variables, we randomly permute all except six of the 4088 gene expressions across the samples, using the same permutation to keep the dependence structure between the permuted gene expressions intact. The set of six unpermuted genes has been chosen randomly among the 200 genes with the highest marginal association with the response. The lasso path
is shown in Fig. 1(a), as a function of the regularization parameter λ (rescaled so that λ=1 is the minimal λ‐value for which the null model is selected and λ=0 amounts to the basis pursuit solution). Three of the ‘relevant’ (unpermuted) genes stand out, but all remaining three variables are hidden within the paths of noise (permuted) genes. Fig. 1(b) shows the stability path. At least four relevant variables stand out much clearer now than they did in the regularization path plot. Fig. 1(c) shows the stability plot for the randomized lasso which will be introduced in Section 3.1: now all six unpermuted variables stand above the permuted variables and the separation between (potentially) relevant variables and irrelevant variables is even better.
Choosing the right regularization parameter is very difficult for the original path. The prediction optimal and cross‐validated choices include too many variables (Meinshausen and Bühlmann, 2006; Leng et al., 2006) and the same effect can be observed in this example, where 14 permuted variables are included in the model that was chosen by cross‐validation. Fig. 1 motivates that choosing the right regularization parameter is much less critical for the stability path and that we have a better chance of selecting truly relevant variables.
2.3. Stability selection
In a traditional setting, variable selection would amount to choosing one element of the set of models
(6)With stability selection, we do not simply select one model in the list (6). Instead the data are perturbed (e.g. by subsampling) many times and we choose all structures or variables that occur in a large fraction of the resulting selection sets.
(7)We keep variables with a high selection probability and disregard those with low selection probabilities. The exact cut‐off πthr with 0<πthr<1 is a tuning parameter but the results vary surprisingly little for sensible choices in a range of the cut‐off. Nor do results depend strongly on the choice of regularization λ or the regularization region Λ. See Fig. 1 for an example.
Before we present some guidance on how to choose the cut‐off parameter and the regularization region Λ below, it is worthwhile to point out that there have been related ideas in the literature on Bayesian model selection. Barbieri and Berger (2004) showed certain predictive optimality results for the so‐called median probability model, consisting of variables which have posterior probability of being in the model of
or greater (as opposed to choosing the model with the highest posterior probability). Lee et al. (2003) or Sha et al. (2004) are examples of more applied papers considering Bayesian variable selection in this context.
2.4. Choice of regularization and error control
When trying to recover the set S, a natural goal is to include as few variables of the set N of noise variables as possible. The choice of the regularization parameter is hence crucial. An advantage of our stability selection is that the choice of the initial set of regularization parameters Λ typically has not a very strong influence on the results, as long as Λ is varied within reason. Another advantage, which we focus on below, is the ability to choose this set of regularization parameters in a way that guarantees, under stronger assumptions, a certain bound on the expected number of false selections.
be the set of selected structures or variables if varying the regularization λ in the set Λ. Let qΛ be the average number of selected variables,
. Define V to be the number of falsely selected variables with stability selection,

In general, it is very difficult to control E(V), as the distribution of the underlying estimator
depends on many unknown quantities. Exact control is only possible under some simplifying assumptions.
is exchangeable for all λ ∈ Λ. Also, assume that the original procedure is not worse than random guessing, i.e.
(8)
by
(9)We shall discuss below how to make constructive use of the value
which is in general an unknown quantity. The expected number of falsely selected variables is sometimes called the per‐family error rate or, if divided by p, the per‐comparison error rate in multiple testing (Dudoit et al., 2003). Choosing fewer variables (reducing qΛ) or increasing the threshold πthr for selection will, unsurprisingly, reduce the expected number of falsely selected variables, with a minimal achievable non‐trivial value of 1/p2 (for πthr=1 and qΛ=1) for the per‐family error rate. This seems sufficiently low for all practical purposes as long as p>10, say.
The exchangeability assumption involved is perhaps stronger than we would wish, but there does not seem to be a way of achieving error control in the same generality without making similar assumptions. In recent independent work, Fan et al. (2009) made use of a similar condition for error control in regression. In regression and classification, the exchangeability assumption is fulfilled for all reasonable procedures
(whose results do not depend on the ordering of variables) if the design is random and the distribution of (Y,XS,XN) is invariant under permutations of variables in N. The simplest example is independence between each Xk, k ∈ N, and all other variables, including Y. To give another example for regression in model (1), the condition is satisfied if the error has a normal distribution and the variable X has a joint normal distribution where it holds true for all pairs k,k′ ∈ N that cov(Xk,Xl)=cov(Xk′,Xl) for all l=1,…,p. For real data, we have no guarantee that the assumption is fulfilled but the numerical examples in Section 4 show that the bound holds up very well for real data.
Note also that the assumption of exchangeability is only needed to prove theorem 1. All other benefits of stability selection that are shown in this paper do not rely on this assumption. Besides exchangeability, we needed another, quite harmless, assumption, namely that the original procedure is not worse than random guessing. One would certainly hope that this assumption is fulfilled. If it is not, the results below are still valid with slightly weaker constants. The assumption seems so weak, however, that we do not pursue this further.
The threshold value πthr is a tuning parameter whose influence is very small. For sensible values in the range of, say, πthr ∈ (0.6,0.9), results tend to be very similar. Once the threshold has been chosen at some default value, the regularization region Λ is determined by the error control desired. Specifically, for a default cut‐off value πthr=0.9, choosing the regularization parameters Λ such that say qΛ=√(0.8p) will control E(V)1, or choosing Λ such that qΛ=√(0.8αp) controls the familywise error rate at level α, i.e. P(V>0)α. Of course, we can proceed the other way round by fixing the regularization region Λ and choosing πthr such that E(V) is controlled at the desired level.
To do this, we need knowledge about qΛ. This can be easily achieved by regularization of the selection procedure
in terms of the number of selected variables q, i.e. the domain Λ for the regularization parameter λ determines the number q of selected variables, i.e. q=q(Λ). For example, with l1‐norm penalization as in expressions (2) or (4), the number q is given by the variables which enter first in the regularization path when varying from a maximal value λmax to some minimal value λmin. Mathematically, λmin is such that
.
Without stability selection, the regularization parameter λ invariably must depend on the unknown noise level of the observations. The advantages of stability selection are that
- (a)
exact error control is possible and
- (b)
the method works fine even though the noise level is unknown.
This is a real advantage in high dimensional problems with p≫n, as it is very difficult to estimate the noise level in these settings.
2.4.1. Pointwise control
For some applications, evaluation of subsampling replicates of
is already computationally very demanding for a single value of λ. If this single value λ is chosen such that some overfitting occurs and the set
is rather too large, in the sense that it contains S with high probability, the same approach as above can be used and is in our experience very successful. Results typically do not depend strongly on the regularization λ utilized. See the example below for graphical modelling. Setting Λ={λ}, we can immediately transfer all the results above to the case of what we here call pointwise control. For methods which select structures incrementally, i.e. for which
for all λλ′, pointwise control and control with Λ=[λ,∞) are equivalent since
is then monotonically increasing with decreasing λ for all k=1,…,p.
2.5. Example II: graphical modelling
Stability selection is also promising for graphical modelling. Here we focus on Gaussian graphical models as described in Section 1.1 around formulae (3) and (4).
The pattern of non‐zero entries in the inverse covariance matrix Σ−1 corresponds to the edges between the corresponding pairs of variables in the associated graph and is equivalent to a non‐zero partial correlation (or conditional dependence) between such pairs of variables (Lauritzen, 1996).
There has been interest recently in using l1‐penalties for model selection in Gaussian graphical models due to their computational efficiency for moderate and large graphs (Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Banerjee and El Ghaoui, 2008; Bickel and Levina, 2008; Rothman et al., 2008). Here we work with the graphical lasso (Friedman et al., 2008), as applied to the data from 160 randomly selected genes from the vitamin gene expression data set (without the response variable) that was introduced in Section 2.2. We want to infer the set of non‐zero entries in the inverse covariance matrix Σ−1. Part of the resulting regularization path of the graphical lasso showing graphs for various values of the regularization parameter λ, i.e.
where
, are shown in Figs 2(a)–2(f). For reasons of display, variables (genes) are ordered first using hierarchical clustering and are symbolized by nodes arranged in a circle. Stability selection is shown in Figs 2(g)–2(l). We pursue a pointwise control approach. For each value of λ, we select the threshold πthr to guarantee that E(V)30, i.e. we expect fewer than 30 wrong edges among the 12720 possible edges in the graph. The set
varies remarkably little for the majority of the path and the choice of q (which is implied by λ) does not seem to be critical, as already observed for variable selection in regression.

Vitamin gene expression data set—(a)–(f) regularization path of the graphical lasso and (g)–(l) the corresponding pointwise stability‐selected models: (a), (g) λ=0.46; (b), (h) λ=0.448; (c), (i) λ=0.436; (d), (j) λ=0.424; (e), (k) λ=0.412; (f), (l) λ=0.4
Next, we permute the variables (expression values) randomly, using a different permutation for each variable (gene). The true graph is now the empty graph. As can be seen from Fig. 3, stability selection selects now just very few edges or none at all (as it should). Figs 3(a)–3(f) show the corresponding graphs estimated with the graphical lasso, which yields a much poorer selection of edges.

Same plots as in Fig. 2 but with the variables (expression values of each gene) permuted independently (the empty graph is the true model; with stability selection, only a few errors are made, as guaranteed by the error control made): (a), (g) λ=0.065; (b), (h) λ=0.063; (c), (i) λ=0.061; (d), (j) λ=0.059; (e), (k) λ=0.057; (f), (l) λ=0.055
2.6. Computational requirements
Stability selection demands that we rerun
multiple times. Evaluating selection probabilities over 100 subsamples seems sufficient in practice. The algorithmic complexity of the lasso in expression (2) or in expression (13) in Section 3.1 is of the order O{np min(n,p)}; see Efron et al. (2004). In the p>n regime, running the full lasso path on subsamples of size n/2 is hence a quarter of the cost of running the algorithm on the full data set and running 100 simulations is 25 times the cost of running a single fit on the full data set. This cost could be compared with the cost of cross‐validation, as this is what we must resort to often in practice to select the regularization parameter. Running tenfold cross‐validation uses approximately 10×0.92=8.1 as many computational resources as the single fit on the full data set. Stability selection is thus roughly three times more expensive than tenfold cross‐validation. This analysis is based on the fact that the computational complexity scales like O(n2) with the number of observations (assuming that p>n). If computational costs would scale linearly with sample size (e.g. for the lasso with p<n), this factor would increase to roughly 5.5.
Stability selection with the lasso (using 100 subsamples) for a data set with p=1000 and n=100 takes about 10 s on a 2.2‐GHz processor, using the implementation of Friedman et al. (2007). Computational costs of this order would often seem worthwhile, given the potential benefits.
3. Consistent variable selection
(10)
. The predictor variables are normalized with
for all k ∈ {1,…,p}. We allow for high dimensional settings where p≫n.
Stability selection is attractive for two reasons. First, the choice of a proper regularization parameter for variable selection is crucial and notoriously difficult, especially because the noise level is unknown. With stability selection, results are much less sensitive to the choice of the regularization. Second, we shall show that stability selection makes variable selection consistent in settings where the original methods fail.
is understood to be equivalent to
(11)It is clearly of interest to know under which conditions consistent variable selection can be achieved. In the high dimensional context, this places a restriction on the growth of the number p of variables and sparsity |S|, typically of the form |S|pt log (p)=o(n) (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006; Wainwright, 2009). Although this assumption is often realistic, there are stronger assumptions on the design matrix that need to be satisfied for consistent variable selection. For the lasso, it amounts to the ‘neighbourhood stability’ condition (Meinshausen and Bühlmann, 2006), which is equivalent to the ‘irrepresentable condition’ (Zhao and Yu, 2006; Zou, 2006; Yuan and Lin, 2007). For OMP (which is essentially forward variable selection), the so‐called ‘exact recovery criterion’ (Tropp, 2004; Zhang, 2009) is sufficient and necessary for consistent variable selection.
Here, we show that these conditions can be circumvented more directly by using stability selection, also giving guidance on the proper amount of regularization. For brevity, we shall only discuss in detail the case of the lasso whereas the analysis of OMP is just alluded to.
An interesting aspect is that stability selection with the original procedures alone often yields very large improvements already. Moreover, adding some extra sort of randomness in the spirit of random forests (Breiman, 2001) weakens considerably the conditions that are needed for consistent variables selection as discussed next.
3.1. Lasso and randomized lasso
The lasso (Tibshirani, 1996; Chen et al., 2001) estimator is given in expression (2).
, it turns out that the design needs to satisfy some assumptions, the strongest of which is arguably the so‐called neighbourhood stability condition (Meinshausen and Bühlmann, 2006) which is equivalent to the irrepresentable condition (Zhao and Yu, 2006; Zou, 2006; Yuan and Lin, 2007):
(12)Condition (12) is sufficient and (almost) necessary (the word ‘almost’ refers to the fact that a necessary relationship uses ‘’ instead of ‘<’). If this condition is violated, all that we can hope for is recovery of the regression vector β in an l2‐sense of convergence by achieving
for n→∞. The main assumption here is bounds on the sparse eigenvalues as discussed below. This type of l2‐convergence can be used to achieve consistent variable selection in a two‐stage procedure by thresholding or, preferably, the adaptive lasso (Zou, 2006; Huang et al., 2008). The disadvantage of such a two‐step procedure is the need to choose several tuning parameters without proper guidance on how these parameters can be chosen in practice. We propose the randomized lasso as an alternative. Despite its simplicity, it is consistent for variable selection even though the irrepresentable condition (12) is violated.
The randomized lasso is a new generalization of the lasso. Whereas the lasso penalizes the absolute value |βk| of every component with a penalty term proportional to λ, the randomized lasso changes the penalty λ to a randomly chosen value in the range [λ,λ/α].
for regularization parameter
is then
(13)A proposal for the distribution of the weights Wk is described below, just before theorem 2. The word ‘weakness’ is borrowed from the terminology of weak greedy algorithms (Temlyakov, 2000) which are loosely related to our randomized lasso. Implementation of estimator (13) is straightforward by appropriate rescaling of the predictor variables (with scale factor Wk for the kth variable). Using these rescaled variables, the standard lasso is solved, using for example the algorithm LARS (Efron et al., 2004) or fast co‐ordinatewise approaches (Meier et al., 2008; Friedman et al., 2007). The perturbation of the penalty weights is reminiscent of the reweighting in the adaptive lasso (Zou, 2006). Here, however, the reweighting is not based on any previous estimate but is simply chosen at random! As such, it is very simple to implement. However, it seems nonsensical at first sight since we surely cannot expect any improvement from such a random perturbation. If applied only with one random perturbation, the randomized lasso is not very useful. However, applying the randomized lasso many times and looking for variables that are chosen often will turn out to be a very powerful procedure.
3.1.1. Consistency for randomized lasso with stability selection
For stability selection with the randomized lasso, we can do without the irrepresentable condition (12) but need only a condition on the sparse eigenvalues of the design (Candes and Tao, 2007; van de Geer, 2008; Meinshausen and Yu, 2009; Bickel et al., 2009), which was also called the sparse Riesz condition in Zhang and Huang (2008).
(14)We must constrain sparse eigenvalues to succeed.
(15)
such that
(16)We have not specified the exact form of perturbations that we shall be using for the randomized lasso (13). For what follows we consider the randomized lasso (13), where the weights Wk are sampled independently as Wk=α with probability pw ∈ (0,1) and Wk=1 otherwise. Other perturbations are certainly possible and often work just as well in practice.
(17)
with λλmin. On the same set ΩA,
(18)
Remark 3. Theorem 2 is valid for all λλmin. This is noteworthy as it means that, even if the value of λ is chosen too large (i.e. considerably larger than λmin), no noise variables will be selected and expression (17) holds true. Only some important variables might be missed. This effect has been seen in the empirical examples as stability selection is very insensitive to the choice of λ. In contrast, a hard thresholded solution of the lasso with a value of λ too large will lead to the inclusion of noise variables. Thus, stability selection with the randomized lasso exhibits an important property of being conservative and guarding against false positive selections.
be the selection probability of variable k ∈ S∖Ssmall;λ, while doing both random weight perturbations and subsampling n/2 out of n observations. The probability that
is above the threshold πthr ∈ (0,1) is bounded by a Markov‐type inequality from below by

as a consequence of theorem 2. If 5/(p∨an/2) is sufficiently small in comparison with 1−πthr, this elementary inequality implies that important variables in S∖Ssmall;λ are still chosen by stability selection (subsampling and random‐weights perturbation) with very high probability. A similar argument shows that noise variables are also still not chosen with very high probability. Empirically, combining random‐weight perturbations with subsampling yields very competitive results and this is what we recommend to use.
3.1.2. Relation to other work
In related and very interesting work, Bach (2008) has proposed the ‘bolasso’ (for bootstrapped enhanced lasso) and shown that using a finite number of subsamples of the original lasso procedure and applying basically stability selection with πthr=1 yield consistent variables selection under the condition that the penalty parameter λ vanishes faster than typically assumed, at rate n−1/2, and that the model dimension p is fixed. Although the latter condition could possibly be technical only, the first distinguishes it from our results. Applying stability selection to the randomized lasso, no false variable is selected for all sufficiently large values of λ; see remark 3. In other words, if λ is chosen ‘too large’ with the randomized lasso, only truly relevant variables are chosen (though a few might be missed). If λ is chosen too large with the bolasso, noise variables might be picked up. Fig. 4 is a good illustration. Picking the regularization in Fig. 4(a) (without extra randomness) to select the correct model is much more difficult than in Fig. 4(c), where extra randomness is added. The same distinction can be made with two‐stage procedures like the adaptive lasso (Zou, 2006) or hard thresholding (Candes and Tao, 2007; Meinshausen and Yu, 2009), where variables are thresholded. Picking λ too large (and λ is notoriously difficult to choose), false variables will invariably enter the model. In contrast, stability selection with the randomized lasso does not pick wrong variables if λ is chosen too large.

Stability paths for the randomized lasso with stability selection by using weakness parameters (a) α=1 (identical to the original lasso), (b) α=0.5 and (c) α=0.2 (
, coefficients of the first two (relevant) variables;‐ ‐ ‐ ‐ , coefficient of the third (irrelevant) variable; · · · · · · ·, coefficients from all other (irrelevant) variables): introducing the randomized version helps to avoid choosing the third (irrelevant) predictor variable
3.2. Example
We illustrate the results on the randomized lasso with a small simulation example: p=n=200 and the predictor variables are sampled from an
distribution, where Σ is the identity matrix, except for the entries Σ13=Σ23=ρ and their symmetrical counterparts. We use a regression vector β=(1,1,0,0,…,0). The response Y is obtained from the linear model Y=Xβ+ɛ in equation (1), where ɛ1,…,ɛn are IID
. For ρ>0.5, the irrepresentable condition (12) is violated and the lasso cannot correctly identify the first two variables as the truly important variables, since it always includes the third variable superfluously as well. Using the randomized version for the lasso, the two relevant variables are still chosen with probability close to 1, whereas the irrelevant third variable is chosen only with much lower probability; the corresponding probabilities are shown for the randomized lasso in Fig. 4. This allows us to separate relevant and irrelevant variables. And, indeed, the randomized lasso is consistent under stability selection.
3.3. Randomized orthogonal matching pursuit
Interesting alternatives to the lasso or greedy forward search in this context are the recently proposed forward–backward search (Zhang, 2008) and the MC+ algorithm (Zhang, 2007), which both provably lead to consistent variable selection under weak conditions on sparse eigenvalues, despite being greedy solutions to non‐convex optimization problems. It will be very interesting to explore the effect of stability selection on these algorithms, but this is beyond the scope of this paper.
Here, we look instead at OMP, a greedy forward search in the variable space. The iterative sure independence screening procedure (Fan and Lv, 2008) entails OMP as a special case. We shall examine the effect of stability selection under subsampling and additional randomization. To have a clear definition of randomized OMP, with weakness 0<α<1 and q iterations, we define it as follows.
- (a)
Set R1=Y. Set m=0 and
.
- (b)
For m=1,…,q:
- (i)
find
;
- (ii)
define
;
- (iii)
select randomly a variable ksel in the set K and set
;
- (iv)
let Rm+1=Y−PmY, where the projection Pm is given by
.
- (c)
Return the selected sets
.
(19)This is a sufficient condition for consistent variable selection. If it is not fulfilled, there are regression coefficients that cause OMP or its weak variant to fail in recovery of the exact set S of relevant variables. Surprisingly, this condition is quite similar to the irrepresentable (Zhao and Yu, 2006) or neighbourhood stability condition (Meinshausen and Bühlmann, 2006).
In the spirit of theorem 2, we have also a proof that stability selection for randomized OMP is asymptotically consistent for variable selection in linear models, even if the right‐hand side in condition (19) is not bounded by 1 but instead by a possibly large constant (assuming that the weakness α is sufficiently low). This indicates that stability selection has a more general potential for improved structure estimation, beyond the case for the lasso that was presented in theorem 2. It is noteworthy that our proof involves artificial adding of noise covariates. In practice, this seems to help often but a more involved discussion is beyond the scope of this paper. We shall give empirical evidence for the usefulness of stability selection under subsampling and additional randomization for OMP in the numerical examples below.
There is an inherent trade‐off when choosing the weakness α. A negative consequence of a low α is that the design can become closer to singularity and can thus lead to unfavourable conditioning of the weighted design matrix. However, a low value of α makes it less likely that irrelevant variables are selected. This is a surprising result but rests on the fact that irrelevant variables can only be chosen if the corresponding irrepresentable condition (12) is violated. By randomly perturbing the weights with a low α, this condition is bound to fail sometimes, lowering the selection probabilities for such variables. A low value of α will thus help stability selection to avoid selecting noise variables with a violated irrepresentable condition (12). In practice, choosing α in the range (0.2,0.8) gives very useful results.
4. Numerical results
To investigate further the effects of stability selection, we focus here on the application of stability selection to the lasso and randomized lasso for both regression and the natural extension to binary classification. The effect on OMP and randomized OMP will also be examined.
For regression (the lasso and OMP), we generate observations by Y=Xβ+ɛ. For binary classification, we use the logistic linear model under the binomial family. To generate the design matrices X, we use two real and five simulated data sets.
- (a)
Independent predictor variables: all p=1000 predictor variables are IID standard normal distributed; sample size n=100 and n=1000.
- (b)
Block structure with 10 blocks: the p=1000‐dimensional predictor variable follows an
distribution, where Σkm=0 for all pairs (k,m) except if mod10(k)=mod10(m), for which Σkm=0.5; sample size n=200 and n=1000.
- (c)
Toeplitz design: the p=1000‐dimensional predictor variable follows an
distribution, where Σkm=ρ|k−m| and ρ=0.99; sample size n=200 and n=1000.
- (d)
Factor model with two factors: let φ1 and φ2 be two latent variables following IID standard normal distributions. Each predictor variable Xk, for k=1,…,p, is generated as Xk=fk,1φ1+fk,2φ2+ηk, where fk,1,fk,2 and ηk have IID standard normal distributions for all k=1,…,p; sample sizes are n=200 and n=1000, and p=1000.
- (e)
Data set (e) identical to (d) but with 10 instead of two factors.
- (f)
Motif regression data set: this is a data set (p=660 and n=2587) about finding transcription factor binding sites (motifs) in DNA sequences. The real‐valued predictor variables are abundance scores for p candidate motifs (for each of the genes). Our data set is from a heat shock experiment with yeast. For a general description and motivation about motif regression we refer to Conlon et al. (2003).
- (g)
This data set is the vitamin gene expression data (with p=4088 and n=158) that were described in Section 2.2.
We do not use the response values from the real data sets, however, as we need to know which variables are truly relevant or irrelevant. For this, we create sparse regression vectors by setting βk=0 for all k=1,…,p, except for a randomly chosen set S of coefficients, where βk is chosen independently and uniformly in [0,1] for all k ∈ S. The size s=|S| of the active set is varied between 4 and 50, depending on the data set. For regression, the noise vector (ɛ1,…,ɛn) is chosen IID
, where the rescaling of the variance with n is due to the rescaling of the predictor variables to unit norm, i.e. ‖X(k)‖2=1. The noise level σ2 is chosen to achieve signal‐to‐noise ratios of 0.5 and 2. For binary classification, we scale the vector β to achieve a given Bayes misclassification rate, either
or
. Each of the 64 scenarios is run 100 times: once using the standard procedure (the lasso or OMP), once using stability selection with subsampling and once using stability selection with subsampling and additional randomization (α=0.5 for the randomized lasso and α=0.9 for randomized OMP). The methods are thus in total evaluated on about 20000 simulations each.
The solution of stability selection cannot be reproduced by simply selecting the right penalty with the lasso, since stability selection provides a fundamentally new solution. To compare the power of both approaches, we look at the probability that γs of the s relevant variables can be recovered without error, where γ ∈ {0.1,0.4}. A set of γs variables is said to be recovered successfully for the lasso or OMP selection, if there is a regularization parameter such that at least ⌈γs⌉ variables in S have a non‐zero regression coefficient and all variables in N={1,…,p}∖S have a zero regression coefficient. For stability selection, recovery without error means that the ⌈γs⌉ variables with highest selection probability
are all in S. The value λmin is chosen such that at most √(0.8p) variables are selected in the whole path of solutions for λλmin. Note that this notion neglects the fact that the most advantageous regularization parameter is selected automatically here for the lasso and OMP but not for stability selection.
Results are shown in Fig. 5 for the lasso applied to regression, and in Fig. 6 for OMP applied to regression and the lasso applied to binary classification. In Fig. 5, we also give the median number of variables violating the irrepresentable condition (denoted by ‘violations’) and the average of the maximal correlation between a randomly chosen variable and all other variables (‘max cor’) as two measures of the difficulty of the problem.

Probability of selecting 0.1s and 0.4s important variables without selecting a noise variable with the lasso in the regression setting (
) and stability selection under subsampling (
) for the 64 different settings: ×, results for stability selection under additional randomization (α=0.5)


Equivalent plot to Fig. 5 for the lasso applied to (a) binary classification and (b) OMP applied to regression
Stability selection identifies as many as or more correct variables than the underlying method itself. In some settings (e.g. in (a) or in (b) and (f) when the number s of relevant variables is very small), stability selection does not improve and yields comparable results with those of the underlying method. That stability selection is not helping for scenario (a) is to be expected as the design is nearly orthogonal (very weak empirical correlations between variables), thus almost decomposing into p univariate decisions and we would not expect stability selection to help in a univariate framework. However, the gain of stability selection under subsampling is often substantial, irrespective of the sparsity of the signal and the signal‐to‐noise‐ratio. Additional randomization helps in cases where there are many variables violating the irrepresentable condition, e.g. in setting (e). This is in line with our theory.
Instead of giving full receiver operating characteristic curves for each simulation setting, we look at the number of falsely chosen variables when selecting 20% and 80% of all s relevant variables. The mean number of falsely selected variables in each of these two cases is shown in Fig. 7 for the lasso and Fig. 8 for OMP. When using the lasso, stability selection with subsampling can increase the number of falsely selected edges when there is a large number of relevant variables (s=50) and we are looking to identify a large proportion (80%) of these. Yet all methods fail

Average number of falsely chosen variables in the regression setting when selecting (a) 20% or (b) 80% of all s correct variables:
, lasso results;
, stability selection under subsampling; ×, results for stability selection under additional randomization (α=0.5)

Equivalent plot to Fig. 7 when using OMP instead of the lasso
somehow when trying to recover such a large number of relevant variables, yielding often hundreds of false positive results. In all other settings, stability selection seems advantageous, selecting often substantially fewer variables falsely than the standard lasso. For OMP, the gains are even more pronounced.
Next, we test how well the error control of theorem 1 holds up for these data sets. For the motif regression data set (f) and the vitamin gene expression data set (g), the lasso is applied, with randomization and without. For both data sets, the signal‐to‐noise ratio is varied between 0.5, 1 and 2. The number of non‐zero coefficients s is varied in steps of 1 between 1 and 12, with a standard normal distribution for the randomly chosen non‐zero coefficients. Each of the 72 settings is run 20 times. We are interested in the comparison between the cross‐validated solution and stability selection. For stability selection, we chose qΛ=√(0.8p) and thresholds of πthr=0.6, corresponding to a control of E(V)2.5, where V is the number of wrongly selected variables. The control is mathematically derived under the assumption of exchangeability for the distribution of noise variables; see theorem 1. This assumption is most probably not fulfilled for the given data set and it is of interest to see how well the bound holds up for real data. Results are shown in Fig. 9. Stability selection reduces the number of falsely selected variables dramatically, while maintaining almost the same power to detect relevant variables. The number of falsely chosen variables is remarkably well controlled at the level desired, giving empirical evidence that the error control derived is useful beyond the setting of exchangeability discussed. Stability selection thus helps to select a useful amount of regularization.

Comparison of stability selection (for the randomized lasso with (a) α=0.5 and (b) α=1) () with cross‐validation (for the standard lasso) (•) for the real data sets (f) and (g), showing the average proportion of correctly identified relevant variables versus the average number of falsely selected variables: each •– pair corresponds to a simulation setting (some specified signal‐to‐noise ratio and s);
, value at which the number of wrongly selected variables is controlled, namely E(V)2.5; looking at stability selection, the proportion of correctly identified relevant variables is very close to the cross‐validation solution, whereas the number of falsely selected variables is reduced dramatically
5. Discussion
Stability selection addresses the notoriously difficult problem of structure estimation or variable selection, especially for high dimensional problems. Cross‐validation fails often for high dimensional data: sometimes spectacularly. Stability selection is based on subsampling in combination with (high dimensional) selection algorithms. The method is extremely general and we demonstrate its applicability for variable selection in regression and Gaussian graphical modelling.
Stability selection provides finite sample familywise multiple‐testing error control (or control of other error rates of false discoveries) and hence a transparent principle to choose a proper amount of regularization for structure estimation or variable selection. Furthermore, the solution of stability selection depends surprisingly little on the initial regularization chosen. This is an additional great benefit besides error control.
Another property of stability selection is the improvement over a prespecified selection method. Often computationally efficient algorithms for high dimensional selection are inconsistent, even in rather simple settings. We prove for the randomized lasso that stability selection will be variable selection consistent even if the necessary conditions for consistency of the original method are violated. And, thus, stability selection will asymptotically select the right model in scenarios where the lasso fails.
In short, stability selection is the marriage of subsampling and high dimensional selection algorithms, yielding finite sample familywise error control and markedly improved structure estimation. Both of these main properties have been demonstrated on simulated and real data.
Acknowledgements
We thank Bin Yu and Peter Bickel for inspiring discussions and the referees for many helpful comments and suggestions which greatly helped to improve the manuscript. NM thanks the Forschungsinstitut für Mathematik at Eidgenössiche Technische Hochschule Zürich for generous support and hospitality.
Appendices
Appendix A
A.1. Sample splitting
and
:

for any set K⊆{1,…,p} as
(20)
is a randomized algorithm).
We work with the selection probabilities that are based on subsampling but the following lemma lets us convert these probabilities easily into simultaneous selection probabilities based on sample splitting; the latter is used for the proof of theorem 1. The bound is rather tight for selection probabilities that are close to 1.
(21)
. Note that the two events are not independent as the probability is only with respect to a random split of the fixed samples {1,…,n} into I1 and I2. The probabilities sK({1,0}),sK({0,1}) and sK({0,0}) are defined equivalently by
and
. Note that
and

. Hence

A.2. Proof of theorem 1
for all k ∈ N, using the made definitions
and
. Define furthermore
to be the set of noise variables (in N) which appear in
and analogously
. The expected number of falsely selected variables can be written as
. Using assumption (8) (which asserts that the method is not worse than random guessing), it follows that E(|UΛ|)E(|NΛ|)|S|/|N|. Putting this together, (1+|S|/|N|) E(|NΛ|)qΛ and hence |N|−1E(|NΛ|)qΛ/p. Using the exchangeability assumption, we have
for all k ∈ N and hence, for k ∈ N, it holds that
, as desired. Note that this result is independent of the sample size that is used in the construction of
. Now using lemma 2 below, it follows that
for all 0<ξ<1 and k ∈ N. Using lemma 1, it follows that
. Hence

the set of selected variables based on a sample size of ⌊n/2⌋. If
, then

for some
, then

Proof. Let I1,I2⊆{1,…,n} be, as above, the random split of the samples {1,…,n} into two disjoint subsets, where both |Ii|=⌊n/2⌋ for i=1,2. Define the binary random variable
for all subsets K⊆{1,…,p} as
Denote the data (the n samples) by Z. The simultaneous selection probability
, as defined in expression (20), is then
, where the expectation E* is with respect to the random split of the n samples into sets I1 and I2 (and additional randomness if
is a randomized algorithm). To prove the first part, the inequality
(for a sample size ⌊n/2⌋) implies that
and hence
. Therefore,
. Using a Markov‐type inequality,
Thus
, completing the proof of the first claim. The proof of the second part follows analogously.
A.3. Proof of theorem 2
Instead of working directly with form (13) of the randomized lasso estimator, we consider the equivalent formulation of the standard lasso estimator, where all variables have initially unit norm and are then rescaled by their random weights W.
Definition 6 (additional notation). For weights W as in expression (13), let Xw be the matrix of rescaled variables, with
for each k=1,…,p. Let
and
be the maximal and minimal eigenvalues analogous to expression (14) for Xw instead of X.
The proof rests mainly on the twofold effect that a weakness α<1 has on the selection properties of the lasso. The first effect is that the singular values of the design can be distorted if working with the reweighted variables Xw instead of X itself. A bound on the ratio between largest and smallest eigenvalue is derived in lemma 3, effectively yielding a lower bound for useful values of α. The following lemma 4 then asserts, for such values of α, that the relevant variables in S are chosen with high probability under any random sampling of the weights. The next lemma 5 establishes the key advantage of the randomized lasso as it shows that the irrepresentable condition (12) is sometimes fulfilled under randomly sampled weights, even though it is not fulfilled for the original data. Variables which are wrongly chosen because condition (12) is not satisfied for the original unweighted data will thus not be selected by stability selection. The final result is established in lemma 7 after a bound on the noise contribution in lemma 6.
by
and assume that s7. Let W be weights generated randomly in [α,1], as in expression (13), and let Xw be the corresponding rescaled predictor variables, as in definition 6. For α2ν φmin(Cs2)/Cs2, with
, it holds under assumption 1 for all random realizations W that
(22)
and the second inequality by s1. It follows that
(23)
be again the p×p diagonal matrix with diagonal entries
for all k=1,…,p and 0 on the non‐diagonal elements. Then
and, taking suprema over all
with diagonal entries in [α,1],

and the fact that
as well as
and thus
for all
with diagonal entries in [α,1]. The corresponding argument for φmin(m) yields the bound
for all
. Claim (22) follows by observing that
for s7, since C1 by assumption 1 and hence
.
be the set
of selected variables of the randomized lasso with weakness α ∈ (0,1] and randomly sampled weights W. Suppose that the weakness α2(7/κ)2 φmin(Cs2)/Cs2. Under the assumptions of theorem 2, there is a set Ω0 in the sample space of Y with P(Y ∈ Ω0)1−3/(p∨an), such that for all realizations W=w, for p5, if Y ∈ Ω0,
(24)
, as, by definition,
, as in lemma 3. The quantity C=c*/c* in Zhang and Huang (2008) is identical to our notation
. It is bounded for all random realizations of W=w, as long as α2(7/κ)2 φmin(Cs2)/Cs2, using lemma 3, by

(25)
(26)
. The last inequality implies, by definition of Ssmall;λ in theorem 2, that
, which completes the proof.
of weights that fulfil vjwj for all j ∈ {1,…,p}, with equality for all j ∈ A. Then, for α2φmin(m)/m√2,
(27)
be the realization of W for which
and
for all other j ∈ {1,…,p}∖k. The probability of
is clearly pw(1−pw)p−1 under the sampling scheme that is used for the weights. Let
be the selected set of variables under these weights. Let now
be the set of all weights for which wk=α and wj=1 for all j ∈ A, and arbitrary values in {α,1} for all wj with j∉A∪k. The probability for a random weight being in this set is
. By the assumption on K, it holds that K(w)=A for all
, since
for all j ∈ {1,…,p} with equality for j ∈ A. For all weights
, it follows moreover that

(28)
, it is sufficient to show, for
,

As XAγ is the projection of Xk into the space that is spanned by XA and
, it holds that
. Using
, it follows that
, which shows result (28) and thus completes the proof.
be the projection into the space that is spanned by all variables in subset A⊆{1,…,p}. Suppose that p>10. Then there is a set Ω1 with P(Ω1)1−2/(p∨an), such that, for all ω ∈ Ω1,
(29)
be the event that
. As entries in ɛ are IID
for all δ ∈ (0,1). Note that, for all A?{1,…,p} and k∉A,
. Define
as
(30)
, showing that this bound is related to a bound in Zhang and Huang (2008), and we repeat a similar argument. Each term n1/2‖PAɛ‖2/σ has a
‐distribution as long as XA is of full rank |A|. Hence, using the same standard tail bound as in the proof of theorem 3 of Zhang and Huang (2008),

,

and concluding that P(Ω1)1−2/(p∨an) for all p>10.
be again the probability for variable k of being in the selected subset, with respect to random sampling of the weights W. Then, under the assumptions of theorem 2, for all k∉S and p>10, there is a set ΩA with P(ΩA)1−5/(p∨an) such that, for all ω ∈ ΩA and λλmin,
(31)
(32)
only if
(33)
is the solution to expression (13) with the constraint that
, which is comparable with the analysis in Meinshausen and Bühlmann (2006). Let
be the set of non‐zero coefficients and
be the set of regression coefficients which are either truly non‐zero or estimated as non‐zero (or both). We shall use
as a shorthand notation for
. Let
be the projection operator into the space that is spanned by all variables in the set
. For all W=w, this is identical to

in condition (33) into the two terms
(34)
and, by definition of
above,
. Thus the left‐hand term in expression (34) is bounded from above by


We now apply lemma 5 to the rightmost term. The set
is a function of the weight vector and satisfies for every realization of the observations Y ∈ Ω0 the conditions in lemma 5 on the set K(w). First,
. Second, by definition of
above,
for all weights w. Third, it follows by the Karush–Kuhn–Tucker conditions for the lasso that the set of non‐zero coefficients of
and
is identical for two weight vectors w and v, as long as vj=wj for all
and vjwj for all
(increasing the penalty on zero coefficients will leave them zero, if the penalty for non‐zero coefficients is kept constant). Hence there is a set Ωw in the sample space of W with Pw(Ωw)1−δw such that
. Moreover, for the same set Ωw, we have
. Hence, for all ω ∈ Ω0∩Ω1 and, for all ω ∈ Ωw, the left‐hand side of condition (33) is bounded from above by λmin/7+2−1/4λ<λ and variable k∉S is hence not part of the set
. It follows that
with δw=pw(1−pw)Cs2 for all k∉S. This completes the first part (31) of the proof.
For the second part (32), we need to show that, for all ω ∈ Ω0∩Ω1, all variables k in S are chosen with probability at least 1−δw (with respect to random sampling of the weights W), except possibly for variables in Ssmall;λ⊆S, defined in theorem 2. For all ω ∈ Ω0, however, it follows directly from lemma 4 that
. Hence, for all k ∈ S∖Ssmall;λ, the selection probability satisfies
for all Y ∈ Ω0, which completes the proof.
Since the statement in lemma 7 is a reformulation of the assertion of theorem 2, the proof of the latter is complete.
Discussion on the paper by Meinshausen and Bühlmann
Sylvia Richardson (Imperial College London)
This stimulating paper on combining resampling with l1‐selection algorithms makes important contributions for the analysis of high dimensional data. What I found particularly appealing in this paper is that it puts on a firm footing the idea of using the stability under resampling to select a set of variables, by
- (a)
estimating for each variable Xk its inclusion probability
in resampled subsets and
- (b)
formulating a selection rule based on the maximum of these over a regularization domain Λ.
Using inclusion probabilities in resampled subsets had been discussed in an informal way for a considerable time in applied work, in particular in genomics. Early work on extracting prognostic signatures from gene expression data was soon questioned as it was noticed that such signatures had little reproducibility. The idea of intersecting or combining resampled signatures followed. For example Zucknick et al. (2008) investigated the stability of gene expression signature in ovarian cancer derived by different supervised learning procedures including the lasso, the elastic net and random forests by computing their inclusion frequencies under resampling for profiles of different sizes (see Fig. 2 of Zucknick et al. (2008)).
I shall focus my discussion on the variable selection aspect rather than the graphical modelling, and specifically I shall comment on two aspects:
- (a)
trying to understand better the applicability of the bound in theorem 1 and the performance of this method beyond the reported simulations;
- (b)
putting ‘stability’ into a broader context and relating or comparing it with other approaches.
The focus of theorem 1 is on the control of the familywise error rate, control which depends on two quantities: the threshold πthr and the average number of selected variables over the regularization domain qΛ. It is informative to work out the bounds that are obtained as successive variables are selected for particular thresholds. On the vitamin data set, using the recommended qΛ=p√0.8=57,
reaches 0.9 for only one variable, with a bound E(V)1. Lowering πthr to 0.6 selects three variables with a small qΛ=3.9,
and hence a useful bound E(V)0.02. But the bound on E(V) reaches 3.8 if the domain is extended till a fourth variable is included. Hence, in this example, the bounds of theorem 1 would restrict the selection to three variables (Fig. 10(a)). If the practical use of the stability plots is extended to a ranking of the features according to the values of
as suggested in the simulations, the bounds of theorem 1 thus appear to be quite conservative for deriving a cut‐off.

Stability paths for the vitamin data set: (a) illustration of the bounds of theorem 1 for various values of qΛ and πthr for the lasso; (b) α=0.5; (c) α=0.2
With respect to the randomized lasso, the relevant quantities in the consistency theorem 2 are the threshold δ and the bound on the βks. Unfortunately, these quantities do not seem amenable to explicit computations. The authors seem to rely instead on a semiqualitative interpretation of the plots described in terms such as ‘variables standing out’, ‘better separated’, … without giving quantitative guidelines on how to judge such a separation. However, the values of
are clearly influenced by the choice of weakness α (see Figs 10(b) and 10(c)), indicating that the thresholds for stability selection should be adapted with respect to α. Besides the elegant theoretical results of theorem 2, it is thus not entirely clear how to use the randomized stability paths in practice.
Broadly speaking, stability selection and machine learning methods can both be viewed as ‘ensemble learning’ procedures following Hastie et al. (2009). Counting the number of times that a variable is selected in each of the resampled n/2 subsets for particular values of λ is just one way of combining the information of a collection of lasso learners. In this respect, it is a little surprising that the authors have not opened up the discussion on connections between their approach and ensemble methods such as ‘bagging’ or ‘stacking’. Exploiting this connection could potentially lead to revisiting some of the choices made in their procedure, such as the set of learners that are combined (e.g. involving learners with more complex penalties such as in the elastic net) and the size of the subsamples, and to investigate the performance of combination rules that would exploit more than the marginal information, e.g. the order, or the stability of subsets.
Linking stability selection to Bayesian approaches provides further intriguing questions. It is well known that the penalty λ can be viewed as a parameter in a Laplace prior on the regressions coefficients β. To ‘stabilize’ inference, the authors take the maximum of
over a domain Λ. From a Bayesian perspective, the choice of using the maximum rather than some form of integration overΛ is questionable. Have the authors considered alternative choices to the maximum and would some of their results carry over?
This naturally leads me to discuss the connection with the Bayesian variable selection (BVS) context, where stability and predictive performance are achieved, not by resampling the data but by allowing parameter and model uncertainty. In this light, model averaging for BVS could be viewed as an ensemble method. There are several strategies for BVS, differing in their prior model of the regression coefficients and the model search strategy. One way (but by no means the only way) to exploit the output of the BVS search is to compute marginal probabilities of inclusion for each variable, averaging over the space of models that are visited. In the large p, small n paradigm, ranking the posterior probabilities of inclusion to select relevant variables is commonly done. Of course, when the covariates are dependent, joint rather than marginal thresholding should be also considered.
To understand better the power and sensitivity of stability selection, and to investigate further the claim that is made by the authors of empirical evidence of good performance even when the ‘irrepresentable condition’ is violated, we have implemented their procedure on a set of simulated examples under two scenarios of large p, small n, the first inspired by classical test cases for variable selection that were devised by George and McCullogh (1997) and the second based on phased genotype data from the HapMap project. In both cases, a few of the regressors have strong correlation with the noise variables. In parallel, we have run two Bayesian stochastic search algorithms, shotgun stochastic search (Hans et al., 2007) and evolutionary stochastic search (Bottolo and Richardson, 2010), on the same data sets. Receiver operating characteristic curves and false discovery rate curves averaged over 25 replicates are presented in Fig. 11. It is clear from the plots that, in these two cases of large p, small n, good power for stability selection is only achieved at the expense of selecting a large number of false positive discoveries, a fact that can also be clearly seen in Fig. 7 of the paper. The Bayesian stochastic algorithms outperform stability selection procedures in the two scenarios. By their capacity to explore efficiently important parts of the parameter and model space and to perform averaging according to the support of each model, here Bayesian learners have an enhanced performance.

Receiver operating characteristic curves comparing stability selection (with (
, α=0.5) and without (
) randomization) and Bayesian variable selection using either stochastic shotgun search (· ‐ · ‐ · ‐ ) or evolutionary stochastic search (· · · · · · ·, E(p)=5) (variables are ranked by marginal probabilities of inclusion (Bayesian methods) or by
; results are averaged over 25 simulated data sets and details of the simulation set‐up can be found in Bottolo and Richardson (2010)): (a) p=300, n=120, s=16 and average maximum correlation 0.68; (b) HapMap data example, p=775, n=120, s=10 and average maximum correlation 0.88
As can be surmised from my comments, I have found this paper enjoyable, thought provoking and rich for future research directions, and I heartily congratulate the authors.
John Shawe‐Taylor and Shiliang Sun (University College London)
We congratulate the authors on a paper with an exciting mix of novel theoretical insights and practical experimental testing and verification of the ideas. We provide a personal view of the developments that were introduced by the paper, mentioning some areas where further work might be usefully undertaken, before presenting some results assessing the generalization performance of stability selection on a medical data set.
The paper introduces a general method for assessing the reliability of including component features in a model. They independently follow a similar line to that proposed by Bach (2008), in which the author proposed to run the lasso algorithm using boostrap samples and only included features that occur in all the models thus created. Meinshausen and Bühlmann refine this idea by assessing the probability that a feature is included in models created with random subsets of ⌊n/2⌋ training examples. Features are included if this probability exceeds a threshold πthr.
Theorem 1 provides a theoretical bound on the expected number of falsely selected variables in terms of πthr and qΛ, the expected number of features to be included in the models for a fixed subset of the training data, but a range of values of the regularization parameter λ ∈ Λ. The theorem is quite general, but makes one non‐trivial assumption: that the distribution over the inclusion of false variables is exchangeable. In their evaluation of this bound on a range of real world training sets, albeit with artificial regression functions, they demonstrate a remarkable agreement between the bound value (chosen to equal 2.5) and the true number of falsely included variables.
We would have liked to have seen further assessment of the reliability of the bound in different regimes, i.e. bound values as fixed by different qΛ and πthr. The experimental results indicate that in the data sets that were considered the exchangeability assumption either holds or, if it fails to hold, does not adversely affect the quality of the bound. We believe that it would have been useful to explore in greater detail which of these explanations is more probable.
One relatively minor misfit between the theory and practical experiments was the fact that the theoretical results are in terms of the expected value of the quantities over random subsets, whereas in practice a small sample is used to estimate the features to include as well as quantities such as qΛ. Perhaps finite sample methods for estimating fit with the assumption of exchangeability could also be considered. This might lead to an on‐line approach where samples are generated until the required accuracy is achieved.
Theorem 2 provides a more refined analysis in that it also provides guarantees that relevant examples are included provided that they play a significant part in the true model, which is something that theorem 1 does not address. Though stability selection as defined refers to the use of random subsampling and all the experiments make use of this strategy, theorem 2 analyses the effect of a ‘randomized lasso’ algorithm that randomly rescales the features before training on the full set. Furthermore, the proof of theorem 2 does not make it easy for the reader to gain an intuitive understanding of the key ideas behind the result.
Our final suggestion for further elucidation of the usefulness of the ideas that are presented in the paper is to look at the effects of stability selection on the generalization performance of the resulting models.
As an example we have applied the approach to a data set that is concerned with predicting the level of cholesterol of subjects on the basis of risk factors and single‐nucleotide polymorphism genotype features.
The data set includes 1842 subjects or examples. The feature set (input) includes six risk factors (age, smo, bmi, apob, apoa, hdl) and 787 genotypes. Each genotype takes a value in {1,2,3}. As preprocessing, each risk factor is normalized to have mean 0 and variance 1. For each example, its output is the averaged level of cholesterol over five successive years. The whole data were divided into a training set of 1200 examples and a test set of the remaining 642 examples. We shall report the test performance averaged across 10 different random divisions of training and test sets. The performance is evaluated through the root‐mean‐square error. In addition to standard ‘stability selection’ we report performance for a variant in which complementary pairs of subsets are used.
We report results for four methods:
- (a)
ridge regression with the original features (method M1);
- (b)
the lasso with the original features (method M2);
- (c)
ridge regression with the features identified by stability selection (method M3);
- (d)
the lasso with the features identified by stability selection (method M4).
The variants of M3 and M4 based on complementary pairs of subsets are denoted M3c and M4c. The performances of the first two methods are independent of πthr and provide a baseline given in Table 1.
| Results for method M1 | Results for method M2 | |
|---|---|---|
| Root‐mean‐square error | 0.752 (0.017) | 0.707 (0.017) |
| Number of retained features | 792 (0.66) | 109 (5.22) |
For the two methods involving stability selection we experiment with values of πthr from the set {0.2,0,25,0.3,0.35,0.4,0.45,0.5}. The results for various values of πthr for methods M3 and M4 using standard subsampling and the ranadomized lasso are given in Table 2, whereas using the complementary sampling gives the results of Table 3.
| π thr | Number of features | Results for method M3 | Results for method M4 |
|---|---|---|---|
| 0.20 | 117.4 (6.2) | 0.722 (0.017) | 0.716 (0.017) |
| 0.25 | 86.8 (5.2) | 0.720 (0.016) | 0.715 (0.016) |
| 0.30 | 64.7 (4.1) | 0.719 (0.017) | 0.715 (0.017) |
| 0.35 | 45.3 (4.1) | 0.716 (0.016) | 0.715 (0.017) |
| 0.40 | 27.3 (3.8) | 0.714 (0.016) | 0.713 (0.016) |
| 0.45 | 17.7 (1.9) | 0.712 (0.016) | 0.710 (0.016) |
| 0.50 | 11.4 (1.6) | 0.714 (0.019) | 0.713 (0.019) |
| π thr | Number of features | Results for method M3c | Results for method M4c |
|---|---|---|---|
| 0.20 | 116.5 (4.4) | 0.721 (0.017) | 0.715 (0.017) |
| 0.25 | 83.8 (3.0) | 0.720 (0.017) | 0.715 (0.017) |
| 0.30 | 62.4 (3.6) | 0.718 (0.017) | 0.714 (0.016) |
| 0.35 | 44.2 (3.2) | 0.717 (0.015) | 0.716 (0.016) |
| 0.40 | 27.4 (3.4) | 0.714 (0.015) | 0.713 (0.015) |
| 0.45 | 18.2 (1.7) | 0.714 (0.012) | 0.710 (0.013) |
| 0.50 | 11.8 (1.8) | 0.715 (0.014) | 0.713 (0.014) |
The results suggest that stability selection has not improved the generalization ability of the resulting regressors, though clearly the lasso methods outperform ridge regression. The performance is remarkably stable across different values of πthr despite the number of stable variables undergoing an order of magnitude reduction.
The vote of thanks was passed by acclamation.
Tso‐Jung Yen (Academia Sinica, Taipei) and Yu‐Min Yen (London School of Economics and PoliticalScience)
We congratulate the authors for tackling a challenging statistical problem with an effective and easily implementable method. Our comments and interest in the paper are as follows. First, the authors claim that, under the method, tuning parameter λ is insensitive to the final result. However, we have found that it may still be affected by its range, particularly in the pm situation, where m is the subsampling size. In this situation, when λ→0, the subsampling estimation results of the lasso will approach those of ordinary least squares. Consequently, λmin→0 will lead to
for all {1,…,p} ∈ K with high probability.

with the lasso when λmin varies. Suppose that we require E(V)2; then, with λmin=10.11, the corresponding
. Calibrating the value into equation (9) we obtain an unfeasible value πthr=1.022>1.

Boxplots for the estimated number of selected variables (
,
, average values of 200 subsampling estimations of
): the data set used is the diabetes data presented in Efron et al. (2004), with p=10 and n=442; the method used is the lasso and λmin is varied at nine different levels
and

Given E(V)2, πthr=0.9 and p=10, we have q*=4,
and
. The estimation results with the randomized lasso are shown in Fig. 13, which indicates that only paths falling in region H (two variables) are selected.

Result of stability selection for the diabetes data with πthr=0.9 (
), p=10, q*=2.828,
(
) and
(
); the method used is the randomized lasso with Wk ∈ [0.5,1]; the plot indicates that stability selection will only select variables with paths falling in region H
Secondly, in addition to E(V)/p, we may be also interested in controlling the false discovery rate
. Conventionally, the quantity may be approximated by
, but it is unknown whether such an approximation works well in regression‐based variable selection.
Finally, we are interested in what the relationship between qΛ and the degrees of freedom of the lasso (dflasso) is. As indicated in Efron et al. (2004) and Zou et al. (2007),
.
by definition. They are different:
relies on a single λ and the whole sample, but
depends on Λ and the subsample set I. We are wondering whether they will have some common features, and this may be useful to link the method with other traditional methods such as Cp, Akaike's information criterion and the Bayes information criterion.
Rajen Shah and Richard Samworth (University of Cambridge)
We congratulate the authors for their innovative and thought‐provoking paper. Here we propose a minor variant of the subsampling algorithm that is the basis of stability selection. Instead of drawing individual subsamples at random, we advocate drawing disjoint pairs of subsamples at random. This variant appears to have favourable properties.

.
- (a)
Letting VM be the number of falsely selected variables
satisfies the same upper bound as in theorem 1 of the paper. Briefly, defining
the result corresponding to lemma 1 of the paper is
The arguments of lemma 2 and theorem 1 follow through since
. Thus we have the same error control as in the paper even for finite M, as well as the infinite subsampling case.
- (b)
Simulations suggest that we obtain a slight decrease in the Monte Carlo variance. A heuristic explanation is that, when n is even, each observation is contained in the same number of subsamples. This minimizes the sum of the pairwise intersection sizes of our subsamples.
- (c)
With essentially no extra computational cost, we obtain estimates of simultaneous selection probabilities, which can also be useful for variable selection; see Fan et al. (2009).
- (d)
If, in addition to the assumptions of theorem 1, we also assume that the distribution of
is unimodal, we obtain improved bounds:
For a visual comparison between this bound and that of theorem 1, see Fig. 14. The improvement suggests that using sample splitting with this bound can lead to more accurate error control than using standard stability selection.
- (e)
This new bound gives guidance about the choice of M. For instance, when πthr=0.6 choosing M>52 ensures that the bound on
is within 5% of its limit as M→∞. When πthr=0.9, choosing M>78 has the same effect.

Factor multiplying
against πthr for each of the bounds: the bound of theorem 1 (‐ ‐ ‐ ‐ ‐ ‐ ‐ ) and the new bound with M=∞ (
)
Christian Hennig (University College London)
Stability selection seems to be a fruitful idea.
As usually done with variable selection, the authors present it as a mathematical problem in which the task is to pick a few variables with truly non‐zero coefficients out of many variables with true βk=0. However, in reality we do not believe model assumptions to be precisely fulfilled, and in most cases we believe that the (in some sense) closest linear regression approximation to reality does not have any regression coefficients precisely equal to zero.
It is of course fine to have theory about the idealized situation with many zero coefficients, but in more realistic situations the quality of a variable selection method cannot be determined by considering models and data alone. It would be necessary to specify ‘costs’ for including or excluding variables with ‘small’ true βk, which may depend on whether we would rather optimize the predictive quality or rather favour models with small numbers of variables enabling simple interpretations. We may even be interested in stability of the selection in its own right. Accepting the dependence of the choice of a method on the aim of data analysis, it would be very useful for promising methods such as stability selection to have a ‘profile’ of potential aims for which this is particularly suited, or rather not preferable.
Considering the author's remark at the end of Section 1.1, in Hennig (2010) it is illustrated in which sense the problem of finding the correct number of clusters s cannot be decided on the basis of models and data alone, and also in some simulation set‐ups given there it turns out that s is not necessarily estimated most stable if it is chosen by a subsampling method looking for stable clusterings given s (based on ‘prediction strength’; Tibshirani and Walther (2005)).
Paul D. W. Kirk, Alexandra M. Lewin and Michael P. H. Stumpf (Imperial College London)
We consider stability selection when several of the relevant variables are correlated with one another. Like the authors, we are interested in variable relevance, rather than prediction; hence we wish to select all relevant variables.
To illustrate, we use a simulated example, which is similar to that of the authors, in which p=500,n=50, the predictors are sampled from an
distribution and the response is given by Y=Σi=1,…,8 Xi+ɛ, where ɛ is a zero‐centred Gaussian noise term with variance 0.1. Here Σ is the identity matrix except for the elements Σ1,2=Σ3,4=Σ4,5=Σ3,5=0.8 and their symmetrical counterparts. Thus two sets of predictors are correlated: {X1,X2} and {X3,X4,X5}.
The problem
For variables that are correlated with each other, different realizations of the simulation example above result in different stability paths; for example some realizations will stably select X1 with high probability but not X2, some will stably select X2 but not X1 (as in Fig. 15(a)) and others will select both variables with lower probability, and hence may not select either with sufficiently high probability to be chosen in the final analysis. In fact there is a clear relationship between the marginal selection probabilities for X1 and X2, as shown in Fig. 15(b), which shows these probabilities for 1000 realizations.

(a) Stability path for a particular realization of the simulation example (
, X1 and X2; · ‐ · ‐ · ‐, X3, X4 and X5;
, X6, X7 and X8; ⋯·⋯, irrelevant variables); (b) for 1000 realizations, selection probabilities for X1 and X2 at λ=0.25 estimated by using the authors’ subsampling method (the plot illustrates the density of these points in the 0–1 square (with lighter squares indicating higher density), showing a clear negative relationship); (c) using the same realization as in (a), the stability path when correlated variables are grouped together; (d) again using the same realization, a stability path by using the elastic net (with mixing parameter set to 0.2)
Solution 1
One approach is to use the lasso as before, but to calculate selection probabilities for sets of correlated predictors. Fig. 15(c) shows the stability paths for grouped predictors for the same realization as in Fig. 15(a), in which only one member of each correlated set would have been selected with high probability. Grouping them enables us to select the groups as required.
Solution 2
The obvious drawback to selection probabilities for groups is that the groups must be defined from the outset. We propose to use the elastic net (Zou and Hastie, 2005), which uses a linear combination of the lasso l1‐penalty and the ridge l2‐penalty. The l2‐penalty lets the algorithm include groups of correlated variables, whereas the l1‐penalty ensures that most regression coefficients are set to 0. We find that using marginal selection probabilities with the elastic net can give us all members of the correlated groups without defining them in advance, as shown in Fig. 15(d).
J. T. Kent (University of Leeds)
This has been a fascinating paper dealing, in particular, with the important problem of variable selection in regression. I have two simple questions about the methodology in this setting.
First, if we are willing to assume joint normality of the Y,X data, then all the information in the data will be captured by the sufficient statistics, namely the first two sample moments, together with the sample size n. Presumably there is no need to resample from the data in this situation; in principle, inferences could be made analytically from the set of sample correlations between the variables, though in practice a version of the parametric bootstrap might be used. More generally, the use of resampling methods seems to carry with it an implicit assumption or accommodation of non‐normality and leads to the question how the methodology of the paper will be affected by different types of non‐normality.
Second, I am not entirely clear what happens under approximate collinearity between the explanatory variables. In the conventional forward search algorithm in regression analysis, we are often faced with the situation where two variables x1 and x2 have similar explanatory power. If x1 is in the model, then there is no need to include x2; conversely, if x2 is in the model there is no need to include x1. If I understand your procedure correctly, you will tend to include x1 half the time and x2 half the time, leading to stability probabilities of about 50% each. If so, you might falsely conclude that neither variable is needed.
Axel Gandy (Imperial College London)
I congratulate the authors on their stimulating paper. The following comments concern the practical implementation of selecting stable variables.
The paper defines the set of stable variables in expression (7) as those k for which
for a fixed 0<πthr<1. In practice,
, and therefore also the set of stable variables, cannot be evaluated explicitly.
via Monte Carlo simulation is to generate J independent subsamples Ij of {1,…,n} of size ⌊n/2⌋ each and to approximate
by

and
can be on different sides of the threshold πthr, leading potentially to a different set of stable variables.
. Applied to the situation of the present paper, the algorithm will produce an estimate
of
with a bound on the probability of
being on a different side of πthr from
. More precisely, for an (arbitrarily small) ɛ>0,

denotes the probability distribution of the simulation conditionally on the observed data.
Besides this guaranteed performance, the algorithm in Gandy (2009) is a sequential algorithm and will come to a decision based on only a small number of samples if
is far from the threshold πthr.

This can be accomplished by running the algorithm of Gandy (2009) for each λ and k with the Bonferroni corrected threshold ɛ/(p#Λ). These can be run in parallel using the same subsamples Ij. The Bonferroni correction would be conservative. Devising a less conservative correction could be a topic for further research.
Howell Tong (London School of Economics and Political Science)

- (a)
treating the problem as one of smoothing;
- (b)
focusing on structural stability rather than variable selection;
- (c)
treating the regularization parameters as some hyperparameters in a Bayesian framework.
Finally, I have a minor question. Have the authors tried to use their stability selection on model selection in time series modelling?
Chris Holmes (Oxford University)
The authors are to be congratulated on a ground breaking paper. The following comments are made from the perspective of a casual Bayesian observer.
occurred in taking actions, a(·), using
when nature is really in Sk′. An optimal way to proceed then is according to the principle of maximum expected utility, i.e. to choose
so as to minimize your expected loss,

is the posterior mass assigned to subset S which characterizes all the information about the unknown state of nature. Such an approach provides provably coherent decision making in the face of uncertainty and is prescriptive in how to take actions by using variable selection; see Lindley (1968) and Brown et al. (1999) for instance. It feels to me that crisp variable selection without incorporation of utility is a little like breaking the eggs without making the omelette. The job is only half done—with apologies to Savage (1954) for the analogy.
Crisp variable subset selection is rare but much more common is for the statistician to work with the owner of the data to determine the relevance of the measured variables and to understand better the dependence structures within the data, i.e. as part of a dialogue whereby statistical evidence is combined with expert judgement. On the one hand the Bayesian works with the posterior distribution
, which is a function of the data and a prior distribution which captures any rich domain knowledge that may exist; on the other hand stability selection reports
, which is a function of the data and the algorithm. It seems to the casual Bayesian observer that the former is more objective while providing a formal mechanism to incorporate any domain knowledge which might exist about the problem.
The following contributions were received in writing after the meeting.
Ismaïl Ahmed and Sylvia Richardson (Imperial College London)
The object of this contribution is to discuss further the vitamin example that was provided by the authors. This example is given to ‘see how the lasso and the related stability path cope with noise variables’. It shows that, on the basis of a graphical analysis of the stability path, we can select five of the six permuted genes whereas, with the lasso path, the situation seems to be much less clear.
Thanks to the authors, we had the opportunity to reanalyse the vitamin data set that was used in the paper. The first thing that we would like to remark is that by performing a simple univariate analysis, i.e. by using each of the 4088 genes one at a time and then adjusting the corresponding p‐values for multiplicities at a 5% level for the false discovery rate, we also pick up five of the six unpermuted covariates. The results are illustrated by Fig. 16, which also shows that there is an important discrepancy between the first five q‐values and the remaining values.

Estimated q‐values according to the p‐values resulting from the 4088 univariate analyses: the estimated q‐values are obtained with the location‐based estimator (Dalmasso et al., 2005); for a false discovery rate of 0.05 or lower, five null hypotheses are rejected (
)
Furthermore, we also performed a standard multivariate regression analysis restricted to the six unpermuted covariates, removing thus all the noise variables. The results, which are illustrated in Table 4, show that only one unpermuted gene is associated with a p‐value that is less than 0.05 and that three unpermuted genes have a p‐value that is less than 0.10. Consequently, it seems unclear whether any multivariate selection method could or should pick up more than these three variables. And indeed, when applying the shotgun stochastic search algorithm of Hans et al. (2007) on the whole data set with 20000 iterations, no more than these three variables could possibly be selected with regard to their posterior importance measure (as defined in equation (2) of Hans et al. (2007)) over the 100 000 top visited models.
| Parameter | Estimate | Standard error | t‐value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | −7.7107 | 0.1782 | −43.27 | 0.0000 |
| X1407 | −0.1221 | 0.1912 | −0.64 | 0.5246 |
| X1885 | 0.6665 | 0.3888 | 1.71 | 0.0894 |
| X3228 | −0.1094 | 0.2716 | −0.40 | 0.6880 |
| X3301 | 0.4697 | 0.2750 | 1.71 | 0.0905 |
| X3496 | 0.6183 | 0.3077 | 2.01 | 0.0470 |
| X3803 | −0.1729 | 0.3271 | −0.53 | 0.5982 |
It thus seems to us puzzling that, on this example, stability selection behaves more like a univariate approach rather than a multivariate approach.
Phil Brown and Jim Griffin (University of Kent, Canterbury)

We have also given a full Bayesian analysis in Griffin and Brown (2010) illustrating the limitations of straight lasso penalization using another robustness prior, the variance gamma prior.
David Draper (University of California, Santa Cruz)
I have two comments on this interesting and useful paper.
- (a)
I am interested in pursuing connections with Bayesian ideas beyond the authors’ mention of the Barbieri and Berger (2004) results. I reinterpret three of the concepts in the present paper in Bayesian language.
- (i)
Frequentist penalization of the log‐likelihood to regularize the problem is often equivalent to Bayesian choice of a prior distribution (for example, think of the l1‐norm penalty term in the lasso in the paper's equation (2) as a log‐prior for β; might there be an even better prior to achieve the goals of this paper?).
- (ii)
Under the assumption that the rows Z(i) of the data matrix are sampled independently from the underlying data‐generating distribution, resampling the existing data is like sampling from the posterior predictive distribution (given the data seen so far) for future rows in the data matrix (and of course, if we already had enough such rows, it would no longer be true that p≫n and the good predictors would be more apparent).
- (iii)
When estimated via resampling, stability paths are just Monte Carlo approximations to expectations of the indicator function, for inclusion of a given variable, with respect to the posterior distribution for the unknowns in the model.
-
I bring this up because it has often proved useful in the past to figure out what is the Bayesian model (both prior and likelihood) for which an algorithmic procedure, like that described in this paper, is approximately a good posterior summary of the quantities of interest, because even better procedures can then be found; a good example is the reverse engineering of neural networks from a Bayesian viewpoint by Lee (2004). Can the authors suggest the next steps in this algorithm‐to‐Bayes research agenda applied to their procedure?
- (b)
The authors have cast their problem in inferential language, but the real goal of structure discovery is often decision making (for instance, a drug company trying to maximize production of riboflavin in the example in Section 2.2 will want to decide in which genes to encourage mutation; money is lost both by failing to pursue good genes and by pursuing bad ones); moreover, when the problem is posed inferentially there is no straightforward way to see how to trade off false positive against false negative selections, whereas this is an inescapable part of a decision‐making approach. What does the authors’ procedure look like in a real world problem when it is optimized for making decisions, via maximization of expected utility?
Zhou Fang (Oxford University)
The authors suggest an interesting and novel way of enhancing variable selection methods and should be congratulated for their contribution. However, recalculation with subsamples can be computationally costly. Here, a heuristic is suggested that may allow some benefits of stability selection with the lasso without resampling.
,
being the design matrix and coefficient estimates for the selected variables
,


Consider a simulation. Using n=50 and p=100, generate (Z(1),…,Z(p),V) independently standard normal and set X(k)=Z(k)+V. We set Y=Xβ+ɛ, with β=(1,1,1,1,1,0,…,0)T and ɛ independent
noise. Fig. 17 shows results from various variable selection techniques.

(a) Lasso, (b) stability selection, (c) perturbation and (d) perturbation probability paths on a simulated data set:
, true covariates; ‐ ‐ ‐ ‐ ‐ ‐ ‐ , irrelevant covariates
In Fig. 17(a), V creates correlation in the predictors, hence generating false positive selections. For high regularization, these spurious fits rival the true fits in magnitude. As suggested in the paper, stability selection is more effective at controlling false positive selections.
of A, as defined by


This approximates the stability selection path, except for low regularization, where failure to consider cases of unselected variables being selected under a perturbation means that we overestimate stability. However, this calculation, even inefficiently implemented, took 0.5 s to compute, compared with 14 s for 200 resamples of stability selection.
Torsten Hothorn (Ludwig‐Maximilians‐Universität München)
Stability selection brings together statistical error control and model‐based variable selection. The method, which controls the probability of selecting—and thus potentially interpreting—a model containing at least one non‐influential variable, will increase the confidence in scientific findings obtained from high dimensional or otherwise complex models.
The idea of studying the stability of the variable selection procedure applied to a specific problem by means of resampling is simple and easy to implement. And the authors point out that this straightforward approach has actually been used much earlier. The first reference that I could find is a paper on model selection in Cox regression by Sauerbrei and Schumacher (l992). Today, multiple‐testing procedures utilizing the joint distribution of the estimated parameters can be applied in such low dimensional models for variable and structure selection under control of the familywise error rate (Haufe et al. (2010) present a nice application to multivariate time series). With their theorem 1, Nicolai Meinshausen and Peter Bühlmann now provide the means for proper error control also in much more complex models.
Two issues seem worth further attention to me: the exchangeability assumption that is made in theorem 1 and the prediction error of models fitted by using only the selected variables. One popular approach for variable selection in higher dimensions is based on the permutation variable importance measure that is used in random forests. Interestingly, it was found by Strobl et al. (2008) that correlated predictor variables receive a higher variable importance than is justified by the data‐generating process. The reason is that exchangeability is (implicitly) assumed by the permutation scheme that is applied to derive these variable importances. The problem can be addressed by applying a conditional permutation scheme and I wonder whether a more elaborate resampling technique taking covariate information into account might allow for a less strong assumption for stability selection as well.
Concerning my second point, the simulation results show that stability selection controls the number of falsely selected variables. I wonder how the performance (measured by the out‐of‐sample prediction error) of a model that is fitted to only the selected variables compares with the performance of the underlying standard procedure (including a cross‐validated choice of hyperparameters). If the probability that an important variable is missed by stability selection is low, there should not be much difference. However, if stability selection is too restrictive, I would expect the prediction error of the underlying standard model to be better. This would be another hint that interpretable models and high prediction accuracy might not be achievable at the same time.
Chenlei Leng and David J. Nott (National University of Singapore)
We congratulate Meinshausen and Bühlmann on an elegant piece of work which shows the usefulness of introducing additional elements of randomness into the lasso and other feature selection procedures through subsampling and other mechanisms. It is now well understood that certain restrictive assumptions (Zhao and Yu, 2006; Wainwright, 2009) must be imposed on the design matrix for the lasso to be a consistent model selector although adaptive.versions of the lasso can circumvent the problem (Zou, 2006). However, as convincingly pointed out by Meinshausen and Bühlmann, by considering multiple sparse models obtained from perturbations of the original feature selection problem the performance of the original lasso, which uses just a single fit, can be improved.
We believe that a Bayesian perspective has much to offer when thinking about randomized versions of the lasso. We offer two alternative approaches, where the randomness comes from an appropriate posterior distribution.
- (a)
Our first approach puts a prior on the parameters in the full model. Given a draw of the parameters, say β* from the posterior distribution, we consider projecting the model that is formed by this realization onto subspaces defined via some form of l1‐constraint on the parameters. Defining the loss function as the expected Kullback–Leibler divergence between this model and its projection, we use any of the following constraints on the subspace
inspired by the lasso and adaptive lasso penalty respectively. Owing to the l1‐penalty, in the posterior distribution of the projection there is positive probability that some parameters are exactly 0 and the posterior distribution on the model space that is induced by the projection allows exploration of model uncertainty. This idea is discussed in Nott and Leng (2010) and extends a Bayesian variable selection approach of Dupuis and Robert (2003) which considers projections onto subspaces that are defined by sets of active covariates.
- (b)
In on‐going work, we consider the following adaptive lasso (Zou, 2006):
In comparison with the usual methods which determine a single estimate of
(35)
(Zou, 2006; Wang et al., 2007),we generalize the Bayesian lasso method in Park and Casella (2008) to produce a posterior sample of
, which is denoted as
. For each b, we plug
into expression (35), which gives a sparse estimate β*b of β. The estimated parameters
can then be used for prediction and assessing model uncertainty. This is very much like the randomized lasso of Meinshausen and Bühlmann, but the randomness enters very naturally through a posterior distribution on hyperparameters. Our preliminary results show that this approach works competitively in prediction and model selection compared with the lasso and adaptive lasso.
Rebecca Nugent, Alessandro Rinaldo, Aarti Singh and Larry Wasserman (Carnegie Mellon University,Pittsburgh)
Meinshausen and Bühlmann argue for using stability‐based methods. We suspect that the methods that are introduced in the current paper will generate much interest.
Stability methods have gained popularity lately. See Lange et al. (2004) and Ben‐Hur et al. (2002) for example. There are cases where stability can lead to poor answers (Ben‐David et al., 2006). Some caution is needed.
General view of stability
Let
be some class of procedures indexed by a tuning parameter h. We think of larger h as corresponding to larger bias. Our view of the stability approach is to use the least biased procedure subject to having an acceptable variability. This has a Neyman–Pearson flavour to it since we optimize what we cannot control subject to bounds on what we can control. The advantage is that variance is estimable whereas bias, generally, is not. There is no notion of approximating the ‘truth’ so it is not required that the model be correct. In contrast, Meinshausen and Bühlmann seem to be more focused on finding the ‘true structure’.
from X (with bandwidth h) and construct a kernel density estimator
from Y. Define the instability by

is the empirical distribution based on Z. Under certain conditions, Rinaldo and Wasserman (2010) showed the following theorem.
Theorem 3. Let h* be the diameter of {p>λ} and let d be the dimension of the support of Xi. Then:
- (a)
Ξ(0)=0 and Ξ(h)=0, for all hh*;
- (b)
;
- (c)
As
;
- (d)
for each h ∈ (0,h*),
for constants c1 and c2.
(36)True structure?
The authors spend time discussing the search for true structure. In general, we feel that there is too much emphasis on finding true structure. Consider the linear model. It is a virtual certainty that the model is wrong. Nevertheless, we all use the linear model because it often leads to good predictions. The search for good predictors is much different from the search for true structure. The latter is not even well defined when the model is wrong, which it always is.
Adam J. Rothman, Elizaveta Levina and Ji Zhu (University of Michigan, Ann Arbor)
We congratulate the authors on developing a clever and practical method for improving high dimensional variable selection, and establishing an impressive array of theoretical performance guarantees. We are particularly interested in stability selection in graphical models, which is illustrated with one brief example in the paper. To investigate the performance of stability selection combined with the graphical lasso a little further, we performed the following simple simulation. The data are generated from the Np(0,Ω−1) distribution, where Ωii=1,Ωi,i−1=Ωi−1,i=0.3 and the rest are 0. We selected p=30 and n=100, and performed 50 replications. Stability selection with pointwise control was implemented with bootstrap samples of size n/2 drawn 100 times.
We selected four different values of the tuning parameter λ for the graphical lasso, which correspond to the marked points along the receiver operating characteristic (ROC) curves for the graphical lasso in Fig. 18. The ROC curve showing false positive and true positive rates of detecting 0s in Ω for the graphical lasso was obtained by varying the tuning parameter λ and averaging over replications. For each fixed λ, we applied stability selection varying πthr within the recommended range of 0.6–0.9, which resulted in an ROC curve for stability selection. The ROC curves show that stability selection reduces the false positive rate, as it should, and shifts the graphical lasso result down along the ROC curve; essentially, it is equivalent to the graphical lasso with a larger λ. Figs 18(a) and 18(b) have λs which are too small, and stability selection mostly improves on the graphical lasso result, but it does appear somewhat sensitive to the exact value of λ: if λ is very small (Fig. 18(a)), stability selection only improves on the graphical lasso for large values of πthr. In Figs 18(c) and 18(d), λ is just right or too large, and then applying stability selection makes the overall result worse. This example confirms that stability selection is a useful computational tool to improve on the false positive rate of the graphical lasso when tuning over the full range of λ is more expensive than doing bootstrap replications. However, since it does seem somewhat sensitive to the choice of a suitable small λ, it seems that combining it with some kind of initial crude cross‐validation could result in even better performance. It would be interesting to consider whether there are particular types of the inverse covariance matrix that benefit from stability selection more than others, and whether any theoretical results can be obtained specifically for such structures; in particular, it would be interesting to know whether stability selection can perform better than the graphical lasso with oracle λ.

Graphical lasso ROC curve (‐ ‐ ‐ ‐ ‐ ‐ ‐ ) and four different stability selection ROC curves (
) obtained by varying πthr from 0.6 to 0.9 for fixed values of λ of (a) 0.01, (b) 0.06, (c) 0.23 and (d) 0.40: × marks the point on the graphical lasso ROC curve corresponding to the fixed λ
A. B. Tsybakov (Centre de Recherche en Economie et Statistique, Université Paris 6 and Ecole Polytechnique,Paris)
I congratulate the authors on a thought‐provoking paper, which pioneers many interesting ideas. My question is about the comparison with other selection methods, such as the adaptive lasso or thresholded lasso (TL). In the theory these methods have better selection properties than those stated in theorem 2. For example, consider the TL
where
is the lasso estimator with λ as in Bickel et al. (2009), τ=√{ log (p)/n} and c>0 is such that
with high probability under the restricted eigenvalue condition of Bickel et al. (2009). Then a two‐line proof using expression (7.9) in Bickel et al. (2009) shows that, with the same probability, under the RE condition
selects S correctly whenever mink ∈ S|βk|>Cs1/2τ for some C>0 depending only on σ2 and the eigenvalues of X′X/n. Since also c depends only on X and σ2 (see Bickel et al. (2009)), c can be evaluated from the data. The restricted eigenvalue condition is substantially weaker than assumption 1 of theorem 2 and mink ∈ S|βk| need not be as large as greater than C′s3/2τ, as required in theorem 2. We may interpret it as the fact that stability selection is successful if the relevant βk are very large and the Gram matrix is very nice, whereas for smaller βk and less diagonal Gram matrices it is safer to use the TL. Of course, here we compare only the ‘upper bound’, but it is not clear why stability selection does not achieve at least similar behaviour to that of the TL. Is it only technical or is there an intrinsic reason?
Cun‐Hui Zhang (Rutgers University, Piscataway)
I congratulate the authors for their correct call for attention to the utility of randomized variable selection and great effort in studying its effectiveness.
In variable selection, a false variable may have a significant observed association with the response variable by representing a part of the realized noise through luck or by correlating with the true variables. A fundamental challenge in such structure estimation problems with high dimensional data is to deal with the competition of many such false variables for the attention of a statistical learning algorithm.
The solution proposed here is to simulate the selection probabilities of each variable with a randomized learning algorithm and to estimate the structure by choosing the variables with high simulated selection probabilities. The success of the proposed method in the numerical experiments is very impressive, especially in some cases at a level of difficulty that has rarely been touched on earlier. I applaud the authors for raising the bar for future numerical experiments in the field.
On the theoretical side, the paper considers two assumptions to guarantee the success of the method proposed:
- (a)
many false variables compete among themselves at random so each false variable has only a small chance of catching the attention of the randomized learning algorithm;
- (b)
the original randomized learning algorithm is not worse than random guessing.
The first assumption controls false discoveries whereas the second ensures a certain statistical power of detecting the true structure. Under these two assumptions, theorem 1 asserts in a broad context the validity of an upper bound for the total number of false discoveries. This result has the potential for an enormous influence, especially in biology, text mining and other areas that are overwhelmed with poorly understood large data.
Because of the potential for great influence of such a mathematical inequality in the practice of statistics, possibly by many non‐statisticians, we must proceed with equally great caution. In this spirit, I comment on the two assumptions as follows.

is the identity and 1 denotes matrices of proper dimensions with 1 for all entries. I wonder whether such an assumption could be tested.
Assumption (b) may not always hold for the lasso. For qΛ<|S|, a counterexample seems to exist with
, where XS and ZN are independent standard normal vectors and the components of ρS are of the same sign as those of β.
Hui Zou (University of Minnesota,Minneapolis)
I congratulate Dr Meinshausen and Professor Bühlmann on developing stability selection for addressing the difficult problem of variable selection with high dimensional data. Stability selection is intuitively appealing, general and supported by finite sample theory.
Regularization parameter selection in sparse learning is often guided by some model comparison criteria such as the Akaike information criterion and Bayes information criterion in which prediction accuracy measurement is a crucial component. It is quite intriguing to see that stability selection directly targets variable selection without using any prediction measurement. The advantage of stability selection is well demonstrated by theorem 1 in which inequality (9) controls the number of false selections. In the context of variable selection, inequality (9) is very useful when the number of missed true variables is small. In an ideal situation we wish to have
while controlling the number of false selections. An interesting theoretical problem is whether a non‐trivial lower bound could be established for
.
To see how well stability selection identifies relevant variables, we did some simulations where stability selection was applied to sure independence screening (SIS) (Fan and Lv, 2008). In the linear regression case, SIS picks the top d≪p variables that have the highest correlation with the response variable. Denote by
the SIS selector with reduced dimension d. Then
and
. We need to consider only pointwise control because
is monotonically increasing with increasing d. The simulation model is
with (x1,…,xp) independent and identically distributed N(0,1) and β1=β2=…=β10=1 and βj=0 for j>10. Following Meinshausen and Bühlmann, we let d=⌊√{(2πthr−1)}p⌋ to guarantee E(V)1. Moreover, in the p≫n setting
discovers all relevant variables with very high probability (Fan and Lv, 2008). For the simulation study we considered an asymptotic high dimension setting (p=20 000;n=800) and a more realistic high dimension setting (p=4000; n=200).
Table 5 summarizes the simulation results. First of all, in all four cases the number of false selections by stability selection is much smaller than 1, the nominal upper bound. For the case of p=20 000 and n=800, both SIS and stability selection select all true variables. In particular, stability selection using πthr=0.9 achieves the perfect variable selection in all 100 replications. When p=4000 and n=200, SIS still has a reasonably low missing rate (less than 5%), but stability selection using πthr=0.6 and πthr=0.9 selects about six and three relevant variables respectively. The performance is not very satisfactory. From this example we also see that with finite samples the choice of πthr can have a significant effect on the missing rate of stability selection, although its effect on the false discovery rate is almost ignorable.
| π thr | d |
|
|
|
|---|---|---|---|---|
| 0.6 | 63 | 10 | 10 | 0.22 |
| 0.9 | 126 | 10 | 10 | 0 |
| p=4000 | n=200 | |||
| 0.6 | 28 | 9.52 | 5.91 | 0.09 |
| 0.9 | 56 | 9.68 | 3.23 | 0.01 |
The authors replied later, in writing, as follows.
We are very grateful to all the discussants for their many insightful and inspiring comments. Although we cannot respond in a brief rejoinder to every issue that has been raised, we present some additional thoughts relating to the stimulating contributions.
Connections to Bayesian approaches
Richardson, Brown and Griffin, Draper, and Leng and Nott discuss interesting possible connections between stability selection (or other randomized selection procedures) and Bayesian approaches with appropriately chosen priors. The randomized lasso has the most immediate relation, as pointed out by Brown and Griffin and connecting with their interesting paper (Griffin and Brown, 2007). They also raise the question whether subsampling is then still necessary. Although we do not have a theoretical answer here, it seems that subsampling improves in practice a randomized procedure (or the equivalent Bayesian counterpart). We are also not ‘throwing away real data’ with subsampling since the final selection probabilities over subsampled data are U‐statistics of order ⌊n/2⌋ and are using all n samples, not just a random subset. Stability selection is closely related to bagging (Breiman, 1996), as pointed out by Richardson. Stability selection is aggregating selection outcomes rather than predictions and assigning an error rate via our theorem 1. Nott and Leng (2010) seems to be very interesting in the context of Bayesian variable selection.
Bayesian decision theoretic framework
Draper and Holmes point out that the Bayesian framework is natural for a decision theoretic framework. And, indeed, this is one of the advantages of Bayesian statistics. In our examples of biomarker discovery and, more generally for variable selection (and also for example graphical modelling), the workflow consists of two steps. The first aim is to obtain
- (a)
a good ranked list of potentially interesting markers or variables which we
- (b)
then need to cut off at some position in the list.
Although a decision theoretic analysis is mostly helpful in step (b), stability selection is potentially improving both (a) and (b). The issue where to cut in the list in step (b) involves in the frequentist set‐up a choice of an acceptable type I error rate. The choice of a type I error rate is maybe not as satisfying as a full decision theoretic treatment but it is often useful in practice. Each ‘discovery’ needs to be validated by further experiments which are often very costly and the chosen framework aims to optimize the number of true discoveries under a given budget that can be spent on falsely chosen variables or hypotheses.
Exchangeability assumption
Shawe‐Taylor and Sun, and Zhang raise, very legitimately, the question to what extent the exchangeability assumption in theorem 1 is too stringent. We wrote in the paper that results do seem to hold up very well for real data sets where the assumption is likely to be violated (and theorem 2 is not making use of the strong exchangeability assumption). It is maybe also worthwhile mentioning that the assumptions can be weakened considerably for specific applications. For the special case of high dimensional linear models, we worked out a related solution in follow‐up work (Meinshausen et al., 2009).
Tightness of bounds and two‐step procedures
Tsybakov correctly points out that sharper bounds on the l2‐distance are available for the standard lasso and these could be exploited for variable selection by using hard thresholding of coefficients or the adaptive lasso. The reasons for having looser results for the randomized lasso are technical in our view, not intrinsic. It is much more difficult to analyse the stability selection algorithm which involves subsampling and randomization of covariates, which opens up maybe interesting areas for further mathematical investigations. We thought that it was interesting, nevertheless, that the irrepresentable condition can be considerably weakened by using randomization of the covariates instead of using two‐step procedures such as hard thresholding or the adaptive lasso.
Power and false positive selections
Zou, Richardson, Shah and Samworth, and Rothman, Levina and Zhu examined the power of the method to detect important variables and compared it with alternative approaches for some examples. Although it is obviously true that no method will be universally ‘optimal’, stability selection places a strong emphasis on avoiding false positive selections. This is in contrast with say, sure independence screening used by Zou, which is a screening method (by name!) and is sitting at the opposite end of the spectrum by placing a large emphasis on a large power while accepting many false positive selections. For the simulation results of Zou, we suspect that sure independence screening would have a much larger false positive rate for p=4000 but we could not see it being reported. Rothman, Levina and Zhu compare the receiver operating characteristic curve for the example of graphical modelling. It does not come entirely unexpected from our point of view that the gain of stability selection is very small or, indeed, non‐existent since the simulation takes place in a Toeplitz design case which is very close to complete independence between all variables. For regression, it was shown already in the paper that stability selection cannot be expected to improve performance for independent or very weakly correlated variables. And our theorem 2 showed that we can expect major improvements only if the irrepresentable condition is violated, which has analogies in Gaussian graphical modelling (Meinshausen, 2008; Ravikumar et al., 2008).
Generalization performance and sparsity
Richardson, Shawe‐Taylor and Sun, Tsybakov and Hothorn discuss the connection between generalization performance and sparsity of the selected set of variables. Hothorn mentions that achieving both optimal predictive accuracy and consistent variable selection might be very difficult, as manifested also in the Akaike information criterion–Bayes information criterion dilemma for lower dimensional problems. Shawe‐Taylor and Sun illustrate that stability selection will in general produce rather sparse models, which is in agreement with the discussion on false positive selections above. Their example demonstrates though also impressively that the predictive performance is sometimes compromised only very marginally when using much sparser models than those produced by the lasso under cross‐validation. In general, stability selection will yield much sparser models than the lasso with cross‐validation. How much predictive performance one is willing to sacrifice for higher sparsity of the model, if any, should be application driven. If the answer is ‘none’, stability selection might not be appropriate.
Approximating the true model
(37)and
has non‐zero components only in the set M⊆{1,…,p}. Here, C2 is a suitable positive number, typically depending on n, and we denote by f,f* and fM the n×1 vectors evaluated at the observed covariates. Clearly, if the true model is linear and sparse with many regression coefficients equal to 0 and where the few non‐zero regression coefficients are all sufficiently large, then the set Sapprox in expression (37) equals the set S of the true active variables. Theorem 1 will remain valid under an appropriate exchangeability assumption for selection of variables in the complement of Sapprox which might or might not be realistic. The mathematical arguments for extending theorem 2 to such a setting seem to be more involved.
Correlated predictor variables
Kirk, Lewin and Stumpf, and Kent raise the issue of correlated predictor variables and examine the behaviour of stability selection for highly correlated designs. This is a very important discussion point. As mentioned already above, stability selection puts a large emphasis on avoiding false positive selections and, as a consequence, might miss important variables if they are highly correlated with irrelevant variables. This is similar to the behaviour of a classical test for the regression coefficient p≪n situations. For situations where we are more interested in whether there are interesting variables in a certain group of variables, the proposal of Kirk, Lewin and Stumpf on testing stability of sets of variables (and finding those sets possibly by the elastic net) seems very interesting and useful.
Numerical example of vitamin gene expression data
Ahmed and Richardson analyse our gene expression data set with several competing methods and come to the conclusion that at most three genes should be selected. They raise the question whether stability selection is selecting too many variables. However, as shown in the initial contribution to the discussion by Richardson, stability selection is in fact also selecting only three genes under reasonable type I error control. The methods seem to be in agreement here.
Choice of regularization
Yen and Yen mention that the number q of selected variables can grow very large for small regularization parameters λ and propose an interesting way to choose a suitable region for the regularization parameter. Yet, instead of restricting λ to larger values, a useful alternative in practice is to select only the first q variables that appear when lowering the regularization. And q can be chosen a priori to yield non‐trivial bounds in theorem 1.
Computational improvements
Gandy and Fang both propose interesting extensions that help to alleviate the computational challenge of having to fit a model on many subsamples of the data. An interesting alternative to the procedure that was proposed by Gandy is the improved bounds suggested by Shah and Samworth.
Dependent data
Tong notes that stability selection makes an inherent assumption of independence between observations. We have not yet tried to apply the method to dependent data such as time series. The standard subsampling scheme will not be suitable in cases of dependence. A block‐based approach with independent subsampling of blocks (and where dependence is captured within blocks, at least approximately) along the lines of Künsch (1989) might be an interesting alternative to explore in this context.
Connections to clustering and density cluster estimation
Nugent, Rinaldo, Singh and Wasserman, and Hennig provide fascinating connections to related ideas in clustering and density cluster estimation. As described in the paper, Monti et al. (2003) is another interesting connection to consensus clustering.
Related ideas
Richardson and Hothorn mention numerous related previous references. We tried to point out many connections to previous work but have missed important ones. It is maybe worthwhile emphasizing again the similarity, at the crude level, of the work of Bach (2008) on bolasso which has been developed independently and simultaneously.
We reiterate and thank all the contributors again for their many interesting and thoughtful comments which have already opened up and will open up new research in this area. We would like to convey special thanks to Rajen Shah and Richard Samworth, who spotted a mistake in the definition of the assumption ‘not worse than random guessing’ in an earlier version of the manuscript. Their improved bounds will also make stability selection less conservative and address John Shawe‐Taylor's comment regarding the finite amount of random subsampling in practice versus our theoretical arguments corresponding to all possible subsamples. Finally, we thank the Royal Statistical Society and the journal for hosting this discussion.
- (1980) Seasonal adjustment by a Bayesian modeling. J. Time Ser. Anal., 1, 1– 13.
- (2008) Bolasso: model consistent Lasso estimation through the bootstrap. In Proc. 25th Int. Conf. Machine Learning, pp. 33– 40. New York: Association for Computing Machinery.
- and (2004) Optimal predictive model section. Ann. Statist., 32, 870– 897.
- , and (1980) Regression Diagnostics, ch. 2. New York: Wiley.
- , and (2006) A sober look at clustering stability. Learn. Theor., no. 4005, 5– 19.
- , and (2002) A stability based method for discovering structure in clustered data. Pacific Symp. Biocomputing.
- (1985) Statistical Decision Theory and Bayesian Analysis. Berlin: Springer.
- , and (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37, 1705– 1732.
- and (2010) Evolutionary Stochastic Search for Bayesian model exploration. Preprint. (Available from http://arxiv.org/abs/1002.2706.)
- (1996) Bagging predictors. Mach. Learn., 24, 123– 140.
- , and (1999) The choice of variables in multivariate regression: a non‐conjugate Bayesian decision theory approach. Biometrika, 60, 627– 641.
- , and (2007) Sparsity oracle inequalities for the Lasso. Electron. J. Statist., 1, 169– 194.
- , and (2005) A simple procedure for estimating the false discovery rate. Bioinformatics, 21, 660– 668.
- and (2003) Variable selection in qualitative models via an entropic explanatory power. J. Statist. Planng Inf., 111, 77– 94.
- , , and (2004) Least angle regression. Ann. Statist., 32, 407– 499.
- and (2008) Sure independence screening for ultrahigh diemensional feature space (with discussion). J. R. Statist. Soc. B, 70, 849– 911.
- , and (2009) Ultrahigh dimensional feature selection: beyond the linear model. J. Mach. Learn. Res., 10, 2013– 2038.
- (2009) Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. J. Am. Statist. Ass., 104, 1504– 1511.
- , and (2010) Prediction and variable selection with the adaptive Lasso. Preprint arXiv:1001.5176v1.
- and (1997) Approaches for Bayesian variable selection. Statist. Sin., 7, 339– 373.
- and (2007) Bayesian adaptive lassos with non‐convex penalisation. Institute of Mathematics and Statistics, University of Kent, Canterbury. (Available from http://www.kent.ac.uk/ims/personal/jeg28/.)
- and (2010) Inference with normal‐gamma prior distributions in regression problems. Bayesn Anal., 5, 171– 181.
- , and (2007) Shotgun Stochastic Search for large p regression. J. Am. Statist. Ass., 102, 507– 517.
- , and (2008) The Elements of Statistical Learning, 2nd edn. New York: Springer.
- , , and (2010) Sparse casual discovery in multivariate time series. In Journal of Machine Learning Research Wrkshp Conf. Proc., vol. 6, Causality: Objectives and Assessment, pp. 97– 106. (Availabe from http://www.JMLR.org.)
- (2010) Methods for merging Gaussian mixture components. In Adv. Data Anal. Classificn, 4, 3– 34.
- and (2005) Empirical Bayes selection of wavelet thresholds. Ann. Statist., 33, 1700– 1752.
- (1989) The jackknife and the bootstrap for general stationary observations. Ann. Statist., 17, 1217– 1241.
- , , and (2004) Stability‐based validation of clustering solutions. Neur. Computn, 16, 1299– 1323.
- (2004) Bayesian Nonparametrics via Neural Networks. Philadelphia: Society for Industrial and Applied Mathematics.
- (1968) The choice of variables in multiple regression (with discussion). J. R. Statist. Soc. B, 30, 31– 66.
- (2008) A note on the Lasso for graphical Gaussian model selection. Statist. Probab. Lett., 78, 880– 884.
- , and (2009) P‐values for high‐dimensional regression. J. Am. Statist. Ass., 104, 1671– 1681.
- , , and (2003) Consensus clustering: a resampling‐based method for class discovery and visualization of gene expression microarray data. Mach. Learn., 52, 91– 118.
- and (2010) Bayesian projection approaches to variable selection in generalized linear models. Computnl Statist. Data Anal., to be published.
- and (2008) The Bayesian Lasso. J. Am. Statist. Ass., 103, 681– 686.
- , , and (2008) High‐dimensional covariance estimation by minimizing l1‐penalized log‐determinant divergence. Preprint arXiv:0811.3628.
- and (2010) Generalized density clustering. Ann. Statist., to be published. (Available from http://arxiv.org/abs/0907.3454.)
- and (1992) A bootstrap resampling procedure for model‐building—application to the Cox regression‐model. Statist. Med., 11, 2093– 2109.
- (1954) The Foundations of Statistics. New York: Wiley.
- , , , and (2008) Conditional variable importance for random forests. BMC Bioinform., 9.
- and (2005) Cluster validation by prediction strength. J. Computnl Graph. Statist., 14, 511– 528.
- (2010) Obituary of Hirotugu Akaike. J. R. Statist. Soc. A, 173, 451– 454.
- (2009) Sharp thresholds for high‐dimensional and noisy recovery of sparsity. IEEE Trans. Inform. Theor., 55, 2183– 2202.
- , and (2007) Regression coefficient and autoregressive order shrinkage and selection via the lasso. J. R. Statist. Soc. B, 69, 63– 78.
- and (2006) On model selection consistency of lasso. J. Mach. Learn. Res., 7, 2541– 2563.
- (2006) The adaptive lasso and its oracle properties. J. Am. Statist. Ass., 101, 1418– 1429.
- and (2005) Regularization and variable selection via the elastic net. J. R. Statist. Soc. B, 67, 301– 320.
- , and (2007) On the degrees of freedom of the LASSO. Ann. Statist., 35, 2173– 2192.
- , and (2008) Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Statist. Applic. Genet. Molec. Biol., 7, no. 1, article 7.
References
Citing Literature
Number of times cited according to CrossRef: 837
- Tian-Tian Zhai, Johannes A. Langendijk, Lisanne V. van Dijk, Arjen van der Schaaf, Linda Sommers, Johanna G.M. Vemer-van den Hoek, Henk P. Bijl, Gyorgy B. Halmos, Max J.H. Witjes, Sjoukje F. Oosting, Walter Noordzij, Nanna M. Sijtsema, Roel J.H.M. Steenbakkers, Pre-treatment radiomic features predict individual lymph node failure for head and neck cancer patients, Radiotherapy and Oncology, 10.1016/j.radonc.2020.02.005, 146, (58-65), (2020).
- João Pereira, Erik S. G. Stroes, Albert K. Groen, Aeilko H. Zwinderman, Evgeni Levin, Manifold Mixing for Stacked Regularization, Machine Learning and Knowledge Discovery in Databases, 10.1007/978-3-030-43823-4_36, (444-452), (2020).
- Willi Sauerbrei, Aris Perperoglou, Matthias Schmid, Michal Abrahamowicz, Heiko Becher, Harald Binder, Daniela Dunkler, Frank E. Harrell, Patrick Royston, Georg Heinze, State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues, Diagnostic and Prognostic Research, 10.1186/s41512-020-00074-3, 4, 1, (2020).
- Willem Kruijer, Pariya Behrouzi, Daniela Bustos-Korts, María Xosé Rodríguez-Álvarez, Seyed Mahdi Mahmoudi, Brian Yandell, Ernst Wit, Fred A. van Eeuwijk, Reconstruction of Networks with Direct and Indirect Genetic Effects, Genetics, 10.1534/genetics.119.302949, 214, 4, (781-807), (2020).
- Michaël G.B. Blum, Linda Valeri, Olivier François, Solène Cadiou, Valérie Siroux, Johanna Lepeule, Rémy Slama, Challenges Raised by Mediation Analysis in a High-Dimension Setting, Environmental Health Perspectives, 10.1289/EHP6240, 128, 5, (055001), (2020).
- John Kang, James T. Coates, Robert L. Strawderman, Barry S. Rosenstein, Sarah L. Kerns, Genomics models in radiotherapy: From mechanistic to machine learning, Medical Physics, 10.1002/mp.13751, 47, 5, (e203-e217), (2020).
- Xinlin Hu, Yaohua Hu, Fanjie Wu, Ricky Wai Tak Leung, Jing Qin, Integration of Single-Cell Multi-Omics for Gene Regulatory Network Inference, Computational and Structural Biotechnology Journal, 10.1016/j.csbj.2020.06.033, (2020).
- Yipeng Song, Johan A. Westerhuis, Age K. Smilde, Logistic principal component analysis via non-convex singular value thresholding, Chemometrics and Intelligent Laboratory Systems, 10.1016/j.chemolab.2020.104089, (104089), (2020).
- Kang K. Yan, Xiaofei Wang, Wendy Lam, Varut Vardhanabhuti, Anne W.M. Lee, Herbert H. Pang, Radiomics analysis using stability selection supervised component analysis for right-censored survival data, Computers in Biology and Medicine, 10.1016/j.compbiomed.2020.103959, (103959), (2020).
- Ji Chen, Veronika I. Müller, Juergen Dukart, Felix Hoffstaedter, Justin T. Baker, Avram J. Holmes, Deniz Vatansever, Thomas Nickl-Jockschat, Xiaojin Liu, Birgit Derntl, Lydia Kogler, Renaud Jardri, Oliver Gruber, André Aleman, Iris E. Sommer, Simon B. Eickhoff, Kaustubh R. Patil, Intrinsic connectivity patterns of task-defined brain networks allow individual prediction of cognitive symptom dimension of schizophrenia and are linked to molecular architecture, Biological Psychiatry, 10.1016/j.biopsych.2020.09.024, (2020).
- Yuxin Wang, Kathrin Fenner, Damian E. Helbling, Clustering micropollutants based on initial biotransformations for improved prediction of micropollutant removal during conventional activated sludge treatment, Environmental Science: Water Research & Technology, 10.1039/C9EW00838A, (2020).
- Veronica Tozzo, Annalisa Barla, Multi-parameters Model Selection for Network Inference, Complex Networks and Their Applications VIII, 10.1007/978-3-030-36687-2_47, (566-577), (2020).
- Mehmet Ali Kaygusuz, Vilda Purutçuoğlu, The Model Selection Methods for Sparse Biological Networks, Artificial Intelligence and Applied Mathematics in Engineering Problems, 10.1007/978-3-030-36178-5_10, (107-126), (2020).
- Stefanie Warnat-Herresthal, Konstantinos Perrakis, Bernd Taschler, Matthias Becker, Kevin Baßler, Marc Beyer, Patrick Günther, Jonas Schulte-Schrepping, Lea Seep, Kathrin Klee, Thomas Ulas, Torsten Haferlach, Sach Mukherjee, Joachim L. Schultze, Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics, iScience, 10.1016/j.isci.2019.100780, 23, 1, (100780), (2020).
- Chuan-Quan Li, Zhaoyu Fang, Qing-Song Xu, A partition-based variable selection in partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 10.1016/j.chemolab.2020.103935, (103935), (2020).
- Ronja Weber, Naemi Haas, Astghik Baghdasaryan, Tobias Bruderer, Demet Inci, Srdjan Micic, Nathan Perkins, Renate Spinas, Renato Zenobi, Alexander Moeller, Volatile organic compound breath signatures of children with cystic fibrosis by real-time SESI-HRMS, ERJ Open Research, 10.1183/23120541.00171-2019, 6, 1, (00171-2019), (2020).
- Mathias Cardner, Mustafa Yalcinkaya, Sandra Goetze, Edlira Luca, Miroslav Balaz, Monika Hunjadi, Johannes Hartung, Andrej Shemet, Nicolle Kränkel, Silvija Radosavljevic, Michaela Keel, Alaa Othman, Gergely Karsai, Thorsten Hornemann, Manfred Claassen, Gerhard Liebisch, Erick Carreira, Andreas Ritsch, Ulf Landmesser, Jan Krützfeldt, Christian Wolfrum, Bernd Wollscheid, Niko Beerenwinkel, Lucia Rohrer, Arnold von Eckardstein, Structure-function relationships of HDL in diabetes and coronary heart disease, JCI Insight, 10.1172/jci.insight.131491, 5, 1, (2020).
- Sebastian Daberdaku, Structure-Based Antibody Paratope Prediction with 3D Zernike Descriptors and SVM, Computational Intelligence Methods for Bioinformatics and Biostatistics, 10.1007/978-3-030-34585-3_4, (27-49), (2020).
- Chico Q. Camargo, Jonathan Bright, Graham McNeill, Sridhar Raman, Scott A. Hale, Estimating Traffic Disruption Patterns with Volunteered Geographic Information, Scientific Reports, 10.1038/s41598-020-57882-2, 10, 1, (2020).
- Wenbo Wu, Xiangrong Yin, Pseudo estimation and variable selection in regression, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2020.01.006, (2020).
- Shimin Shuai, Steven Gallinger, Lincoln Stein, Combined burden and functional impact tests for cancer driver discovery using DriverPower, Nature Communications, 10.1038/s41467-019-13929-1, 11, 1, (2020).
- Mohammad S. Rahman, Gholamreza Haffari, A Statistically Efficient and Scalable Method for Exploratory Analysis of High-Dimensional Data, SN Computer Science, 10.1007/s42979-020-0064-2, 1, 2, (2020).
- Marcela M. Fernandez-Gutierrez, Sultan Imangaliyev, Andrei Prodan, Bruno G. Loos, Bart J. F. Keijser, Michiel Kleerebezem, A salivary metabolite signature that reflects gingival host-microbe interactions: instability predicts gingivitis susceptibility, Scientific Reports, 10.1038/s41598-020-59988-z, 10, 1, (2020).
- Haohan Wang, Michael M. Vanyukov, Eric P. Xing, Wei Wu, Discovering weaker genetic associations guided by known associations, BMC Medical Genomics, 10.1186/s12920-020-0667-4, 13, S3, (2020).
- Nikolay A. Alemasov, Nikita V. Ivanisenko, Vladimir A. Ivanisenko, Learning the changes of barnase mutants thermostability from structural fluctuations obtained using anisotropic network modeling, Journal of Molecular Graphics and Modelling, 10.1016/j.jmgm.2020.107572, 97, (107572), (2020).
- Leying Guan, Robert Tibshirani, Post model‐fitting exploration via a “Next‐Door” analysis, Canadian Journal of Statistics, 10.1002/cjs.11542, 48, 3, (447-470), (2020).
- Aki Nikolaidis, Anibal Solon Heinsfeld, Ting Xu, Pierre Bellec, Joshua Vogelstein, Michael Milham, Bagging improves reproducibility of functional parcellation of the human brain, NeuroImage, 10.1016/j.neuroimage.2020.116678, (116678), (2020).
- Fan Zhang, Jeffrey C. Miecznikowski, David L. Tritchler, Identification of supervised and sparse functional genomic pathways, Statistical Applications in Genetics and Molecular Biology, 10.1515/sagmb-2018-0026, 19, 1, (2020).
- Yongmei Sun, Tingshuo Chen, Jingxian Wang, Yuefeng Ji, A Machine Learning Based Method for Coexistence State Prediction in Multiple Wireless Body Area Networks, 13th EAI International Conference on Body Area Networks, 10.1007/978-3-030-29897-5_17, (203-217), (2020).
- Shikha Roy, Rakesh Kumar, Vaibhav Mittal, Dinesh Gupta, Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning, Scientific Reports, 10.1038/s41598-020-60740-w, 10, 1, (2020).
- Yu-Chuan Chang, June-Tai Wu, Ming-Yi Hong, Yi-An Tung, Ping-Han Hsieh, Sook Wah Yee, Kathleen M. Giacomini, Yen-Jen Oyang, Chien-Yu Chen, GenEpi: gene-based epistasis discovery using machine learning, BMC Bioinformatics, 10.1186/s12859-020-3368-2, 21, 1, (2020).
- C. Denis, E. Lebarbier, C. Lévy‐Leduc, O. Martin, L. Sansonnet, A novel regularized approach for functional data clustering: an application to milking kinetics in dairy goats, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12404, 69, 3, (623-640), (2020).
- Jong-June Jeon, Yongdai Kim, Sungho Won, Hosik Choi, Primal path algorithm for compositional data analysis, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.106958, (106958), (2020).
- Yuntong Bai, Zille Pascal, Wenxing Hu, Vince D. Calhoun, Yu-Ping Wang, Biomarker Identification Through Integrating fMRI and Epigenetics, IEEE Transactions on Biomedical Engineering, 10.1109/TBME.2019.2932895, 67, 4, (1186-1196), (2020).
- Guang-Hui Fu, Yuan-Jiao Wu, Min-Jie Zong, Jianxin Pan, Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data, BMC Bioinformatics, 10.1186/s12859-020-3411-3, 21, 1, (2020).
- Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij, Stacked Penalized Logistic Regression for Selecting Views in Multi-View Learning, Information Fusion, 10.1016/j.inffus.2020.03.007, (2020).
- Binbin Zhou, Sha Zhao, Longbiao Chen, Shijian Li, Zhaohui Wu, Gang Pan, Forecasting Price Trend of Bulk Commodities Leveraging Cross-domain Open Data Fusion, ACM Transactions on Intelligent Systems and Technology, 10.1145/3354287, 11, 1, (1-26), (2020).
- Matteo Stocchero, Relevant and irrelevant predictors in PLS2, Journal of Chemometrics, 10.1002/cem.3237, 34, 8, (2020).
- Hao-Ting Wang, Jonathan Smallwood, Janaina Mourao-Miranda, Cedric Huchuan Xia, Theodore D. Satterthwaite, Danielle S. Bassett, Danilo Bzdok, Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists, NeuroImage, 10.1016/j.neuroimage.2020.116745, (116745), (2020).
- Jie Wei, Hui Chen, Determining the number of factors in approximate factor models by twice K-fold cross validation, Economics Letters, 10.1016/j.econlet.2020.109149, (109149), (2020).
- Marinela Capanu, Mihai Giurcanu, Colin B. Begg, Mithat Gönen, Optimized variable selection via repeated data splitting, Statistics in Medicine, 10.1002/sim.8538, 39, 16, (2167-2184), (2020).
- Kemal Akyol, Assessing the importance of autistic attributes for autism screening, Expert Systems, 10.1111/exsy.12562, 37, 5, (2020).
- Ayaz Akram, Maria Mushtaq, Muhammad Khurram Bhatti, Vianney Lapotre, Guy Gogniat, Meet the Sherlock Holmes’ of Side Channel Leakage: A Survey of Cache SCA Detection Techniques, IEEE Access, 10.1109/ACCESS.2020.2980522, 8, (70836-70860), (2020).
- Jana Janková, Rajen D. Shah, Peter Bühlmann, Richard J. Samworth, Goodness‐of‐fit testing in high dimensional generalized linear models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12371, 82, 3, (773-795), (2020).
- Eliana Lima, Peers Davies, Jasmeet Kaler, Fiona Lovatt, Martin Green, Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection, Scientific Reports, 10.1038/s41598-020-64829-0, 10, 1, (2020).
- Andrius Vabalas, Emma Gowen, Ellen Poliakoff, Alexander J. Casson, Applying Machine Learning to Kinematic and Eye Movement Features of a Movement Imitation Task to Predict Autism Diagnosis, Scientific Reports, 10.1038/s41598-020-65384-4, 10, 1, (2020).
- Ivana Marić, Abraham Tsur, Nima Aghaeepour, Andrea Montanari, David K. Stevenson, Gary M. Shaw, Virginia D. Winn, Early prediction of preeclampsia via machine learning, American Journal of Obstetrics & Gynecology MFM, 10.1016/j.ajogmf.2020.100100, 2, 2, (100100), (2020).
- Pietro Hiram Guzzi, Swarup Roy, Gene expression networks: inference and analysis, Biological Network Analysis, 10.1016/B978-0-12-819350-1.00012-8, (95-131), (2020).
- Zixin Shen, Argon Chen, Comprehensive relative importance analysis and its applications to high dimensional gene expression data analysis, Knowledge-Based Systems, 10.1016/j.knosys.2020.106120, (106120), (2020).
- Krzysztof Koras, Dilafruz Juraeva, Julian Kreis, Johanna Mazur, Eike Staub, Ewa Szczurek, Feature selection strategies for drug sensitivity prediction, Scientific Reports, 10.1038/s41598-020-65927-9, 10, 1, (2020).
- Gido H. Schoenmacker, Katre Sakala, Barbara Franke, Jan K. Buitelaar, Toomas Veidebaum, Jaanus Harro, Tom Heskes, Tom Claassen, Arias Vásquez Alejandro, Identification and validation of risk factors for antisocial behaviour involving police, Psychiatry Research, 10.1016/j.psychres.2020.113208, 291, (113208), (2020).
- Ji Lv, Senyi Deng, Le Zhang, A review of artificial intelligence applications for antimicrobial resistance, Biosafety and Health, 10.1016/j.bsheal.2020.08.003, (2020).
- Sainan Jin, Ke Miao, Liangjun Su, On factor models with random missing: EM estimation, inference, and cross validation, Journal of Econometrics, 10.1016/j.jeconom.2020.08.002, (2020).
- Rui Hou, Maciej A. Mazurowski, Lars J. Grimm, Jeffrey R. Marks, Lorraine M. King, Carlo C. Maley, Eun-Sil Shelley Hwang, Joseph Y. Lo, Prediction of Upstaged Ductal Carcinoma In Situ Using Forced Labeling and Domain Adaptation , IEEE Transactions on Biomedical Engineering, 10.1109/TBME.2019.2940195, 67, 6, (1565-1572), (2020).
- Benjamin B Chu, Kevin L Keys, Christopher A German, Hua Zhou, Jin J Zhou, Eric M Sobel, Janet S Sinsheimer, Kenneth Lange, Iterative hard thresholding in genome-wide association studies: Generalized linear models, prior weights, and double sparsity, GigaScience, 10.1093/gigascience/giaa044, 9, 6, (2020).
- Jue Wang, Hao Zhou, Tao Hong, Xiang Li, Shouyang Wang, A multi-granularity heterogeneous combination approach to crude oil price forecasting, Energy Economics, 10.1016/j.eneco.2020.104790, (104790), (2020).
- Roberto Molinari, Gaetan Bakalli, Stéphane Guerrier, Cesare Miglioli, Samuel Orso, Olivier Scaillet, Swag: A Wrapper Method for Sparse Learning, SSRN Electronic Journal, 10.2139/ssrn.3633843, (2020).
- Florent Guinot, Marie Szafranski, Julien Chiquet, Anouk Zancarini, Christine Le Signor, Christophe Mougel, Christophe Ambroise, Fast computation of genome-metagenome interaction effects, Algorithms for Molecular Biology, 10.1186/s13015-020-00173-2, 15, 1, (2020).
- Armeen Taeb, Parikshit Shah, Venkat Chandrasekaran, False discovery and its control in low rank estimation, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12387, 82, 4, (997-1027), (2020).
- Fei Ozga, Jukka-Pekka Onnela, Victor DeGruttola, Bayesian method for inferring the impact of geographical distance on intensity of communication, Scientific Reports, 10.1038/s41598-020-68583-1, 10, 1, (2020).
- Philipp Herzog, Matthias Feldmann, Ulrich Voderholzer, Thomas Gärtner, Michael Armbrust, Elisabeth Rauh, Robert Doerr, Winfried Rief, Eva-Lotta Brakemeier, Drawing the borderline: Predicting treatment outcomes in patients with borderline personality disorder, Behaviour Research and Therapy, 10.1016/j.brat.2020.103692, (103692), (2020).
- Unnur D. Teitsdottir, Maria K. Jonsdottir, Sigrun H. Lund, Taher Darreh-Shori, Jon Snaedal, Petur H. Petersen, Association of glial and neuronal degeneration markers with Alzheimer’s disease cerebrospinal fluid profile and cognitive functions, Alzheimer's Research & Therapy, 10.1186/s13195-020-00657-8, 12, 1, (2020).
- Takashi Takahashi, Yoshiyuki Kabashima, Semi-analytic approximate stability selection for correlated data in generalized linear models, Journal of Statistical Mechanics: Theory and Experiment, 10.1088/1742-5468/ababff, 2020, 9, (093402), (2020).
- Elif Eyigoz, Melody Courson, Lucas Sedeño, Katharina Rogg, Juan Rafael Orozco-Arroyave, Elmar Nöth, Sabine Skodda, Natalia Trujillo, Mabel Rodríguez, Jan Rusz, Edinson Muñoz, Juan F. Cardona, Eduar Herrera, Eugenia Hesse, Agustín Ibáñez, Guillermo Cecchi, Adolfo M. García, From discourse to pathology: Automatic identification of Parkinson's disease patients via morphological measures across three languages, Cortex, 10.1016/j.cortex.2020.08.020, (2020).
- Eneko Urunuela, Stephen Jones, Anna Crawford, Wanyong Shin, Sehong Oh, Mark Lowe, Cesar Caballero-Gaudes, undefined, 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 10.1109/EMBC44109.2020.9176137, (1092-1095), (2020).
- Adi Porat Rein, Uri Kramer, Moran Hausman Kedem, Aviva Fattal-Valevski, Alexis Mitelpunkt, Early prediction of encephalopathic transformation in children with benign epilepsy with centro-temporal spikes, Brain and Development, 10.1016/j.braindev.2020.08.013, (2020).
- Pia Tio, Lourens Waldorp, Katrijn VanDeun, Constructing Graphical Models for Multi-Source Data: Sparse Network and Component Analysis, Advanced Studies in Classification and Data Science, 10.1007/978-981-15-3311-2_22, (275-287), (2020).
- Annick V. Hartstra, Valentina Schüppel, Sultan Imangaliyev, Anouk Schrantee, Andrei Prodan, Didier Collard, Evgeni Levin, Geesje Dallinga-Thie, Mariette T. Ackermans, Maaike Winkelmeijer, Stefan R. Havik, Amira Metwaly, Ilias Lagkouvardos, Anika Nier, Ina Bergheim, Mathias Heikenwalder, Andreas Dunkel, Aart J. Nederveen, Gerhard Liebisch, Giulia Mancano, Sandrine P. Claus, Alfonso Benítez-Páez, Susanne E. la Fleur, Jacques J. Bergman, Victor Gerdes, Yolanda Sanz, Jan Booij, Elles Kemper, Albert K. Groen, Mireille J. Serlie, Dirk Haller, Max Nieuwdorp, Infusion of donor feces affects the gut–brain axis in humans with metabolic syndrome, Molecular Metabolism, 10.1016/j.molmet.2020.101076, (101076), (2020).
- Tianyang Xu, Zhen-Hua Feng, Xiao-Jun Wu, Josef Kittler, Learning Low-Rank and Sparse Discriminative Correlation Filters for Coarse-to-Fine Visual Object Tracking, IEEE Transactions on Circuits and Systems for Video Technology, 10.1109/TCSVT.2019.2945068, 30, 10, (3727-3739), (2020).
- Hejie Lei, Xingke Chen, Ling Jian, Canal-LASSO: A sparse noise-resilient online linear regression model, Intelligent Data Analysis, 10.3233/IDA-194672, 24, 5, (993-1010), (2020).
- Lei Zhang, Hongmei Chen, Xiaoma Tao, Hongguo Cai, Jingneng Liu, Yifang Ouyang, Qing Peng, Yong Du, Machine learning reveals the importance of the formation enthalpy and atom-size difference in forming phases of high entropy alloys, Materials & Design, 10.1016/j.matdes.2020.108835, (108835), (2020).
- Nuosi Wu, Fu Yin, Le Ou-Yang, Zexuan Zhu, Weixin Xie, Joint learning of multiple gene networks from single-cell gene expression data, Computational and Structural Biotechnology Journal, 10.1016/j.csbj.2020.09.004, 18, (2583-2595), (2020).
- Jin Eun Yoo, Minjeong Rho, Exploration of Predictors for Korean Teacher Job Satisfaction via a Machine Learning Technique, Group Mnet, Frontiers in Psychology, 10.3389/fpsyg.2020.00441, 11, (2020).
- Yongping Cui, Hongyan Chen, Ruibin Xi, Heyang Cui, Yahui Zhao, Enwei Xu, Ting Yan, Xiaomei Lu, Furong Huang, Pengzhou Kong, Yang Li, Xiaolin Zhu, Jiawei Wang, Wenjie Zhu, Jie Wang, Yanchun Ma, Yong Zhou, Shiping Guo, Ling Zhang, Yiqian Liu, Bin Wang, Yanfeng Xi, Ruifang Sun, Xiao Yu, Yuanfang Zhai, Fang Wang, Jian Yang, Bin Yang, Caixia Cheng, Jing Liu, Bin Song, Hongyi Li, Yi Wang, Yingchun Zhang, Xiaolong Cheng, Qimin Zhan, Yanhong Li, Zhihua Liu, Whole-genome sequencing of 508 patients identifies key molecular features associated with poor prognosis in esophageal squamous cell carcinoma, Cell Research, 10.1038/s41422-020-0333-6, (2020).
- Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, Berend Smit, Big-Data Science in Porous Materials: Materials Genomics and Machine Learning, Chemical Reviews, 10.1021/acs.chemrev.0c00004, (2020).
- Youngjoo Cho, Debashis Ghosh, Quantile-Based Subgroup Identification for Randomized Clinical Trials, Statistics in Biosciences, 10.1007/s12561-020-09286-z, (2020).
- Tina Ploner, Steffen Heß, Marcus Grum, Philipp Drewe-Boss, Jochen Walker, Using gradient boosting with stability selection on health insurance claims data to identify disease trajectories in chronic obstructive pulmonary disease, Statistical Methods in Medical Research, 10.1177/0962280220938088, (096228022093808), (2020).
- Md Sultan Mahmud, Faruk Ahmed, Rakib Al-Fahad, Kazi Ashraf Moinuddin, Mohammed Yeasin, Claude Alain, Gavin M. Bidelman, Decoding Hearing-Related Changes in Older Adults’ Spatiotemporal Neural Processing of Speech Using Machine Learning, Frontiers in Neuroscience, 10.3389/fnins.2020.00748, 14, (2020).
- Piotr Fryzlewicz, Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection—rejoinder, Journal of the Korean Statistical Society, 10.1007/s42952-020-00085-2, (2020).
- Claude Renaux, Laura Buzdugan, Markus Kalisch, Peter Bühlmann, Hierarchical inference for genome-wide association studies: a view on methodology with software, Computational Statistics, 10.1007/s00180-019-00939-2, (2020).
- Claude Renaux, Laura Buzdugan, Markus Kalisch, Peter Bühlmann, Rejoinder on: Hierarchical inference for genome-wide association studies: a view on methodology with software, Computational Statistics, 10.1007/s00180-019-00948-1, (2020).
- Pieter F. de Groot, Tatjana Nikolic, Sultan Imangaliyev, Siroon Bekkering, Gaby Duinkerken, Fleur M. Keij, Hilde Herrema, Maaike Winkelmeijer, Jeffrey Kroon, Evgeni Levin, Barbara Hutten, Elles M. Kemper, Suat Simsek, Johannes H. M. Levels, Flora A. van Hoorn, Renuka Bindraban, Alicia Berkvens, Geesje M. Dallinga-Thie, Mark Davids, Frits Holleman, Joost B. L. Hoekstra, Erik S. G. Stroes, Mihai Netea, Daniël H. van Raalte, Bart O. Roep, Max Nieuwdorp, Oral butyrate does not affect innate immunity and islet autoimmunity in individuals with longstanding type 1 diabetes: a randomised controlled trial, Diabetologia, 10.1007/s00125-019-05073-8, (2020).
- Kemal AKYOL, Şafak BAYIR, Baha ŞEN, Parkinson Hastalığı İçin Öznitelik Seçiminin Önemi, Academic Platform Journal of Engineering and Science, 10.21541/apjes.541637, (175-180), (2020).
- Timothy I. Cannings, Random projections: Data perturbation for classification problems, WIREs Computational Statistics , 10.1002/wics.1499, 0, 0, (2020).
- Rakib Al-Fahad, Mohammed Yeasin, Gavin M Bidelman, Decoding of single-trial EEG reveals unique states of functional brain connectivity that drive rapid speech categorization decisions, Journal of Neural Engineering, 10.1088/1741-2552/ab6040, 17, 1, (016045), (2020).
- Bin Yu, Karl Kumbier, Veridical data science, Proceedings of the National Academy of Sciences, 10.1073/pnas.1901326117, (201901326), (2020).
- Kipoong Kim, Jajoon Koo, Hokeun Sun, An empirical threshold of selection probability for analysis of high-dimensional correlated data, Journal of Statistical Computation and Simulation, 10.1080/00949655.2020.1739286, (1-12), (2020).
- Junwei Lu, Mladen Kolar, Han Liu, Kernel Meets Sieve: Post-Regularization Confidence Bands for Sparse Additive Model, Journal of the American Statistical Association, 10.1080/01621459.2019.1689984, (1-16), (2020).
- Guangbao Guo, Taylor quasi-likelihood for limited generalized linear models, Journal of Applied Statistics, 10.1080/02664763.2020.1743650, (1-24), (2020).
- Line Aas Mortensen, Anne Marie Svane, Mark Burton, Claus Bistrup, Helle Charlotte Thiesson, Niels Marcussen, Hans Christian Beck, Proteomic Analysis of Renal Biomarkers of Kidney Allograft Fibrosis—A Study in Renal Transplant Patients, International Journal of Molecular Sciences, 10.3390/ijms21072371, 21, 7, (2371), (2020).
- Tianxi Li, Elizaveta Levina, Ji Zhu, Network cross-validation by edge sampling, Biometrika, 10.1093/biomet/asaa006, (2020).
- Arne De Brabandere, Jill Emmerzaal, Annick Timmermans, Ilse Jonkers, Benedicte Vanwanseele, Jesse Davis, A Machine Learning Approach to Estimate Hip and Knee Joint Loading Using a Mobile Phone-Embedded IMU, Frontiers in Bioengineering and Biotechnology, 10.3389/fbioe.2020.00320, 8, (2020).
- Francesco Giordano, Marcella Niglio, Marialuisa Restaino, A new procedure for variable selection in presence of rare events, Journal of the Operational Research Society, 10.1080/01605682.2020.1740620, (1-18), (2020).
- Qi Song, Jiyoung Lee, Shamima Akter, Matthew Rogers, Ruth Grene, Song Li, Prediction of condition-specific regulatory genes using machine learning, Nucleic Acids Research, 10.1093/nar/gkaa264, (2020).
- Songshan Yang, Jiawei Wen, Scott T Eckert, Yaqun Wang, Dajiang J Liu, Rongling Wu, Runze Li, Xiang Zhan, Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning, Bioinformatics, 10.1093/bioinformatics/btaa229, (2020).
- Philipp Herzog, Ulrich Voderholzer, Thomas Gärtner, Bernhard Osen, Michael Svitak, Robert Doerr, Maria Rolvering-Dijkstra, Matthias Feldmann, Winfried Rief, Eva-Lotta Brakemeier, Predictors of outcome during inpatient psychotherapy for posttraumatic stress disorder: a single-treatment, multi-site, practice-based study, Psychotherapy Research, 10.1080/10503307.2020.1802081, (1-15), (2020).
- Renate M Hoogeveen, João P Belo Pereira, Nick S Nurmohamed, Veronica Zampoleri, Michiel J Bom, Andrea Baragetti, S Matthijs Boekholdt, Paul Knaapen, Kay-Tee Khaw, Nicholas J Wareham, Albert K Groen, Alberico L Catapano, Wolfgang Koenig, Evgeni Levin, Erik S G Stroes, Improved cardiovascular risk prediction using targeted plasma proteomics in primary prevention, European Heart Journal, 10.1093/eurheartj/ehaa648, (2020).
- Patrick L. Combettes, Christian L. Müller, Regression Models for Compositional Data: General Log-Contrast Formulations, Proximal Optimization, and Microbiome Data Applications, Statistics in Biosciences, 10.1007/s12561-020-09283-2, (2020).
- Gao Wang, Abhishek Sarkar, Peter Carbonetto, Matthew Stephens, A simple new approach to variable selection in regression, with application to genetic fine mapping, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12388, 0, 0, (2020).
- Tim Richter-Heitmann, Benjamin Hofner, Franz-Sebastian Krah, Johannes Sikorski, Pia K. Wüst, Boyke Bunk, Sixing Huang, Kathleen M. Regan, Doreen Berner, Runa S. Boeddinghaus, Sven Marhan, Daniel Prati, Ellen Kandeler, Jörg Overmann, Michael W. Friedrich, Stochastic Dispersal Rather Than Deterministic Selection Explains the Spatio-Temporal Distribution of Soil Bacteria in a Temperate Grassland, Frontiers in Microbiology, 10.3389/fmicb.2020.01391, 11, (2020).
- See more




