Volume 75, Issue 1

Variable selection with error control: another look at stability selection

First published: 21 June 2012
Citations: 104
Address for correspondence: Richard Samworth, Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge, CB3 0WB, UK.
E‐mail: r.j.samworth@statslab.cam.ac.uk

Abstract

Summary. Stability selection was recently introduced by Meinshausen and Bühlmann as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called complementary pairs stability selection, and derive bounds both on the expected number of variables included by complementary pairs stability selection that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology.

Number of times cited according to CrossRef: 104

  • Structure-function relationships of HDL in diabetes and coronary heart disease, JCI Insight, 10.1172/jci.insight.131491, 5, 1, (2020).
  • False discovery and its control in low rank estimation, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12387, 82, 4, (997-1027), (2020).
  • Goodness‐of‐fit testing in high dimensional generalized linear models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12371, 82, 3, (773-795), (2020).
  • The Model Selection Methods for Sparse Biological Networks, Artificial Intelligence and Applied Mathematics in Engineering Problems, 10.1007/978-3-030-36178-5_10, (107-126), (2020).
  • Clustering micropollutants based on initial biotransformations for improved prediction of micropollutant removal during conventional activated sludge treatment, Environmental Science: Water Research & Technology, 10.1039/C9EW00838A, (2020).
  • Optimized variable selection via repeated data splitting, Statistics in Medicine, 10.1002/sim.8538, 39, 16, (2167-2184), (2020).
  • Radiomics analysis using stability selection supervised component analysis for right-censored survival data, Computers in Biology and Medicine, 10.1016/j.compbiomed.2020.103959, (103959), (2020).
  • Early-life exposure to multiple persistent organic pollutants and metals and birth weight: Pooled analysis in four Flemish birth cohorts, Environment International, 10.1016/j.envint.2020.106149, 145, (106149), (2020).
  • Genome‐wide association study of café‐au‐lait macule number in neurofibromatosis type 1, Molecular Genetics & Genomic Medicine, 10.1002/mgg3.1400, 8, 10, (2020).
  • Stochastic Dispersal Rather Than Deterministic Selection Explains the Spatio-Temporal Distribution of Soil Bacteria in a Temperate Grassland, Frontiers in Microbiology, 10.3389/fmicb.2020.01391, 11, (2020).
  • Comments on: Inference and computation with Generalized Additive Models and their extensions, TEST, 10.1007/s11749-020-00714-2, (2020).
  • Decoding of single-trial EEG reveals unique states of functional brain connectivity that drive rapid speech categorization decisions, Journal of Neural Engineering, 10.1088/1741-2552/ab6040, 17, 1, (016045), (2020).
  • An empirical threshold of selection probability for analysis of high-dimensional correlated data, Journal of Statistical Computation and Simulation, 10.1080/00949655.2020.1739286, (1-12), (2020).
  • Random projections: Data perturbation for classification problems, WIREs Computational Statistics , 10.1002/wics.1499, 0, 0, (2020).
  • Rejoinder on: Hierarchical inference for genome-wide association studies: a view on methodology with software, Computational Statistics, 10.1007/s00180-019-00948-1, (2020).
  • Kernel Meets Sieve: Post-Regularization Confidence Bands for Sparse Additive Model, Journal of the American Statistical Association, 10.1080/01621459.2019.1689984, (1-16), (2020).
  • Hierarchical inference for genome-wide association studies: a view on methodology with software, Computational Statistics, 10.1007/s00180-019-00939-2, (2020).
  • Sampling uncertainty versus method uncertainty: A general framework with applications to omics biomarker selection, Biometrical Journal, 10.1002/bimj.201800309, 62, 3, (670-687), (2019).
  • Integrative analysis of genetical genomics data incorporating network structures, Biometrics, 10.1111/biom.13072, 75, 4, (1063-1075), (2019).
  • Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools, Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms, 10.1016/j.bbagrm.2019.194418, (194418), (2019).
  • Drivers of domestic electricity users’ price responsiveness: A novel machine learning approach, Applied Energy, 10.1016/j.apenergy.2018.11.014, 235, (900-913), (2019).
  • Reliable Factors of Capital Structure: Stability Selection Approach, The Quarterly Review of Economics and Finance, 10.1016/j.qref.2019.11.001, (2019).
  • Early Imaging-Based Predictive Modeling of Cognitive Performance Following Therapy for Childhood ALL, IEEE Access, 10.1109/ACCESS.2019.2946240, 7, (146662-146674), (2019).
  • undefined, 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 10.1109/CIBCB.2019.8791489, (1-10), (2019).
  • Variable Selection for High Dimensional Metagenomic Data, Contemporary Biostatistics with Biopharmaceutical Applications, 10.1007/978-3-030-15310-6_2, (19-32), (2019).
  • Flexible modeling of ratio outcomes in clinical and epidemiological research, Statistical Methods in Medical Research, 10.1177/0962280219891195, (096228021989119), (2019).
  • A robust data-driven genomic signature for idiopathic pulmonary fibrosis with applications for translational model selection, PLOS ONE, 10.1371/journal.pone.0215565, 14, 4, (e0215565), (2019).
  • Differential network enrichment analysis reveals novel lipid pathways in chronic kidney disease, Bioinformatics, 10.1093/bioinformatics/btz114, (2019).
  • Inference for $$L_2$$ L 2 -Boosting, Statistics and Computing, 10.1007/s11222-019-09882-0, (2019).
  • A regularization approach for the detection of differential item functioning in generalized partial credit models, Behavior Research Methods, 10.3758/s13428-019-01224-2, (2019).
  • Graphical Model Selection for Gaussian Conditional Random Fields in the Presence of Latent Variables, Journal of the American Statistical Association, 10.1080/01621459.2018.1434531, 114, 526, (723-734), (2018).
  • Development and evaluation of a multimodal marker of major depressive disorder, Human Brain Mapping, 10.1002/hbm.24282, 39, 11, (4420-4439), (2018).
  • undefined, 2018 International Joint Conference on Neural Networks (IJCNN), 10.1109/IJCNN.2018.8489162, (1-8), (2018).
  • undefined, 2018 International Joint Conference on Neural Networks (IJCNN), 10.1109/IJCNN.2018.8489680, (1-8), (2018).
  • Thinner retinal layers are associated with changes in the visual pathway: A population‐based study, Human Brain Mapping, 10.1002/hbm.24246, 39, 11, (4290-4301), (2018).
  • Striatal subdivisions that coherently interact with multiple cerebrocortical networks, Human Brain Mapping, 10.1002/hbm.24275, 39, 11, (4349-4359), (2018).
  • Evoked directional network characteristics of epileptogenic tissue derived from single pulse electrical stimulation, Human Brain Mapping, 10.1002/hbm.24309, 39, 11, (4611-4622), (2018).
  • Composite quantile regression for massive datasets, Statistics, 10.1080/02331888.2018.1500579, 52, 5, (980-1004), (2018).
  • Analysis of genotype by methylation interactions through sparsity-inducing regularized regression, BMC Proceedings, 10.1186/s12919-018-0145-6, 12, S9, (2018).
  • FReM – Scalable and stable decoding with fast regularized ensemble of models, NeuroImage, 10.1016/j.neuroimage.2017.10.005, 180, (160-172), (2018).
  • Optimal estimation of direction in regression models with large number of parameters, Applied Mathematics and Computation, 10.1016/j.amc.2017.05.050, 318, (281-289), (2018).
  • Natural variation in the parameters of innate immune cells is preferentially driven by genetic factors, Nature Immunology, 10.1038/s41590-018-0049-7, 19, 3, (302-314), (2018).
  • Nasopharyngeal Lactobacillus is associated with a reduced risk of childhood wheezing illnesses following acute respiratory syncytial virus infection in infancy, Journal of Allergy and Clinical Immunology, 10.1016/j.jaci.2017.10.049, (2018).
  • A model averaging approach for the ordered probit and nested logit models with applications, Journal of Applied Statistics, 10.1080/02664763.2018.1450367, 45, 16, (3012-3052), (2018).
  • Concepts of Hypothesis Testing and Types of Errors, Dosage Form Design Parameters, 10.1016/B978-0-12-814421-3.00007-5, (257-280), (2018).
  • Bootstrapped Sparse Canonical Correlation Analysis, Imaging Genetics, 10.1016/B978-0-12-813968-4.00006-7, (101-117), (2018).
  • Extending Statistical Boosting, Methods of Information in Medicine, 10.3414/ME13-01-0123, 53, 06, (428-435), (2018).
  • Prediction error bounds for linear regression with the TREX, TEST, 10.1007/s11749-018-0584-4, (2018).
  • A Review on Variable Selection in Regression Analysis, Econometrics, 10.3390/econometrics6040045, 6, 4, (45), (2018).
  • Distinct mucosal microbial communities in infants with surgical necrotizing enterocolitis correlate with age and antibiotic exposure, PLOS ONE, 10.1371/journal.pone.0206366, 13, 10, (e0206366), (2018).
  • Mixed graphical models for integrative causal analysis with application to chronic lung disease diagnosis and prognosis, Bioinformatics, 10.1093/bioinformatics/bty769, (2018).
  • Condition-adaptive fused graphical lasso (CFGL): An adaptive procedure for inferring condition-specific gene co-expression network, PLOS Computational Biology, 10.1371/journal.pcbi.1006436, 14, 9, (e1006436), (2018).
  • RANK: Large-Scale Inference With Graphical Nonlinear Knockoffs, Journal of the American Statistical Association, 10.1080/01621459.2018.1546589, (1-43), (2018).
  • Assessing Tuning Parameter Selection Variability in Penalized Regression, Technometrics, 10.1080/00401706.2018.1513380, (1-11), (2018).
  • Goodness‐of‐fit tests for high dimensional linear models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12234, 80, 1, (113-135), (2017).
  • Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates, Statistics and Computing, 10.1007/s11222-017-9754-6, 28, 3, (673-687), (2017).
  • A novel bagging approach for variable ranking and selection via a mixed importance measure, Journal of Applied Statistics, 10.1080/02664763.2017.1391181, 45, 10, (1734-1755), (2017).
  • Bootstrap-Based LASSO-Type Selection to Build Generalized Additive Partially Linear Models for High-Dimensional Data, Monte-Carlo Simulation-Based Statistical Modeling, 10.1007/978-981-10-3307-0_18, (405-424), (2017).
  • High-dimensional simultaneous inference with the bootstrap, TEST, 10.1007/s11749-017-0554-2, 26, 4, (685-719), (2017).
  • Random‐projection ensemble classification, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12228, 79, 4, (959-1035), (2017).
  • undefined, 2017 International Joint Conference on Neural Networks (IJCNN), 10.1109/IJCNN.2017.7965989, (1202-1209), (2017).
  • Precision oncology for acute myeloid leukemia using a knowledge bank approach, Nature Genetics, 10.1038/ng.3756, 49, 3, (332-340), (2017).
  • undefined, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(, 10.1109/ICBDA.2017.8078704, (48-52), (2017).
  • A general framework for functional regression modelling, Statistical Modelling: An International Journal, 10.1177/1471082X16681317, 17, 1-2, (1-35), (2017).
  • Rejoinder, Statistical Modelling: An International Journal, 10.1177/1471082X16689188, 17, 1-2, (100-115), (2017).
  • Structure Learning in Graphical Modeling, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-060116-053803, 4, 1, (365-393), (2017).
  • Expression QTLs Mapping and Analysis: A Bayesian Perspective, Systems Genetics, 10.1007/978-1-4939-6427-7_8, (189-215), (2017).
  • Statistical Approach for Ranking OECD Countries Based on Composite GICSES Index and I-Distance Method, Emerging Trends in the Development and Application of Composite Indicators, 10.4018/978-1-5225-0714-7.ch014, (324-348), (2017).
  • Probing for Sparse and Fast Variable Selection with Model-Based Boosting, Computational and Mathematical Methods in Medicine, 10.1155/2017/1421409, 2017, (1-8), (2017).
  • An Update on Statistical Boosting in Biomedicine, Computational and Mathematical Methods in Medicine, 10.1155/2017/6083072, 2017, (1-12), (2017).
  • Group-level spatio-temporal pattern recovery in MEG decoding using multi-task joint feature learning, Journal of Neuroscience Methods, 10.1016/j.jneumeth.2017.05.004, 285, (97-108), (2017).
  • A Bayesian method for detecting pairwise associations in compositional data, PLOS Computational Biology, 10.1371/journal.pcbi.1005852, 13, 11, (e1005852), (2017).
  • Stochastic correlation coefficient ensembles for variable selection, Journal of Applied Statistics, 10.1080/02664763.2016.1221913, 44, 10, (1721-1742), (2016).
  • Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival, Statistical Methods in Medical Research, 10.1177/0962280214548748, 26, 1, (414-436), (2016).
  • Boosting flexible functional regression models with a high number of functional historical effects, Statistics and Computing, 10.1007/s11222-016-9662-1, 27, 4, (913-926), (2016).
  • Exploring stability-based voxel selection methods in MVPA using cognitive neuroimaging data: a comprehensive study, Brain Informatics, 10.1007/s40708-016-0048-0, 3, 3, (193-203), (2016).
  • Exploring the molecular basis of age-related disease comorbidities using a multi-omics graphical model, Scientific Reports, 10.1038/srep37646, 6, 1, (2016).
  • Network-Guided Biomarker Discovery, Machine Learning for Health Informatics, 10.1007/978-3-319-50478-0_16, (319-336), (2016).
  • Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection, BMC Bioinformatics, 10.1186/s12859-016-1149-8, 17, 1, (2016).
  • PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection, Computational Statistics, 10.1007/s00180-016-0652-8, 31, 4, (1237-1262), (2016).
  • Estimation Stability With Cross-Validation (ESCV), Journal of Computational and Graphical Statistics, 10.1080/10618600.2015.1020159, 25, 2, (464-492), (2016).
  • Comment, Journal of the American Statistical Association, 10.1080/01621459.2015.1102142, 110, 512, (1439-1442), (2016).
  • A semiparametric graphical modelling approach for large-scale equity selection, Quantitative Finance, 10.1080/14697688.2015.1101149, 16, 7, (1053-1067), (2015).
  • Extensions of stability selection using subsamples of observations and covariates, Statistics and Computing, 10.1007/s11222-015-9589-y, 26, 5, (1059-1077), (2015).
  • Mean and quantile boosting for partially linear additive models, Statistics and Computing, 10.1007/s11222-015-9592-3, 26, 5, (997-1008), (2015).
  • Sequential selection procedures and false discovery rate control, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12122, 78, 2, (423-444), (2015).
  • High dimensional ordinary least squares projection for screening variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12127, 78, 3, (589-611), (2015).
  • Robust Stability Best Subset Selection for Autocorrelated Data Based on Robust Location and Dispersion Estimator, Journal of Probability and Statistics, 10.1155/2015/432986, 2015, (1-8), (2015).
  • Investigating microbial co-occurrence patterns based on metagenomic compositional data, Bioinformatics, 10.1093/bioinformatics/btv364, 31, 20, (3322-3329), (2015).
  • Randomized Structural Sparsity-Based Support Identification with Applications to Locating Activated or Discriminative Brain Areas: A Multicenter Reproducibility Study, IEEE Transactions on Autonomous Mental Development, 10.1109/TAMD.2015.2427341, 7, 4, (287-300), (2015).
  • Controlling false discoveries in high-dimensional situations: boosting with stability selection, BMC Bioinformatics, 10.1186/s12859-015-0575-3, 16, 1, (2015).
  • Quantitative Differences in the Urinary Proteome of Siblings Discordant for Type 1 Diabetes Include Lysosomal Enzymes, Journal of Proteome Research, 10.1021/acs.jproteome.5b00052, 14, 8, (3123-3135), (2015).
  • Randomized structural sparsity via constrained block subsampling for improved sensitivity of discriminative voxel identification, NeuroImage, 10.1016/j.neuroimage.2015.05.057, 117, (170-183), (2015).
  • A Novel Bagging Ensemble Approach for Variable Ranking and Selection for Linear Regression Models, Multiple Classifier Systems, 10.1007/978-3-319-20248-8_1, (3-14), (2015).
  • GN-SCCA: GraphNet Based Sparse Canonical Correlation Analysis for Brain Imaging Genetics, Brain Informatics and Health, 10.1007/978-3-319-23344-4_27, (275-284), (2015).
  • A unified framework of constrained regression, Statistics and Computing, 10.1007/s11222-014-9520-y, 26, 1-2, (1-14), (2014).
  • RandGA: injecting randomness into parallel genetic algorithm for variable selection, Journal of Applied Statistics, 10.1080/02664763.2014.980788, 42, 3, (630-647), (2014).
  • Group bound: confidence intervals for groups of variables in sparse high dimensional regression without assumptions on the design, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12094, 77, 5, (923-945), (2014).
  • High-Dimensional Statistics with a View Toward Applications in Biology, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-022513-115545, 1, 1, (255-278), (2014).
  • Analysis of feature selection stability on high dimension and small sample data, Computational Statistics & Data Analysis, 10.1016/j.csda.2013.07.012, 71, (681-693), (2014).
  • See more