Volume 74, Issue 2

Strong rules for discarding predictors in lasso‐type problems

First published: 03 November 2011
Citations: 142
Address for correspondence Robert Tibshirani, Departments of Statistics and Health Research and Policy, Stanford University, Stanford, CA 94305, USA.
E‐mail: tibs@stanford.edu

Abstract

Summary. We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have proposed ‘SAFE’ rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non‐zero coefficients in the solution. We therefore combine them with simple checks of the Karush–Kuhn–Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush–Kuhn–Tucker conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.

Number of times cited according to CrossRef: 142

  • A regularized spatial market segmentation method with Dirichlet process—Gaussian mixture prior, Spatial Statistics, 10.1016/j.spasta.2019.100402, 35, (100402), (2020).
  • Integrated nomogram based on five stage-related genes and TNM stage to predict 1-year recurrence in hepatocellular carcinoma, Cancer Cell International, 10.1186/s12935-020-01216-9, 20, 1, (2020).
  • Incorporating machine learning and social determinants of health indicators into prospective risk adjustment for health plan payments, BMC Public Health, 10.1186/s12889-020-08735-0, 20, 1, (2020).
  • Semantic Segmentation of Sorghum Using Hyperspectral Data Identifies Genetic Associations, Plant Phenomics, 10.34133/2020/4216373, 2020, (1-11), (2020).
  • Alzheimer-type dementia prediction by sparse logistic regression using claim data, Computer Methods and Programs in Biomedicine, 10.1016/j.cmpb.2020.105582, (105582), (2020).
  • Safe feature screening rules for the regularized Huber regression, Applied Mathematics and Computation, 10.1016/j.amc.2020.125500, 386, (125500), (2020).
  • Delirium Severity Trajectories and Outcomes in ICU Patients. Defining a Dynamic Symptom Phenotype, Annals of the American Thoracic Society, 10.1513/AnnalsATS.201910-764OC, 17, 9, (1094-1103), (2020).
  • A radiomics–clinical nomogram for preoperative prediction of IDH1 mutation in primary glioblastoma multiforme, Clinical Radiology, 10.1016/j.crad.2020.07.036, (2020).
  • Hybrid safe-strong rules for efficient optimization in lasso-type problems, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107063, (107063), (2020).
  • undefined, 2020 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), 10.1109/CogSIMA49017.2020.9216035, (1-9), (2020).
  • Predicting the Type of Tumor-Related Epilepsy in Patients With Low-Grade Gliomas: A Radiomics Study, Frontiers in Oncology, 10.3389/fonc.2020.00235, 10, (2020).
  • Conditional Random Fields with Least Absolute Shrinkage and Selection Operator to Classifying the Barley Genes Based on Expression Level Affected by the Fungal Infection, Journal of Computational Biology, 10.1089/cmb.2019.0428, (2020).
  • An Application of High-Dimensional Statistics to Predictive Modeling of Grade Variability, Geosciences, 10.3390/geosciences10040116, 10, 4, (116), (2020).
  • Hormone Receptor-Status Prediction in Breast Cancer Using Gene Expression Profiles and Their Macroscopic Landscape, Cancers, 10.3390/cancers12051165, 12, 5, (1165), (2020).
  • Teicoplanin-Modified HPLC Column as a Source of Experimental Parameters for Prediction of the Anticonvulsant Activity of 1,2,4-Triazole-3-Thiones by the Regression Models, Materials, 10.3390/ma13112650, 13, 11, (2650), (2020).
  • Peptide variability and signatures associated with disease progression in CSF collected longitudinally from ALS patients, Analytical and Bioanalytical Chemistry, 10.1007/s00216-020-02765-8, (2020).
  • Circulating miR-16-5p, miR-92a-3p, and miR-451a in Plasma from Lung Cancer Patients: Potential Application in Early Detection and a Regulatory Role in Tumorigenesis Pathways, Cancers, 10.3390/cancers12082071, 12, 8, (2071), (2020).
  • The nucleolar-related protein Dyskerin pseudouridine synthase 1 (DKC1) predicts poor prognosis in breast cancer, British Journal of Cancer, 10.1038/s41416-020-01045-7, (2020).
  • Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms, Operations Research, 10.1287/opre.2019.1919, (2020).
  • Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank, Biostatistics, 10.1093/biostatistics/kxaa038, (2020).
  • Municipal Green Purchasing in Mexico: Policy Adoption and Implementation Success, Sustainability, 10.3390/su12208339, 12, 20, (8339), (2020).
  • Mammography-based radiomic analysis for predicting benign BI-RADS category 4 calcifications, European Journal of Radiology, 10.1016/j.ejrad.2019.108711, (108711), (2019).
  • An Integrated Gaussian Graphical Model to evaluate the impact of exposures on metabolic networks, Computers in Biology and Medicine, 10.1016/j.compbiomed.2019.103417, (103417), (2019).
  • Simultaneous Safe Feature and Sample Elimination for Sparse Support Vector Regression, IEEE Transactions on Signal Processing, 10.1109/TSP.2019.2924580, 67, 15, (4043-4054), (2019).
  • Scalable and Robust Sparse Subspace Clustering Using Randomized Clustering and Multilayer Graphs, Signal Processing, 10.1016/j.sigpro.2019.05.017, (2019).
  • Efficient Implementation of Penalized Regression for Genetic Risk Prediction, Genetics, 10.1534/genetics.119.302019, 212, 1, (65-74), (2019).
  • Some Notes on Concordance between Optimization and Statistics, Mathematical Problems in Engineering, 10.1155/2019/3485064, 2019, (1-6), (2019).
  • A Machine-Learning-Based Drug Repurposing Approach Using Baseline Regularization, Computational Methods for Drug Repurposing, 10.1007/978-1-4939-8955-3_15, (255-267), (2019).
  • Bioinformatic Profiling Identifies a Fatty Acid Metabolism-Related Gene Risk Signature for Malignancy, Prognosis, and Immune Phenotype of Glioma, Disease Markers, 10.1155/2019/3917040, 2019, (1-14), (2019).
  • undefined, 2019 9th International Conference on Advanced Computer Information Technologies (ACIT), 10.1109/ACITT.2019.8780072, (257-260), (2019).
  • Development and Validation of a Nomogram Prognostic Model for Patients With Advanced Non-Small-Cell Lung Cancer, Cancer Informatics, 10.1177/1176935119837547, 18, (117693511983754), (2019).
  • ExSIS: Extended Sure Independence Screening for Ultrahigh-dimensional Linear Models, Signal Processing, 10.1016/j.sigpro.2019.01.018, (2019).
  • Association of ground-level ozone, meteorological factors and weather types with daily myocardial infarction frequencies in Augsburg, Southern Germany, Atmospheric Environment, 10.1016/j.atmosenv.2019.116975, (116975), (2019).
  • A two-stage minimax concave penalty based method in pruned AdaBoost ensemble, Applied Soft Computing, 10.1016/j.asoc.2019.105674, (105674), (2019).
  • Fast and approximate exhaustive variable selection for generalised linear models with APES, Australian & New Zealand Journal of Statistics, 10.1111/anzs.12276, 61, 4, (445-465), (2019).
  • undefined, 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 10.1109/CAMSAP45676.2019.9022441, (1-5), (2019).
  • Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects, Nature Communications, 10.1038/s41467-019-09407-3, 10, 1, (2019).
  • Efficient proximal gradient algorithm for inference of differential gene networks, BMC Bioinformatics, 10.1186/s12859-019-2749-x, 20, 1, (2019).
  • Making the Most of Clumping and Thresholding for Polygenic Scores, The American Journal of Human Genetics, 10.1016/j.ajhg.2019.11.001, (2019).
  • Association of serum adipocyte fatty acid-binding protein and apolipoprotein B /apolipoprotein A1 ratio with intima media thickness of common carotid artery in dyslipidemic patients, Biomedical Papers, 10.5507/bp.2018.043, 163, 2, (166-171), (2019).
  • Quantifying relationships between watershed characteristics and hydroecological indices of Missouri streams, Science of The Total Environment, 10.1016/j.scitotenv.2018.11.205, 654, (1305-1315), (2019).
  • Territorial landscapes: incorporating density-dependence into wolf habitat selection studies, Royal Society Open Science, 10.1098/rsos.190282, 6, 11, (190282), (2019).
  • Biological market effects predict cleaner fish strategic sophistication, Behavioral Ecology, 10.1093/beheco/arz111, (2019).
  • Exosomal miRNAs as Novel Pharmacodynamic Biomarkers for Cancer Chemopreventive Agent Early Stage Treatments in Chemically Induced Mouse Model of Lung Squamous Cell Carcinoma, Cancers, 10.3390/cancers11040477, 11, 4, (477), (2019).
  • Identifying incident dementia by applying machine learning to a very large administrative claims dataset, PLOS ONE, 10.1371/journal.pone.0203246, 14, 7, (e0203246), (2019).
  • OUP accepted manuscript, Schizophrenia Bulletin, 10.1093/schbul/sbz056, (2019).
  • Simultaneous prediction of multiple outcomes using revised stacking algorithms, Bioinformatics, 10.1093/bioinformatics/btz531, (2019).
  • Year-ahead predictability of South Asian Summer Monsoon precipitation, Environmental Research Letters, 10.1088/1748-9326/ab006a, 14, 4, (044006), (2019).
  • Long Noncoding RNA Analyses for Osteoporosis Risk in Caucasian Women, Calcified Tissue International, 10.1007/s00223-019-00555-8, (2019).
  • Quality of life, pain, and psychological factors in patients undergoing surgery for primary tumors of the spine, Supportive Care in Cancer, 10.1007/s00520-019-04965-0, (2019).
  • A Unified Approach to Sparse Tweedie Modeling of Multisource Insurance Claim Data, Technometrics, 10.1080/00401706.2019.1647881, (1-18), (2019).
  • A Pliable Lasso, Journal of Computational and Graphical Statistics, 10.1080/10618600.2019.1648271, (1-11), (2019).
  • Estimation of semiparametric regression model with right-censored high-dimensional data, Journal of Statistical Computation and Simulation, 10.1080/00949655.2019.1572757, (1-20), (2019).
  • A methylation study of long-term depression risk, Molecular Psychiatry, 10.1038/s41380-019-0516-z, (2019).
  • E-ENDPP: a safe feature selection rule for speeding up Elastic Net, Applied Intelligence, 10.1007/s10489-018-1295-y, 49, 2, (592-604), (2018).
  • ν-SVM solutions of constrained Lasso and Elastic net, Neurocomputing, 10.1016/j.neucom.2017.10.029, 275, (1921-1931), (2018).
  • A safe reinforced feature screening strategy for lasso based on feasible solutions, Information Sciences, 10.1016/j.ins.2018.10.031, (2018).
  • Estimating Sparse Signals Using Integrated Wideband Dictionaries, IEEE Transactions on Signal Processing, 10.1109/TSP.2018.2835426, 66, 16, (4170-4181), (2018).
  • Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr, Bioinformatics, 10.1093/bioinformatics/bty185, 34, 16, (2781-2787), (2018).
  • A safe accelerative approach for pinball support vector machine classifier, Knowledge-Based Systems, 10.1016/j.knosys.2018.02.010, 147, (12-24), (2018).
  • Development of Prognostic Biomarker Signatures for Survival Using High-Dimensional Data, Biopharmaceutical Applied Statistics Symposium, 10.1007/978-981-10-7820-0_16, (339-351), (2018).
  • Clinical and RNA expression integrated signature for urothelial bladder cancer prognosis, Cancer Biomarkers, 10.3233/CBM-170314, 21, 3, (535-546), (2018).
  • Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular heterogeneity and cell state transitions between subpopulations, Genome Research, 10.1101/gr.223925.117, 28, 7, (1053-1066), (2018).
  • Determination of sex differences of human cadaveric mandibular condyles using statistical shape and trait modeling, Bone, 10.1016/j.bone.2017.10.003, 106, (35-41), (2018).
  • Approaching Mean-Variance Efficiency for Large Portfolios, SSRN Electronic Journal, 10.2139/ssrn.2699157, (2018).
  • Intracranial hemorrhage in anticoagulated patients with mild traumatic brain injury: significant differences between direct oral anticoagulants and vitamin K antagonists, Internal and Emergency Medicine, 10.1007/s11739-018-1806-1, 13, 7, (1077-1087), (2018).
  • A safe sample screening rule for Laplacian twin parametric-margin support vector machine, Pattern Recognition, 10.1016/j.patcog.2018.06.018, 84, (1-12), (2018).
  • High Dimensional Estimation and Multi-Factor Models, SSRN Electronic Journal, 10.2139/ssrn.3169905, (2018).
  • Convergence of evidence from a methylome-wide CpG-SNP association study and GWAS of major depressive disorder, Translational Psychiatry, 10.1038/s41398-018-0205-8, 8, 1, (2018).
  • Age prediction of children and adolescents aged 6-17 years: an epigenome-wide analysis of DNA methylation, Aging, 10.18632/aging.101445, 10, 5, (1015-1026), (2018).
  • The impact of soil moisture on precipitation downscaling in the Euro-Mediterranean area, Climate Dynamics, 10.1007/s00382-018-4304-2, (2018).
  • Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects, BMJ, 10.1136/bmj.k4245, (k4245), (2018).
  • Approaching Mean-Variance Efficiency for Large Portfolios, The Review of Financial Studies, 10.1093/rfs/hhy105, (2018).
  • RES complex is associated with intron definition and required for zebrafish early embryogenesis, PLOS Genetics, 10.1371/journal.pgen.1007473, 14, 7, (e1007473), (2018).
  • Methylome-wide association findings for major depressive disorder overlap in blood and brain and replicate in independent brain samples, Molecular Psychiatry, 10.1038/s41380-018-0247-6, (2018).
  • : A New Method for Bi-Level Variable Selection of Conditional Main Effects , Journal of the American Statistical Association, 10.1080/01621459.2018.1448828, (1-13), (2018).
  • PUlasso: High-Dimensional Variable Selection With Presence-Only Data, Journal of the American Statistical Association, 10.1080/01621459.2018.1546587, (1-30), (2018).
  • Human adipose tissue levels of persistent organic pollutants and metabolic syndrome components: Combining a cross-sectional with a 10-year longitudinal study using a multi-pollutant approach, Environment International, 10.1016/j.envint.2017.04.002, 104, (48-57), (2017).
  • undefined, 2017 IEEE 7th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 10.1109/CAMSAP.2017.8313129, (1-5), (2017).
  • Sliced Inverse Regression With Adaptive Spectral Sparsity for Dimension Reduction, IEEE Transactions on Cybernetics, 10.1109/TCYB.2016.2526630, 47, 3, (759-771), (2017).
  • A nonlinear term selection method for improving synchronous machine parameters estimation, International Journal of Electrical Power & Energy Systems, 10.1016/j.ijepes.2016.08.004, 85, (77-86), (2017).
  • Screening Tests for Lasso Problems, IEEE Transactions on Pattern Analysis and Machine Intelligence, 10.1109/TPAMI.2016.2568185, 39, 5, (1008-1027), (2017).
  • undefined, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10.1109/ICASSP.2017.7952993, (4426-4430), (2017).
  • Mass cytometry identifies a distinct monocyte cytokine signature shared by clinically heterogeneous pediatric SLE patients, Journal of Autoimmunity, 10.1016/j.jaut.2017.03.010, 81, (74-89), (2017).
  • Development and validation of Risk Equations for Complications Of type 2 Diabetes (RECODe) using individual participant data from randomised trials, The Lancet Diabetes & Endocrinology, 10.1016/S2213-8587(17)30221-8, 5, 10, (788-798), (2017).
  • Risk factors for comorbid oppositional defiant disorder in attention-deficit/hyperactivity disorder, European Child & Adolescent Psychiatry, 10.1007/s00787-017-0972-4, 26, 10, (1155-1164), (2017).
  • Prognostic impact of a novel gene expression profile classifier for the discrimination between metastatic and non-metastatic primary colorectal cancer tumors, Oncotarget, 10.18632/oncotarget.22591, 8, 64, (107685-107700), (2017).
  • Semismooth Newton Coordinate Descent Algorithm for Elastic-Net Penalized Huber Loss Regression and Quantile Regression, Journal of Computational and Graphical Statistics, 10.1080/10618600.2016.1256816, 26, 3, (547-557), (2017).
  • A note on the square root law for urban police travel times, Journal of the Operational Research Society, 10.1057/jors.2015.124, 67, 7, (989-1000), (2017).
  • Mixed Th1 and Th2 Mycobacterium tuberculosis-specific CD4 T cell responses in patients with active pulmonary tuberculosis from Tanzania, PLOS Neglected Tropical Diseases, 10.1371/journal.pntd.0005817, 11, 7, (e0005817), (2017).
  • Benefit and harm of intensive blood pressure treatment: Derivation and validation of risk models using data from the SPRINT and ACCORD trials, PLOS Medicine, 10.1371/journal.pmed.1002410, 14, 10, (e1002410), (2017).
  • Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data, Applied Intelligence, 10.1007/s10489-016-0850-7, 48, 8, (1933-1952), (2016).
  • Predictors of 6-month health utility outcomes in survivors of acute respiratory distress syndrome, Thorax, 10.1136/thoraxjnl-2016-208560, 72, 4, (311-317), (2016).
  • undefined, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16, 10.1145/2939672.2939859, (1705-1714), (2016).
  • Data Shared Lasso: A novel tool to discover uplift, Computational Statistics & Data Analysis, 10.1016/j.csda.2016.02.015, 101, (226-235), (2016).
  • Sparse models for imaging genetics, Machine Learning and Medical Imaging, 10.1016/B978-0-12-804076-8.00005-0, (129-151), (2016).
  • Commodity dynamics: A sparse multi-class approach, Energy Economics, 10.1016/j.eneco.2016.09.013, 60, (62-72), (2016).
  • Successive Ray Refinement and Its Application to Coordinate Descent for Lasso, Intelligent Data Engineering and Automated Learning – IDEAL 2016, 10.1007/978-3-319-46257-8_34, (310-320), (2016).
  • undefined, 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), 10.1109/SAM.2016.7569706, (1-5), (2016).
  • A network-driven approach for genome-wide association mapping, Bioinformatics, 10.1093/bioinformatics/btw270, 32, 12, (i164-i173), (2016).
  • See more