Volume 63, Issue 2

Estimating the number of clusters in a data set via the gap statistic

First published: 06 January 2002
Citations: 2,078
Robert Tibshirani Department of Health Research and Policy and Department of Statistics, Stanford University, Stanford, CA 94305, USA. E‐mail: tibs@stat.stanford.edu

Abstract

We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K‐means or hierarchical), comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

Number of times cited according to CrossRef: 2078

  • CNAK: Cluster number assisted K-means, Pattern Recognition, 10.1016/j.patcog.2020.107625, 110, (107625), (2021).
  • Model-Class Selection Using Clustering and Classification for Structural Identification and Prediction, Journal of Computing in Civil Engineering, 10.1061/(ASCE)CP.1943-5487.0000932, 35, 1, (04020051), (2021).
  • A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning, Expert Systems with Applications, 10.1016/j.eswa.2020.114041, 164, (114041), (2021).
  • Characterization of the water retention curves of Everglades wetland soils, Geoderma, 10.1016/j.geoderma.2020.114724, 381, (114724), (2021).
  • Customer Clustering of French Transmission System Operator (RTE) Based on Their Electricity Consumption, Optimization of Complex Systems: Theory, Models, Algorithms and Applications, 10.1007/978-3-030-21803-4_89, (893-905), (2020).
  • Structural simplification compromises the potential of common insectivorous bats to provide biocontrol services against the major olive pest Prays oleae, Agriculture, Ecosystems & Environment, 10.1016/j.agee.2019.106708, 287, (106708), (2020).
  • Trendlets: A novel probabilistic representational structures for clustering the time series data, Expert Systems with Applications, 10.1016/j.eswa.2019.113119, 145, (113119), (2020).
  • Metaheuristics Approaches to Solve the Employee Bus Routing Problem With Clustering-Based Bus Stop Selection, Artificial Intelligence and Machine Learning Applications in Civil, Mechanical, and Industrial Engineering, 10.4018/978-1-7998-0301-0.ch012, (216-239), (2020).
  • Machine learning-based prediction of glioma margin from 5-ALA induced PpIX fluorescence spectroscopy, Scientific Reports, 10.1038/s41598-020-58299-7, 10, 1, (2020).
  • Dynamic Aggregation of Grid-Tied Three-Phase Inverters, IEEE Transactions on Power Systems, 10.1109/TPWRS.2019.2942292, 35, 2, (1520-1530), (2020).
  • Copy-Move Forgery Detection Based on Automatic Threshold Estimation, International Journal of Sociotechnology and Knowledge Development, 10.4018/IJSKD.2020010101, 12, 1, (1-23), (2020).
  • References, Advantages and Pitfalls of Pattern Recognition, 10.1016/B978-0-12-811842-9.16001-7, (315-326), (2020).
  • Phylogeny of Diplazium (Athyriaceae) revisited: Resolving the backbone relationships based on plastid genomes and phylogenetic tree space analysis, Molecular Phylogenetics and Evolution, 10.1016/j.ympev.2019.106699, 143, (106699), (2020).
  • Cluster analysis and prediction of residential peak demand profiles using occupant activity data, Applied Energy, 10.1016/j.apenergy.2019.114246, 260, (114246), (2020).
  • Assessing the Performance of a SAR Boat Location-Allocation Plan via Simulation, Improving the Safety and Efficiency of Emergency Services, 10.4018/978-1-7998-2535-7, (142-178), (2020).
  • Investigating spatial non-stationary environmental effects on the distribution of giant pandas in the Qinling Mountains, China, Global Ecology and Conservation, 10.1016/j.gecco.2019.e00894, 21, (e00894), (2020).
  • Global and regional evolution of sea surface temperature under climate change, Global and Planetary Change, 10.1016/j.gloplacha.2020.103190, (103190), (2020).
  • Analyzing continuous infrasound from Stromboli volcano, Italy using unsupervised machine learning, Computers & Geosciences, 10.1016/j.cageo.2020.104494, (104494), (2020).
  • Exploiting the Importance of Personalization When Selecting Music for Relaxation, MultiMedia Modeling, 10.1007/978-3-030-37731-1_5, (49-61), (2020).
  • Using Cluster Analysis and Dynamic Programming for Demand Response Applied to Electricity Load in Residential Homes, ASME Journal of Engineering for Sustainable Buildings and Cities, 10.1115/1.4045704, 1, 1, (2020).
  • Robust fuzzy c-means clustering algorithm with adaptive spatial & intensity constraint and membership linking for noise image segmentation, Applied Soft Computing, 10.1016/j.asoc.2020.106318, (106318), (2020).
  • Model-based unsupervised clustering for distinguishing Cuvier's and Gervais' beaked whales in acoustic data, Ecological Informatics, 10.1016/j.ecoinf.2020.101094, (101094), (2020).
  • Evaluating the histological-based condition of wild collected larval fish: A synthetic approach applied to common sole (Solea solea), Journal of Marine Systems, 10.1016/j.jmarsys.2020.103309, (103309), (2020).
  • Transcriptome meta-analysis reveals differences of immune profile between eutopic endometrium from stage I-II and III-IV endometriosis independently of hormonal milieu, Scientific Reports, 10.1038/s41598-019-57207-y, 10, 1, (2020).
  • Three faces of the online leftists: An exploratory study based on case observations and big-data analysis, Chinese Journal of Sociology, 10.1177/2057150X19896537, 6, 1, (67-101), (2020).
  • Learning Path Recommendation System for Programming Education Based on Neural Networks, International Journal of Distance Education Technologies, 10.4018/IJDET.2020010103, 18, 1, (36-64), (2020).
  • Regenerative potential of prostate luminal cells revealed by single-cell analysis, Science, 10.1126/science.aay0267, 368, 6490, (497-505), (2020).
  • Hierarchical Risk Parity: Accounting for Tail Dependencies in Multi-Asset Multi-Factor Allocations, SSRN Electronic Journal, 10.2139/ssrn.3513399, (2020).
  • Pervasive and non-random recombination in near full-length HIV genomes from Uganda, Virus Evolution, 10.1093/ve/veaa004, 6, 1, (2020).
  • Promoting social media analytics in capital raising: a design science-based approach, Social Network Analysis and Mining, 10.1007/s13278-020-00652-9, 10, 1, (2020).
  • Deep soft K-means clustering with self-training for single-cell RNA sequence data, NAR Genomics and Bioinformatics, 10.1093/nargab/lqaa039, 2, 2, (2020).
  • Multi-domain potential biomarkers for post-traumatic stress disorder (PTSD) severity in recent trauma survivors, Translational Psychiatry, 10.1038/s41398-020-00898-z, 10, 1, (2020).
  • Secreted breast tumor interstitial fluid microRNAs and their target genes are associated with triple-negative breast cancer, tumor grade, and immune infiltration, Breast Cancer Research, 10.1186/s13058-020-01295-6, 22, 1, (2020).
  • Fast computation of genome-metagenome interaction effects, Algorithms for Molecular Biology, 10.1186/s13015-020-00173-2, 15, 1, (2020).
  • Cluster Analysis in R With Big Data Applications, Open Source Software for Statistical Analysis of Big Data, 10.4018/978-1-7998-2768-9.ch004, (111-136), (2020).
  • Average Speed of Public Transport Vehicles Based on Smartcard Data, Smart Systems Design, Applications, and Challenges, 10.4018/978-1-7998-2112-0.ch007, (123-144), (2020).
  • Robust Dimensionality Reduction for Data Visualization with Deep Neural Networks, Graphical Models, 10.1016/j.gmod.2020.101060, (101060), (2020).
  • Horizontal and vertical movement of yellowtails Seriola quinqueradiata during summer to early winter recorded by archival tags in the northeastern Japan Sea, Marine Ecology Progress Series, 10.3354/meps13226, 636, (139-156), (2020).
  • A Meta-learning approach for recommending the number of clusters for clustering algorithms, Knowledge-Based Systems, 10.1016/j.knosys.2020.105682, (105682), (2020).
  • Determining the number of states in dynamic functional connectivity using cluster validity indexes, Journal of Neuroscience Methods, 10.1016/j.jneumeth.2020.108651, 337, (108651), (2020).
  • Characterisation of HIV-1 Molecular Epidemiology in Nigeria: Origin, Diversity, Demography and Geographic Spread, Scientific Reports, 10.1038/s41598-020-59944-x, 10, 1, (2020).
  • Speaker clustering quality estimation with logistic regression, Computer Speech & Language, 10.1016/j.csl.2020.101139, (101139), (2020).
  • Cluster Analysis Base on Psychosocial Information for Alcohol, Tobacco and Other Drugs Consumers, Applied Technologies, 10.1007/978-3-030-42520-3_22, (269-283), (2020).
  • Cluster Analysis for Abstemious Characterization Based on Psycho-Social Information, Applied Technologies, 10.1007/978-3-030-42520-3_15, (184-193), (2020).
  • Automatic Color Image Segmentation Using Clustering Technique, Intelligent Techniques and Applications in Science and Technology, 10.1007/978-3-030-42363-6_91, (780-788), (2020).
  • A Clustering-Based Approach to Identify Joint Impedance During Walking, IEEE Transactions on Neural Systems and Rehabilitation Engineering, 10.1109/TNSRE.2020.3005389, 28, 8, (1808-1816), (2020).
  • Characteristics of the microbiota in the urine of women with type 2 diabetes, Journal of Diabetes and its Complications, 10.1016/j.jdiacomp.2020.107561, (107561), (2020).
  • Impression space model for the evaluation of Internet advertising effectiveness, Concurrency and Computation: Practice and Experience, 10.1002/cpe.5678, 32, 11, (2020).
  • Global lake thermal regions shift under climate change, Nature Communications, 10.1038/s41467-020-15108-z, 11, 1, (2020).
  • Application Behaviors Driven Self-Organizing Network (SON) for 4G LTE Networks, IEEE Transactions on Network Science and Engineering, 10.1109/TNSE.2018.2877353, 7, 1, (3-14), (2020).
  • The power of sophistication: How service design cues help in service failures, Journal of Consumer Behaviour, 10.1002/cb.1816, 19, 3, (277-290), (2020).
  • Fishing for mammals: Landscape‐level monitoring of terrestrial and semi‐aquatic communities using eDNA from riverine systems, Journal of Applied Ecology, 10.1111/1365-2664.13592, 57, 4, (707-716), (2020).
  • Use of immunotherapy and surgery for stage IV melanoma, Cancer, 10.1002/cncr.32817, 126, 11, (2614-2624), (2020).
  • LiMM‐PCA: Combining ASCA+ and linear mixed models to analyse high‐dimensional designed data, Journal of Chemometrics, 10.1002/cem.3232, 34, 6, (2020).
  • The Wheat 660K SNP array demonstrates great potential for marker‐assisted selection in polyploid wheat, Plant Biotechnology Journal, 10.1111/pbi.13361, 18, 6, (1354-1360), (2020).
  • Low concordance of short‐term and long‐term selection responses in experimental Drosophila populations, Molecular Ecology, 10.1111/mec.15579, 29, 18, (3466-3475), (2020).
  • Examination of the influence of cedar fragrance on cognitive function and behavioral and psychological symptoms of dementia in Alzheimer type dementia, Neuropsychopharmacology Reports, 10.1002/npr2.12096, 40, 1, (10-15), (2020).
  • Clustering quality metrics for subspace clustering, Pattern Recognition, 10.1016/j.patcog.2020.107328, (107328), (2020).
  • IoT-KEEPER: Detecting Malicious IoT Network Activity Using Online Traffic Analysis at the Edge, IEEE Transactions on Network and Service Management, 10.1109/TNSM.2020.2966951, 17, 1, (45-59), (2020).
  • Counting niches: Abundance‐by‐trait patterns reveal niche partitioning in a Neotropical forest, Ecology, 10.1002/ecy.3019, 101, 6, (2020).
  • Weighted K-Means Clustering with Observation Weight for Single-Cell Epigenomic Data, Statistical Modeling in Biomedical Research, 10.1007/978-3-030-33416-1_3, (37-64), (2020).
  • Finding groups in structural equation modeling through the partial least squares algorithm, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.106957, (106957), (2020).
  • Degree of Genetic Liability for Alzheimer’s Disease Associated with Specific Proteomic Profiles in Cerebrospinal Fluid, Neurobiology of Aging, 10.1016/j.neurobiolaging.2020.03.012, (2020).
  • Feature based clustering technique for investigation of domestic load profiles and probabilistic variation assessment: Smart meter dataset, Sustainable Energy, Grids and Networks, 10.1016/j.segan.2020.100346, (100346), (2020).
  • Are We Meeting the Promise of Endotypes and Precision Medicine in Asthma?, Physiological Reviews, 10.1152/physrev.00023.2019, 100, 3, (983-1017), (2020).
  • Preserving air pollution forest archives accessible through dendrochemistry, Journal of Environmental Management, 10.1016/j.jenvman.2020.110462, 264, (110462), (2020).
  • Mapping crown rust resistance at multiple time points in elite oat germplasm, The Plant Genome, 10.1002/tpg2.20007, 13, 1, (2020).
  • Efficient Beamforming Training and Limited Feedback Precoding for Massive MIMO Systems, IEEE Journal on Selected Areas in Communications, 10.1109/JSAC.2020.3000886, 38, 9, (2197-2214), (2020).
  • Machine Learning Based Network Analysis Using Millimeter-Wave Narrow-Band Energy Traces, IEEE Transactions on Mobile Computing, 10.1109/TMC.2019.2907585, 19, 5, (1138-1155), (2020).
  • Ten-year trajectory and outcomes of negative symptoms of patients with first-episode schizophrenia spectrum disorders, Schizophrenia Research, 10.1016/j.schres.2020.03.061, (2020).
  • A STUDY ON LAND DEVELOPMENT STATUS AND TENDENCY OF SHINKANSEN STATION AREA, Journal of Architecture and Planning (Transactions of AIJ), 10.3130/aija.85.1715, 85, 774, (1715), (2020).
  • Controlling Self-organization in Generative Creative Systems, Artificial Intelligence in Music, Sound, Art and Design, 10.1007/978-3-030-43859-3_14, (194-209), (2020).
  • Tissue-specific and interpretable sub-segmentation of whole tumour burden on CT images by unsupervised fuzzy clustering, Computers in Biology and Medicine, 10.1016/j.compbiomed.2020.103751, (103751), (2020).
  • Degrees of freedom and model selection for -means clustering , Computational Statistics & Data Analysis, 10.1016/j.csda.2020.106974, (106974), (2020).
  • Leaching of exhausted LNCM cathode batteries in ascorbic acid lixiviant: a green recycling approach, reaction kinetics and process mechanism, Journal of Chemical Technology & Biotechnology, 10.1002/jctb.6418, 95, 8, (2286-2294), (2020).
  • Gradual and rapid shifts in the composition of assemblages of hydroids (Cnidaria) along depth and latitude in the deep Atlantic Ocean, Journal of Biogeography, 10.1111/jbi.13853, 47, 7, (1541-1551), (2020).
  • Vocal repertoires and insights into social structure of sperm whales () in Mauritius, southwestern Indian Ocean, Marine Mammal Science, 10.1111/mms.12673, 36, 2, (638-657), (2020).
  • Tree‐to‐tree interactions slow down Himalayan treeline shifts as inferred from tree spatial patterns, Journal of Biogeography, 10.1111/jbi.13840, 47, 8, (1816-1826), (2020).
  • Disentangling Pro-mitotic Signaling during Cell Cycle Progression using Time-Resolved Single-Cell Imaging, Cell Reports, 10.1016/j.celrep.2020.03.078, 31, 2, (107514), (2020).
  • A New Approach to Determine the Optimal Number of Clusters Based on the Gap Statistic, Machine Learning for Networking, 10.1007/978-3-030-45778-5_15, (227-239), (2020).
  • Pseudo-quantile functional data clustering, Journal of Multivariate Analysis, 10.1016/j.jmva.2020.104626, (104626), (2020).
  • Data-Driven Load Modeling and Forecasting of Residential Appliances, IEEE Transactions on Smart Grid, 10.1109/TSG.2019.2959770, 11, 3, (2652-2661), (2020).
  • scRCMF: Identification of Cell Subpopulations and Transition States From Single-Cell Transcriptomes, IEEE Transactions on Biomedical Engineering, 10.1109/TBME.2019.2937228, 67, 5, (1418-1428), (2020).
  • Dynamics of Global Gene Expression and Regulatory Elements in Growing Brachypodium Root System, Scientific Reports, 10.1038/s41598-020-63224-z, 10, 1, (2020).
  • Exploration of Critical Care Data by Using Unsupervised Machine Learning, Computer Methods and Programs in Biomedicine, 10.1016/j.cmpb.2020.105507, (105507), (2020).
  • Peeling back the layers of crassulacean acid metabolism: functional differentiation between Kalanchoë fedtschenkoi epidermis and mesophyll proteomes, The Plant Journal, 10.1111/tpj.14757, 103, 2, (869-888), (2020).
  • Application of Self-Organizing Maps on Time Series Data for identifying interpretable Driving Manoeuvres, European Transport Research Review, 10.1186/s12544-020-00421-x, 12, 1, (2020).
  • k Is the Magic Number—Inferring the Number of Clusters Through Nonparametric Concentration Inequalities, Machine Learning and Knowledge Discovery in Databases, 10.1007/978-3-030-46150-8_16, (257-273), (2020).
  • An Unsupervised Strategy for Identifying Epithelial-Mesenchymal Transition State Metrics in Breast Cancer and Melanoma, iScience, 10.1016/j.isci.2020.101080, 23, 5, (101080), (2020).
  • A Hybrid Scheduled and Group-Based Random Access Solution for Massive MTC Networks, Computer Networks, 10.1016/j.comnet.2020.107253, (107253), (2020).
  • Zoning additive manufacturing process histories using unsupervised machine learning, Materials Characterization, 10.1016/j.matchar.2020.110123, 161, (110123), (2020).
  • A comparative approach of stochastic frontier analysis and data envelopment analysis estimators: evidence from banking system, Journal of Economic Studies, 10.1108/JES-01-2019-0051, ahead-of-print, ahead-of-print, (2020).
  • T-ReX: a graph-based filament detection method, Astronomy & Astrophysics, 10.1051/0004-6361/201936859, 637, (A18), (2020).
  • Using STADIA to quantify dynamic instability in microtubules, , 10.1016/bs.mcb.2020.03.002, (2020).
  • Novel Alzheimer’s disease subtypes identified using a data and knowledge driven strategy, Scientific Reports, 10.1038/s41598-020-57785-2, 10, 1, (2020).
  • M3C: Monte Carlo reference-based consensus clustering, Scientific Reports, 10.1038/s41598-020-58766-1, 10, 1, (2020).
  • Smart Seismic Sensing for Indoor Fall Detection, Location, and Notification, IEEE Journal of Biomedical and Health Informatics, 10.1109/JBHI.2019.2907498, 24, 2, (524-532), (2020).
  • Understanding Segmentation in Rural Electricity Markets: Evidence from India, Energy Economics, 10.1016/j.eneco.2020.104697, (104697), (2020).
  • Challenges to assessing motivation in MOOC learners: An application of an argument-based approach, Computers & Education, 10.1016/j.compedu.2020.103829, (103829), (2020).
  • Volume and Surface Area-Based Cluster Validity Index, IEEE Access, 10.1109/ACCESS.2020.2968938, 8, (24170-24181), (2020).
  • See more