Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study
Abstract
Summary. The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands, to assess the degree of protection of confidentiality that is afforded by fully synthetic data and to illustrate the specification of synthetic data imputation models. Benefits and limitations of releasing fully synthetic data sets are discussed.
Citing Literature
Number of times cited according to CrossRef: 84
- Natalie Shlomo, Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production, Privacy in Statistical Databases, 10.1007/978-3-030-57521-2_19, (271-280), (2020).
- Kenneth I. Vaden, Mulugeta Gebregziabher, undefined Dyslexia Data Consortium, Mark A. Eckert, Fully synthetic neuroimaging data for replication and exploration, NeuroImage, 10.1016/j.neuroimage.2020.117284, (117284), (2020).
- Martin Klein, Ricardo Moura, Bimal Sinha, Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling, Sankhya B, 10.1007/s13571-019-00215-9, (2020).
- Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, Gorka Epelde, Reliability of Supervised Machine Learning Using Synthetic Data in Healthcare: A Model to Preserve Privacy for Data Sharing (Preprint), JMIR Medical Informatics, 10.2196/18910, (2020).
- Daniel S Quintana, A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation, eLife, 10.7554/eLife.53275, 9, (2020).
- Daniel S. Quintana, Alexander Lischke, Sally Grace, Dirk Scheele, Yina Ma, Benjamin Becker, Advances in the field of intranasal oxytocin research: lessons learned and future directions for clinical research, Molecular Psychiatry, 10.1038/s41380-020-00864-7, (2020).
- Jacob M. Schauer, Arend M. Kuyper, Eric C. Hedberg, Larry V. Hedges, The Effects of Microsuppression on State Education Data Quality, Journal of Research on Educational Effectiveness, 10.1080/19345747.2020.1814465, (1-22), (2020).
- Khaled El Emam, Lucy Mosquera, Jason Bass, A Method for Evaluating Identity Disclosure Risk in Fully Synthetic Health Data (Preprint), Journal of Medical Internet Research, 10.2196/23139, (2020).
- Theresa Henle, Gregory J. Matthews, Ofer Harel, Data Confidentiality, Health Services Evaluation, 10.1007/978-1-4939-8715-3_28, (717-731), (2019).
- Jerome P. Reiter, Differential Privacy and Federal Data Releases, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-030718-105142, 6, 1, (85-101), (2019).
- Daniel Bonnéry, Yi Feng, Angela K. Henneberger, Tessa L. Johnson, Mark Lachowicz, Bess A. Rose, Terry Shaw, Laura M. Stapleton, Michael E. Woolley, Yating Zheng, The Promise and Limitations of Synthetic Data as a Strategy to Expand Access to State-Level Multi-Agency Longitudinal Data, Journal of Research on Educational Effectiveness, 10.1080/19345747.2019.1631421, (1-29), (2019).
- Nicolas Ruiz, Krishnamurty Muralidhar, Josep Domingo-Ferrer, On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective, Privacy in Statistical Databases, 10.1007/978-3-319-99771-1_5, (59-74), (2018).
- Joshua Snoke, Gillian M. Raab, Beata Nowok, Chris Dibben, Aleksandra Slavkovic, General and specific utility measures for synthetic data, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12358, 181, 3, (663-688), (2018).
- Harrison Quick, Scott H. Holan, Christopher K. Wikle, Generating partially synthetic geocoded public use data with decreased disclosure risk by using differential smoothing, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12360, 181, 3, (649-661), (2018).
- Olanrewaju Akande, Andrés Barrientos, Jerome P Reiter, Simultaneous Edit and Imputation For Household Data with Structural Zeros, Journal of Survey Statistics and Methodology, 10.1093/jssam/smy022, (2018).
- Andrés F. Barrientos, Jerome P. Reiter, Ashwin Machanavajjhala, Yan Chen, Differentially Private Significance Tests for Regression Coefficients, Journal of Computational and Graphical Statistics, 10.1080/10618600.2018.1538881, (1-24), (2018).
- Yan Chen, Andrés F. Barrientos, Ashwin Machanavajjhala, Jerome P. Reiter, Is my model any good: differentially private regression diagnostics, Knowledge and Information Systems, 10.1007/s10115-017-1128-z, 54, 1, (33-64), (2017).
- Vicenç Torra, Vicenç Torra, Masking Methods, Data Privacy: Foundations, New Developments and the Big Data Challenge, 10.1007/978-3-319-57358-8_6, (191-238), (2017).
- Bronwyn Loong, Donald B. Rubin, Multiply-Imputed Synthetic Data: Advice to the Imputer, Journal of Official Statistics, 10.1515/jos-2017-0047, 33, 4, (1005-1019), (2017).
- Ricardo Moura, Bimal Sinha, Carlos A. Coelho, undefined, , 10.1063/1.4982005, (020065), (2017).
- Theresa Henle, Gregory J. Matthews, Ofer Harel, Data Confidentiality, Health Care Systems and Policies, 10.1007/978-1-4939-6704-9_28-1, (1-15), (2017).
- Beata Nowok, Gillian M. Raab, Chris Dibben, Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R1, Statistical Journal of the IAOS, 10.3233/SJI-150153, 33, 3, (785-796), (2017).
- Hang J. Kim, Jerome P. Reiter, Alan F. Karr, Simultaneous edit-imputation and disclosure limitation for business establishment data, Journal of Applied Statistics, 10.1080/02664763.2016.1267123, 45, 1, (63-82), (2016).
- Christine M. O’Keefe, Tim Ayre, Sebastien Lucie, Atikur R. Khan, Soomin Song, Soonmin Kwon, Perturbed robust linear estimating equations for confidentiality protection in remote analysis, Statistics and Computing, 10.1007/s11222-016-9653-2, 27, 3, (775-787), (2016).
- Alan F. Karr, Data Sharing and Access, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-041715-033438, 3, 1, (113-132), (2016).
- Sallie Ann Keller, Stephanie Shipp, Aaron Schroeder, Does Big Data Change the Privacy Landscape? A Review of the Issues, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-041715-033453, 3, 1, (161-180), (2016).
- Brittany Megan Bogle, Sanjay Mehrotra, A Moment Matching Approach for Generating Synthetic Data, Big Data, 10.1089/big.2016.0015, 4, 3, (160-178), (2016).
- Yan Chen, Ashwin Machanavajjhala, Jerome P. Reiter, Andres F. Barrientos, undefined, 2016 IEEE 16th International Conference on Data Mining (ICDM), 10.1109/ICDM.2016.0019, (81-90), (2016).
- Min-Jeong Park, Hang J. Kim, Statistical disclosure control for public microdata: present and future, Korean Journal of Applied Statistics, 10.5351/KJAS.2016.29.6.1041, 29, 6, (1041-1059), (2016).
- David McClure, Jerome P. Reiter, Assessing disclosure risks for synthetic data with arbitrary intruder knowledge, Statistical Journal of the IAOS, 10.3233/SJI-160957, 32, 1, (109-126), (2016).
- Joël Kuiper, Edwin R. van den Heuvel, Morris A. Swertz, The Hybrid Synthetic Microdata Platform: A Method for Statistical Disclosure Control, Biopreservation and Biobanking, 10.1089/bio.2014.0069, 13, 3, (178-182), (2015).
- Yi Qian, Hui Xie, Drive More Effective Data-Based Innovations: Enhancing the Utility of Secure Databases, Management Science, 10.1287/mnsc.2014.2026, 61, 3, (520-541), (2015).
- Danny Pfeffermann, John L. Eltinge, Lawrence D. Brown, Methodological Issues and Challenges in the Production of Official StatisticsComments on “Methodological Issues and Challenges in the Production of Official Statistics”Comments on “Methodological Issues and Challenges in the Production of Official Statistics”Rejoinder to Reviewers' Discussion, Journal of Survey Statistics and Methodology, 10.1093/jssam/smv035, 3, 4, (425-483), (2015).
- Natalie Shlomo, Statistical Disclosure Limitation for Health Data: A Statistical Agency Perspective, Medical Data Privacy Handbook, 10.1007/978-3-319-23633-9, (201-230), (2015).
- Matthew J. Schneider, John M. Abowd, A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12100, 178, 4, (963-975), (2015).
- Harrison Quick, Scott H. Holan, Christopher K. Wikle, Jerome P. Reiter, Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography, Spatial Statistics, 10.1016/j.spasta.2015.07.008, 14, (439-451), (2015).
- Fan Li, Michela Baccini, Fabrizia Mealli, Elizabeth R. Zell, Constantine E. Frangakis, Donald B. Rubin, Multiple Imputation by Ordered Monotone Blocks With Application to the Anthrax Vaccine Research Program, Journal of Computational and Graphical Statistics, 10.1080/10618600.2013.826583, 23, 3, (877-892), (2014).
- Gerald Whittaker, Creation of Synthetic Microdata for Data Envelopment Analysis Using Nondominated Sorting, SSRN Electronic Journal, 10.2139/ssrn.2499012, (2014).
- Christine M. O’Keefe, Natalie Shlomo, Applicability of Confidentiality Methods to Personal and Business Data, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_27, (350-363), (2014).
- Joseph W. Sakshaug, Trivellore E. Raghunathan, Nonparametric Generation of Synthetic Data for Small Geographic Areas, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_17, (213-231), (2014).
- Lawrence H. Cox, Enabling Statistical Analysis of Suppressed Tabular Data, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_1, (1-10), (2014).
- Jingchen Hu, Jerome P. Reiter, Quanli Wang, Disclosure Risk Evaluation for Fully Synthetic Categorical Data, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_15, (185-199), (2014).
- Joseph W. Sakshaug, Trivellore E. Raghunathan, Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey, Journal of Applied Statistics, 10.1080/02664763.2014.909778, 41, 10, (2103-2122), (2014).
- Gregory J. Matthews, Ofer Harel, An Examination of Data Confidentiality and Disclosure Issues Related to Publication of Empirical ROC Curves, Academic Radiology, 10.1016/j.acra.2013.04.011, 20, 7, (889-896), (2013).
- Avinash Bhati, Erin L. Crites, Faye S. Taxman, RNR Simulation Tool: A Synthetic Datasets and Its Uses for Policy Simulations, Simulation Strategies to Reduce Recidivism, 10.1007/978-1-4614-6188-3, (197-221), (2013).
- Joseph W. Sakshaug, Trivellore Raghunathan, Synthetic Data for Small Area Estimation in the American Community Survey, SSRN Electronic Journal, 10.2139/ssrn.2248881, (2013).
- Stef Buuren, References, Flexible Imputation of Missing Data, 10.1201/b11826-16, (2012).
- Nigel Melville, Michael McQuaid, Research Note —Generating Shareable Statistical Databases for Business Value: Multiple Imputation with Multimodal Perturbation , Information Systems Research, 10.1287/isre.1110.0361, 23, 2, (559-574), (2012).
- J. P. Reiter, Statistical Approaches To Protecting Confidentiality For Microdata And Their Effects On The Quality Of Statistical Inferences, Public Opinion Quarterly, 10.1093/poq/nfr058, 76, 1, (163-181), (2012).
- Gregory J. Matthews, Ofer Harel, Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve, Health Services and Outcomes Research Methodology, 10.1007/s10742-012-0093-y, 12, 2-3, (141-155), (2012).
- Jerome P. Reiter, Protecting Data Confidentiality in Publicly Released Datasets: Approaches Based on Multiple Imputation, Handbook of Statistics Volume 28, 10.1016/B978-0-44-451875-0.00020-8, (533-545), (2012).
- Guillermo Navarro-Arribas, Vicenç Torra, Information fusion in data privacy: A survey, Information Fusion, 10.1016/j.inffus.2012.01.001, 13, 4, (235-244), (2012).
- Kerstin Hermes, Michael Poulsen, A review of current methods to generate synthetic spatial microdata using reweighting and future directions, Computers, Environment and Urban Systems, 10.1016/j.compenvurbsys.2012.03.005, 36, 4, (281-290), (2012).
- Jörg Drechsler, New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey, Journal of Applied Statistics, 10.1080/02664763.2011.584523, 39, 2, (243-265), (2012).
- Josep Domingo-Ferrer, Krish Muralidhar, Guillem Rufian-Torrell, Anonymization Methods for Taxonomic Microdata, Privacy in Statistical Databases, 10.1007/978-3-642-33627-0_8, (90-102), (2012).
- Yong Ming Jeffrey Woo, Aleksandra B. Slavković, Logistic Regression with Variables Subject to Post Randomization Method, Privacy in Statistical Databases, 10.1007/978-3-642-33627-0_10, (116-130), (2012).
- Jörg Drechsler, Jerome P. Reiter, An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets, Computational Statistics & Data Analysis, 10.1016/j.csda.2011.06.006, 55, 12, (3232-3243), (2011).
- Jerome P. Reiter, Satkartar K. Kinney, Commentary: Sharing Confidential Data for Research Purposes, Epidemiology, 10.1097/EDE.0b013e318225c44b, 22, 5, (632-635), (2011).
- Bradley Malin, Kathleen Benitez, Daniel Masys, Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule, Journal of the American Medical Informatics Association, 10.1136/jamia.2010.004622, 18, 1, (3-10), (2011).
- Joseph W. Sakshaug, Trivellore E. Raghunathan, Synthetic Data for Small Area Estimation, Privacy in Statistical Databases, 10.1007/978-3-642-15838-4_15, (162-173), (2011).
- Jörg Drechsler, Using Support Vector Machines for Generating Synthetic Datasets, Privacy in Statistical Databases, 10.1007/978-3-642-15838-4_14, (148-161), (2011).
- George T. Duncan, Mark Elliot, Juan-José Salazar-González, George T. Duncan, Mark Elliot, Juan-José Salazar-González, Providing and Protecting Microdata, Statistical Confidentiality, 10.1007/978-1-4419-7802-8, (93-122), (2011).
- Yajuan Si, Jerome P. Reiter, A Comparison of Posterior Simulation and Inference by Combining Rules for Multiple Imputation, Journal of Statistical Theory and Practice, 10.1080/15598608.2011.10412032, 5, 2, (335-347), (2011).
- Gregory J. Matthews, Ofer Harel, Robert H. Aseltine, Examining the robustness of fully synthetic data techniques for data with binary variables, Journal of Statistical Computation and Simulation, 10.1080/00949650902744438, 80, 6, (609-624), (2010).
- Jörg Drechsler, Jerome P. Reiter, Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata, Journal of the American Statistical Association, 10.1198/jasa.2010.ap09480, 105, 492, (1347-1357), (2010).
- Aleksandra B. Slavković, Juyoun Lee, Synthetic two-way contingency tables that preserve conditional frequencies, Statistical Methodology, 10.1016/j.stamet.2009.11.002, 7, 3, (225-239), (2010).
- Josep Domingo-Ferrer, Úrsula González-Nicolás, Hybrid microdata using microaggregation, Information Sciences, 10.1016/j.ins.2010.04.005, 180, 15, (2834-2844), (2010).
- Xinlei Wang, Alan F. Karr, Preserving data utility via BART, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2010.03.022, 140, 9, (2551-2561), (2010).
- Silvia Polettini, Loredana di Consiglio, Exploiting Auxiliary Information in the Estimation of Per-Record Risk of Disclosure, Privacy and Anonymity in Information Management Systems, 10.1007/978-1-84996-238-4_5, (91-111), (2010).
- Gregory J. Matthews, Ofer Harel, Robert H. Aseltine, Assessing database privacy using the area under the receiver-operator characteristic curve, Health Services and Outcomes Research Methodology, 10.1007/s10742-010-0061-3, 10, 1-2, (1-15), (2010).
- Simon D. Woodcock, Gary Benedetto, Distribution-preserving statistical disclosure limitation, Computational Statistics & Data Analysis, 10.1016/j.csda.2009.05.020, 53, 12, (4228-4242), (2009).
- Jerome P. Reiter, Anna Oganian, Alan F. Karr, Verification servers: Enabling analysts to assess the quality of inferences from public use data, Computational Statistics & Data Analysis, 10.1016/j.csda.2008.10.006, 53, 4, (1475-1482), (2009).
- S. K. Kinney, J. P. Reiter, Inferences for Two-Stage Multiple Imputation for Nonresponse, Journal of Statistical Theory and Practice, 10.1080/15598608.2009.10411927, 3, 2, (307-318), (2009).
- Christine N. Kohnen, Jerome P. Reiter, Multiple imputation for combining confidential data owned by two agencies, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/j.1467-985X.2008.00574.x, 172, 2, (511-528), (2009).
- Myron P. Gutmann, Kristine Witkowski, Corey Colyer, JoAnne McFarland O’Rourke, James McNally, Providing Spatial Data for Secondary Analysis: Issues and Current Practices Relating to Confidentiality, Population Research and Policy Review, 10.1007/s11113-008-9095-4, 27, 6, (639-665), (2008).
- Jörg Drechsler, Agnes Dundler, Stefan Bender, Susanne Rässler, Thomas Zwick, A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access, AStA Advances in Statistical Analysis, 10.1007/s10182-008-0090-1, 92, 4, (439-458), (2008).
- Jerome P. Reiter, Selecting the number of imputed datasets when using multiple imputation for missing data and disclosure limitation, Statistics & Probability Letters, 10.1016/j.spl.2007.04.020, 78, 1, (15-20), (2008).
- Simon D. Woodcock, Gary Benedetto, Distribution-Preserving Statistical Disclosure Limitation, SSRN Electronic Journal, 10.2139/ssrn.931535, (2007).
- Di An, Roderick J. A. Little, Multiple imputation: an alternative to top coding for statistical disclosure control, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/j.1467-985X.2007.00492.x, 170, 4, (923-940), (2007).
- A. F Karr, C. N Kohnen, A Oganian, J. P Reiter, A. P Sanil, A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality, The American Statistician, 10.1198/000313006X124640, 60, 3, (224-232), (2006).
- Loredana Di Consiglio, Silvia Polettini, Improving Individual Risk Estimators, Privacy in Statistical Databases, 10.1007/11930242_21, (243-256), (2006).
- Josep Domingo-Ferrer, Vicenç Torra, Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation, Data Mining and Knowledge Discovery, 10.1007/s10618-005-0007-5, 11, 2, (195-212), (2005).
- William E. Winkler, Re-identification Methods for Masked Microdata, Privacy in Statistical Databases, 10.1007/978-3-540-25955-8_17, (216-230), (2004).
- William E. Winkler, Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems, Privacy in Statistical Databases, 10.1007/978-3-540-25955-8_18, (231-246), (2004).




