Volume 168, Issue 1

Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study

First published: 15 December 2004
Citations: 84
Jerome P. Reiter, Institute of Statistics and Decision Sciences, Box 90251, Duke University, Durham, NC 27708, USA.
E‐mail: jerry@stat.duke.edu

Abstract

Summary. The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands, to assess the degree of protection of confidentiality that is afforded by fully synthetic data and to illustrate the specification of synthetic data imputation models. Benefits and limitations of releasing fully synthetic data sets are discussed.

Number of times cited according to CrossRef: 84

  • Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production, Privacy in Statistical Databases, 10.1007/978-3-030-57521-2_19, (271-280), (2020).
  • Fully synthetic neuroimaging data for replication and exploration, NeuroImage, 10.1016/j.neuroimage.2020.117284, (117284), (2020).
  • Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling, Sankhya B, 10.1007/s13571-019-00215-9, (2020).
  • Reliability of Supervised Machine Learning Using Synthetic Data in Healthcare: A Model to Preserve Privacy for Data Sharing (Preprint), JMIR Medical Informatics, 10.2196/18910, (2020).
  • A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation, eLife, 10.7554/eLife.53275, 9, (2020).
  • Advances in the field of intranasal oxytocin research: lessons learned and future directions for clinical research, Molecular Psychiatry, 10.1038/s41380-020-00864-7, (2020).
  • The Effects of Microsuppression on State Education Data Quality, Journal of Research on Educational Effectiveness, 10.1080/19345747.2020.1814465, (1-22), (2020).
  • A Method for Evaluating Identity Disclosure Risk in Fully Synthetic Health Data (Preprint), Journal of Medical Internet Research, 10.2196/23139, (2020).
  • Data Confidentiality, Health Services Evaluation, 10.1007/978-1-4939-8715-3_28, (717-731), (2019).
  • Differential Privacy and Federal Data Releases, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-030718-105142, 6, 1, (85-101), (2019).
  • The Promise and Limitations of Synthetic Data as a Strategy to Expand Access to State-Level Multi-Agency Longitudinal Data, Journal of Research on Educational Effectiveness, 10.1080/19345747.2019.1631421, (1-29), (2019).
  • On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective, Privacy in Statistical Databases, 10.1007/978-3-319-99771-1_5, (59-74), (2018).
  • General and specific utility measures for synthetic data, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12358, 181, 3, (663-688), (2018).
  • Generating partially synthetic geocoded public use data with decreased disclosure risk by using differential smoothing, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12360, 181, 3, (649-661), (2018).
  • Simultaneous Edit and Imputation For Household Data with Structural Zeros, Journal of Survey Statistics and Methodology, 10.1093/jssam/smy022, (2018).
  • Differentially Private Significance Tests for Regression Coefficients, Journal of Computational and Graphical Statistics, 10.1080/10618600.2018.1538881, (1-24), (2018).
  • Is my model any good: differentially private regression diagnostics, Knowledge and Information Systems, 10.1007/s10115-017-1128-z, 54, 1, (33-64), (2017).
  • Masking Methods, Data Privacy: Foundations, New Developments and the Big Data Challenge, 10.1007/978-3-319-57358-8_6, (191-238), (2017).
  • Multiply-Imputed Synthetic Data: Advice to the Imputer, Journal of Official Statistics, 10.1515/jos-2017-0047, 33, 4, (1005-1019), (2017).
  • undefined, , 10.1063/1.4982005, (020065), (2017).
  • Data Confidentiality, Health Care Systems and Policies, 10.1007/978-1-4939-6704-9_28-1, (1-15), (2017).
  • Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R1, Statistical Journal of the IAOS, 10.3233/SJI-150153, 33, 3, (785-796), (2017).
  • Simultaneous edit-imputation and disclosure limitation for business establishment data, Journal of Applied Statistics, 10.1080/02664763.2016.1267123, 45, 1, (63-82), (2016).
  • Perturbed robust linear estimating equations for confidentiality protection in remote analysis, Statistics and Computing, 10.1007/s11222-016-9653-2, 27, 3, (775-787), (2016).
  • Data Sharing and Access, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-041715-033438, 3, 1, (113-132), (2016).
  • Does Big Data Change the Privacy Landscape? A Review of the Issues, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-041715-033453, 3, 1, (161-180), (2016).
  • A Moment Matching Approach for Generating Synthetic Data, Big Data, 10.1089/big.2016.0015, 4, 3, (160-178), (2016).
  • undefined, 2016 IEEE 16th International Conference on Data Mining (ICDM), 10.1109/ICDM.2016.0019, (81-90), (2016).
  • Statistical disclosure control for public microdata: present and future, Korean Journal of Applied Statistics, 10.5351/KJAS.2016.29.6.1041, 29, 6, (1041-1059), (2016).
  • Assessing disclosure risks for synthetic data with arbitrary intruder knowledge, Statistical Journal of the IAOS, 10.3233/SJI-160957, 32, 1, (109-126), (2016).
  • The Hybrid Synthetic Microdata Platform: A Method for Statistical Disclosure Control, Biopreservation and Biobanking, 10.1089/bio.2014.0069, 13, 3, (178-182), (2015).
  • Drive More Effective Data-Based Innovations: Enhancing the Utility of Secure Databases, Management Science, 10.1287/mnsc.2014.2026, 61, 3, (520-541), (2015).
  • Methodological Issues and Challenges in the Production of Official StatisticsComments on “Methodological Issues and Challenges in the Production of Official Statistics”Comments on “Methodological Issues and Challenges in the Production of Official Statistics”Rejoinder to Reviewers' Discussion, Journal of Survey Statistics and Methodology, 10.1093/jssam/smv035, 3, 4, (425-483), (2015).
  • Statistical Disclosure Limitation for Health Data: A Statistical Agency Perspective, Medical Data Privacy Handbook, 10.1007/978-3-319-23633-9, (201-230), (2015).
  • A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12100, 178, 4, (963-975), (2015).
  • Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography, Spatial Statistics, 10.1016/j.spasta.2015.07.008, 14, (439-451), (2015).
  • Multiple Imputation by Ordered Monotone Blocks With Application to the Anthrax Vaccine Research Program, Journal of Computational and Graphical Statistics, 10.1080/10618600.2013.826583, 23, 3, (877-892), (2014).
  • Creation of Synthetic Microdata for Data Envelopment Analysis Using Nondominated Sorting, SSRN Electronic Journal, 10.2139/ssrn.2499012, (2014).
  • Applicability of Confidentiality Methods to Personal and Business Data, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_27, (350-363), (2014).
  • Nonparametric Generation of Synthetic Data for Small Geographic Areas, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_17, (213-231), (2014).
  • Enabling Statistical Analysis of Suppressed Tabular Data, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_1, (1-10), (2014).
  • Disclosure Risk Evaluation for Fully Synthetic Categorical Data, Privacy in Statistical Databases, 10.1007/978-3-319-11257-2_15, (185-199), (2014).
  • Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey, Journal of Applied Statistics, 10.1080/02664763.2014.909778, 41, 10, (2103-2122), (2014).
  • An Examination of Data Confidentiality and Disclosure Issues Related to Publication of Empirical ROC Curves, Academic Radiology, 10.1016/j.acra.2013.04.011, 20, 7, (889-896), (2013).
  • RNR Simulation Tool: A Synthetic Datasets and Its Uses for Policy Simulations, Simulation Strategies to Reduce Recidivism, 10.1007/978-1-4614-6188-3, (197-221), (2013).
  • Synthetic Data for Small Area Estimation in the American Community Survey, SSRN Electronic Journal, 10.2139/ssrn.2248881, (2013).
  • References, Flexible Imputation of Missing Data, 10.1201/b11826-16, (2012).
  • Research Note —Generating Shareable Statistical Databases for Business Value: Multiple Imputation with Multimodal Perturbation , Information Systems Research, 10.1287/isre.1110.0361, 23, 2, (559-574), (2012).
  • Statistical Approaches To Protecting Confidentiality For Microdata And Their Effects On The Quality Of Statistical Inferences, Public Opinion Quarterly, 10.1093/poq/nfr058, 76, 1, (163-181), (2012).
  • Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve, Health Services and Outcomes Research Methodology, 10.1007/s10742-012-0093-y, 12, 2-3, (141-155), (2012).
  • Protecting Data Confidentiality in Publicly Released Datasets: Approaches Based on Multiple Imputation, Handbook of Statistics Volume 28, 10.1016/B978-0-44-451875-0.00020-8, (533-545), (2012).
  • Information fusion in data privacy: A survey, Information Fusion, 10.1016/j.inffus.2012.01.001, 13, 4, (235-244), (2012).
  • A review of current methods to generate synthetic spatial microdata using reweighting and future directions, Computers, Environment and Urban Systems, 10.1016/j.compenvurbsys.2012.03.005, 36, 4, (281-290), (2012).
  • New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey, Journal of Applied Statistics, 10.1080/02664763.2011.584523, 39, 2, (243-265), (2012).
  • Anonymization Methods for Taxonomic Microdata, Privacy in Statistical Databases, 10.1007/978-3-642-33627-0_8, (90-102), (2012).
  • Logistic Regression with Variables Subject to Post Randomization Method, Privacy in Statistical Databases, 10.1007/978-3-642-33627-0_10, (116-130), (2012).
  • An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets, Computational Statistics & Data Analysis, 10.1016/j.csda.2011.06.006, 55, 12, (3232-3243), (2011).
  • Commentary: Sharing Confidential Data for Research Purposes, Epidemiology, 10.1097/EDE.0b013e318225c44b, 22, 5, (632-635), (2011).
  • Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule, Journal of the American Medical Informatics Association, 10.1136/jamia.2010.004622, 18, 1, (3-10), (2011).
  • Synthetic Data for Small Area Estimation, Privacy in Statistical Databases, 10.1007/978-3-642-15838-4_15, (162-173), (2011).
  • Using Support Vector Machines for Generating Synthetic Datasets, Privacy in Statistical Databases, 10.1007/978-3-642-15838-4_14, (148-161), (2011).
  • Providing and Protecting Microdata, Statistical Confidentiality, 10.1007/978-1-4419-7802-8, (93-122), (2011).
  • A Comparison of Posterior Simulation and Inference by Combining Rules for Multiple Imputation, Journal of Statistical Theory and Practice, 10.1080/15598608.2011.10412032, 5, 2, (335-347), (2011).
  • Examining the robustness of fully synthetic data techniques for data with binary variables, Journal of Statistical Computation and Simulation, 10.1080/00949650902744438, 80, 6, (609-624), (2010).
  • Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata, Journal of the American Statistical Association, 10.1198/jasa.2010.ap09480, 105, 492, (1347-1357), (2010).
  • Synthetic two-way contingency tables that preserve conditional frequencies, Statistical Methodology, 10.1016/j.stamet.2009.11.002, 7, 3, (225-239), (2010).
  • Hybrid microdata using microaggregation, Information Sciences, 10.1016/j.ins.2010.04.005, 180, 15, (2834-2844), (2010).
  • Preserving data utility via BART, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2010.03.022, 140, 9, (2551-2561), (2010).
  • Exploiting Auxiliary Information in the Estimation of Per-Record Risk of Disclosure, Privacy and Anonymity in Information Management Systems, 10.1007/978-1-84996-238-4_5, (91-111), (2010).
  • Assessing database privacy using the area under the receiver-operator characteristic curve, Health Services and Outcomes Research Methodology, 10.1007/s10742-010-0061-3, 10, 1-2, (1-15), (2010).
  • Distribution-preserving statistical disclosure limitation, Computational Statistics & Data Analysis, 10.1016/j.csda.2009.05.020, 53, 12, (4228-4242), (2009).
  • Verification servers: Enabling analysts to assess the quality of inferences from public use data, Computational Statistics & Data Analysis, 10.1016/j.csda.2008.10.006, 53, 4, (1475-1482), (2009).
  • Inferences for Two-Stage Multiple Imputation for Nonresponse, Journal of Statistical Theory and Practice, 10.1080/15598608.2009.10411927, 3, 2, (307-318), (2009).
  • Multiple imputation for combining confidential data owned by two agencies, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/j.1467-985X.2008.00574.x, 172, 2, (511-528), (2009).
  • Providing Spatial Data for Secondary Analysis: Issues and Current Practices Relating to Confidentiality, Population Research and Policy Review, 10.1007/s11113-008-9095-4, 27, 6, (639-665), (2008).
  • A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access, AStA Advances in Statistical Analysis, 10.1007/s10182-008-0090-1, 92, 4, (439-458), (2008).
  • Selecting the number of imputed datasets when using multiple imputation for missing data and disclosure limitation, Statistics & Probability Letters, 10.1016/j.spl.2007.04.020, 78, 1, (15-20), (2008).
  • Distribution-Preserving Statistical Disclosure Limitation, SSRN Electronic Journal, 10.2139/ssrn.931535, (2007).
  • Multiple imputation: an alternative to top coding for statistical disclosure control, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/j.1467-985X.2007.00492.x, 170, 4, (923-940), (2007).
  • A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality, The American Statistician, 10.1198/000313006X124640, 60, 3, (224-232), (2006).
  • Improving Individual Risk Estimators, Privacy in Statistical Databases, 10.1007/11930242_21, (243-256), (2006).
  • Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation, Data Mining and Knowledge Discovery, 10.1007/s10618-005-0007-5, 11, 2, (195-212), (2005).
  • Re-identification Methods for Masked Microdata, Privacy in Statistical Databases, 10.1007/978-3-540-25955-8_17, (216-230), (2004).
  • Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems, Privacy in Statistical Databases, 10.1007/978-3-540-25955-8_18, (231-246), (2004).