Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 29 Jun 2021 Austria Publisher:Frontiers Media SAJournal:Frontiers in Big Data, volume 4 (eissn: 2624-909X,

Copyright policy )

Authors: Michael Platzer; Thomas Reutterer;

doi: 10.3389/fdata.2021.679939

pmid: 34268491

pmc: PMC8276128

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

- Summary
- Subjects
- Metrics

Abstract

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

Country

Austria

Related Organizations

Wirtschaftsuniversität Wien (Vienna University of Economics and Business)
Austria
Vienna University of Economics and Business
Austria
WU
Austria

Keywords

Big Data, 502020 Market research, 502019 Marketing, 502, structured data, Information technology, synthetic data, privacy, T58.5-58.64, anonymization, fidelity, 502052 Betriebswirtschaftslehre, self-supervised learning, 502020 Marktforschung, 502052 Business administration, synthetic data, privacy, fidelity, structured data, anonymization, self-supervised learning, statistical disclosure control, mixed-type data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	33
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

33

Top 10%

Green

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering