Data Flush

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 28 Apr 2022 English Publisher:MIT Press - JournalsJournal:Harvard Data Science ReviewFunded by:NIH | Discovering causal genes,..., NIH | Causal and integrative de..., NIH | Genetic Association and P... +3 projects

Authors: Xiaotong Shen; Xuan Bi; Rex Shen;

doi: 10.1162/99608f92.681fe3bd

pmid: 36909365

pmc: PMC9997048

Data Flush

- Summary
- Subjects
- Metrics

Abstract

Data perturbation is a technique for generating synthetic data by adding "noise" to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples.

Related Organizations

University of Minnesota System
United States
UNIVERSITY OF MINNESOTA
University of Minnesota
United States
University of Minnesota
United States
Department of Statistics Stanford University
United States

View all View all

Keywords

Electronic computers. Computer science, QA75.5-76.95, Article

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average