Duplicate record elimination in large data files

descriptionPublicationkeyboard_double_arrow_right Article 01 Jun 1983 English Publisher:Association for Computing Machinery (ACM)Journal:ACM Transactions on Database Systems, volume 8, pages 255-265 (issn: 0362-5915, eissn: 1557-4644,

Copyright policy )

Authors: Dina Bitton; David J. DeWitt;

doi: 10.1145/319983.319987

Duplicate record elimination in large data files

- Summary
- Subjects
- Metrics

Abstract

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.

Related Organizations

University of Wisconsin–Madison
United States
University of Wisconsin–Oshkosh
United States

Keywords

Data structures, Information storage and retrieval of data, projection operator, Searching and sorting, merge-sort, duplicate elimination, sorting

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	109
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 0.1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

109

Top 10%

Top 0.1%

Average

bronze

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering