TableDC: Deep Clustering for Tabular Data

Name: TableDC: Deep Clustering for Tabular Data
Keywords: FOS: Computer and information sciences, Computer Science - Databases, Data integration, Databases (cs.DB), deep clustering, Mahalanobis distance, data cleaning

Hafiz Tayyab Rauf; André Freitas; Norman William Paton

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

Pure University of Manchester

Conference object . 2025

License: CC BY

Data sources: Pure University of Manchester

The University of Manchester - Institutional Repository

Contribution for newspaper or weekly magazine . 2025

Data sources: The University of Manchester - Institutional Repository

Proceedings of the ACM on Management of Data

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

DBLP

Article

Data sources: DBLP

TableDC: Deep Clustering for Tabular Data

Deep Clustering for Tabular Data

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object , Contribution for newspaper or weekly magazine 17 Jun 2025Embargo end date: 01 Jan 2024 United Kingdom English Publisher:Association for Computing Machinery (ACM)Journal:Proceedings of the ACM on Management of Data, volume 3, pages 1-28 (eissn: 2836-6573,

Copyright policy )Funded by:FCT | D4

Authors: Hafiz Tayyab Rauf; André Freitas; Norman William Paton;

doi: 10.1145/3725366 , 10.48550/arxiv.2405.17723

arXiv: 2405.17723

TableDC: Deep Clustering for Tabular Data

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data integration). In data management tasks, where high-density embeddings and overlapping clusters dominate, a data management-specific DC algorithm should be able to interact better with the data properties to support data integration tasks. This paper presents a deep clustering algorithm for tabular data (TableDC) that reflects the properties of data management applications that cluster tables (schema inference), rows (entity resolution) and columns (domain discovery). To address overlapping clusters, TableDC integrates Mahalanobis distance, which considers variance and correlation within the data, offering a similarity method suitable for tabular data in high-dimensional latent spaces. TableDC also shows higher tolerance to outliers through its heavy-tailed Cauchy distribution as the similarity kernel. The proposed similarity measure is particularly beneficial where the embeddings of raw data are densely packed and exhibit high degrees of overlap. Data integration tasks may also involve large numbers of clusters, which challenges the scalability of existing DC methods. TableDC learns data embeddings with a large number of clusters more efficiently than baseline DC methods, which scale in quadratic time. We evaluated TableDC with several existing DC, Standard Clustering (SC), and state-of-the-art bespoke methods over benchmark datasets. TableDC consistently outperforms existing DC, SC and bespoke methods.

Country

United Kingdom

Related Organizations

University of Salford
United Kingdom
University of Manchester, Department of Computer Science
United Kingdom
University of Manchester
United Kingdom

Keywords

FOS: Computer and information sciences, Computer Science - Databases, Data integration, Databases (cs.DB), deep clustering, Mahalanobis distance, data cleaning

1 Research products, page 1 of 1

TableDC software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Funded by

FCT| D4

TableDC: Deep Clustering for Tabular Data

TableDC: Deep Clustering for Tabular Data

1 Research products, page 1 of 1

TableDC software on GitHub