Clustering Categorical Data: Soft Rounding K-Modes

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2023Embargo end date: 01 Jan 2022Publisher:Elsevier BVJournal:Information and Computation, volume 296, page 105,115 (issn: 0890-5401,

Copyright policy )

Authors: Surya Teja Gavva; Karthik C. S. 0001; Sharath Punna;

doi: 10.2139/ssrn.4369635 , 10.1016/j.ic.2023.105115 , 10.48550/arxiv.2210.09640

arXiv: 2210.09640

Clustering Categorical Data: Soft Rounding K-Modes

- Summary
- Subjects
- Metrics

Abstract

Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.

Related Organizations

Rutgers, The State University of New Jersey
United States

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Theory of computing, Data Structures and Algorithms (cs.DS), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

7

Top 10%

Average

Top 10%

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering