Clusterability Test for Categorical Data

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:Elsevier BVJournal:Knowledge and Information Systems, volume 67, pages 4,113-4,138 (issn: 0219-1377, eissn: 0219-3116,

Copyright policy )

Authors: Lianyu Hu 0001; Junjie Dong; Mudi Jiang; Yan Liu 0085; Zengyou He;

doi: 10.2139/ssrn.4651548 , 10.1007/s10115-024-02317-x , 10.48550/arxiv.2307.07346

arXiv: 2307.07346

Clusterability Test for Categorical Data

- Summary
- Subjects
- Metrics

Abstract

The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly associated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.

28 pages, 12 appendix pages, 17 figures

Related Organizations

Dalian Polytechnic University
China (People's Republic of)
Dalian University
China (People's Republic of)
Dalian University of Technology
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Applications (stat.AP), Statistics - Applications, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

4

Top 10%

Average

Top 10%

Green