descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 21 Dec 2022 English Publisher:Springer Science and Business Media LLCJournal:Journal of Big Data, volume 9 (eissn: 2196-1115,

Authors: Abdul Wahab Akram; Zareen Alamgir;

doi: 10.1186/s40537-022-00671-7 , 10.60692/we5x0-4b932 , 10.60692/dr0s7-sgv75

Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

- Summary
- Subjects
- Metrics

Abstract

AbstractFuzzy clustering is an invaluable data mining technique that allows each data point to belong to more than one cluster with some degree of membership. It is widely employed in exploratory data mining to discover overlapping communities in social networks, find structure in spectral data, and capture user interests in recommendation systems. Nowadays, the variety and volume of data are increasing at a tremendous rate. Data is power; the massive data, along with an effective technique, can unravel valuable information. The existing fuzzy clustering algorithms do not perform well on massive heterogeneous datasets. Processing an enormous amount of data is beyond the capacity of a single processor. The need of the hour is to develop fuzzy clustering techniques that can work on a distributed framework for Big Data processing and can handle heterogeneous data. In this research, we evaluate the performance of the recently proposed algorithm for the Fuzzy clustering of mixed-mode data FCMD-MD (D’Urso and Massari in Inf Sci 505:513–534, 2019) with different real-world datasets. We develop a distributed FCMD-MD, a fuzzy clustering algorithm for mixed-mode data in Apache SPARK. The experimental results show that the algorithm is scalable, performs well in a distributed environment, and clusters enormous heterogeneous data with high accuracy. We also compared the performance of distributed FCMD-MD and the distributed k-medoid algorithm.

Related Organizations

National University of Computer and Emerging Sciences
Pakistan

Keywords

Computer engineering. Computer hardware, Cluster Validation, Artificial intelligence, Information technology, TK7885-7895, Anomaly Detection in High-Dimensional Data, Database, Big data, Fuzzy Clustering, Cluster analysis, Artificial Intelligence, Document Clustering, Machine learning, Data mining, Data Clustering Techniques and Algorithms, Fuzzy clustering, Scalability, Statistical and Nonlinear Physics, QA75.5-76.95, T58.5-58.64, Semi-supervised Clustering, Computer science, Programming language, Fuzzy logic, Algorithm, Physics and Astronomy, Electronic computers. Computer science, Computer Science, Physical Sciences, Statistical Mechanics of Complex Networks, SPARK (programming language), Stream Data Clustering

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	6
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%