
AbstractFuzzy clustering is an invaluable data mining technique that allows each data point to belong to more than one cluster with some degree of membership. It is widely employed in exploratory data mining to discover overlapping communities in social networks, find structure in spectral data, and capture user interests in recommendation systems. Nowadays, the variety and volume of data are increasing at a tremendous rate. Data is power; the massive data, along with an effective technique, can unravel valuable information. The existing fuzzy clustering algorithms do not perform well on massive heterogeneous datasets. Processing an enormous amount of data is beyond the capacity of a single processor. The need of the hour is to develop fuzzy clustering techniques that can work on a distributed framework for Big Data processing and can handle heterogeneous data. In this research, we evaluate the performance of the recently proposed algorithm for the Fuzzy clustering of mixed-mode data FCMD-MD (D’Urso and Massari in Inf Sci 505:513–534, 2019) with different real-world datasets. We develop a distributed FCMD-MD, a fuzzy clustering algorithm for mixed-mode data in Apache SPARK. The experimental results show that the algorithm is scalable, performs well in a distributed environment, and clusters enormous heterogeneous data with high accuracy. We also compared the performance of distributed FCMD-MD and the distributed k-medoid algorithm.
Computer engineering. Computer hardware, Cluster Validation, Artificial intelligence, Information technology, TK7885-7895, Anomaly Detection in High-Dimensional Data, Database, Big data, Fuzzy Clustering, Cluster analysis, Artificial Intelligence, Document Clustering, Machine learning, Data mining, Data Clustering Techniques and Algorithms, Fuzzy clustering, Scalability, Statistical and Nonlinear Physics, QA75.5-76.95, T58.5-58.64, Semi-supervised Clustering, Computer science, Programming language, Fuzzy logic, Algorithm, Physics and Astronomy, Electronic computers. Computer science, Computer Science, Physical Sciences, Statistical Mechanics of Complex Networks, SPARK (programming language), Stream Data Clustering
Computer engineering. Computer hardware, Cluster Validation, Artificial intelligence, Information technology, TK7885-7895, Anomaly Detection in High-Dimensional Data, Database, Big data, Fuzzy Clustering, Cluster analysis, Artificial Intelligence, Document Clustering, Machine learning, Data mining, Data Clustering Techniques and Algorithms, Fuzzy clustering, Scalability, Statistical and Nonlinear Physics, QA75.5-76.95, T58.5-58.64, Semi-supervised Clustering, Computer science, Programming language, Fuzzy logic, Algorithm, Physics and Astronomy, Electronic computers. Computer science, Computer Science, Physical Sciences, Statistical Mechanics of Complex Networks, SPARK (programming language), Stream Data Clustering
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 6 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
