Research on parallel data processing of data mining platform in the background of cloud computing

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2021 English Publisher:Walter de Gruyter GmbHJournal:Journal of Intelligent Systems, volume 30, pages 479-486 (eissn: 2191-026X,

Copyright policy )

Authors: Bu Lingrui; Zhang Hui; Xing Haiyan; Wu Lijun;

doi: 10.1515/jisys-2020-0113

Research on parallel data processing of data mining platform in the background of cloud computing

- Summary
- Subjects
- Metrics

Abstract

Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.

Keywords

clustering algorithm, parallel processing, Science, Electronic computers. Computer science, cloud computing, Q, data mining, QA75.5-76.95, hadoop platform

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average