
How to apply clustering algorithm to effectively cluster large-scale data is an important research topic in data mining. Based on an in-depth analysis of the Hadoop platform architecture and Canopy-kmeans clustering algorithm, the Canopy-kmeans algorithm was optimized and parallelized. The data packets are clustered after grouping and sampling by statistical thinking to facilitate parallelization and reduce time complexity. The Canopy initial center point selection was optimized using the minimum-maximum principle, and data outlier average sampling method was used to ensure the uniform extraction of data samples from the original data, and the k-means iterative calculation process was optimized. Combined with the MapReduce framework under the Hadoop platform, the improved algorithm is designed and implemented in parallel. Experiments show that the improved Canopy-Kmeans parallel algorithm is effective and convergent when clustering massive amounts of numerical data, and it has a certain degree of improvement in the clustering accuracy and timeliness.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
