A clustering algorithm that claims to be simpler and faster than others

Yewang Chen, Yuanyuan Yang, Songwen Pei, Yi Chen, Jixiang Du, A simple rapid sample-based clustering for large-scale data, Engineering Applications of Artificial Intelligence, Volume 133, Part F, 2024 DOI: 10.1016/j.engappai.2024.108551.

Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O(n×r) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.

Comments are closed.

Post Navigation