Improving Scalable K-Means++
Hämäläinen, J., Kärkkäinen, T., & Rossi, T. (2021). Improving Scalable K-Means++. Algorithms, 14(1), Article 6. https://doi.org/10.3390/a14010006
© 2020 by the authors. Licensee MDPI, Basel, Switzerland
Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means‖ methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases ...
Publication in research information system
MetadataShow full item record
Related funder(s)Academy of Finland
Funding program(s)Academy Programme, AoF; Research profiles, AoF
Additional information about fundingThe work has been supported by the Academy of Finland from the projects 311877 (Demo) and 315550 (HNP-AI).
Showing items with similar title or keywords.
Ivannikova, Elena (IEEE, 2017)This article proposes a scalable version of the Dependence Clustering algorithm which belongs to the class of spectral clustering methods. The method is implemented in Apache Spark using GraphX API primitives. Moreover, ...
Hämäläinen, Joonas (Jyväskylän yliopisto, 2018)Clustering or cluster analysis is an essential part of data mining, machine learning, and pattern recognition. The most popularly applied clustering methods are partitioning-based or prototype-based methods. Prototype-based ...
Hämäläinen, Joonas; Kärkkäinen, Tommi; Rossi, Tuomo (ESANN, 2018)Datasets for unsupervised clustering can be large and sparse, with significant portion of missing values. We present here a scalable version of a robust clustering method with the available data strategy. Moreprecisely, a ...
Zolotukhin, Mikhail (University of Jyväskylä, 2014)
Cochez, Michael (University of Jyväskylä, 2016)Information and its derived knowledge are not static. Instead, information is changing over time and our understanding of it evolves with our ability and willingness to consume the information. When compared to humans, ...