Improving Scalable K-Means++
Hämäläinen, J., Kärkkäinen, T., & Rossi, T. (2021). Improving Scalable K-Means++. Algorithms, 14(1), Article 6. https://doi.org/10.3390/a14010006
Published in
AlgorithmsDate
2021Copyright
© 2020 by the authors. Licensee MDPI, Basel, Switzerland
Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means‖ methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases
...
Publisher
MDPI AGISSN Search the Publication Forum
1999-4893Keywords
Publication in research information system
https://converis.jyu.fi/converis/portal/detail/Publication/47636982
Metadata
Show full item recordCollections
Related funder(s)
Research Council of FinlandFunding program(s)
Academy Programme, AoF; Research profiles, AoFAdditional information about funding
The work has been supported by the Academy of Finland from the projects 311877 (Demo) and 315550 (HNP-AI).License
Related items
Showing items with similar title or keywords.
-
Scalable implementation of dependence clustering in Apache Spark
Ivannikova, Elena (IEEE, 2017)This article proposes a scalable version of the Dependence Clustering algorithm which belongs to the class of spectral clustering methods. The method is implemented in Apache Spark using GraphX API primitives. Moreover, ... -
Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods
Niemelä, Marko; Kärkkäinen, Tommi (Springer, 2022)Missing data introduces a challenge in the field of unsupervised learning. In clustering, when the form and the number of clusters are to be determined, one needs to deal with the missing values both in the clustering ... -
Improvements and applications of the elements of prototype-based clustering
Hämäläinen, Joonas (Jyväskylän yliopisto, 2018) -
Scalable robust clustering method for large and sparse data
Hämäläinen, Joonas; Kärkkäinen, Tommi; Rossi, Tuomo (ESANN, 2018)Datasets for unsupervised clustering can be large and sparse, with significant portion of missing values. We present here a scalable version of a robust clustering method with the available data strategy. Moreprecisely, a ... -
Taming big knowledge evolution
Cochez, Michael (University of Jyväskylä, 2016)Information and its derived knowledge are not static. Instead, information is changing over time and our understanding of it evolves with our ability and willingness to consume the information. When compared to humans, ...