Internal Cluster Validation for Data with Missing Values

Abstract
Clustering is an unsupervised data mining method used to label data into distinct groups. It has numerous applications in various fields, from bioinformatics to object recognition and categorization. The prototype-based clustering methods summarize information in form of cluster centroids that are often called as prototypes. Cluster validation methodology provides a means of assessing the goodness of a clustering solution and identify the optimal number of clusters in the data. Internal cluster validation methods evaluate the quality of clustering by assessing the cluster compactness and separability on the same data set that is input in the clustering phase. A common and sometimes complex issue for both data clustering and cluster validation is the presence of missing values in data that can occur for many different causes, such as non-respondents in questionnaire studies or device operation failures. This dissertation focuses on extending cluster validation models for treating missing values on data. Since these models are not based on the values of the data vectors but on the computed distances between these vectors, missing value treatment is covered by direct distance estimation between data vectors. The thesis presents a toolbox that is used to demonstrate the usability of the developed methods for research and development purposes. In addition, the background theory of each element of the toolbox and use case examples are proposed. A real-world application is provided where cluster validation is utilized for categorizing learning game players into distinct profiles using a gameplay data in which a part of data values are missing. As the main outcome of the thesis, the missing value handling methods for data preprocessing, clustering, and cluster validation are presented. The functionality and validity of the methods are demonstrated using several numerical experiments and the results confirms the scalability of the techniques and their capability of reliably solving knowledge discovery problems. Keywords: knowledge discovery, data mining, log data, data preprocessing, missing values, distance computation, distance estimation, clustering, protype-based clustering, number of clusters, cluster validation, internal cluster validation, cluster validation indices
Main Author
Format
Theses Doctoral thesis
Published
2022
Series
ISBN
978-951-39-9321-4
Publisher
Jyväskylän yliopisto
The permanent address of the publication
https://urn.fi/URN:ISBN:978-951-39-9321-4Use this for linking
ISSN
2489-9003
Language
English
Published in
JYU Dissertations
Contains publications
  • Artikkeli I: Niemelä, M., Äyrämö, S., Ronimus, M., Richardson, U., & Lyytinen, H. (2020). Game learning analytics for understanding reading skills in transparent writing system. British Journal of Educational Technology, 51(6), 2376-2390. DOI: 10.1111/bjet.12916. JYX: jyx.jyu.fi/handle/123456789/67896
  • Artikkeli II: Niemelä, M., Äyrämö, S., & Kärkkäinen, T. (2018). Comparison of cluster validation indices with missing data. In ESANN 2018 : Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 461-466). Full text
  • Artikkeli III: Niemelä, M., & Kärkkäinen, T. (2022). Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods. In T. T. Tuovinen, J. Periaux, & P. Neittaanmäki (Eds.), Computational Sciences and Artificial Intelligence in Industry : New Digital Technologies for Solving Future Societal and Economical Challenges (pp. 123-133). Springer. Intelligent Systems, Control and Automation: Science and Engineering, 76. DOI: 10.1007/978-3-030-70787-3_9
  • Artikkeli IV: Niemelä, M., Äyrämö, S., & Kärkkäinen, T. (2022). Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values. IEEE Access, 10, 352-367. DOI: 10.1109/ACCESS.2021.3136435
License
In CopyrightOpen Access
Copyright© The Author & University of Jyväskylä

Share