Internal Cluster Validation for Data with Missing Values

Niemelä, Marko

978-951-39-9321-4_vaitos10062022.pdf

Internal Cluster Validation for Data with Missing Values

Abstract

Clustering is an unsupervised data mining method used to label data into distinct groups. It has numerous applications in various fields, from bioinformatics to object recognition and categorization. The prototype-based clustering methods summarize information in form of cluster centroids that are often called as prototypes. Cluster validation methodology provides a means of assessing the goodness of a clustering solution and identify the optimal number of clusters in the data. Internal cluster validation methods evaluate the quality of clustering by assessing the cluster compactness and separability on the same data set that is input in the clustering phase. A common and sometimes complex issue for both data clustering and cluster validation is the presence of missing values in data that can occur for many different causes, such as non-respondents in questionnaire studies or device operation failures. This dissertation focuses on extending cluster validation models for treating missing values on data. Since these models are not based on the values of the data vectors but on the computed distances between these vectors, missing value treatment is covered by direct distance estimation between data vectors. The thesis presents a toolbox that is used to demonstrate the usability of the developed methods for research and development purposes. In addition, the background theory of each element of the toolbox and use case examples are proposed. A real-world application is provided where cluster validation is utilized for categorizing learning game players into distinct profiles using a gameplay data in which a part of data values are missing. As the main outcome of the thesis, the missing value handling methods for data preprocessing, clustering, and cluster validation are presented. The functionality and validity of the methods are demonstrated using several numerical experiments and the results confirms the scalability of the techniques and their capability of reliably solving knowledge discovery problems. Keywords: knowledge discovery, data mining, log data, data preprocessing, missing values, distance computation, distance estimation, clustering, protype-based clustering, number of clusters, cluster validation, internal cluster validation, cluster validation indices

Main Author

Niemelä, Marko

Format

Theses Doctoral thesis

Published

2022

Series

JYU Dissertations

ISBN

978-951-39-9321-4

Publisher

Jyväskylän yliopisto

The permanent address of the publication

https://urn.fi/URN:ISBN:978-951-39-9321-4Use this for linking

ISSN

2489-9003

Language

English

Published in

JYU Dissertations

Contains publications

Artikkeli I: Niemelä, M., Äyrämö, S., Ronimus, M., Richardson, U., & Lyytinen, H. (2020). Game learning analytics for understanding reading skills in transparent writing system. British Journal of Educational Technology, 51(6), 2376-2390. DOI: 10.1111/bjet.12916. JYX: jyx.jyu.fi/handle/123456789/67896
Artikkeli II: Niemelä, M., Äyrämö, S., & Kärkkäinen, T. (2018). Comparison of cluster validation indices with missing data. In ESANN 2018 : Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 461-466). Full text
Artikkeli III: Niemelä, M., & Kärkkäinen, T. (2022). Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods. In T. T. Tuovinen, J. Periaux, & P. Neittaanmäki (Eds.), Computational Sciences and Artificial Intelligence in Industry : New Digital Technologies for Solving Future Societal and Economical Challenges (pp. 123-133). Springer. Intelligent Systems, Control and Automation: Science and Engineering, 76. DOI: 10.1007/978-3-030-70787-3_9
Artikkeli IV: Niemelä, M., Äyrämö, S., & Kärkkäinen, T. (2022). Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values. IEEE Access, 10, 352-367. DOI: 10.1109/ACCESS.2021.3136435

License

Internal Cluster Validation for Data with Missing Values

Share

Similar Items