Feature extraction for supervised learning in knowledge discovery systems
Tiedon louhinnalla pyritään paljastamaan tietokannasta tietomassaan sisältyviä säännönmukaisuuksia, joiden olemassaolosta ei vielä olla tietoisia. Kun tietokantaan sisältyvät tiedot ovat kovin moniulotteisia, yksittäisten tapausten sisältäessä lukuisia piirteitä, monen koneoppimisen menetelmän suorituskyky heikkenee ratkaisevasti. Tätä ilmiötä nimitetään ”moniulotteisuuden kiroukseksi”, koska se johtaa usein sekä koneellisen käsittelyn monimutkaisuuden että käsittelyn yhteydessä syntyvien luokitusvirheiden kasvuun. Toisaalta tietokantaan mahdollisesti sisältyvät epärelevantit tai vain epäsuorasti relevantit piirteet tarjoavat heikon esitysavaruuden tietokannan käsiterakenteen kuvaamiseen. Piirteiden muodostamisella pyritäänkin joko ulotteisuuden pienentämiseen tai esitysavaruuden parantamiseen, tai molempiin, ohjatun koneoppimisen tarpeita varten.Työ koostuu erillisistä artikkeleista ja niihin tukeutuvasta yhteenvedosta. Kukin artikkeli käsittelee yhtä tai kahta tutkimuskysymystä ja niihin liittyviä havaintoja, jotka Pechenizkiy lopuksi yhdistää ehdotukseksi sellaiseksi järjestelyksi, jonka avulla tiedonlouhintatekniikoiden ja niiden kombinaatioiden käyttökokemuksia kokoamalla voidaan systemaattisesti tukea sopivimman tiedonlouhintastrategian valintaa.
...
Knowledge discovery or data mining is the process of finding previously unknown and potentially interesting patterns and relations in large databases. The so-called “curse of dimensionality” pertinent to many learning algorithms, denotes the drastic increase in computational complexity and classification error with data having a great number of dimensions. Beside this problem, some individual features, being irrelevant or indirectly relevant for the learning concepts, form poor problem representation space. The purpose of this study is to develop theoretical background and practical aspects of feature extraction (FE) as means of (1) dimensionality reduction, and (2) representation space improvement, for supervised learning (SL) in knowledge discovery systems. The focus is on applying conventional Principal Component Analysis (PCA) and two class-conditional approaches for two targets: (1) for a base level classifier construction, and (2) for dynamic integration of the base level classifiers. Theoretical bases are derived from classical studies in data mining, machine learning and pattern recognition. The software prototype for the experimental study is built within WEKA open-source machine-learning library in Java. The different aspects of the experimental study on a number of benchmark and real-world data sets include analyses of (1) importance of class information use in the FE process; (2) (dis-)advantages of using either extracted features or both original and extracted features for SL; (3) applying FE globally to the whole data and locally within natural clusters; (4) the effect of sampling reduction on FE for SL; and (5) the problems of FE techniques selection for SL for a problem at consideration. The hypothesis and detailed results of the many-sided experimental research process are reported in the corresponding papers included in the thesis. The main contributions of the thesis can be divided into contribution (1) to current theoretical knowledge and (2) to development of practical suggestion on applying FE for SL.
...
Publisher
University of JyväskyläISBN
951-39-2271-5ISSN Search the Publication Forum
1456-5390Keywords
Metadata
Show full item recordCollections
- Väitöskirjat [3537]
License
Related items
Showing items with similar title or keywords.
-
Linear feature extraction for ranking
Pandey, Gaurav; Ren, Zhaochun; Wang, Shuaiqiang; Veijalainen, Jari; Rijke, Maarten de (Springer, 2018)We address the feature extraction problem for document ranking in information retrieval. We then propose LifeRank, a Linear feature extraction algorithm for Ranking. In LifeRank, we regard each document collection for ... -
Knowledge discovery using diffusion maps
Sipola, Tuomo (University of Jyväskylä, 2013) -
Application of a Knowledge Discovery Process to Study Instances of Capacitated Vehicle Routing Problems
Kärkkäinen, Tommi; Rasku, Jussi (Springer, 2020)Vehicle Routing Problems (VRP) are computationally challenging, constrained optimization problems, which have central role in logistics management. Usually different solvers are being developed and applied for different ... -
Unstable feature relevance in classification tasks
Skrypnyk, Iryna (University of Jyväskylä, 2011) -
Dynamic integration of data mining methods in knowledge discovery systems
Tsymbal, Alexey (University of Jyväskylä, 2002)