Improving search engine results using different machine learning models and tools
The aim of this thesis is to provide viable methods that can be used to improve the return position (RP) of a relevant document when a natural language query (NLQ) is applied by a user.
For the purpose of demonstration, we will be using IBM's Watson Discovery Service (WDS) as a search engine that uses supervised machine learning. This feature of WDS enables a user to train the tool so that it can learn to associate the language used in the NLQ to the language used in documents labelled as relevant. Therefore, instead of mapping an NLQ to the relevant document, it will build a model that works in such a way that similar language used in the natural language query will be associated with documents containing similar language as the document labeled as relevant.
The search engine works in such a way that it first searches for the first 100 documents and then ranks the documents based on the training examples provided by the user. In other words, the training example is only applied after the search is complete and the first 100 documents are collected.
The first 100 documents are retrieved based on what has been enabled from options such as: keywords, entities, relations, semantic roles, concept, category classification, sentiment analysis, emotion analysis, and element classification (Watson Discovery Service, 2019).
Bringing 100 documents to be re-ranked for NLQ presents a challenge when the user uses a language that is not present in the documents ingested. For example, the documents ingested could be technical documents using official languages and the user could be using a search word that is commonly used among colleagues. This would mean that even when the training example is present for the type of language used by the user pointing to relevant document, the user will not be able to get the expected documents because they will not have been inside the first 100 documents and therefore will not be re-ranked. Therefore, in this thesis, we will be going through various tools and methods that would enable us to improve the return position of relevant documents that a user expects.
...
Asiasanat
IBM Watson natural language query Watson discovery tiedonhaku hakuohjelmat koneoppiminen big data kyselykielet tiedonhallinta luonnollinen kieli kieli ja kielet Query tiedonhakujärjestelmät information retrieval search engines machine learning query languages information management natural language languages information retrieval systems
Metadata
Näytä kaikki kuvailutiedotKokoelmat
- Pro gradu -tutkielmat [29708]
Lisenssi
Samankaltainen aineisto
Näytetään aineistoja, joilla on samankaltainen nimeke tai asiasanat.
-
Discovering Business Processes from Unstructured Text
Pietikäinen, Sampo (2020)Asiakirjojen käsittely manuaalisesti kuluttaa paljon tietotyöntekijän resursseja. Tämä koskee myös liiketoimintaprossien johtamisen asiantuntijoita, joiden työ voi vaatia useiden liiketoimintaprosessien kuvausten lukemista. ... -
Taming big knowledge evolution
Cochez, Michael (University of Jyväskylä, 2016)Information and its derived knowledge are not static. Instead, information is changing over time and our understanding of it evolves with our ability and willingness to consume the information. When compared to humans, ... -
Feature extraction for supervised learning in knowledge discovery systems
Pechenizkiy, Mykola (University of Jyväskylä, 2005)Tiedon louhinnalla pyritään paljastamaan tietokannasta tietomassaan sisältyviä säännönmukaisuuksia, joiden olemassaolosta ei vielä olla tietoisia. Kun tietokantaan sisältyvät tiedot ovat kovin moniulotteisia, yksittäisten ... -
Part-of-speech tagging in written slang
Korolainen, Valtteri (2014)Erilaiset kieliteknologiasovellukset ovat olleet jo vuosikymmeniä arkipäiväises-sä käytössä. Esimerkiksi ennustava tekstinsyöttö ja automaattinen korjaus ovat olleet käytössä jo vuosikymmeniä. Puheen tunnistus ja kielen ... -
Unstable feature relevance in classification tasks
Skrypnyk, Iryna (University of Jyväskylä, 2011)
Ellei toisin mainittu, julkisesti saatavilla olevia JYX-metatietoja (poislukien tiivistelmät) saa vapaasti uudelleenkäyttää CC0-lisenssillä.