Dealing with a small amount of data : developing Finnish sentiment analysis
Toivanen, I., Lindroos, J., Räsänen, V., & Taipale, S. (2022). Dealing with a small amount of data : developing Finnish sentiment analysis. In 2022 BESC : 9th International Conference on Behavioural and Social Computing. IEEE. https://doi.org/10.1109/besc57393.2022.9995536
Päivämäärä
2022Oppiaine
Hyvinvoinnin ja hoivan politiikat (painoala)Hyvinvoinnin tutkimuksen yhteisöYhteiskuntapolitiikkaPolicies and Politics of Welfare and Care (focus area)School of WellbeingSocial and Public PolicyTekijänoikeudet
© 2022 IEEE
Sentiment analysis has been more and more prominently visible among all natural language processing tasks. Sentiment analysis entails information extraction of opinions, emotions, and sentiments. In this paper, we aim to develop and test language models for low-resource language Finnish. We use the term “low-resource” to describe a language lacking in available resources for language modeling, especially annotated data. We investigate four models: the state-of-the-art FinBERT [1], and competitive alternative BERT models Finnish ConvBERT [2], Finnish Electra [3], and Finnish RoBERTa [4]. Having a comparative framework of multiple BERT variations is connected to our use of additional methods that are implemented to counteract the lack of annotated data. Basing our sentiment analysis on partly annotated survey data collected from eldercare workers, we supplement our training data with additional data sources. In addition to the non-annotated section of our survey data, additional data (external in-domain dataset and open-source news corpus) are focused on to determine how training data can be increased with the use of methods like pretraining (masked language modeling) and pseudo-labeling. Pretraining and pseudo-labeling, often defined as semi-supervised learning methods, make it possible to utilize unlabeled data either by initializing the model, or by labeling unlabeled data samples with seemingly real labels prior to actual model implementation. Our results suggest that out of all the single BERT models, FinBERT performs the best for our use case. Moreover, applying ensemble learning and combining multiple models further betters model performance and predictive power, and it outperforms a single FinBERT model. The use of both pseudo-labeling and ensemble learning proved to be valuable assets in the extension of training data for low-resource languages such as Finnish. However, with pseudo labeling, proper regularization methods should be considered to prevent confirmation bias from affecting the model performance.
...
Julkaisija
IEEEEmojulkaisun ISBN
979-8-3503-9815-1Konferenssi
International Conference on Behavioural and Social ComputingKuuluu julkaisuun
2022 BESC : 9th International Conference on Behavioural and Social ComputingAsiasanat
Julkaisu tutkimustietojärjestelmässä
https://converis.jyu.fi/converis/portal/detail/Publication/164924906
Metadata
Näytä kaikki kuvailutiedotKokoelmat
Lisenssi
Samankaltainen aineisto
Näytetään aineistoja, joilla on samankaltainen nimeke tai asiasanat.
-
“Therefore go and make disciples of all nations” (except of those who do not speak Finnish) : investigating language policies and self-representations on three websites of the Evangelical Lutheran Church of Finland
Peijonen, Pauliina (2019)Tutkimuksen tavoitteena on selvittää, miten Suomen evankelis-luterilainen kirkko suhtautuu vieraskielisiin verkkoviestintänsä perusteella. Tutkimuksessa tarkastellaan Suomen evankelisluterilaista kirkkoa yhtenä superdiversiteetin ... -
Word sense disambiguation for Finnish with an application to language learning
Robertson, Frankie (2020)Tehtävää sanan oikean merkityksen määritämiseksi automattisesti jossakin luonnollisen kielen ilmaisussa kutsutaan saneiden alamerkitysten yksiselitteistämiseksi. Tämä pro gradu -tutkielma kuvaa saneiden alamerkitysten ... -
Un-polarizing news in social media platform
Le Pham, Minh Duc (2019)A person with incorrect information on a given subject/topic mays act against his/her own best interest due to the faulty believes. This is the misinformation problem and the rise of internet and social media has only ... -
Videogames and L2 learning opportunities : does the amount of digital gaming show in increased English proficiency among Finnish upper comprehensive school Students?
Hartikainen, Tuomas (2015)Digitaalisten pelien pelaaminen on edelleen varsinkin nuorten suomalaisten keskuudessa yleinen harrastus. Mediana pelit tarjoavat monille kosketuksen englannin kieleen vapaa-ajallakin, ja tällaisesta informaalista ... -
Investigating the language practices and perspectives of language students in a Finnish university
Liimatainen, Piia (2022)Globalisoituvassa maailmassa omaa kielellistä repertuaariaan on yhä helpompi kasvattaa, kun kielelliset resurssit tulevat helpommin saataville. Muutokset kielten jakautumisessa ja käytössä ovat synnyttäneet uudenlaisia ...
Ellei toisin mainittu, julkisesti saatavilla olevia JYX-metatietoja (poislukien tiivistelmät) saa vapaasti uudelleenkäyttää CC0-lisenssillä.