Näytä suppeat kuvailutiedot

dc.contributor.authorToivanen, Ida
dc.contributor.authorLindroos, Jari
dc.contributor.authorRäsänen, Venla
dc.contributor.authorTaipale, Sakari
dc.date.accessioned2023-02-24T12:39:23Z
dc.date.available2023-02-24T12:39:23Z
dc.date.issued2022
dc.identifier.citationToivanen, I., Lindroos, J., Räsänen, V., & Taipale, S. (2022). Dealing with a small amount of data : developing Finnish sentiment analysis. In <i>2022 BESC : 9th International Conference on Behavioural and Social Computing</i>. IEEE. <a href="https://doi.org/10.1109/besc57393.2022.9995536" target="_blank">https://doi.org/10.1109/besc57393.2022.9995536</a>
dc.identifier.otherCONVID_164924906
dc.identifier.urihttps://jyx.jyu.fi/handle/123456789/85638
dc.description.abstractSentiment analysis has been more and more prominently visible among all natural language processing tasks. Sentiment analysis entails information extraction of opinions, emotions, and sentiments. In this paper, we aim to develop and test language models for low-resource language Finnish. We use the term “low-resource” to describe a language lacking in available resources for language modeling, especially annotated data. We investigate four models: the state-of-the-art FinBERT [1], and competitive alternative BERT models Finnish ConvBERT [2], Finnish Electra [3], and Finnish RoBERTa [4]. Having a comparative framework of multiple BERT variations is connected to our use of additional methods that are implemented to counteract the lack of annotated data. Basing our sentiment analysis on partly annotated survey data collected from eldercare workers, we supplement our training data with additional data sources. In addition to the non-annotated section of our survey data, additional data (external in-domain dataset and open-source news corpus) are focused on to determine how training data can be increased with the use of methods like pretraining (masked language modeling) and pseudo-labeling. Pretraining and pseudo-labeling, often defined as semi-supervised learning methods, make it possible to utilize unlabeled data either by initializing the model, or by labeling unlabeled data samples with seemingly real labels prior to actual model implementation. Our results suggest that out of all the single BERT models, FinBERT performs the best for our use case. Moreover, applying ensemble learning and combining multiple models further betters model performance and predictive power, and it outperforms a single FinBERT model. The use of both pseudo-labeling and ensemble learning proved to be valuable assets in the extension of training data for low-resource languages such as Finnish. However, with pseudo labeling, proper regularization methods should be considered to prevent confirmation bias from affecting the model performance.en
dc.format.mimetypeapplication/pdf
dc.language.isoeng
dc.publisherIEEE
dc.relation.ispartof2022 BESC : 9th International Conference on Behavioural and Social Computing
dc.rightsIn Copyright
dc.subject.othersentiment analysis
dc.subject.otherlow-resource language
dc.subject.otherpseudo-labeling
dc.subject.otherBERT
dc.subject.otherensemble learning
dc.titleDealing with a small amount of data : developing Finnish sentiment analysis
dc.typeconferenceObject
dc.identifier.urnURN:NBN:fi:jyu-202302241900
dc.contributor.laitosYhteiskuntatieteiden ja filosofian laitosfi
dc.contributor.laitosDepartment of Social Sciences and Philosophyen
dc.contributor.oppiaineHyvinvoinnin ja hoivan politiikat (painoala)fi
dc.contributor.oppiaineHyvinvoinnin tutkimuksen yhteisöfi
dc.contributor.oppiaineYhteiskuntapolitiikkafi
dc.contributor.oppiainePolicies and Politics of Welfare and Care (focus area)en
dc.contributor.oppiaineSchool of Wellbeingen
dc.contributor.oppiaineSocial and Public Policyen
dc.type.urihttp://purl.org/eprint/type/ConferencePaper
dc.relation.isbn979-8-3503-9815-1
dc.type.coarhttp://purl.org/coar/resource_type/c_5794
dc.description.reviewstatuspeerReviewed
dc.type.versionacceptedVersion
dc.rights.copyright© 2022 IEEE
dc.rights.accesslevelopenAccessfi
dc.relation.conferenceInternational Conference on Behavioural and Social Computing
dc.subject.ysosuomen kieli
dc.subject.ysotietokonelingvistiikka
dc.format.contentfulltext
jyx.subject.urihttp://www.yso.fi/onto/yso/p8856
jyx.subject.urihttp://www.yso.fi/onto/yso/p6069
dc.rights.urlhttp://rightsstatements.org/page/InC/1.0/?language=en
dc.relation.doi10.1109/besc57393.2022.9995536
dc.type.okmA4


Aineistoon kuuluvat tiedostot

Thumbnail

Aineisto kuuluu seuraaviin kokoelmiin

Näytä suppeat kuvailutiedot

In Copyright
Ellei muuten mainita, aineiston lisenssi on In Copyright