A Contrastive Evaluation of Word Sense Disambiguation Systems for Finnish

Previous work in Word Sense Disambiguation (WSD), like many tasks in nat-ural language processing


Introduction
Like many natural language understanding tasks, Word Sense Disambiguation (WSD) has been referred to as AI-complete (Mallery, 1988, p. 57). That is to say, it is considered as hard as the central problems in artificial intelligence, such as passing the Turing test (Turing, 1950). While in the general case this may be true, the best current systems can at least do better than the (quite tough to beat) Most Frequent Sense (MFS) baseline. Evaluations against common datasets and dictionaries, largely following procedures set out by the shared tasks under the auspices of the SensEval and SemEval workshops, have been key to creating measurable progress in WSD.
For English,  present a recent comparison of different WSD systems across harmonised SensEval and SemEval data sets. Within the Uralic languages, Kahusk et al. (2001) created a manually sense annotated corpus of Estonian so that it could be included in SensEval-2. Two systems based on supervised learning were submitted, presented by Yarowsky et al. (2001) and Vider and Kaljurand (2001). Both systems failed to beat the MFS baseline (Edmonds, 2002, Table 1). For Hungarian, Miháltz (2010) created a sense tagged corpus by translating sense tagged data from English into Hungarian and then performed WSD with a number of supervised systems. Precision was compared with an MFS baseline, but the comparison was only given on a per-word basis. Up until this point, however, no work providing this type of a contrastive evaluation of WSD has been published for Finnish. This work rectifies the situation, giving results for systems representing the major approaches to WSD, including some of the systems which have performed best at the task for other languages.

Data and Resources
The minimum resources required to conduct a WSD evaluation are a Lexical Knowledge Base (LKB) and an evaluation corpus. Supervised systems require additionally a training corpus. The current generation of NLP systems make copious usage of word embeddings as lexical resources, as do some of the systems evaluated here, and so these are also needed. Here, the FinnWordNet (FiWN) (Lindén and Carlson, 2010) LKB is used, while both the evaluation and training corpus are based on the EuroSense corpus . The rest of this section describes these linguistic resources and their preparation in more depth.

Obtaining a Sense Tagged Corpus
EuroSense  is a multilingual sense tagged corpus, obtained by running the knowledge based Babelfy (Moro et al., 2014) WSD algorithm on multilingual texts. To use this corpus in a way which is compatible with the maximum number of systems and in line with the standards of previous evaluations, it first has to be preprocessed. The preprocessing pipeline is shown in Figure 1.
In the first stage, drop non-Finnish, all non Finnish text and annotations are removed from the stream. EuroSense is tagged with synsets from the BabelNet LKB (Navigli and Ponzetto, 2012). This knowledge base is based on the WordNets of many languages enriched and modified according to other sources, such as Wikipedia and Wikitionary. However, here the LKB to be used is FinnWordNet. A mapping file was extracted from BabelNet using its Java API and a local copy, obtained through direct communication with its authors¹. The Babelnet lookup stage applies this mapping. The stage will drop annotation which do not exist in FiWN according to the mapping. A BabelNet synset can also map to multiple FiWN synsets, and in this case an ambiguous annotation can be produced.  The re-anchor and re-lemmatise stages clean up some problems with the grammatical analyses in EuroSense. EuroSense anchors sometimes include help words associated with certain verb conjugations, for example negative forms, e.g. "ei mene", or the perfect construction "on käynyt". Re-anchor removes these words from the anchor, taking care of the cases in which the whole anchor could actually refer to a lemma form in WordNet, e.g. "olla merkitystä". Re-lemmatise checks that the current lemma is associated with the annotated synsets in FiWN. In case there is no matching synsets, we look back at the surface form and check all possible lemmas obtained from OMorFi (Pirinen, 2015)² for matches against FiWN. At this point, any annotations which do not have exactly one lemma and one synset which exist in FiWN are dropped. In the penultimate stage, remove empty, any sentences without any annotations are removed entirely. Finally, the XML format is converted from the back-off annotations of the EuroSense format to the inline annotations of the unified format of .
The corpus is then split into testing and training sections. The testing corpus is made up of the first 1000 sentences, resulting in 4507 tagged instances. The resulting corpus is already sentence and word segmented. Additionally, the instance to be disambiguated is passed to each system with the correct lemma and part of speech tag, meaning the evaluation only tests the disambiguation stage of a full WSD pipeline and not the candidate extraction or POS tagging stage. The corpus is further processed with FinnPOS (Silfverberg et al., 2016)³ for systems that need POS tags and/or lemmas for the words in the context.

Enriching FinnWordNet with frequency data
Many WSD techniques based on WordNet, including the typical implementation of the MFS baseline, assume it is possible to pick the most frequent sense of a lemma by picking the first sense. The reason this works with Princeton WordNet (PWN) (Miller et al., 1990) is because word senses are numbered according to the descending order of sense occurrence counts based on the part of the Brown corpus used during its creation⁴. FinnWordNet senses on the other hand are randomly ordered.
Since this data is potentially needed even by knowledge based systems, which should not have access to a training corpus, it is estimated here based on the frequency data in PWN. Unlike most PWN aligned WordNets, which are aligned at the synset level, FinnWordNet is aligned with PWN at the lemma level. An example of when this distinction takes effect is when lemmas are structurally similar. For example, in the synset "singer, vocalist, vocalizer, vocaliser", the Finnish lemma laulaja is mapped only to singer rather than to every lemma in the synset. When there is no clear distinction to be made, whole synsets are mapped. This reasoning fits with the existing structure of PWN: Relations between synsets encode purely semantic concerns, whereas relations between lemmas encode so-called morpho-semantic relationships, such as morphological derivation.
Let the Finnish-English lemma mapping be denoted L, the specific frequency estimate for a Finnish lemma is then defined like so: freq(l fin ) = (l fin , leng)∈L freq(l eng ) (l fin2 , l eng ) ∈ L ²https://github.com/flammie/omorfi ³https://github.com/mpsilfve/FinnPos ⁴This data is overlapping with, but distinct from SemCor (Miller et al., 1993). The rationale of this approach is that this causes the frequencies of English lemmas to be evenly distributed across all the Finnish lemmas which they map to.
To integrate the resulting synthetic frequency data into as many applications as possible, it is made available in the WordNet format⁵. The WordNet format requires sense occurrence counts, meaning the frequency data must be converted to integer values. To perform this conversion all frequencies are multiplied by the lowest common multiple of the divisors in the above formula. Some care must be taken in downstream applications since the resulting counts are no longer true counts, but rescaled probabilities. The main consequence here is that systems which use +1 smoothing are reconfigured to use +1000 smoothing. Table 1 summarises the word embeddings used here. Due to the large number of word forms a Finnish lemma can take, it is of note here whether the word embedding represents word forms or lemmas. In the case an embedding represents word forms, it is additionally of note whether it uses any subword or character level information during its training, which should help to combat data sparsity. Despite the use of subword information, none of these embeddings can analyse out of vocabulary word forms. Cross-lingual word embeddings embed words from multiple languages in the same space, a property utilised in Section 3.2.2.

Word embeddings
To extend word representations to sequences of words such as sentences, taking the arithmetic mean of word embeddings (AWE) has been commonly used as a baseline. Various incremental modifications have been suggested. Rücklé et al. (2018) ⁵Made available at https://github.com/frankier/fiwn.  Table 3 ᵇ See Table 4 suggest concatenating the vectors formed by multiple power means, including the arithmetic mean. Variants CATP3 and CATP4 are used here. The former is the concatenation of the minimum, arithmetic mean, and the maximum, while the latter contains also the 3rd power mean. Arora et al. (2017) proposed Smooth Inverse Frequency (SIF), by taking a weighted average according to a a+p(w) , where a is a parameter and p(w) is the probability of the word. Arora et al. (2017) perform common component removal on the resulting vector. In the variant used here, (referred to as pre-SIF) a is set to the suggested value of 10 −3 and common component removal is not performed, while p(w) is estimated based upon the word frequency data of Speer et al. (2018)⁶.

Systems and Results
This evaluation is based on the all-words variant of the WSD task. In this task, the aim is to identify and disambiguate all words in some corpus. This is contrasted with the lexical sample approach, where a fixed set of words are chosen for evaluation. There are many systems and approaches which have been proposed for performing WSD. To select techniques for this evaluation, the following criteria were used: • Prefer techniques which have been used in previous evaluations for English.
• Prefer techniques with existing open source code that can be adapted.
• Apart from this, include also simple schemes, especially if they represent an approach to WSD not covered otherwise.
The last criterion has led to the inclusion of multiple techniques based upon representation learning, where some representation of words or groups of words is learned in an unsupervised manner from a large corpus. To perform WSD based on these representations a relatively simple classifier, such as a nearest neighbour classifier, is then used. This approach to WSD additionally acts as a grounded extrinsic evaluation of the quality of the representations. The results of the evaluation are summarised in Table 2, with variants of the Cross-lingual Lesk and AWE-NN systems broken down in Tables 3 and 4. The rest of this section describes each of the systems in more detail.

Baseline
We can define limits for the performance of the WSD systems. The floor is defined by the proportion of unambiguous test instances. It is the F 1 score obtained by a system which makes correct guesses for unambiguous instances and incorrect guesses for every other instance. The ceiling is for systems based upon supervised learning, and is the proportion of test instances for which the true sense exists in the training data. It is the F 1 score obtained by a system which correctly associated every item in the test data with the true class seen in the training data, and makes an incorrect guess for every other instance.
The random sense baseline picks a random sense by picking the first sense according to a version of FinnWordNet without the frequency data from Section 2.2 i.e. the original sense order in FinnWordNet is assumed to be random. This also gives us a rough estimate of the average ambiguity of the gold standard, 1 29.8% ≈ 3. The MFS baseline also picks the first sense, but uses the estimated frequencies from Section 2.2.

Knowledge based systems
Knowledge based WSD systems use only information in the LKB. In almost all dictionary style resources, this can include the text of the definitions themselves. In WordNet style resources, this can include also the graphical structure of the LKB.

UKB
UKB (Agirre et al., 2014) is a knowledge based system, representing the graph based approach to WSD. Since it works on the level of synsets, the main algorithm is essentially language independent, with the candidate extraction step being the main language dependent component. UKB can also make use of language specific word sense frequencies.
As noted in Agirre et al. (2018), depending on the particular configuration, it is easy to get a wide range of results using UKB. The configurations used here are based on the recommended configuration given by Agirre et al. (2018). For all configurations, the ppr w2w algorithm is used, which runs personalised page rank for each target word. One notable configuration difference here is that the contexts passed to UKB are fixed to a single sentence. This is the same input as is given to the other systems in this evaluation. Variations with and without access to word sense frequency information are given, (freq & no freq) with the latter assumed to be similar to the configuration given in .
By default, the lemmas and POS tags in the contexts given to UKB are from the sense tagged instances of EuroSense. Since some instances have been filtered from EuroSense so as to retain high precision, it may that UKB is hamstrung by an insufficient context size. To increase the information in the context without extending it beyond the sentence boundary, a high recall, low precision lemma extraction procedure based on OMorFi is performed. The procedure (referred to in Table 2 as extract) adds to the context all possible lemmas from each word form, including parts of compound words, and also extracts multiwords that are in FiWN.

Lesk with cross-lingual word embeddings
A variant of Lesk, referred to hereafter as Lesk with cross-lingual word embeddings (Cross-lingual Lesk) is included to represent the gloss based approach to WSD. The variant presented here is loosely based upon Basile et al. (2014). The technique is a derivative of simplified Lesk (Kilgarriff and Rosenzweig, 2000) in that words are disambiguated by comparing contexts and glosses. For each candidate definition, the word vectors of each word in the definition text are aggregated to obtain a definition vector. The word vectors of the words in the context of the word being disambiguated are also aggregated to obtain a context vector. Definitions are then ranked from best to worst in descending order of cosine similarity between their definition vector and the context vector. Frequency data (freq) can be incorporated by multiplying the obtained cosine similarities by the smoothed probabilities of the synset given the lemma.
Since the words in the context are Finnish, but the words in the definitions are English, cross-lingual word vectors are required. The embeddings used are fastText, Numberbatch and the concatenation of both. Other variations are made by the choice of aggregation function, choosing whether or not to only include words which occur in FiWN, and whether glosses are expanded by adding also the glosses of related synsets. The gloss expansion procedure follows Banerjee and Pedersen (2002, Chapter 6). The results are summarised in Table 3.

Supervised systems
Supervised WSD systems are based on supervised machine learning. Most typically in WSD a separate classifier is learned for each individual lemma.

SupWSD
SupWSD (Papandrea et al., 2017) is a supervised WSD system following the traditional paradigm of combining hand engineered features with a linear classifier, in this case a support vector machine. SupWSD is largely a reimplementation of It Makes Sense (Zhong and Ng, 2010), and as such uses the same feature templates and its results should be largely comparable. It was chosen over It Makes Sense since it can handle larger corpora.
All variants include the POS tag and local colocation feature templates, and the default configuration includes also the set of words in the sentence. Variants incorporating the most successful configuration of Iacobacci et al. (2016), exponential decay averaging of word vectors with a window size of 10, are also included for each applicable word embedding from Section 2.3. For each configuration incorporating word vectors, variants without the set of words in the sentence are included, denoted e.g. Word2Vec₋s.

Nearest neighbour using word embeddings
Nearest neighbour using word embeddings has been used previously by Melamud et al. (2016) as a baseline. This system is very similar to the one outlined in Section 3.2.2. The main difference is that word senses are now represented by all memorised training instances, each themselves represented by the aggregation of word embeddings in their contexts. When a training instance is the nearest neighbour of a test instance, based on cosine distance, its tagged sense is applied to the test instance. This moves the technique from the realm of knowledge based WSD to supervised WSD. Since both tagged instances and the untagged context to be disambiguated are in Finnish, the constraint that word embeddings must be cross-lingual is removed. The results are summarised in Table 4.

Discussion & Conclusion
This paper has presented the first comparative WSD evaluation for Finnish. In the results presented here, several systems beat the MFS baseline. Of the knowledge based systems, both UKB and some variants of cross-lingual Lesk incorporating frequency This evaluation may be limited by a number of issues. Multiple issues stem from the use of EuroSense. Due to the way it is automatically induced, it contains errors, making its use problematic, especially its use as a gold standard. First we model these errors as occurring in an essentially random manner. In this case a perfect WSD system would get a less than perfect score, and in fact the performance of all systems would be expected to decrease. It is worth noting that since inter-annotator agreement can be relatively low for word sense annotation, manual annotations can also be modelled as having this type of problem to some degree. Random errors in the training data would also cause the supervised systems to perform worse, however this does not effect the overall integrity of the evaluation. However, it is likely that EuroSense in fact contains systematic errors. One type of systematic error is an error of omission: EuroSense assigns senses to a subset of all possible candidate words, filtering out those which the Babelfy algorithm cannot assign sufficient confidence to, meaning that the gold standard may be missing words which are in some sense more difficult, artificially increasing the score of systems which would also have problems with these same words. Perhaps worse are systematic errors which bias certain lemmas within certain types of contexts to certain incorrect senses. In this case, supervised systems may seem to perform better, but only because they are essentially learning to replicate the systematic errors in EuroSense rather than because they are performing WSD more accurately.
Another factor which may cause this evaluation to present too optimistic a picture of the performance of supervised systems is that the evaluation corpus and training corpus are from the same domain, parliamentary proceedings, which could result in an inflated score in comparison to an evaluation corpus from another domain. Finally, since the corpus is derived from EuroParl, the original language of most text is likely not Finnish. Particular features of translated language, sometimes referred to as translationese may affect the applicability of the results to non translated Finnish⁷.
Finally, the MFS baseline may have been handicapped in terms of its performance. On the one hand, the MFS baseline may be reasonably analagous with MFS baselines in WSD evaluations for other languages in that it is ultimately derived from frequency data which is out of domain. On the other hand, estimating the frequencies based on English frequency data is likely quite inaccurate when compared to a possible estimation based on a reasonably sized Finnish language tagged corpus.
⁷For an exploration of some features of translationese in EuroParl, see Koppel and Ordan (2011). Further work could address the issues with the gold standard by creating a crossdomain manually annotated corpus, ideally based on a corpus of text originally in Finnish. A training corpus could also be created manually, but this would be a much larger task. This would however allow a better MFS baseline to be created. A less work intensive way of improving the situation with the MFS baseline would be to add one based on the supervised training data, and consider this as an extra MFS baseline, only for supervised methods.
The implementations of the techniques reimplemented for this evaluation and the scripts and configuration files for the adapted open source systems are publicly available under the Apache v2 license. To ease replicability further, the entire evaluation framework, including all the requirements, WSD systems and lexical resources are made available as a Docker image⁸.