The International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora

Čermáková, Ann; Jantunen, Jarmo; Jauhiainen, Tommi; Kirk, John; Křen, Michal; Kupietz, Marc; Uí Dhonnchadha, Elaine

155-Article%20Text-1147-1-10-20210618.pdf

Publisher's PDF

The International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora

Abstract

This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.

Main Authors

Čermáková, Ann Jantunen, Jarmo Jauhiainen, Tommi Kirk, John Křen, Michal Kupietz, Marc Uí Dhonnchadha, Elaine

Format

Articles Research article

Published

2021

Series

Research in Corpus Linguistics

Subjects

ICC corpus

contrastive linguistics

comparable corpus

ICE corpus

data sustainability

kielitiede

tekijänoikeus

kontrastiivinen tutkimus

korpukset

vertaileva kielitiede

Publication in research information system

https://converis.jyu.fi/converis/portal/detail/Publication/98442746

Publisher

Asociacion Espanola de Linguistica de Corpus

Original source

http://ricl.aelinco.es/first-view/155-Article%20Text-1147-1-10-20210618.pdf

The permanent address of the publication

https://urn.fi/URN:NBN:fi:jyu-202202071401Use this for linking

Review status

Peer reviewed

ISSN

2243-4712

DOI

https://doi.org/10.32714/ricl.09.01.06

Language

English

Published in

Research in Corpus Linguistics

Citation

Čermáková, A., Jantunen, J., Jauhiainen, T., Kirk, J., Křen, M., Kupietz, M., & Uí Dhonnchadha, E. (2021). The International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora. Research in Corpus Linguistics, 9(1), 89-103. https://doi.org/10.32714/ricl.09.01.06

License

The International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora

Share

Similar Items