The Oslo Multilingual Corpus


"We are currently developing the Oslo Multilingual Corpus (OMC), which is an extension of the English-Norwegian Parallel Corpus (ENPC).

The ENPC consists of text excerpts of approximately 10,000 to 15,000 words from fictional and non-fictional Norwegian and English original texts and their translations, amounting to a total of 200 texts, or 2.6 million words. German, Dutch and Portuguese translations were added for some of the texts. The texts are SGML-encoded and aligned at sentence level.

The corpus is now being extended on the German side in particular, to ensure equal representation of texts in English, German, and Norwegian, to the extent that this is possible. Recently, the project has been extended to French. Eventually, the corpus will contain original texts in four languages (English, German, French, Norwegian) and their translations into as many as possible of the other three languages. Currently (October 2001), the English-German-Norwegian part of the corpus consists of 32 English, 31 German, and 22 Norwegian original texts with translations into the other two languages, whereas the French-Norwegian part comprises excerpts from 10 Norwegian and 10 French non-fictional texts with their respective translations.

Due to copyright restrictions, the corpus is only available to researchers and graduate students at the universities in Oslo and Bergen."

– Oslo Multilingual Corpus WWW page


Stig Johansson
Department of British and American Studies
University of Olso
Bergljot Behrens
Department of Linguistics
University of Oslo

Last recorded change to this page: 2007-09-21  •  For corrections or updates, contact