The MULTEXT-East resources are a multilingual dataset for language engineering research and development. This dataset contains, for Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Resian, Romanian, Russian, Slovene, and Serbian, some, or all of the following language resources:
  • the MULTEXT-East morphosyntactic specifications, lexica, and annotated "1984" corpus;
  • MULTEXT-East parallel and comparable text and speech corpora;
  • and associated documentation.
The complete corpora as well as the documentation are encoded in TEI P4.

The MULTEXT-East project was a spin-off of MULTEXT and ran from '95 to '97. MULTEXT-East developed language resources for six languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as well as for English, as the ‘hub’ language of the project. It also adapted existing tools and standards to these languages. The main results of the project were an annotated multilingual corpus and lexical resources for the seven languages.

The extended results of the project were made available in 1998, first on CD-ROM and then via TRACTOR, the TELRI Research Archive of Computational Tools and Resources.

In the scope of the Concede project, a new release was made available in 2002; it contained only the (updated and corrected) morphosytntactic resources from the first release. This second release was made freely available for research use via the Web.

Finally, the third release was made in 2004 - it updates and brings together the first two, adds new languages, and make the move from SGML to XML, in particular to TEI P4 - this work was supported by the TEI task force on SGML to XML migration. Version 3 is also available via the Web, from the home page of the project.

For further information on the MULTEXT-East project, its results and their exploitation you can consult the annotated bibliography of MULTEXT-East, available in HTML and various other formats from the project Web page.

(from the MULTEXT-East WWW page)


Tomaž Erjavec
Jožef Stefan Institute
Jamova 39
SI-1000 Ljubljana
Tel: +386 1 477-3507
Fax: +386 1 425-1038

Last recorded change to this page: 2007-09-21  •  For corrections or updates, contact