Multilingual Text Tools and Corpora (MULTEXT)


Multext encompasses a series of projects whose goals are to develop standards and specifications for the encoding and processing of linguistic corpora, and to develop tools, corpora, and linguistic resources embodying these standards. Multext is developing tools, corpora, and linguistic resources for a wide variety of languages, including Bambara, Bulgarian, Catalan, Czech, Dutch, English, Estonian, French, German, Hungarian, Italian, Kikongo, Occitan, Romanian, Slovenian, Spanish, Swedish, and Swahili. All Multext results are made freely and publicly available for non-commercial, non-military purposes.

Corpus Encoding Standard:

MULTEXT, along with EAGLES and the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation), have developed a Corpus Encoding Standard that will "serve as a widely accepted set of encoding standards for corpus-based work".


The Multext effort has been supported by the European Commission, under the Linguistic Research and Engineering, Copernicus, and Langues regionales et minoritaires programmes; the U.S. National Science Fundation, under the Vassar/CNRS collaboration; the Fonds Francophone pour la Recherche (AUPELF-UREF); the Centre National de la Recherche Scientifique (CNRS) and the Universite de Provence.


Dr. Jean Veronis (coordinator)
Laboratoire Parole et Langage
CNRS & Universite de Provence
29, Av. Robert Schuman
13621 Aix-en-Provence Cedex 1, France
Tel: (+33) 42 95 36 33
Fax: (+33) 42 59 50 96

Last recorded change to this page: 2007-09-21  •  For corrections or updates, contact