Deutsches Textarchiv (The German Text Archive)

General description: The DFG-funded project Deutsches Textarchiv (DTA) started in 2007 and is located at the Berlin-Brandenburg Academy of Sciences and Humanities (Berlin-Brandenburgische Akademie der Wissenschaften, BBAW). Its goal is to digitize a large cross-section of printed works in modern New High German Language, ranging from ca. 1600 to 1900. Images and electronic full-text are available online, the latter can be downloaded as HTML, XML, TCF or plain text. The DTA presents almost exclusively the first editions of the respective works. Currently (April 2016), there are 2422 texts dating from 1600–1900 online, and over 400 more are prepared to be published, comprising a total of more than 650,000 digitized pages with around 1.1 billion characters and roughly 157 million tokens.

The majority of DTA’s texts is transcribed by non-native speakers using the double keying method (vendors guarantee 99.9+% character accuracy). The DTA provides linguistic applications for its corpus, i. e. tokenization, lemmatization, lemma based and phonetic search, and rewrite rules for historic spelling.

All DTA texts are freely available for download in different formats: the original XML/TEI texts, an HTML rendered version, two different kinds of TCF versions, the raw text transcription. Moreover, CMDI metadata comprising TEI header information may be harvested via OAI-PMH.

Implementation description: Each text in the DTA is encoded using the XML/TEI-P5 format. The markup describes text structures (headlines, paragraphs, speakers, poem lines, index items etc.), as well as the physical layout of the text down to the position of each character on a page. The text annotation follows the DTA “Base Format” (DTABf), a customization of the TEI P5 Guidelines. The DTABf consists of about 80 TEI P5 <text> elements which are needed for the basic formal and semantic structuring of the DTA reference corpus. The purpose of developing the DTABf was to gain coherence at the annotation level, given the heterogeneity of the DTA text material over time (1600-1900) and text types (fiction, functional and scientific texts). More, frequently updated information on the DTABf here: (description), (overview: table elements within text).

Access: Open access / CC BY-NC

References (selected from
  • Haaf, Susanne, Alexander Geyken and Frank Wiegand (2014/15): The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources, In: Journal of the Text Encoding Initiative (jTEI) 8, 2014–2015,
  • Alexander Geyken (2013): Wege zu einem historischen Referenzkorpus des Deutschen: das Projekt Deutsches Textarchiv. In: Perspektiven einer corpusbasierten historischen Linguistik und Philologie. Internationale Tagung des Akademienvorhabens „Altägyptisches Wörterbuch“ an der Berlin-Brandenburgischen Akademie der Wissenschaften, 12.–13. Dezember 2011, Edited by Ingelore Hafemann, Berlin 2013, p. 221–234,
  • Jurish, Bryan (2013): “Canonicalizing the Deutsches Textarchiv”. In: Perspektiven einer corpusbasierten historischen Linguistik und Philologie. Internationale Tagung des Akademienvorhabens „Altägyptisches Wörterbuch“ an der Berlin-Brandenburgischen Akademie der Wissenschaften, 12.–13. Dezember 2011. Edited by Ingelore Hafemann, Berlin 2013, p. 235–244,
  • Haaf, Susanne, Frank Wiegand and Alexander Geyken (2013): “Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text”. In: Journal of the Text Encoding Initiative (jTEI) 4, 2013,
  • Thomas, Christian and Frank Wiegand (2012): “Making great work even better. Appraisal and Digital Curation of widely dispersed Electronic Textual Resources (c. 15th–19th cent.) in CLARIN-D”. Full Paper for the International Conference “Historical Corpora 2012”, December 6–9, 2012; Goethe University, Frankfurt, Germany, [Updated print version in: Gippert, Jost / Gehrke, Ralf (Hrsg.): Historical Corpora. Challenges and Perspectives. Tübingen 2015, S. 181–196.]
  • Geyken, Alexander, Susanne Haaf and Frank Wiegand (2012): “The DTA ‘base format’: A TEI-Subset for the Compilation of Interoperable Corpora”. In: 11th Conference on Natural Language Processing (KONVENS) – Empirical Methods in Natural Language Processing, Proceedings of the Conference. Edited by Jeremy Jancsary. Wien, 2012 (= Schriftenreihe der Österreichischen Gesellschaft für Artificial Intelligence 5),
  • Geyken, Alexander et al. (2012): “TEI und Textkorpora: Fehlerklassifikation und Qualitätskontrolle vor, während und nach der Texterfassung im Deutschen Textarchiv”. In: Jahrbuch für Computerphilologie,
  • Geyken, Alexander et al. (2011): “Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv”. In: Digitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, 20./21. September 2010. Beiträge der Tagung. Edited by Silke Schomburg, Claus Leggewie, Henning Lobin und Cornelius Puschmann. 2., ergänzte Fassung. hbz, 2011, p. 157–161,
  • Jurish, Bryan (2011): “Finite-state Canonicalization Techniques for Historical German”. PhD thesis, Universität Potsdam, January, 2011, URN urn:nbn:de:kobv:517-opus-55789,


Susanne Haaf / Matthias Boenig / Christian Thomas
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22–23
10117 Berlin
Tel: +49 (0)30 20370 523