Deutsches Textarchiv (The German Text Archive)

Description: The DFG-funded project Deutsches Textarchiv (DTA) started in 2007 and is located at the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW). Its goal is to digitize a large cross-section of printed works in German, ranging from ca. 1650 to 1900. Images and electronic full-text are available online, the latter can be downloaded as HTML or XML. The DTA presents almost exclusively the first editions of the respective works. Currently (Dec 2011) there are more than 700 texts available (i. e. 500 million characters), most of them transcribed by non-native speakers using the double keying method (vendors guarantee 99.9+% character accuracy). The DTA provides linguistic applications for its corpus, i. e. tokenization, lemmatization, lemma based and phonetic search, and rewrite rules for historic spelling.

Implementation description: Each text in the DTA is encoded using the XML/TEI-P5 format. The markup describes text structures (headlines, paragraphs, speakers, poem lines, index items etc.), as well as the physical layout of the text down to the position of each character on a page. The text annotation follows the DTA "base format", a customization of the TEI P5 Guidelines. The DTA "base format" consists of about 80 TEI P5 <text> elements which are needed for the basic formal and semantic structuring of the DTA reference corpus. The purpose of developing the "base format" was to gain coherence at the annotation level, given the heterogeneity of the DTA text material over time (1650-1900) and text types (fiction, functional and scientific texts). More frequently updated information on the DTA "base format" here: http://kaskade.dwds.de/dtaq/help/basisformat (description), http://kaskade.dwds.de/~wiegand/teixml/ (overview: table elements within text).

Other Related Resources: Not yet available.

Access: Open access / CC BY-NC

References (select):
  • Geyken, Alexander et al. (2011): "Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv"; in: Digitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, 20./21. September 2010. Beiträge der Tagung. Hrsg. von Silke Schomburg, Claus Leggewie, Henning Lobin und Cornelius Puschmann. 2., ergänzte Fassung. hbz, 2011, S. 157–161. http://www.hbz-nrw.de/dokumentencenter/veroeffentlichungen/Tagung_Digitale_Wissenschaft.pdf#page=159
  • Geyken, Alexander et al. (2011): "TEI und Textkorpora: Fehlerklassifikation und Qualitätskontrolle vor, während und nach der Texterfassung im Deutschen Textarchiv"; in: Jahrbuch für Computerphilologie (forthcoming paper).
  • Jurish, Bryan (2010): More than Words: Using Token Context to Improve Canonicalization of Historical German. In: Journal for Language Technology and Computational Linguistics (JLCL), vol. 25/1, 2010.

Contact:

Susanne Haaf / Matthias Schulz / Christian Thomas
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22–23
10117 Berlin
Germany
Tel: +49 (0)30 20370 523
Email: dta@bbaw.de