parallel sessions shown in parallel • parallel sessions shown in sequence* • with abstracts

* Support for CSS varies across browsers.

Monday, November 9
12:30–5 p.m. TEI Ontologies SIG Workshop (Turkish-American Friendship Room, Shapiro Library)
Tuesday, November 10
109–4 p.m. TEI Ontologies SIG Workshop (Turkish-American Friendship Room, Shapiro Library)
Wednesday, November 11
109–2 p.m. TEI Ontologies SIG Workshop (Turkish-American Friendship Room, Shapiro Library)
3–6 p.m. A TEI-based Publishing Workflow (Turkish-American Friendship Room, Shapiro Library)
Thursday, November 12
8:30 a.m. Registration opens (Gallery, Hatcher Graduate Library North)
9–9:30 a.m. Welcome (Gallery, Hatcher Graduate Library North)
Chair: Daniel O'Donnell
  1. Daniel O'Donnell (chair of the Board of Directors of the TEI Consortium)
  2. Kevin Hawkins (local organizer)
  3. John Wilkin (executive director of HathiTrust and associate university librarian for information technology, University of Michigan)
9:30–10:30 a.m. Virtual Research Environments in the Humanities: Challenges and New Developments with a Focus on Europe (Gallery, Hatcher Graduate Library North)
Speaker: Elmar Mittler
Chair: Fotis Jannidis

The development of virtual research environments is rather advanced in the sciences, but there are great opportunities to improve research facilities for the humanities as well. Direct access to relevant resources such as digitized text, primary data, services, and tools opens new frontiers of research. Some examples from Germany and Europe will show the collaborative establishment and discuss the chances and challenges of the new generation of research facilities.

10:30–11 a.m. Coffee break (Gallery, Hatcher Graduate Library North) Demonstration of Espresso Book Machine (lobby of Shapiro Library)
11 a.m. – 12:30 p.m. New Tools and Perspectives for Manuscript Editing (Gallery, Hatcher Graduate Library North)
Chair: Franz Fischer
  • Do we need a Document Encoding Initiative?
    Elena Pierazzo

    In any branch of Manuscript Studies (Editing, Codicology, Palaeography, Art History, History) the first level of enquiry is always the document, the physical support that lies in front of the scholar’s eyes.

    To understand the text that is contained in the manuscript, a deep study of the manuscript itself is fundamental: the layout, the type of script, the type of writing support, the binding and many other aspects are able to tell us about when, where and why this particular text was composed. The text therefore represents the second level of enquiry, not the first. In the case of modern draft manuscripts scholars must give detailed consideration to the layout, the different stratifications of writing and the disposition of these in the physical space; all of these, together with the understanding of the text, are required to gain insight about the composition, time of revisions, and flow of the text. Furthermore, for some texts we know that the kind of physical support used to record them not only influences but also determines the text itself. For instance, the content and the length of letters are often determined by the size and quantity of paper available to the writer. For these reasons the working group in genetic editions will propose the document as its primary field of investigation and therefore of encoding.

    Despite all of this, the TEI’s approach forces scholars to consider the text first. Of the two main hierarchies (text and document), the TEI privileges the text (hence Text Encoding Initiative) and relegates topological description to empty elements (<pb/>, <lb/>, <cb/>) or attributes (<add place=””>, <note place=””>). The TEI does not say that documents are not relevant, but that they are less relevant that texts; to use a bibliographical metaphor, texts are ‘substantial’ while documents are ‘accidental’.

    Many contributions (amongst others: Robinson 2009 and Pichler 1995) have discussed the concept of ‘What is a Text’, reaching the conclusion that a text is an abstraction which changes in time according to different scholars’ interpretations and theories. Documents change as well (leaves can be lost, ink can fade) but they are real, tangible objects and much more stable than a text, as they do not depend on a theory and they exist even if we do not read them.

    The paper will discuss the advisability of offering manuscript scholars the possibility of encoding documents as well as texts, presenting examples of medieval and modern manuscripts for which such an approach is desirable.

    References Robinson, P. M. (2009). ‘What text really is not, and why editors have to learn to swim’. Literary and Linguistic Computing, 24(1), pp. 41–52.
    Pichler, A. (1995). ‘Advantages of a Machine-Readable Version of Wittgenstein’s Nachlass’. In: Culture and Value: Philosophy and the Cultural Sciences, K. Johannessen and T. Nordenstam, (eds.). Austrian Ludwig Wittgenstein Society, Wien, 1995pages 770–776.

  • Creating document management system for the digital library of manuscripts: Manuscriptorium and M-Tool [slides]
    Jindrich Marek

    This paper deals with the document management system (DMS) which was implemented in Manuscriptorium (, the digital library of manuscripts and early printed books operated by the National Library of the Czech Republic.

    Currently it is possible to describe manuscripts in TEI P5, a format which covers most of the needs of cataloguers. TEI Guidelines contain a chapter which is devoted to the manuscript description. There are also online digital libraries which are able to handle such descriptions. But what is missing is an intelligent and ergonomic DMS which would help with creating, uploading and managing of descriptive metadata. This paper would like to present such a system which was created for the Manuscriptorium digital library.

    Basic needs of users who create descriptive metadata include a flexible but standardized form for creating a descriptive record as it is still implemented in systems for creating descriptive records of printed books. Such a form is provided in M-Tool, an online instrument which allows users to save and/or upload a digital compound document to the online system. This tool makes it possible to describe not only manuscripts but also incunabula and early printed books. It creates descriptive metadata, structural metadata (page level of the digital document representation) and reference metadata (links to the images of the document). It is available everywhere because it is a web application. So the technical requirements for the end user are only having a web modern browser. The interface of the system is quite intuitive, so it assumes no prior knowledge of XML.

    This tool is complemented by an administration system which controls the digital compound documents. The administration system is an interface that lets the cataloguer upload and control his descriptive metadata. It is also a management system that allows the administrator to deal with the uploaded data: to authorize it or to propose some changes to be done by the user. This online system is also available everywhere, so the administration is a simple task.

    The data creation for this system is bound with the problem of converting records from other formats and other systems. We think there are three possible ways of integration of alien data: 1) through harvesting with the OAI protocol; 2) through an “ad hoc” connector; and 3) through creating new descriptive records with M-Tool.

    The handling of metadata coheres with the internal organization of the digital library. The basic component of the Manuscriptorium system is the database of evidence records. Every evidence record (written in TEI P5) has three substantial parts: descriptive metadata, technical metadata and structural metadata. Evidence records are dynamically linked to other data or metadata, such as full text of manuscripts or page images, so every evidence record created by the online tool is a foot-stone in the system which links to other data pertinent to the concrete manuscript.

  • Performing Dada Poetry in N-dimensional space with an electronic genetic edition of poetry by the Baroness Elsa von Freytag-Loringhoven
    Tanya Clement

    Described by The Little Review editor Margaret Anderson as “perhaps the only figure of our generation who deserves the epithet extraordinary,” the Dadaist poet Baroness Elsa von Freytag-Loringhoven published, between 1918 and 1929, approximately forty of her poems in little magazines such as The Little Review, transatlantic review, and transition, and the single issue of New York Dada. Yet, since 1929 the Baroness’s work has rarely appeared in print and until the 1980s, scholarship on her work was not published at all. What ultimately denudes her poetry of its meaning-making possibilities is a failure to interrogate the textual conditions in which her poetry was read and appreciated. Within The Little Review culture of the late 1910s and early 1920s, this textual condition included the collaborative audience so important to a Dadaist performance.

    This talk introduces a theory of text—textual performance—that encourages new kinds of access to and thus new readings of the Baroness’s poetry based on the incorporation of an audience. Simply reproducing a textual performance is not the means by which we can access the textual event of a Baroness poem; rather, a digital edition created within a social online network such as MySpace comprises the collaborative audience needed to bring the texts into play. Jerome McGann asserts that all features of the text, whether they be perceptual, semantic, syntactic, or rhetorical signify meaning and not each by itself, but in accordance or discordance with the other. The interplay is “recursive” and appears “topological rather than hierarchic,” such that he likens the poetic field of meaning to “a mobile with a shifting set of poles and hinge points carrying a variety of objects.” This field, occupied by the moving mobile, may be conceived as the fourth dimension in which three dimensional shapes move—a dimension that represents time and space. In other words, the electronic environment constitutes an n-dimensional space in which the Dada performance—the gallery and the cabaret—of a textual event may be played.

    An online social network site provides for an example of the ways in which text-based embodiment and real-time performance is already engaged in n-dimensional space. MySpace, for example, which includes a site for the Baroness with over 700 friends, enacts an element of authenticity that is the result of an interactive and collaborative audience. This is an element that is also essential to Dada culture. To make text available in this environment, I have created In Transition: Selected poems by the Baroness Elsa von Freytag-Loringhoven, a genetic edition which includes multiple, annotated versions of twelve encoded in P5 in parallel segmentation and represented in an open-source, JavaScript environment called the Versioning Machine. The next step to providing access to a textual performance of this work is incorporating this edition into the n-dimensional space provided online by environments like MySpace, Facebook, and Omeka. This talk will discuss these options in the terms of the textual experience provided by In Transition—a textual performance of the Baroness’s poetry that has never been staged quite this same way.


    1 Anderson, Margaret C. My Thirty Years’ War; the Autobiography: Beginnings and Battles to 1930. New York: Horizon Press, 1970. 177.

    2 McGann, Jerome. Radiant Textuality: Literature After the World Wide Web. New York: Palgrave, 2001. 297.

    3 Please see the Baroness’s page at

    4 At the writing of this CFP, the site at is password protected. It will be publically available by May 2009.

Between Mass Digitization and Editing (Gallery Instruction Lab, Hatcher Graduate Library North)
Chair: Paul F. Schaffner
  • Between the folds. A hybrid model for online publishing projects integrating philology, mass document repositories, and automatic text analysis [slides]
    Thomas Crombez

    Can "the marriage of heaven and hell" -- namely, bridging the gap between TEI-based editing projects and mass digitization projects -- be accomplished? The two kinds of projects are often presented as opposed approaches to digitization: one based on careful and rigorous editorship, the other on bringing together very large quantities of swiftly gathered 'junk text'.

    In this paper, I would like to propose a hybrid model for online publishing projects that integrates the advantages of both. I will therefore introduce a third component (automatic text analysis) and demonstrate two rudimentary examples.

    The model is constituted of the following components:

    1- CORE

    A carefully curated but by necessity small collection of reliable texts, which will probably be strongly interrelated (e.g., same author, genre, and/or period). The documents feature extensive metadata and are edited according to an internationally accepted standard (e.g., TEI XML)

    2- CLOUD

    A number of large and publicly available repositories of data, which may preferably be consulted by means of an API (e.g., Google Book Search, WordHoard, WordNet, Wikipedia, JSTOR)


    A number of web services that enable researchers to visualize and annotate the contents of the Core, while providing added meta-information that is not already manually tagged in the documents, but generated (1) through automatic text analysis and (2) by linking the Core documents to the existing web repositories

    The crux of the model is to analyze and re-format the mass of data contained in the 'Cloud', and to stream it to a specialized audience of researchers studying a particular set of texts. Therefore, the third component is undoubtedly the most important. Extending the metadata manually added during the editing process with automatically generated metadata, a document profile for each text in the Core corpus may be generated. These document profiles may be compared internally -- e.g., for searching related texts in the Core -- or sent to external repositories through automated queries.

    I would like to illustrate this model by means of two of my recent publishing projects (that are both still in "beta version"):

    1) Corpus Toneelkritiek Interbellum (Flemish Theatre Reviews 1919-1939) is a corpus of short documents on Flemish theatre (mostly reviews and essays) from the inter-war period. All 350 documents were extensively marked up according to the TEI standards, indicating in each text where the name of a director is used, or a dramatist, or a poet, or where a title occurs. This manually added metadata is collected in document profiles which are compared to determine internal links. (

    2) Pieter T'Jonck Theatre and Dance Reviews is a research website designed for SARMA (a Belgian platform for dance criticism) that presents the reviews of critic Pieter T'Jonck (from 1985 up to the present). Texts are fully searchable, but were also analyzed by an automatic parser (MBSP, 'Memory-Based Shallow Parser') to extract all named entities. Documents are presented to researchers online featuring manually added metadata (date, publication, the performance under review) and automatically generated metadata (list of names mentioned). These metadata may serve as the basis of new queries, which are also forwarded to Google Book Search in order to retrieve and display relevant quotations from books. (

  • TEI stand-off architecture of the National Corpus of Polish
    Piotr Banski and Adam Przepiórkowski

    The National Corpus of Polish (NCP, “NKJP” in Polish) is a shared initiative of four institutions, each of which has created its own corpus in the past. The NCP shall be a reference corpus of the Polish language containing hundreds of millions of words (the intended size is 1 billion). The corpus will be searchable by means of advanced tools that analyse Polish inflection and sentence structure (a 430-million-token demo is already available online). The project, financed by the Polish Ministry of Science and Higher Education, ends in 2010.

    The XML architecture of the NCP follows the stand-off recommendations of the TEI Guidelines, testing them in the process. The general structure of the corpus involves separate layers of (largely) TEI-conformant documents – for some annotation layers, we have consciously created extensions of the TEI schema, with the intention of submitting them as proposals for extending the Guidelines. The layers of annotation are as follows:

    • source text – with minimal markup inside the <body> element: anonymous blocks (<ab>) are created at line breaks; this level is directly referenced by the following two:
    • text structure – markup from the level of the paragraph upwards
    • word-segmentation – paragraphs (a subset of those identified at the text structure layer), sentences and segments; this layer is referenced by the following two:
    • word-sense annotation – identifying word senses; this layer also references a (TEI-conformant) dictionary that exists as a separate part of the corpus
    • morphosyntactic annotation – a layer composed mostly of feature structures, listing the possible morphological interpretations for segments, and identifying the ones that are correct in the given morphosyntactic context; this layer is referenced by:
    • syntactic words layer – identifying syntactic words, which are sometimes sequences of segments identified at the segmentation layer and described at the morphosyntactic layer (this happens mostly in the case of some compounds, some host+clitic combinations, and multi-word-expressions); this layer is referenced by two others:
    • syntactic groups layer – describing the shallow syntactic structure of the analysed sequences
    • named entities layer – identifying named entities

    The NCP, due to its size, may not be represented as a single XML <teiCorpus> document that X-includes the particular texts. Instead, we adopted a solution that features a near-Copernican reversal: all corpus files include the main corpus header, and all TEI documents in a single annotation cluster (source text + all the annotation files) include the same local header, describing the text and all the modifications in its descriptions. (Thus, each single corpus file includes two headers: the main one, the same for the entire corpus, and the local one for the given annotation cluster).

    The current version of the TEI Guidelines (P5) advocates its own stand-off approach to corpus annotation, different in nature from the original stand-off idea of the Corpus Encoding Standard (CES) and later the XCES. To our knowledge, this particular stand-off approach has not yet been tested in a multi-million-token corpus, and the NCP may be, as far as we are aware, the first very large corpus to employ it. Implementing the TEI stand-off architecture has raised numerous problems of varied nature, some of which have already been signalled on the TEI-L mailing list. These problems concern abstract issues of data representation (e.g., the choice of the XInclude mechanism over XLink) and also more technical issues, such as the absence of the TEI syntax for the structural expression of alternation between competing sequences of segments or the lack of a standardized way of expressing the boundedness of segments identified in the raw text.

    Yet another issue is that of the granularity of the lowest layer of stand-off architecture — of when it is advisable, and of whether it is at all possible, to keep the source text in the raw form, especially in the context of the stand-off methods advocated by TEI P5. In other words, this is a question about the exact point at which corpus annotation should enter the stage and ‘touch’ the source text. Finally, there are also issues concerning the surprising lack of XML tools that could actually bring all the advantages of stand-off approaches to the average user.

    The presentation will demonstrate and discuss the solutions that we have adopted for the problems encountered in the actual practice of annotating thousands of words of a linguistic corpus. We hope that our suggestions will add to the discussion on further refinements of the stand-off annotation approach as advocated by the TEI.


  • The TEI-ms USE for the digitized Arabic manuscripts Cataloguing (late-breaking submission)
    Mohammed Ou Rabah Soualah and Mohamed Hasssoun

    The online access to the digitized Arabic manuscripts is one of the objectives of the majority of the libraries of the Arabic world. In addition, the access to these works supposes efficient methods of research. They must allow an access by subject heading on the one hand and by contents on the other hand. Thus, a simple and effective cataloguing method is essential. With this intention, the TEI P5 Manuscript Description (TEI-ms) proves to be a reliable tool which resolves the manuscript cataloguing problems. In this article, we present our contribution in order to adapt the TEI-ms in the old Arabic manuscripts cataloguing solution.

12:30–2 p.m. Lunch (on your own)
2–3:30 p.m. Encoding Genetic Editions (Gallery, Hatcher Graduate Library North)
Chair: Amanda Gailey

The TEI is successfully used by many projects encoding digital editions. But it has been recognized for some time now that the guidelines on editorial matters focus on traditional editions and provide very little if any guidance for those working in the tradition of genetic criticism, that is where the main purpose is to analyse the genetic process -- how a work of art developed -- rather than offering simply a survey of variation.

In order to fill this gap, a working group on genetic editions was established almost two years ago within the Special Interest Group on Manuscripts. In two face-to-face meetings the working group outlined the task and divided it into a set of distinct concepts. It was one of the premises of the work to describe all concepts involved in encoding genetic editions first and then, in a second step, to discuss how to map them to the TEI schema and whether it would be necessary to extend the existing schema and introduce special tags and attributes. A first schema was drafted after discussion was advanced enough to provide a list of necessary concepts.

During a two-day workshop these concepts and the schema expressing them were presented to an international group of scholars specialised in genetic editions, inviting them to present particularly difficult cases from their work and to comment on the ideas of the first draft. Immediately after this workshop the task force working on the concepts for genetic editions met to revise their draft and to map the results to the TEI schema. This work will be the basis of first draft set of detailed recommendations for the encoding of genetic editions which we want to present at the TEI meeting in Michigan.

The task of this working group is to offer guidelines to genetic editions and to make sure that the TEI remains the obvious solution for both the encoding of editions and of modern authors and their manuscripts. The panel will offer a short introduction outlining this task and its work plan followed by three papers discussing different aspects:

  • The Encoding Model: Principles and Phenomena
    Elena Pierazzo

    The first paper will outline the concepts the working group defined as necessary to encode manuscripts for a genetic edition. It will show typical examples of genetic processes and discuss design decisions. The paper will show how the encoding model is prepared to meet the needs of editors using different scholarly approaches to genetic criticism, for example the French school of critique génétique or the German school of the Genetische Edition

  • Extending the TEI?
    Lou Burnard and Elena Pierazzo

    The second paper will discuss the process of mapping the genetic concepts to the existing TEI schema. Most concepts can be expressed using TEI elements but some need a bit of tweaking; in particular, it is not always clear whether it is better to introduce new elements or to adapt existing ones.

    The paper also includes a discussion of existing TEI concepts about the encoding of textual alterations and management of documentary features within a text. The paper will discuss the advantages/disadvantages of encoding genetic editions from a document rather than from a text point of view, supporting the discussion with current editorial theory and practise.

  • Case studies
    Fotis Jannidis and Malte Rehbein

    Two case studies will be presented, documenting practical problems in applying the encoding model for genetic criticism in everyday work. They will offer a view of some authorial draft manuscripts and discuss possible solutions to encode them, concentrating on the typical and the problematic. The case studies will also present possibilities to access the texts in new ways using the genetic markup.

TEI Projects and Small Libraries: Examining TEI Markup Decisions and Procedures (roundtable discussion) (Gallery Instruction Lab, Hatcher Graduate Library North)
[workflow handout]
Richard Wisneski, Virginia Dressler, and Stephanie Pasadyn
Chair: Brett Barney

With the recent release of the TEI Consortium’s “Guidelines for Best Encoding Practices,” libraries now have clear levels from which to choose in implementing a digitization project. However, what happens when libraries do not fully to conform to one particular level, but rather begin to blur the distinctions between two or more levels? For example, what is lost (or gained) when a library elects to borrow certain aspects of P5 Level 3 encoding in following P5 Level 2 implementation? Furthermore, how do small libraries carry out large-scale TEI projects, where encoding levels Level 3 or 4 are desirable in the face of small staff and resources? Are such projects even feasible?

This panel discussion will present one small library’s attempt at a TEI Project following Level 2 encoding, with its desire to incorporate some features defined by Level 3. Case Western Reserve University's Kelvin Smith Library has begun a project to digitize and text-encode its books on Cleveland, OH History. The project, Books on Cleveland, Ohio and the Western Reserve Digital Text Collection, contains over 130 primary source texts on the history of Cleveland and its surrounding area, which date from the mid-nineteenth century to the early twentieth century, and cover a wide array of subjects, including: ethnic groups in and around Cleveland; Cleveland charity organizations; religious organizations in and nearby Cleveland; directories; Cleveland educational history; Western Reserve settlement; and historical homes and landmarks of Cleveland and Cuyahoga County. Currently, approximately half of the collection has been scanned and OCR'ed, and an initial workflow has begun, which includes structural text markup in conformance to P5 Level 2 encoding, but with aspects of Level 3 encoding, particularly in regards to divisions, lists, and figures. The project is projected to last two years, and then continue with contributions from neighboring institutions.

Panelists will give an overview of the project, including its background and rationale. Also included in the opening presentations will be an overview of current workflow, TEI markup procedures, and implementation in spite of a relatively small staff and limited resources. Specifically, one presenter will detail the workflow developed by the head of Bibliographic and Metadata Services, with assistance from library administrators, to utilize student workers, cataloging staff, digital projects librarians, and library science practicum students to scan, OCR, store, and do grayscale conversion to page images; create TEI documents; and create METS, Dublin Core, and MODS records. Another presenter will detail digitization procedures in spite of limited staff and resources. A person new to text encoding will describe the process in learning TEI, and what exactly is taught for text encoding.

Following opening presentations, discussion will revolve around decisions for in-depth encoding versus mass digitization, the ramifications of blurring distinctions between TEI encoding levels, and (re)sources available to assist small institutions in successfully carrying out such projects. We will discuss the Best Practices document’s sections on levels 2 and 3 encoding in particular, with attention to differences between and objectives of these levels. We will explore the implications of blurring distinctions between levels and the results for small libraries when these distinctions break down. Lastly, we will discuss the role project planning plays in encoding level decisions.

3:30–4 p.m. Coffee break (Gallery, Hatcher Graduate Library North)
4–6 p.m. TEI Consortium members' meeting (open to the public) (Gallery, Hatcher Graduate Library North)
6:15 p.m. Reception (Work • Ann Arbor, 306 South State Street)
Friday, November 13
9–10 a.m. Computational Work with Very Large Text Collections: Google Books, HathiTrust, the Open Content Alliance, and the Future of TEI [slides] (Gallery, Hatcher Graduate Library North)
Speaker: John M. Unsworth
Chair: Susan Schreibman

This talk will address the challenges, possibilities, implications, and possible unintended consequences of having very large text collections (on the order of millions of volumes) made available for computational work, in environments where the texts can be reprocessed into new representations, in order to be manipulated with analytical tools. Security and trust considerations, the roles of institutional partners, the impact on humanities (and other disciplines), and the opportunities for the TEI community will be touched upon, among other topics.

10–10:30 a.m. Coffee break (Gallery, Hatcher Graduate Library North) Demonstration of Espresso Book Machine (lobby of Shapiro Library)
10:30–11 a.m. Micropapers (Gallery, Hatcher Graduate Library North)
Chair: Syd Bauman
  • DeReKo goes P5: Customizing TEI P5 for the Mannheim German Reference Corpus
    Andreas Witt, Marc Kupietz and Holger Keibel

    DeReKo (Deutsches Referenzkorpus), the Archive of General Reference Corpora of Contemporary Written German at the Institute for the German Language (IDS), is an important resource for the study of the German language. It currently comprises 3.6 billion words and has a growth rate of approximately 300 million words per year. The institute is as a public-law institution that defines the ‘documentation of the German language in its current use’ as one of its main goals. It is therefore a declared IDS policy to provide for a long term sustainability of DeReKo.

    The main purpose of DeReKo is to serve as an empirical basis for the scientific study of contemporary written German. It is in general most useful whenever the focus of interest is on the language itself rather than on the information conveyed by it. DeReKo is not intended for the use in research areas like computational linguistics, information retrieval or language technology in general. However, the IDS aims at the reusability and interoperability of all of its data. Therefore the IDS has been using an annotation model that is heavily influenced by the TEI guidelines. Because the use of TEI P3 and especially the application of user-defined extensions was regarded as cumbersome by some of its users, the IDS developed its own DTD based on the Corpus Encoding Standard (X)CES that, in turn, was influenced by TEI P3. Of course, the annotation scheme was continuously extended. Whenever possible, the element and attribute names in these extensions were taken from the TEI guidelines. Nonetheless, over time the DTD diverted increasingly from the TEI annotation scheme. After the release of TEI P5 in 2008, the IDS decided to develop a new annotation scheme that is fully compliant with P5. The presentation will report on the current state of this new effort.

  • The Chicago Foreign Language Press Survey in TEI
    Douglas Knox

    The Chicago Foreign Language Press Survey digitization project, beginning just now (May 2009) at the Newberry Library, will use TEI to encode approximately 120,000 sheets of typescript created in the 1930s by a Federal Works Projects Administration project that selected and translated into English more than 45,000 newspaper articles from 22 different language groups. Although the original articles were published between 1855 and 1938, the 1930s editors left a strong stamp on the collection as a whole, employing their own hierarchical subject scheme to select and organize the articles translated.

    The scale of the project and the nature of the resource call for a level of markup somewhere between the structural neglect of vast mass-digitization of images, on one hand, and intensive literary-critical editing, on the other. The first task is simply to capture sufficient information to construct the kind of database that the original WPA editors envisioned as paper cross-indexes that no longer exist, if they ever did. Even this level of markup requires editorial choices, because the limitations of their indexing controls led the original editors into inconsistencies and mistakes that need to be resolved to make the collection usable as an electronic resource. In the absence of preexisting indexes and inventories, even the initial round of production-focused basic structural markup at full scale has a certain exploratory character. An initial round of transcription markup will provide data for collection-level data analysis that can then inform editorial decisions requiring document-level markup adjustments for the sake of enhancing the usefulness of a digital edition. There will be much more to report by November.

  • Custom metadata elements for a TEI-based documentary edition
    David R. Sewell

    The University of Virginia Press is publishing an ongoing TEI-based digital edition of the papers of the first American presidents and other major political documents. Some of the collection is born-digital, but in most cases we are digitizing volumes of letterpress editions, e.g. the Papers of George Washington (55 vols. to date). This presents us with some markup problems, as we need to treat each digitized volume as a single TEI object (i.e., a document instance with a single TEI/text structure, rather than as a TEIcorpus), while at the same time capturing metadata for each individual document that allows us to (1) perform optimal search and retrieval on documents, and (2) at some point in the future divide the individual documents into separate TEI objects, each with its own TEI header.

    At first, we supplied metadata within a <bibl> element placed at the start of each document. There were two problems with this approach: (1) it was tag abuse, as the metadata text was added by us rather than appearing in the transcribed text, and (2) we discovered that for data processing purposes, we needed to record metadata that had no corresponding TEI elements, or that had uniquely used element names (for reasons having to do with our back-end XML database software).

    Our solution was to use the ODD mechanism to define new custom metadata elements and attributes in our own namespace. This was a liberating step, as it freed us up to create in effect our own metadata language that could be used within an FGEA:mapData element, without the problems or limitations of tei:bibl.

    Time permitting, I will briefly share the portion of our ODD customization that defines our metadata element.

11 a.m. – 12:30 p.m. Some Problems with Using TEI as Seen by Experienced Practitioners and Teachers (roundtable discussion) (Gallery, Hatcher Graduate Library North)
Chair: Julianne Nyhan
  • Evolving TEI standards and the burdens of digital project maintenance
    Andrew Jewell

    As the editor of a thematic research collection (The Willa Cather Archive), Jewell has relied considerably on TEI for the structure of the underlying data, building almost the entirety of the website on a TEI framework. However, given strapped resources, a decision to migrate from P4 to P5 is increasingly difficult to justify: if the current data model is proving sufficient, why alter it? Of course, some would respond with the advice not to alter it; if P4 works, don't move to P5. However, this raises a larger question: If one chooses to mark up text in TEI in order to gain a measure of stability for the data, then the constant evolution for the TEI guidelines undermines one of the principles for implementing TEI in the first place. Granted, revisions must happen, but how are individual projects to judge when revision of that project's data is appropriate or not? When is the task of creating valid TEI too onerous or too marginal to the intellectual work of the project, particularly when the overwhelming majority of the project's audience is uninformed or uninterested in TEI itself?

  • The role of TEI in large text-analysis projects
    Brian Pytlik Zillig

    Google is purportedly, and largely in secret, digitizing seven million books. The Open Content Alliance is, in a separate effort, scanning a million more. Aside from the controversial aspects of this work, these digitization projects raise serious questions about data interoperability. Neither effort is employing the TEI Guidelines as the underlying text model. It is probably safe to assume that the overwhelming majority of these texts will not exist in TEI within the next ten years, if ever. Moreover, Google and OCA each rely upon different file formats. Are we witnessing the creation of gated communities of texts, where you can’t easily get from one to the other? Of course, interoperability issues are to some extent a normal part of the digital humanities. TEI has rarely, if ever, been implemented across projects as a standard per se. Implementations of the TEI Guidelines vary in ways that all scholarly activities do. The MONK Project successfully converted 2,500 XML and SGML files from varying text collections into TEI-A, a subset of TEI intended for analysis. But it is one thing to get a few thousand texts to play together, and another thing to attempt it at the scale of Google and OCA. If we assume that massive numbers of digitized texts, most not in TEI, will someday co-exist in a common electronic environment, we will face this question: How can we begin to interoperate millions of texts?

  • TEI documentation and the need to be responsive and accessible to a varied user community
    Brett Barney

    Practitioners came (and continue to come) to TEI encoding for disparate reasons and from disparate backgrounds. One might argue, in fact, that this diversity has been a key asset to TEI's vitality. In any diverse community, though, forces exist that tend, if unattended, to marginalize and homogenize. I propose to focus on specific TEI community practices with that tendency, beginning with examples drawn from my own ongoing initiation. How much does one need to know to fully participate? How do we assure that those with sufficient aptitude and desire to meet that requirement have adequate access to the knowledge acquisition process? Does the history of TEI elders' (self) exile from the community give reason and/or fodder for modifications in the practices that maintain and renew the community? Is there cause for concern about the proportions of "digital" and "humanities" in TEI's storefront in the digital humanities?

  • TEI in the classroom, with emphasis on the need for mark up that engages student interpretive interests
    Amanda Gailey

    Having taught about 10 undergraduate and graduate courses that heavily feature TEI, Gailey has noticed recurring difficulties for students who learn TEI as part of their literary training. A general lack of materials to help introduce TEI to newcomers makes training more decentralized and inconsistent than it needs to be. Also, students often find TEI to have evolved to address interests in the texts that seem fairly removed from their own interests as budding scholars. Specifically, they frequently feel compelled to lay aside an interpretive interest in the texts in order to describe them, though these need not be viewed as conflicting approaches in the classroom. They also are frequently drawn to aspects of texts that are not as fully supported by TEI as others, such as illustrations, physical bibliography, etc. How might we present text encoding as a valuable pedagogical activity? How can we integrate it into a literary education? What sorts of resources would help, and what aspects of TEI could be better developed for literary study?

12:30–2 p.m. Lunch (on your own)
2–3:30 p.m. Scholarly Editing (Gallery, Hatcher Graduate Library North)
Chair: Malte Rehbein
  • TEI Encoding in conjunction with MySQL, JavaScript, and HTML to create a visual variorum edition of Don Quixote [paper]
    Victor Enrique Agosto, Eduardo Urbina, Elena Tillman, Fernando Gonzalez Moreno and Richard Furuta

    Since 2003, the Cervantes Project has been developing a fully documented hypertextual archive to make accessible the textual iconography of the Quixote (Madrid 1605, 1615). With support from a preservation and access NEH grant we have digitized, indexed and annotated 25,000 illustrations and have developed search tools and finding aids to enable the identification of discreet images or sets of items based on a newly created taxonomy of episodes and adventures of the text, which includes now 500 categories, as well as a 400 control vocabulary of key words. The contents of the image database are dynamically updated and the results are available immediately online through the collection index and a multilayered search engine. In the last year we have applied tags related to both the taxonomy categories and the key words to two complete editions of the Quixote, one in Spanish and one in English, linking the illustrations to the texts and thus producing a virtual visual variorum edition suitable for both research and teaching purposes.

    In this paper we present the steps taken to TEI encode both texts of Don Quixote and in particular the approach and solutions developed to fit the requirements of our archive regarding visual elements and dynamic links utilizing the structural divisions, taxonomic categories and terms mentioned above to be able to navigate from images to texts and from texts to images. In particular, we describe the four level division utilized to create the TEI One Document Does is All (ODD), which include the Part and Chapter headers, the Chapter Image that is linked to all the images in the database for that particular chapter, and the Text/Image, which corresponds with the terms found in the Browse image archive by content finding aid, which makes possible to filter results by individual chapters. Consequently, these levels facilitate the search of information as well as the maintenance and update of the TEI tags.

    Several difficulties were encountered in the implementation of the project. First, the proper encoding between the taxonomy categories, the key words and the TEI tags had to be resolved to effectively work in English and Spanish. Second, we encountered conflicts parsing JavaScript code using the oXygen XML editor that prevented the link tags from opening a window showing only the images for each individual category or key word. Lastly, we had to deal with the unwanted enumeration of paragraphs due to the inflexibility in the TEI Cascading Styling Sheet found in the oXygen TEI-P5 HTML transformation XSLT instructions. As a result of these limitations, we decided for online presentation to transform the TEI ODD of Don Quixote into a HTML format able to interact dynamically with the MySQL databases of the Quixote iconography archive at the Cervantes Project (

  • Wittgenstein's Nachlass in TEI P5 [slides]
    Tone Merete Bruvik, Alois Pichler and Vemund Olstad

    In 2000, twenty thousand pages of Wittgenstein's Nachlass were published by Oxford University Press on CD-ROM under the title "Wittgenstein's Nachlass - The Bergen Electronic Edition" (BEE). BEE was the result of 10 years of research and editorial work by the Wittgenstein Archives at the University of Bergen (WAB), one of its major achievements being a complete "machine-readable version" of the Wittgenstein Nachlass.

    Today this machine-readable version is being converted to XML-TEI (P5) markup. This paper will, first, describe the steps involved in the conversion process and focus on some of the issues and challenges encountered in it, and, second, present some of the results from conversion, with special emphasis on the HTML text edition outputs produced for the ongoing EU project "DISCOVERY - Digital Semantic Corpora for Virtual Research in Philosophy" (2006-2009), where WAB is one of the main content providers.

    Micro and macro encoding

    The final part of the paper will look into a specific challenge posed by TEI's policy of using different encodings on the micro and macro levels for what we consider to be in fact the same phenomena.


    In TEI, elements like app (apparatus) and del (deletions) are available on the phrase level, which can be looked upon as the micro text level. But in many cases these elements will also be needed on the macro level. As a simple case in question one may point to the phenomenon where two paragraphs which follow each other are deleted in one operation. The same problem occurs with apparatus entries, and we believe with all the elements in the model.pPart.transcriptional group: add, app, corr, damage, del, orig, reg, restore, sic, supplied and unclear.

    It is true that these macro deletions often cross hierarchical boundaries, and that might be the reason the TEI P5 Guidelines suggest that i.e. <delSpan/> might be used in these cases. But such crossing is not uncommon on the micro level as well.

    One would expect, that encoding that works on the micro level (inside p elements in this case) should also be allowed to work on the macro level (in this case outside the p or div element), if the phenomenon is the same. Whether an entire paragraph is deleted or only a single sentence of a paragraph, it is in both cases the same phenomenon, isn't it? Therefore, one would think that it should also ask for the same encoding.


    1. Wittgenstein's Nachlass: The Bergen Electronic Edition, eds. Wittgenstein Archives at the University of Bergen. Oxford: OUP 2000.

    2. TEI Consortium, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. 1.3.0, February 1, 2009. TEI Consortium. (May 13, 2009)

    3. DISCOVERY - Digital Semantic Corpora for Virtual Research in Philosophy. (May 13, 2009)

  • An Analysis of Text Encoding Approaches: A Case Study [slides]
    (late-breaking submission)
    Aja Teehan and John G Keating


    On the 24th of August we posted a call to the Humanist Mailing List for a text encoding case study; respondents would be provided with five images from a guestbook and asked to encode them as examples of their own approaches to text encoding. Following the receipt of 14 expressions of interest, a full description of the case study was issued by email on the 14th of September. This paper discusses the motivation for and design of the study, and provides an analysis of the submissions received to date.


    We know of the many ways in which mark-up may be developed (see, for instance, the lively SIGs), some of the processes of development (Ide and Veronis: 1998) and the many features to which it can be applied (Burnard: 1999). There have, of course, been discussions of the suitability of mark-up for certain documents (Bradley: 2005, Lavagnino: 2006), and the implications of mark-up for textuality (McGann and Buzzetti: 2006), but despite all of this we know relatively little of how it is applied in practice. It is hoped that by providing reflection upon current practices within the community we can identify ways in which we can build upon existing expertise to improve the digital resources we deliver.

    Given that the sources and uses for text encoding are so varied, it would be unhelpful to compare approaches across publicly accessible projects, even those ones that do expose their XML encodings or methodology. The solution was to provide a single sample, and Use Case, for everyone to encode according to their own design and approach.

    Description of the Case Study

    The respondents were sent a description of the study along with1 five sample images, non-authoritative transcriptions and imagemaps. Our source document was a guestbook from the Castlhyde Estate House. This estate house is historically significant as the lands once belonged to the family of Douglas Hyde, the first president of Ireland.


    All responses received thus far have provided a validating XML encoding, a schema and a report outlining their approach and rationale behind techniques employed. All respondents have provided permission to disseminate the results, and to publish all the received documents along with the original source in an on-line corpus for open access research.

    Results Analysis

    Once the final deadline has passed (Oct 27th), we will analyse the techniques used by the participants to encode problematic features and the rationale behind the techniques. For instance, the document is tabular in nature but guests have often ignored the physical layout of the page - how will the participants have dealt with the tension between the semantic elements (names, addresses, dates) and the physical elements (cells, rows), especially when one contradicts the other such as when guests write their name under the “address” column? Having reviewed the participants’ reports we can confirm that this did indeed provide pause for thought and was admirably dealt with in the encodings. To aid us in our analysis the participants will fill out a reflective questionnaire.


    Conclusions will be drawn based upon analysis of the documents and questionnaire provided to us.


    Burnard, Lou, Is Humanities Computing an Academic Discipline? or, Why Humanities Computing Matters. (1999) Accessed online 1st Sept 2009:

    Bradley, John, Documents and Data: modelling materials for Humanities research in XML and relational databases. TEXT Technology (2005); LLC (2007). Accessed online 1st Sept 2009:

    Ide , Nancy and Veronis, Jeán (eds), Text encoding initative: background and contexts. Dordrecht: Kluwer Academic Publishers (1995).

    Lavagnino, John, When Not to Use TEI. In: Lou Burnard, Katherine O’Brien O’Keeffe, John Unsworth (eds): Electronic Textual Editing. New York: Modern Language Association of America, p. 334–338 (2006).

    McGann, Jerome and Buzzetti, Dino, Critical Editing in a Digital. In: Lou Burnard, Katherine O’Brien O’Keeffe, John Unsworth (eds): Electronic Textual Editing. New York: Modern Language Association of America, p. 53–73 (2006).

    Please note that the email sent to the respondents describing the Case Study is provided separately under “supporting documents”. We have also supplied a single image used in the case study along with its non-authoritative transcription and image map.

Methods and Techniques (Gallery Instruction Lab, Hatcher Graduate Library North)
Chair: Andreas Witt
  • Declaratively Creating and Processing XML/TEI Data
    Christian Schneiker and Dietmar Seipel

    Declarative programming is one of the main concerns of XML and XML query and transformation languages such as XQuery and XSLT, and it should be combined with object–oriented programming techniques, which recently have attracted a lot of attention. Especially in the field of natural language processing (NLP), developing less ambiguous, more compact and faster applications should be one of the main goals, such that the code becomes more understandable than overstuffed source code produced by languages like C++ and Java (Sperberg–McQueen, 2004 and Wadler, 2000). For parsing and annotating different kinds of texts, context–free grammars are much more flexible and reliable than the commonly used regular expressions, which we find in many NLP applications. Furthermore, with extra arguments given to the functor, the expression of context–sensitive grammars is also possible (Warren, 1999), and probabilistic extensions are possible, too.

    To meet the NLP requirements, we have implemented a declarative toolkit for processing natural language, in which every single step consists of solitary modules that can be assembled in the workflow process. Our underlying XML query and transformation language FnQuery (Seipel, 2002) allows the user to access XML documents in Prolog in a way similar to XPath and XSLT. In combination with extended definite clause grammars (EDCG) (Schneiker et al., 2009), an extension to the well–known DCGs of common Prolog systems, we can process different types of text and annotate them according to TEI.

    Programming with our XML toolkit is also simplified a lot by the possibility of converting XML meta–data – like schemas generated for TEI – to a set of EDCG rules. This technology gives the user the ability to generate flexible and easy–to–read context–free grammars for parsing written language while the generated output is already annotated according to the defined XML schema. For this purpose, we make use of the fact, that an XML schema defines a context–free grammar representing the structure of an XML document in a declarative way. The resulting EDCGs can be used for generating parse trees.

    We exemplify the robustness and feasibility of our concept by converting double keyed raw data of a 19th century dictionary into a valid TEI encoding for dictionaries.

  • No escape from XLink: the requirements of multi-level stand-off annotation
    Piotr Banski

    Stand-off annotation, whereby the resource being annotated is kept in a relatively raw form, virtually 'surrounded' by layers of information of various sorts (stored and possibly distributed separately) is by now a familiar technique of XML data representation that, in the context of the TEI, dates back to the days of the Corpus Encoding Standard (CES; Ide and Véronis, 1993), an SGML application of the then-current TEI standard. Stand-off annotation is a necessity in the case of direct description of binary data, and an interesting alternative to inline annotation of text resources, with numerous advantages and one prominent disadvantage: the lack of a standard for merging the annotations with the primary data. Earlier, the XLink mechanism was recommended for the purpose of linking individual layers of annotation (XCES; Ide at al., 2000). Currently, in TEI P5, the XInclude mechanism is proposed for this purpose.

    The goal of the presentation is to show that while XInclude is an intriguing method that (nearly) works, there is no escape from XLink when doing multiple-level stand-off annotation of text resources. The reasons are many. The most trivial concern the lack of tools implementing TEI XPointer schemes – this is leveraged by the relative lack of free tools supporting XLinks, although only the basic simple-link XLink mechanism is needed here. Others concern the metaphors behind each of the competing systems: while the metaphor behind XInclude is primarily “merger”, XLink has a more trivial role: pointing. And in some cases, pointing is as little and as much as is exactly needed to make elements of the particular annotation layers correspond, without any further consequences concerning infoset merger that is the necessary entailment of XInclude. Finally, while XInclude has the intriguing feature of pointing and merging at (almost) the same time, once includes are resolved, some information on the relative positioning of the X-included pieces of a lower-level resource (most importantly the source text) is lost, and the operations that have to be performed in order to recover it defeat the putative advantages of using XInclude in the first place.

    While XInclude is an elegant mechanism that works for simple resource-merging purposes (such as including the <teiHeader> element in its proper place, right before validation and transformations) and has some chance of working in the future for the purpose of addressing spans of text, its use is actually limited to (some) single-level stand-off systems and systems where the primary resource has already been heavily annotated (which in itself is a departure from the stand-off ideal). In all other cases, we need XLink, as a more primitive, and thus more flexible, pointing mechanism with no consequences concerning infoset merging and schema-merging that must follow it.

    The presentation is based on the actual practice of implementing TEI P5 stand-off architecture of a very large (target size: 109 tokens) linguistic corpus, and illustrated by fragments of its annotation layers, for which both XInclude and XLink mechanisms are used.


    • XInclude:
    • XLink:
    • Ide, Nancy and Jean Véronis. (1993). Background and context for the development of a Corpus Encoding Standard, EAGLES Working Paper, 30p. Available at
    • Ide, Nancy, Patrice Bonhomme, and Laurent Romary. (2000). XCES: An XML-based Standard for Linguistic Corpora. Proceedings of the Second Language Resources and Evaluation Conference (LREC), Athens, Greece, 825-30.
  • Teaching “craft” encoding
    Syd Bauman and Julia Flanders

    From January 2007 through April of 2009, the Brown University Women Writers Project sponsored a series of eleven seminars on scholarly text encoding, with funding from the National Endowment for the Humanities. The goals of these seminars were "to provide humanities faculty and students with an opportunity to examine the significance of text encoding as a scholarly practice, through a combination of discussion and practical experimentation...[and] to provide supporting resources for humanities researchers who want to experiment with text encoding on their own, or would like to start or become involved with a digital research project." Our assumption was that there was an emerging need and an audience for TEI training that focused on small-scale projects reflecting the interests of individual scholars.

    Training of this sort would need to focus on the role of markup as an expression of scholarly intention and interpretation--often quite idiosyncratic--rather than on more generic and scalable forms of encoding. It would also need to position the humanities scholar as a creator of digital resources rather than purely as a consumer, although if successful such training would enable scholars to be much more critical and astute consumers as well. To support these seminars we developed a suite of materials whose shape was determined by several goals: to provide participants with permanent web access to the materials used in the seminar, while permitting us to make changes to these materials over time; to create and maintain these materials in TEI using open-source tools; to support significant hands-on practice in the seminars, using participants' own sample materials.

    The materials we developed include:

    1. A custom TEI schema and accompanying stylesheets for authoring and publishing slides and lectures.

    2. A master set of nearly 40 presentations on a wide variety of topics, which can be combined as needed to cover any range of audience interests and expertise levels.

    3. A custom TEI schema of 311 elements, designed for teaching, which emphasizes the TEI features that are most relevant to these audiences: critical editing, manuscripts, oral histories, personography and interpretive markup, as well as coverage of the usual humanities genres.

    4. A CSS stylesheet template containing selectors (but no styling) for all of the elements in our teaching schema; participants can adapt this by adding style information of their own.

    5. A set of handouts including a short element crib sheet (providing a quick reference to the most common elements), a crib sheet of <oXygen/> commands, a CSS crib sheet, a brief guide to TEI customization with Roma, a document analysis guide, and a brief set of sample documents for participants who did not bring their own.

    This paper will describe these materials and our strategies for developing and using them in more detail, and will also discuss the particular challenges and advantages of this mode of teaching. In particular we will consider the role of theoretical discussion and its relation to encoding practice; the interplay between markup and display in a teaching context; and the overall impact of these seminars on the scholarly use of markup.

3:30–4 p.m. Coffee break (Gallery, Hatcher Graduate Library North)
4–5:30 p.m. Poster slam, posters, and tool demos (simultaneous reception) (Gallery, Hatcher Graduate Library North)
Chair: Syd Bauman
  • TEI Chinese Localization Project
    Jen Jou Hung

    The TEI Consortium has recently made efforts to internationalize the standard by providing an infrastructure for partners across the globe to localize the standard in their own languages. Since the year 2005, Dharma Drum Buddhist College, the Taiwan e-Learning and Digital Archives Program (TELDAP) and the TEI Consortium have worked together to localize the TEI standard. During these four years, the TEI Chinese Localization Project has produced many concrete results, including:

    • the translation of element and attribute definitions into fluent Chinese. All of these translations are available on the TEI website and will be maintained there by the TEI Consortium.
    • the localization or translation of examples for each element.
    • the Chinese translation of the ROMA interface (
    • the Chinese translation of TEI-Lite. Furthermore, we have augmented TEI-Lite with copious annotations , especially where the proper understanding of the examples requires familiarity with European literature.
    • a selected translation from the TEI: P5 Guidelines (Chapter 2).
    • the organization of a number of workshops in Taiwan introducing TEI, by Julia Flanders and Syd Bauman, Sebastian Rahtz and Marcus Bingenheimer.
  • NINES (Networked Infrastructure for Nineteenth Century Electronic Scholarship)
    Laura C Mandell and Dana Wheeles

    NINES (Networked Infrastructure for Nineteenth-century Electronic Scholarship) is a scholarly organization devoted to forging links between the material archive of the nineteenth century and the digital research environment of the twenty-first. Our activities are driven by three primary goals: to serve as a peer-reviewing body for digital work in the long 19th-century (British and American), to support scholars’ priorities and best practices in the creation of digital research materials, and to develop software tools for new and traditional forms of research and critical analysis. At the center of these efforts is our online Collex interface, which gathers the best scholarly resources in the field in one place to make them fully searchable and interoperable. It also provides an authoring space in which researchers can create and publish their own work.

    In an effort to connect with scholars engaged in digital scholarship and textual markup, Assistant Director of NINES Laura Mandell (University of Miami, Ohio) and Project Manager Dana Wheeles (University of Virginia) would like to present NINES in a poster session at the 2009 TEI Conference. This poster session would include a demonstration of the NINES site and an explanation of the peer-review principles behind its organization. We would also be happy to answer questions about other tools under development in NINES, including Juxta (textual collation software that works with XML-encoded texts) and Ivanhoe (a gamespace for textual inquiry), as well as the goals of NINES’ sister project, 18thConnect. In doing so, we hope to show scholars the possibilities for sharing and sustaining the resources they construct using TEI markup.

  • The TEI-EJ: A monthly publication of the TEI Education SIG to appear on the TEI website [poster]
    Stephanie Ann Schlitz and Julianne Nyhan

    We view this poster presentation as an opportunity to disseminate information about our (TEI Board and TEI Council approved) proposal to develop and publish a web-based TEI monthly journal, the TEI-J. The journal will provide a community-driven forum and offer members of the TEI, whether novice or expert, as well as the broader Digital Humanities community new educational insights into the TEI. This will be achieved by publishing articles which address a number of TEI-related issues, such as, inter alia, novel implementations; interviews with members; mashups with other markup languages and technologies; favourite aspects; tools development; reflective essays; problems encountered and resolved by users; questions and challenges for the community; and additional topics proposed by prospective authors.

    In addition to this pedagogical function, the monthly series will also aim to:

    • Attract new users to the TEI community by providing them with a means to learn about the TEI via a novel, example and evidentiary-driven delivery method that is, nevertheless, quality assured.
    • Promote the TEI among existing users by offering an innovative yet educative series of discussions designed to explore and explain aspects of the TEI.
    • Contribute to the longer-term foresight policy of the TEI Council by offering an additional platform for examining usage trends or problems.

    The series will be peer reviewed. In order to cater to more than one learning style, the series will include traditional text articles and will have dynamic content also, such as video, audio and multimedia formats. We expect to finalize the editorial board by late summer 2009 and are in the preliminary stages of developing the publication platform. We will share our progress in these areas via the poster.

    We are also in the process of identifying and inviting members of the international digital humanities community, both senior and junior and drawn from the widest possible range of research and teaching areas, to reflect on their experiences with the TEI and to contribute to TEI-J. We expect published contributions to be composed in an instructive style that accommodates both TEI experts and non-experts, and authors will be asked to contextualise their reflections with one or more examples and to explain the significance of their example(s) to their research, the broader digital humanities community, and/or to the TEI community.

    Finally, the poster session will enable us to collaborate with and to seek feedback from TEI members as we take the next step toward publishing our inaugural issue early in 2010.

  • Teaching with TEI [poster]
    Kathryn Tomasek

    Since the fall semester of 2004, students in United States Women’s History courses at Wheaton College in Norton, Massachusetts, have been transcribing and coding nineteenth-century documents from the Wheaton College Archives and Special Collections for digital publication. Because the college was founded as an institution for the education of women in 1834, these records provide an ideal opportunity for students to explore primary sources that reflect women’s experiences and changing ideas about gender since the second quarter of the nineteenth century. The sources that students have used in course work so far have ranged from women’s journals and diaries to minutes of meetings of the Board of Trustees of Wheaton Female Seminary and essays from the institution’s literary magazine Rushlight. For several summers, students have also worked transcribing and editing the journals, diaries, and account books of the institution’s founder, Eliza Baylies Chapin Wheaton, and the journal and ephemera associated with her trip to Europe in spring and summer 1862. And in spring 2009, students in the methods course for History majors transcribed and coded pages from the day book kept by Laban Morey Wheaton, the husband of Eliza Baylies Chapin Wheaton. Between 1828 and 1859, he recorded transactions associated with his business interests in the town of Norton, including agricultural pursuits, land rentals, tax collections, legal services, and the operation of a general store. Currently, a student researcher is transcribing and coding Laban Morey Wheaton’s cash books, and students in the methods course will correlate transactions from the cash books with their own transcriptions from subsequent pages in the day book. Combined with the day book, the cash books offer at view not only of the business practices behind the wealth that supported Wheaton Female Seminary through its first thirty years of operation but also of economic relations in the town of Norton during the second quarter of the nineteenth century.

    Transcription and coding with the TEI have the potential to transform our teaching and our scholarship across disciplines. For teaching, collaborative contributions to the digital archive offer students opportunities to work with original primary source documents and thus to begin to understand mediations that are intrinsic to the archival, research, and editorial processes. As students transcribe, proofread, and code documents for analysis, they learn about the nature of historical sources and what sources can and cannot tell us as historians, and such lessons evoke theoretical questions about the nature and purposes of archives. Such projects give students opportunities to “do history,” as recommended in teaching standards supported by the American Historical Association.

    Such teaching projects also give faculty members opportunities to work with previously unexamined or underutilized historical documents. As we use digital technologies to teach our students and to publish original documents from the founding era of the educational institution that became Wheaton College, we contribute to scholarship in U.S. History by vastly increasing access to documents that have previously been available only to those who could be in the same physical space with either the original document or its edited version in book form. As we create this digital archive, we also shape its purposes, developing in the process the archive we need for the twenty-first century. Not least, this new archive offers global access where digital technologies are available, increasing opportunities for transnational research and collaborations.

    This poster will present pedagogical uses as one example of how the TEI provides value distinct from that offered by mass digitization projects. In this poster we plan to describe the use of TEI and benefits thereof in a series of courses, including sample curricula, a sample TEI-encoded document, and excerpts of student feedback.

  • Introducing Mandoku [slides]
    Christian Wittern

    Mandoku is a tool for creating and using digital editions of premodern Chinese texts. It makes it possible to display a digital facsimile of one or more editions and a transcribed text of these editions side by side on the same screen. From there, the texts can be proofread, compared and annotated. A special feature is the possibility to associate characters of the transcription with images cut from the text and a database of character properties and variants, which can be maintained while operating on the text. Interactive commands exist also to assist in identifying and record structural and semantic features of the texts.

    One of the major obstacles to digitization of premodern Chinese texts is the use of character forms that are much more ideosyncratic than today's standard forms. Since in most cases they cannot be represented, they are exchanged during input for the standard forms. This is a potentially haphardous and error-prone process at best, completely distorts the text in worse cases. To improve on this situation and to make the substitution process itself available to analysis, Mandoky uses the position of a character in a text as the main means of adressing, allowing the character encoding to become part of the modelling process, thus making it available to study and analysis, which in turn should make the process of encoding more tractable even for premodern texts. The current model is still experimental, but initial results have been encouraging. While the format used by this tool has been developed specifically for this purpose, the texts produced using Mandoku can be exported as valid TEI P5. This makes it possible to use this tool in a workflow that starts from plain text and aims at producing a TEI version of these texts that can than be fed to digital repositories, printers or analytic tools.

    Mandoku is work and progress and is developed as part of the Daozang jiyao project at the Institute for Research in Humanities, Kyoto University by Christian Wittern. The screen shot shows a page from the preface to a commentary to the Daode jing from a 18th century collection of Daoist texts in a early 20th century woodblock print. On the left is a facsimile of the print, to the right is the transcribed text. A character has been selected and associated with a portion cut from the image. This will be saved to the database and associated with the character.

  • The Complete Prose of T. S Eliot: A Joint Project of Emory University and Johns Hopkins University Press
    Alice Hickcox

    In 2005 Dr. Ron Schuchard of Emory University's English department received permission from Mrs. Eliot to edit the prose papers of her husband, T. S. Eliot. The goal was to produce a critical edition of all of Eliot's prose, as well as a fully searchable digital archive accessible to scholars. The Lewis H. Beck Center, the Emory University Libraries' digital text center, was asked to help Dr. Schuchard create digital surrogates of the print materials, and to make the material available to the volume editors, who are located in both the U.S. and Britain.

    With the support of Emory College and the library, the Beck Center recruited "Team Eliot," the Beck Center staff and a few graduate and undergraduate students, to begin digitizing and proofreading over 1000 pieces of text, from articles of a few paragraphs to book length essays. Eliot experts would edit the texts, working from the texts and images created by Team Eliot.

    We began by creating images of the readily available material, and added the rare and unusual material as it became available to us. The images of the text were created as tiffs, which were used to create text versions either through OCR or rekeying. After the texts were proofread, they were transferred to the system for editorial access.

    The editors have access to the texts and jpeg derivative images through a web drive, supported by a Subversion versioning system. They can edit the texts directly on the web drive, or they can choose to download and upload the texts. They can also install their own svn clients and edit their working copies of the text, and update the repository with their editions.

    The Beck Center is working with the press publishing the American edition, Johns Hopkins University Press, to create an xml-based text, marked up in TEI, which will be used for both the paper and electronic editions. This collaboration has required decision-making and communication well in advance of the receipt of edited texts, in order to agree on the level and detail of the markup in the text, as well as the structure of the files themselves.

    Several issues face the encoders:

    • detecting changes in rtf files, so that the encoded files can be up-to-date, because the editors will make changes even after the files have been encoded.
    • defining a strictly limited tag set that can validate the encoded files and check attribute values.
    • communicating between the press and the Beck Center so that the level of markup and detail of markup matches the needs of the press.

    The schedule calls for the editors of first two volumes of the Collected Prose to be finished by the end of September, and for the encoded volumes to be delivered to the press by the end of the year.

    This project marks a first for both the press and the Beck Center. This is the first time the press will produce both print and electronic editions from XML encoded text, and it is the first time that the digital text center has participated in the production process for a print edition.

  • TXM open-source platform demonstration [slides]
    Serge Heiden

    The research project Textométrie develops an open-source software platform for textometry analysis of textual data (

    The source data can be encoded in TEI with linguistic annotations.

    The platform provides tools for qualitative data analysis (deep text search engine - kwic concordances, hyper textual data rendering and navigation) and for quantitative data analysis (factorial analysis, classification...).

    The demonstration will explore a sample corpus with the various available tools of the platform.

  • Medieval Nordic Text Archive – Menota: Resources, tools and guidelines [poster]
    Tone Merete Bruvik and Odd Einar Haugen

    Medieval Nordic Text Archive – Menota – is a network of Nordic archives, libraries and research departments working with medieval texts and manuscript facsimiles. The network has published the Menota Handbook, containing guidelines on how to encode Medieval Nordic texts using TEI P5. Moreover, Menota hosts an archive of texts encoded according to these guidelines. The archive forms a searchable text corpus (approximately 900000 words) using Corpus Workbench routines.

    The Menota Handbook offers not only encoding guidelines, but also tools and resources, from recommendations on fonts for medieval texts, via XSLT stylesheets for Menotic texts, to practical advice on how to convert ordinary text files into Menota-compatible XML.

    In the poster, we give an overview of the tools and resources available from Menota, and also look into how we have approached some of the problems using TEI P5 and XML:

    Levels of text representation

    In Menota we have added three elements to represent different levels of text representation:

    <me:facs> contains a reading on a facsimile level

    <me:dipl> contains a reading on a diplomatic level

    <me:norm> contains a reading on a normalised level

    These are defined as part of the Menota namespace 'me'.

    Deletions made by the transcriber or editor

    If a piece of text obviously should be deleted, e.g. duplicated text in a dittography, the transcriber or editor might want to make a deletion. This is the converse action of adding text, and should be distinguished from similar actions made by the scribe. While the elements <add> and <del> describe actions by the scribe himself or other scribes, the editorial additions and deletions should be singled out by separate elements. For additions, TEI recommends the element <supplied>, but there is no parallel to the <del> element. We suggest the element <me:expunged>, since the noun expunction and the verb expunge are commonly used for editorial deletion.

    Overlapping structures

    Medieval texts have its share of the overlap problem; we will show what approach we have chosen to handle these.

    In chapter 11 “Representation of Primary Sources” in the TEI P5 Guidelines the elements <addSpan/>, <delSpan/> and <damageSpan/> are defined. These elements are counterparts to the elements <add>, <del> and <damage>, but are milestone elements, and should be used when the feature to be encoded crosses structural divisions. There are in fact many more elements which can cross structural divisions, e.g. <sic>, <corr>, <unclear> and <supplied>, but there are no corresponding <sicSpan>, <corrSpan>, <unclearSpan> and <suppliedSpan>. Rather that adding these and several other elements we recommend using one generic empty element to cover all cases of overlapping structures. We have called this new element <me:textSpan/> and given it attributes from the classes “att.spanning”, “att.transcriptional”, “att.typed” and “”, and the attribute @me:category, which contains a reference to the element it is the counterpart to.


    1. TEI Consortium, eds. TEI P5: Guidelines for Electronic Text Encoding and

    Interchange. 1.3.0, February 1, 2009. TEI Consortium. (May 13, 2009)

    2. The Menota handbook: Guidelines for the electronic encoding of Medieval

    Nordic primary sources. Ed. Odd Einar Haugen. Version 2.0. Bergen: Medieval

    Nordic Text Archive, 2008.

    ISBN 978-82-8088-400-8

  • Decoding Patrick - The St Patrick's Confessio Hypertext Stack [poster]
    Franz Fischer

    The St Patrick’s Confessio Hypertext Stack Project aims, in building up a comprehensive digital research environment, to make accessible to academic specialists, as well as to interested lay people, all the textual aspects of St Patrick's Confessio. Composed in the 5th century this is the very oldest text, in any language, written in Ireland that has survived. Apart from the fact that Patrick really existed it is an highly informative text as regards the origins of Christianity in early medieval Ireland.

    The project envisages providing facsimiles and transcriptions of the extant manuscript testimonies and digital versions of relevant editions – from the editio princeps of 1656 up to the canonical version of the critical text, established in the scholarly edition by Ludwig Bieler in 1951 –, together with commentaries and translations into several modern languages. All the different textual components of this digital resource will be realised as one hypertext stack of very closely interlinked text layers, in order to enable the on-line visitor to click through the different manuscripts and text-versions passage by passage and, thus, to retrace, reconstruct and proof-check any established version of the text. The stack itself is then to be embedded in a net of significant contextual information and databases such as the the definitive dictionary entries prepared by Dictionary of Medieval Latin from Celtic Sources (DMLCS) for many of the most interesting words.

    In addition, the opportunity is being taken to open up a whole horizon of tradition (the perception of the Saint and his work as a powerful founder of monasteries) by providing access to one of the earliest witnesses to the transformation and construction of the later medieval figure of Patrick, namely Muirchú’s Latin Life of Saint Patrick, compiled about 200 years after Patrick’s death. In order to elucidate why Muirchú adapted the account of St Patrick as he did, the original Latin text and an English translation will be provided, as well as an entertaining and highly readable modern narrative composed on behalf of the Activity. This composition, as well as an adapted dialogue performance of the Confessio itself, will also be delivered in audio files. Moreover, a commented iconographic slide show as well as a blog-like platform for user interaction to discuss the popularly held image of the Saint will be set up.

    In order to guarantee usability and sustainability of the resources, each of the text layers will be stored and provided as (a) a digital representation of the actual appearance of the layer, (b) the text involved, a plain electronic text format, and (c) an electronic text version deeply encoded according to the standards established by the TEI P5 guidelines. To guarantee the digital asset’s durability, availability and ongoing maintenance, the files will be stored as an Academy Digital Resource (ADR) and published on line through an XML-database.

    The project was conceived and is overseen by Dr Anthony Harvey, editor of the Royal Irish Academy Dictionary of Medieval Latin from Celtic Sources. The Stack has three years’ funding for one Post-doctoral researcher and for one student intern each summer. Technical support is given by the Digital Humanities Observatory (DHO). Furthermore, the value of the Stack is greatly enhanced by the contributions of volunteer collaborators and partner projects. Last but not least, the success of the project is highly dependent on the courtesy of libraries, editors and publishers as regards matters of reproduction and copyright. A preliminary Stack version is going to be launched by summer this year.

    For further information see:

  • A Study of the Use of Digital Scholarly Editions of Letters and Correspondence [poster]
    Bert Van Raemdonck

    In 2004, Lina Karlsson and Linda Malm published a summary of their master thesis in Human IT. Karlsson and Malm had examined in what form and to what extent 'media specific value-adding features' were present in a selection of digital scholarly editions on the Web, 'concentrating on hypertextuality, interactivity and hypermediality'. The results of their study gave scholarly editors who had taken the step into the digital era little reason to party: however vigorous some of them had been arguing that their digital editions were somehow revolutionary, most Web editions seemed 'to reproduce features of the printed media' and did not fulfill 'the potential of the Web to any larger extent'.

    However some recent scholarly digital editions show new and better use of the potential of the Web, various editions still suffer from the same problem that Karlsson and Malm have witnessed five years ago. Moreover, some recent editions have some other shortcomings that make them more user-unfriendly than their creators had hoped for. Therefore it is somewhat surprising that so little research has been done to find out what users actually want to do with the editions we are making.

    The digital edition of a thousand letters concerning the Flemish literary journal Van Nu en Straks (1893-1894; 1896-1901) I am preparing, will be part of the 'Digital Archive of Letters in Flanders' (DALF). Hence, the question of how to encode these letters was an easy one: the 'DALF Guidelines for the description and encoding of modern correspondence material' (an extension to the TEI P4 Guidelines) were already there, perfectly addressing the needs of the project. The question of what to encode turned out a lot more difficult: DALF (and TEI) offer a seemingly endless range of elements to encode all kinds of features that may or may not occur in any handwritten letter. Considering the simple fact that no scholarly editing project has limitless funding, someone will have to decide which features should or should not be encoded for a particular project.

    Because no recent findings by other scholars on this subject are available (particularly not with respect to editions of correspondence), I have created a survey to gather some information on what other scholars would want to do with the letters we are editing. With my poster presentation I will try to prove that we as editors lack detailed information on the way other scholars want to use our editions, and present the online survey I have created.

  • Semantically Rich Tools for Text Exploration
    Andrew Ashton

    Literary scholarship is poised to benefit immensely from the emerging modular software frameworks that support deep and finely grained investigations of digital texts. Humanities research centers, such as the Brown University Women Writers Project, have invested substantially in enriching bodies of literary texts with semantic and structural information, using XML formats such as the Text Encoding Initiative (TEI). Recent innovations in scholarly software design offer opportunities to exploit the semantic depth of TEI collections by creating new tools for textual analysis and collaboration. To this end, the Brown University Center for Digital Scholarship (CDS) and the Brown University Women Writers Project (WWP) are beginning a new effort to create a prototype suite of software tools to explore TEI encoded texts in the new Software Environment for the Advancement of Scholarly Research (SEASR – SEASR provides an environment for creating analytical software tools that are tailored to scholars’ research goals. Textual analyses and visualizations created in SEASR can be shared on the web and adapted by other scholars to support their own research.

    This project, newly funded by the NEH Digital Humanities Start-Up Grant program, will develop a suite of SEASR components to expose and manipulate various facets of the semantic meaning encoded in TEI collections. These SEASR components act as building blocks for scholars to create powerful and nuanced analyses. Components can be shared on the web, recombined, and repurposed in an open, modular environment, lending great flexibility to scholars working with SEASR.

    Examples of TEI components for SEASR might include components that:

    • Extract and separate a collection of texts by genre, then retrieve genre-specific structures within the text (e.g. poems, dramatic speeches, letters, recipes)
    • Distill from the selected texts or text pieces the personal names, and separate these by type (references to historical figures, mythological figures, biblical figures; place names; etc.)
    • Sort the subset of data chronologically.
    • Pass the data through a component that tokenizes and adds morphosyntactic information to each word.
    • Generate a visualization for each genre that describes changes in the association of certain adjectives with personal names, differentiated by gender.

    This poster will outline the SEASR framework, and provide examples of how SEASR can expose and explore the semantic meaning encoded within TEI collections.

  • The Shakespeare Quartos Archive [slides]
    James C. Kuhn

    The Shakespeare Quartos Archive ( has created encoded transcriptions for thirty-two copies of the five pre-1641 editions of Hamlet held at six participating institutions. Our first phase of work was funded by a JISC/NEH Transatlantic Digitization Collaboration Grant. A short demonstration of collations and the prototype user interface will be accompanied by discussion of encoding choices for representing documents made up of both printed text and manuscript. Quarto transcriptions were created largely on the basis of existing digital images, in a variety of formats, and created by a variety of different institutions. Outsourced re-keying and encoding was followed by proofing and quality assurance under the direction of staff at the Oxford Digital Library. Copy-specific details differing from copy to copy, such as manuscript additions to the printed text and damage to the printed text, were added by staff and interns at the Folger Shakespeare Library. Visual and textual comparisons are possible through a variety of techniques, some internal to the SQA user interface developed at the Maryland Institute for Technology in the Humanities, others which rely on third-party open source software. Outlining of upcoming SQA plans will conclude this session, along with informal questions for conference delegates about organization and workflow in collaborative transcription projects; about ways in which SQA and other TEI-based projects can best support the task of editorial collation of multiple “hybrid” documents made up of print and manuscript elements; and about the rationale and the outline for our future workplan.

5:45 and 6 p.m. Walking tours of Ann Arbor (groups will leave from the Gallery)
Saturday, November 14
10 a.m. – 12:00 p.m. SIG business meetings
  • Correspondence (Gallery Instruction Lab, Hatcher Graduate Library North)
  • Education (Room 310, Hatcher Graduate Library North)
  • Libraries (Faculty Exploratory (room 209), Hatcher Graduate Library North)
  • Manuscripts (Gallery, Hatcher Graduate Library North)
  • Music (Library Information Technology conference room (room 300), Hatcher Graduate Library North)
  • Scholarly Publishing (Area Programs Reference (room 110), Hatcher Graduate Library NorthGovernment Documents Center (room 203), Hatcher Graduate Library North)
  • Text and Graphics (Room 311, Hatcher Graduate Library North)
  • Tools (Government Documents Center (room 203), Hatcher Graduate Library NorthStudy room near Government Documents Center, Hatcher Graduate Library North)
12–1:30 p.m. Lunch (on your own)
1:30–3 p.m. Tools (Gallery, Hatcher Graduate Library North)
Chair: Aja Teehan
  • Freedict: an Open Source repository of TEI-encoded bilingual dictionaries [slides]
    Piotr Banski and Beata Wójtowicz

    The presentation introduces an open-source project that hosts TEI-encoded data and tools to manipulate them. The project is where open standards meet: the TEI formalism (P4 and P5) has been used to encode lexical databases of varied complexity, and the project tools convert TEI sources into, among others, the distribution format used by the DICT protocol (Faith and Martin, 1997). The dictionary databases are stored at, and distributed by,, a well known open-source distribution and development platform.

    Given the unavailability of free XML-based tools that could play the role of a Dictionary Writing System, the project is on its way to defining levels of dictionary encoding designed to make the process of dictionary creation or conversion easy. The idea is to proceed via stages of increasing complexity of markup, at each stage being able to convert the dictionary source into the desired format – currently, only the DICT format is supported for TEI P5 dictionaries, but tools exist to convert TEI P4 into several open dictionary standards (zbedic, Evolutionary Dictionary, Stardict, Open Dict).

    The phases that a dictionary may go through may involve abuse of some of the TEI elements, e.g., the <def> element that may hold translation equivalents rather than a real definition, or an extended system of <note>s of various types, that can contain labels, usage hints, parts of definitions, and other elements of dictionary microstructure. This tag abuse is documented in the dictionary header and in this way poses less of a danger and makes it possible for the dictionary creator to publish and distribute the dictionary without having to reach the final stage of encoding complexity, which we make one of the crucial objectives of the entire enterprise – we want the authors to be able to “publish early, publish often”, in the Open-Source way.

    Currently, the project offers 68 dictionary databases of varied complexity and different sizes, from 140-150 thousand entries (English – Czech, Hungarian – English) to around 300 entries (Irish – Polish). At least new three large databases are currently in the process of being converted to TEI P5.

    Below is a fragment of a possible initial stage of dictionary encoding, anonymized:


    One of the middle stages (note the tag abuse) is presented below:

    <entry xml:id="id-goes-here">
        <note type="def">[extended defition]</note>

    And the following is a possible final stage of dictionary development, with some attributes and possible repeated elements (such as <sense>, <cit> or <quote>) omitted:

      <entry xml:id="id-goes-here">
        <cit type="trans" xml:lang="x">

    Each of these stages is convertible by XSLT tools into DICT databases, and may be used across the Internet practically within hours after the conversion.

    One of the challenges that this system poses is balancing the components responsible for general issues and dictionary/language-specific issues. This is solved by dividing the markup into two general categories: language-specific, where various linguistic features pertaining to the given language pair are encoded in detail, often with tag abuse (see above), and the uniform pre-conversion format, which is translated into the DICT format by a single project-wide set of stylesheets. The mapping between the language-specific format and the pre-conversion format is performed by language-specific XSLT stylesheets.

    The aim of the presentation is to

    • present a case where controlled tag abuse appears useful (as it allows for gradual enhancements of encoding while ensuring that the database is useful at each stage),
    • discuss the strategy adopted for dealing with many diverse lexical descriptions while having to guarantee a fairly uniform output of the conversion from TEI to DICT and other distribution formats (the split between language-specific and pre-conversion formats)
    • popularize the project among the TEI community as a repository of free lexical databases that can be re-used in other projects.


    • Faith, Rickard and Bret Martin. 1997. A Dictionary Server Protocol. Request for Comments: 2229 (RFC #2229). Network Working Group. Available from
    • Freedict:
  • Using TEI to Create a Geo-Located Table of Contents for the Poetess Archive in Google Earth
    Laura C Mandell and, Gerald C Gannod, and Kristen Bachman

    The Poetess Archive contains poetry written and published during what one literary critic has called the “bull market” of poetry. Writings in the poetess tradition were disseminated in myriad collections – miscellanies, beauties, literary annuals, gift books. Much criticism was expended in distinguishing “high” from “low” poetry, true art from the merely popular, the canonical from the non-canonical. Poetess poetry falls by definition into the last of these categories. Yet there has been a resurgence of interest in popular poetry of late.

    The Poetess Archive asks, what is it like to organize poetry written between 1700 and 1900 by triangulating it with the tradition of popular poetry rather than by filtering it through the canon? (Imagine a Norton Anthology of Poetry that focused on women writers: one heading would say, “Wordsworth,” and it would contain all the poetry and journals written by Dorothy Wordsworth, and then, under “Minor Poets,” one would find “excerpts” from “Tintern Abbey” by “W. Wordsworth, Dorothy’s brother”).

    Our use of TEI has allowed us to engage numerous filters for reorganizing poetry in ways other than those that are disciplinary and anthological. In the Poetess Archive, we begin with TEI documents, keying our poetry and poetry collections by hand, and then hand encoding them. An XSLT transform takes information from the metadata-rich TEI header and creates the comma-delimited files that are then loaded into a MySQL database. We have used that database to create an XML web service that uses the SOAP protocol and the Web Service Description Language (WSDL) to facilitate access to the archive.

    A mashup is an application that utilizes a number of data sources (in the form of web services) to present that data in some way that may have not previously been envisioned. At Miami University, we have developed a mashup that uses the data from the Poetess Archive web service to geo-associate poetry according to where a particular work was published. By providing parameters such as country of origin and date range, the application can generate time and location-oriented maps that are viewable within Google Earth and Google Maps. In this way, a user can virtually geo-navigate the world and find poetry that was published within geographically and chronologically close proximity.

    While students tend to think of poetry as timeless, and to imagine poets as having access to all the world’s art, sometimes even that art which postdates them, this tool firmly situates English and American poetry in time and space, impelling Poetess Archive users to find out more about those temporal and spatial realities, perhaps through other web services. Thus, we increase intellectual access to the history of poetry while simultaneously increasing real access by projecting the Poetess Archive Database onto different platforms and views.

  • The TEI-Comparator: A web-based comparison tool for XML editions
    James Cummings and Arno Mittelbach

    This paper discusses the creation of a tool for the Holinshed Project at the University of Oxford. Holinshed's Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed's Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. EEBO-TCP had already encoded a version of the 1587 edition, and the Holinshed Project specially commissioned them to create a 1577 edition using the same methodology. The resulting texts were converted to valid TEI P5 XML and used as a base to construct a comparison engine, known as the TEI-Comparator, to assist the editors in understanding the textual differences between the two editions.

    Using the TEI-Comparator has several stages. The first was to decide what elements in the two TEI XML files should be compared. In this case the appropriate granularity was at the paragraph (and paragraph-like) level. The project was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another. This first stage ran a short preparatory script which added unique namespaced IDs to each relevant element in both the TEI files. It is the proper linking of these two IDs which the TEI-Comparator hoped to facilitate.

    The second stage was to prepare a database of initial comparisons between the two texts using a bespoke fuzzy text-comparison n-gram algorithm designed by Arno Mittelbach (the technical lead for the TEI-Comparator). This Shingle Cloud algorithm transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack's n-grams against the needle's and constructs a huge binary string where they match. This binary string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. The algorithm runs in linear time and, given the language of the originals, was found to work better if the strings of text were regularized (including removal of vowels).

    The third stage in using the comparator was for the research assistant on the project to confirm, remove, annotate, or create new links between one edition and the other using a custom interface to the TEI-Comparator constructed in Java using the Google Web Toolkit API. The final stage was to produce output from the work put in by the RA through generating two standalone HTML versions of the texts which were linked together based on the now-confirmed IDs.

    By the time of the TEI 2009 conference, the TEI-Comparator will be publicly available on Sourceforge with documentation and examples to make it easy for others to re-purpose this software for other similar uses, and submit bugs and requests for future development. Although known as the 'TEI-Comparator', the program does not require TEI input, it works with XML files of any vocabulary as long as the elements being compared have sufficient unique text in them.

Late-Breaking Submissions (Gallery Instruction Lab, Hatcher Graduate Library North)
Chair: Rebecca Welzenbach
  • To Dream the Impossible Schema: Tagging Simultaneous Music, Performance, and Text for the Musical Theatre Online Project
    Douglas Larue Reside

    Music Theater Online (, just launched in beta after a thorough reimagining in its final funded months, is one of the very few TEI-based projects that attempts to link text, music, video, and image to create an electronic edition of a single work. In the first year of the project, funded by the National Endowment for the Humanities Digital Humanities Startup Grant, we (with the permission and assistance of the creators and performers) encoded in TEI every extant copy of the libretto of the 2008 Broadway musical Glory Days and, where possible, linked each version of the script to associated audio and video recordings. Using a newly developed web interface, scholars can track the development of the script from the first drafts through the rehearsals and readings to the final Broadway version, and, through the use of a web-based collation tool, compare any one version of the script to another.

    The extraordinarily interdisciplinary and multimedial nature of the American musical forced us to push the current instantiation of the TEI to its limits, and extend the schema beyond what is currently described in the recommendations while still remaining within the spirit of the published guidelines. In this presentation, I will discuss the modifications we made to TEI P5 in order to encode a musical, and the challenges that remain in encoding such texts. I will also discuss whether the TEI is, or should even aspire to be, appropriate by itself for works in which verbal text is only one of several important bearers of meaning used to communicate. I will briefly summarize other standard encoding options for music, audio, video, and images (MEI, MusicXML, SMIL, and SVG) which might be combined with TEI to thoroughly encode such multimedial material and explain why the Music Theatre Online, for good or ill, decided to expand the TEI rather than combine it with these other schema.

    Finally, I will outline the methodology we used for encoding annotations and show how other editors might employ the same methods in their web-based editions. The Music Theatre Online interface also includes an annotation tool that uses standoff markup to allow users to notate any region of the text--even across existing TEI tags. These tag sets can be downloaded and shared with other users, allowing, for instance, a stage manager of a production of Glory Days to take notes during rehearsals to be shared with the cast at the end of the day. Unlike many similar projects, however, none of the annotations are stored on the project server, which, while limiting the possibilities for aggregating user generated content, does permit users to maintain complete control of their work. I will discuss the reasons the project team decided to abandon the original (more common) plan for a centralized database for user content and compare the results to other MITH projects (such as the Shakespeare Quartos Archive) which do allow users to store their content on our machines.

  • A Four-Layer Model for Digital Multimedia Editions
    Douglas Larue Reside

    As the first, funded year of the Shakespeare Quartos Archive draws to a close and the prototype is launched for public use, the project team has reflected on the lessons learned during to work and defined a sustainability plan for future development. After much thought and discussion, we have developed a four layered model for electronic editions, currently at the heart of our proposals for a second phase for our work, which we believe will be of interest and use to all those engaged in electronic editing. The Shakespeare Quartos Archive contains high-resolution images of every extant pre-1642 copy of Shakespeare’s Hamlet in quarto form along with TEI-encoded transcripts of these images. Maintaining access to this content as operating systems and scripting interpreters change is among the most pressing problem for our team. We have worked to make the data in this projects as open, accessible, and well-documented in an effort to, in so far as is possible, ensure continued access to our content even when the interface becomes obsolete. Our archive is constructed using a four-layered model which takes as its prime directive open access to content and achieves this by separating content from interface. The first, "content" layer of this structure contains the archive of digital surrogates. In the case of the Shakespeare Quartos Archive this consists of page images, for other projects it might include digitized audio or video recordings. We expect every item at this layer should be identified by a unique, stable URI accessible without any login or password and maintained for a period of at least 15 years (the short history of personal computing suggests that image formats should be stable enough for such a period). The second layer consists of all the metadata and encoded transcripts that identify and describe the content. Again, this is located at a stable, open URI. Only at the third layer does our interface exist. This interface, constructed out of modular “widgets”, accesses the content and metadata at the lower two levels, but is accepted as an ephemeral means of accessing the content. The fourth layer consists of a database of user generated annotations and tags created with the interface. This structure not only creates, we hope, a more sustainable model but also looks forward to future time when enough similar projects and interfaces might exist on the web that a user might choose from several different interfaces for the same content.

  • ODD as a generic specification platform [slides]
    Laurent Romary

    The experience gained since the first concepts for the “new” ODD for P5 were set in 2004 has shown both the great usefulness but also the limitations of what has become the core element of the TEI infrastructure as a whole. Indeed, ODD has not only been used to specify all TEI components in P5 but progressively has been adopted by the community as a customisation framework, ranging from personal adaptation of the TEI to the definition of new modules, which could be ultimately submitted to the consortium.

    Still, ODD is so intricately connected to the TEI environment that this prevents at times other communities to use it for the specification of their own XML formats independently of what is actually available from the TEI. There has been some experiments in this direction, even in other standardisation communities (W3C and ISO for instance), but all have faced such technical difficulties (element name conflict, management of namespace, impossibility just to reuse basic components of the TEI) that the actual success of these endeavours could only be the result of a strong motivation and direct involvement of a TEI specialist.

    In parallel, a debate has started both within the council and in the TEI community as a whole as to whether we should make ODD evolve, in which directions and with which impact on the TEI tool developments. The issue is obviously motivated by the feeling that ODD has the potential to be a real specification platform with more architectural capacities (modules and classes) then existing XML schema languages.

    Instead of making a technical analysis of the features that should be modified and added from the current ODD, I will explore a series of use cases where I identify potential needs and requirements for more expressivity, more flexibility or more coherence in the core ODD concepts. These use cases will in particular be articulated along two axes:

    • From a TEI internal point of view, how can we reuse customizations to go towards the definition of families of schemas that allow projects or communities of users to manage various workflows and maintain their levels of requirement in coherence with the main TEI framework?
    • Seen from the outside, how can ODD be used independently of the TEI, or more probably in combination with some TEI components, without imposing a precise knowledge of the TEI intricacies?

    This will lead to the elicitation of some requirements concerning module autonomy and inter-dependence which seem to be solvable through the introduction of the concept of crystals, i.e. connected and autonomous groups of elements. I will show how this concept facilitates the understanding of inheritance mechanisms in ODD as well as allows more flexibility in contemplating any kind of combination of TEI and non-TEI components.

3–4 p.m. Roundtable discussion: Trends to Watch in the TEI Community (Gallery, Hatcher Graduate Library North)
Chair: Susan Schreibman
  • Arianna Ciula (Science Officer, Humanities Unit, European Science Foundation)
  • Christian-Emil Ore (Chair of CIDOC and Academic Leader, Unit for Digital Documentation, Faculty of Humanities, University of Oslo)
  • Paul F. Schaffner (Head of Electronic Text Production, Digital Library Production Service, University of Michigan Library)
  • Christian Wittern (Associate Professor, Documentation and Information Center for Chinese Studies, Institute for Research in the Humanities, Kyoto University)
4–6 p.m. Working meeting of the SIG on Correspondence (Gallery Instruction Lab, Hatcher Graduate Library North) Working meeting of the SIG on Education (Gallery, Hatcher Graduate Library North)
4–7 p.m., followed by dinner Evening meeting of TEI Board of Directors (Turkish-American Friendship Room, Shapiro Library)
Sunday, November 15
10 a.m. – 4 p.m. Meeting of TEI Board of Directors (Turkish-American Friendship Room, Shapiro Library)