Encoding Information for Interchange An introduction to the TEI

Humanities Computing Unit

Oxford University

The problem

SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs

But to use it, you need a formal specification (aka document type definition orDTD)

Where do you get one from?

How do you choose?

Some answers

Roll your own
- from scratch
- within an existing framework

Take what’s on offer

Use the TEI architecture

The Text Encoding Initiative

Modular Architecture

Customization

Where did the TEI come from?

From the humanities research community
- librarians and cybernauts
- linguists, historians, lexicographers...

Sponsors
- ACH Association for Computers and the Humanities
- ACL Association for Computational Linguistics
- ALLC Association for Literary and Linguistic Computing

Funders
- U.S. National Endowment for the Humanities
- Mellon Foundation
- Commission of European Communities DG XIII
- Social Science and Humanities Research Council of Canada

… and where is it going?

Continued work in new application areas
- manuscript description
- physical description
- non-SGML data
- XML conformance

Continued take-up

Need for new infrastructure

Corrected reprint of P3 due summer 1998

Goals of the TEI

better interchange and integration of data

support for all texts, in all languages, from all periods

guidance for the perplexed: what to encode

assistance for the specialist: how to encode any information of interest

TEI Deliverables

coherent set of recommendations for text encoding

comprising several distinct SGML tagsets

based on existing practice

documented in a reference manual

tutorials for general and specialised audiences

The TEI modus operandi...

identify significant particularities independent of notation or realisation

avoid controversy, over-delicacy, inadequacy

seek generalizable solutions, acceptable to a consensus

... and some consequences

focus on content, not presentation

descriptive, not prescriptive

Occam's razor

modular, extensible dtd

highly general in application, needs customization for particular areas

Who uses TEI?

see http://www-tei.uic/orgs/tei/app/

digital librarians and archivists
- LC, HTI, UVA, CETH, OTA...

Language Engineering projects
- EAGLES, BNC, MULTEX, Parole, Silfide

academic researchers
- Women Writers Project, Project Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...

Designing your DTD

How can a single mark-up scheme handle a large variety of requirements ?
- all texts are alike
- every text is different

Learn from the database designers
- one construct, many views
- each view a selection from the whole

How many dtds might you need?

one (the Corporate or WKWBFY approach)

none (the Anarchic or NWEUMP approach)

as many as it takes (the Mixed Economy or WNSA approach)

The TEI solution: modularization

(a British DTD)

a (very) large number of element and attribute definitions

organised as tagsets (core, base, additional, or auxiliary)

grouped into classes

Combining Tag Sets

And how does one combine tagsets? The how-many-dtds problem is back.
- all tag sets, all the time (the table d'hôte model)
- a few pre-selected combinations (the combination plate model)
- in completely unconstrained abandon (the smorgasbord model)
- one from column A, two from column B (the Chinese menu model)

The Chicago Pizza Model

“(deepDish|thinCrust|stuffed)” >

<!ENTITY % topping

“(pepperoni|mushrooms|sausage|

pepper | anchovies | ...)” >

<!ELEMENT pizza - -

(%base;, tomatoSauce & cheese,

%(topping)*) >

To build a view of the TEI dtd, take...

<!ENTITY % tei.prose 'INCLUDE' >

<!ENTITY % tei.analysis 'INCLUDE' >

<tei.2>.....</tei.2>

the core tagsets

the base of your choice

the toppings of your choice

… trim to fit ...

user extension files

rename elements

undefine elements to be redefined* or removed

SYSTEM ‘myMods.ent’ >

<!ENTITY % seg ‘IGNORE’>

… and cook thoroughly

‘compile’ the dtd to remove all parameterization

easier to use for some software

better project management

see http://firth.natcorp.ox.ac.uk/~tei/pizza.html

don’t forget the documentation!

TEI base tagsets

one only must be selected

defines basic structural components

currently defined:
- prose, verse, drama
- transcribed speech
- dictionaries
- terminological databases

mixtures of bases require special treatment

TEI additional tagsets

sets of elements for specialised application areas

can be mixed and matched ad lib

currently provided:
- linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....

How does this work ?

Main dtd consists of marked sections, each (potentially) containing one tagset

By default, all tagsets are IGNOREd

]]>

How does this work? (contd)

Tagsets contain element and attlist declarations, each also enclosed by a marked section

By default all elements are INCLUDEd

<!ELEMENT %n.element - - (#PCDATA)>

<!ATTLIST %n.element %a.global >

]]>

How does this work? (contd)

Element names (GIs) are always referred to indirectly, so that they may be renamed

<!ENTITY % n.elem2 “foo”>

Element Classes

Model classes
- elements which share syntactic properties (i.e. occur in same position)

Attribute classes
- elements which share attributes

Class membership can be inherited

Another way of doing architectural forms

Some TEI model classes

divn: structural elements like divisions

divtop: elements which can appear at the start of a divn element

chunk: paragraph-like elements

phrase: elements which appear within chunks

Some TEI semantic classes

data: phrases likely to be normalised or processed non textually

biblpart: specialised components of bibliographic descriptions

demographic: descriptive features of participants in a language interaction

Some TEI attribute classes

global: attributes which are available to every element

linking: attributes for elements which have linking semantics

The class system in action

Simplifying documentation and understanding of the DTD

Parameterizing content models
- different for different bases

Simplifies customization
- class membership is unaffected
- adding new elements to an existing class

Parameterized content models

“Components”, for example:
- a dictionary is composed of entries
- a play is composed of speeches
- a novel is composed of paragraphs

in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently

How does this work? (contd)

the component class has different members in different bases

<!ENTITY % m.component “p|list|note”>

]]>

<![ %TEI.dictionaries [

<!ENTITY % m.component “entry”>

]]>

<!ENTITY %component.seq “(%m.component)+”>

<!ELEMENT div -- (head?, (%component.seq), div*) >

Customization...

Removing an element involves
- undeclaring it
- (NB: ISO 8879 permits references to undefined elements -- though not all vendors know this)

Adding a new element involves
- determining its class
- defining it
- adding it to that class

Customization (contd)

Modification of an element implies removal followed by addition

Class membership should be unaffected

<!ENTITY % p “IGNORE”>

<!ELEMENT %n.p - - (#PCDATA)>

How does this work? (contd)

<!ENTITY % m.class “%x.class name1 | name2 | name3 ...” >

Each model class is defined as a parameter entity

Reference to class members is always indirect

Membership extensible (by a kludge)

An example: the Lampeter corpus

Requirements
- light presentational tagging
- structural markup for access
- demographic information about text production
- small number of tags to ease data capture and validation

Implementation
- tagsets: prose base, and tags from four additional sets
- some extensions, many exclusions

The Lampeter corpus DTD subset

<!ENTITY % TEI.prose "INCLUDE">

<!ENTITY % TEI.corpus "INCLUDE">

<!ENTITY % TEI.figures "INCLUDE">

<!ENTITY % TEI.transcr "INCLUDE">

<!ENTITY % TEI.extensions.ent

SYSTEM "lampext.ent">

<!ENTITY % TEI.extensions.dtd

SYSTEM "lampext.dtd">

The Lampeter corpus extensions.ent

<!ENTITY % biblStruct 'IGNORE' >

<!ENTITY % supplied 'IGNORE' >

<!ENTITY % x.phrase

"it|ro|sc|su|bo|go|">

<!ENTITY % x.biblPart

"printer|pubFormat|bookSeller|">

<!ENTITY % x.demographic

"socecstatusPat|biogNote|">

<!ENTITY % x.globincl "gap|">

The Lampeter corpus extensions.dtd

- - (%phrase.seq)>

<!ELEMENT (persName|printer|pubFormat

|bookSeller|biogNote|socecstatusPat)

- - (%phrase.seq) >

Summary

Designing a successful DTD involves careful, conscious, controlled , theft

Modularize the task

A class system helps identify
- what is true of all documents
- what is true of some documents

Modifiability can be compatible with standardization