Encoding Information for Interchange An introduction to the TEI
Humanities Computing Unit
The problem
- SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs
- But to use it, you need a formal specification (aka document type definition orDTD)
- Where do you get one from?
Some answers
- Roll your own
- from scratch
- within an existing framework
The Text Encoding Initiative
Where did the TEI come from?
- From the humanities research community
- librarians and cybernauts
- linguists, historians, lexicographers...
- Sponsors
- ACH Association for Computers and the Humanities
- ACL Association for Computational Linguistics
- ALLC Association for Literary and Linguistic Computing
- Funders
- U.S. National Endowment for the Humanities
- Mellon Foundation
- Commission of European Communities DG XIII
- Social Science and Humanities Research Council of Canada
… and where is it going?
- Continued work in new application areas
- manuscript description
- physical description
- non-SGML data
- XML conformance
- Need for new infrastructure
- Corrected reprint of P3 due summer 1998
Goals of the TEI
a user-driven codification of existing best practice
- better interchange and integration of data
- support for all texts, in all languages, from all periods
- guidance for the perplexed: what to encode
- assistance for the specialist: how to encode any information of interest
TEI Deliverables
- coherent set of recommendations for text encoding
- comprising several distinct SGML tagsets
- based on existing practice
- documented in a reference manual
- tutorials for general and specialised audiences
The TEI modus operandi...
- identify significant particularities independent of notation or realisation
- avoid controversy, over-delicacy, inadequacy
- seek generalizable solutions, acceptable to a consensus
... and some consequences
- focus on content, not presentation
- descriptive, not prescriptive
- highly general in application, needs customization for particular areas
Who uses TEI?
- see http://www-tei.uic/orgs/tei/app/
- digital librarians and archivists
- LC, HTI, UVA, CETH, OTA...
- Language Engineering projects
- EAGLES, BNC, MULTEX, Parole, Silfide
- academic researchers
- Women Writers Project, Project Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...
Designing your DTD
- How can a single mark-up scheme handle a large variety of requirements ?
- all texts are alike
- every text is different
- Learn from the database designers
- one construct, many views
- each view a selection from the whole
How many dtds might you need?
or is there a better way?
- one (the Corporate or WKWBFY approach)
- none (the Anarchic or NWEUMP approach)
- as many as it takes (the Mixed Economy or WNSA approach)
The TEI solution: modularization
a single main DTD with many faces
- a (very) large number of element and attribute definitions
- organised as tagsets (core, base, additional, or auxiliary)
Combining Tag Sets
- And how does one combine tagsets? The how-many-dtds problem is back.
- all tag sets, all the time (the table d'hôte model)
- a few pre-selected combinations (the combination plate model)
- in completely unconstrained abandon (the smorgasbord model)
- one from column A, two from column B (the Chinese menu model)
The Chicago Pizza Model
“(deepDish|thinCrust|stuffed)” >
“(pepperoni|mushrooms|sausage|
pepper | anchovies | ...)” >
(%base;, tomatoSauce & cheese,
To build a view of the TEI dtd, take...
<!DOCTYPE TEI.2 system 'tei2.dtd' [
<!ENTITY % tei.prose 'INCLUDE' >
<!ENTITY % tei.analysis 'INCLUDE' >
- the toppings of your choice
… trim to fit ...
- undefine elements to be redefined* or removed
<!ENTITY % tei.extensions.ent
… and cook thoroughly
- ‘compile’ the dtd to remove all parameterization
- easier to use for some software
- better project management
- see http://firth.natcorp.ox.ac.uk/~tei/pizza.html
- don’t forget the documentation!
TEI base tagsets
- one only must be selected
- defines basic structural components
- currently defined:
- prose, verse, drama
- transcribed speech
- dictionaries
- terminological databases
- mixtures of bases require special treatment
TEI additional tagsets
- sets of elements for specialised application areas
- can be mixed and matched ad lib
- currently provided:
- linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....
How does this work ?
- Main dtd consists of marked sections, each (potentially) containing one tagset
- By default, all tagsets are IGNOREd
<!-- declarations for tagset here -->
<!ENTITY % TEI.tagset “INCLUDE”>
How does this work? (contd)
- Tagsets contain element and attlist declarations, each also enclosed by a marked section
- By default all elements are INCLUDEd
<!ELEMENT %n.element - - (#PCDATA)>
<!ATTLIST %n.element %a.global >
<!ENTITY % element “IGNORE”>
How does this work? (contd)
- Element names (GIs) are always referred to indirectly, so that they may be renamed
<!ELEMENT %n.elem1 - (%n.elem2;+)>
<!ENTITY % n.elem1 “elem1”>
<!ENTITY % n.elem2 “foo”>
Element Classes
- Model classes
- elements which share syntactic properties (i.e. occur in same position)
- Attribute classes
- elements which share attributes
- Class membership can be inherited
- Another way of doing architectural forms
Some TEI model classes
- divn: structural elements like divisions
<div>, <div1>, <div2>, <lg>, <lg1>...
- divtop: elements which can appear at the start of a divn element
<head>, <epigraph>, <byLine>...
- chunk: paragraph-like elements
- phrase: elements which appear within chunks
<hi>, <foreign>, <date>, <q> ...
Some TEI semantic classes
- data: phrases likely to be normalised or processed non textually
<date>, <time>, <name>...
- biblpart: specialised components of bibliographic descriptions
<author>, <title>, <editor>...
- demographic: descriptive features of participants in a language interaction
<birth>, <socEcstat>, <occupation>...
Some TEI attribute classes
- global: attributes which are available to every element
- linking: attributes for elements which have linking semantics
targType, targOrder, evaluate
The class system in action
- Simplifying documentation and understanding of the DTD
- Parameterizing content models
- different for different bases
- Simplifies customization
- class membership is unaffected
- adding new elements to an existing class
Parameterized content models
- “Components”, for example:
- a dictionary is composed of entries
- a play is composed of speeches
- a novel is composed of paragraphs
- in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently
How does this work? (contd)
- the component class has different members in different bases
<!ENTITY % m.component “p|list|note”>
<!ENTITY % m.component “entry”>
<!ENTITY %component.seq “(%m.component)+”>
<!ELEMENT div -- (head?, (%component.seq), div*) >
Customization...
- Removing an element involves
- undeclaring it
- (NB: ISO 8879 permits references to undefined elements -- though not all vendors know this)
- Adding a new element involves
- determining its class
- defining it
- adding it to that class
Customization (contd)
- Modification of an element implies removal followed by addition
- Class membership should be unaffected
<!-- in TEI.extensions.ent -->
<!-- in TEI.extensions.dtd -->
<!ELEMENT %n.p - - (#PCDATA)>
How does this work? (contd)
<!ENTITY % m.class “%x.class name1 | name2 | name3 ...” >
<!ELEMENT % n.element - - (%m.class;+)>
- Each model class is defined as a parameter entity
- Reference to class members is always indirect
- Membership extensible (by a kludge)
An example: the Lampeter corpus
- Requirements
- light presentational tagging
- structural markup for access
- demographic information about text production
- small number of tags to ease data capture and validation
- Implementation
- tagsets: prose base, and tags from four additional sets
- some extensions, many exclusions
The Lampeter corpus DTD subset
<!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.corpus "INCLUDE">
<!ENTITY % TEI.figures "INCLUDE">
<!ENTITY % TEI.transcr "INCLUDE">
<!ENTITY % TEI.extensions.ent
<!ENTITY % TEI.extensions.dtd
<!-- more declarations here -->
The Lampeter corpus extensions.ent
<!ENTITY % analytic 'IGNORE' >
<!ENTITY % biblStruct 'IGNORE' >
<!-- hic desunt multa -->
<!ENTITY % supplied 'IGNORE' >
"printer|pubFormat|bookSeller|">
"socecstatusPat|biogNote|">
<!ENTITY % x.globincl "gap|">
The Lampeter corpus extensions.dtd
<!ELEMENT (it|ro|sc|su|bo|go)
<!ELEMENT (persName|printer|pubFormat
|bookSeller|biogNote|socecstatusPat)
NB: This is a provisional version only! (no attlists, no documentation…)
Summary
- Designing a successful DTD involves careful, conscious, controlled , theft
- A class system helps identify
- what is true of all documents
- what is true of some documents
- Modifiability can be compatible with standardization