TEI TiteA recommendation for off-site text encodingPerry Trolard
Distributed under a Creative Commons Attribution-ShareAlike 3.0 Unported License
Copyright 2011 TEI Consortium.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
This software is provided by the copyright holders and contributors
"as is" and any express or implied warranties, including, but not
limited to, the implied warranties of merchantability and fitness for
a particular purpose are disclaimed. In no event shall the copyright
holder or contributors be liable for any direct, indirect, incidental,
special, exemplary, or consequential damages (including, but not
limited to, procurement of substitute goods or services; loss of use,
data, or profits; or business interruption) however caused and on any
theory of liability, whether in contract, strict liability, or tort
(including negligence or otherwise) arising in any way out of the use
of this software, even if advised of the possibility of such
damage.
TEI material can be licensed differently depending on the use you intend to
make of it. Hence it is made available under both the CC+BY and BSD-2
licences. The CC+BY licence is generally appropriate for usages which
treat TEI content as data or documentation. The BSD-2 licence is
generally appropriate for usage of TEI content in a software
environment. For further information or clarification, please contact
the TEI Consortium.
Written as the principal product of an independent study under John Unsworth at
the Graduate School of Library and Information Science, University of
Illinois at Urbana-Champaign, May 2009.
This electronic document is the original, but it primarily synthesizes work done at the University of Michigan Digital Library Production Service, University of Virginia Digital Library Production Service, and the California Digital Library and represented in their documents Minimum standards for text capture, Text Encoding Guidelines for Keyboarding Vendors, and CDL TEI Base Encoding Guidelines. , respectively.
Added add and del elementsFurther clarifications.Corrected typos and clarified TEI conformance.Updated metadata, including version number to 1.1, and added
some background on what Tite is. Copyedited section comparing Tite and the Best Practices for
TEI in LibrariesRemoved unwanted elements & attributes, as described in
ticket 3136934; added handShift, described in 3164403.Revised to use new syntax Added prose indicating that the absence of
teiHeader breaks the abstract model as a convenience to transcribers but needs to be
addressed after further processing.Incorporated prose edits recommended by the TEI-in-Libraries
SIG. Removed header module, TEI; added g.Edited prose to conform to P5 official release.Removed @rendition from att.global, as there is no
tagsDecl element to contain a rendition target. Added several more element
exclusions from TEI Lite (handNote, namespace, rendition, tagUsage).Added sugar elements to att.globalAdded equiv for ornament and cols, and changed prose about
ornament to use @rendAdded equiv the hi-like objectsModified available attributes for gap,
unclear; added ed attribute to cols.Misc. prose edits; cleaned up unnecessary
elementSpec elements.Added text as alternate root (on start
attribute of schemaSpec), making teitite-nohead redundant; added div0 to list
of elements Tite excludes (it's always been exluded, but now is reported).Finished initial draft.TEI TiteA recommendation for off-site text encodingPerry Trolard, for the TEI ConsortiumVersion 1.1 — September 2011
Introduction
TEI Tite is a constrained customization of TEI designed for use when outsourcing production
of TEI documents to vendors, who use some combination of OCR and keyboarding to produce encoded
text. While the canonical version of Tite is maintained by the TEI Council, a derived version
is used in the AccessTEI program.
TEI Tite is meant to express a transitional format for documents, not an archival one. A
project outsourcing encoding of documents using Tite should convert Tite documents created by
vendors into a more suitable format for long-term preservation, such as one of the encoding
levels of
Best Practices for TEI in Libraries or a project-specific TEI
customization.
While Tite includes only a limited set of all of the elements in TEI, it should not be
confused with
TEI Lite, which also contains a subset of elements. What
distinguishes Tite from other TEI customizations is that Tite is meant to prescribe
exactly one way of encoding a particular feature of a document in as many cases
as possible, ensuring that any two encoders would produce the same XML document for a
source document.
This document specifies how a source document should be encoded using TEI Tite. Its organizing model is roughly the
structure of a TEI document itself, and it proceeds from high-level features to low, starting
with general requirements, text structure, directions on when to group texts, considerations
about type of text (genre and format), continuing down to instructions on marking phrase-level
features, reference systems, and so forth. In its original ODD (one document does-it-all)
format, this document can generate everything necessary for working in TEI Tite: both
documentation (this Tite-specific prose as well as the full technical documentation for each of
its elements) and schemas in either W3C Schema, RELAX NG, or XML DTD. Software utilities,
including the Roma web tool, can generate
these.
Tite uses a subset of the TEI's elements, except for a few
shortcut elements for the convenience of use by vendors
(b, i, ul, sup, sub,
smcap, cols and ornament) which can be
transformed to normal TEI elements.
Tite is also not a TEI-conformant customization since it breaks the TEI
Abstract Model by omitting teiHeader for encoder
convenience. That is, Tite was created primarily by
removing elements and attributes from the TEI, and
not from extensive modification. As a TEI
customization, Tite inherits TEI semantics, and ambiguity in this
specification should be resolved with reference to the TEI
Guidelines. What makes Tite distinct is that where the TEI
in general is famously tolerant of multiple methods of encoding a
given feature, Tite seeks uniformity of encoding through
constraint, via its stripped-down tag set and via this
specification.
Tite can be used to encode printed prose, poetry, drama, newspapers, and anything else which
can be described with the basic TEI building-blocks of divisions, paragraphs, line groups, and
speeches.
In this documentation, document refers generally to the item (book, pamphlet,
newspaper, etc.) to be encoded and text to either linguistic (as opposed to
graphic) material or a logically distinct literary unit.
General Requirements
What to Capture
All printed material should be captured: all text (that is, printed characters) should be
transcribed and the presence of graphical items or other non-transcribable elements should be
indicated with markup.
End-of-line Hyphens
A distinction should be maintained in the electronic transcription between end-of-line or
soft hyphens (an artifact of page layout) and hard
hyphens (a linguistic feature). The former should be transcribed as the SOFT HYPHEN (U+00AD)
character; the latter, as the HYPHEN-MINUS (U+002D) character generally available on Western
keyboards. In the rare case of coincidence of the two types — where a word that is normally
hyphenated is split across a line break at its hyphen — the hyphen should be considered hard,
and transcribed as the HYPHEN-MINUS.
Character Encoding
Characters should be encoded in UTF-8. For characters not easily input from the keyboard,
use hexadecimal numeric entities (e.g. é, the small latin e with acute accent, is represented
as é).
Accuracy and Verification
The standard for accuracy of transcription should be at least 99.99% (1 error in 10,000
characters). The sample size for verification will be 5% of the total text.
Documenting the Encoding Process
Almost surely, difficult encoding situations will arise whose resolution may not be covered
by this documentation or the TEI Guidelines. In such cases, it is important to document the
markup choices that are made. To this end each encoded file should be accompanied by a
document with such notes. These notes should reference features of a document that seem
remarkable to encoders and how these were handled by encoders.
Global Text Structure
TEI Tite text structure
In TEI Tite, text is the root element, containing front matter, the body of the
text, and back matter.
The text's xml:id attribute should contain a unique identifier for the
document being encoded.
Tite omits the teiHeader element as a convenience to transcribers. This departs
from normal TEI practice, which requires TEI as the root element, containing
teiHeader and text elements. In order to bring a document encoded in TEI
Tite into adherence with the TEI Abstract Model, projects should add a teiHeader before
engaging in post-transcription processing.
Groups of Texts
A document should be encoded as a group of texts only when each member of the group
contains its own front or back matter (most often, a separate title page). In this case the
group element should be a child of the text element, and should contain
child text elements each containing a front, body, and
back (each text need not have both front and back matter, but should have
at least one). Note that this group of texts will still have its own front and back matter.
When dealing with a group of texts, the basic TEI text structure is modified to look like:
In cases where a document appears to contain a group of texts but the above condition is
not met, encode each unit as a (numbered) div with an appropriate type
attribute.
Structural Divisions
Tite uses numbered divisions: div1 through div7, which stand for levels of
nesting within a text. div1s nest inside or are contained by the front,
body, and back elements, div2s nest inside or are contained by
div1s, etc. The document's table of contents is often a good place to find cues
about where structural divisions start and end; other cues can be blank pages, recurring
typographical or ornamental features, or a numbering system ("Chapter 5" etc.). Also, the
presence of a heading will often indicate the beginning of a division.
The type attribute should be used to express the type of division being marked.
Where present, use a name for division type given in the document itself. Though any
constrained enumerated list of type values will have to be determined on a
job-by-job basis, some examples of appropriate division types are: actarticlebookchapteressayletterpartscenesectionsubsection
When a heading is present, encode it with the head element. If there is more than
one heading at the beginning of a given division, encode each heading with its own
head element, using the type attribute to distinguish them. Appropriate
values are: mainsub (subtitle)alt (alternate)desc (descriptive)
The n attribute should be used to record sequential labels associated with a
structural division (numbers, numerals, letters). When present, these labels should also be
transcribed within the content of head element. For instance:
III: It Awakes
False Indicators
A divisional title is a page that resembles a half-title page: it displays the
title or heading of a major structural unit on an otherwise blank page. Divisional
titles should be encoded not with a separate div element, but as a
head within the appropriate div. For half-title pages and
similar fly-title pages see the section on Front
Matter.
Another potential false indication of a new structural division is an ornament
used as an informal division: a printer's ornament of some sort, a string of asterisks or
periods, or a horizontal line. Mark these with the special ornament element. If the
ornament is a horizontal line or printer's device or otherwise not transcribable, make the
element empty and include an appropriate type attribute (line or
ornament); if the ornament is made up of characters, transcribe the characters
into the ornament's content.
Front and Back Matter
Front and back matter should be encoded with the front and back elements,
respectively. div1 elements should contain the major sections and should be
characterized by type attribute values. The exception, however, is the title page,
which should be encoded with the titlePage element and its children. The
titlePart element should have a type attribute with one of the following
values: mainsub (subtitle)desc (descriptive title)alt (alternate title)volume (volume information)titlePart type="volume" should be used to encode volume information wherever it is
found on the title page, even if it is separated from the other title information. The
elements that make up the titlePage content model are: graphic,
byline, epigraph, docTitle, titlePart,
docAuthor, docEdition, docImprint, docDate,
figure, ornament.
Information on the verso of the title page should be included as well (after a
pb).
Common items to encode in front and back matter -- and therefore common type
attribute values for front and back divisions are:
front
acknowledgementsadvertisementcastlistcontentsdedicationfly-titleforewordintroductionpreface
back
appendixbibliographycolophonglossaryindex
Half-title and fly-title pages may be encountered in the front
matter. A half-title page precedes the title page proper and sometimes includes
volume or series information; a fly-title page comes at the very end of the front
matter, just before the body. In the case of half-titles, encode these as div1
type="half-title" (with titlePart elements as appropriate); in the case of
fly-titles, encode them likewise with div1 type="fly-title", making sure to make
the fly-title division the last part of the front matter (and not the first part of the body,
as may seem reasonable as well).
Types of Text
Tite is equipped to support basic encoding of several types of text: in terms of genre, it
supports prose, verse, and drama, and in terms of format, it supports books, newspapers,
pamphlets, and other similar printed material. Tite has special elements for letters, verse,
drama, and newspapers.
Letters
opener and closer are elements designed to encode the beginning and ending
sections of letters, prefaces, diary entries, or other personal types of writing. Both
elements contain: dateline: for recording time and place of composition; use date with
when value (formatted yyyy-mm-dd) to record date informationsigned: for recording a signaturesalute: for recording salutation at the beginning ("Dear Roger,") or end
("Yours truly,")
opener contains the additional elements epigraph, argument, and
byline. epigraph will often be useful in the context of a letter. When
encoding an epigraph, make sure to encode the content as you would any other feature, marking
line groups, bibliographical elements, etc.
argument and byline, however, are not intended specifically for use with
letters: argument: for a summary that precedes a divisionbyline: for a statement of responsibility for the document
Verse
All verse should be encoded within at least one lg element, even when there are no
distinct stanzas or when the verse is interspersed with prose. If it is known, use the
type attribute to express the type of line group. Sometimes within a poem there is
a question about what should be tagged as a lg or as a separate div. As a
rough rule of thumb, if there is a title accompanying the division, use the div
element; otherwise, use lg.
Each line of verse should be encoded with the l element, and care should be taken
to distinguish these logical lines of verse from lines motivated by page layout. The latter
should be encoded as lbs. Thus should be encoded as AS virtuous men pass mildly away,And whisper to their souls to go,Whilst some of their sad friends do say,"Now his breath goes," and some say, "No." Also, as in the example above, use the rend attribute to mark when a line
is indented more than its siblings. Use numbered indent values (e.g.
indent(1), indent(2), etc.) to make clear levels of indentation.
Drama
The standard TEI elements for drama should be used: sp, stage,
speaker. If the who attribute is used on sp, also transcribe who
is given as the speaker, in whatever form it is written, in the speaker element.
Short pieces of stage direction that accompany the speaker designation may be included in the
speaker element.
Scenes and acts should be encoded as appropriately nested div elements with
type attributes of scene or act, respectively. Cast lists
can likewise be encoded using div and type="castlist".
Prologues and epilogues can be treated as sps of their own, unless their structure
would be better represented by nested div elements.
Newspapers
Tite includes the elements cols and cb which are well suited for the
multi-column layout of newspapers. Additional relevant elements are: ref, to encode a
pointer to the continuation of a story in a different column or on a different page; and
figure, to describe illustrations, advertisements, and cartoons.
Block-level Features
Block Quotations
Use the q element to encode block quotations. A block quotation is indicated by its
being set off from surrounding text either with extra line-spacing or margins or with a
different typeface. If the quotation is of an entire text, use the floatingText
element and its children inside the q element:
If present, transcribe all quotation marks or other delimiters inside the q
element.
Figures
Use the figure element to encode figures. If a figure has a heading or caption,
encode it with the head element. If there is associated text, simply use a p
to encode it.
Tables and Lists
Tables and lists are encoded as in the TEI Guidelines, but note the following.
If a cell in a table is a heading or a label, set the role attribute to
label; if the cell contains data, there is no need to use role:
data is the default. If a cell or row spans more than one column or row, use the
rows or cols attributes set to the number of columns or rows that it
spans.
If unsure about whether a structure is best encoded as a list or table, record it as a table
only if it would not be properly understood without tabular layout.
Lists should be encoded as either sequences of items or
label-item pairs. When items in the list contain a label, as in a gloss
list, be sure to use the latter form.
Notes
Both the reference to the note in the running text and the note itself must be encoded. Use
ptr or ref to encode the reference. If there is no reference in the text
(often the case for marginal notes), supply a ptr element in a reasonable place in
the text running beside the note. If there is a reference (number, symbol, etc.), use the
ref element and include the reference text as the content. In both cases, a
target attribute must be supplied which contains the xml:id value of
the associated note.
When encoding the note itself with the note element, the xml:id and
place attributes must be supplied. See the TEI documentation for acceptable values
for place; the most common will be foot, end,
margin-left (-right, -top, -bot).
Transcribe the note directly after it is referenced in the document. In the case of notes
without explicit reference (pointed to with ptr), set the anchored
attribute to false.
divWrapper Elements
Elements that can appear at the beginning and end of structural divisions, such as
argument, epigraph, and opener, are called
divWrapper elements in the TEI class system. An argument is
a summary of what is to come; be sure to distinguish this from a heading, which
is a title for the division. If an epigraph comes with bibliographic or simple
citation material, encode this as well. For example: "I have sworn upon the altar of God eternal hostility against every form of tyranny over
the mind of man."Thomas Jefferson.
Uncertain Blocks
In rare cases where the logical identity of a block-level element is hard to discern, use
the TEI element ab (anonymous block) instead of applying a p or div
element. In these cases, be sure to document this decision in accompanying notes.
Applying this element should be viewed as a last resort.
The gap element should be used when for some reason the document being transcribed
contains illegible text (smudged, torn, missing, etc.) or something outside the scope of
transcription for a given project: characters in an unsupported character set, for instance.
gap indicates that something is omitted. When using gap, set the
reason attribute to an appropriate value. (See unclear below.)
Phrase-level Features
Typographical Changes
There are six elements in Tite that capture specific typographical features: for bold-face glyphsfor italicized glyphsfor underlined glyphsfor glyphs in small-capsfor glyphs in subscriptfor glyphs in superscript These mark the physical change, and are agnostic about a logical motivation for it.
There are two exceptions to this approach, however: marking foreign words and titles. In the
case of foreign words, use the foreign element; in the case of titles, use the
title element only if certain that the word or phrase in question is a title. If a
phrase is, say, italicized, but you are uncertain about its being a title, use the i
element instead. Foreign words should be marked only if they are typographically distinguished
from surrounding text.
If there is a typographical feature not covered by the above elements, the TEI hi
element is still available in Tite. Use it without a rend attribute.
Phrase-level Quotation
For passages set off by quotation marks or another delimeter, use the q element,
including the delimeter inside the tag.
Alignment and Indentation
If the alignment of an element seems remarkable, set the element's rend attribute
to an appropriate value (normally center, right, left, etc.).
However, when semantic already accounts for its cause, description of alignment is not
necessary. Headings, for instance, do not need to be marked as being centered.
To indicate level of indentation (often in verse), use numerical arguments to
indent, as in indent(1), indent(-1), and so on.
Uncertain Segments
The seg element is the phrase-level analogue to the ab element. If a
phrase-level feature seems to be present but its identity is hard to fathom, use this element.
This, again, is a last resort.
Alternately, when a passage of text is for some reason too hard to read, use the
unclear element, setting the reason attribute to an appropriate value.
When using unclear, surround the entire word with the tag if any part of it is
unclear (not just the illegible letter, say).
Unknown Glyphs
For cases in which it is unknown which character a given glyph corresponds to, mark the
glyph with the g element to indicate the uncertainty. By convention in Tite,
g represents any unknown glyph; no ref attribute is necessary. Note that
unknown glyphs are different from illegible text.
Reference Systems
Encode page breaks (pb) at the start of each page, and encode breaks
even for blank pages. If the page is numbered, include the page number as the value of the
n attribute and, again, no matter where the page number is printed on the page,
place the pb element at the top.
If marking column breaks, follow the same rules as for page breaks. Column breaks are
imagined to appear at the top of the column, at the beginning of the column's
text. The cols element exists to record a change in columnar layout. If such a change
occurs, mark the beginning of the new layout with cols and supply the new number of
columns as the value for the n attribute.
If line breaks are to be captured, use the lb element.
Appendices
TEI Tite and the Best Practices for TEI in Libraries
The
Best Practices for TEI in
Libraries ("BP") creates common definitions of levels of encoding based on depth
of markup applied. Because the levels of encoding provide a tremendously useful common set of
terms, it's helpful to situate TEI Tite according to them.
Mapped to BP levels, TEI Tite would sit between Level 3 and Level 4: it requires use of all the
elements from Level 3 plus additional ones, but requires fewer elements than Level 4.
Relative to Level 3, Simple Analysis, Tite encourages the use of the rend attribute on typographically distinct text
(marked with hi), implicitly, through the provision of convenience elements
(i, b, etc.), and it provides the title and foreign
elements for semantic markup of typographically distinct phrases; in level 3, the
rend attribute is optional, and title and foreign are not
provided provides some genre-specific elements in addition to those for verse that level three
also provides (lg, l): sp, speaker, and stage
for drama, the cols element especially for newspapers.
The most useful comparison for Tite is to Level 4 (Basic Content Analysis), provides
the most useful comparison. The folowing items represent instances where Tite is
less ambitious than Level 4: except in the case of the foreign and title elements, it is preferred
in Tite to describe typographical changes physically, rather than semantically; Tite uses
i, b, etc. where level four uses emph, gloss,
termTite provides only q for quoted material, where level four is more
discriminating, using quote, said, mentioned, soCalledTite doesn't provide elements for editorial intervention, as level four does:
choice, sic, corrTite doesn't provide entity-specific naming elements, like persName,
placeName, orgName and their list- (listPerson, etc.) forms
Bringing Tite-encoded documents up to BP Level 4 would simply require application
of additional markup, not significant reworking of markup, and in that way Tite is
compatible with the BP.
Do also keep in mind that Tite lacks both the teiHeader and root TEI
element used in TEI-conformant documents.
Formal specification
(bold) for capturing typographical feature: bold glyphs.(italics) for capturing typographical feature: italicized glyphs.(underline) for capturing typographical feature: underlined glyphs.(subscript) for capturing typographical feature: subscript glyphs.(superscript) for capturing typographical feature: superscript glyphs.(smallcaps) for capturing typographical feature: glyphs in small capitals.(columns) with the n attribute (denoting new number of columns) is used to mark
where a document changes columnar layout.indicates the edition or version in which the change in columnar layout is located at
this pointfor capturing typographical feature: printer's ornament, horizontal line, strings of
asterisks or periods, etc, indicating an informal division that does not call for a new
div element. If a horizontal rule or printer's ornament, use appropriate
rend attribute and leave the element empy; if the ornament can be represented
with characters, include these in the element.
Acknowledgments
The TEI Tite is simply a synthesis of work done at the University of Michigan Digital Library Production Service, University of Virginia Digital Library Production Service, and the California Digital Library and represented in their documents Minimum standards for text capture, Text Encoding Guidelines for Keyboarding Vendors, and CDL TEI Base Encoding Guidelines, respectively. Many thanks to the institutions and individuals responsible for sharing
their experience and expertise for the benefit of the TEI community at large.
Also, thank you to members of the TEI Special Interest Group on Libraries who provided very
valuable corrections and suggestions.