TEI Lite:
An Introduction to Text Encoding for Interchange
Lou Burnard
C. M. Sperberg-McQueen
Document No: TEI U 5
June 1995, revised May 2002
Prefatory note
TEI Lite was the name adopted for what the TEI editors originally
conceived of as a simple demonstration of how the TEI encoding scheme
might be adopted to meet 90% of the needs of 90% of the TEI user
community. In retrospect, it was predictable that many people should
imagine TEI Lite to be all there is to TEI, or find TEI Lite to be far
too heavy for their needs (to meet the latter criticism, Michael also
prepared a special barebones version of TEI
Lite).
TEI Lite was based largely on our observations of existing and
previous practice in the encoding of texts, particularly as manifest
in the collections of the Oxford
Text Archive and in our own experience. It is therefore
unsurprising that it seems to have become, if not a de facto standard,
at least a common point of departure for electronic text centres and
encoding projects world wide. Maybe the fact that we actually produced
this shortish, readable, manual for it also helped.
That manual was, of course, authored and is maintained in the DTD
it describes, originally as an XML
document. This makes it easy to produce a number of differently
formatted versions in HTML, PDF, etc., some of which can be found
in The TEI Vault.
Early adopters of TEI Lite included a number of
‘Electronic Text Centers’, many of whom produced
their own documentation and tutorial materials (some examples are
listed in the TEI Tutorials
pages).
With the publication of TEI P4, the XML version of the TEI
Guidelines, which uses the generation of TEI Lite as an example of the
Modification mechanism built into the TEI Guidelines, the opportunity
has been taken to produce a lightly revised version of the present
document. This revision documents the XML version of the
TEI Lite DTD.
Lou Burnard, May 2002
Contents
This document provides an
introduction to the recommendations of the Text Encoding Initiative
(TEI), by describing a manageable subset of the full TEI encoding
scheme. The scheme documented here can be used to encode a wide
variety of commonly encountered textual features, in such a way as to
maximize the usability of electronic transcriptions and to facilitate
their interchange among scholars using different computer systems. It
is also fully compatible with the full TEI scheme, as defined by TEI
document P4, Guidelines for Electronic Text Encoding and
Interchange, published in May 2002, and available from the TEI
Consortium website at
http://www.tei-c.org/cms/Guidelines/P4/html/index.html
.
Introduction
The Text
Encoding Initiative (TEI) Guidelines are addressed to anyone who wants
to interchange information stored in an electronic form. They
emphasize the interchange of textual information, but other forms of
information such as images and sound are also addressed. The
Guidelines are equally applicable in the creation of new resources and
in the interchange of existing ones.
The Guidelines provide a
means of making explicit certain features of a text in such a way as
to aid the processing of that text by computer programs running on
different machines. This process of making explicit we call
markup or encoding. Any textual
representation on a computer uses some form of markup; the TEI came
into being partly because of the enormous variety of mutually
incomprehensible encoding schemes currently besetting scholarship, and
partly because of the expanding range of scholarly uses now being
identified for texts in electronic form.
The TEI Guidelines describe an encoding scheme which can be
expressed using a number of different formal languages. The first
editions of the Guidelines used the Standard Generalized
Markup Language (SGML); the most recent edition
(TEI P4, 2002) can also be expressed in the Extensible Markup
Language (XML); future versions may also be expressible in other
schema languages. Such languages have in common
the definition of text in terms of elements and attributes, and rules
governing their appearance within a text. The TEI's use of XML is
ambitious in its complexity and generality, but it is fundamentally no
different from that of any other XML markup scheme, and so any
general-purpose XML-aware software is able to process TEI-conformant
texts.
The TEI was sponsored by the Association for Computers and
the Humanities, the Association for Computational Linguistics, and the
Association for Literary and Linguistic Computing, and is now
maintained and developed by an independent membership consortium, hosted by
four major Universities. Funding has been
provided in part from the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European
Communities, the Andrew W. Mellon Foundation, and the Social Science
and Humanities Research Council of Canada. The Guidelines were
first published in May 1994, after six years of development involving many
hundreds of scholars from different academic disciplines
worldwide. During the years that followed, the Guidelines were
increasingly influential in the development of the
digital library, in the language industries, and even in the
development of the World Wide Web itself. The TEI consortium was
set up in January 2001, and a year later produced the current
fully revised edition of the Guidelines, which has been entirely
revised for XML compatibility.
At the outset of its work, the overall goals of the
TEI were defined by the closing statement of a planning conference
held at Vassar College, N.Y., in November, 1987; these
‘Poughkeepsie Principles’ were further elaborated
in a series of design documents. The Guidelines, say these design
documents,
should:
-
suffice to represent the textual features needed for
research;
-
be simple, clear, and concrete;
-
be easy for researchers to use without special-purpose
software;
-
allow the rigorous definition and efficient processing of
texts;
-
provide for user-defined extensions;
-
conform to existing and emergent standards.
The world of scholarship is large and diverse. For the Guidelines
to have wide acceptability, it was important to ensure that:
-
the common core of textual features be easily shared;
-
additional specialist features be easy to add to (or remove
from) a text;
-
multiple parallel encodings of the same feature should be
possible;
-
the richness of markup should be user-defined, with a very
small minimal requirement;
-
adequate documentation of the text and its encoding should be
provided.
The present document describes a manageable selection from the
extensive set of elements and recommendations resulting from
those design goals, which is called TEI Lite.
In selecting from the several hundred elements defined by
the full TEI scheme, we have tried to identify a useful ‘starter
set’, comprising the elements which almost every user should
know about. Experience working with TEI Lite will be invaluable in
understanding the full TEI DTD and in knowing which optional parts of
the full DTD are necessary for work with particular types of text.
Our goals in defining this subset may be summarized as follows:
-
it should include most of the TEI ‘core’
tag set, since this contains elements relevant to virtually all text
types and all kinds of text-processing work;
-
it should be able to handle adequately a reasonably wide
variety of texts, at the level of detail found in existing practice
(as demonstrated in, for example, the holdings of the Oxford Text
Archive);
-
it should be useful for the production of new documents as well
as encoding of existing ones;
-
it should be usable with a wide range of existing XML
software;
-
it should be derivable from the full TEI DTD using the
extension mechanisms described in the TEI Guidelines;
-
it should be as small and simple as is consistent with the
other goals.
The reader may judge our success in meeting these goals for him or
herself. At the time of writing (1995), our confidence that we have at least
partially done so is borne out by its use in practice for the encoding
of real texts. The Oxford Text Archive uses TEI Lite when it
translates texts from its holdings from their original markup schemes
into SGML; the Electronic Text Centers at the University of Virginia
and the University of Michigan have used TEI Lite to encode their
holdings. And the Text Encoding Initiative itself uses TEI Lite, in
its current technical documentation — including this document.
Although we have tried to make this document self-contained, as
suits a tutorial text, the reader should be aware that it does not
cover every detail of the TEI encoding scheme. All of the elements
described here are fully documented in the TEI Guidelines themselves,
which should be consulted for authoritative reference information on
these, and on the many others which are not described here. Some
basic knowledge of XML is assumed.
A Short Example
We begin with a short example, intended to show what happens when
a passage of prose is typed into a computer by someone with little
sense of the purpose of mark-up, or the potential of electronic texts.
In an ideal world, such output might be generated by a very accurate
optical scanner. It attempts to be faithful to the appearance of the
printed text, by retaining the original line breaks, by introducing
blanks to represent the layout of the original headings and page
breaks, and so forth. Where characters not available on the keyboard
are needed (such as the accented letter a in
faàl or the long dash), it attempts to
mimic their appearance.
CHAPTER 38
READER, I married him. A quiet wedding we had: he and I, the par-
son and clerk, were alone present. When we got back from church, I
went into the kitchen of the manor-house, where Mary was cooking
the dinner, and John cleaning the knives, and I said --
'Mary, I have been married to Mr Rochester this morning.' The
housekeeper and her husband were of that decent, phlegmatic
order of people, to whom one may at any time safely communicate a
remarkable piece of news without incurring the danger of having
one's ears pierced by some shrill ejaculation and subsequently stunned
by a torrent of wordy wonderment. Mary did look up, and she did
stare at me; the ladle with which she was basting a pair of chickens
roasting at the fire, did for some three minutes hang suspended in air,
and for the same space of time John's knives also had rest from the
polishing process; but Mary, bending again over the roast, said only --
'Have you, miss? Well, for sure!'
A short time after she pursued, 'I seed you go out with the master,
but I didn't know you were gone to church to be wed'; and she
basted away. John, when I turned to him, was grinning from ear to
ear.
'I telled Mary how it would be,' he said: 'I knew what Mr Ed-
ward' (John was an old servant, and had known his master when he
was the cadet of the house, therefore he often gave him his Christian
name) -- 'I knew what Mr Edward would do; and I was certain he
would not wait long either: and he's done right, for aught I know. I
wish you joy, miss!' and he politely pulled his forelock.
'Thank you, John. Mr Rochester told me to give you and Mary
this.'
I put into his hand a five-pound note. Without waiting to hear
more, I left the kitchen. In passing the door of that sanctum some time
after, I caught the words --
'She'll happen do better for him nor ony o' t' grand ladies.' And
again, 'If she ben't one o' th' handsomest, she's noan faa\l, and varry
good-natured; and i' his een she's fair beautiful, onybody may see
that.'
I wrote to Moor House and to Cambridge immediately, to say what
I had done: fully explaining also why I had thus acted. Diana and
474
JANE EYRE 475
Mary approved the step unreservedly. Diana announced that she
would just give me time to get over the honeymoon, and then she
would come and see me.
'She had better not wait till then, Jane,' said Mr Rochester, when I
read her letter to him; 'if she does, she will be too late, for our honey-
moon will shine our life long: its beams will only fade over your
grave or mine.'
How St John received the news I don't know: he never answered
the letter in which I communicated it: yet six months after he wrote
to me, without, however, mentioning Mr Rochester's name or allud-
ing to my marriage. His letter was then calm, and though very serious,
kind. He has maintained a regular, though not very frequent correspond-
ence ever since: he hopes I am happy, and trusts I am not of those who
live without God in the world, and only mind earthly
things.
This transcription suffers from a number of shortcomings:
-
the page numbers and running titles are intermingled with the
text in a way which makes it difficult for software to disentangle
them;
-
no distinction is made between single quotation marks and
apostrophe, so it is difficult to know exactly which passages are in
direct speech;
-
the preservation of the copy text's hyphenation means that
simple-minded search programs will not find the broken words;
-
the accented letter in faàl and
the long dash have been rendered by ad hoc keying conventions which
follow no standard pattern and will be processed correctly only if the
transcriber remembers to mention them in the documentation;
-
paragraph divisions are marked only by the use of white space,
and hard carriage returns have been introduced at the end of each
line. Consequently, if the size of type used to print the text
changes, reformatting will be problematic.
We now present the same passage, as it might be encoded using the
TEI Guidelines. As we shall see, there are many ways in which this
encoding could be extended, but as a minimum, the TEI approach allows
us to represent the following distinctions:
-
Paragraph divisions are now marked explicitly.
-
Apostrophes are distinguished from quotation marks.
-
Entity references are used for the accented letter and the long
dash.
-
Page divisions have been marked with an empty
<pb>
element alone.
-
To simplify searching and processing, the lineation of the
original has not been retained and words broken by typographic
accident at the end of a line have been re-assembled without comment.
If the original lineation were of interest, as it might be for an
important printing, it could easily be recorded, though it has not
been here.
-
For convenience of proof reading, a new line has been
introduced at the start of each paragraph, but the indentation is
removed.
<pb n='474'/>
<div type="chapter" n='38'>
<p>Reader, I married him. A quiet wedding we had: he and I,
the parson and clerk, were alone present. When we got back
from church, I went into the kitchen of the manor-house,
where Mary was cooking the dinner, and John cleaning the
knives, and I said —</p>
<p><q>Mary, I have been married to Mr Rochester this
morning.</q> The housekeeper and her husband were of that
decent, phlegmatic order of people, to whom one may at any
time safely communicate a remarkable piece of news without
incurring the danger of having one's ears pierced by some
shrill ejaculation and subsequently stunned by a torrent of
wordy wonderment. Mary did look up, and she did stare at
me; the ladle with which she was basting a pair of chickens
roasting at the fire, did for some three minutes hang
suspended in air, and for the same space of time John's
knives also had rest from the polishing process; but Mary,
bending again over the roast, said only —</p>
<p><q>Have you, miss? Well, for sure!</q></p>
<p>A short time after she pursued, <q>I seed you go out with
the master, but I didn't know you were gone to church to be
wed</q>; and she basted away. John, when I turned to him,
was grinning from ear to ear. <q>I telled Mary how it would
be,</q> he said: <q>I knew what Mr Edward</q> (John was an
old servant, and had known his master when he was the cadet
of the house, therefore he often gave him his Christian
name) — <q>I knew what Mr Edward would do; and I was
certain he would not wait long either: and he's done right,
for aught I know. I wish you joy, miss!</q> and he politely
pulled his forelock.</p>
<p><q>Thank you, John. Mr Rochester told me to give you and
Mary this.</q></p>
<p>I put into his hand a five-pound note. Without waiting
to hear more, I left the kitchen. In passing the door of
that sanctum some time after, I caught the words —</p>
<p><q>She'll happen do better for him nor ony o' t' grand
ladies.</q> And again, <q>If she ben't one o' th'
handsomest, she's noan faàl, and varry good-natured;
and i' his een she's fair beautiful, onybody may see
that.</q></p>
<p>I wrote to Moor House and to Cambridge immediately, to
say what I had done: fully explaining also why I had thus
acted. Diana and <pb n='475'/> Mary approved the step
unreservedly. Diana announced that she would just give me
time to get over the honeymoon, and then she would come and
see me.</p>
<p><q>She had better not wait till then, Jane,</q> said Mr
Rochester, when I read her letter to him; <q>if she does,
she will be too late, for our honeymoon will shine our life
long: its beams will only fade over your grave or mine.</q></p>
<p>How St John received the news I don't know: he never
answered the letter in which I communicated it: yet six
months after he wrote to me, without, however, mentioning Mr
Rochester's name or alluding to my marriage. His letter was
then calm, and though very serious, kind. He has maintained
a regular, though not very frequent correspondence ever
since: he hopes I am happy, and trusts I am not of those who
live without God in the world, and only mind earthly things.</p>
The decision to focus on Brontë's text, rather than on the
printing of it in this particular edition, is one aspect of a
fundamental encoding issue: that of selectivity. An encoding makes
explicit only those textual features of importance to the encoder. It
is not difficult to think of ways in which the encoding of even this
short passage might readily be extended. For example:
-
a regularized form of the passages in dialect could be
provided;
-
footnotes glossing or commenting on any passage could be
added;
-
pointers linking parts of this text to others could be added;
-
proper names of various kinds could be distinguished from the
surrounding text;
-
detailed bibliographic information about the text's provenance
and context could be prefixed to it;
-
a linguistic analysis of the passage into sentences, clauses,
words, etc., could be provided, each unit being associated with
appropriate category codes;
-
the text could be segmented into narrative or discourse
units;
-
systematic analysis or interpretation of the text could be
included in the encoding, with potentially complex alignment or
linkage between the text and the analysis, or between the text and one
or more translations of it;
-
passages in the text could be linked to images or sound held on
other media.
The TEI-recommended way of carrying all of these out is described
in the remainder of this document. The TEI scheme as a whole also
provides for an enormous range of other possibilities, of which we
cite only a few:
-
detailed analysis of the components of names;
-
detailed meta-information providing thesaurus-style information
about the text's origins or topics;
-
information about the printing history or manuscript variations
exhibited by a particular series of versions of the text.
For recommendations on these and many other possibilities, the
full Guidelines should be consulted.
The Structure of a TEI Text
All TEI-conformant texts contain (a) a TEI header
(marked up as a
<teiHeader>
element) and (b) the transcription
of the text proper (marked up as a
<text>
element).
The TEI header provides information analogous to that provided by
the title page of a printed text. It has up to four parts: a
bibliographic description of the machine-readable text, a description
of the way it has been encoded, a non-bibliographic description of the
text (a text profile), and a revision history. The
header is described in more detail in section The Electronic Title
Page.
A TEI text may be unitary (a single work) or
composite (a collection of single works, such as an
anthology). In either case, the text may have an optional front
or back. In between is the body of the
text, which, in the case of a composite text, may consist of
groups, each containing more groups or texts.
A unitary text will be encoded using an overall structure like
this:
<TEI.2>
<teiHeader> [ TEI Header information ] </teiHeader>
<text>
<front> [ front matter ... ] </front>
<body> [ body of text ... ] </body>
<back> [ back matter ... ] </back>
</text>
</TEI.2>
A composite text also has an optional front and back. In between
occur one or more groups of texts, each with its own optional front
and back matter. A composite text will thus be encoded using an
overall structure like this:
<TEI.2>
<teiHeader> [ header information for the composite ] </teiHeader>
<text>
<front> [ front matter for the composite ] </front>
<group>
<text>
<front> [ front matter of first text ] </front>
<body> [ body of first text ] </body>
<back> [ back matter of first text ] </back>
</text>
<text>
<front> [ front matter of second text] </front>
<body> [ body of second text ] </body>
<back> [ back matter of second text ] </back>
</text>
[ more texts or groups of texts here ]
</group>
<back> [ back matter for the composite ] </back>
</text>
</TEI.2>
It is also possible to define a composite of TEI texts, each with
its own header. Such a collection is known as a
TEI corpus,
and may itself have a header:
<teiCorpus>
<teiHeader> [header information for the corpus]</teiHeader>
<TEI.2>
<teiHeader>[header information for first text]</teiHeader>
<text> [first text in corpus] </text>
</TEI.2>
<TEI.2>
<teiHeader>[header information for second text]</teiHeader>
<text> [second text in corpus] </text>
</TEI.2>
</teiCorpus>
It is not however possible to create a composite of corpora --
that is, a number of
<teiCorpus>
elements combined together
and treated as a single object. This is a restriction of the current
version of the TEI Guidelines.
In the remainder of this document, we discuss chiefly simple text
structures. The discussion in each case consists of a short list of
relevant TEI elements with a brief definition of each,
followed by definitions for any attributes specific to
that element. In most cases, short examples are also given.
Encoding the Body
As indicated above, a simple TEI document at the textual level
consists of the following elements:
-
<front>
- contains any prefatory matter (headers, title page, prefaces,
dedications, etc.) found before the start of a text proper.
-
<group>
- contains a number of unitary texts or groups of texts.
-
<body>
- contains the whole body of a single unitary text, excluding any
front or back matter.
-
<back>
- contains any appendixes, etc., following the main part of a
text.
Elements specific to front and back matter are described
below in section
Front and Back Matter. In this section we discuss
the elements making up the body of a text.
Text Division Elements
The body of a prose text may be just a series of paragraphs, or
these paragraphs may be grouped together into chapters, sections,
subsections, etc. In the former case, each paragraph is tagged using
the
<p>
tag. In the latter case, the
<body>
may be
divided either into a series of
<div>
elements, or into a
series of
<div>
elements, either of which may be further subdivided, as
discussed below:
-
<p>
- marks paragraphs in prose.
-
<div>
- contains a subdivision of the front, body, or back of a text.
-
<div>
- contains a first-level subdivision of the front, body, or back
of a text (the largest, if
<div0>
is not used, the second largest if it is).
When structural subdivisions smaller than a
<div>
are
necessary, a
<div>
may be divided into
<div2>
elements, a
<div2>
into smaller
<div3>
elements, etc.,
down to the level of
<div7>
. If more than seven levels of
structural division are present, one must either modify the TEI tag
set to accept
<div8>
, etc., or else use the unnumbered
<div>
element: a
<div>
may be subdivided by smaller
<div>
elements, without limit to the depth of nesting.
All these
division elements take the following three
attributes:
-
type
- This indicates the conventional name for this category of text
division. Its value will typically be ‘Book’,
‘Chapter’,
‘Poem’, etc. Other possible values
include ‘Group’ for groups of poems,
etc., treated as a single unit, ‘Sonnet’,
‘Speech’, and ‘Song’. Note that whatever value is supplied for the type
attribute of the first
<div>
,
<div>
,
<div2>
,
etc., in a text is assumed to apply for all subsequent
<div>
,
<div>
s (etc.) within the same
<body>
. This implies
that a value must be given for the first division element of each
type, or whenever the value changes.
-
id
- This specifies a unique identifier for the division, which may
be used for cross references or other links to it, such as a
commentary, as further discussed in section Cross References and
Links. It is
often useful to provide an id attribute for every
major structural unit in a text, and to derive the ID values in some
systematic way, for example by appending a section number to a short
code for the title of the work in question, as in the examples below.
-
n
- The n attribute specifies a mnemonic short name
or number for the division, which can be used to identify it in
preference to the value given for the id attribute. If a conventional form of reference or
abbreviation for the parts of a work already exists (such as the
book/chapter/verse pattern of Biblical citations), the n
attribute is the place to record it.
The attributes
id and
n,
indeed, are so widely useful that they are allowed on any element in
any TEI DTD: they are
global attributes. Other global
attributes defined in the TEI Lite scheme are discussed in section
Linking Attributes.
The value of every
id attribute must be unique
within a document. One simple way of ensuring that this is so is to
make it reflect the hierarchic structure of the document. For example,
Smith's
Wealth of Nations as first published consists
of five books, each of which is divided into chapters, while some
chapters are further subdivided into parts. We might define
id values for this structure as follows:
<div id="WN1" n="I" type="book">
<div2 id="WN101" n="I.1" type="chapter">
... </div2>
<div2 id="WN102" n="I.2" type="chapter">
... </div2>
...
<div2 id="WN110" n="I.10" type="chapter">
<div3 id="WN1101" n="I.10.1" type="part">
... </div3>
<div3 id="WN1102" n="I.10.2" type="part">
... </div3>
</div2>
...
</div>
<div id="WN2" n="II" type="book">
....
</div>
...
A different numbering scheme may be used for
id and
n attributes: this is often useful where a canonical
reference scheme is used which does not tally with the structure of
the work. For example, in a novel divided into books each containing
chapters, where the chapters are numbered sequentially through the
whole work, rather than within each book, one might use a scheme such
as the following:
<div id="TS01" n="1" type="Volume">
<div2 id="TS011" n="1" type="Chapter">
... </div2>
<div2 id="TS012" n="2">
...</div2>
</div>
<div id="TS02" n="2" type="Volume">
<div2 id="TS021" n="3"type="Chapter">
...</div2>
<div2 id="TS022" n="4">
...</div2>
</div>
Here the work has two volumes, each containing two chapters.
The chapters are numbered conventionally 1 to 4, but the
id
values specified allow them to be regarded additionally as if they
were numbered 1.1, 1.2, 2.1, 2.2.
Headings and Closings
Every
<div>
,
<div>
,
<div2>
, etc., may
have a title or heading at its start, and (less commonly) a closing
such as ‘End of Chapter 1’. The
following elements may be used to transcribe them:
-
<head>
- contains any heading, for example, the title of a section, or
the heading of a list or glossary.
-
<trailer>
- contains a closing title or footer appearing at the end of a
division of a text.
Some other elements which may be necessary at the beginning or
ending of text divisions are discussed below in section
Prefatory Matter .
Whether or not headings and trailers are included in a
transcription is a matter for the individual transcriber to decide.
Where a heading is completely regular (for example ‘Chapter 1’)
or has been given as an attribute value (e.g.
<div type="Chapter"
n="1">
), it may be omitted; where it contains otherwise
unrecoverable text it should always be included. For example, the
start of Hardy's
Under the Greenwood Tree might be
encoded as follows:
<div id="UGT1" n="Winter" type="Part">
<div2 id="UGT11" n="1" type="Chapter">
<head>Mellstock-Lane</head>
<p>To dwellers in a wood almost every species of tree ...
Prose, Verse and Drama
As noted above, the paragraphs making up a textual division should
be tagged with the
<p>
tag. For example:
<body>
<p>I fully appreciate Gen. Pope's splendid achievements
with their invaluable results; but you must know that
Major Generalships in the Regular Army, are not as
plenty as blackberries.
</p>
</body>
A number of different tags are provided for the encoding of the
structural components of verse and performance texts (drama, film,
etc.):
-
<l>
- contains a single, possibly incomplete, line of verse.
Attributes include:
-
part
- specifies whether or not the line is metrically complete. Legal
values are:
F for the final part of an incomplete line,
Y if the line is metrically incomplete,
N if the line is complete, or if no claim is made as to
its completeness,
I for the initial part of an incomplete line,
M for a medial part of an incomplete line.
-
<lg>
- contains a group of verse lines functioning as a formal unit
e.g. a stanza, refrain, verse paragraph, etc.
-
<sp>
- contains an individual speech in a performance text, or a
passage presented as such in a prose or verse text. Attributes
include:
-
who
- identifies the speaker of the part by supplying an ID.
-
<speaker>
- contains a special form of heading or label, giving the name of
one or more speakers in a performance text or fragment.
-
<stage>
- contains any kind of stage direction within a performance text
or fragment. Attributes include:
-
type
- indicates the kind of stage direction. Suggested values include
entrance, exit, setting,
delivery, etc.
Here, for example, is the start of a poetic text in which verse
lines and stanzas are tagged:
<lg n="I">
<l>I Sing the progresse of a
deathlesse soule,</l>
<l>Whom Fate, with God made,
but doth not controule,</l>
<l>Plac'd in most shapes; all times
before the law</l>
<l>Yoak'd us, and when, and since,
in this I sing.</l>
<l>And the great world to his aged evening;</l>
<l>From infant morne, through manly noone I draw.</l>
<l>What the gold Chaldee, of silver Persian saw,</l>
<l>Greeke brass, or Roman iron, is in this one;</l>
<l>A worke t'out weare Seths pillars, bricke and stone,</l>
<l>And (holy writs excepted) made to yeeld to none,</l>
</lg>
Note that the
<l>
element marks verse lines, not typographic
lines: the original lineation of the first few lines above has not
therefore been made explicit by this encoding, and may be lost. The
<lb>
element described in section Page and Line Numbers may be
used to mark typographic lines if so desired.
Sometimes, particularly in dramatic texts, verse lines are split
between speakers. The easiest way of encoding this is to use the
part attribute to indicate that the lines so
fragmented are incomplete, as in this example:
<div type ="Act" n="I"><head>ACT I</head>
<div2 type ="Scene" n="1"><head>SCENE I</head>
<stage rend="italic">
Enter Barnardo and Francisco, two Sentinels, at several doors</stage>
<sp><speaker>Barn</speaker><l part="Y">Who's there?</l></sp>
<sp><speaker>Fran</speaker><l>Nay, answer me. Stand and unfold
yourself.</l></sp>
<sp><speaker>Barn</speaker><l part="i">Long live the King!</l></sp>
<sp><speaker>Fran</speaker><l part="m">Barnardo?</l></sp>
<sp><speaker>Barn</speaker><l part="f">He.</l></sp>
<sp><speaker>Fran</speaker><l>You come most carefully upon
your hour.</l></sp>
The same mechanism may be applied to stanzas which are divided
between two speakers:
<sp><speaker>First voice</speaker>
<lg type="stanza" part="I">
<l>But why drives on that ship so fast</l>
<l>Withouten wave or wind?</l>
</lg>
<sp><speaker>Second Voice</speaker>
<lg part="F">
<l>The air is cut away before.</l>
<l>And closes from behind.</l>
</lg>
This example shows how dialogue presented in a prose work as if it
were drama should be encoded. It also demonstrates the use of the
who attribute to bear a code identifying the speaker
of the piece of dialogue concerned:
<sp who="OPI"><speaker>The reverend Doctor Opimiam</speaker>
<p>I do not think I have named a single unpresentable fish.</p>
</sp>
<sp who="GRM"><speaker>Mr Gryll</speaker>
<p>Bream, Doctor: there is not much to be said for bream.</p>
</sp>
<sp who="OPI"><speaker>The Reverend Doctor Opimiam</speaker>
<p>On the contrary, sir, I think there is much to be said for him.
In the first place....</p>
<p>Fish, Miss Gryll -- I could discourse to you on fish by
the hour: but for the present I will forbear.</p>
</sp>
Page and Line Numbers
Page and line breaks may be marked with the following empty
elements.
-
<pb>
- marks the boundary between one page of a text and the next in a
standard reference system.
-
<lb>
- marks the start of a new (typographic) line in some edition or
version of a text.
These elements mark a single point in the text, not a span
of text. The global
n attribute should be used to
supply the number of the page or line beginning at the tag. In
addition, these two elements share the following attribute:
-
ed
- indicates the edition or version in which the page break is
located at this point.
When working from a paginated original, it is often useful to
record its pagination, if only to simplify later proof-reading.
Recording the line breaks may be useful for the same reason; treatment
of end-of-line hyphenation in printed source texts will require some
consideration.
If pagination, etc., are marked for more than one edition, specify
the edition in question using the
ed attribute, and
supply as many tags are necessary. For example, in the following
passage we indicate where the page breaks occur in two different
editions (
ED1 and
ED2)
<p>I wrote to Moor House and to Cambridge immediately, to
say what I had done: fully explaining also why I had thus
acted. Diana and <pb ed="ED1" n="475"/> Mary approved the
step unreservedly. Diana announced that she would
<pb ed="ED2" n="485"/>just give me time to get over the
honeymoon, and then she would come and see me.</p>
The
<pb>
and
<lb>
elements are special cases of
the general class of
milestone elements which mark
reference points within a text. TEI Lite also includes a generic
<milestone>
element, which is not restricted to special cases
but can mark any kind of reference point: for example, a column
break, the start of a new kind of section not otherwise tagged, etc.
This element has the following description and attributes:
-
<milestone>
- marks the boundary between sections of a text, as indicated by
changes in a standard reference system. Attributes include:
-
ed
- indicates the
edition or version to which the milestone
applies.
-
unit
- indicates what kind of section is changing at this milestone.
The names used for types of unit and for editions referred to by
the ed and unit attributes may be chosen
freely, but should be documented in the header.
The
<milestone>
element may be used to replace the others,
or the others may be used as a set; they should not be mixed
arbitrarily.
Marking Highlighted Phrases
Changes of Typeface, etc.
Highlighted words or phrases are those made visibly different from
the rest of the text, typically by a change of type font, handwriting
style, or ink color, intended to draw the reader's attention to
them.
The global rend attribute can be attached to any
element, and used wherever necessary to specify details of the
highlighting used for it. For example, a heading rendered in bold
might be tagged head rend="bold", and one in
italic head rend="italic".
It is not always possible or desirable to interpret the reasons
for such changes of rendering in a text. In such cases, the element
<hi>
may be used to mark a sequence of highlighted text
without making any claim as to its status.
-
<hi>
- marks a word or phrase as graphically distinct from the
surrounding text, for reasons concerning which no claim is
made.
In the following example, the use of a distinct typeface for the
subheading and for the included name are recorded but not interpreted:
<p><hi rend="gothic">And this Indenture further witnesseth</hi>
that the said <hi rend="italic">Walter Shandy</hi>, merchant,
in consideration of the said intended marriage ...</p>
Alternatively, where the cause for the highlighting can be
identified with confidence, a number of other, more specific, elements
are available.
-
<emph>
- marks words or phrases which are stressed or emphasized for
linguistic or rhetorical effect.
-
<foreign>
- identifies a word or phrase as belonging to some language other
than that of the surrounding text.
-
<mentioned>
- marks words or phrases mentioned, not used.
-
<term>
- contains a single-word, multi-word or symbolic designation
which is regarded as a technical term.
-
<title>
- contains the title of a work, whether article, book, journal,
or series, including any alternative titles or subtitles. Attributes
include:
-
level
- indicates whether this is the title of an article, book,
journal, series, or unpublished material. Legal values are:
m for monographic title (book, collection, or other item
published as a distinct item, including single volumes of multi-volume
works); s (series title); j (journal title);
u for title of unpublished material (including theses and
dissertations unless published by a commercial press); a for
analytic title (article, poem, or other item published as part of a
larger item).
-
type
- classifies the title according to some convenient typology.
Sample values include:
abbreviated, main, subordinate
(for subtitles and titles of parts), and parallel (for
alternate titles, often in another language, by which the work is also
known).
Some features (notably quotations and glosses) may be found in a
text either marked by highlighting, or with quotation marks. In
either case, the elements
<q>
and
<gloss>
(as
discussed in the following section) should be used. If the rendition
is to be recorded, use the global rend attribute.
As an example of the elements defined here, consider the following
sentence:
On the one hand the Nibelungenlied
is associated with the new rise of romance of twelfth-century France,
the romans d'antiquité;, the romances of Chrétien
de Troyes, and the German adaptations of these works by Heinrich van
Veldeke, Hartmann von Aue, and Wolfram von Eschenbach.
Interpreting the role of the highlighting, the sentence might
look like this:
<p>On the one hand the <title>Nibelungenlied</title> is associated
with the new rise of romance of twelfth-century France, the
<foreign>romans d'antiquité</foreign>, the romances of
Chrétien de Troyes, ...</p>
Describing only the appearance of the original, it might look
like this:
<p>On the one hand the <hi rend="italic">Nibelungenlied</hi>
is associated with the new rise of romance of twelfth-century
France, the <hi rend="italic">romans
d'antiquité</hi>, the romances of
Chrétien de Troyes, ...</p>
Quotations and Related
Features
Like changes of typeface, quotation marks are conventionally used
to denote several different features within a text, of which the most
frequent is quotation. When possible, we recommend that the
underlying feature be tagged, rather than the simple fact that
quotation marks appear in the text, using the following elements:
-
<q>
- contains a quotation or apparent quotation --- a representation
of speech or thought marked as being quoted from someone else (whether
in fact quoted or not); in narrative, the words are usually those of
a character or speaker; in dictionaries,
<q>
may be used to
mark real or contrived examples of usage. Attributes include:
-
type
- may be used to indicate whether the quoted matter is spoken or
thought, or to characterize it more finely. Sample values include:
spoken (for representation of direct speech, usually marked
by quotation marks) and thought (for representation of
thought, e.g. internal monologue).
-
who
- identifies the speaker of a piece of direct speech.
-
<mentioned>
- marks words or phrases mentioned, not used.
-
<soCalled>
- contains a word or phrase for which the author or narrator
indicates a disclaiming of responsibility, for example by the use of
scare quotes or italics.
-
<gloss>
- marks a word or phrase which provides a gloss or definition for
some other word or phrase. Attributes include:
-
target
- identifies the associated word or phrase.
Here is a simple example of a quotation:
<p>Few dictionary makers are likely to forget
Dr. Johnson's description of the
lexicographer as <q>a harmless drudge.</q></p>
To record how a quotation was printed (for example,
in-line or set off as a display or
block quotation), the rend attribute
should be used. This may also be used to indicate the kind of
quotation marks used.
Direct speech interrupted by a narrator can be represented simply
by ending the quotation and beginning it again after the interruption,
as in the following example:
<p><q>Who-e debel you?</q> — he at last said — <q>you
no speak-e, damme, I kill-e.</q> And so saying, the lighted
tomahawk began flourishing about me in the dark.</p>
If it is important to convey the idea that the two
<q>
elements together reproduce a single speech, the linking attributes
next and
prev may be used, as described in section
Linking Attributes.
Quotations may be accompanied by a reference to the source or
speaker, using the
who attribute, whether or not the
source is given in the text, as in the following example:
<q who="Wilson">Spaulding, he came down into the office just this
day eight weeks with this very paper in his hand, and he
says:—<q who="Spaulding">I wish to the Lord, Mr. Wilson, that
I was a red-headed man.</q></q>
This example also demonstrates how quotations may be embedded
within other quotations: one speaker (Wilson) quotes another speaker
(Spaulding).
The creator of the electronic text must decide whether quotation
marks are replaced by the tags or whether the tags are added and the
quotation marks kept. If the quotation marks are removed from the
text, the rend attribute may be used to record the way
in which they were rendered in the copy text.
As with highlighting, it is not always possible and may not be
considered desirable to interpret the function of quotation marks in a
text in this way. In such cases, the tag
<hi rend="quoted">
might be used to mark quoted text without making any claim as to its
status.
Foreign Words or Expressions
Words or phrases which are not in the main language of the texts
may be tagged as such in one of two ways. If the word or phrase is
already tagged for some reason, the element indicated should bear a
value for the global
lang attribute indicating the
language used. Where there is no applicable element, the element
<foreign>
may be used, again using the
lang
attribute. For example:
<p>John has real <foreign lang="fra">savoir-faire</foreign>.</p>
<p>Have you read <title lang="deu">Die Dreigroschenoper</title>?</p>
<p><mentioned lang="fra">Savoir-faire</mentioned> is French for know-how.</p>
<p>The court issued a writ of <term lang="lat">mandamus</term>.</p>
As these examples show, the
<foreign>
element should not
be used to tag foreign words if some other more specific element such
as
<title>
,
<mentioned>
, or
<term>
applies.
The global lang attribute may be attached to any
element to show that it uses some other language than that of the
surrounding text.
Notes
All notes, whether printed as footnotes, endnotes, marginalia, or
elsewhere, should be marked using the same element:
-
<note>
- contains a note or annotation. Attributes include:
-
type
- describes the type of note.
-
resp
- indicates who is responsible for the annotation: author,
editor, translator, etc. The value might be
author,
editor, etc., or the initials of the individual who
added the annotation.
-
place
- indicates where the note appears in the source text. Sample
values include inline, interlinear, left,
right, foot, and end, for
notes which appear as marked paragraphs in the body of the text,
between the lines, in the left or right margin, at the foot of the
page, or at the end of the chapter or volume, respectively.
-
target
- indicates the point of attachment of a note, or the beginning
of the span to which the note is attached.
-
targetEnd
- points to the end of the span to which the note is attached, if
the note is not embedded in the text at that point.
-
anchored
- indicates whether the copy text shows the exact place of
reference for the note.
Where possible, the body of a note should be inserted in the
text at the point at which its identifier or mark first appears. This
may not be possible for example with marginalia, which may not be
anchored to an exact location. For simplicity, it may be adequate to
position marginal notes before the relevant paragraph or other
element. Notes may also be placed in a separate division of the text
(as end-notes are, in printed books) and linked to the relevant
portion of the text using their
target attribute.
The n attribute may be used to supply the number
or identifier of a note if this is required. The resp
attribute should be used consistently to distinguish between authorial
and editorial notes, if the work has both kinds; otherwise, the TEI
header should state which kind they are.
Examples:
<p>Collections are ensembles of distinct
entities or objects of any sort.
<note place="foot" n=1>
We explain below why we use the uncommon term
<mentioned>collection</mentioned>
instead of the expected <mentioned>set</mentioned>.
Our usage corresponds to the <mentioned>aggregate</mentioned>
of many mathematical writings and to the sense of
<mentioned>class</mentioned> found
in older logical writings.
</note>
The elements ...</p>
<lg id="RAM609">
<note place="margin">The curse is finally expiated</note>
<l>And now this spell was snapt: once more</l>
<l>I viewed the ocean green,</l>
<l>And looked far forth, yet little saw</l>
<l>Of what had else been seen —</l>
</lg>
Cross References and
Links
Explicit cross references or links from one point in a text to
another in the same SGML document may be encoded using the elements
described in section Simple Cross References. References or links to
elements of some other SGML document, or to parts of non-SGML
documents, may be encoded using the TEI extended pointers
described in section Extended Pointers. Implicit links (such as
the association between two parallel texts, or that between a text and
its interpretation) may be encoded using the linking attributes
discussed in section Linking Attributes.
Simple Cross References
A cross reference from one point within a single document to
another can be encoded using either of the following elements:
-
<ref>
- a reference to another location in the current document, in
terms of one or more identifiable elements, possibly modified by
additional text or comment.
-
<ptr>
- a pointer to another location in the current document in terms
of one or more identifiable elements.
These elements share the following attributes:
-
target
- specifies the destination of the pointer as one or more SGML
identifiers
-
type
- categorizes the pointer in some respect, using any convenient
set of categories.
-
targType
- specifies the type (or types) of element to which this pointer
may point.
-
crDate
- specifies when this pointer was made.
-
resp
- specifies the creator of the pointer.
The difference between these two elements is that
<ptr>
is
an empty element, simply marking a point from which a link is to be
made, whereas
<ref>
may contain some text as well —
typically the text of the cross-reference itself. The
<ptr>
element would be used for a cross reference which is to be indicated by
some non-verbal means such as a symbol or icon, or in an electronic
text by a button. It is also useful in document production systems,
where the formatter can generate the correct verbal form of the cross
reference.
The following two forms, for example, are logically equivalent
(assuming we have documented somewhere the exact verbal form of cross
references represented by
<ptr>
elements):
See especially <ref target="SEC12">section 12 on page
34</ref>.
See especially <ptr
target="SEC12"/>.
The value of the
target attribute must have been used as the
identifier of some other element within the current document. This implies that the
passage or phrase being pointed at must bear an identifier, and must
therefore be tagged as an element of some kind. In the following
example, the cross reference is to a
<div>
element:
...
see especially <ptr target="SEC12"/>.
...
<div id="SEC12"><head>Concerning Identifiers...
...
Because the
id attribute is global, any element in
a document may be pointed to in this way. In the following example, a
paragraph has been given an identifier so that it may be pointed at:
...
this is discussed in <ref target="pspec">the paragraph on links</ref>
...
<p id="pspec">Links may be made to any kind of element
...
The
targType attribute can be used to specify that
the element pointed to must be of a particular type, as in the
following example:
...
this is discussed in <ref target="dspec" targType="div div2">
the section on links</ref>
This reference should fail if the element with identifier
dspec is neither a
<div>
nor a
<div2>
.
Note however that this additional check cannot be carried out by an
SGML or XML parser
alone, since such parsers can only check that some element
dspec exists.
The
type attribute can be used to categorize the
link represented by the pointer in any convenient way. The
resp and
crDate attributes may also be
used to represent the person or agency responsible for making the
link, and its date of creation, as in the following example:
...
this is discussed in
<ref type="xref" resp="auto" crdate="950521" target="dspec" targType='div div2">
the section on links</ref>
These attributes are most likely to be of use in hypertext
systems containing very many pointers used for a variety of purposes
and created by a variety of means.
Sometimes the target of a cross reference does not correspond
with any particular feature of a text, and so may not be tagged as an
element of some kind. If the desired target is simply a point in the
current document, the easiest way to mark it is by introducing an
<anchor>
element at the appropriate spot. If the target is
some sequence of words not otherwise tagged, the
<seg>
element
may be introduced to mark them. These two elements are described as
follows:
-
<anchor>
- specifies a location or point within a document so that it may
be pointed to.
-
<seg>
- identifies a span or segment of text within a document so that
it may be pointed to. Attributes include
-
type
- categorizes the segment
In the following (imaginary) example,
<ref>
elements have
been used to represent points in this text which are to be linked in
some way to other parts of it; in the first case to a point, and in
the second, to a sequence of words:
Returning to <ref target="ABCD">the point where I dozed
off</ref>, I noticed that <ref target="EFGH">three
words</ref> had been circled in red by a previous reader
This encoding requires that elements with the specified
identifiers (
ABCD and
EFGH in this
example) are to be found somewhere else in the current document.
Assuming that no element already exists to carry these identifiers,
the
<anchor>
and
<seg>
elements may be used:
.... <anchor type="bookmark" id="ABCD"/> ....
....<seg type="target" id="EFGH"> ... </seg> ...
The type attribute should be used (as above) to
distinguish amongst different purposes for which these general purpose
elements might be used in a text. Some other uses are discussed in
section Linking Attributes below.
Extended Pointers
The elements
<ptr>
and
<ref>
can only be used for
cross-references or links whose targets occur within the same
document as their source. They can also refer only to elements
explicitly tagged in the document.
The elements discussed in this section are not restricted in this way.
-
<xptr>
- defines a pointer to another location in the current document
or an external document.
-
<xref>
- defines a pointer to another location in the current document
or an external document, possibly modified by additional text or
comment.
In addition to the pointer attributes already discussed in section
Simple Cross References above, these elements share the following
additional attributes, which are used to specify the target of the
cross reference or link in place of the
target attribute:
-
doc
- specifies the document within which the required location is to
be found, by default the current document.
-
from
- specifies the start of the destination of the pointer as an
expression in the TEI extended pointer syntax, by default the whole of
the document indicated by the doc attribute.
-
to
- specifies the endpoint of the destination of the pointer as an
expression in the TEI extended pointer syntax; may only be specified
if the from attribute has been.
A full specification of the language used to express the target of
TEI extended pointers is beyond the scope of this document; here we
list here only a few of its more generally useful features. The full
Guidelines should be consulted for more detail.
An
<xptr>
(or
<xref>
) may point to the whole of
some other document simply by supplying an entity name as the value of
the
doc attribute, as in this example:
see <xref doc="P3">The TEI Guidelines, passim</xref>
This example assumes that some system or public entity with the
name
P3 has been declared. This declaration has to be
included within the DTD in force when the document is parsed;
the manner of doing so is specific to the authoring software in use
(as further discussed in section Figures and Graphics).
The from attribute is used to specify some
location within whatever document is specified by the doc
attribute. The specification uses a special language, called the
TEI extended pointer syntax; only some details of which
are given here. In this language, locations are defined as a series of
steps, each one identifying some part of the document,
often in terms of the locations identified by the previous step. For
example, you would point to the third sentence of the second paragraph
of chapter two by selecting chapter two in the first step, the second
paragraph in the second step, and the third sentence in the last step.
A step can be defined in terms of the document tree itself, using such
concepts as parent,
descendent, preceding, etc. or, more loosely, in
terms of text patterns, word or character positions. You can also use
a foreign (non-SGML) notation, or specify a location within a graphic
in terms of its co-ordinate system.
The from and to attributes use the
same notation. Each points to some portion of the target document;
the extended pointer as a whole points to the section beginning at the
start of the from and running to the end of the
to.
The first step in a location path will often be to specify the
identifier of some element within the target document, as in this
example:
<xptr doc="P3" from="id (SA)"/>
This selects the whole of whatever element bears the
identifier
SA within the entity
P3. If a
finer-grained target is required, other steps might follow. The
following keywords are available for you to select other elements in
terms of their relationship to this one:
-
child
- elements contained by this one.
-
ancestor
- elements which contains this one, directly or indirectly.
-
previous
- elements with the same parent as this one but preceding it in
the document.
-
next
- elements with the same parent as this one and following it in
the document.
-
preceding
- elements in the document which start before this one does,
irrespective of their parents.
-
following
- elements in the document which start after this one does,
irrespective of their parents.
Each of these keywords implies a particular set of elements (the
set of children, the set of ancestors, the set of previous siblings,
etc.); to specify which element in the set we are pointing at, the
keyword may optionally be followed by a parenthesized list containing:
-
a positive or negative number, indicating which of the possibly
many elements found is intended (+1 indicating the first element
encountered, starting from the current location, and -1 indicating the
last), or the keyword all, indicating that all the elements
in the set are to be pointed at;
-
a generic identifier, indicating the type of element required,
or a star indicating that any element type will do;
-
a set of attribute names and values, indicating that the
element selected should have attributes with the names and values
specified, if any.
Continuing the above example, the following reference will select
the third
<p>
element directly contained by whatever element
has the identifier
SA:
<xptr doc="P3" from="id (SA) child (3 p)"/>
Similarly, assuming that the entity
P3 is in fact
a reference to the XML form of the TEI Guidelines, then the following
reference will select section 14.2.2 of that publication in which (as
it happens) the extended pointer syntax is formally defined:
For full details, see
<ref doc="P3" from="id (SA) child (2 div2) child (2 div3)">
TEI Extended pointer syntax definition
</ref>
Normally, the scope of a cross reference will be adequately
defined by the
from attribute. For some documents,
however, it may be more convenient to define both a starting and an
ending scope. As noted above, the
to attribute is
provided for this purpose. For example,
<xptr doc="P1" from="id (xyz)" to="id (abc)"/>
is an extended pointer whose target is the sequence starting
at the beginning of whatever element in document
P1
has identifier
XYZ and ending at the end of whatever
element in the same document has identifier
ABC. Any
elements in between are also included, irrespective of structure; the
pointer is erroneous if the end of
ABC precedes the
start of
XYZ.
Very complex specifications are easily built using this syntax.
For example, the following reference will select the most recent
<head>
element which carries an attribute
lang with the value
LAT, and which occurs before the start of the element with
identifier
SA:
<xptr doc="P3" from="id (SA) preceding (1 head lang lat)"/>
If no value is supplied for the
doc attribute, the
current document is assumed. Thus, the following references are
semantically equivalent. They both indicate the element with
identifier
X1 within the current document:
<ptr target="X1"/>
<xptr from="id (X1)"/>
The TEI Extended Pointer Syntax was defined before the more recent
XLink specifications, which are however to some extent derived
from them. Work is currently going on to harmonize the two
specification languages.
Linking Attributes
The following special purpose
linking attributes are
defined for every element in the TEI Lite DTD:
-
ana
- links an element with its interpretation.
-
corresp
- links an element with one or more other corresponding elements.
-
next
- links an element to the next element in an aggregate.
-
prev
- links an element to the previous element in an aggregate.
The
ana (analysis) attribute is intended for use
where a set of abstract analyses or interpretations have been defined
somewhere within a document, as further discussed in section
Interpretation and
Analysis. For example, a linguistic analysis of the sentence
‘John loves Nancy’ might be encoded as follows:
<seg type="sentence" ana="SVO">
<seg type="lex" ana="NP1">John</seg>
<seg type="lex" ana="VVI">loves</seg>
<seg type="lex" ana="NP1">Nancy</seg>
</seg>
This encoding implies the existence elsewhere in the
document of elements with identifiers
SVO,
NP1,
and
VV1 where the significance of these particular codes
is explained. Note the use of the
<seg>
element to mark
particular components of the analysis, distinguished by the
type
attribute.
The
corresp (corresponding) attribute provides a
simple way of representing some form of correspondence between two
elements in a text. For example, in a multilingual text, it may be
used to link translation equivalents, as in the following example
<seg lang="FRA" id="FR1" corresp="EN1">Jean aime Nancy</seg>
<seg lang="ENG" id="EN1" corresp="FR1">John loves Nancy</seg>
The same mechanism may be used for a variety of purposes. In the
following example, it has been used to represent anaphoric
correspondences between ‘the show’
and ‘Shirley’, and between
‘NBC’ and ‘the network’:
<p><title id="shirley">Shirley</title>, which made
its Friday night debut only a month ago, was
not listed on <name id="nbc">NBC</name>'s new schedule,
although <seg id="network" corresp="nbc">the network</seg>
says <seg id="show" corresp="shirley">the show</seg>
still is being considered.</p>
The
next and
prev attributes
provide a simple way of linking together the components of a
discontinuous element, as in the following example:
<q id="Q1a" next="Q1b">Who-e debel you?</q>
&mdash he at last said &mdash
<q id="Q1b" prev="Q1a">you no speak-e,
damme, I kill-e.</q> And so saying,
the lighted tomahawk began flourishing
about me in the dark.
Editorial
Interventions
The process of encoding an electronic text has much in common with
the process of editing a manuscript or other text for printed
publication. In both cases a conscientious editor may wish to record
both the original state of the source and any editorial correction or
other change made in it. The elements discussed in this and the next
section provide some facilities for meeting these needs.
The following pair of elements may be used to mark
correction, that is editorial changes introduced where
the editor believes the original to be erroneous:
-
<corr>
- contains the correct form of a passage apparently erroneous in
the copy text. Attributes include:
-
sic
- gives the original form of the apparent error in the copy text.
-
resp
- signifies the editor or transcriber responsible for suggesting
the correction held as the content of the
<corr>
element.
-
cert
- signifies the degree of certainty ascribed to the correction
held as the content of the
<corr>
element.
-
<sic>
- contains text reproduced although apparently incorrect or
inaccurate. Attributes include:
-
corr
- gives a correction for the apparent error in the copy text.
-
resp
- signifies the editor or transcriber responsible for suggesting
the correction.
-
cert
- signifies the degree of certainty ascribed to the correction.
The following pair of elements may be used to mark
normalization, that is editorial changes introduced for
the sake of consistency or modernization of a text:
-
<orig>
- contains the original form of a reading, for which a
regularized form is given in an attribute value. Attributes include:
-
reg
- gives a regularized (normalized) form of the text.
-
resp
- identifies the individual responsible for the regularization of
the word or phrase.
-
<reg>
- contains a reading which has been regularized or normalized in
some sense. Attributes include:
-
orig
- gives the unregularized form of the text as found in the source
copy.
-
resp
- identifies the individual responsible for the regularization of
the word or phrase.
For example, the reading
... for his nose was as sharp as a pen and a' table of green feelds
is taken by Gifford as involving (1) the erroneous substitution
of
table for
babbled,
and (2) the non-standard spellings
a' and
feelds for
he and
fields. Gifford's conjecture might be encoded
thus:
... for his nose was as sharp as a pen and <reg orig="a'">he</reg>
<corr sic="table" ed="Gifford">babbl'd</corr> of green
<reg orig="feelds">fields</reg>
Omissions, Deletions, and
Additions
In addition to correcting or normalizing words and phrases,
editors and transcribers may also supply missing material, omit
material, or transcribe material deleted or crossed out in the source.
In addition, some material may be particularly hard to transcribe
because it is hard to make out on the page. The following elements
may be used to record such phenomena:
-
<add>
- contains letters, words, or phrases inserted in the text by an
author, scribe, annotator, or corrector. Attributes include:
-
place
- if the addition is written into the copy text, indicates where
the additional text is written. Sample values include
inline,
supralinear, infralinear,
left (in left margin),
right (in right margin),
top,
bottom, etc.
-
<gap>
- indicates a point where material has been omitted in a
transcription, whether for editorial reasons described in the TEI
header, as part of sampling practice, or because the material is
illegible or inaudible. Attributes include:
-
desc
- gives a description of the omitted text.
-
resp
- indicates the editor, transcriber or encoder responsible for
the decision not to provide any transcription of the text and hence
the application of the
<gap>
tag.
-
<del>
- contains a letter, word or passage deleted, marked as deleted,
or otherwise indicated as superfluous or spurious in the copy text by
an author, scribe, annotator or corrector. Attributes include:
-
type
- classifies the type of deletion using any convenient typology.
-
status
- may be used to indicate faulty deletions, e.g. strikeouts which
include too much or too little text.
-
hand
- signifies the hand of the agent which made the deletion.
-
<unclear>
- contains a word, phrase, or passage which cannot be transcribed
with certainty because it is illegible or inaudible in the source.
Attributes include:
-
reason
- indicates why the material is hard to transcribe.
-
resp
- indicates the individual responsible for the transcription of
the letter, word or passage contained with the
<unclear>
element.
These elements may be used to record changes made by an editor, by
the transcriber, or (in manuscript material) by the author or scribe.
For example, if the source for an electronic text read
The following elements are provided for
for simple editorial interventions.
then it might be felt desirable to correct the obvious error,
but at the same time to record the deletion of the superfluous second
for, thus:
The following elements are provided for
<del hand="LB">for</del> simple editorial interventions.
The attribute value
LB on the
hand
attribute indicates that ‘LB’
corrected the duplication of
for.
If the source read
The following elements provided for
for simple editorial interventions.
(i.e. if the verb had been
inadvertently dropped) then the corrected text might read:
The following elements <add hand="LB">are</add> provided for
<del hand="LB">for</del> simple editorial interventions.
The attribute value
LB on the
hand
attribute indicates that ‘LB’
corrected the duplication of
for.
These elements are not limited to changes made by an editor; they
can also be used to record authorial changes in manuscripts. A
manuscript in which the author has first written ‘How it galls me,
what a galling shadow’, then crossed out the word
galls and inserted
dogs
might be encoded thus:
How it <del hand="DHL" type="overstrike">galls</del>
<add hand="DHL" place="supralinear">dogs</add> me,
what a galling shadow
Similarly, the
<unclear>
and
<gap>
elements may be
used together to indicate the omission of illegible material; the
following example also shows the use of
<add>
for a
conjectural emendation:
One hundred & twenty good regulars joined to me
<unclear><gap reason="indecipherable"/></unclear>
& instantly, would aid me signally <add hand="ed">in?</add>
an enterprise against Wilmington.
The
<del>
element marks material which is transcribed as
part of the electronic text despite being marked as deleted, while
<gap>
marks the location of material which is omitted from the
electronic text, whether it is legible or not. A language corpus, for
example, might omit long quotations in foreign languages:
<p> ... An example of a list appearing in a fief ledger of
<name type="place">Koldinghus</name> <date>1611/12</date>
is given below. It shows cash income from a sale of
honey.</p>
<q><gap desc="quotation from ledger"
reason="in Danish"/></q>
<p>A description of the overall structure of the account is
once again ... </p>
Other corpora (particular those constructed before the widespread
use of scanners) systematically omit figures and
mathematics:
<p>At the bottom of your screen below the mode line is the
<term>minibuffer</term>. This is the area where Emacs
echoes the commands you enter and