TEI Lite: Encoding for Interchange: an introduction to the TEI

This document provides an introduction to the recommendations of the Text Encoding Initiative (TEI), by describing a specific subset of the full TEI encoding scheme. The scheme documented here can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different computer systems. It is fully compatible with the full TEI scheme, as defined by TEI document P5, Guidelines for Electronic Text Encoding and Interchange, as of February 2006, and available from the TEI Consortium website at http://www.tei-c.org.

1 Introduction

The Text Encoding Initiative (TEI) Guidelines are addressed to anyone who wants to interchange information stored in an electronic form. They emphasize the interchange of textual information, but other forms of information such as images and sound are also addressed. The Guidelines are equally applicable in the creation of new resources and in the interchange of existing ones.

The Guidelines provide a means of making explicit certain features of a text in such a way as to aid the processing of that text by computer software running on different machines. This process of making explicit we call markup or encoding. Any textual representation on a computer uses some form of markup; the TEI came into being partly because of the enormous variety of mutually incomprehensible encoding schemes currently besetting scholarship, and partly because of the expanding range of scholarly uses now being identified for texts in electronic form.

The TEI Guidelines describe an encoding scheme which can be expressed using a number of different formal languages. The first editions of the Guidelines used the Standard Generalized Markup Language (SGML); since 2002, this has been replaced by the use of the Extensible Markup Language (XML). These markup languages have in common the definition of text in terms of elements and attributes, and rules governing their appearance within a text. The TEI's use of XML is ambitious in its complexity and generality, but it is fundamentally no different from that of any other XML markup scheme, and so any general-purpose XML-aware software is able to process TEI-conformant texts.

Since 2001, the TEI has been a community initiative supported by an international membership consortium. It was originally an international research project sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing, with substantial funding over its first five years from the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Communities, the Andrew W. Mellon Foundation, the Social Science and Humanities Research Council of Canada and others. The Guidelines were first published in May 1994, after six years of development involving many hundreds of scholars from different academic disciplines worldwide. During the years that followed, the Guidelines became increasingly influential in the development of the digital library, in the language industries, and even in the development of the World Wide Web itself. The TEI Consortium was set up in January 2001, and a year later produced an edition of the Guidelines entirely revised for XML compatibility. In 2004, it set about a major revision of the Guidelines to take full advantage of new schema languages, the first release of which appeared in 2005. This revision of the TEI Lite document conforms to version 2.1 of this most recent edition of the Guidelines, TEI P5, released in June 2012.

At the outset of its work, the overall goals of the TEI were defined by the closing statement of a planning conference held at Vassar College, N.Y., in November, 1987; these ‘Poughkeepsie Principles’ were further elaborated in a series of design documents. The Guidelines, say these design documents, should:

suffice to represent the textual features needed for research;
be simple, clear, and concrete;
be easy for researchers to use without special-purpose software;
allow the rigorous definition and efficient processing of texts;
provide for user-defined extensions;
conform to existing and emergent standards.

The world of scholarship is large and diverse. For the Guidelines to have wide acceptability, it was important to ensure that:

the common core of textual features be easily shared;
additional specialist features be easy to add to (or remove from) a text;
multiple parallel encodings of the same feature should be possible;
the richness of markup should be user-defined, with a very small minimal requirement;
adequate documentation of the text and its encoding should be provided.

The present document describes a manageable selection from the extensive set of elements and recommendations resulting from those design goals, which is called TEI Lite.

In selecting from the several hundred elements defined by the full TEI scheme, we have tried to identify a useful ‘starter set’, comprising the elements which almost every user should know about. Experience working with TEI Lite will be invaluable in understanding the full TEI scheme and in knowing how to integrate specialized parts of it into the general TEI framework.

Our goals in defining this subset may be summarized as follows:

it should be able to handle adequately a reasonably wide variety of texts, at the level of detail found in existing practice (as demonstrated in, for example, the holdings of the Oxford Text Archive);
it should be useful for the production of new documents (such as this one) as well as the encoding of existing texts;
it should be usable with a wide range of existing XML software;
it should be a pure subset of the full TEI scheme and defined using the customizaticon methods described in the TEI Guidelines;
it should be as small and simple as is consistent with the other goals.

The reader may judge our success in meeting these goals for him or herself.

Although we have tried to make this document self-contained, as suits a tutorial text, the reader should be aware that it does not cover every detail of the TEI encoding scheme. All of the elements described here are fully documented in the TEI Guidelines themselves, which should be consulted for authoritative reference information on these, and on the many others which are not described here. Some basic knowledge of XML is assumed.

2 A Short Example

We begin with a short example, intended to show what happens when a passage of prose is typed into a computer by someone with little sense of the purpose of mark-up, or the potential of electronic texts. In an ideal world, such output might be generated by a very accurate optical scanner. It attempts to be faithful to the appearance of the printed text, by retaining the original line breaks, by introducing blanks to represent the layout of the original headings and page breaks, and so forth. Where characters not available on the keyboard are needed (such as the accented letter a in faàl or the long dash), it attempts to mimic their appearance.

CHAPTER 38 READER, I married him. A quiet wedding we had: he and I, the par- son and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said -- 'Mary, I have been married to Mr Rochester this morning.' The housekeeper and her husband were of that decent, phlegmatic order of people, to whom one may at any time safely communicate a remarkable piece of news without incurring the danger of having one's ears pierced by some shrill ejaculation and subsequently stunned by a torrent of wordy wonderment. Mary did look up, and she did stare at me; the ladle with which she was basting a pair of chickens roasting at the fire, did for some three minutes hang suspended in air, and for the same space of time John's knives also had rest from the polishing process; but Mary, bending again over the roast, said only -- 'Have you, miss? Well, for sure!' A short time after she pursued, 'I seed you go out with the master, but I didn't know you were gone to church to be wed'; and she basted away. John, when I turned to him, was grinning from ear to ear. 'I telled Mary how it would be,' he said: 'I knew what Mr Ed- ward' (John was an old servant, and had known his master when he was the cadet of the house, therefore he often gave him his Christian name) -- 'I knew what Mr Edward would do; and I was certain he would not wait long either: and he's done right, for aught I know. I wish you joy, miss!' and he politely pulled his forelock. 'Thank you, John. Mr Rochester told me to give you and Mary this.' I put into his hand a five-pound note. Without waiting to hear more, I left the kitchen. In passing the door of that sanctum some time after, I caught the words -- 'She'll happen do better for him nor ony o' t' grand ladies.' And again, 'If she ben't one o' th' handsomest, she's noan faa\l, and varry good-natured; and i' his een she's fair beautiful, onybody may see that.' I wrote to Moor House and to Cambridge immediately, to say what I had done: fully explaining also why I had thus acted. Diana and 474 JANE EYRE 475 Mary approved the step unreservedly. Diana announced that she would just give me time to get over the honeymoon, and then she would come and see me. 'She had better not wait till then, Jane,' said Mr Rochester, when I read her letter to him; 'if she does, she will be too late, for our honey- moon will shine our life long: its beams will only fade over your grave or mine.' How St John received the news I don't know: he never answered the letter in which I communicated it: yet six months after he wrote to me, without, however, mentioning Mr Rochester's name or allud- ing to my marriage. His letter was then calm, and though very serious, kind. He has maintained a regular, though not very frequent correspond- ence ever since: he hopes I am happy, and trusts I am not of those who live without God in the world, and only mind earthly things.

This transcription suffers from a number of shortcomings:

the page numbers and running titles are intermingled with the text in a way which makes it difficult for software to disentangle them;
no distinction is made between single quotation marks and apostrophe, so it is difficult to know exactly which passages are in direct speech;
the preservation of the copy text's hyphenation means that simple-minded search programs will not find the broken words;
the accented letter in faàl and the long dash have been rendered by ad hoc keying conventions which follow no standard pattern and will be processed correctly only if the transcriber remembers to mention them in the documentation;
paragraph divisions are marked only by the use of white space, and hard carriage returns have been introduced at the end of each line. Consequently, if the size of type used to print the text changes, reformatting will be problematic.

We now present the same passage, as it might be encoded using the TEI Guidelines. As we shall see, there are many ways in which this encoding could be extended, but as a minimum, the TEI approach allows us to represent the following distinctions:

Paragraph and chapter divisions are now marked explicitly.
Apostrophes are distinguished from quotation marks; direct speech is explicitly marked.
The accented letter and the long dash are correctly represented.
Page divisions have been marked with an empty pb element alone.
The lineation of the original has not been retained and words broken by typographic accident at the end of a line have been re-assembled without comment.
For convenience of proof reading, a new line has been introduced at the start of each paragraph, but the indentation is removed.

<pb n="474"/>
<div type="chapter" n="38">
<p>Reader, I married him. A quiet wedding we had: he and I, the parson and clerk, were alone
   present. When we got back from church, I went into the kitchen of the manor-house, where
   Mary was cooking the dinner, and John cleaning the knives, and I said —</p>
<p>
  <q>Mary, I have been married to Mr Rochester this morning.</q> The housekeeper and her
   husband were of that decent, phlegmatic order of people, to whom one may at any time safely
   communicate a remarkable piece of news without incurring the danger of having one's ears
   pierced by some shrill ejaculation and subsequently stunned by a torrent of wordy
   wonderment. Mary did look up, and she did stare at me; the ladle with which she was basting
   a pair of chickens roasting at the fire, did for some three minutes hang suspended in air,
   and for the same space of time John's knives also had rest from the polishing process; but
   Mary, bending again over the roast, said only —</p>
<p>
  <q>Have you, miss? Well, for sure!</q>
</p>
<p>A short time after she pursued, <q>I seed you go out with the master, but I didn't know
     you were gone to church to be wed</q>; and she basted away. John, when I turned to him, was
   grinning from ear to ear. <q>I telled Mary how it would be,</q> he said: <q>I knew what Mr
     Edward</q> (John was an old servant, and had known his master when he was the cadet of the
   house, therefore he often gave him his Christian name) — <q>I knew what Mr Edward would do;
     and I was certain he would not wait long either: and he's done right, for aught I know. I
     wish you joy, miss!</q> and he politely pulled his forelock.</p>
<p>
  <q>Thank you, John. Mr Rochester told me to give you and Mary this.</q>
</p>
<p>I put into his hand a five-pound note. Without waiting to hear more, I left the kitchen.
   In passing the door of that sanctum some time after, I caught the words —</p>
<p>
  <q>She'll happen do better for him nor ony o' t' grand ladies.</q> And again, <q>If she
     ben't one o' th' handsomest, she's noan faàl, and varry good-natured; and i' his een she's
     fair beautiful, onybody may see that.</q>
</p>
<p>I wrote to Moor House and to Cambridge immediately, to say what I had done: fully
   explaining also why I had thus acted. Diana and <pb n="475"/> Mary approved the step
   unreservedly. Diana announced that she would just give me time to get over the honeymoon,
   and then she would come and see me.</p>
<p>
  <q>She had better not wait till then, Jane,</q> said Mr Rochester, when I read her letter
   to him; <q>if she does, she will be too late, for our honeymoon will shine our life long:
     its beams will only fade over your grave or mine.</q>
</p>
<p>How St John received the news I don't know: he never answered the letter in which I
   communicated it: yet six months after he wrote to me, without, however, mentioning Mr
   Rochester's name or alluding to my marriage. His letter was then calm, and though very
   serious, kind. He has maintained a regular, though not very frequent correspondence ever
   since: he hopes I am happy, and trusts I am not of those who live without God in the world,
   and only mind earthly things.</p>
</div>

This particular encoding represents a set of choices or priorities. As a trivial example, note that in the second example, end-of-line hyphenation has been silently removed. Conceivably Brontë (or her printer) intended the word ‘honeymoon’ to appear as ‘honey-moon’ on its second appearance, though this seems unlikely: our decision to focus on Brontë's text, rather than on the printing of it in this particular edition, makes it impossible to be certain. This is an instance of the fundamental selectivity of any encoding. An encoding makes explicit only those textual features of importance to the encoder. It is not difficult to think of ways in which the encoding of even this short passage might readily be extended. For example:

a regularized form of the passages in dialect could be provided;
footnotes glossing or commenting on any passage could be added;
pointers linking parts of this text to others could be added;
proper names of various kinds could be distinguished from the surrounding text;
detailed bibliographic information about the text's provenance and context could be prefixed to it;
a linguistic analysis of the passage into sentences, clauses, words, etc., could be provided, each unit being associated with appropriate category codes;
the text could be segmented into narrative or discourse units;
systematic analysis or interpretation of the text could be included in the encoding, with potentially complex alignment or linkage between the text and the analysis, or between the text and one or more translations of it;
passages in the text could be linked to images or sound held on other media.

TEI-recommended ways of carrying out most of these are described in the remainder of this document. The TEI scheme as a whole also provides for an enormous range of other possibilities, of which we cite only a few:

detailed analysis of the components of names;
detailed meta-information providing thesaurus-style information about the text's origins or topics;
information about the printing history or manuscript variations exhibited by a particular series of versions of the text.

For recommendations on these and many other possibilities, the full Guidelines should be consulted.

3 The Structure of a TEI Text

All TEI-conformant texts contain (a) a TEI header (marked up as a teiHeader element) and (b) the transcription of the text proper (marked up as a text element). These two elements are combined together to form a single TEI element, which must be declared within the TEI namespace¹.

The TEI header provides information analogous to that provided by the title page of a printed text. It has up to four parts: a bibliographic description of the machine-readable text, a description of the way it has been encoded, a non-bibliographic description of the text (a text profile), and a revision history. The header is described in more detail in section 19 The Electronic Title Page.

A TEI text may be unitary (a single work) or composite (a collection of single works, such as an anthology). In either case, the text may have an optional front or back. In between is the body of the text, which, in the case of a composite text, may consist of groups, each containing more groups or texts.

A unitary text will be encoded using an overall structure like this:

A composite text also has an optional front and back. In between occur one or more groups of texts, each with its own optional front and back matter. A composite text will thus be encoded using an overall structure like this:

It is also possible to define a composite of complete TEI texts, each with its own header. Such a collection is known as a TEI corpus, and may itself have a header:

It is also possible to create a composite of corpora -- that is, one teiCorpus element may contain many nested teiCorpus elements rather than many nested TEI elements, to any depth considered necessary.

In the remainder of this document, we discuss chiefly simple text structures. The discussion in each case consists of a short list of relevant TEI elements with a brief definition of each, followed by definitions for any attributes specific to that element, and a reference to any classes of which the element is a member. These references are linked to full specifications for each object, as given in the TEI Guidelines. In most cases, short examples are also given.

For example, here are the elements discussed so far:

TEI (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.
teiHeader (TEI Header) supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text.
text contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample.
teiCorpus contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

4 Encoding the Body

As indicated above, a simple TEI document at the textual level consists of the following elements:

front (front matter) contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found at the start of a document, before the main body.
group contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose, for example the collected works of an author, a sequence of prose essays, etc.
body (text body) contains the whole body of a single unitary text, excluding any front or back matter.
back (back matter) contains any appendixes, etc. following the main part of a text.

Elements specific to front and back matter are described below in section 18 Front and Back Matter. In this section we discuss the elements making up the body of a text.

4.1 Text Division Elements

The body of a prose text may be just a series of paragraphs, or these paragraphs may be grouped together into chapters, sections, subsections, etc. Each paragraph is tagged using the p tag. The div element is used to represent any such grouping of paragraphs.

p (paragraph) marks paragraphs in prose.
div (text division) contains a subdivision of the front, body, or back of a text.

The type attribute on the div element may be used to supply a conventional name for this category of text division, or otherwise distinguish them. Typical values might be ‘book’, ‘chapter’, ‘section’, ‘part’, ‘poem’, ‘song’, etc. For a given project, it will usually be advisable to define and adhere to a specific list of such values.

A div element may itself contain further, nested, divs, thus mimicking the traditional structure of a book, which can be decomposed hierarchically into units such as parts, containing chapters, containing sections, and so on. TEI texts in general conform to this simple hierarchic model.

The xml:id attribute may be used to supply a unique identifier for the division, which may be used for cross references or other links to it, such as a commentary, as further discussed in section 8 Cross References and Links. It is often useful to provide an xml:id attribute for every major structural unit in a text, and to derive its values in some systematic way, for example by appending a section number to a short code for the title of the work in question, as in the examples below. It is particularly useful to supply such identifiers if the resource concerned is to be made available over the web, since they make it much easier for other web-based applications to link directly to the corresponding parts of your text.

The n attribute may be used to supply (additionally or alternatively) a short mnemonic name or number for a division, or any other element. If a conventional form of reference or abbreviation for the parts of a work already exists (such as the book/chapter/verse pattern of Biblical citations), the n attribute is the place to record it; unlike the identifier supplied by xml:id, it does not need to be unique.

The xml:lang attribute may be used to specify the language of the division. Languages are identified by an internationally defined code, as further discussed in section 6.3 Foreign Words or Expressions below.

The rend attribute may be used to supply information about the rendition (appearance) of a division, or any other element, as further discussed in section 6 Marking Highlighted Phrases below. As with the type attribute, a project will often find it useful to predefine the possible values for this attribute, but TEI Lite does not constrain it in anyway.

These four attributes, xml:id, n, xml:lang, and rend are so widely useful that they are allowed on any element in any TEI schema: they are global attributes. Other global attributes defined in the TEI Lite scheme are discussed in section 8.3 Special kinds of Linking.

The value of every xml:id attribute should be unique within a document. One simple way of ensuring that this is so is to make it reflect the hierarchic structure of the document. For example, Smith's Wealth of Nations as first published consists of five books, each of which is divided into chapters, while some chapters are further subdivided into parts. We might define xml:id values for this structure as follows:

A different numbering scheme may be used for xml:id and n attributes: this is often useful where a canonical reference scheme is used which does not tally with the structure of the work. For example, in a novel divided into books each containing chapters, where the chapters are numbered sequentially through the whole work, rather than within each book, one might use a scheme such as the following:

Here the work has two volumes, each containing two chapters. The chapters are numbered conventionally 1 to 4, but the xml:id values specified allow them to be regarded additionally as if they were numbered 1.1, 1.2, 2.1, 2.2.

4.2 Headings and Closings

Every div may have a title or heading at its start, and (less commonly) a trailer such as ‘End of Chapter 1’ at its end. The following elements may be used to transcribe them:

head (heading) contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc.
trailer contains a closing title or footer appearing at the end of a division of a text.

Some other elements which may be necessary at the beginning or ending of text divisions are discussed below in section 18.1.2 Prefatory Matter.

Whether or not headings and trailers are included in a transcription is a matter for the individual transcriber to decide. Where a heading is completely regular (for example ‘Chapter 1’) or may be automatically constructed from attribute values (e.g. <div type="chapter" n="1">), it may be omitted; where it contains otherwise unrecoverable text it should always be included. For example, the start of Hardy's Under the Greenwood Tree might be encoded as follows:

<div xml:id="UGT1" n="Winter" type="Part">
<div xml:id="UGT11" n="1" type="Chapter">
<head>Mellstock-Lane</head>
<p>To dwellers in a wood almost every species of tree ... </p>
</div>
</div>

4.3 Prose, Verse and Drama

As in the Bronte example above, the paragraphs making up a textual division are tagged with the p tag. In poetic or dramatic texts different tags are needed, to represent verse lines and stanzas in the first case, or individual speeches and stage directions in the second. :

l (verse line) contains a single, possibly incomplete, line of verse.
lg (line group) contains one or more verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc.
sp (speech) An individual speech in a performance text, or a passage presented as such in a prose or verse text.
speaker A specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment.
stage (stage direction) contains any kind of stage direction within a dramatic text or fragment.

Here, for example, is the start of a poetic text in which verse lines and stanzas are tagged:

<lg n="I">
<l>I Sing the progresse of a
   deathlesse soule,</l>
<l>Whom Fate, with God made, but doth not controule,</l>
<l>Plac'd in
   most shapes; all times before the law</l>
<l>Yoak'd us, and when, and since, in this I
   sing.</l>
<l>And the great world to his aged evening;</l>
<l>From infant morne, through manly
   noone I draw.</l>
<l>What the gold Chaldee, of silver Persian saw,</l>
<l>Greeke brass, or
   Roman iron, is in this one;</l>
<l>A worke t'out weare Seths pillars, bricke and
   stone,</l>
<l>And (holy writs excepted) made to yeeld to none,</l>
</lg>

Note that the l element marks verse lines, not typographic lines: the original lineation of the first few lines above has not therefore been made explicit by this encoding, and may be lost. The lb element described in section 5 Page and Line Numbers might additionally be used to mark typographic lines if so desired.

Here is the end of a famous dramatic text, in which speeches and stage directions are marked:

<sp>
<speaker>Vladimir</speaker>
<p>Pull on your trousers.</p>
</sp>
<sp>
<speaker>Estragon</speaker>
<p>You want me to pull off my trousers?</p>
</sp>
<sp>
<speaker>Vladimir</speaker>
<p>Pull <emph>on</emph> your trousers.</p>
</sp>
<sp>
<speaker>Vladimir</speaker>
<p>
<stage>(realizing his trousers are down)</stage>.
True</p>
</sp>
<stage>He pulls up his trousers</stage>
<sp>
<speaker>Vladimir</speaker>
<p>Well? Shall we go?</p>
</sp>
<sp>
<speaker>Estragon</speaker>
<p>Yes, let's go.</p>
</sp>
<stage>They do not move.</stage>

Note that the stage (stage direction) element can appear either within a speech or between speeches. The sp ("speech") element contains, following an optional speaker element indicating who is speaking, either paragraphs (if the speech is in prose) or verse lines or stanzas as in the next example. In this case, it is quite common to find that verse lines are split between speakers. The easiest way of encoding this is to use the part attribute to indicate that the lines so fragmented are incomplete :

<div type="Act" n="I">
<head>ACT I</head>
<div type="Scene" n="1">
  <head>SCENE I</head>
  <stage rend="italic"> Enter Barnardo and Francisco, two Sentinels, at several doors</stage>
  <sp>
   <speaker>Barn</speaker>
   <l part="Y">Who's there?</l>
  </sp>
  <sp>
   <speaker>Fran</speaker>
   <l>Nay, answer me. Stand and unfold yourself.</l>
  </sp>
  <sp>
   <speaker>Barn</speaker>
   <l part="I">Long live the King!</l>
  </sp>
  <sp>
   <speaker>Fran</speaker>
   <l part="M">Barnardo?</l>
  </sp>
  <sp>
   <speaker>Barn</speaker>
   <l part="F">He.</l>
  </sp>
  <sp>
   <speaker>Fran</speaker>
   <l>You come most carefully upon your hour.</l>
  </sp>

</div>
</div>

The same mechanism may be applied to stanzas which are divided between two speakers:

<div>
<sp>
  <speaker>First voice</speaker>
  <lg type="stanza" part="I">
   <l>But why drives on that ship so fast</l>
   <l>Withouten wave or wind?</l>
  </lg>
</sp>
<sp>
  <speaker>Second Voice</speaker>
  <lg part="F">
   <l>The air is cut away before.</l>
   <l>And closes from behind.</l>
  </lg>
</sp>

</div>

The sp element can also be used for dialogue presented in a prose work as if it were drama, as in the next example, which also demonstrates the use of the who attribute to bear a code identifying the speaker of the piece of dialogue concerned:

<div>
<sp who="#OPI">
  <speaker>The reverend Doctor Opimian</speaker>
  <p>I do not think I have named a single unpresentable fish.</p>
</sp>
<sp who="#GRM">
  <speaker>Mr Gryll</speaker>
  <p>Bream, Doctor: there is not much to be said for bream.</p>
</sp>
<sp who="#OPI">
  <speaker>The Reverend Doctor Opimian</speaker>
  <p>On the contrary, sir, I think there is much to be said for him. In the first
     place....</p>
  <p>Fish, Miss Gryll -- I could discourse to you on fish by the hour: but for the present I
     will forbear.</p>
</sp>
</div>

Here the who attribute values (#OPI etc.) are links, pointing to a list of the characters in the novel, each of which has an identifier:

<list>
<head>Characters in the novel</head>
<item xml:id="OPI">
  <name>Dr Opimian</name> : named for the famous Roman fine wine</item>
<item xml:id="GRM">
  <name>Mr Gryll</name> : named for the mythical Gryllus, one of Ulysses'
   sailors transformed by Circe into a pig, who argues that he was happier in that state than
   as a man</item>
</list>

5 Page and Line Numbers

Page and line breaks etc. may be marked with the following elements.

pb/ (page break) marks the boundary between one page of a text and the next in a standard reference system.
lb/ (line break) marks the start of a new (typographic) line in some edition or version of a text.
milestone/ marks a boundary point separating any kind of section of a text, typically but not necessarily indicating a point at which some part of a standard reference system changes, where the change is not represented by a structural element.

These elements mark a single point in the text, not a span of text. The global n attribute should be used to supply the number of the page or line beginning at the tag.

When working from a paginated original, it is often useful to record its pagination, if only to simplify later proof-reading. It is also useful for synchronizing an encoded text with a set of page images. Recording the line breaks may be useful for similar reasons.

If features such as pagination or lineation are marked for more than one edition, specify the edition in question using the ed attribute, and supply as many tags are necessary. For example, in the following passage we indicate where the page breaks occur in two different editions (ED1 and ED2)

<p>I wrote to Moor House and to Cambridge immediately, to say what I had done: fully
explaining also why I had thus acted. Diana and <pb ed="ED1" n="475"/> Mary approved the step
unreservedly. Diana announced that she would <pb ed="ED2" n="485"/>just give me time to get
over the honeymoon, and then she would come and see me.</p>

A special attribute break may be used to indicate whether or not this empty element is considered as a word-breaking, irrespective of any adjacent whitespace. For example, in the following encoded sample:

The pb and lb elements are special cases of the general class of milestone elements which mark reference points within a text. The generic milestone element can mark any kind of reference point: for example, a column break, the start of a new kind of section not otherwise tagged, or in general any significant change in the text not marked by an XML element. The names used for types of unit and for editions referred to by the ed and unit attributes may be chosen freely, but should be documented in the header refsDecl element (see 19.2.3 Reference and Classification Declarations). The milestone element may be used to replace the others, or the others may be used as a set; they should not be mixed arbitrarily.

6 Marking Highlighted Phrases

6.1 Changes of Typeface, etc.

Highlighted words or phrases are those made visibly different from the rest of the text, typically by a change of type font, handwriting style, ink colour etc., which is intended to draw the reader's attention to some associated change.

The global rend attribute can be attached to any element, and used wherever necessary to specify details of the highlighting used for it in the source. For example, a heading rendered in bold might be tagged <head rend="bold">, and one in italic <head rend="italic">.

The values to be used for the rend attribute are not specified by the TEI Guidelines, since they will depend entirely on the needs of the particular project. Some typical values might include italic, bold etc. for font variations; center, right etc. for alignment; large, small etc. for size; smallcaps, allcaps etc. for type variants and so on. Several such words may be used in combination as necessary, but no formal syntax is proposed. The full TEI Guidelines provide more rigorous mechanisms, using other W3C standards such as CSS, as an alternative to the use of rend.

It is not always possible or desirable to interpret the reasons for such changes of rendering in a text. In such cases, the element hi may be used to mark a sequence of highlighted text without making any claim as to its status.

hi (highlighted) marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made.

In the following example, the use of a distinct typeface for the subheading and for the included name are recorded but not interpreted:

<p>
<hi rend="gothic">And this Indenture further
witnesseth</hi> that the said <hi rend="italic">Walter Shandy</hi>, merchant, in
consideration of the said intended marriage ...
</p>

Alternatively, where the cause for the highlighting can be identified with confidence, a number of other, more specific, elements are available.

emph (emphasized) marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect.
foreign (foreign) identifies a word or phrase as belonging to some language other than that of the surrounding text.
gloss identifies a phrase or word used to provide a gloss or definition for some other word or phrase.
label contains any label or heading used to identify part of a text, typically but not exclusively in a list or glossary.
mentioned marks words or phrases mentioned, not used.
term contains a single-word, multi-word, or symbolic designation which is regarded as a technical term.
title contains a title for any kind of work.

Some features (notably quotations and glosses) may be found in a text either marked by highlighting, or with quotation marks. In either case, the elements q and gloss (as discussed in the following section) should be used. If the highlighting is to be recorded, use the global rend attribute.

As an example of the elements defined here, consider the following sentence:

On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité, the romances of Chrétien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach.

Interpreting the role of the highlighting, the sentence might look like this:

<p>On the one hand the <title>Nibelungenlied</title>
is associated with the new rise of romance of twelfth-century France, the <foreign>romans
d'antiquité</foreign>, the romances of Chrétien de Troyes, ...</p>

Describing only the appearance of the original, it might look like this:

<p>On the one hand the <hi rend="italic">Nibelungenlied</hi> is associated with the new rise of romance of twelfth-century France,
the <hi rend="italic">romans d'antiquité</hi>, the romances of Chrétien de Troyes,
...</p>

6.2 Quotations and Related Features

Like changes of typeface, quotation marks are conventionally used to denote several different features within a text, of which the most frequent is quotation. When possible, we recommend that the underlying feature be tagged, rather than the simple fact that quotation marks appear in the text, using the following elements:

q (quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.
mentioned marks words or phrases mentioned, not used.
soCalled contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics.
gloss identifies a phrase or word used to provide a gloss or definition for some other word or phrase.

Here is a simple example of a quotation:

<p>Few dictionary makers are likely to forget Dr. Johnson's description of the
lexicographer as <q>a harmless drudge.</q>
</p>

To record how a quotation was printed (for example, in-line or set off as a display or block quotation), the rend attribute should be used. This may also be used to indicate the kind of quotation marks used.

Direct speech interrupted by a narrator can be represented simply by ending the quotation and beginning it again after the interruption, as in the following example:

<p>
<q>Who-e debel you?</q> — he at last said —
<q>you no speak-e, damme, I kill-e.</q> And so saying, the lighted tomahawk began
flourishing about me in the dark.
</p>

If it is important to convey the idea that the two q elements together make up a single speech, the linking attributes next and prev may be used, as described in section 8.3 Special kinds of Linking.

Quotations may be accompanied by a reference to the source or speaker, using the who attribute, whether or not this is explicit in the text, as in the following example:

<q who="#Wilson">Spaulding, he came
down into the office just this day eight weeks with this very paper in his hand, and he
says:—<q who="#Spaulding">I wish to the Lord, Mr. Wilson, that I was a red-headed
man.</q>
</q>

This example also demonstrates how quotations may be embedded within other quotations: one speaker (Wilson) quotes another speaker (Spaulding).

The creator of the electronic text must decide whether quotation marks are replaced by the tags or whether the tags are added and the quotation marks kept. If the quotation marks are removed from the text, the rend attribute may be used to record the way in which they were rendered in the copy text.

The full TEI Guidelines provide additional elements to distinguish direct speech, quotation, and other typical uses of quotation mark although it is not always possible and may not be considered desirable to interpret the function of quotation marks in a text. For simplicity, only q (which may be used for any such case) has been included in TEI Lite.

6.3 Foreign Words or Expressions

Words or phrases which are not in the main language of the texts may be tagged as such in one of two ways. If the word or phrase is already tagged for some reason, the element indicated should bear a value for the global xml:lang attribute indicating the language used. Where there is no applicable element, the element foreign may be used, again using the xml:lang attribute. For example:

<p>John has real <foreign xml:lang="fr">savoir-faire</foreign>.</p>
<p>Have you read <title xml:lang="de">Die
Dreigroschenoper</title>?</p>
<p>
<mentioned xml:lang="fr">Savoir-faire</mentioned> is French
for know-how.
</p>
<p>The court issued a writ of <term xml:lang="la">mandamus</term>.</p>

As these examples show, the foreign element should not be used to tag foreign words if some other more specific element such as title, mentioned, or term applies. The global xml:lang attribute may be attached to any element to show that it uses some other language than that of the surrounding text.

The codes used to identify languages, supplied on the xml:lang attribute, must be constructed in a particular way, and must conform to common Internet standards², as further explained in the relevant section of the TEI Guidelines. Some simple example codes for a few languages are given here:

zh	Chinese	grc	Ancient Greek
en	English	el	Greek
enm	Middle English	ja	Japanese
fr	French	la	Latin
de	German	sa	Sanskrit

7 Notes

All notes, whether printed as footnotes, endnotes, marginalia, or elsewhere, should be marked using the same element:

note contains a note or annotation.

Where possible, the body of a note should be inserted in the text at the point at which its identifier or mark first appears. This may not be possible for example with marginalia, which may not be anchored to an exact location. For simplicity, it may be adequate to position marginal notes before the relevant paragraph or other element. Notes may also be placed in a separate division of the text (as end-notes are, in printed books) and linked to the relevant portion of the text using their target attribute.

The n attribute may be used to supply the number or identifier of a note if this is required. The resp attribute should be used consistently to distinguish between authorial and editorial notes, if the work has both kinds.

Examples:

<p>Collections are ensembles of
distinct entities or objects of any sort. <note place="foot" n="1"> We explain below why we
   use the uncommon term <mentioned>collection</mentioned> instead of the expected
<mentioned>set</mentioned>. Our usage corresponds to the <mentioned>aggregate</mentioned>
   of many mathematical writings and to the sense of <mentioned>class</mentioned> found in
   older logical writings. </note> The elements ...</p>

<lg xml:id="RAM609">
<note place="margin">The
   curse is finally expiated</note>
<l>And now this spell was snapt: once more</l>
<l>I viewed
   the ocean green,</l>
<l>And looked far forth, yet little saw</l>
<l>Of what had else been seen
   —</l>
</lg>

8 Cross References and Links

Explicit cross references or links from one point in a text to another in the same or another document may be encoded using the elements described in this section. Implicit links (such as the association between two parallel texts, or that between a text and its interpretation) may be encoded using the linking attributes discussed in section 8.3 Special kinds of Linking.

8.1 Simple Cross References

A cross reference from one point within a single document to another can be encoded using either of the following elements:

ref (reference) defines a reference to another location, possibly modified by additional text or comment.
ptr/ (pointer) defines a pointer to another location.

The difference between these two elements is that ptr is an empty element, simply marking a point from which a link is to be made, whereas ref may contain some text as well, typically identifying the target of the cross reference. The ptr element would be used for a cross reference which is to be indicated by some non-verbal means such as a symbol or icon, or in an electronic text by a button. It is also useful in document production systems, where the formatter can generate the correct verbal form of the cross reference.

The following two forms, for example, are logically equivalent :

See especially <ref target="#SEC12">section 12 on
page 34</ref>.

See especially <ptr target="#SEC12"/>.

The value of the target attribute on either element may be the identifier of some other element within the current document. The passage or phrase being pointed at must bear an identifier, and must therefore be tagged as an element of some kind. In the following example, the cross reference is to a div element:

... see especially <ptr target="#SEC12"/>. ...
<div xml:id="SEC12">
<head>Concerning Identifiers</head>

</div>

Because the xml:id attribute is global, any element in a TEI document may be pointed to in this way. In the following example, a paragraph has been given an identifier so that it may be pointed at:

... this is
discussed in <ref target="#pspec">the paragraph on links</ref> ...
<p xml:id="pspec">Links
may be made to any kind of element ...</p>

Sometimes the target of a cross reference does not correspond with any particular feature of a text, and so may not be tagged as an element of some kind. If the desired target is simply a point in the current document, the easiest way to mark it is by introducing an anchor element at the appropriate spot. If the target is some sequence of words not otherwise tagged, the seg element may be introduced to mark them. These two elements are described as follows:

anchor/ (anchor point) attaches an identifier to a point within a text, whether or not it corresponds with a textual element.
seg (arbitrary segment) represents any segmentation of text below the ‘chunk’ level.

In the following (imaginary) example, ref elements have been used to represent points in this text which are to be linked in some way to other parts of it; in the first case to a point, and in the second, to a sequence of words:

Returning to <ref target="#ABCD">the point where I
dozed off</ref>, I noticed that <ref target="#EFGH">three words</ref> had been circled in
red by a previous reader

This encoding requires that elements with the specified identifiers (ABCD and EFGH in this example) are to be found somewhere else in the current document. Assuming that no element already exists to carry these identifiers, the anchor and seg elements may be used:

....
<anchor type="bookmark" xml:id="ABCD"/> .... ....<seg type="target" xml:id="EFGH"> ...
</seg> ...

The type attribute should be used (as above) to distinguish amongst different purposes for which these general purpose elements might be used in a text. Some other uses are discussed in section 8.3 Special kinds of Linking below.

8.2 Pointing to other documents

So far, we have shown how the elements ptr and ref may be used for cross-references or links whose targets occur within the same document as their source. However, the same elements may also be used to refer to elements in any other XML document or resource, such as a document on the web, or a database component. This is possible because the value of the target attribute may be any valid universal resource indicator (URI) [A full definition of this term, defined by the W3C (the consortium which manages the development and maintenance of the World Wide Web), is beyond the scope of this tutorial: however, the most frequently encountered version of a URI is the familiar ‘URL’ used to indicate a web page, such as http://www.tei-c.org/index.xml].

A URI may reference a web page or just a part of one, for example http://www.tei-c.org/index.xml#SEC2. The sharp sign indicates that what follows it is the identifier of an element to be located within the XML document identified by what precedes it: this example will therefore locate an element which has an xml:id attribute value of SEC2 within the document retrieved from http://www.tei-c.org/index.xml. In the examples we have discussed so far, the part to the left of the sharp sign has been omitted: this is understood to mean that the referenced element is to be located within the current document.

Parts of an XML document can be specified by means of other more sophisticated mechanisms using a special language called Xpath, also defined by the W3C. This is particularly useful where the elements to be linked to do not bear identifiers and must therefore be located by some other means.

8.3 Special kinds of Linking

The following special purpose linking attributes are defined for every element in the TEI Lite scheme:

ana: links an element with its interpretation.
corresp: links an element with one or more other corresponding elements.
next: links an element to the next element in an aggregate.
prev: links an element to the previous element in an aggregate.

The ana (analysis) attribute is intended for use where a set of abstract analyses or interpretations have been defined somewhere within a document, as further discussed in section 15 Interpretation and Analysis. For example, a linguistic analysis of the sentence ‘John loves Nancy’ might be encoded as follows:

<seg type="sentence" ana="SVO">
<seg type="lex" ana="#NP1">John</seg>
<seg type="lex" ana="#VVI">loves</seg>
<seg type="lex" ana="#NP1">Nancy</seg>
</seg>

This encoding implies the existence elsewhere in the document of elements with identifiers SVO, NP1, and VV1 where the significance of these particular codes is explained. Note the use of the seg element to mark particular components of the analysis, distinguished by the type attribute.

The corresp (corresponding) attribute provides a simple way of representing some form of correspondence between two elements in a text. For example, in a multilingual text, it may be used to link translation equivalents, as in the following example

<seg xml:lang="fr" xml:id="FR1" corresp="#EN1">Jean
aime Nancy</seg>
<seg xml:lang="en" xml:id="EN1" corresp="#FR1">John loves
Nancy</seg>

The same mechanism may be used for a variety of purposes. In the following example, it has been used to represent the correspondences between the show and ‘Shirley’, and between ‘NBC’ and ‘the network’:

<p>
<title xml:id="shirley">Shirley</title>, which
made its Friday night debut only a month ago, was not listed on <name xml:id="nbc">NBC</name>'s new schedule, although <seg xml:id="network" corresp="#nbc">the network</seg>
says <seg xml:id="show" corresp="#shirley">the show</seg> still is being
considered.
</p>

The next and prev attributes provide a simple way of linking together the components of a discontinuous element, as in the following example:

<q xml:id="Q1a" next="#Q1b">Who-e debel you?</q> —
he at last said — <q xml:id="Q1b" prev="#Q1a">you no speak-e, damme, I kill-e.</q> And so
saying, the lighted tomahawk began flourishing about me in the dark.

9 Editorial Interventions

The process of encoding an electronic text has much in common with the process of editing a manuscript or other text for printed publication. In either case a conscientious editor may wish to record both the original state of the source and any editorial correction or other change made in it. The elements discussed in this and the next section provide some facilities for meeting these needs.

9.1 Correction and Normalization

The following elements may be used to mark correction, that is editorial changes introduced where the editor believes the original to be erroneous:

corr (correction) contains the correct form of a passage apparently erroneous in the copy text.
sic (Latin for thus or so) contains text reproduced although apparently incorrect or inaccurate.

The following elements may be used to mark normalization, that is editorial changes introduced for the sake of consistency or modernization of a text:

orig (original form) contains a reading which is marked as following the original, rather than being normalized or corrected.
reg (regularization) contains a reading which has been regularized or normalized in some sense.

As an example, consider this extract from the quarto printing of Shakespeare's Henry V.

... for his nose was as sharp as a pen and a table of green feelds

A modern editor might wish to make a number of interventions here, specifically to modernize (or normalise) the Elizabethan spellings of a' and feelds for he and fields respectively. He or she might also want to emend table to babbl'd, following an editorial tradition that goes back to the 18th century Shakespearian scholar Lewis Theobald. The following encoding would then be appropriate:

... for his nose was as sharp as
a pen and <reg>he</reg>
<corr resp="#Theobald">babbl'd</corr> of green
<reg>fields</reg>

A more conservative or source-oriented editor, however, might want to retain the original, but at the same time signal that some of the readings it contains are in some sense anomalous:

... for his nose was as sharp as a pen and
<orig>a</orig>
<sic>table</sic> of green
<orig>feelds</orig>

Finally, a modern digital editor may decide to combine both possibilities in a single composite text, using the choice element.

choice groups a number of alternative encodings for the same point in a text.

This allows an editor to mark where alternative readings are possible:

... for his nose was as sharp as a pen and
<choice>
<orig>a</orig>
<reg>he</reg>
</choice>
<choice>
<corr resp="#Theobald">babbl'd</corr>
<sic>table</sic>
</choice> of green

<choice>
<orig>feelds</orig>
<reg>fields</reg>
</choice>

9.2 Omissions, Deletions, and Additions

In addition to correcting or normalizing words and phrases, editors and transcribers may also supply missing material, omit material, or transcribe material deleted or crossed out in the source. In addition, some material may be particularly hard to transcribe because it is hard to make out on the page. The following elements may be used to record such phenomena:

add (addition) contains letters, words, or phrases inserted in the text by an author, scribe, annotator, or corrector.
gap (gap) indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible.
del (deletion) contains a letter, word, or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator, or corrector.
unclear contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source.

These elements may be used to record changes made by an editor, by the transcriber, or (in manuscript material) by the author or scribe. For example, if the source for an electronic text read

The following elements are provided for for simple editorial interventions.

then it might be felt desirable to correct the obvious error, but at the same time to record the deletion of the superfluous second for, thus:

The following elements are provided for <del resp="#LB">for</del> simple editorial interventions.

The attribute value #LB on the resp attribute is used to point to a fuller definition (typically in a respStmt element) for the agency responsible for correcting the duplication of for.

If the source read

The following elements provided for simple editorial interventions.

(i.e. if the verb had been inadvertently dropped) then the corrected text might read:

The following elements <add resp="#LB">are</add> provided for simple editorial interventions.

These elements are not limited to changes made by an editor; they can also be used to record authorial changes in manuscripts. A manuscript in which the author has first written ‘How it galls me, what a galling shadow’, then crossed out the word galls and inserted dogs might be encoded thus:

How it <del hand="#DHL" type="overstrike">galls</del>
<add hand="#DHL" place="supralinear">dogs</add> me, what a galling shadow

Again, the code #DHL points to another location where more information about the hand concerned is to be found³.

Similarly, the unclear and gap elements may be used together to indicate the omission of illegible material; the following example also shows the use of add for a conjectural emendation:

One hundred
& twenty good regulars joined to me <unclear>
<gap reason="indecipherable"/>
</unclear>
& instantly, would aid me signally <add hand="#ed">in?</add> an enterprise against
Wilmington.

The del element marks material which is transcribed as part of the electronic text despite being marked as deleted, while gap marks the location of material which is omitted from the electronic text, whether it is legible or not. A language corpus, for example, might omit long quotations in foreign languages:

<p> ... An example of a list appearing in a fief
ledger of <name type="place">Koldinghus</name>
<date>1611/12</date> is given below. It shows cash income from a sale of
honey.</p>
<gap>
<desc>quotation from ledger (in Danish)</desc>
</gap>
<p>A description of the
overall structure of the account is once again ... </p>

Other corpora (particular those constructed before the widespread use of scanners) systematically omit figures and mathematics:

<p>At the bottom of your screen below the mode line is the <term>minibuffer</term>. This is
the area where Emacs echoes the commands you enter and where you specify filenames for Emacs
to find, values for search and replace, and so on. <gap reason="graphic">
<desc>diagram of
Emacs screen</desc>
</gap>
</p>

9.3 Abbreviations and their Expansion

Like names, dates, and numbers, abbreviations may be transcribed as they stand or expanded; they may be left unmarked, or encoded using the following elements:

abbr (abbreviation) contains an abbreviation of any sort.
expan (expansion) contains the expansion of an abbreviation.

The abbr element is useful as a means of distinguishing semi-lexical items such as acronyms or jargon:

We can sum up the above
discussion as follows: the identity of a <abbr>CC</abbr> is defined by that calibration of
values which motivates the elements of its <abbr>GSP</abbr>;

Every manufacturer of <abbr>3GL</abbr> or
<abbr>4GL</abbr> languages is currently nailing on <abbr>OOP</abbr> extensions

The type attribute may be used to distinguish types of abbreviation by their function.

The expan element is used to mark an expansion supplied by an encoder. This element is particularly useful in the transcription of manuscript materials. For example, the character p with a bar through its descender as a conventional representation for the word per is commonly encountered in Medieval European manuscripts. An encoder may choose to expand this as follows:

The expansion corresponding with an abbreviated form may not always contain the same letters as the abbreviation. Where it does, however, common editorial practice is to italicize or otherwise signal which letters have been supplied. The expan element should not be used for this purpose since its function is to indicate an expanded form, not a part of one. For example, consider the common abbreviation wt (for with) found in medieval texts. In a modern edition, an editor might wish to represent this as ‘with’, italicising the letters not found in the source. One simple means of achieving that would be an encoding such as the follow

To record both an abbreviation and its expansion, the choice element mentioned above may be used to group the abbreviated form with its proposed expansion:

10 Names, Dates, and Numbers

The TEI scheme defines elements for a large number of ‘data-like’ features which may appear almost anywhere within almost any kind of text. These features may be of particular interest in a range of disciplines; they all relate to objects external to the text itself, such as the names of persons and places, numbers and dates. They also pose particular problems for many natural language processing (NLP) applications because of the variety of ways in which they may be presented within a text. The elements described here, by making such features explicit, reduce the complexity of processing texts containing them.

10.1 Names and Referring Strings

A referring string is a phrase which refers to some person, place, object, etc. Two elements are provided to mark such strings:

rs (referencing string) contains a general purpose name or referring string.
name (name, proper noun) contains a proper noun or noun phrase.

The type attribute is used to distinguish amongst (for example) names of persons, places and organizations, where this is possible:

<q>My dear <rs type="person">Mr. Bennet</rs>, </q>
said his lady to him one day,
<q>have you heard that <rs type="place">Netherfield Park</rs>
is let at last?</q>

It being one of the principles of the <rs type="organization">Circumlocution Office</rs> never, on any account whatsoever, to give a
straightforward answer, <rs type="person">Mr Barnacle</rs> said,
<q>Possibly.</q>

As the following example shows, the rs element may be used for any reference to a person, place, etc, not necessarily one in the form of a proper noun or noun phrase.

<q>My dear <rs type="person">Mr. Bennet</rs>,</q>
said <rs type="person">his lady</rs> to him one day...

The name element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the rs element, or nested within it if a referring string contains a mixture of common and proper nouns.

Simply tagging something as a name is rarely enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as van or de la, may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer.

The key attribute provides an alternative normalized identifier for the object being named, like a database record key. It may thus be useful as a means of gathering together all references to the same individual or location scattered throughout a document:

<q>My dear <rs type="person" key="BENM1">Mr.
Bennet</rs>, </q> said <rs type="person" key="BENM2">his lady</rs> to him one day,
<q>have
you heard that <rs type="place" key="NETP1">Netherfield Park</rs> is let at
last?</q>

This use should be distinguished from the case of the reg (regularization) element, which provides a means of marking the standard form of a referencing string as demonstrated below:

<name type="person" key="WADLM1">
<choice>
<sic>Walter de la Mare</sic>
<reg>de la Mare, Walter</reg>
</choice>
</name> was
born at <name key="Ch1" type="place">Charlton</name>, in <name key="KT1" type="county">Kent</name>, in 1873.

The index element discussed in indexing may be more appropriate if the function of the regularization is to provide a consistent index:

<p>
<name type="place">Montaillou</name> is not a
large parish. At the time of the events which led to <name type="person">Fournier</name>'s
<index>
<term>Benedict XII, Pope of Avignon (Jacques Fournier)</term>
</index>
investigations, the local population consisted of between 200 and 250
inhabitants.
</p>

Although adequate for many simple applications, these methods have two inconveniences: if the name occurs many times, then its regularised form must be repeated many times; and the burden of additional XML markup in the body of the text may be inconvenient to maintain and complex to process. For applications such as onomastics, relating to persons or places named rather than the name itself, or wherever a detailed analysis of the component parts of a name is needed, the full TEI Guidelines provide a range of other solutions.

10.2 Dates and Times

Tags for the more detailed encoding of times and dates include the following:

date contains a date in any format.
time contains a phrase defining a time of day in any format.

These elements have a number of attributes which can be used to provide normalised versions of their values.

att.datable provides attributes for normalization of elements that contain dates, times, or datable events.

calendar	indicates the system or calendar to which the date represented by the content of this element belongs.
period	supplies a pointer to some location defining a named period of time within which the datable item is understood to have occurred.
when	supplies the value of the date or time in a standard form, e.g. yyyy-mm-dd.

The when attribute specifies a normalized form for the date or time, using one of the standard formats defined by ISO 8601. Partial dates or times (e.g. ‘1990’, ‘September 1990’, ‘twelvish’) can be expressed by omitting a part of the value supplied, as in the following examples:

<date when="1980-02-21">21
Feb 1980</date>
<date when="1990">1990</date>
<date when="1990-09">September 1990</date>
<date when="--09">September</date>
<date when="2001-09-11T12:48:00">Sept 11th, 12 minutes before 9
am</date>

Note in the last example the use of a normalized representation for the date string which includes a time: this example could thus equally well be tagged using the time element.

Given on the <date when="1977-06-12">Twelfth
Day of June in the Year of Our Lord One Thousand Nine Hundred and Seventy-seven of the
Republic the Two Hundredth and first and of the University the Eighty-Sixth.</date>

<l>specially when it's nine below zero</l>
<l>and <time when="15:00:00">three o'clock in the afternoon</time>
</l>

10.3 Numbers

Numbers can be written with either letters or digits (twenty-one, xxi, and 21) and their presentation is language-dependent (e.g. English 5th becomes Greek 5.; English 123,456.78 equals French 123.456,78). In natural-language processing or machine-translation applications, it is often helpful to distinguish them from other, more ‘lexical’ parts of the text. In other applications, the ability to record a number's value in standard notation is important. The num element provides this possibility:

num (number) contains a number, written in any form.

For example:

<num value="33">xxxiii</num>
<num type="cardinal" value="21">twenty-one</num>
<num type="percentage" value="10">ten percent</num>
<num type="percentage" value="10">10%</num>
<num type="ordinal" value="5">5th</num>

11 Lists

The element list is used to mark any kind of list. A list is a sequence of text items, which may be numbered, bulleted, or arranged as a glossary list. Each item may be preceded by an item label (in a glossary list, this label is the term being defined):

list (list) contains any sequence of items organized as a list.
item contains one component of a list.
label contains any label or heading used to identify part of a text, typically but not exclusively in a list or glossary.

Individual list items are tagged with item. The first item may optionally be preceded by a head, which gives a heading for the list. The numbering of a list may be omitted, indicated using the n attribute on each item, or (rarely) tagged as content using the label element. The following are all thus equivalent:

<list>
<head>A short list</head>
<item>First item in list.</item>
<item>Second item in list.</item>
<item>Third item in list.</item>
</list>
<list>
<head>A short list</head>
<item n="1">First item in list.</item>
<item n="2">Second item in list.</item>
<item n="3">Third item in list.</item>
</list>
<list>
<head>A short list</head>
<label>1</label>
<item>First item in list.</item>
<label>2</label>
<item>Second item in list.</item>
<label>3</label>
<item>Third item in list.</item>
</list>

The styles should not be mixed in the same list.

A simple two-column table may be treated as a glossary list, tagged <list type="gloss">. Here, each item comprises a term and a gloss, marked with label and item respectively. These correspond to the elements term and gloss, which can occur anywhere in prose text.

<list type="gloss">
<head>Vocabulary</head>
<label xml:lang="enm">nu</label>
<item>now</item>
<label xml:lang="enm">lhude</label>
<item>loudly</item>
<label xml:lang="enm">bloweth</label>
<item>blooms</item>
<label xml:lang="enm">med</label>
<item>meadow</item>
<label xml:lang="enm">wude</label>
<item>wood</item>
<label xml:lang="enm">awe</label>
<item>ewe</item>
<label xml:lang="enm">lhouth</label>
<item>lows</item>
<label xml:lang="enm">sterteth</label>
<item>bounds, frisks</item>
<label xml:lang="enm">verteth</label>
<item xml:lang="la">pedit</item>
<label xml:lang="enm">murie</label>
<item>merrily</item>
<label xml:lang="enm">swik</label>
<item>cease</item>
<label xml:lang="enm">naver</label>
<item>never</item>
</list>

Where the internal structure of a list item is more complex, it may be preferable to regard the list as a table, for which special-purpose tagging is defined below (13 Tables).

Lists of whatever kind can, of course, nest within list items to any depth required. Here, for example, a glossary list contains two items, each of which is itself a simple list:

<list type="gloss">
<label>EVIL</label>
<item>
  <list type="simple">
   <item>I am cast upon a horrible desolate island, void of all hope of recovery.</item>
   <item>I am singled out and separated as it were from all the world to be miserable.</item>
   <item>I am divided from mankind — a solitaire; one banished from human society.</item>
  </list>
</item>
<label>GOOD</label>
<item>
  <list type="simple">
   <item>But I am alive; and not drowned, as all my ship's company were.</item>
   <item>But I am singled out, too, from all the ship's crew, to be spared from
       death...</item>
   <item>But I am not starved, and perishing on a barren place, affording no
       sustenances....</item>
  </list>
</item>
</list>

A list need not necessarily be displayed in list format. For example,

<p>On those remote pages it is written that animals
are divided into <list rend="run-on">
  <item n="a">those that belong to the Emperor,</item>
  <item n="b"> embalmed ones, </item>
  <item n="c"> those that are trained, </item>
  <item n="d"> suckling pigs, </item>
  <item n="e"> mermaids, </item>
  <item n="f"> fabulous ones, </item>
  <item n="g"> stray dogs, </item>
  <item n="h"> those that are included in this classification, </item>
  <item n="i"> those that tremble as if they were mad, </item>
  <item n="j"> innumerable ones, </item>
  <item n="k"> those drawn with a very fine camel's-hair brush, </item>
  <item n="l"> others, </item>
  <item n="m"> those that have just broken a flower vase, </item>
  <item n="n"> those that resemble flies from a distance.</item>
</list>
</p>

Lists of bibliographic items should be tagged using the listBibl element, described in the next section.

12 Bibliographic Citations

It is often useful to distinguish bibliographic citations where they occur within texts being transcribed for research, if only so that they will be properly formatted when the text is printed out. The element bibl is provided for this purpose. Where the components of a bibliographic reference are to be distinguished, the following elements may be used as appropriate. It is generally useful to mark at least those parts (such as the titles of articles, books, and journals) which will need special formatting. The other elements are provided for cases where particular interest attaches to such details.

bibl (bibliographic citation) contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
author in a bibliographic reference, contains the name(s) of an author, personal or corporate, of a work; for example in the same form as that provided by a recognized bibliographic name authority.
biblScope (scope of citation) defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.
date contains a date in any format.
editor secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc.
publisher provides the name of the organization responsible for the publication or distribution of a bibliographic item.
pubPlace (publication place) contains the name of the place where a bibliographic item was published.
title contains a title for any kind of work.

For example, the following editorial note might be transcribed as shown:

He was a member of Parliament for Warwickshire in 1445, and died March 14, 1470 (according to Kittredge, Harvard Studies 5. 88ff).

He was a member of Parliament for Warwickshire
in 1445, and died March 14, 1470 (according to <bibl>
<author>Kittredge</author>,
<title>Harvard Studies</title>
<biblScope>5. 88ff</biblScope>
</bibl>).

For lists of bibliographic citations, the listBibl element should be used; it may contain a series of bibl elements.

13 Tables

Tables represent a challenge for any text processing system, but simple tables, at least, appear in so many texts that even in the simplified TEI tag set presented here, markup for tables is necessary. The following elements are provided for this purpose:

table contains text displayed in tabular form, in rows and columns.
row contains one row of a table.
cell contains one cell of a table.

For example, Defoe uses mortality tables like the following in the Journal of the Plague Year to show the rise and ebb of the epidemic:

<p>It was indeed coming on amain, for the burials
that same week were in the next adjoining parishes thus:— <table rows="5" cols="4">
  <row role="data">
   <cell role="label">St. Leonard's, Shoreditch</cell>
   <cell>64</cell>
   <cell>84</cell>
   <cell>119</cell>
  </row>
  <row role="data">
   <cell role="label">St. Botolph's, Bishopsgate</cell>
   <cell>65</cell>
   <cell>105</cell>
   <cell>116</cell>
  </row>
  <row role="data">
   <cell role="label">St. Giles's, Cripplegate</cell>
   <cell>213</cell>
   <cell>421</cell>
   <cell>554</cell>
  </row>
</table>
</p>
<p>This shutting up of houses was at first counted a very cruel and unchristian
method, and the poor people so confined made bitter lamentations. ... </p>

14 Figures and Graphics

Not all the components of a document are necessarily textual. The most straightforward text will often contain diagrams or illustrations, to say nothing of documents in which image and text are inextricably intertwined, or electronic resources in which the two are complementary.

The encoder may simply record the presence of a graphic within the text, possibly with a brief description of its content, and may also provide a link to a digitized version of the graphic, using the following elements:

graphic indicates the location of an inline graphic, illustration, or figure.
figure groups elements representing or containing graphic information such as an illustration, formula, or figure.
figDesc (description of figure) contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it.

Any textual information accompanying the graphic, such as a heading and/or caption, may be included within the figure element itself, in a head and one or more p elements, as also may any text appearing within the graphic itself. It is strongly recommended that a prose description of the image be supplied, as the content of a figDesc element, for the use of applications which are not able to render the graphic, and to render the document accessible to vision-impaired readers. (Such text is not normally considered part of the document proper.)

The simplest use for these elements is to mark the position of a graphic and provide a link to it, as in this example;

This indicates that the graphic contained by the file p412fig.png appears between pages 412 and 413.

The graphic element can appear anywhere that textual content is permitted, within but not between paragraphs or headings. In the following example, the encoder has decided to treat a specific printer's ornament as a heading:

More usually, a graphic will have at the least an identifying title, which may be encoded using the head element, or a number of figures may be grouped together in a particular structure. It is also often convenient to include a brief description of the image. The figure element provides a means of wrapping one or more such elements together as a kind of graphic ‘block’:

<figure>
<graphic url="fessipic.png"/>
<head>Mr Fezziwig's Ball</head>
<figDesc>A Cruikshank
engraving showing Mr Fezziwig leading a group of revellers.</figDesc>
</figure>

These cases should be carefully distinguished from the case where an encoded text is complemented by a collection of digital images, maintained as a distinct resource. The facs attribute may be used to associate any element in an encoded text with a digital facsimile of it. In the simple case where only page images are available, the facs attribute on the< pb> element may be used to associate each image with an appropriate point in the text:

This method is only appropriate in the simple case where each digital image file page1.png etc. corresponds with a single transcribed and encoded page. If more detailed alignment of image and transcription is required, for example because the image files actually represent double page spreads, more sophisticated mechanisms are provided in the full TEI Guidelines.

15 Interpretation and Analysis

It is often said that all markup is a form of interpretation or analysis. While it is certainly difficult, and may be impossible, to distinguish firmly between ‘objective’ and ‘subjective’ information in any universal way, it remains true that judgments concerning the latter are typically regarded as more likely to provide controversy than those concerning the former. Many scholars therefore prefer to record such interpretations only if it is possible to alert the reader that they are considered more open to dispute, than the rest of the markup. This section describes some of the elements provided by the TEI scheme to meet this need.

15.1 Orthographic Sentences

Interpretation typically ranges across the whole of a text, with no particular respect to other structural units. A useful preliminary to intensive interpretation is therefore to segment the text into discrete and identifiable units, each of which can then bear a label for use as a sort of ‘canonical reference’. To facilitate such uses, these units may not cross each other, nor nest within each other. They may conveniently be represented using the following element:

s (s-unit) contains a sentence-like division of a text.

As the name suggests, the s element is most commonly used (in linguistic applications at least) for marking orthographic sentences, that is, units defined by orthographic features such as punctuation. For example, the passage from Jane Eyre discussed earlier might be divided into s-units as follows:

<pb n="474"/>
<div type="chapter" n="38">
<p>
  <s n="001">Reader, I married him.</s>
  <s n="002">A quiet wedding we had:</s>
  <s n="003">he
     and I, the parson and clerk, were alone present.</s>
  <s n="004">When we got back from
     church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and
     John cleaning the knives, and I said —</s>
</p>
<p>
  <q>
   <s n="005">Mary, I have been married to Mr Rochester this morning.</s>
  </q> ... </p>
</div>

Note that s elements cannot nest: the beginning of one s element implies that the previous one has finished. When s-units are tagged as shown above, it is advisable to tag the entire text end-to-end, so that every word in the text being analysed will be contained by exactly one s element, whose identifier can then be used to specify a unique reference for it. If the identifiers used are unique within the document, then the xml:id attribute might be used in preference to the n used in the above example.

15.2 Words and punctuation

Tokenization, that is, the identification of lexical or non-lexical tokens within a text, is a very common requirement for all kinds of textual analysis, and not an entirely trivial one. The decision as to whether, for example, ‘can't’ in English or ‘du’ in French should be treated as one word or two is not simple. Consequently it is often useful to make explicit the preferred tokenization in a marked up text. The following elements are available for this purpose:

w (word) represents a grammatical (not necessarily orthographic) word.
pc (punctuation character) a character or string of characters regarded as constituting a single punctuation mark.

For example, the output from a part of speech tagger might be recorded in TEI Lite as follows:

<s n="1">
<w ana="#NP0">Marley</w>
<w ana="#VBD">was</w>
<w ana="#AJ0">dead</w>
<pc>:</pc>
<w ana="#TO0">to</w>
<w ana="#VBB">begin</w>
<w ana="#PRP">with</w>
<pc>. </pc>
</s>

In this example, each word has been decorated with an automatically generated part of speech code, using the ana attribute discussed in section 8.3 Special kinds of Linking above. The w also provides for each word to be associated with a root form or lemma, either explicitly using the lemma attribute, or by reference, using the lemmaRef attribute, as in this example:

...<w ana="#VBD" lemma="be" lemmaRef="http://www.myLexicon.com/be">was</w> ...

15.3 General-Purpose Interpretation Elements

The w element is a specialisation of the seg element which has already been introduced for use in identifying otherwise unmarked targets of cross references and hypertext links (see section 8 Cross References and Links); it identifies some phrase-level portion of text to which the encoder may assign a user-specified type, as well as a unique identifier; it may thus be used to tag textual features for which there is no other provision in the published TEI Guidelines.

For example, the Guidelines provide no ‘apostrophe’ element to mark parts of a literary text in which the narrator addresses the reader (or hearer) directly. One approach might be to regard these as instances of the q element, distinguished from others by an appropriate value for the who attribute. A possibly simpler, and certainly more general, solution would however be to use the seg element as follows:

<div type="chapter" n="38">
<p>
<seg type="apostrophe">Reader, I married him.</seg> A quiet wedding we had: ...</p>
</div>

The type attribute on the seg element can take any value, and so can be used to record phrase-level phenomena of any kind; it is good practice to record the values used and their significance in the header.

A seg element of one type (unlike the s element which it superficially resembles) can be nested within a seg element of the same or another type. This enables quite complex structures to be represented; some examples were given in section 8.3 Special kinds of Linking above. However, because it must respect the requirement that elements be properly nested and may not cut across each other, it cannot cope with the common requirement to associate an interpretation with arbitrary segments of a text which may completely ignore the document hierarchy. It also requires that the interpretation itself be represented by a single coded value in the type attribute.

Neither restriction applies to the interp element, which provides powerful features for the encoding of quite complex interpretive information in a relatively straightforward manner.

interp (interpretation) summarizes a specific interpretative annotation which can be linked to a span of text.
interpGrp (interpretation group) collects together a set of related interpretations which share responsibility or type.

These elements allow the encoder to specify both the class of an interpretation, and the particular instance of that class which the interpretation involves. Thus, whereas with seg one can say simply that something is an apostrophe, with interp one can say that it is an instance (apostrophe) of a larger class (rhetorical figures).

Moreover, interp is an empty element, which must be linked to the passage to which it applies either by means of the ana attribute discussed in section 8.3 Special kinds of Linking above, or by means of its own inst attribute. This means that any kind of analysis can be represented, with no need to respect the document hierarchy, and also facilitates the grouping of analyses of a particular type together. A special purpose interpGrp element is provided for the latter purpose.

For example, suppose that you wish to mark such diverse aspects of a text as themes or subject matter, rhetorical figures, and the locations of individual scenes of the narrative. Different portions of our sample passage from Jane Eyre for example, might be associated with the rhetorical figures of apostrophe, hyperbole, and metaphor; with subject-matter references to churches, servants, cooking, postal service, and honeymoons; and with scenes located in the church, in the kitchen, and in an unspecified location (drawing room?).

These interpretations could be placed anywhere within the text element; it is however good practice to put them all in the same place (e.g. a separate section of the front or back matter), as in the following example:

<back>
<div type="Interpretations">
  <p>
   <interp xml:id="fig-apos-1" resp="#LB-MSM" type="figureOfSpeech">apostrophe</interp>
   <interp xml:id="fig-hyp-1" resp="#LB-MSM" type="figureOfSpeech">hyperbole</interp>
   <interp xml:id="set-church-1" resp="#LB-MSM" type="setting">church</interp>
   <interp xml:id="ref-church-1" resp="#LB-MSM" type="reference">church</interp>
   <interp xml:id="ref-serv-1" resp="#LB-MSM" type="reference">servants</interp>
  </p>
</div>
</back>

The evident redundancy of this encoding can be considerably reduced by using the interpGrp element to group together all those interp elements which share common attribute values, as follows:

<back>
<div type="Interpretations">
  <p>
   <interpGrp type="figureOfSpeech" resp="#LB-MSM">
    <interp xml:id="fig-apos">apostrophe</interp>
    <interp xml:id="fig-hyp">hyperbole</interp>
    <interp xml:id="fig-meta">metaphor</interp>
   </interpGrp>
   <interpGrp type="scene-setting" resp="#LB-MSM">
    <interp xml:id="set-church">church</interp>
    <interp xml:id="set-kitch">kitchen</interp>
    <interp xml:id="set-unspec">unspecified</interp>
   </interpGrp>
   <interpGrp type="reference" resp="#LB-MSM">
    <interp xml:id="ref-church">church</interp>
    <interp xml:id="ref-serv">servants</interp>
    <interp xml:id="ref-cook">cooking</interp>
   </interpGrp>
  </p>
</div>
</back>

Once these interpretation elements have been defined, they can be linked with the parts of the text to which they apply in either or both of two ways. The ana attribute can be used on whichever element is appropriate:

<div type="chapter" n="38">
<p xml:id="P38.1" ana="#set-church #set-kitch">
<s xml:id="P38.1.1" ana="#fig-apos">Reader, I
married him.</s>
</p>
</div>

Note in this example that since the paragraph has two settings (in the church and in the kitchen), the identifiers of both have been supplied.

Alternatively, the interp elements can point to all the parts of the text to which they apply, using their inst attribute:

<interp
  xml:id="fig-apos-2"
  type="figureOfSpeech"
  resp="#LB-MSM"
  inst="#P38.1.1">apostrophe</interp>
<interp
  xml:id="set-church-2"
  type="scene-setting"
  inst="#P38.1"
  resp="#LB-MSM">church</interp>
<interp
  xml:id="set-kitchen-2"
  type="scene-setting"
  inst="#P38.1"
  resp="#LB-MSM">kitchen</interp>

The interp element is not limited to any particular type of analysis.m The literary analysis shown above is but one possibility; one could equally well use interp to capture a linguistic part-of-speech analysis. For example, the example sentence given in section 8.3 Special kinds of Linking assumes a linguistic analysis which might be represented as follows:

<interp xml:id="NP1" type="pos">noun
phrase, singular</interp>
<interp xml:id="VV1" type="pos">inflected verb, present-tense
singular</interp> ...

16 Technical Documentation

Although the focus of this document is on the use of the TEI scheme for the encoding of existing ‘pre-electronic’ documents, the same scheme may also be used for the encoding of new documents. In the preparation of new documents (such as this one), XML has much to recommend it: the document's structure can be clearly represented, and the same electronic text can be re-used for many purposes — to provide both online hypertext or browsable versions and well-formatted typeset versions from a common source for example.

To facilitate this, the TEI Lite schema includes some elements for marking features of technical documents in general, and of XML-related documents in particular.

16.1 Additional Elements for Technical Documents

The following elements may be used to mark particular features of technical documents:

eg (example) contains any kind of illustrative example.
code contains literal code from some formal language such as a programming language.
ident (identifier) contains an identifier or name for an object of some kind in a formal language. ident is used for tokens such as variable names, class names, type names, function names etc. in formal programming languages.
gi (element name) contains the name (generic identifier) of an element.
att (attribute) contains the name of an attribute appearing within running text.
formula contains a mathematical or other formula.
val (value) contains a single attribute value.

The following example shows how these elements might be used to encode a passage from a tutorial introducing the Fortran programming language:

<p>It is traditional to introduce a language with a
program like the following: <eg> CHAR*12 GRTG GRTG = 'HELLO WORLD' PRINT *, GRTG END
</eg>
</p>
<p>This simple example first declares a variable <ident>GRTG</ident>, in the line
<code>CHAR*12 GRTG</code>, which identifies <ident>GRTG</ident> as consisting of 12 bytes
of type <ident>CHAR</ident>. To this variable, the value <val>HELLO WORLD</val> is then
assigned.</p>

A formatting application, given a text like that above, can be instructed to format examples appropriately (e.g. to preserve line breaks, or to use a distinctive font). Similarly, the use of tags such as ident greatly facilitates the construction of a useful index.

The formula element should be used to enclose a mathematical or chemical formula presented within the text as a distinct item. Since formulae generally include a large variety of special typographic features not otherwise present in ordinary text, it will usually be necessary to present the body of the formula in a specialized notation. The notation used should be specified by the notation attribute, as in the following example:

<formula notation="tex"> \begin{math}E =
mc^{2}\end{math} </formula>

A particular problem arises when XML encoding is the subject of discussion within a technical document, itself encoded in XML. In such a document, it is clearly essential to distinguish clearly the markup occurring within examples from that marking up the document itself, and end-tags are highly likely to occur. One simple solution is to use the predefined entity reference < to represent each < character which marks the start of an XML tag within the examples. A more general solution is to mark off the whole body of each example as containing data which is not to be scanned for XML mark-up by the parser. This is achieved by enclosing it within a special XML construct called a CDATA marked section, as in the following example:

<p>A list should be encoded as follows: <eg><![ CDATA [ <list> <item>First item in the list</item> <item>Second item</item> </list> ]]> </eg> The <gi>list</gi> element consists of a series of <gi>item</gi> elements.

The list element used within the example above will not be regarded as forming part of the document proper, because it is embedded within a marked section (beginning with the special markup declaration <![CDATA[ , and ending with ]]>).

Note also the use of the gi element to tag references to element names (or generic identifiers) within the body of the text.

16.2 Generated Divisions

Most modern document production systems have the ability to generate automatically whole sections such as a table of contents or an index. The TEI Lite scheme provides an element to mark the location at which such a generated section should be placed.

divGen (automatically generated text division) indicates the location at which a textual division generated automatically by a text-processing application is to appear.

The divGen element can be placed anywhere that a division element would be legal, as in the following example:

<front>
<titlePage>

</titlePage>
<divGen type="toc"/>
<div>
<head>Preface</head>

</div>
</front>
<body>

</body>
<back>
<div>
<head>Appendix</head>

</div>
<divGen type="index" n="Index"/>
</back>

This example also demonstrates the use of the type attribute to distinguish the different kinds of division to be generated: in the first case a table of contents (a toc) and in the second an index.

When an existing index or table of contents is to be encoded (rather than one being generated) for some reason, the list element discussed in section 11 Lists should be used.

16.3 Index Generation

While production of a table of contents from a properly tagged document is generally unproblematic for an automatic processor, the production of a good quality index will often require more careful tagging. It may not be enough simply to produce a list of all parts tagged in some particular way, although extracting (for example) all occurrences of elements such as term or name will often be a good departure point for an index.

The TEI schema provides a special purpose index tag which may be used to mark both the parts of the document which should be indexed, and how the indexing should be done.

index (index entry) marks a location to be indexed for whatever purpose.

For example, the second paragraph of this section might include the following:

... TEI lite also provides a special purpose
<gi>index</gi> tag
<index>
<term>indexing</term>
</index>
<index>
<term>index (tag)</term>
<index>
<term>use in index generation</term>
</index>
</index>
which may be used ...

The index element can also be used to provide a form of interpretive or analytic information. For example, in a study of Ovid, it might be desired to record all the poet's references to different figures, for comparative stylistic study. In the following lines of the Metamorphoses, such a study would record the poet's references to Jupiter (as deus, se, and as the subject of confiteor [in inflectional form number 227]), to Jupiter-in-the-guise-of-a-bull (as imago tauri fallacis and the subject of teneo), and so on.⁴

<l n="3.001">iamque deus posita fallacis
imagine tauri</l>
<l n="3.002">se confessus erat Dictaeaque rura tenebat</l>

This need might be met using the note element discussed in section in 7 Notes, or with the interp element discussed in section 15 Interpretation and Analysis. Here we demonstrate how it might also be satisfied by using the index element.

We assume that the object is to generate more than one index: one for names of deities (called dn), another for onomastic references (called on), a third for pronominal references (called pr) and so forth. One way of achieving this might be as follows:

<l n="3.001">iamque deus posita
fallacis imagine tauri <index indexName="dn">
  <term>Iuppiter</term>
  <index>
   <term>deus</term>
  </index>
</index>
<index indexName="on">
  <term>Iuppiter (taurus)</term>
  <index>
   <term>imago tauri
       fallacis</term>
  </index>
</index>
</l>
<l n="3.002">se confessus erat Dictaeaque rura tenebat
<index indexName="pr">
  <term>Iuppiter</term>
  <index>
   <term>se</term>
  </index>
</index>
<index indexName="v">
  <term>Iuppiter</term>
  <index>
   <term>confiteor
       (v227)</term>
  </index>
</index>
</l>

For each index element above, an entry will be generated in the appropriate index, using as headword the content of the term element it contains; the term elements nested within the secondary index element in each case provide a secondary keyword. The actual reference will be taken from the context in which the index element appears, i.e. in this case the identifier of the l element containing it.

16.4 Addresses

The address element is used to mark a postal address of any kind. It contains one or more addrLine elements, one for each line of the address.

address contains a postal address, for example of a publisher, an organization, or an individual.
addrLine (address line) contains one line of a postal address.

Here is a simple example:

<address>
<addrLine>Computer Center (M/C 135)</addrLine>
<addrLine>1940 W. Taylor, Room 124</addrLine>
<addrLine>Chicago, IL 60612-7352</addrLine>
<addrLine>U.S.A.</addrLine>
</address>

The individual parts of an address may be further distinguished by using the name element discussed above (section 10.1 Names and Referring Strings).

<address>
<addrLine>Computer Center (M/C 135)</addrLine>
<addrLine>1940 W. Taylor, Room 124</addrLine>
<addrLine>
<name type="city">Chicago</name>, IL 60612-7352</addrLine>
<addrLine>
<name type="country">USA</name>
</addrLine>
</address>

17 Character Sets, Diacritics, etc.

With the advent of XML and its adoption of Unicode as the required character set for all documents, most problems previously associated with the representation of the divers languages and writing systems of the world are greatly reduced. For those working with standard forms of the European languages in particular, almost no special action is needed: any XML editor should enable you to input accented letters or other ‘non-ASCII’ characters directly, and they should be stored in the resulting file in a way which is transferable directly between different systems.

There are two important exceptions: the characters & and < may not be entered directly in an XML document, since they have a special significance as initiating markup. They must always be represented as entity references, like this: & or <. Other characters may also be represented by means of entity reference where necessary, for example to retain compatibility with a pre-Unicode processing system.

18 Front and Back Matter

18.1 Front Matter

For many purposes, particularly in older texts, the preliminary material such as title pages, prefatory epistles, etc., may provide very useful additional linguistic or social information. P5 provides a set of recommendations for distinguishing the textual elements most commonly encountered in front matter, which are summarized here.

18.1.1 Title Page

The start of a title page should be marked with the element titlePage. All text contained on the page should be transcribed and tagged with the appropriate element from the following list:

titlePage (title page) contains the title page of a text, appearing within the front or back matter.
docTitle (document title) contains the title of a document, including all its constituents, as given on a title page.
titlePart contains a subsection or division of the title of a work, as indicated on a title page.
byline contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
docAuthor (document author) contains the name of the author of the document, as given on the title page (often but not always contained in a byline).
docDate (document date) contains the date of a document, as given (usually) on a title page.
docEdition (document edition) contains an edition statement as presented on a title page of a document.
docImprint (document imprint) contains the imprint statement (place and date of publication, publisher name), as given (usually) at the foot of a title page.
epigraph contains a quotation, anonymous or attributed, appearing at the start or end of a section or on a title page.

Typeface distinctions should be marked with the rend attribute when necessary, as described above. Very detailed description of the letter spacing and sizing used in ornamental titles is not as yet provided for by the Guidelines. Changes of language should be marked by appropriate use of the xml:lang attribute or the foreign element, as necessary. Names of people, places, or organizations, may be tagged using the name element wherever they appear if no other more specific element is available.

Two example title pages follow:

<titlePage rend="Roman">
<docTitle>
  <titlePart type="main"> PARADISE REGAIN'D. A POEM In IV <hi>BOOKS</hi>. </titlePart>
  <titlePart> To which is added <title>SAMSON AGONISTES</title>. </titlePart>
</docTitle>
<byline>The Author <docAuthor>JOHN MILTON</docAuthor>
</byline>
<docImprint>
  <name>LONDON</name>, Printed by <name>J.M.</name> for <name>John Starkey</name>
   at the <name>Mitre</name> in <name>Fleetstreet</name>, near
<name>Temple-Bar.</name>
</docImprint>
<docDate>MDCLXXI</docDate>
</titlePage>

<titlePage>
<docTitle>
  <titlePart type="main"> Lives of the Queens of England, from the Norman
     Conquest;</titlePart>
  <titlePart type="sub">with anecdotes of their courts. </titlePart>
</docTitle>
<titlePart>Now first published from Official Records and other authentic documents private
   as well as public.</titlePart>
<docEdition>New edition, with corrections and additions</docEdition>
<byline>By <docAuthor>Agnes Strickland</docAuthor>
</byline>
<epigraph>
  <q>The treasures of antiquity laid up in old historic rolls, I opened.</q>
  <bibl>BEAUMONT</bibl>
</epigraph>
<docImprint>Philadelphia: Blanchard and Lea</docImprint>
<docDate>1860.</docDate>
</titlePage>

As elsewhere, the ref attribute may be used to link a name with a canonical definition of the entity being named. For example:

<byline>By <docAuthor>
  <name
    ref="http://en.wikipedia.org/wiki/Agnes_Strickland">Agnes
     Strickland</name>
</docAuthor>
</byline>

18.1.2 Prefatory Matter

Major blocks of text within the front matter should be marked using div elements; the following suggested values for the type attribute may be used to distinguish various common types of prefatory matter:

preface: A foreword or preface addressed to the reader in which the author or publisher explains the content, purpose, or origin of the text
dedication: A formal offering or dedication of a text to one or more persons or institutions by the author.
abstract: A summary of the content of a text as continuous prose
ack: A formal declaration of acknowledgment by the author in which persons and institutions are thanked for their part in the creation of a text
contents: A table of contents, specifying the structure of a work and listing its constituents. The list element should be used to mark its structure.
frontispiece: A pictorial frontispiece, possibly including some text.

Where other kinds of prefatory matter are encountered, the encoder is at liberty to invent other values for the type attribute.

Like any text division, those in front matter may contain low level structural or non-structural elements as described elsewhere. They will generally begin with a heading or title of some kind which should be tagged using the head element. Epistles will contain the following additional elements:

salute (salutation) contains a salutation or greeting prefixed to a foreword, dedicatory epistle, or other division of a text, or the salutation in the closing of a letter, preface, etc.
signed (signature) contains the closing salutation, etc., appended to a foreword, dedicatory epistle, or other division of a text.
byline contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
dateline contains a brief description of the place, date, time, etc. of production of a letter, newspaper story, or other work, prefixed or suffixed to it as a kind of heading or trailer.
argument A formal list or prose description of the topics addressed by a subdivision of a text.
cit (cited quotation) contains a quotation from some other document, together with a bibliographic reference to its source. In a dictionary it may contain an example text with at least one occurrence of the word form, used in the sense being described, or a translation of the headword, or an example.
opener groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter.
closer groups together salutations, datelines, and similar phrases appearing as a final group at the end of a division, especially of a letter.

Epistles which appear elsewhere in a text will, of course, contain these same elements.

As an example, the dedication at the start of Milton's Comus should be marked up as follows:

<div type="dedication">
<head>To the Right Honourable <name>JOHN Lord Viscount BRACLY</name>, Son and Heir apparent
   to the Earl of Bridgewater, &c.</head>
<salute>MY LORD,</salute>
<p>THis <hi>Poem</hi>, which receiv'd its first occasion of Birth from your Self, and
   others of your Noble Family .... and as in this representation your attendant
<name>Thyrsis</name>, so now in all reall expression</p>
<closer>
  <salute>Your faithfull, and most humble servant</salute>
  <signed>
   <name>H. LAWES.</name>
  </signed>
</closer>
</div>

18.2 Back Matter

18.2.1 Structural Divisions of Back Matter

Because of variations in publishing practice, back matter can contain virtually any of the elements listed above for front matter, and the same elements should be used where this is so. Additionally, back matter may contain the following types of matter within the back element. Like the structural divisions of the body, these should be marked as div elements, and distinguished by the following suggested values of the type attribute:

appendix: An ancillary self-contained section of a work, often providing additional but in some sense extra-canonical text.
glossary: A list of terms associated with definition texts (‘glosses’): this should be encoded as a <<list type="gloss">> element
notes: A section in which textual or other kinds of notes are gathered together.
bibliogr: A list of bibliographic citations: this should be encoded as a listBibl
index: Any form of pre-existing index to the work (An index may also be generated for a document by using the index element described above).
colophon: A statement appearing at the end of a book describing the conditions of its physical production.

19 The Electronic Title Page

Every TEI text has a header which provides information analogous to that provided by the title page of printed text. The header is introduced by the element teiHeader and has four major parts:

fileDesc (file description) contains a full bibliographic description of an electronic file.
encodingDesc (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.
profileDesc (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting.
revisionDesc (revision description) summarizes the revision history for a file.

A corpus or collection of texts with many shared characteristics may have one header for the corpus and individual headers for each component of the corpus. In this case the type attribute indicates the type of header. <teiHeader type="corpus"> introduces the header for corpus-level information.

Some of the header elements contain running prose which consists of one or more ps. Others are grouped:

Elements whose names end in Stmt (for statement) usually enclose a group of elements recording some structured information.
Elements whose names end in Decl (for declaration) enclose information about specific encoding practices.
Elements whose names end in Desc (for description) contain a prose description.

19.1 The File Description

The fileDesc element is mandatory. It contains a full bibliographic description of the file with the following elements:

titleStmt (title statement) groups information about the title of a work and those responsible for its content.
editionStmt (edition statement) groups information relating to one edition of a text.
extent describes the approximate size of a text as stored on some carrier medium, whether digital or non-digital, specified in any convenient units.
publicationStmt (publication statement) groups information concerning the publication or distribution of an electronic or other text.
seriesStmt (series statement) groups information about the series, if any, to which a publication belongs.
notesStmt (notes statement) collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description.
sourceDesc (source description) describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as "born digital" for a text which has no previous existence.

A minimal header has the following structure:

19.1.1 The Title Statement

The following elements can be used in the titleStmt:

title contains a title for any kind of work.
author in a bibliographic reference, contains the name(s) of an author, personal or corporate, of a work; for example in the same form as that provided by a recognized bibliographic name authority.
sponsor specifies the name of a sponsoring organization or institution.
funder (funding body) specifies the name of an individual, institution, or organization responsible for the funding of a project or text.
principal (principal researcher) supplies the name of the principal researcher responsible for the creation of an electronic text.
respStmt (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. May also be used to encode information about individuals or organizations which have played a role in the production or distribution of a bibliographic work.

The title of a digital resource derived from a non-digital one will obviously be similar. However, it is important to distinguish the title of the computer file from that of the source text, for example:

[title of source]: a machine readable transcription [title of source]: electronic edition A machine readable version of: [title of source]

The respStmt element contains the following subcomponents:

resp (responsibility) contains a phrase describing the nature of a person's intellectual responsibility, or an organization's role in the production or distribution of a work.
name (name, proper noun) contains a proper noun or noun phrase.

Example:

<titleStmt>
<title>Two stories by Edgar Allen Poe: a machine readable transcription</title>
<author>Poe, Edgar Allen (1809-1849)</author>
<respStmt>
<resp>compiled by</resp>
<name>James D. Benson</name>
</respStmt>
</titleStmt>

19.1.2 The Edition Statement

The editionStmt groups information relating to one edition of the digital resource (where edition is used as elsewhere in bibliography), and may include the following elements:

edition (edition) describes the particularities of one edition of a text.
respStmt (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. May also be used to encode information about individuals or organizations which have played a role in the production or distribution of a bibliographic work.

Example:

<editionStmt>
<edition n="U2">Third
draft, substantially revised <date>1987</date>
</edition>
</editionStmt>

Determining exactly what constitutes a new edition of an electronic text is left to the encoder.

19.1.3 The Extent Statement

The extent statement describes the approximate size of the digital resource.

Example:

<extent>4532
bytes</extent>

19.1.4 The Publication Statement

The publicationStmt is mandatory. It may contain a simple prose description or groups of the elements described below:

publisher provides the name of the organization responsible for the publication or distribution of a bibliographic item.
distributor supplies the name of a person or other agency responsible for the distribution of a text.
authority (release authority) supplies the name of a person or other agency responsible for making a work available, other than a publisher or distributor.

At least one of these three elements must be present, unless the entire publication statement is in prose. The following elements may occur within them:

pubPlace (publication place) contains the name of the place where a bibliographic item was published.
address contains a postal address, for example of a publisher, an organization, or an individual.
idno (identifier) supplies any form of identifier used to identify some object, such as a bibliographic item, a person, a title, an organization, etc. in a standardized way.
availability supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, any licence applying to it, etc.
licence contains information about a licence or other legal agreement applicable to the text.
date contains a date in any format.

Example:

<publicationStmt>
<publisher>University of Victoria Humanities Computing and Media Centre</publisher>
<pubPlace>Victoria, BC</pubPlace>
<date>2011</date>
<availability status="restricted">
  <licence
    target="http://creativecommons.org/licenses/by-sa/3.0/"> Distributed under a
     Creative Commons Attribution-ShareAlike 3.0 Unported License </licence>
</availability>
</publicationStmt>

19.1.5 Series and Notes Statements

The seriesStmt element groups information about the series, if any, to which a publication belongs. It may contain title, idno, or respStmt elements.

The notesStmt, if used, contains one or more note elements which contain a note or annotation. Some information found in the notes area in conventional bibliography has been assigned specific elements in the TEI scheme.

19.1.6 The Source Description

The sourceDesc is a mandatory element which records details of the source or sources from which the computer file is derived. It may contain simple prose or a bibliographic citation, using one or more of the following elements:

bibl (bibliographic citation) contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
listBibl (citation list) contains a list of bibliographic citations of any kind.

Examples:

<sourceDesc>
<bibl>The first folio of Shakespeare, prepared by Charlton Hinman (The Norton Facsimile,
1968)</bibl>
</sourceDesc>

<sourceDesc>
<bibl>
  <author>CNN Network News</author>
  <title>News headlines</title>
  <date>12 Jun
     1989</date>
</bibl>
</sourceDesc>

19.2 The Encoding Description

The encodingDesc element specifies the methods and editorial principles which governed the transcription of the text. Its use is highly recommended. It may be prose description or may contain elements from the following list:

projectDesc (project description) describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
samplingDecl (sampling declaration) contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
editorialDecl (editorial practice declaration) provides details of editorial principles and practices applied during the encoding of a text.
refsDecl (references declaration) specifies how canonical references are constructed for this text.
classDecl (classification declarations) contains one or more taxonomies defining any classificatory codes used elsewhere in the text.

19.2.1 Project and Sampling Descriptions

Examples of projectDesc and <samplingDesc>:

<encodingDesc>
<projectDesc>
  <p>Texts collected for
     use in the Claremont Shakespeare Clinic, June 1990.
  </p>
</projectDesc>
</encodingDesc>

<encodingDesc>
<samplingDecl>
<p>Samples of
2000 words taken from the beginning of the text</p>
</samplingDecl>
</encodingDesc>

19.2.2 Editorial Declarations

The editorialDecl contains a prose description of the practices used when encoding the text. Typically this description should cover such topics as the following, each of which may conveniently be given as a separate paragraph.

correction: how and under what circumstances corrections have been made in the text.
normalization: the extent to which the original source has been regularized or normalized.
quotation: what has been done with quotation marks in the original -- have they been retained or replaced by entity references, are opening and closing quotes distinguished, etc.
hyphenation: what has been done with hyphens (especially end-of-line hyphens) in the original -- have they been retained, replaced by entity references, etc.
segmentation: how has the text has been segmented, for example into sentences, tone-units, graphemic strata, etc.
interpretation: what analytic or interpretive information has been added to the text.

Example:

<editorialDecl>
<p>The part of
   speech analysis applied throughout section 4 was added by hand and has not been
   checked.</p>
<p>Errors in transcription controlled by using the WordPerfect spelling
   checker.</p>
<p>All words converted to Modern American spelling using Webster's 9th
   Collegiate dictionary.</p>
</editorialDecl>

19.2.3 Reference and Classification Declarations

The refsDecl element is used to document the way in which any standard referencing scheme built into the encoding works. In its simplest form, it consists of prose description.

Example:

<refsDecl>
<p>The <att>n</att>
   attribute on each <gi>div</gi> contains the canonical reference for each division in the
   form XX.yyy where XX is the book number in roman numeral and yyy is the section number in
   arabic.</p>
<p>Milestone tags refer to the edition of 1830 as E30 and that of 1850 as E50.
</p>
</refsDecl>

The classDecl element groups together definitions or sources for any descriptive classification schemes used by other parts of the header. At least one such scheme must be provided, encoded using the following elements:

taxonomy defines a typology either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.
bibl (bibliographic citation) contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
category contains an individual descriptive category, possibly nested within a superordinate category, within a user-defined taxonomy.
UNKNOWN ELEMENT catDesc

In the simplest case, the taxonomy may be defined by a bibliographic reference, as in the following example:

<classDecl>
<taxonomy xml:id="LC-SH">
<bibl>Library of Congress Subject Headings
</bibl>
</taxonomy>
</classDecl>

Alternatively, or in addition, the encoder may define a special purpose classification scheme, as in the following example:

<taxonomy xml:id="B">
<bibl>Brown Corpus</bibl>
<category xml:id="B.A">
  <catDesc>Press
     Reportage</catDesc>
  <category xml:id="B.A1">
   <catDesc>Daily</catDesc>
  </category>
  <category xml:id="B.A2">
   <catDesc>Sunday</catDesc>
  </category>
  <category xml:id="B.A3">
   <catDesc>National</catDesc>
  </category>
  <category xml:id="B.A4">
   <catDesc>Provincial</catDesc>
  </category>
  <category xml:id="B.A5">
   <catDesc>Political</catDesc>
  </category>
  <category xml:id="B.A6">
   <catDesc>Sports</catDesc>
  </category>
</category>
<category xml:id="B.D">
  <catDesc>Religion</catDesc>
  <category xml:id="B.D1">
   <catDesc>Books</catDesc>
  </category>
  <category xml:id="B.D2">
   <catDesc>Periodicals and
       tracts</catDesc>
  </category>
</category>
</taxonomy>

Linkage between a particular text and a category within such a taxonomy is made by means of the catRef element within the textClass element, as described in the next section below.

19.3 The Profile Description

The profileDesc element enables information characterizing various descriptive aspects of a text to be recorded within a single framework. It has three optional components:

creation contains information about the creation of a text.
langUsage (language usage) describes the languages, sublanguages, registers, dialects, etc. represented within a text.
textClass (text classification) groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.

The creation element is useful for documenting where a work was created, even though it may not have been published or recorded there.

Example:

<creation>
<date when="1992-08">August 1992</date>
<name type="place">Taos, New Mexico</name>
</creation>

The langUsage element is useful where a text contains many different languages. It may contain language elements to document each particular language used:

language characterizes a single language or sublanguage used within a text.

For example, a text containing predominantly text in French as spoken in Quebec, but also smaller amounts of British and Canadian English might be documented as follows:

<langUsage>
<language ident="fr-CA" usage="60">Québecois</language>
<language ident="en-CA" usage="20">Canadian business English</language>
<language ident="en-GB" usage="20">British English</language>
</langUsage>

The textClass element classifies a text. This may be done with reference to a classification system locally defined by means of the classDecl element, or by reference to some externally defined established scheme such as the Universal Decimal Classification. Texts may also be classified using lists of keywords, which may themselves be drawn from locally or externally defined control lists. The following elements are used to supply such classifications:

classCode (classification code) contains the classification code used for this text in some standard classification system.
catRef/ (category reference) specifies one or more defined categories within some taxonomy or text typology.
keywords contains a list of keywords or phrases identifying the topic or nature of a text.

The simplest way of classifying a text is by means of the classCode element. For example, a text with classification 410 in the Universal Decimal Classification might be documented as follows:

When a classification scheme has been locally defined using the taxonomy element discussed in the preceding subsection, the catRef element should be used to reference it. To continue the earlier example, a work classified in the Brown Corpus as

Press
       reportage - Sunday

and also as Religion might be documented as follows:

The element keywords contains a list of keywords or phrases identifying the topic or nature of a text. As usual, the attribute scheme identifies the source from which these terms are taken. For example, if the LC Subject Headings are used, following declaration of that classification system in a taxonomy element as above :

<textClass>
<keywords scheme="#LCSH">
  <list>
   <item>English literature -- History and criticism -- Data processing.</item>
   <item>English literature -- History and criticism -- Theory etc.</item>
   <item>English language -- Style -- Data processing.</item>
  </list>
</keywords>
</textClass>

Multiple classifications may be supplied using any of the mechanisms described in this section.

19.4 The Revision Description

The revisionDesc element provides a change log in which each change made to a text may be recorded. The log may be recorded as a sequence of change elements each of which contains a brief description of the change. The attributes when and who may be used to identify when the change was carried out and the agency responsible for it.

Example:

<revisionDesc>
<change when="1991-03-06" who="#EMB">File format updated</change>
<change when="1990-05-25" who="#EMB">Stuart's corrections entered</change>
</revisionDesc>

In a production environment it will usually be found preferable to use some kind of automated system to track and record changes. Many such version control systems, as they are known, can also be configured to update the TEI Header of a file automatically.

List of Elements Described

The TEI Lite schema is a pure subset of TEI P5. In the following list of elements and classes used, some information, notably the examples, derives from the canonical definition for the element in TEI P5 and may therefore refer to elements or attributes not provided by TEI Lite. Note however that only the elements listed here are available within the TEI Lite schema. These specifications also refer to many attributes which although available in TEI Lite are not discussed in this tutorial for lack of space.

Schema tei_lite: changed components

<TEI>

<TEI> (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element. [4. 15.1. ]
Module	textstructure — List of Elements Described
Attributes	Attributes att.global (@xml:id, @n, @xml:lang, @rend, @style) (att.global.linking (@next, @prev)) (att.global.analytic (@ana)) (att.global.facs (@facs))
Used by
Contained by	core: teiCorpus
May contain	header: teiHeader textstructure: text
Declaration	element TEI { att.global.attributes, ( teiHeader, ( ( model.resourceLike+, text? ) \| text ) ) }
Schematron	<s:ns prefix="tei" uri="http://www.tei-c.org/ns/1.0"/>
Schematron	<s:ns prefix="rng" uri="http://relaxng.org/ns/structure/1.0"/>
Example	<TEI version="5.0" xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>The shortest TEI Document Imaginable</title> </titleStmt> <publicationStmt> <p>First published as part of TEI P2, this is the P5 version using a name space.</p> </publicationStmt> <sourceDesc> <p>No source: this is an original work.</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <p>This is about the shortest TEI document imaginable.</p> </body> </text> </TEI>
Note	This element is required.

att.datable.w3c

att.datable.w3c provides attributes for normalization of elements that contain datable events using the W3C datatypes.

Module tei — List of Elements Described

Members att.datable [name date time licence creation change]

Attributes

when

supplies the value of the date or time in a standard form, e.g. yyyy-mm-dd.

Status	Optional
Datatype	`data.temporal.w3c`
Values	A normalized form of temporal expression conforming to the W3C XML Schema Part 2: Datatypes Second Edition.
Examples of W3C date, time, and date & time formats. <p> <date when="1945-10-24">24 Oct 45</date> <date when="1996-09-24T07:25:00Z">September 24th, 1996 at 3:25 in the morning</date> <time when="1999-01-04T20:42:00-05:00">Jan 4 1999 at 8 pm</time> <time when="14:12:38">fourteen twelve and 38 seconds</time> <date when="1962-10">October of 1962</date> <date when="--06-12">June 12th</date> <date when="---01">the first of the month</date> <date when="--08">August</date> <date when="2006">MMVI</date> <date when="0056">AD 56</date> <date when="-0056">56 BC</date> </p>
This list begins in the year 1632, more precisely on Trinity Sunday, i.e. the Sunday after Pentecost, in that year the <date calendar="#Julian" when="1632-06-06">27th of May (old style)</date>.
<opener> <dateline> <placeName>Dorchester, Village,</placeName> <date when="1828-03-02">March 2d. 1828.</date> </dateline> <salute>To Mrs. Cornell,</salute> Sunday <time when="12:00:00">noon.</time> </opener>
Note	The value of the when attribute should be the normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by XML Schema Part 2: Datatypes Second Edition, using the Gregorian calendar. The most commonly-encountered format for the date part of the when attribute is `yyyy-mm-dd`, but `yyyy`, `--mm`, `---dd`, `yyyy-mm`, or `--mm-dd` may also be used. For the time part, the form `hh:mm:ss` is used. Note that this format does not currently permit use of the value 0000 to represent the year 1 BCE; instead the value -0001 should be used.

att.global

att.global provides attributes common to all elements in the TEI encoding scheme.

Module tei — List of Elements Described

Members p foreign emph hi q cit mentioned soCalled desc gloss term sic corr choice reg orig gap add del unclear name rs address addrLine num date time abbr expan ptr ref list item label head note index graphic milestone pb lb author editor respStmt resp title publisher biblScope pubPlace bibl listBibl relatedItem l lg sp speaker stage teiCorpus divGen teiHeader fileDesc titleStmt sponsor funder principal editionStmt edition extent publicationStmt distributor authority idno availability licence seriesStmt notesStmt sourceDesc encodingDesc projectDesc samplingDecl editorialDecl refsDecl classDecl taxonomy category profileDesc creation langUsage language textClass keywords classCode catRef revisionDesc change TEI text body group div trailer byline dateline argument epigraph opener closer salute signed postscript titlePage docTitle titlePart docAuthor docEdition docImprint docDate front back table row cell formula figure figDesc anchor seg s w pc interp interpGrp att code eg gi ident val

Attributes

Attributes att.global.linking (@next, @prev) att.global.analytic (@ana) att.global.facs (@facs)

xml:id

(identifier) provides a unique identifier for the element bearing the attribute.

Status	Optional
Datatype	`xsd:ID`
Values	any valid XML identifier.
Note	The xml:id attribute may be used to specify a canonical reference for an element; see section 3.10. .

(number) gives a number (or other label) for an element, which is not necessarily unique within the document.

Status	Optional
Datatype	`data.text`
Values	the value consists of a single token which may however contain punctuation characters, whitespace or word separating characters. It need not be restricted to numbers.
Note	The n attribute may be used to specify the numbering of chapters, sections, list items, etc.; it may also be used in the specification of a standard reference system for the text.

xml:lang

(language) indicates the language of the element content using a ‘tag’ generated according to BCP 47

Status	Optional
Datatype	`data.language`
Values	The value must conform to BCP 47. If the value is a private use code (i.e., starts with x- or contains -x-), a language element with a matching value for its ident attribute should be supplied in the TEI Header to document this value. Such documentation may also optionally be supplied for non-private-use codes, though these must remain consistent with their Internet Engineering Task Force (IETF) definitions.
<p> … The consequences of this rapid depopulation were the loss of the last <foreign xml:lang="rap">ariki</foreign> or chief (Routledge 1920:205,210) and their connections to ancestral territorial organization.</p>
Note	the xml:lang value will be inherited from the immediately enclosing element, or from its parent, and so on up the document hierarchy. It is generally good practice to specify xml:lang at the highest appropriate level, noticing that a different default may be needed for the teiHeader from that needed for the associated resource element or elements, and that a single TEI document may contain texts in many languages.

rend

(rendition) indicates how the element in question was rendered or presented in the source text.

Status	Optional
Datatype	1–∞ occurrences of `data.word`separated by whitespace
Values	may contain any number of tokens, each of which may contain letters, punctuation marks, or symbols, but not whitespace or word-separating characters.
<head rend="align(center) case(allcaps)"> <lb/>To The <lb/>Duchesse <lb/>of <lb/>Newcastle, <lb/>On Her <lb/> <hi rend="case(mixed)">New Blazing-World</hi>. </head>
Note	These Guidelines make no binding recommendations for the values of the rend attribute; the characteristics of visual presentation vary too much from text to text and the decision to record or ignore individual characteristics varies too much from project to project. Some potentially useful conventions are noted from time to time at appropriate points in the Guidelines. The values of the rend attribute are a set of sequence-indeterminate individual tokens separated by whitespace.

style

contains an expression in some formal style definition language which defines the rendering or presentation used for this element in the source text

Status	Optional
Datatype	`data.text`
<head style="text-align: center; font-variant: small-caps"> <lb/>To The <lb/>Duchesse <lb/>of <lb/>Newcastle, <lb/>On Her <lb/> <hi style="font-variant: normal">New Blazing-World</hi>. </head>
Note	Unlike the attribute values of rend, the style attribute may contain whitespace. This attribute is intended for recording inline stylistic information concerning the source, not any particular output.

att.global.linking

att.global.linking defines a set of attributes for hypertext and other linking, which are enabled for all elements when the additional tag set for linking is selected.

Module linking — List of Elements Described

Members att.global [p foreign emph hi q cit mentioned soCalled desc gloss term sic corr choice reg orig gap add del unclear name rs address addrLine num date time abbr expan ptr ref list item label head note index graphic milestone pb lb author editor respStmt resp title publisher biblScope pubPlace bibl listBibl relatedItem l lg sp speaker stage teiCorpus divGen teiHeader fileDesc titleStmt sponsor funder principal editionStmt edition extent publicationStmt distributor authority idno availability licence seriesStmt notesStmt sourceDesc encodingDesc projectDesc samplingDecl editorialDecl refsDecl classDecl taxonomy category profileDesc creation langUsage language textClass keywords classCode catRef revisionDesc change TEI text body group div trailer byline dateline argument epigraph opener closer salute signed postscript titlePage docTitle titlePart docAuthor docEdition docImprint docDate front back table row cell formula figure figDesc anchor seg s w pc interp interpGrp att code eg gi ident val]

Attributes

points to the next element of a virtual aggregate of which the current element is part.

Status	Optional
Datatype	`data.pointer`
Values	a URI.

(previous) points to the previous element of a virtual aggregate of which the current element is part.

Status	Optional
Datatype	`data.pointer`
Values	a URI.

Schema tei_lite: unchanged components

abbr: (abbreviation) contains an abbreviation of any sort. [3.5.5. ]

add: (addition) contains letters, words, or phrases inserted in the text by an author, scribe, annotator, or corrector. [3.4.3. ]

addrLine: (address line) contains one line of a postal address. [3.5.2. 2.2.4. 3.11.2.3. ]

address: contains a postal address, for example of a publisher, an organization, or an individual. [3.5.2. 2.2.4. 3.11.2.3. ]

anchor: (anchor point) attaches an identifier to a point within a text, whether or not it corresponds with a textual element. [8.4.2. 16.4. ]

argument: A formal list or prose description of the topics addressed by a subdivision of a text. [4.2. 4.6. ]

att: (attribute) contains the name of an attribute appearing within running text. [22. ]

att.ascribed: provides attributes for elements representing speech or action that can be ascribed to a specific individual.

att.breaking: provides an attribute to indicate whether or not the element concerned is considered to mark the end of an orthographic token in the same way as whitespace.

att.cReferencing: provides an attribute which may be used to supply a canonical reference as a means of identifying the target of a pointer.

att.canonical: provides attributes which can be used to associate a representation such as a name or title with canonical information about the object being named or referenced.

att.datable: provides attributes for normalization of elements that contain dates, times, or datable events.

att.datcat: introduces dcr:datacat and dcr:ValueDatacat attributes that may be used to align XML elements or attributes with the appropriate Data Categories (DCs) defined by the ISO 12620:2009 standard and stored in the Web repository called ISOCat at http://www.isocat.org/.

att.declarable: provides attributes for those elements in the TEI Header which may be independently selected by means of the special purpose decls attribute.

att.declaring: provides attributes for elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element.

att.dimensions: provides attributes for describing the size of physical objects.

att.divLike: provides attributes common to all elements which behave in the same way as divisions.

att.docStatus: provides attributes for use on metadata elements describing the status of a document.

att.editLike: provides attributes describing the nature of an encoded scholarly intervention or interpretation of any kind.

att.global.analytic: provides additional global attributes for associating specific analyses or interpretations with appropriate portions of a text.

att.global.facs: groups elements corresponding with all or part of an image, because they contain an alternative representation of it, typically but not necessarily a transcription of it.

att.interpLike: provides attributes for elements which represent a formal analysis or interpretation.

att.milestoneUnit: provides an attribute to indicate the type of section which is changing at a specific milestone.

att.naming: provides attributes common to elements which refer to named persons, places, organizations etc.

att.personal: (attributes for components of names usually, but not necessarily, personal names) common attributes for those elements which form part of a name usually, but not necessarily, a personal name.

att.placement: provides attributes for describing where on the source page or object a textual element appears.

att.pointing: defines a set of attributes used by all elements which point to other elements by means of one or more URI references.

att.ranging: provides attributes for describing numerical ranges.

att.responsibility: provides attributes indicating who is responsible for something asserted by the markup and the degree of certainty associated with it.

att.segLike: provides attributes for elements used for arbitrary segmentation.

att.sortable: provides attributes for elements in lists or groups that are sortable, but whose sorting key cannot be derived mechanically from the element content.

att.sourced: provides attributes identifying the source edition from which some encoded feature derives.

att.spanning: provides attributes for elements which delimit a span of text by pointing mechanisms rather than by enclosing it.

att.tableDecoration: provides attributes used to decorate rows or cells of a table.

att.transcriptional: provides attributes specific to elements encoding authorial or scribal intervention in a text when transcribing manuscript or similar sources.

att.translatable: provides attributes used to indicate the status of a translatable portion of an ODD document.

att.typed: provides attributes which can be used to classify or subclassify elements in any way.

author: in a bibliographic reference, contains the name(s) of an author, personal or corporate, of a work; for example in the same form as that provided by a recognized bibliographic name authority. [3.11.2.2. 2.2.1. ]

authority: (release authority) supplies the name of a person or other agency responsible for making a work available, other than a publisher or distributor. [2.2.4. ]

availability: supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, any licence applying to it, etc. [2.2.4. ]

back: (back matter) contains any appendixes, etc. following the main part of a text. [4.7. 4. ]

bibl: (bibliographic citation) contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged. [3.11.1. 2.2.7. 15.3.2. ]

biblScope: (scope of citation) defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work. [3.11.2.3. ]

body: (text body) contains the whole body of a single unitary text, excluding any front or back matter. [4. ]

byline: contains the primary statement of responsibility given for a work on its title page or at the head or end of the work. [4.2.2. 4.5. ]

catRef: (category reference) specifies one or more defined categories within some taxonomy or text typology. [2.4.3. ]

category: contains an individual descriptive category, possibly nested within a superordinate category, within a user-defined taxonomy. [2.3.7. ]

cell: contains one cell of a table. [14.1.1. ]

change: documents a change or set of changes made during the production of a source document, or during the revision of an electronic file. [2.5. 2.4.1. ]

choice: groups a number of alternative encodings for the same point in a text. [3.4. ]

cit: (cited quotation) contains a quotation from some other document, together with a bibliographic reference to its source. In a dictionary it may contain an example text with at least one occurrence of the word form, used in the sense being described, or a translation of the headword, or an example. [3.3.3. 4.3.1. 9.3.5.1. ]

classCode: (classification code) contains the classification code used for this text in some standard classification system. [2.4.3. ]

classDecl: (classification declarations) contains one or more taxonomies defining any classificatory codes used elsewhere in the text. [2.3.7. 2.3. ]

closer: groups together salutations, datelines, and similar phrases appearing as a final group at the end of a division, especially of a letter. [4.2.2. 4.2. ]

code: contains literal code from some formal language such as a programming language.

corr: (correction) contains the correct form of a passage apparently erroneous in the copy text. [3.4.1. ]

creation: contains information about the creation of a text. [2.4.1. 2.4. ]

data.certainty: defines the range of attribute values expressing a degree of certainty.

data.code: defines the range of attribute values expressing a coded value by means of a pointer to some other element which contains a definition for it.

data.count: defines the range of attribute values used for a non-negative integer value used as a count.

data.duration.w3c: defines the range of attribute values available for representation of a duration in time using W3C datatypes.

data.enumerated: defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities.

data.language: defines the range of attribute values used to identify a particular combination of human language and writing system.

data.name: defines the range of attribute values expressed as an XML Name.

data.namespace: defines the range of attribute values used to indicate XML namespaces as defined by the W3C Namespaces in XML Technical Recommendation.

data.numeric: defines the range of attribute values used for numeric values.

data.outputMeasurement: defines a range of values for use in specifying the size of an object that is intended for display on the web.

data.pointer: defines the range of attribute values used to provide a single URI pointer to any other resource, either within the current document or elsewhere.

data.probability: defines the range of attribute values expressing a probability.

data.temporal.w3c: defines the range of attribute values expressing a temporal expression such as a date, a time, or a combination of them, that conform to the W3C XML Schema Part 2: Datatypes specification.

data.text: defines the range of attribute values used to express some kind of identifying string as a single sequence of unicode characters possibly including whitespace.

data.truthValue: defines the range of attribute values used to express a truth value.

data.version: defines the range of attribute values which may be used to specify a TEI version number.

data.word: defines the range of attribute values expressed as a single word or token.

data.xTruthValue: (extended truth value) defines the range of attribute values used to express a truth value which may be unknown.

date: contains a date in any format. [3.5.4. 2.2.4. 2.5. 3.11.2.3. 15.2.3. 13.3.6. ]

dateline: contains a brief description of the place, date, time, etc. of production of a letter, newspaper story, or other work, prefixed or suffixed to it as a kind of heading or trailer. [4.2.2. ]

del: (deletion) contains a letter, word, or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator, or corrector. [3.4.3. ]

desc: (description) contains a brief description of the object documented by its parent element, including its intended usage, purpose, or application where this is appropriate. [22.4.4. 22.4.5. 22.4.6. 22.4.7. ]

distributor: supplies the name of a person or other agency responsible for the distribution of a text. [2.2.4. ]

div: (text division) contains a subdivision of the front, body, or back of a text. [4.1. ]

divGen: (automatically generated text division) indicates the location at which a textual division generated automatically by a text-processing application is to appear. [3.8.2. ]

docAuthor: (document author) contains the name of the author of the document, as given on the title page (often but not always contained in a byline). [4.6. ]

docDate: (document date) contains the date of a document, as given (usually) on a title page. [4.6. ]

docEdition: (document edition) contains an edition statement as presented on a title page of a document. [4.6. ]

docImprint: (document imprint) contains the imprint statement (place and date of publication, publisher name), as given (usually) at the foot of a title page. [4.6. ]

docTitle: (document title) contains the title of a document, including all its constituents, as given on a title page. [4.6. ]

edition: (edition) describes the particularities of one edition of a text. [2.2.2. ]

editionStmt: (edition statement) groups information relating to one edition of a text. [2.2.2. 2.2. ]

editor: secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc. [3.11.2.2. ]

editorialDecl: (editorial practice declaration) provides details of editorial principles and practices applied during the encoding of a text. [2.3.3. 2.3. 15.3.2. ]

eg: (example) contains any kind of illustrative example. [22.4.4. 22.4.5. ]

emph: (emphasized) marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect. [3.3.2.2. 3.3.2. ]

encodingDesc: (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived. [2.3. 2.1.1. ]

epigraph: contains a quotation, anonymous or attributed, appearing at the start or end of a section or on a title page. [4.2.3. 4.2. 4.6. ]

expan: (expansion) contains the expansion of an abbreviation. [3.5.5. ]

extent: describes the approximate size of a text as stored on some carrier medium, whether digital or non-digital, specified in any convenient units. [2.2.3. 2.2. 3.11.2.3. ]

figDesc: (description of figure) contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it. [14.4. ]

figure: groups elements representing or containing graphic information such as an illustration, formula, or figure. [14.4. ]

fileDesc: (file description) contains a full bibliographic description of an electronic file. [2.2. 2.1.1. ]

foreign: (foreign) identifies a word or phrase as belonging to some language other than that of the surrounding text. [3.3.2.1. ]

formula: contains a mathematical or other formula. [14.2. ]

front: (front matter) contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found at the start of a document, before the main body. [4.6. 4. ]

funder: (funding body) specifies the name of an individual, institution, or organization responsible for the funding of a project or text. [2.2.1. ]

gap: (gap) indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible. [3.4.3. ]

gi: (element name) contains the name (generic identifier) of an element. [22. 22.4.4. ]

gloss: identifies a phrase or word used to provide a gloss or definition for some other word or phrase. [3.3.4. ]

graphic: indicates the location of an inline graphic, illustration, or figure. [3.9. ]

group: contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose, for example the collected works of an author, a sequence of prose essays, etc. [4. 4.3.1. 15.1. ]

head: (heading) contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc. [4.2.1. ]

hi: (highlighted) marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made. [3.3.2.2. 3.3.2. ]

ident: (identifier) contains an identifier or name for an object of some kind in a formal language. ident is used for tokens such as variable names, class names, type names, function names etc. in formal programming languages. [22.1.1. ]

idno: (identifier) supplies any form of identifier used to identify some object, such as a bibliographic item, a person, a title, an organization, etc. in a standardized way. [2.2.4. 2.2.5. 3.11.2.3. ]

index: (index entry) marks a location to be indexed for whatever purpose. [3.8.2. ]

interp: (interpretation) summarizes a specific interpretative annotation which can be linked to a span of text. [17.3. ]

interpGrp: (interpretation group) collects together a set of related interpretations which share responsibility or type. [17.3. ]

item: contains one component of a list. [3.7. 2.5. ]

keywords: contains a list of keywords or phrases identifying the topic or nature of a text. [2.4.3. ]

l: (verse line) contains a single, possibly incomplete, line of verse. [3.12.1. 3.12. 7.2.5. ]

label: contains any label or heading used to identify part of a text, typically but not exclusively in a list or glossary. [3.7. ]

langUsage: (language usage) describes the languages, sublanguages, registers, dialects, etc. represented within a text. [2.4.2. 2.4. 15.3.2. ]

language: characterizes a single language or sublanguage used within a text. [2.4.2. ]

lb: (line break) marks the start of a new (typographic) line in some edition or version of a text. [3.10.3. 7.2.5. ]

lg: (line group) contains one or more verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc. [3.12.1. 3.12. 7.2.5. ]

licence: contains information about a licence or other legal agreement applicable to the text. [2.2.4. ]

list: (list) contains any sequence of items organized as a list. [3.7. ]

listBibl: (citation list) contains a list of bibliographic citations of any kind. [3.11.1. 2.2.7. 15.3.2. ]

macro.limitedContent: (paragraph content) defines the content of prose elements that are not used for transcription of extant materials.

macro.paraContent: (paragraph content) defines the content of paragraphs and similar elements.

macro.phraseSeq: (phrase sequence) defines a sequence of character data and phrase-level elements.

macro.phraseSeq.limited: (limited phrase sequence) defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents.

macro.specialPara: ('special' paragraph content) defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements.

mentioned: marks words or phrases mentioned, not used. [3.3.3. ]

milestone: marks a boundary point separating any kind of section of a text, typically but not necessarily indicating a point at which some part of a standard reference system changes, where the change is not represented by a structural element. [3.10.3. ]

model.addrPart: groups elements such as names or postal codes which may appear as part of a postal address.

model.addressLike: groups elements used to represent a postal or e-mail address.

model.availabilityPart: groups elements such as licences and paragraphs of text which may appear as part of an availability statment

model.biblLike: groups elements containing a bibliographic description.

model.biblPart: groups elements which represent components of a bibliographic description.

model.certLike: groups elements which are used to indicate uncertainty or precision of other elements.

model.choicePart: groups elements (other than choice itself) which can be used within a choice alternation.

model.common: groups common chunk- and inter-level elements.

model.dateLike: groups elements containing temporal expressions.

model.descLike: groups elements which contain a description of their function.

model.div1Like: groups top-level structural divisions.

model.divBottom: groups elements appearing at the end of a text division.

model.divBottomPart: groups elements which can occur only at the end of a text division.

model.divGenLike: groups elements used to represent a structural division which is generated rather than explicitly present in the source.

model.divLike: groups elements used to represent un-numbered generic structural divisions.

model.divPart: groups paragraph-level elements appearing directly within divisions.

model.divTop: groups elements appearing at the beginning of a text division.

model.divTopPart: groups elements which can occur only at the beginning of a text division.

model.divWrapper: groups elements which can appear at either top or bottom of a textual division.

model.editorialDeclPart: groups elements which may be used inside editorialDecl and appear multiple times.

model.egLike: groups elements containing examples or illustrations.

model.emphLike: groups phrase-level elements which are typographically distinct and to which a specific function can be attributed.

model.encodingDescPart: groups elements which may be used inside encodingDesc and appear multiple times.

model.entryPart: groups elements appearing at any level within a dictionary entry.

model.entryPart.top: groups high level elements within a structured dictionary entry

model.frontPart: groups elements which appear at the level of divisions within front or back matter.

model.gLike: groups elements used to represent individual non-Unicode characters or glyphs.

model.global: groups elements which may appear at any point within a TEI text.

model.global.edit: groups globally available elements which perform a specifically editorial function.

model.global.meta: groups globally available elements which describe the status of other elements.

model.glossLike: groups elements which provide an alternative name, explanation, or description for any markup construct.

model.graphicLike: groups elements containing images, formulae, and similar objects.

model.headLike: groups elements used to provide a title or heading at the start of a text division.

model.hiLike: groups phrase-level elements which are typographically distinct but to which no specific function can be attributed.

model.highlighted: groups phrase-level elements which are typographically distinct.

model.imprintPart: groups the bibliographic elements which occur inside imprints.

model.inter: groups elements which can appear either within or between paragraph-like elements.

model.lLike: groups elements representing metrical components such as verse lines.

model.lPart: groups phrase-level elements which may appear within verse only.

model.labelLike: groups elements used to gloss or explain other parts of a document.

model.limitedPhrase: groups phrase-level elements excluding those elements primarily intended for transcription of existing sources.

model.linePart: groups transcriptional elements which appear within lines or zones of a source-oriented transcription within a <sourceDoc> element.

model.listLike: groups list-like elements.

model.measureLike: groups elements which denote a number, a quantity, a measurement, or similar piece of text that conveys some numerical meaning.

model.milestoneLike: groups milestone-style elements used to represent reference systems.

model.nameLike: groups elements which name or refer to a person, place, or organization.

model.nameLike.agent: groups elements which contain names of individuals or corporate bodies.

model.noteLike: groups globally-available note-like elements.

model.pLike: groups paragraph-like elements.

model.pLike.front: groups paragraph-like elements which can occur as direct constituents of front matter.

model.pPart.data: groups phrase-level elements containing names, dates, numbers, measures, and similar data.

model.pPart.edit: groups phrase-level elements for simple editorial correction and transcription.

model.pPart.editorial: groups phrase-level elements for simple editorial interventions that may be useful both in transcribing and in authoring.

model.pPart.transcriptional: groups phrase-level elements used for editorial transcription of pre-existing source materials.

model.phrase: groups elements which can occur at the level of individual words or phrases.

model.phrase.xml: groups phrase-level elements used to encode XML constructs such as element names, attribute names, and attribute values

model.profileDescPart: groups elements which may be used inside profileDesc and appear multiple times.

model.ptrLike: groups elements used for purposes of location and reference.

model.publicationStmtPart: groups elements which may appear within the publicationStmt element of the TEI Header.

model.qLike: groups elements related to highlighting which can appear either within or between chunk-level elements.

model.quoteLike: groups elements used to directly contain quotations.

model.resourceLike: groups non-textual elements which may appear together with a header and a text to constitute a TEI document.

model.respLike: groups elements which are used to indicate intellectual or other significant responsibility, for example within a bibliographic element.

model.segLike: groups elements used for arbitrary segmentation.

model.sourceDescPart: groups elements which may be used inside sourceDesc and appear multiple times.

model.stageLike: groups elements containing stage directions or similar things defined by the module for performance texts.

model.teiHeaderPart: groups high level elements which may appear more than once in a TEI Header.

model.titlepagePart: groups elements which can occur as direct constituents of a title page, such as docTitle, docAuthor, docImprint, or epigraph.

name: (name, proper noun) contains a proper noun or noun phrase. [3.5.1. ]

note: contains a note or annotation. [3.8.1. 2.2.6. 3.11.2.6. 9.3.5.4. ]

notesStmt: (notes statement) collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description. [2.2.6. 2.2. ]

num: (number) contains a number, written in any form. [3.5.3. ]

opener: groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter. [4.2. ]

orig: (original form) contains a reading which is marked as following the original, rather than being normalized or corrected. [3.4.2. 12. ]

p: (paragraph) marks paragraphs in prose. [3.1. 7.2.5. ]

pb: (page break) marks the boundary between one page of a text and the next in a standard reference system. [3.10.3. ]

pc: (punctuation character) a character or string of characters regarded as constituting a single punctuation mark. [17.1. ]

postscript: contains a postscript, e.g. to a letter. [4.2. ]

principal: (principal researcher) supplies the name of the principal researcher responsible for the creation of an electronic text. [2.2.1. ]

profileDesc: (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. [2.4. 2.1.1. ]

projectDesc: (project description) describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected. [2.3.1. 2.3. 15.3.2. ]

ptr: (pointer) defines a pointer to another location. [3.6. 16.1. ]

pubPlace: (publication place) contains the name of the place where a bibliographic item was published. [3.11.2.3. ]

publicationStmt: (publication statement) groups information concerning the publication or distribution of an electronic or other text. [2.2.4. 2.2. ]

publisher: provides the name of the organization responsible for the publication or distribution of a bibliographic item. [3.11.2.3. 2.2.4. ]

q: (quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used. [3.3.3. ]

ref: (reference) defines a reference to another location, possibly modified by additional text or comment. [3.6. 16.1. ]

refsDecl: (references declaration) specifies how canonical references are constructed for this text. [2.3.6.3. 2.3. 2.3.6. ]

reg: (regularization) contains a reading which has been regularized or normalized in some sense. [3.4.2. 12. ]

relatedItem: contains or references some other bibliographic item which is related to the present one in some specified manner, for example as a constituent or alternative version of it. [3.11.2.5. ]

resp: (responsibility) contains a phrase describing the nature of a person's intellectual responsibility, or an organization's role in the production or distribution of a work. [3.11.2.2. 2.2.1. 2.2.2. 2.2.5. ]

respStmt: (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. May also be used to encode information about individuals or organizations which have played a role in the production or distribution of a bibliographic work. [3.11.2.2. 2.2.1. 2.2.2. 2.2.5. ]

revisionDesc: (revision description) summarizes the revision history for a file. [2.5. 2.1.1. ]

row: contains one row of a table. [14.1.1. ]

rs: (referencing string) contains a general purpose name or referring string. [13.2.1. 3.5.1. ]

s: (s-unit) contains a sentence-like division of a text. [17.1. 8.4.1. ]

salute: (salutation) contains a salutation or greeting prefixed to a foreword, dedicatory epistle, or other division of a text, or the salutation in the closing of a letter, preface, etc. [4.2.2. ]

samplingDecl: (sampling declaration) contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection. [2.3.2. 2.3. 15.3.2. ]

seg: (arbitrary segment) represents any segmentation of text below the ‘chunk’ level. [16.3. 6.2. 7.2.5. ]

seriesStmt: (series statement) groups information about the series, if any, to which a publication belongs. [2.2.5. 2.2. ]

sic: (Latin for thus or so) contains text reproduced although apparently incorrect or inaccurate. [3.4.1. ]

signed: (signature) contains the closing salutation, etc., appended to a foreword, dedicatory epistle, or other division of a text. [4.2.2. ]

soCalled: contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics. [3.3.3. ]

sourceDesc: (source description) describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as "born digital" for a text which has no previous existence. [2.2.7. ]

sp: (speech) An individual speech in a performance text, or a passage presented as such in a prose or verse text. [3.12.2. 3.12. 7.2.2. ]

speaker: A specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment. [3.12.2. ]

stage: (stage direction) contains any kind of stage direction within a dramatic text or fragment. [3.12.2. 3.12. 7.2.4. ]

table: contains text displayed in tabular form, in rows and columns. [14.1.1. ]

taxonomy: defines a typology either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy. [2.3.7. ]

teiCorpus: contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text. [4. 15.1. ]

teiHeader: (TEI Header) supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. [2.1.1. 15.1. ]

term: contains a single-word, multi-word, or symbolic designation which is regarded as a technical term. [3.3.4. ]

text: contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. [4. 15.1. ]

textClass: (text classification) groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. [2.4.3. ]

time: contains a phrase defining a time of day in any format. [3.5.4. ]

title: contains a title for any kind of work. [3.11.2.2. 2.2.1. 2.2.5. ]

titlePage: (title page) contains the title page of a text, appearing within the front or back matter. [4.6. ]

titlePart: contains a subsection or division of the title of a work, as indicated on a title page. [4.6. ]

titleStmt: (title statement) groups information about the title of a work and those responsible for its content. [2.2.1. 2.2. ]

trailer: contains a closing title or footer appearing at the end of a division of a text. [4.2.4. 4.2. ]

unclear: contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source. [11.3.3.1. 3.4.3. ]

val: (value) contains a single attribute value. [22. 22.4.5. ]

w: (word) represents a grammatical (not necessarily orthographic) word. [17.1. ]

Prefatory note

Table of contents

1 Introduction

2 A Short Example

3 The Structure of a TEI Text

4 Encoding the Body

4.1 Text Division Elements

4.2 Headings and Closings

4.3 Prose, Verse and Drama

5 Page and Line Numbers

6 Marking Highlighted Phrases

6.1 Changes of Typeface, etc.

6.2 Quotations and Related Features

6.3 Foreign Words or Expressions

7 Notes

8 Cross References and Links

8.1 Simple Cross References

8.2 Pointing to other documents

8.3 Special kinds of Linking

9 Editorial Interventions

9.1 Correction and Normalization

9.2 Omissions, Deletions, and Additions

9.3 Abbreviations and their Expansion

10 Names, Dates, and Numbers

10.1 Names and Referring Strings

10.2 Dates and Times

10.3 Numbers

11 Lists

12 Bibliographic Citations

13 Tables

14 Figures and Graphics

15 Interpretation and Analysis

15.1 Orthographic Sentences

15.2 Words and punctuation

15.3 General-Purpose Interpretation Elements

16 Technical Documentation

16.1 Additional Elements for Technical Documents

16.2 Generated Divisions

16.3 Index Generation

16.4 Addresses

17 Character Sets, Diacritics, etc.

18 Front and Back Matter

18.1 Front Matter

18.1.1 Title Page

18.1.2 Prefatory Matter

18.2 Back Matter

18.2.1 Structural Divisions of Back Matter

19 The Electronic Title Page

19.1 The File Description

19.1.1 The Title Statement

19.1.2 The Edition Statement

19.1.3 The Extent Statement

19.1.4 The Publication Statement

19.1.5 Series and Notes Statements

19.1.6 The Source Description

19.2 The Encoding Description

19.2.1 Project and Sampling Descriptions

19.2.2 Editorial Declarations

19.2.3 Reference and Classification Declarations

19.3 The Profile Description

19.4 The Revision Description

List of Elements Described

Schema tei_lite: changed components

<TEI>

att.datable.w3c

att.global

att.global.linking

Schema tei_lite: unchanged components