3 Elements Available in All TEI Documents

Table of contents

This chapter describes elements which may appear in any kind of text and the tags used to mark them in all TEI documents. Most of these elements are freely floating phrases, which can appear at any point within the textual structure, although they must generally be contained by a higher-level element of some kind (such as a paragraph). A few of the elements described in this chapter (for example, bibliographic citations and lists) have a comparatively well-defined internal structure, but most of them have no consistent inner structure of their own. In the general case, they contain only a few words, and are often identifiable in a conventionally printed text by the use of typographic conventions such as shifts of font, use of quotation or other punctuation marks, or other changes in layout.

This chapter begins by describing the p tag used to mark paragraphs, the prototypical formal unit for running text in many TEI modules. This is followed, in section 3.2 Treatment of Punctuation, by a discussion of some specific problems associated with the interpretation of conventional punctuation, and the methods proposed by the Guidelines for resolving ambiguities therein.

The next section (section 3.3 Highlighting and Quotation) describes a number of phrase-level elements commonly marked by typographic features (and thus well-represented in conventional markup languages). These include features commonly marked by font shifts (section 3.3.2 Emphasis, Foreign Words, and Unusual Language) and features commonly marked by quotation marks (section 3.3.3 Quotation) as well as such features as terms, cited words, and glosses (section 3.3.4 Terms, Glosses, Equivalents, and Descriptions).

Section 3.4 Simple Editorial Changes introduces some phrase-level elements which may be used to record simple editorial interventions, such as emendation or correction of the encoded text. The elements described here constitute a simple subset of the full mechanisms for encoding such information (described in full in chapter 11 Representation of Primary Sources), which should be adequate to most commonly encountered situations.

The next section (section 3.5 Names, Numbers, Dates, Abbreviations, and Addresses) describes several phrase-level and inter-level elements which, although often of interest for analysis or processing, are rarely explicitly identified in conventional printing. These include names (section 3.5.1 Referring Strings), numbers and measures (section 3.5.3 Numbers and Measures), dates and times (section 3.5.4 Dates and Times), abbreviations (section 3.5.5 Abbreviations and Their Expansions), and addresses (section 3.5.2 Addresses).

In the same way, the following section (section 3.6 Simple Links and Cross-References) presents only a subset of the facilities available for the encoding of cross-references or text-linkage. The full story may be found in chapter 16 Linking, Segmentation, and Alignment; the tags presented here are intended to be usable for a wide variety of simple applications.

Sections 3.7 Lists, and 3.8 Notes, Annotation, and Indexing, describe two kinds of quasi-structural elements: lists and notes. These may appear either within chunk-level elements such as paragraphs, or between them. Several kinds of lists are catered for, of an arbitrary complexity. The section on notes discusses both notes found in the source and simple mechanisms for adding annotations of an interpretive nature during the encoding; again, only a subset of the facilities described in full elsewhere (specifically, in chapter 17 Simple Analytic Mechanisms) is discussed.

Section 3.9 Graphics and other non-textual components introduces some simple ways of representing graphic or other non-textual content found in a text. A fuller discussion of the multimedia facilities supported by these Guidelines may be found in chapters 14 Tables, Formulæ, and Graphics and 16 Linking, Segmentation, and Alignment.

Next, section 3.10 Reference Systems, describes methods of encoding within a text the conventional system or systems used when making references to the text. Some reference systems have attained canonical authority and must be recorded to make the text useable in normal work; in other cases, a convenient reference system must be created by the creator or analyst of an electronic text.

Like lists and notes, the bibliographic citations discussed in section 3.11 Bibliographic Citations and References, may be regarded as structural elements in their own right. A range of possibilities is presented for the encoding of bibliographic citations or references, which may be treated as simple phrases within a running text, or as highly-structured components suitable for inclusion in a bibliographic database.

Additional elements for the encoding of passages of verse or drama (whether prose or verse) are discussed in section 3.12 Passages of Verse or Drama.

The chapter concludes with a technical overview of the structure and organization of the module described here. This should be read in conjunction with chapter 1 The TEI Infrastructure, describing the structure of the TEI document type definition.

3.1 Paragraphs

The paragraph is the fundamental organizational unit for all prose texts, being the smallest regular unit into which prose can be divided. Prose can appear in all TEI texts, even those that are primarily of another genre (e.g., verse); thus the paragraph is described here, as an element which can appear in any kind of text.

Paragraphs can contain any of the other elements described within this chapter, as well as some other elements which are specific to individual text types. We distinguish phrase-level elements, which must be entirely contained within a paragraph and cannot appear except within one, from chunks, which can appear between, but not within, paragraphs, and from inter-level elements, which can appear either within a single paragraph or between paragraphs. The class of phrases includes emphasized or quoted phrases, names, dates, etc. The class of inter-level elements includes bibliographic citations, notes, lists, etc. The class of chunks includes the paragraph itself, and other elements which have similar structural properties, notably the ab (anonymous block) element described in 16.3 Blocks, Segments, and Anchors) which may be used as an alternative to the paragraph in some kinds of texts.

Because paragraphs may appear in different base or additional tag sets, their possible contents may differ in different kinds of documents. In particular, additional elements not listed in this chapter may appear in paragraphs in certain kinds of text. However, the elements described in this chapter are always by default available in all kinds of text.

The paragraph is marked using the p element:
  • p (paragraph) marks paragraphs in prose.

If a consistent internal subdivision of paragraphs is desired, the s or seg (‘segment’) elements may be used, as discussed in chapters 16 Linking, Segmentation, and Alignment and 17 Simple Analytic Mechanisms respectively. More usually, however, paragraphs have no firm internal structure, but contain prose encoded as a mix of characters, entity references, phrases marked as described in the rest of this chapter, and embedded elements like lists, figures, or tables.

Since paragraphs are usually explicitly marked in Western texts, typically by indentation, the application of the p tag usually presents few problems.

In some cases, the body of a text may comprise but a single paragraph:
<body>
 <p>I fully appreciate Gen. Pope's splendid achievements with their
   invaluable results; but you must know that Major Generalships in the
   Regular Army, are not as plenty as blackberries.</p>
</body>
bibliography
This news story shows typically short journalistic paragraphs:
<head>SARAJEVO, Bosnia and Herzegovina, April 19</head>
<p>Serbs seized more territory in this struggling new country today as
the United States Air Force ended a two-day airlift of humanitarian
aid into the capital, Sarajevo.</p>
<p>International relief workers called on European Community nations
to step up their humanitarian aid to the former Yugoslav republic,
in conjunction with new American aid flights if necessary.</p>
<p>A special envoy from the European Community, Colin Doyle, harshly
condemned the decision by Serbs to shell Sarajevo on Saturday night
during a visit to the Bosnian capital by a senior American official,
Deputy Assistant Secretary of State Ralph R. Johnson.</p>
<p>...</p>
The following extract from a Russian fairy tale demonstrates how other phrase level elements (in this case q elements representing direct speech; see section 3.3.3 Quotation) may be nested within, but not across, paragraphs:
<p>A fly built a castle, a tall and mighty castle.
There came to the castle the Crawling Louse. <q>Who,
   who's in the castle? Who, who's in your house?</q>
said the Crawling Louse. <q>I, I, the Languishing Fly.
   And who art thou?</q>
 <q>I'm the Crawling Louse.</q>
</p>
<p>Then came to the castle the Leaping Flea. <q>Who,
   who's in the castle?</q> said the Leaping Flea. <q>I,
   I, the Languishing Fly, and I, the Crawling Louse. And
   who art thou?</q>
 <q>I'm the Leaping Flea.</q>
</p>
<p>Then came to the castle the Mischievous Mosquito.
<q>Who, who's in the castle?</q> said the Mischievous
Mosquito. <q>I, I, the Languishing Fly, and I, the
   Crawling Louse, and I, the Leaping Flea. And who art
   thou?</q>
 <q>I'm the Mischievous Mosquito.</q>
</p>
bibliography

3.2 Treatment of Punctuation

Punctuation marks cause problems for text markup when they are not available in the character set used and when they are significantly ambiguous. To a large extent, the availability of the Unicode character set addresses most such problems, since it provides specific code points for most punctuation marks, and also distinguishes glyphs (such as stop, comma, and hyphen) which are used with different functions. Thus, for example, different Unicode code points are available for the hyphen used as a minus sign, as a word breaking hyphen, as a soft hyphen, or as a ‘non-breaking’ hyphen. The facilities described in chapter 5 Representation of Non-standard Characters and Glyphs may also be used to define markup for non-standard punctuation characters.

Full stop (period) may mark (orthographic) sentence boundaries, abbreviations, decimal points, or serve as a visual aid in printing numbers. These usages can be distinguished by tagging S-units, abbreviations, and numbers, as described in sections 16.3 Blocks, Segments, and Anchors, 3.5.5 Abbreviations and Their Expansions, and 3.5.3 Numbers and Measures. However, there are independent reasons for tagging these, whether or not they are marked by full stops, and the polysemy of the full stop itself is perhaps no different from that of any character in the writing system.

Question mark and exclamation mark typically mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author (! to express surprise or some other strong feeling, ? to query a word or expression or mark a sentence as dubious in linguistic discussion). These uses may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be left unmarked, or tagged using the c element discussed in 17.1 Linguistic Segment Categories.

Dashes are used for a variety of purposes: insertion, interruption, new speaker (in dialogue), list item. In the latter two cases it is preferable to mark the underlying feature using the elements q or item, on which see section 3.3.3 Quotation, and section 3.7 Lists, respectively.

Quotation marks may be removed from text contained by q or quote elements, especially as quotations are not always marked by quotation marks (notably long quotations) or may be marked in a variety of ways; see the discussion of quotation and related features in section 3.3.3 Quotation.

Apostrophes must be distinguished from single quote marks. As with hyphens, this disambiguation may be performed by selecting an appropriate Unicode character, but it may also be represented by using explicit XML tags for quotations as suggested above. However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Full disambiguation of these uses belongs to the level of linguistic analysis and interpretation.

Parentheses and other marks of suspension such as dashes or ellipses are often used to signal information about the syntactic structure of a text fragment. Full disambiguation of their uses also belongs to the level of linguistic analysis and interpretation, and is therefore discussed in chapter 17 Simple Analytic Mechanisms.

Where punctuation marks are disambiguated by tagging the underlying feature they signal, it may be debated whether they should be excluded or left as part of the text. In the case of quotation marks, it may be more convenient to distinguish opening from closing marks simply by using the appropriate Unicode character than to use the q element, with or without a rend attribute. The solution chosen will vary depending upon the feature and depending upon the purpose of the project.

3.3 Highlighting and Quotation

This section deals with a variety of textual features, all of which have in common that they are frequently realized in conventional printing practice by the use of such features as underlining, italic fonts, or quotation marks, collectively referred to here as highlighting. After an initial discussion of this phenomenon and alternate approaches to encoding it, this section describes ways of encoding the following textual features, all of which are conventionally rendered using some kind of highlighting:
  • emphasis, foreign words and other linguistically distinct uses of highlighting
  • representation of speech and thought, quotation, etc.
  • technical terms, glosses, etc.

3.3.1 What Is Highlighting?

By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings.9 The purpose of highlighting is generally to draw the reader's attention to some feature or characteristic of the passage highlighted; this section describes the elements recommended by these Guidelines for the encoding of such textual features.

In conventionally printed modern texts, highlighting is often employed to identify words or phrases which are regarded as being one or more of the following:
  • distinct in some way — as foreign, dialectal, archaic, technical, etc.
  • emphatic, and which would for example be stressed when spoken
  • not part of the body of the text, for example cross-references, titles, headings, labels, etc.
  • identified with a distinct narrative stream, for example an internal monologue or commentary.
  • attributed by the narrator to some other agency, either within the text or outside it: for example, direct speech or quotation.
  • set apart from the text in some other way: for example, proverbial phrases, words mentioned but not used, names of persons and places in older texts, editorial corrections or additions, etc.

The textual functions indicated by highlighting may not be rendered consistently in different parts of a text or in different texts. (For example, a foreign word may appear in italics if the surrounding text is in roman, but in roman if the surrounding text is in italics.) For this reason, these Guidelines distinguish between the encoding of rendering itself and the encoding of the underlying feature expressed by it.

Highlighting as such may be encoded by using either of the global attributes rend or rendition attributes (see 1.3.1.1 Global Attributes). This allows the encoder both to specify the function of a highlighted phrase or word, by selecting the appropriate element described here or elsewhere in the Guidelines, and to further describe the way in which it is highlighted, by means of the rend attribute. If the encoder wishes to offer no interpretation of the feature underlying the use of highlighting in the source text, then the hi element may be used, which indicates only that the text so tagged was highlighted in some way.
  • hi (highlighted) marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made.
The hi element is provided by the model.hiLike class.

The possible values carried by the rend attribute are not formally defined in this version of the Guidelines. Since the rend attribute may be used to document any peculiarity of the way a given segment of text was rendered in the original source text, it may need to express a very large range of typographic features, by no means restricted to typeface, type size, etc.

Where it is both appropriate and feasible, these Guidelines recommend that the textual feature marked by the highlighting should be encoded, rather than just the simple fact of the highlighting. This is for the following reasons:
  • the same kind of highlighting may be used for different purposes in different contexts
  • the same textual function may be highlighted in different ways in different contexts
  • for analytic purposes, it is in general more useful to know the intended function of a highlighted phrase than simply that it is distinct.

In many, if not most, cases the underlying function of a highlighted phrase will be obvious and non-controversial, since the distinctions indicated by a change of highlighting correspond with distinctions discussed elsewhere in these Guidelines. The elements available to record such distinctions are, for the most part, members of the model.emphLike class. This and the model.hiLike class mentioned above constitute the model.highlighted class, which is a phrase level class. Members of this class may appear anywhere within paragraph level elements.

The distinction between the two classes is simple, and typified by the two elements hi and emph: the former marks simply that a passage is typographically distinct in some way, while the latter asserts that a passage is linguistically emphasized for some purpose. These two properties, though often combined, are not identical. It should however be recognized, however, that cases do exist in which it is not economically feasible to mark the underlying function (e.g. in the preparation of large text corpora), as well as cases in which it is not intellectually appropriate (as in the transcription of some older materials, or in the preparation of material for the study of typographic practice). In such cases, the hi element or some other element from the model.hiLike class should be used.

Elements which are sometimes realized by typographic distinction but which are not discussed in this section include title (discussed in section 3.11 Bibliographic Citations and References) and name (discussed in section 3.5.1 Referring Strings).

3.3.2 Emphasis, Foreign Words, and Unusual Language

This subsection discusses the following elements:
  • foreign (foreign) identifies a word or phrase as belonging to some language other than that of the surrounding text.
  • emph (emphasized) marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect.
  • distinct identifies any word or phrase which is regarded as linguistically distinct, for example as archaic, technical, dialectal, non-preferred, etc., or as forming part of a sublanguage.
These elements are all members of the model.emphLike class.
3.3.2.1 Foreign Words or Expressions
Words or phrases which are not in the main language of the text should be tagged as such, at least where the fact is indicated in the text. Where the word or phrase concerned is already distinguished from the rest of the text by virtue of its function (for example, because it is a name, a technical term, a quotation, a mentioned word, etc.) then the global xml:lang attribute should be used to specify additionally that its language distinguishes it from the surrounding text. Any element in the TEI scheme may take a xml:lang attribute, which specifies both the writing system and the language used by its content (see section vi.1. Language identification for discussion of this attribute). Where there is no other applicable element, the element foreign may be used to provide a peg onto which the xml:lang may be attached.
<q>Aren't you confusing <foreign xml:lang="la">post hoc</foreign> with <foreign xml:lang="la">propter hoc</foreign>?</q> said the Bee Master.
<q>Wax-moth only succeed when weak bees let them in.</q>
The foreign element should not be used to represent foreign words which are mentioned or glossed within the text: for these use the appropriate element from section 3.3.4 Terms, Glosses, Equivalents, and Descriptions below. Compare the following example sentences:
John eats a <foreign xml:lang="fr">croissant</foreign> every morning.
<mentioned xml:lang="fr">Croissant</mentioned> is difficult to
pronounce with your mouth full.
A <term xml:lang="fr">croissant</term> is a crescent-shaped
piece of light, buttery, pastry that is usually eaten for
breakfast, especially in France.
3.3.2.2 Emphatic Words and Phrases
The emph element is provided to mark words or phrases which are linguistically emphatic or stressed. Text which is only typographically ‘emphasized’ falls into the class of highlighted text, and may be tagged with the hi element. In printed works, emphasis is generally indicated by devices such as the use of an italic font, a large typeface, or extra wide letter spacing; in manuscripts and typescripts, it is usually indicated by the use of underlining. As the following examples demonstrate, an encoder may choose whether or not to make explicit the particular type of rendition associated with the emphasis by use of the rend attribute. If a source text consistently renders a particular feature (e.g. emphasis or words in foreign languages) in a particular way, the rendering associated with that feature may be described in the TEI header using the rendition element. The rend attribute may then be used to describe examples which deviate from the norm. For example, assuming that the TEI Header has defined a default rendering for the emph element, the following encoding would use it:
<q>Sex, sir, is <emph>purely</emph> a
question of appetite!</q> Tarr exclaimed.
If on the other hand no such default has been defined for the element, the encoder may specify it informally using the rend attribute:
<q>What it all comes to is this,</q> he said.

<q>
 <emph rend="italic">What does Christopher
   Robin do in the morning nowadays?</emph>
</q>
or, if a rendition element has been provided in the header (but not necessarily associated with any other element), the rendition attribute may be used to point to it:
<l>Here Thou, great <name rend="italics">Anna</name>!
whom three Realms obey,</l>
<l>Doth sometimes Counsel take —
and sometimes <emph rendition="#italic">Tea</emph>.</l>
<!-- in the header ... -->
<rendition xml:id="italicscheme="css">text-style:italic</rendition>
Further information on the use of the rendition element is provided at 2.3.4 The Tagging Declaration.

The hi element is used to mark words or phrases which are highlighted in some way, but for which identification of the intended distinction is difficult, controversial, or impossible. It enables an encoder simply to record the fact of highlighting, possibly describing it by the use of a rend or rendition attribute, as discussed above, without however taking a position as to the function of the highlighting. This may also be useful if the text is to be processed in two stages: representing simply typographic distinctions during a first pass, and then replacing the hi elements with more specific elements in a second pass.

Some simple examples:
<hi rend="gothic">And this Indenture further witnesseth</hi>
that the said <hi rend="italic">Walter Shandy</hi>, merchant,
in consideration of the said intended marriage ...
In this example, the first highlighted phrase uses black letter or gothic print to mimic the appearance of a legal document, and italic to mark Walter Shandy as a name. In a second pass, the elements head or label might be appropriate for the first use, and the element name for the second.
The heaviest rain, and snow, and hail, and sleet, could
boast of the advantage over him in only one respect. They
often <hi rend="quoted">came down</hi> handsomely, and
Scrooge never did.
In this example, the phrase came down uses inverted commas to indicate a play on words.10 In a second pass, the element soCalled might be preferred.
3.3.2.3 Other Linguistically Distinct Material

For some kinds of analysis, it may be desirable to encode the linguistic distinctiveness of words and phrases with more delicacy than is allowed by the foreign element. The distinct element is provided for this purpose. Its attributes allow for additional information characterizing the nature of the linguistic distinction to be made in two distinct ways: the type attribute simply assigns a user-defined code of some kind to the word or phrase which assigns it to some register, sub-language, etc. No recommendations as to the set of values for this attribute are provided at this time, as little consensus exists in the field.

Alternatively, the remaining three attributes may be used in combination to place a word or phrase on a three-dimensional scale sometimes used in descriptive linguistics, as for example in Mattheier et al, 1988. The time attribute places a word diachronically, for example as archaic, old-fashioned, contemporary, futuristic, etc.; the space attribute places a word diatopically, that is, with respect to a geographical classification, for example as national, regional, international, etc.; the social attribute places a word diastatically, that is, with respect to a social classification, for example as technical, polite, impolite, restricted, etc. Again, no recommendations are made for the values of these attributes at this time; the encoder should provide a description of the scheme used in the appropriate section of the header (see section 2.3 The Encoding Description).

Examples:
Next morning a boy in that dormitory confided to his
bosom friend, a <distinct type="psSlang">fag</distinct> of
Macrea's, that there was trouble in their midst which
King <distinct type="archaic">would fain</distinct> keep
secret.
Next morning a boy in that dormitory confided to his
bosom friend, a
<distinct time="1900space="GBsocial="publicschool">fag</distinct>
of Macrea's, that there was trouble in their midst which
King <distinct time="archaic">would fain</distinct> keep
secret.
Where more complex (or more rigorous) interpretive analyses of the associations of a word are required, the more detailed and general mechanisms described in chapter 18 Feature Structures should be preferred to these simple characterizations. It may also be preferable to record the kinds of analysis suggested here by means of the simple annotation element note described in section 3.8 Notes, Annotation, and Indexing, or the span element described in section 17.3 Spans and Interpretations.

3.3.3 Quotation

One form of presentational variation found particularly frequently in written and printed texts is the use of quotation marks. As with the typographic variations discussed in the preceding section, it is generally helpful to separate the encoding of the underlying textual feature (for example, a quotation or a piece of direct speech) from the encoding of its rendering (for example, the use of a particular style of quotation marks).

This section discusses the following elements, all of which are often rendered by the use of quotation marks:
  • q (separated from the surrounding text with quotation marks) contains material which is marked as (ostensibly) being somehow different than the surrounding text, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.
  • said (speech or thought) indicates passages thought or spoken aloud, whether explicitly indicated in the source or not, whether directly or indirectly reported, whether by real people or fictional characters.
    directmay be used to indicate whether the quoted matter is regarded as direct or indirect speech.
    aloudmay be used to indicate whether the quoted matter is regarded as having been vocalized or signed.
  • quote (quotation) contains a phrase or passage attributed by the narrator or author to some agency external to the text.
  • cit (cited quotation) contains a quotation from some other document, together with a bibliographic reference to its source. In a dictionary it may contain an example text with at least one occurrence of the word form, used in the sense being described, or a translation of the headword, or an example.
  • mentioned marks words or phrases mentioned, not used.
  • soCalled contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics.
The elements mentioned and soCalled are members of the class model.emphLike; the q and said are members of the class model.qLike in their own right, while cit and quote are members of model.quoteLike, a subclass of model.qLike. This class is a subclass of model.inter; hence all of these elements are permitted both within and between paragraph-level elements.

The most common and important use of quotation marks is, of course, to mark quotation, by which we mean simply any part of the text attributed by the author or narrator to some agency other than the narrative voice. The q element may be used if no further distinction beyond this is judged necessary. If however it is felt necessary to distinguish passages which are in some sense external to the work from passages of direct speech or thought, a more precise element may be chosen from the list above. Typical examples include passages cited from other works, for which the element quote may be used, and words or phrases spoken or thought by people or characters within the current work, for which the element said may be used. The soCalled element is used for cases where the author or narrator distances him or herself from the words in question without however attributing them to any other voice in particular. The mentioned element is appropriate for a case where a word or phrase is being discussed in the body of a text rather than forming part of the text directly.

As noted above, if the distinction among these various reasons why a passage is offset from surrounding text cannot be made reliably, or is not of interest, then all quoted matter may simply be marked using the q element.

Quotation may be indicated in a printed source by changes in type face, by special punctuation marks (single or double or angled quotes, dashes, etc.) and by layout (indented paragraphs, etc.). If these characteristics are of interest, one or other of the global rend or rendition attributes discussed in section 1.3.1.1 Global Attributes may be used to record them.

Quotation marks themselves may, like other punctuation marks, be felt for some purposes to be worth retaining within a text, quite independently of their description by the rend attribute. This should generally be done using the appropriate Unicode character, or, if this is not possible, a numeric character reference (see Character References).

Alternatively, the encoder may suppress all quotation marks, possibly recording their form using some appropriate set of conventions in the rend attribute. Some examples are shown below:
<said rend="PRE lsquo POST rsquo">Who-e debel
you?</said> — he at last said —
<said rend="PRE lsquo POST rsquo">you no speak-e,
damme, I kill-e.</said> And so saying,
the lighted tomahawk began flourishing
about me in the dark.
Adolphe se tourna vers lui :
<said>— Alors, Albert, quoi de neuf?</said>
<said>— Pas grand-chose.</said>
<said>— Il fait beau,</said> dit Robert.
Adolphe se tourna vers lui :
<said rend="PRE mdash">Alors,
Albert, quoi de neuf ?</said>
<said rend="PRE mdash">Pas grand-chose.</said>
<said rend="PRE mdash">Il fait beau,</said>
dit Robert.
As members of the att.ascribed class, elements said and q share the following attribute:
  • att.ascribed provides attributes for elements representing speech or action that can be ascribed to a specific individual.
    whoindicates the person, or group of people, to whom the element content is ascribed.
This may be used to make explicit who is speaking:
Adolphe se tourna vers lui :
<said who="#Adolphe">— Alors, Albert,
quoi de neuf?</said>
<said who="#Albert">— Pas grand-chose.</said>
<said who="#Robert">— Il fait beau,</said>
dit Robert.

<!-- .... -->
<list type="speakers">
 <item xml:id="Adolphe"/>
 <item xml:id="Albert"/>
 <item xml:id="Robert"/>
</list>
The who attribute may be supplied whether or not an indication of the speaker is given explicitly in the text. It may take the form (as above) of a normalized form of the speaker's name, but its role is to act as a pointer to a location elsewhere in the text where data about each speaker may be supplied. The most appropriate place to place such information is within the participant description component of the TEI Header, as further discussed in 15.2.2 The Participant Description but for simple cases like the above, a simple list of speakers located in the front or back matter of the text may suffice.
It may also be useful to distinguish representations of speech from representations of thought, in modern printed texts often indicated by a change of typeface. The aloud attribute is provided for this purpose, as in this example:
<said aloud="true">Oh yes,</said> said Henry,
<said aloud="false">I mean
Gordon Macrae, for example…</said>
<said aloud="false">Jungian
Analyst with Winebox! That's what you called him, you callous bastard,
didn't you? Eh? Eh?</said>
Quoted matter may be embedded within quoted matter, as when one speaker reports the speech of another:
<said who="#Wilson">Spaulding, he came down into the office just this day
eight weeks with this very paper in his hand, and he says:—
<said who="#WilsonSpaulding">I wish to the Lord, Mr. Wilson, that I was a
   red-headed man.</said>
</said>
<!-- ... -->
<list type="speakers">
 <item xml:id="Wilson">Wilson</item>
 <item xml:id="WilsonSpaulding">Spaulding reported by Wilson</item>
<!-- ...-->
</list>
Direct speech nested in this way is treated in the same way as elsewhere: a change of rendition may occur, but the same element should be used. An encoder may however choose to distinguish between direct speech which contains quotations from extra-textual matter and direct speech itself, as in the following example:
<p>
 <said>The Lord! The Lord! It is Sakya Muni himself,</said> the lama half
sobbed; and under his breath began the wonderful Buddhist
invocation:-<said>
  <quote>
   <l>To Him the Way — the Law — Apart —</l>
   <l>Whom Maya held beneath her heart</l>
   <l>Ananda's Lord — the Bodhisat</l>
  </quote>
   And He is here! The Most Excellent Law is here also. My
   pilgrimage is well begun. And what work! What work!</said>
</p>
Quotations from other works are often accompanied by a reference to their source. The cit element may be used to group together the quotation and its associated bibliographic reference, which should be encoded using the elements for bibliographic references discussed in section 3.11 Bibliographic Citations and References, as in the following example.
<div xml:id="mm01type="chapter">
 <head>Chapter 1</head>
 <epigraph>
  <cit>
   <quote>
    <l>Since I can do no good because a woman</l>
    <l>Reach constantly at something that is near it.</l>
   </quote>
   <bibl>
    <title>The Maid's Tragedy</title>
    <author>Beaumont and Fletcher</author>
   </bibl>
  </cit>
 </epigraph>
 <p>Miss Brooke had that kind of beauty which seems to be thrown into
   relief by poor dress...</p>
</div>
Like other bibliographic references, the citation attached to a quotation may be represented simply by a pointer, as in this example:
Lexicography has shown little sign of being affected by the
work of followers of J.R. Firth, probably best summarized
in his slogan, <cit>
 <quote>You shall know a word by the company it keeps.</quote>
 <ref>(Firth, 1957)</ref>
</cit>
Unlike most of the other elements discussed in this chapter, direct speech and quotations may frequently contain other high-level elements such as paragraphs or verse lines, as well as being themselves contained by such elements. Three possible solutions exist for this well-known structural problem:
  • the quotation is broken into segments, each of which is entirely contained within a paragraph
  • the quotation is marked up using stand-off markup
  • the quotation boundaries are represented by empty segment boundary delimiter elements
For further discussion and several examples, see chapter 20 Non-hierarchical Structures.
Finally, in this section, the element soCalled is provided for all cases in which quotation marks are used to distance the quoted text from the narrator or speaker. Common examples include the ‘scare’ quotes often found in newspaper headlines and advertising copy, where the effect is to cast doubts on the veracity of an assertion:
<head>PM dodges <soCalled>election threat</soCalled> in interview</head>
The same element should be used to mark a variety of special ironic usages. Some further examples follow:
He hated <soCalled>good</soCalled> books.
<soCalled>Croissants</soCalled> indeed! toast not good enough for you?
Although Chomsky's decision that all NL
sentences are finite objects was never justified by arguments from
the attested properties of NLs, it did have a certain
<soCalled>social</soCalled> justification. It was commonly assumed in
works on logic until fairly recently that the notion
<mentioned>language</mentioned> is necessarily restricted to finite
strings.

3.3.4 Terms, Glosses, Equivalents, and Descriptions

This section describes a set of textual elements which are used to provide a gloss, alternate identification, or description of something.

Technical terms are often italicized or emboldened upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are mentioned (for example, as examples) rather than used may mark them either with italics or with quotation marks, and will gloss them less regularly.
  • term contains a single-word, multi-word, or symbolic designation which is regarded as a technical term.
  • gloss identifies a phrase or word used to provide a gloss or definition for some other word or phrase.
These elements are also members of the class model.emphLike.

A term may appear with or without a gloss, as may a mentioned element. Where the gloss is present, it may be linked to the term it is glossing by means of its target attribute. To establish such a link, the encoder should give an xml:id value to the term or mentioned element and provide that id as the value of the target attribute on the gloss element. The following examples demonstrate this facility: for more discussion of this and other kinds of linkage within TEI documents, see chapter 16 Linking, Segmentation, and Alignment.

Examples:
We may define <term xml:id="TDPvrend="sc">discoursal point of view</term>
as
<gloss target="#TDPv">the relationship, expressed through discourse
structure, between the implied author or some other addresser,
and the fiction.</gloss>
<gloss rend="unmarkedtarget="#PRSR">A computational device that infers
structure from grammatical strings of words</gloss> is known as a
<term xml:id="PRSR">parser</term>, and much of the history of NLP over the
last 20 years has been occupied with the design of parsers.
There is thus a striking accentual difference between a verbal
form like <mentioned xml:id="cw234xml:lang="grc">eluthemen</mentioned>
<gloss target="#cw234">we were released,</gloss> accented on the
second syllable of the word, and its participial derivative

<mentioned xml:id="cw235xml:lang="grc">lutheis</mentioned>
<gloss target="#cw235">released,</gloss> accented on the last.
Another group of elements is used to supply different kinds of names for objects described by the TEI. Examples of this are documentation of elements, attributes, classes (and also attribute values where appropriate), and description of glyphs.
  • altIdent (alternate identifier) supplies the recommended XML name for an element, class, attribute, etc. in some language.
  • desc (description) contains a brief description of the object documented by its parent element, including its intended usage, purpose, or application where this is appropriate.
  • equiv/ (equivalent) specifies a component which is considered equivalent to the parent element, either by co-reference, or by external link.
    uri(uniform resource identifier) references the underlying concept of which the parent is a representation by means of some external identifier
    filterreferences an external script which contains a method to transform instances of this element to canonical TEI
    namenames the underlying concept of which the parent is a representation
Along with the gloss element mentioned above, these elements constitute the model.glossLike class.
The gloss element may be used to provide a brief explanation for the name of the object if this is not self-explanatory. For example, the specification for the element ab used to mark arbitrary blocks of text begins as follows:
<elementSpec module="linkingident="ab">
 <gloss>anonymous block</gloss>
<!--... -->
</elementSpec>
A gloss may also be supplied for an attribute name or an attribute value in similar circumstances:
<valList type="open">
 <valItem ident="susp">
  <gloss>suspension</gloss>
  <desc>the abbreviation provides the first letter(s)
     of the word or phrase, omitting the remainder.</desc>
 </valItem>
 <valItem ident="contr">
  <gloss>contraction</gloss>
  <desc>the abbreviation omits some letter(s) in the middle.</desc>
 </valItem>
<!--...-->
</valList>
Note that this is quite distinct from the use of the desc element, which contains a full description of the intended semantics for the object.
The equiv element is used to document equivalencies between the concept represented by this object and the same concept as described in other schemes or ontologies. The uri attribute is used to supply a pointer to some location where such external concepts are defined. For example, to indicate that the TEI death element corresponds to the concept defined by the CIDOC CRM category E69, the declaration for the former might begin as follows:
<elementSpec module="namesdatesident="death">
 <equiv name="E69uri="http://cidoc.ics.forth.gr/"/>
<!--... -->
</elementSpec>
The equiv element may also be used to map newly-defined elements onto existing constructs in the TEI, using the filter and name attributes to point to an implementation of the mapping. This is useful when a TEI customization (see 23.2 Personalization and Customization) defines ‘shortcuts’ for convenience of data entry or markup readability. For example, suppose that in some TEI customization an element <bo> has been defined which is conceptually equivalent to the standard markup construct <hi rend='bold'>. The following declarations would additionally indicate that instances of the <bo> element can be converted to canonical TEI by obtaining a filter from the URI specified, and running the procedure with the name bold. The mimeType attribute specifies the language (in this case XSL) in which the filter is written:
<elementSpec ident="bons="http://www.example.org/ns/notTEI">
 <equiv
   filter="http://www.example.com/equiv-filter.xsl"
   mimeType="text/xsl"
   name="bold"/>

 <gloss>bold</gloss>
 <desc>contains a sequence of characters rendered in a bold face.</desc>
<!-- ... -->
</elementSpec>
The altIdent element is used to provide an alternative name for an object, for example using a different natural language. Thus, the following might be used to indicate that the abbr element should be identified using the German word Abkürzung:
<elementSpec ident="abbrmode="change">
 <altIdent xml:lang="de">Abkürzung</altIdent>
<!--...-->
</elementSpec>
In the same way, the following specification for the graphic element indicates that the attribute url may also be referred to using the alternate identifier href:
<elementSpec ident="graphicmode="change">
 <attList>
  <attDef mode="changeident="url">
   <altIdent>href</altIdent>
  </attDef>
<!-- .... -->
 </attList>
</elementSpec>

By default, the altIdent of a component is identical to the value of its ident attribute.

The contents of the desc element provide a brief characterization of the intended function of the object being documented in a form that permits its quotation out of context, as in the following example:
<elementSpec module="coreident="foreign">
<!--... -->
 <desc>identifies a word or phrase as belonging to some language other
   than that of the surrounding text. </desc>
<!--... -->
</elementSpec>
By convention, a desc element begins with a verb such as contains, indicates, specifies, etc. and contains a single clause.

3.3.5 Some Further Examples

As a simple example of the elements discussed here, consider the following sentence:

On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité, the romances of Chrétien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach.

A first approximation to the encoding of this sentence might be simply to record the fact that the phrases printed above in italics are highlighted, as follows:
On the one hand the <hi rend="italic">Nibelungenlied</hi> is
associated with the new rise of romance of twelfth-century France,
the <hi xml:lang="frrend="italic">romans d'antiquité</hi>,
the romances of Chrétien de Troyes, ...
This encoding would, however, lose the important distinction between an italicized title and an italicized foreign phrase. Many other phrases might also be italicized in the text, and a retrieval program seeking to identify foreign terms (for example) would not be able to produce reliable results by simply looking for italicized words. Where economic and intellectual constraints permit, therefore, it would be preferable to encode both the function of the highlighted phrases and their appearance, as follows:
On the one hand the <title rend="italic">Nibelungenlied</title>
is associated with the new rise of romance of twelfth-century France,
the <foreign rend="italic">romans d'antiquité</foreign>, the
romances of Chrétien de Troyes, ...
In this example, the decision as to which textual features are distinguished by the highlighting is relatively uncontroversial. As a less straightforward example, consider the use of italic font in the following passage:

A pretty common case, I believe; in all vehement debatings. She says I am too witty; Anglicé, too pert; I, that she is too wise; that is to say, being likewise put into English, not so young as she has been: in short, she is grown so much into a mother, that she had forgotten she ever was a daughter. ...

Clearly, the word vehement is not italicized for the same reason as the phrase not so young as she has been; the former is emphasized, while the latter is proverbial. It also provides an ironic gloss for the words too wise, in the same way as too pert glosses too witty. The glossed phrases are not, however, technical terms or cited words, but quoted phrases, as if the writer were putting words into her own and her mother's mouths. Finally, the words mother and daughter are apparently italicized simply to oppose them in the sentence; certainly they do not fit into any of the categories so far proposed as reasons for italicizing. Note also that the word Anglicé is not italicized although it is not generally considered an English word.

The following sample encoding for the above passage attempts to take into account all the above points:
A pretty common case, I believe; in all <emph>vehement</emph>
debatings. She says I am <q rend="italic">too witty</q>;
<foreign xml:lang="larend="roman">Anglicé</foreign>,
<gloss rend="italic">too pert</gloss>; I, that she is
<q rend="italic"> too wise</q>; that is to say, being likewise
put into English, <gloss rend="italic">not so young as she has
been</gloss>: in short, she is grown so much into a
<hi rend="italic">mother</hi>, that she had forgotten she ever
was a <hi rend="italic">daughter</hi>.

3.4 Simple Editorial Changes

As in editing a printed text, so in encoding a text in electronic form, it may be necessary to accommodate editorial comment on the text and to render account of any changes made to the text in preparing it. The tags described in this section may be used to record such editorial interventions, whether made by the encoder, by the editor of a printed edition used as a copy text, by earlier editors, or by the copyists of manuscripts.

The tags described here handle most common types of editorial intervention and stereotyped comment; where less structured commentary of other types is to be included, it should be marked using the note element described in section 3.8 Notes, Annotation, and Indexing. Systematic interpretive annotation is also possible using the various methods described in chapter 16 Linking, Segmentation, and Alignment. The examples given here illustrate only simple cases of editorial intervention; in particular, they permit economical encoding of a simple set of alternative readings of a short span of text. To encode multiple views of large or heterogenous spans of text, the mechanisms described in chapter 16 Linking, Segmentation, and Alignment should be used. To encode multiple witnesses of a particular text, a similar mechanism designed specifically for critical editions is described in chapter 12 Critical Apparatus.

For most of the elements discussed here, some encoders may wish to indicate both a responsibility, that is, a code indicating the person or agency responsible for making the editorial intervention in question, and also an indication of the degree of certainty which the encoder wishes to associate with the intervention. Because these requirements are common to many of the elements discussed in this section, they are provided by an attribute class, called att.editLike. All members of this class carry the following optional attributes:
  • att.editLike provides attributes describing the nature of a encoded scholarly intervention or interpretation of any kind.
    cert(certainty) signifies the degree of certainty associated with the intervention or interpretation.
    resp(responsible party) indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber.
    evidenceindicates the nature of the evidence supporting the reliability or accuracy of the intervention or interpretation.
Many of the elements discussed here can be used in two ways. Their primary purpose is to indicate that the text encoded as the element's content represents an editorial intervention (or non-intervention) of a specific kind, indicated by the element itself. However, pairs or other meaningful groupings of such elements can also be supplied, wrapped within a special purpose choice element:
  • choice groups a number of alternative encodings for the same point in a text.
This element enables the encoder to represent for example a text in its ‘original’ uncorrected and unaltered form, alongside the same text in one or more ‘edited’ forms. This usage permits software to switch automatically between one ‘view’ of a text and another, so that (for example) a stylesheet may be set to display either the text in its original form or after the application of editorial interventions of particular kinds.

Elements which can be combined in this way constitute the model.choicePart class. The default members of this class are sic, corr, reg, orig, unclear, add, and del; their functions and usage are described further below.

Three categories of editorial intervention are discussed in this section:
  • indication or correction of apparent errors
  • indication or regularization of variant, irregular, non-standard, or eccentric forms
  • editorial additions, suppressions, and omissions

A more extended treatment of the use of these tags in transcriptional and editorial work is given in chapter 11 Representation of Primary Sources.

3.4.1 Apparent Errors

When the copy text is manifestly faulty, an encoder or transcriber may elect simply to correct it without comment, although for scholarly purposes it will often be more generally useful to record both the correction and the original state of the text. The elements described here enable all three approaches, and allows the last to be done is such a way as make it easy for software to present either the original or the correction.
  • sic (latin for thus or so) contains text reproduced although apparently incorrect or inaccurate.
  • corr (correction) contains the correct form of a passage apparently erroneous in the copy text.
The following examples show alternative treatment of the same material. The copy text reads:

Another property of computer-assisted historical research is that data modelling must permit any one textual feature or part of a textual feature to be a part of more than one information model and to allow the researcher to draw on several such models simultaneously, for example, to select from a machine-readable text those marginal comments which indicate that the date's mentioned in the main body of the text are incorrect.

An encoder may choose to correct the typographic error, either silently or with an indication that a correction has been made, as follows:
… marginal comments which indicate that the <corr>dates</corr>
mentioned in the main body of the text are incorrect.
Alternatively, the encoder may simply record the typographic error without correcting it, either without comment or with a sic element to indicate the error is not a transcription error in the encoding:
… marginal comments which indicate that the <sic>date's</sic>
mentioned in the main body of the text are incorrect.
If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, both sic and corr are used, wrapped in a choice:
… marginal comments which indicate that the
<choice>
 <corr>dates</corr>
 <sic>date's</sic>
</choice> mentioned in the main body of the text are
incorrect.
The sic and corr elements can appear in either order.
If it is desired to indicate the person or edition responsible for the emendation, this might be done as follows:
… marginal comments which indicate that the
<choice>
 <corr resp="#msm">dates</corr>
 <sic>date's</sic>
</choice> mentioned in the main body of the text are
incorrect.

<!-- within the header for this document ... -->
<respStmt>
 <resp>editor</resp>
 <name xml:id="msm">C.M. Sperberg McQueen</name>
</respStmt>
Here the resp attribute has been used to indicate responsibility for the correction. Its value (#msm) is an example of the pointer values discussed in section 3.6 Simple Links and Cross-References; in this case, it points to a name element within the TEI Header, but any element might be indicated in this way, including for example a person element (if the module described in 13 Names, Dates, People, and Places has been included), or one of the bibliographic elements described in 3.11 Bibliographic Citations and References, if the correction has been taken from some other source. The resp attribute is available for all elements which are part of the att.editLike class. The same class makes available a cert attribute,which may be used to indicate the degree of editorial confidence in a particular correction, as in the following example: