![]() |
Text Encoding Initiative |
The XML Version of the TEI Guidelines6 Elements Available in All TEI Documents |
Up: Contents Previous: 5 The TEI Header Next: 7 Default Text Structure
|
6 Elements Available in All TEI Documents 6.3 Highlighting and Quotation 6.4 Names, Numbers, Dates, Abbreviations, and Addresses 6.6 Simple Links and Cross References 6.8 Notes, Annotation, and Indexing 6.10 Bibliographic Citations and References 6.11 Passages of Verse or Drama 6.12 Overview of the Core Tag Set Introductory Note (March 2002) 2 A Gentle Introduction to XML 3 Structure of the TEI Document Type Definition 4 Languages and Character Sets 6 Elements Available in All TEI Documents 14 Linking, Segmentation, and Alignment 17 Certainty and Responsibility 18 Transcription of Primary Sources 21 Graphs, Networks, and Trees 22 Tables, Formulae, and Graphics 29 Modifying and Customizing the TEI DTD 32 Algorithm for Recognizing Canonical References 38 Sample Tag Set Documentation 39 Formal Grammar for the TEI-Interchange-Format Subset of SGML |
This chapter describes elements which may appear in any kind of text and the tags used to mark them in all TEI documents. Most of these elements are freely floating phrases, which can appear at any point within the textual structure, although they must generally be contained by a higher-level element of some kind (such as a paragraph). A few of the elements described in this chapter (for example, bibliographic citations and lists) have a comparatively well-defined internal structure, but most of them have no consistent inner structure of their own. In the general case, they contain only a few words, and are often identifiable in a conventionally printed text by the use of typographic conventions such as shifts of font, use of quotation or other punctuation marks, or other changes in layout. This chapter begins by describing the <p> tag used to mark paragraphs, which serve as the fundamental formal unit for running text in many base tag sets, and are available in all. This is followed, in section 6.2 Treatment of Punctuation, by a discussion of some specific problems associated with the interpretation of conventional punctuation, and the methods proposed by the current Guidelines for resolving ambiguities therein. The next section (section 6.3 Highlighting and Quotation) describes a number of phrase-level elements commonly marked by typographic features (and thus well-represented in conventional markup languages). These include features commonly marked by font shifts (section 6.3.2 Emphasis, Foreign Words, and Unusual Language) and features commonly marked by quotation marks (section 6.3.3 Quotation) as well as such features as terms, cited words, and glosses (section 6.3.4 Terms, Glosses, and Cited Words). The next section (section 6.4 Names, Numbers, Dates, Abbreviations, and Addresses) describes several phrase-level and inter-level elements which, although often of interest for analysis or processing, are rarely explicitly identified in conventional printing. These include names (section 6.4.1 Referring Strings), numbers and measures (section 6.4.3 Numbers and Measures), dates and times (section 6.4.4 Dates and Times), abbreviations (section 6.4.5 Abbreviations and Their Expansions), and addresses (section 6.4.2 Addresses). Section 6.5 Simple Editorial Changes introduces some phrase-level elements which may be used to record simple editorial emendation or correction of the encoded text. The tags described here constitute a simple subset of the full mechanisms for encoding such information (described in full in chapter 18 Transcription of Primary Sources), which should be adequate to most commonly encountered situations. In the same way, the following section (section 6.6 Simple Links and Cross References) presents only a subset of the facilities available for the encoding of cross-references or text-linkage. The full story may be found in chapter 14 Linking, Segmentation, and Alignment; the tags presented here are intended to be usable for a wide variety of simple applications. Sections 6.7 Lists, and 6.8 Notes, Annotation, and Indexing, describe two kinds of quasi-structural elements, lists and notes, which may appear either within chunk-level elements such as paragraphs, or between them. Several kinds of lists are catered for, of an arbitrary complexity. The section on notes discusses both notes found in the source and simple mechanisms for adding annotations of an interpretive nature during the encoding; again, only a subset of the facilities described in full elsewhere (specifically, in chapter 15 Simple Analytic Mechanisms) is discussed. Next, section 6.9 Reference Systems, describes methods of encoding within a text the conventional system or systems used when making references to the text. Some reference systems have attained canonical authority and must be recorded to make the text useable in normal work; in other cases, a convenient reference system must be created by the creator or analyst of an electronic text. Like lists and notes, the bibliographic citations discussed in section 6.10 Bibliographic Citations and References, may be regarded as structural elements in their own right. A range of possibilities is presented for the encoding of bibliographic citations or references, which may be treated as simple phrases within a running text, or as highly-structured components suitable for inclusion in a bibliographic database. Additional elements for the encoding of passages of verse or drama (whether prose or verse) are discussed in section 6.11 Passages of Verse or Drama. The chapter concludes with a technical overview of the structure and organization of the tag set described here. This should be read in conjunction with chapter 3 Structure of the TEI Document Type Definition, describing the structure of the TEI document type definition. 6.1 ParagraphsThe paragraph is the fundamental organizational unit for all prose texts, being the smallest regular unit into which prose can be divided. Prose can appear in all TEI texts, not simply in those using the prose base (section 8 Base Tag Set for Prose); the paragraph is therefore described here, as an element which can appear in any kind of text. Paragraphs can contain any of the other elements described within this chapter, as well as some other elements which are specific to individual text types. We distinguish phrase-level elements, which must be entirely contained within a paragraph and cannot appear except within one, from chunks, which can appear between, but not within, paragraphs, and from inter-level elements, which can appear either within a single paragraph or between paragraphs. The class of phrases includes emphasized or quoted phrases, names, dates, etc. The class of inter-level elements includes bibliographic citations, notes, lists, etc. The class of chunks includes the paragraph itself, and other elements which have similar structural properties, notably the <ab> (anonymous block) element described in 14.3 Blocks, Segments and Anchors) which may be used as an alternative to the paragraph in some kinds of texts. Because paragraphs may appear in different base or additional tag sets, their possible contents may differ in different kinds of documents. In particular, additional elements not listed in this chapter may appear in paragraphs in certain kinds of text. However, the elements described in this chapter are always by default available in all kinds of text. The paragraph is marked using the <p> element:
If a consistent internal subdivision of paragraphs is desired, the <s> or <seg> (`segment') elements may be used, as discussed in chapters 14 Linking, Segmentation, and Alignment and 15 Simple Analytic Mechanisms respectively. More usually, however, paragraphs have no firm internal structure, but contain prose encoded as a mix of characters, entity references, phrases marked as described in the rest of this chapter, and embedded elements like lists, figures, or tables. Since paragraphs are usually explicitly marked in Western texts, typically by indentation, the application of the <p> tag usually presents few problems. In some cases, the body of a text may comprise but a single paragraph: <body> <p>I fully appreciate Gen. Pope's splendid achievements with their invaluable results; but you must know that Major Generalships in the Regular Army, are not as plenty as blackberries.</p> </body> This news story shows typically short journalistic paragraphs: <head>SARAJEVO, Bosnia and Herzegovina, April 19</head> <p>Serbs seized more territory in this struggling new country today as the United States Air Force ended a two-day airlift of humanitarian aid into the capital, Sarajevo.</p> <p>International relief workers called on European Community nations to step up their humanitarian aid to the former Yugoslav republic, in conjunction with new American aid flights if necessary.</p> <p>A special envoy from the European Community, Colin Doyle, harshly condemned the decision by Serbs to shell Sarajevo on Saturday night during a visit to the Bosnian capital by a senior American official, Deputy Assistant Secretary of State Ralph R. Johnson.</p> <p>...</p> The following extract from a Russian fairy tale demonstrates how other phrase level elements (in this case <q> elements representing direct speech; see section 6.3.3 Quotation) may be nested within, but not across, paragraphs: <p>A fly built a castle, a tall and mighty castle. There came to the castle the Crawling Louse. <q>Who, who's in the castle? Who, who's in your house?</q> said the Crawling Louse. <q>I, I, the Languishing Fly. And who art thou?</q><q>I'm the Crawling Louse.</q> </p> <p>Then came to the castle the Leaping Flea. <q>Who, who's in the castle?</q> said the Leaping Flea. <q>I, I, the Languishing Fly, and I, the Crawling Louse. And who art thou?</q><q>I'm the Leaping Flea.</q> </p> <p>Then came to the castle the Mischievous Mosquito. <q>Who, who's in the castle?</q> said the Mischievous Mosquito. <q>I, I, the Languishing Fly, and I, the Crawling Louse, and I, the Leaping Flea. And who art thou?</q><q>I'm the Mischievous Mosquito.</q> </p> The <p> element is formally declared as follows: <!-- 6.1: Paragraph-->
<!ELEMENT p %om.RO; %paraContent;>
<!ATTLIST p
%a.global;
TEIform CDATA 'p' >
<!-- end of 6.1-->
6.2 Treatment of PunctuationPunctuation marks cause problems for text markup because they may not be available in the character set used and because they are often ambiguous. In the former case entity names should be used to render the punctuation mark (see 4 Languages and Character Sets). In the latter case, ambiguous punctuation may be treated as described below. Full stop (period) may mark (orthographic) sentence boundaries, abbreviations, decimal points, or serve as a visual aid in printing numbers. These usages can be distinguished by tagging S-units, abbreviations, and numbers, as described in sections 14.3 Blocks, Segments and Anchors, 6.4.5 Abbreviations and Their Expansions, and 6.4.3 Numbers and Measures. There are independent reasons for tagging these, whether or not they are marked by full stops. Alternatively, entity names like the following might be used to distinguish stops (and other characters) used for these purposes:
Question mark and exclamation mark typically mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author (‘!’ to express surprise or some other strong feeling, ‘?’ to query a word or expression or mark a sentence as dubious in linguistic discussion). These uses may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be left unmarked. Hyphens at line-end may or may not indicate permanent (`hard') hyphens in the word. Where the lineation of the machine-readable text differs from the original, the editor may either eliminate non-significant line-end hyphens or replace them by a reference to an appropriate character entity.72 Whichever method is adopted, it should be reported using the <hyphenation> element within the encoding declarations in the TEI header. See chapter 5 The TEI Header for discussion of the TEI header and encoding declarations. When creating a machine-readable text from scratch, it is best not to introduce hyphenation simply to make lines of a predefined length, since one cannot then easily tell whether the hyphens are soft or hard. When compounds or prefixed words are hyphenated in mid-sentence, it may be impossible to tell whether the hyphenation is due to formatting or to linguistic concerns. Dashes are best distinguished in form by using the entity names provided in the public entity set ISOpub, defined in ISO 8879: mdash, ndash, and dash (the `true' hyphen). Alternatively, in a standalone XML context, these entities may be represented as Unicode characters —, –, or ‐ respectively. Dashes are used for a variety of purposes: insertion, interruption, new speaker (in dialogue), list item. In the latter two cases it is preferable to mark the underlying feature using the elements <q> or <item>, on which see section 6.3.3 Quotation, and section 6.7 Lists, respectively. Quotation marks should generally be replaced by the tags <q> or <quote>, especially as quotations are not always marked by quotation marks (notably long quotations) or may be marked in a variety of ways; see the discussion of quotation and related features in section 6.3.3 Quotation. Apostrophes must be distinguished from single quote marks. This is best done by tagging quotations or other uses of quotation marks (see above). However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Full disambiguation of these uses belongs to the level of linguistic analysis and interpretation. Parentheses and other marks of suspension such as dashes or ellipses are often used to signal information about the syntactic structure of a text fragment. Full disambiguation of their uses also belongs to the level of linguistic analysis and interpretation, and is therefore discussed in chapter 15 Simple Analytic Mechanisms. Where punctuation marks are disambiguated by tagging the underlying feature they signal, it may be debated whether they should be excluded or left as part of the text. In the case of quotation marks, it may sometimes be more convenient to distinguish opening from closing marks simply by using the appropriate entity reference, rather than using the <q> element, with or without a rend attribute. The solution chosen will vary depending upon the feature and depending upon the purpose of the project. 6.3 Highlighting and QuotationThis section deals with a variety of textual features, all of which have in common that they are frequently realized in conventional printing practice by the use of such features as underlining, italic fonts, or quotation marks, collectively referred to here as highlighting. After an initial discussion of this phenomenon and alternate approaches to encoding it, this section describes ways of encoding the following textual features, all of which are conventionally rendered using some kind of highlighting:
6.3.1 What Is Highlighting?By ‘highlighting’ we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings.73 The purpose of highlighting is generally to draw the reader's attention to some feature or characteristic of the passage highlighted; this section describes the elements recommended by these Guidelines for the encoding of such textual features. In conventionally printed modern texts, highlighting is often employed to identify words or phrases which are regarded as being one or more of the following:
The textual functions signalled by highlighting may not be rendered consistently in different parts of a text or in different texts. (For example, a foreign word may appear in italics if the surrounding text is in roman, but in roman if the surrounding text is in italics.) For this reason, these Guidelines distinguish between the encoding of rendering itself and the encoding of the underlying feature expressed by it. Highlighting as such may be encoded by using the global rend attribute which can be specified for any element in the TEI scheme. This allows the encoder both to specify the function of a highlighted phrase or word, by selecting the appropriate element described here or elsewhere in the Guidelines, and to further describe the way in which it is highlighted, by means of the rend attribute. If the encoder wishes to offer no interpretation of the feature underlying the use of highlighting in the source text, then the <hi> element may be used, which indicates only that the text so tagged was highlighted in some way. The possible values carried by the rend attribute are not formally defined in this version of the Guidelines. Since the rend attribute may be used to document any peculiarity of the way a given segment of text was rendered in the original source text, it may need to express a very large range of typographic features, by no means restricted to type face, type size, etc. Where it is both appropriate and feasible, these Guidelines recommend that the textual feature marked by the highlighting should be encoded, rather than just the simple fact of the highlighting. This is for the following reasons:
In many, if not most, cases the underlying function of a highlighted phrase will be obvious and non-controversial, since the distinctions indicated by a change of highlighting correspond with distinctions discussed elsewhere in these Guidelines. It should be recognized, however, that cases do exist in which it is not economically feasible to mark the underlying function of highlighting (e.g. in the preparation of large text corpora), as well as cases in which it is not intellectually appropriate (as in the transcription of some older materials, or in the preparation of material for the study of typographic practice). In such cases, the <hi> element should be used, as further discussed below. Elements which are sometimes realized by typographic distinction but which are not discussed in this section include <title> (discussed in section 6.10 Bibliographic Citations and References) and <name> (discussed in section 6.4.1 Referring Strings). 6.3.2 Emphasis, Foreign Words, and Unusual LanguageThis subsection discusses the following elements:
6.3.2.1 Foreign Words or ExpressionsWords or phrases which are not in the main language of the text should be tagged as such, at least where the fact is indicated in the text. Where the word or phrase concerned is already distinguished from the rest of the text by virtue of its function (for example, because it is a name, a technical term, a quotation, a mentioned word, etc.) then the global lang attribute should be used to specify additionally that its language distinguishes it from the surrounding text. Any element in the TEI scheme may take a lang attribute, which specifies both the writing system and the language used by its content (see section 4.3 Code shifting for discussion of this attribute). Where there is no other applicable element, the tag <foreign> may be used to provide a peg onto which the lang may be attached. <q>Aren't you confusing <foreign lang="la">post hoc</foreign> with <foreign lang="la">propter hoc</foreign>?</q> said the Bee Master. <q>Wax-moth only succeed when weak bees let them in.</q> The <foreign> tag should not be used to encode foreign words which are mentioned or glossed within the text: for these use the appropriate element from section 6.3.4 Terms, Glosses, and Cited Words below. Compare the following example sentences: John eats a <foreign lang="fr">croissant</foreign> every morning. <mentioned lang="fr">Croissant</mentioned> is difficult to pronounce with your mouth full. A <term lang="fr">croissant</term> is a crescent-shaped piece of light, buttery, pastry that is usually eaten for breakfast, especially in France. The <foreign> element is formally defined as follows: <!-- 6.3.2.1: Highlighted phrases-->
<!ELEMENT foreign %om.RR; %paraContent;>
<!ATTLIST foreign
%a.global;
TEIform CDATA 'foreign' >
[continued in 6.3.2.1: ]
[continued in 6.3.2.1: ]
[continued in 6.3.2.1: Quotation]
[continued in 6.3.2.1: Terms, glosses, etc.]
<!-- end of 6.3.2.1-->
6.3.2.2 Emphatic Words and PhrasesThe <emph> element is provided to mark words or phrases which are linguistically emphatic or stressed. Text which is only typographically `emphasized' falls into the class of highlighted text, and may be tagged with the <hi> element. In printed works, emphasis is generally indicated by devices such as the use of an italic font, a large typeface or extra wide letter spacing; in manuscripts and typescripts, it is usually indicated by the use of underlining. As the following examples demonstrate, an encoder may choose whether or not to make explicit the particular type of rendition associated with the emphasis, by use of the rend attribute. If a source text consistently renders a particular feature (e.g. emphasis or words in foreign languages) in a particular way, the rendering associated with that feature may be described in the TEI header and the rend attribute used only to describe examples which deviate from the norm. <q>Sex, sir, is <emph>purely</emph> a question of appetite!</q> Tarr exclaimed. <q>What it all comes to is this,</q> he said. <q><emph rend="italic">What does Christopher Robin do in the morning nowadays?</emph></q> <l>Here Thou, great <name rend="italics">Anna</name>! whom three Realms obey,</l> <l>Doth sometimes Counsel take — and sometimes <emph rend="italic">Tea</emph>.</l> The <hi> element is used to mark words or phrases which are highlighted in some way, but for which identification of the intended distinction is difficult, controversial or impossible. It enables an encoder simply to record the fact of highlighting, possibly describing it by the use of a rend attribute, as discussed above, without however taking a position as to the function of the highlighting. This may also be useful if the text is to be processed in two stages: representing simply typographic distinctions during a first pass, and then replacing the <hi> tags with more specific tags in a second pass. <hi rend="gothic">And this Indenture further witnesseth</hi> that the said <hi rend="italic">Walter Shandy</hi>, merchant, in consideration of the said intended marriage ...In this example, the first highlighted phrase uses black letter or gothic print to mimic the appearance of a legal document, and italic to mark ‘Walter Shandy’ as a name. In a second pass, the elements <head> or <label> might be appropriate for the first use, and the element <name> for the second. The heaviest rain, and snow, and hail, and sleet, could boast of the advantage over him in only one respect. They often <hi rend="quoted">came down</hi> handsomely, and Scrooge never did.In this example, the phrase ‘came down’ uses inverted commas to indicate a play on words.74 In a second pass, the element <soCalled> might be preferred. The <emph> and <hi> elements are formally defined as follows: <!-- 6.3.2.2: -->
<!ELEMENT emph %om.RR; %paraContent;>
<!ATTLIST emph
%a.global;
TEIform CDATA 'emph' >
<!ELEMENT hi %om.RR; %paraContent;>
<!ATTLIST hi
%a.global;
TEIform CDATA 'hi' >
<!-- end of 6.3.2.2-->
6.3.2.3 Other Linguistically Distinct MaterialFor some kinds of analysis, it may be desirable to encode the linguistic distinctiveness of words and phrases with more delicacy than is allowed by the <foreign> element. The <distinct> element is provided for this purpose. Its attributes allow for additional information characterizing the nature of the linguistic distinction to be made in two distinct ways: the type attribute simply assigns a user-defined code of some kind to the word or phrase which assigns it to some register, sub-language, etc. No recommendations as to the set of values for this attribute are provided at this time, as little consensus exists in the field. Alternatively, the remaining three attributes may be used in combination to place a word or phrase on a three-dimensional scale sometimes used in descriptive linguistics.75 The time attribute places a word diachronically, for example as archaic, old-fashioned, contemporary, futuristic, etc.; the space attribute places a word diatopically, that is, with respect to a geographical classification, for example as national, regional, international, etc.; the social attribute places a word diastatically, that is, with respect to a social classification, for example as technical, polite, impolite, restricted, etc. Again, no recommendations are made for the values of these attributes at this time; the encoder should provide a description of the scheme used in the appropriate section of the header (see section 5.3 The Encoding Description). Next morning a boy in that dormitory confided to his bosom friend, a <distinct type="psSlang">fag</distinct> of Macrea's, that there was trouble in their midst which King <distinct type="archaic">would fain</distinct> keep secret. Next morning a boy in that dormitory confided to his bosom friend, a <distinct time="1900" space="GB" social="publicschool">fag</distinct> of Macrea's, that there was trouble in their midst which King <distinct time="archaic">would fain</distinct> keep secret.Where more complex (or more rigorous) interpretive analyses of the associations of a word are required, the more detailed and general mechanisms described in chapter 16 Feature Structures should be preferred to these simple characterizations. It may also be preferable to record the kinds of analysis suggested here by means of the simple annotation element <note> described in section 6.8 Notes, Annotation, and Indexing, or the <span> element described in section 15.3 Spans and Interpretations. The <distinct> element has the following formal definition: <!-- 6.3.2.3: -->
<!ELEMENT distinct %om.RR; %phrase.seq;>
<!ATTLIST distinct
%a.global;
type CDATA #IMPLIED
time CDATA #IMPLIED
space CDATA #IMPLIED
social CDATA #IMPLIED
TEIform CDATA 'distinct' >
<!-- end of 6.3.2.3-->
6.3.3 QuotationThis section discusses the following elements, all of which are often rendered by the use of quotation marks:
One form of presentational variation found particularly frequently in written and printed texts is the use of quotation marks. As with the typographic variations discussed in the preceding section, it is generally helpful to separate the encoding of the underlying textual feature (for example, a quotation or a piece of direct speech) from the encoding of its rendering (for example, the use of a particular style of quotation marks). The most common and important use of quotation marks is, of course, to mark quotation, by which we mean simply any part of the text attributed by the author or narrator to some agency other than the narrative voice. Typical examples include passages cited from other works, for which the element <quote> may be used, and words or phrases attributed to other voices within the current work, for which the element <q> may be used. If this distinction between intra-textual and inter-textual voices cannot be made reliably, or is not of interest, then all quoted matter may simply be marked using the <q> tag. The editorial policy in this respect should be stated in the encoding description of the TEI Header. The <soCalled> element is used for cases where the author or narrator distances him or herself from the words in question without however attributing them to any other voice in particular. Quotation may be rendered by changes in type face, by special punctuation marks (single or double or angled quotes, dashes, etc.) and by layout (indented paragraphs, etc.). If these characteristics are of interest, an appropriate value for the rend attribute should be given, to record how the <q> or <quote> element is rendered. For discussion of suggested values for this attribute, see below. Quotation marks themselves may, like other punctuation marks, be felt for some purposes to be worth retaining within a text, quite independently of their description by the rend attribute. Where this is done, an appropriate entity reference should be chosen from the standard entity sets listed in chapter 37 Obtaining TEI WSDs; this has the advantage that the entity may be redefined as null when the punctuation is to be ignored for some analytic purpose. Well-known ambiguities, such as whether the character ' represents an apostrophe or a closing single quotation mark, or whether the character " represents an opening or closing double quotation mark may all be resolved by the use of appropriate entity references, as discussed in section 6.2 Treatment of Punctuation. Alternatively, the encoder may suppress all quotation marks, possibly recording their form using the rend attribute. Where this is done, the following list of entity names (taken from the public entity sets ISOpub and ISOnum) may be found useful to describe quotation-mark styles common in European and American typesetting:
These may be used in the rend attribute to show how the quotation was opened and closed. For example, if the words ‘pre’ and ‘post’ are used to indicate preceding and following punctuation, then the following example would describe a conventional American book printed using single quotation marks: <q rend="PRE lsquo POST rsquo">Who-e debel you?</q> — he at last said — <q rend="PRE lsquo POST rsquo">you no speak-e, damme, I kill-e.</q> And so saying, the lighted tomahawk began flourishing about me in the dark.The following example demonstrates alternative policies which may be adopted with respect to encoding of the punctuation used to mark quotation: Adolphe se tourna vers lui : <q>— Alors, Albert, quoi de neuf?</q> <q>— Pas grand-chose.</q> <q>— Il fait beau,</q> dit Robert. Adolphe se tourna vers lui :
<q rend="PRE mdash">Alors,
Albert, quoi de neuf ?</q>
<q rend="PRE mdash">Pas grand-chose.</q>
<q rend="PRE mdash">Il fait beau,</q>
dit Robert.
To make explicit who is speaking, which is not always stated in the
above example, the who attribute should be used:
Adolphe se tourna vers lui : <q who="Adolphe">— Alors, Albert, quoi de neuf?</q> <q who="Albert">— Pas grand-chose.</q> <q who="Robert">— Il fait beau,</q> dit Robert.The who attribute is also useful as a means of supplying a normalized form of the speaker's name, to facilitate selection of text by particular speakers. As indicated above, it may be supplied whether or not an indication of the speaker is given explicitly in the text. Where investigation of `narrative voice' is the primary object of the encoding, it may be convenient to identify each speaker as a participant in the work, and to associate individual speeches with them by means of the ID/IDREF mechanism. See section 23.2.2 The Participants Description for discussion of the participant description component of the TEI Header. For such analyses, it may also be useful to distinguish representations of speech from representations of thought, in modern printed texts often indicated by a change of typeface. The type attribute should be used for this purpose, as in this example: <q type="speech">Oh yes,</q> said Henry, <q type="speech">I mean Gordon Macrae, for example…</q> <q type="thought">Jungian Analyst with Winebox! That's what you called him, you callous bastard, didn't you? Eh? Eh?</q> Quoted matter may be embedded within quoted matter, as when one speaker reports the speech of another: <q who="Wilson">Spaulding, he came down into the office just this day eight weeks with this very paper in his hand, and he says:— <q who="Spaulding">I wish to the Lord, Mr. Wilson, that I was a red-headed man.</q></q> Direct speech nested in this way is treated in the same way as elsewhere: a change of rendition may occur, but the same element should be used. An encoder may however choose to distinguish between direct speech which contains quotations from extra-textual matter and direct speech itself, as in the following example: <p><q>The Lord! The Lord! It is Sakya Muni himself,</q> the lama half
sobbed; and under his breath began the wonderful Buddhist
invocation:-<q>
<quote>
<l>To Him the Way — the Law — Apart —</l>
<l>Whom Maya held beneath her heart</l>
<l>Ananda's Lord — the Bodhisat</l>
</quote>
And He is here! The Most Excellent Law is here also. My
pilgrimage is well begun. And what work! What work!</q>
</p>
Quotations from other works are often accompanied by a reference to their source. The <cit> element may be used to group together the quotation and its associated bibliographic reference, which should be encoded using the elements for bibliographic references discussed in section 6.10 Bibliographic Citations and References, as in the following example. <div id="mm01" type="chapter">
<head>Chapter 1</head>
<epigraph><cit>
<quote>
<l>Since I can do no good because a woman</l>
<l>Reach constantly at something that is near it.</l>
</quote>
<bibl>
<title>The Maid's Tragedy</title>
<author>Beaumont and Fletcher</author>
</bibl>
</cit></epigraph>
<p>Miss Brooke had that kind of beauty which seems to be thrown into
relief by poor dress...</p>
</div>
Like other bibliographic references, the citation attached to a
quotation may be represented simply by a pointer, as in this example:
Lexicography has shown little sign of being affected by the
work of followers of J.R. Firth, probably best summarized
in his slogan, <cit>
<quote>You shall know a word by the company it keeps.</quote>
<ref target="fi57">(Firth, 1957)</ref>
</cit>
Unlike most of the other elements discussed in this chapter, direct
speech and quotations may frequently contain other high-level elements
such as paragraphs or verse lines, as well as being themselves contained
by such elements. Three possible solutions exist for this well-known
structural problem:
For further discussion, and several examples, see chapter 31 Multiple Hierarchies. Finally, in this section, the element <soCalled> is provided for all cases in which quotation marks are used to distance the quoted text from the narrator or speaker. Common examples include the `scare' quotes often found in newspaper headlines and advertising copy, where the effect is to cast doubts on the veracity of an assertion: <head>PM dodges <soCalled>election threat</soCalled> in interview</head> The same element should be used to mark a variety of special ironic usages. Some further examples follow: He hated <soCalled>good</soCalled> books. <soCalled>Croissants</soCalled> indeed! toast not good enough for you? Although Chomsky's decision that all NL sentences are finite objects was never justified by arguments from the attested properties of NLs, it did have a certain <soCalled>social</soCalled> justification. It was commonly assumed in works on logic until fairly recently that the notion <mentioned>language</mentioned> is necessarily restricted to finite strings. The elements discussed in this section are formally defined as follows: <!-- 6.3.3: Quotation-->
<!ELEMENT q %om.RR; %specialPara;>
<!ATTLIST q
%a.global;
type CDATA #IMPLIED
direct (y | n | unspecified) "unspecified"
who CDATA #IMPLIED
TEIform CDATA 'q' >
<!ELEMENT quote %om.RR; %specialPara;>
<!ATTLIST quote
%a.global;
TEIform CDATA 'quote' >
<!ELEMENT cit %om.RR; ( (q | quote | %m.bibl; | %m.loc; | %m.Incl; )+)>
<!ATTLIST cit
%a.global;
TEIform CDATA 'cit' >
<!ELEMENT soCalled %om.RR; %phrase.seq;>
<!ATTLIST soCalled
%a.global;
TEIform CDATA 'soCalled' >
<!-- end of 6.3.3-->
6.3.4 Terms, Glosses, and Cited WordsThis section describes the following textual elements, all of which have in common that they may be variously realized using italics, quotation marks, or other devices:
Technical terms are often italicized or emboldened upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are mentioned (for example, as examples) rather than used may mark them either with italics or with quotation marks, and will gloss them less regularly. A <term> may appear with or without a gloss, as may a <mentioned> element. Where the <gloss> is present, it may be linked to the term it is glossing by means of the ID/IDREF mechanism. To establish such a link, the encoder should give an id value to the <term> or <mentioned> element and provide that id as the value of the target attribute on the <gloss> element. The following examples demonstrate this facility: for more discussion of this and other kinds of linkage within TEI documents, see chapter 14 Linking, Segmentation, and Alignment. We may define <term id="tdpv" rend="sc">discoursal point of view</term> as <gloss target="tdpv">the relationship, expressed through discourse structure, between the implied author or some other addresser, and the fiction.</gloss> <gloss rend="unmarked" target="t1">A computational device that infers structure from grammatical strings of words</gloss> is known as a <term id="t1">parser</term>, and much of the history of NLP over the last 20 years has been occupied with the design of parsers. There is thus a striking accentual difference between a verbal form like <mentioned id="cw234" lang="grc">eluthemen</mentioned> <gloss target="cw234">we were released,</gloss> accented on the second syllable of the word, and its participial derivative <mentioned id="cw235" lang="grc">lutheis</mentioned> <gloss target="cw235">released,</gloss> accented on the last. The elements discussed in this section have the following formal definitions: <!-- 6.3.4: Terms, glosses, etc.-->
<!ELEMENT term %om.RR; %phrase.seq;>
<!ATTLIST term
%a.global;
type CDATA #IMPLIED
TEIform CDATA 'term' >
<!ELEMENT mentioned %om.RR; %phrase.seq;>
<!ATTLIST mentioned
%a.global;
TEIform CDATA 'mentioned' >
<!ELEMENT gloss %om.RR; %phrase.seq;>
<!ATTLIST gloss
%a.global;
target IDREF #IMPLIED
TEIform CDATA 'gloss' >
<!-- end of 6.3.4-->
6.3.5 Some Further ExamplesAs a simple example of the elements discussed here, consider the following sentence: On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité, the romances of Chrétien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach.A first approximation to the encoding of this sentence might be simply to record the fact that the phrases printed above in italics are highlighted, as follows: On the one hand the <hi rend="italic">Nibelungenlied</hi> is associated with the new rise of romance of twelfth-century France, the <hi lang="fr" rend="italic">romans d'antiquité</hi>, the romances of Chrétien de Troyes, ...This encoding would however lose the important distinction between an italicized title and an italicized foreign phrase. Many other phrases might also be italicized in the text, and a retrieval program seeking to identify foreign terms (for example) would not be able to produce reliable results by simply looking for italicized words. Where economic and intellectual constraints permit, therefore, it would be preferable to encode both the function of the highlighted phrases and their appearance, as follows: On the one hand the <title rend="italic">Nibelungenlied</title> is associated with the new rise of romance of twelfth-century France, the <foreign rend="italic">romans d'antiquité</foreign>, the romances of Chrétien de Troyes, ... In this example, the decision as to which textual features are distinguished by the highlighting is relatively uncontroversial. As a less straightforward example, consider the use of italic font in the following passage from Samuel Richardson's Clarissa (1747). A pretty common case, I believe; in all vehement debatings. She says I am too witty; Anglicé, too pert; I, that she is too wise; that is to say, being likewise put into English, not so young as she has been: in short, she is grown so much into a mother, that she had forgotten she ever was a daughter. ... Clearly, the word ‘vehement’ is not italicized for the same reason as the phrase ‘not so young as she has been’; the former is emphasized, while the latter is proverbial. It also provides an ironic gloss for the words ‘too wise’, in the same way as ‘too pert’ glosses ‘too witty’. The glossed phrases are not however technical terms or cited words, but quoted phrases, as if Clarissa were putting words into her own and her mother's mouths. Finally, the words ‘mother’ and ‘daughter’ are apparently italicized simply to oppose them in the sentence; certainly they do not fit into any of the categories so far proposed as reasons for italicizing. Note also that the word ‘Anglicé’ is not italicized although it is not generally considered an English word. The following sample encoding for the above passage attempts to take into account all the above points: A pretty common case, I believe; in all <emph>vehement</emph> debatings. She says I am <q rend="italic">too witty</q>; <foreign lang="la" rend="roman">Anglicè</foreign>, <gloss rend="italic">too pert</gloss>; I, that she is <q rend="italic"> too wise</q>; that is to say, being likewise put into English, <gloss rend="italic">not so young as she has been</gloss>: in short, she is grown so much into a <hi rend="italic">mother</hi>, that she had forgotten she ever was a <hi rend="italic">daughter</hi>. 6.4 Names, Numbers, Dates, Abbreviations, and AddressesThis section describes a number of textual features which it is often convenient to distinguish from their surrounding text. Names, dates, and numbers are likely to be of particular importance to the scholar treating a text as source for a database; di[stinguishing such items from the surrounding text is however equally important to the scholar primarily interested in lexis. The treatment of these textual features proposed here is not intended to be exhaustive: fuller treatments for names, numbers, measures, and dates are provided in the additional tag set for names and dates (see chapter 20 Names and Dates). 6.4.1 Referring StringsA referring string is a phrase which refers to some person, place, object etc. Two elements are provided to mark such strings:
<q>My dear <rs type="person">Mr. Bennet</rs></q>, said his lady to him one day, <q>have you heard that <rs type="place">Netherfield Park</rs> is let at last?</q> Collectors of water-rents were appointed by the <rs type="organization">Watering Committee</rs>. They were paid a commission not exceeding four per cent, and gave bond. It being one of the principles of the <rs type="org">Circumlocution Office</rs> never, on any account whatsoever, to give a straightforward answer, <rs type="person">Mr Barnacle</rs> said, <q>Possibly.</q> As the following example shows, the <rs> element may be used for any reference to a person, place, etc., not only to references in the form of a proper noun or noun phrase. <q>My dear <rs type="person">Mr. Bennet</rs></q>, said <rs type="person">his lady</rs> to him one day ... The <name> element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the <rs> element, or nested within it if a referring string contains a mixture of common and proper nouns. The following example shows an alternative way of encoding the short sentence from Pride and Prejudice quoted above: <q>My dear <name type="person">Mr. Bennet</name>,</q> said <rs <rs type="person">his lady</rs> to him one day, <q>have you heard that <name type="place">Netherfield Park</name> is let at last?</q>The following example shows how a proper name may be nested within a referring string: <rs>His Excellency the Life President, <name>Ngwazi Dr H. Kamuzu Banda</name></rs> Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as ‘van’ or ‘de la’, may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer. The following attributes, common to all members of the names element class, are provided to help overcome these difficulties:
Either or both of these attributes may be specified, as appropriate. The key attribute may be useful as a means of gathering together all references to the same individual or location scattered throughout a document: <q>My dear <rs key="BENM1" type="person"> Mr. Bennet</rs>,</q> said <rs key="BENM2" type="person">his lady</rs> to him one day, <q>have you heard that <rs key="NETP1" type="place">Netherfield Park</rs> is let at last?</q> This use should be distinguished from the case of the reg (regularization) attribute, which provides a means of marking the standard form of a referencing string as demonstrated below: My personal life during the administration of <rs key="POJA1" reg="Polk, James K." type="person">Col. Polk</rs> has but poorly compensated me for the suspended enjoyments and pursuits of private and professional spheres <name key="VOM1" reg="Volanges, Mme de" type="person">Mme. de Volanges</name> marie sa fille: c'est encore un secret; mais elle m'en a fait part hier. <name key="WADLM1" reg="de la Mare, Walter" type="person">Walter de la Mare</name> was born at <name key="Ch1" type="place">Charlton</name>, in <name key="KT1" type="county">Kent</name>, in 1873. <name type="place">Montaillou</name> is not a large parish. At the time of the events which led to <name reg="Benedict XII, Pope of Avignon (Jacques Fournier)" type="person">Fournier's</name> investigations, the local population consisted of between 200 and 250 inhabitants. This method is adequate for many simple applications. For more complex applications, such as onomastics, or wherever a detailed analysis of the component parts of a name is needed, the specialized elements described in chapter 20 Names and Dates or the analytical tools described in chapter 16 Feature Structures should be used. These elements are formally declared as follows: <!-- 6.4.1: Proper Nouns-->
<!ELEMENT name %om.RR; %phrase.seq;>
<!ATTLIST name
%a.global;
%a.names;
type CDATA #IMPLIED
TEIform CDATA 'name' >
<!ELEMENT rs %om.RR; %phrase.seq;>
<!ATTLIST rs
%a.global;
%a.names;
type CDATA #IMPLIED
TEIform CDATA 'rs' >
<!-- end of 6.4.1-->
6.4.2 AddressesThe simplest way of encoding an address is to regard it as a series of distinct lines, just as they might be printed on an envelope. The following elements support this view:
Alternatively, an address may be encoded as a structure composed of the following elements, which constitute the addrPart element class:
<address> <addrLine>110 Southmoor Road,</addrLine> <addrLine>Oxford OX2 6RB,</addrLine> <addrLine>UK</addrLine> </address> The above address could also be represented as follows: <address> <street>110 Southmoor Road</street> <name type="city">Oxford</name> <postCode>OX2 6RB</postCode> <name type="country">United Kingdom</name> </address> The order of elements within an address is highly culture-specific, and is therefore unconstrained: <address> <name type="org">Università di Bologna</name> <name type="country">Italy</name> <postCode>40126</postCode> <name type="city">Bologna</name> <street>via Marsala 24</street> </address> For further discussion of ways of regularizing the names of places, see section 6.4 Names, Numbers, Dates, Abbreviations, and Addresses. A full postal address may also include the name of the addressee, tagged as above using the general purpose <name> element. When the additional tag set for names and dates is enabled, more specific elements such as <publisher> or <org> may be used, as further discussed in chapter 20 Names and Dates. The <address> element and its components are formally described as follows: <!-- 6.4.2: Addresses and their components-->
<!ELEMENT address %om.RO; ( (%m.Incl;)*,
( (addrLine, (%m.Incl;)*)+ | ((%m.addrPart;), (%m.Incl;)*)* ) ) >
<!ATTLIST address
%a.global;
TEIform CDATA 'address' >
<!ELEMENT addrLine %om.RO; %phrase.seq;>
<!ATTLIST addrLine
%a.global;
TEIform CDATA 'addrLine' >
<!ELEMENT street %om.RO; %phrase.seq;>
<!ATTLIST street
%a.global;
TEIform CDATA 'street' >
<!ELEMENT postCode %om.RO; (#PCDATA)>
<!ATTLIST postCode
%a.global;
TEIform CDATA 'postCode' >
<!ELEMENT postBox %om.RO; (#PCDATA)>
<!ATTLIST postBox
%a.global;
TEIform CDATA 'postBox' >
<!--Other components of addresses should be represented
using the general purpose NAME element-->
<!-- end of 6.4.2-->
6.4.3 Numbers and MeasuresThis section describes two elements provided for the simple encoding of numbers and measures and gives some indication of circumstances in which this may usefully be done. The following phrase level elements are provided for this purpose:
Like names or abbreviations, numbers can occur virtually anywhere in a text. Numbers are special in that they can be written with either letters or digits (‘twenty-one’, ‘xxi’, and ‘21’) and their presentation is language-dependent (e.g. English ‘5th’ becomes Greek ‘5.’; English ‘123,456.78’ equals French ‘123.456,78’). For many kinds of application, e.g. natural-language processing or machine translation, numbers are not regarded as `lexical' in the same way as other parts of a text. For these and other applications, the <num> element provides a convenient method of distinguishing numbers from the surrounding text. For other kinds of application, numbers are only useful if normalized: here the <num> element is useful precisely because it provides a standardized way of representing a numerical value. <num value="33">xxxiii</num> <num type="cardinal" value="21">twenty-one</num> <num type="percentage" value="10">ten percent</num> <num type="percentage" value="10">10%</num> <num type="ordinal" value="5">5th</num> <num type="fraction" value="0,5">one half</num> <num type="fraction" value="0,5">1/2</num> The word ‘measure’ is used here to refer to a special kind of referring string, the referent of which is a `virtual object'. In its fullest form, a measure consists of a number, a phrase expressing units of measure and a phrase expressing the commodity being measured. Not all of these components need be present in every case. For some applications, particularly quantitative ones, the internal components of measure need to be marked so that their values can be calculated. Thus, in order to evaluate a monetary measure according to some standard, it is necessary to mark its currency unit (e.g. US dollars, pounds sterling). Similarly, the expression ‘2 ounces’ will have a different meaning when it is associated with ‘flour’ from that which it has when associated with ‘water’. Such applications will require the elements discussed in chapter 20 Names and Dates, or the more powerful analytical tools discussed in chapter 16 Feature Structures. Elsewhere, it may be sufficient simply to encode measures as such, perhaps also indicating their numeric content with the <num> element, as in the following examples: <l>I've measured it from side to side</l> <l>'Tis <measure reg="0.924 m" type="length"> <num value="3">three</num> feet</measure> long, and <measure reg="0.616 m" type="length"> <num value="2">two</num> feet</measure> wide.</l>As the above example also demonstrates, the <measure> element is a member of the class names like other referencing strings, and may thus bear a reg attribute to indicate a normalized value. The form of normalization used should conform to a defined standard such as the International System of Units (SI). The <measure> element may also carry a key attribute to indicate a database key value, as in the following example: <list>
<item><measure key="BH2" type="volume">
<num value="2">ii</num> bags hops
</measure>
</item>
<item><measure key="TW6" type="volume">
<num value="6">six</num> trusses Woolen and linen goods
</measure>
</item>
<item><measure key="WC5" type="weight">
5 tonnes coale
</measure>
</item>
<!-- ... -->
</list>
These elements are formally defined as follows: <!-- 6.4.3: Numbers and measures-->
<!ELEMENT num %om.RR; %phrase.seq;>
<!ATTLIST num
%a.global;
type CDATA #IMPLIED
value CDATA #IMPLIED
TEIform CDATA 'num' >
<!ELEMENT measure %om.RR; %phrase.seq;>
<!ATTLIST measure
%a.global;
%a.names;
type CDATA #IMPLIED
TEIform CDATA 'measure' >
<!-- end of 6.4.3-->
6.4.4 Dates and TimesDates and times, like numbers, can appear in widely varying culture- and language-dependent forms, and can pose similar problems in automatic language processing. The following elements are provided to identify them:
Dates can occur virtually anywhere in a text, but in some contexts (e.g. bibliographic citations) their encoding is recommended or required rather than optional. Times can also appear anywhere but are generally optional. Partial dates or times (e.g. ‘1990’, ‘September 1990’, ‘twelvish’) can be expressed in the value attribute by simply omitting a part of the value supplied. Imprecise dates or times (for example ‘early August’, ‘some time after ten and before twelve’) may be expressed as date or time ranges. If either end of the date or time range is known to be accurate (for example, ‘at some time before 1230’, ‘a few days after Hallowe'en’), the exact attribute may be used to specify this. Where the certainty (i.e. reliability) of the date or time itself is in question, rather than its precision, the encoder should record this fact using the mechanisms discussed in chapter 17 Certainty and Responsibility. These mechanisms are useful primarily for fully specified dates or times known with certainty. If component parts of dates or times are to be marked up, or if a more complex analysis of the meaning of a temporal expression is required, the techniques described in chapter 20 Names and Dates should be used in preference to the simple method outlined here. The value attribute is a useful way of normalizing or disambiguating dates and times which can appear in many formats, as the following examples show: <date value="1980-02-12">12/2/1980</date> Given on the <date value="1977-06-12">Twelfth Day of June in the Year of Our Lord One Thousand Nine Hundred and Seventy-seven of the Republic the Two Hundredth and first and of the University the Eighty-Sixth.</date> <date value="2001">2001</date> <date value="2001-09">September 2001</date> <date value="2001-09-11">11 Sept 01</date> <date value="2001-09-11">9/11</date>, <time value="08:48">8:48</time> <date value="2001-09-11T12:48Z">Sept 11th, 12 minutes before 9 am</date>Note in the last example the use of a normalized representation for the date string which includes a time: this example could thus equally well be tagged using the <time> element. The following examples demonstrate the use of the <dateRange> element to mark a period of time: Those five years — <dateRange from="1918" to="1923">1918 to 1923</dateRange> — had been, he suspected, somehow very important. The Eddic poems are preserved in a unique manuscript (Codex Regius 2365) from <dateRange from="1250" to="1300">the second half of the thirteenth century</dateRange>, and <title>Hervarar saga</title> dates from <date value="1300">around 1300</date>. These elements are formally defined as follows: <!-- 6.4.4: Dates and times-->
<!ELEMENT date %om.RR; %phrase.seq;>
<!ATTLIST date
%a.global;
calendar CDATA #IMPLIED
value CDATA #IMPLIED
certainty CDATA #IMPLIED
TEIform CDATA 'date' >
<!ELEMENT dateRange %om.RO; %phrase.seq;>
<!ATTLIST dateRange
%a.global;
calendar CDATA #IMPLIED
from CDATA #IMPLIED
to CDATA #IMPLIED
exact (to|from|both|none) #IMPLIED
TEIform CDATA 'dateRange' >
<!ELEMENT time %om.RR; %phrase.seq;>
<!ATTLIST time
%a.global;
value CDATA #IMPLIED
type (am | pm | 24hour | descriptive) #IMPLIED
zone CDATA #IMPLIED
TEIform CDATA 'time' >
<!ELEMENT timeRange %om.RR; %phrase.seq;>
<!ATTLIST timeRange
%a.global;
from CDATA #IMPLIED
to CDATA #IMPLIED
exact (to|from|both|none) #IMPLIED
TEIform CDATA 'timeRange' >
<!-- end of 6.4.4-->
6.4.5 Abbreviations and Their ExpansionsIt is sometimes desirable to mark abbreviations in the copy text, whether to trigger special processing for them, to provide the full form of the word or phrase abbreviated, or to allow for different possible expansions of the abbreviation. Abbreviations may be transcribed as they stand, or expanded; they may be left unmarked, or marked using these tags:
The <abbr> element is useful as a means of distinguishing semi-lexical items such as acronyms or jargon: We can sum up the above discussion as follows: the identity of a <abbr>CC</abbr> is defined by that calibration of values which motivates the elements of its <abbr>GSP</abbr>; ... Every manufacturer of <abbr>3GL</abbr> or <abbr>4GL</abbr> languages is currently nailing on <abbr>OOP</abbr> extensions. The type attribute may be used to distinguish types of abbreviation by their function, and the expan attribute may be used to supply an expansion: <abbr type="title">Dr.</abbr> <abbr type="initial">M.</abbr> Deegan is the Director of the <abbr expan="Computers in Teaching Initiative" type="acronym">CTI</abbr> Centre for Textual Studies. Abbreviations such as ‘Dr. M.’ above may be treated as two abbreviations, as above, or as one: <abbr>Dr. M.</abbr> Deegan is the Director of the <abbr>CTI</abbr> Centre for Textual Studies. This element is particularly useful where manuscript materials in which abbreviation is very frequent are being transcribed. For example: <l>Ex<abbr expan="per" resp="pg" type="brevigraph">&per;</abbr>ience, thogh noon auctoritee</l> <l>Were in this world, is right ynogh for me</l> <l>To speke of wo that is in mariage;</l> Here an entity reference per has been used to represent the common manuscript symbol ‘crossed-p’, and its expansion supplied in the associated <abbr> tag. The same lines might be transcribed, expanded, as follows: <l>Ex<expan abbr="&per;" resp="pg" type="brevigraph">per</expan>ience, thogh noon auctoritee</l> <l>Were in this world, is right ynogh for me</l> <l>To speke of wo that is in mariage;</l> In practice, it may be most convenient to transcribe the abbreviation as an entity reference; this allows the entity reference itself to be expanded either as an <abbr> or as an <expan> element, depending on the processing to be done at the moment. (For further discussion of such documentation, see section 25.4.3 Documenting Coded Character Sets and Entity Sets.) The text shown here: <l>Ex&per;ience, thogh noon auctoritee</l> <l>Were in this world, is right ynogh for me</l> <l>To speke of wo that is in mariage;</l>may be expanded as desired by providing the appropriate choice between the two entity declarations: <!ENTITY per "<abbr type='brevigraph' expan='per' Resp='PG'>&p.crossed;</abbr>"> <!ENTITY per "<expan type='brevigraph' abbr='&p.crossed;' Resp='PG'>per</expan>">For further discussion of manuscript abbreviations, see chapter 18 Transcription of Primary Sources. These elements are formally defined as follows: <!-- 6.4.5: Abbreviations-->
<!ELEMENT abbr %om.RR; %phrase.seq;>
<!ATTLIST abbr
%a.global;
expan CDATA #IMPLIED
resp IDREF %INHERITED;
cert CDATA #IMPLIED
type CDATA #IMPLIED
TEIform CDATA 'abbr' >
<!ELEMENT expan %om.RR; %phrase.seq;>
<!ATTLIST expan
%a.global;
abbr CDATA #IMPLIED
resp IDREF %INHERITED;
cert CDATA #IMPLIED
type CDATA #IMPLIED
TEIform CDATA 'expan' >
<!-- end of 6.4.5-->
6.5 Simple Editorial ChangesAs in editing a printed text, so in encoding a text in electronic form, it may be necessary to accommodate editorial comment on the text and to render account of any changes made to the text in preparing it. The tags described in this section may be used to record such editorial interventions, whether made by the encoder, by the editor of a printed edition used as a copy text, by earlier editors, or by the copyists of manuscripts. The tags described here handle most common types of editorial intervention and stereotyped comment; where less structured commentary of other types is to be included, it should be marked using the <note> element described in section 6.8 Notes, Annotation, and Indexing. Systematic interpretive annotation is also possible using the various methods described in chapter 14 Linking, Segmentation, and Alignment. The examples given here illustrate only simple cases of editorial intervention; in particular, they permit economical encoding of two alternative readings of a text only. To encode more than two views of any one segment of text, the mechanisms described in chapters 14 Linking, Segmentation, and Alignment and 19 Critical Apparatus must be used. The first two pairs of elements here discussed (<sic> and <corr>, <reg> and <orig>) may both be used to record simultaneously a text in its `original', uncorrected and unaltered form and also in an `edited' form. In this way they resemble the pair <abbr> and <expan>, described in section 6.4.5 Abbreviations and Their Expansions. Such paired elements enable software to move automatically from one `view' of the text to the other. Three categories of editorial intervention are discussed in this section:
A more extended treatment of the use of these tags in transcriptional and editorial work is given in chapter 18 Transcription of Primary Sources. 6.5.1 Correction of Apparent ErrorsWhen the copy text is manifestly faulty, an encoder or transcriber may elect simply to correct it without comment. For scholarly purposes, it will often be more generally useful to record both the correction and the original state of the text. The elements described here enable this to be done is such a way as not to distract the reader.
The following examples show alternative treatment of the same material. The copy text reads: Another property of computer-assisted historical research is that data modelling must permit any one textual feature or part of a textual feature to be a part of more than one information model and to allow the researcher to draw on several such models simultaneously, for example, to select from a machine-readable text those marginal comments which indicate that the date's mentioned in the main body of the text are incorrect. An encoder may choose to correct the typographic error, either silently or with an indication that a correction has been made, as follows: ... marginal comments which indicate that the <corr>dates</corr> mentioned in the main body of the text are incorrect. Alternatively, the encoder may simply record the typographic error without correcting it, either without comment or with a <sic> element to indicate the error is not a transcription error in the encoding: ... marginal comments which indicate that the <sic>date's</sic> mentioned in the main body of the text are incorrect. If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, either <sic> or <corr> may be used with the appropriate attribute: ... marginal comments which indicate that the <sic corr="dates" resp="msm">date's</sic> mentioned in the main body of the text are incorrect. ... marginal comments which indicate that the <corr sic="date's" resp="MSM">dates</corr> mentioned in the main body of the text are incorrect.If both readings are given, the choice between <sic> and <corr> is largely a question of individual preference; since both record the same information, either may be mechanically transformed into the other. If the original reading contains tags, it will prove more convenient to use <sic> than <corr> (and vice versa if there are tags within the corrected reading), since tags are not recognized in attribute values. If both readings contain subordinate tags, then recourse must be had to the methods described in chapter 19 Critical Apparatus. The cert attribute on the <sic> and <corr> elements permits a statement of the degree of editorial confidence in a particular correction. For example, using a confidence scale of one to ten, an editor may indicate the conjectural status of a correction by assigning a value to this attribute of less than ten. In the following instance, some uncertainty is expressed concerning a commonly-accepted emendation: An <corr sic="Antony" cert="8">Autumn</corr> it was, That grew the more by reapingSee further the discussion in section 18.1.3 Correction and Conjecture. Where the correction takes the form of adding text, the encoder must choose whether to use the <corr> (or <sic>) tag, the <add> tag (see section 6.5.3 Additions, Deletions, and Omissions below), or the more detailed facilities provided by the additional tag set for primary source description. The following discussion may be helpful when making this decision:
The formal definition of these elements is as follows: <!-- 6.5.1: Editorial tags for correction-->
<!ELEMENT sic %om.RR; %specialPara;>
<!ATTLIST sic
%a.global;
corr CDATA #IMPLIED
resp CDATA %INHERITED;
cert CDATA #IMPLIED
TEIform CDATA 'sic' >
<!ELEMENT corr %om.RR; %specialPara;>
<!ATTLIST corr
%a.global;
sic CDATA #IMPLIED
resp CDATA %INHERITED;
cert CDATA #IMPLIED
TEIform CDATA 'corr' >
<!-- end of 6.5.1-->
6.5.2 Regularization and NormalizationWhen the source text makes extensive use of variant forms or non-standard spellings, it may be desirable for a number of reasons to regularize it: that is, to provide `standard' or `regularized' forms equivalent to the non-standard forms.76 As with other such changes to the copy text, the changes may be made silently (in which case the TEI header should specify the types of silent changes made) or may be explicitly marked using the following elements:
Typical applications for these elements include the production of editions intended for student or lay readers, linguistic research in which spelling or usage variation is not the main question at issue, production of spelling dictionaries, etc. Consider this 16th-century text: how godly a dede it is to overthrowe so wicked a race the world may judge: for my part I thinke there canot be a greater sacryfice to God. An encoder may choose to preserve the original spelling of this text, but simply flag it as nonstandard by using the <orig> element with no attributes specified, as follows: how godly a <orig>dede</orig> it is to <orig>overthrowe</orig> so wicked a race the world may judge: for my part I <orig>thinke</orig> there <orig>canot</orig> be a greater <orig>sacryfice</orig> to God. Alternatively, the encoder may simply indicate that certain words have been modernized by using the <reg> element with no attributes specified, as follows: how godly a <reg>deed</reg> it is to <reg>overthrow</reg> so wicked a race the world may judge: for my part I <reg>think</reg> there <reg>cannot</reg> be a greater <reg>sacrifice</reg> to God. More usefully, the encoder may elect to record both old and new spellings, so that (for example) the same electronic text may serve as the basis of an old- or new-spelling edition: how godly a <reg orig="dede">deed</reg> it is to <reg orig="overthrowe">overthrow</reg> so wicked a race the world may judge: for my part I <reg orig="thinke">think</reg> there <reg orig="canot">cannot</reg> be a greater <reg orig="sacryfice">sacrifice</reg> to God. Or the <orig> tag might be preferred: how godly a <orig reg="deed">dede</orig> it is to <orig reg="overthrow">overthrowe</orig> so wicked a race the world may judge: for my part I <orig reg="think">thinke</orig> there <orig reg="cannot">canot</orig> be a greater <orig reg="sacrifice">sacryfice</orig> to God. The resp attribute should be used to specify the agency responsible for the regularization. This may be an identifiable individual, for example an editor, or a descriptive phrase such as ‘copyist’. For example, in the first stanza of the Old Norse poem Grógaldr, the manuscript form ‘dura’ is usually regularized in modern editions to ‘dyra’ doors. The manuscript's ‘vek ek þik dauðra dura’ might thus be recorded together with its regularization in two ways, as follows: vek ek þik dauðra <reg orig="dura" resp="ed">dyra</reg>or: vek ek þik dauðra <orig reg="dyra" resp="ed">dura</orig> These elements are formally defined as follows: <!-- 6.5.2: Editorial tags for regularization-->
<!ELEMENT reg %om.RR; %phrase.seq;>
<!ATTLIST reg
%a.global;
orig CDATA #IMPLIED
resp CDATA #IMPLIED
TEIform CDATA 'reg' >
<!ELEMENT orig %om.RR; %phrase.seq;>
<!ATTLIST orig
%a.global;
reg CDATA #IMPLIED
resp CDATA #IMPLIED
TEIform CDATA 'orig' >
<!-- end of 6.5.2-->
6.5.3 Additions, Deletions, and OmissionsThe following elements are used to indicate when words or phrases have been omitted from, added to, or marked for deletion from, a text. Like the other editorial elements, they allow for a wide range of editorial practices:
Encoders may choose to omit parts of the copy text for reasons ranging from illegibility of the source or impossibility of transcribing it, to editorial policy, e.g. a systematic exclusion of poetry or prose from an encoding. The full details of the policy decisions concerned should be documented in the TEI Header (see section 5.3 The Encoding Description). Each place in the text at which omission has taken place should be marked with a <gap> element, with optionally further information about the reason for the omission, its extent, and the person or agency responsible for it, as in the following examples: <gap desc="Prose commentary" reason="sampling" resp="pr" extent="120 lines"/> ... Their arrangement with respect to Jupiter and to each other was as follows: <gap desc="diagram" reason="sampling" extent="2 cm x 1 col"/> That is, there were two starts on the easterly side and one to the west; ... <gap desc="ink blot" reason="illegible" extent="two words"/> <gap reason="overwriting, illegible" resp="h1" extent="8 chars"/> The <add> and <del> elements may be used to record where words or phrases have been added or deleted in the copy text. They are not appropriate where longer passages have been added or deleted, which span several elements; for these, the elements <addSpan> and <delSpan>, or other mechanisms described in section 18 Transcription of Primary Sources must be used. Additions to a text may be recorded for a number of reasons. Sometimes they are marked in a distinctive way in the source text, for example by brackets or insertion above the line (supralinear insertion), as in the following example, taken from a 19th century manuscript: The story I am going to relate is true as to its main facts, and as to the consequences <add place="supralinear" resp="auth">of these facts</add> from which this tale takes its title. The <add> element should not be used to mark editorial changes, such as supplying a word omitted by mistake from the source text or a passage present in another version. In these cases, either the <corr> or <supplied> tags should be used, as discussed above in section 6.5.1 Correction of Apparent Errors, and in section 18.1.3 Correction and Conjecture, respectively. The <unclear> element is used to mark passages in the original which cannot be read with confidence, or about which the transcriber is uncertain for other reasons, as for example when transcribing a partially inaudible or illegible source. Its reason and resp attributes are used, as with the <gap> element, to indicate the cause of uncertainty and the person responsible for the conjectured reading. <l>And where the sandy mountain Fenwick scald</l> <l><unclear reason="ink blot" resp="LB">The</unclear> sea between yet hence his pray'r prevail'd</l>or from a spoken text: and then <unclear reason="passing truck">marbled queen</unclear>Where the material affected is entirely illegible or inaudible, the <gap> element discussed above should be used in preference. The <del> element is used to mark material which is deleted in the source but which can still be read with some degree of confidence, as opposed to material which has been omitted by the encoder or transcriber either because it is entirely illegible or for some other reason. This is of particular importance in transcribing manuscript material, though deletion is also found in printed texts, sometimes for humorous purposes: <l>One day I will sojourn to your shores</l> <l>I live in the middle of England</l> <l>But!</l> <l>Norway! My soul resides in your watery <del type="overstrike">fiords fyords fiiords</del></l> <l>Inlets.</l> The type attribute may be used to distinguish different methods of deletion in manuscript or typescript material, as in this line from the typescript of Eliot's Waste Land: <l><del type="overtyped">Mein</del> Frisch <del type="overstrike">schwebt</del> weht der Wind</l> Deletion in manuscript or typescript is often associated with addition: <l><del type="overstrike">Inviolable</del> <add place="infralinear">Inexplicable</add> splendour of Corinthian white and gold</l> The <del> element should not be used where the deletion is such that material cannot be read with confidence, or read at all, or where the material has been omitted by the transcriber or editor for some other reason. Where the material cannot be read with confidence following deletion, the <unclear> tag should be used with the reason attribute indicated that the difficulty of transcription is due to deletion. Where material has been omitted by the transcriber or editor, this may be indicated by use of the <corr> (or <sic>) and <gap> elements. Observe that the distinction between recommended uses of the | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||