![]() |
Text Encoding Initiative |
The XML Version of the TEI Guidelines5 The TEI Header |
Up: Contents Previous: 4 Languages and Character Sets Next: 6 Elements Available in All TEI Documents
|
5.1 Organization of the TEI Header 5.6 Minimal and Recommended Headers 5.7 Note for Library Cataloguers Introductory Note (March 2002) 2 A Gentle Introduction to XML 3 Structure of the TEI Document Type Definition 4 Languages and Character Sets 6 Elements Available in All TEI Documents 14 Linking, Segmentation, and Alignment 17 Certainty and Responsibility 18 Transcription of Primary Sources 21 Graphs, Networks, and Trees 22 Tables, Formulae, and Graphics 29 Modifying and Customizing the TEI DTD 32 Algorithm for Recognizing Canonical References 38 Sample Tag Set Documentation 39 Formal Grammar for the TEI-Interchange-Format Subset of SGML |
This chapter addresses the problems of describing an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented. Such documentation is equally necessary for scholars using the texts, for software processing them, and for cataloguers in libraries and archives. Together these descriptions and declarations provide an electronic analogue to the title page attached to a printed work. They also constitute an equivalent for the content of the code books or introductory manuals customarily accompanying electronic data sets. Every TEI-conformant text must carry such a set of descriptions, prefixed to it and encoded as described in this chapter. The set is known as the TEI header, tagged <teiHeader>, and it has four major parts:
A TEI header can be a very large and complex object, or it may be a very simple one. Some application areas (for example, the construction of language corpora and the transcription of spoken texts) will require more specialized and detailed information than others. The present proposals therefore define both a core set of elements, (all of which may be used without formality in any TEI header) and additional tagsets, which may be invoked as extensions as needed. For more details of this extension mechanism, see chapter 3.2 Core, Base, and Additional Tag Sets; the header extensions are fully described in chapter 23 Language Corpora, which should be read in conjunction with the present chapter. The next section of the present chapter briefly introduces the overall structure of the header, and the kinds of data it may contain. This is followed by a detailed description of all the constituent elements which may be used in the core header. Section 5.6 Minimal and Recommended Headers, at the end of the present chapter, discusses the recommended content of a minimal TEI header, and its relation to standard library cataloguing practices. Recommendations relevant to the use of TEI headers as free-standing documents, for interchange among libraries, data archives, and similar institutions may be found in chapter 24 The Independent Header. 5.1 Organization of the TEI Header5.1.1 The TEI Header and Its ComponentsThe <teiHeader> element should be clearly distinguished both from the prolog, which comprises either the XML declaration or the SGML declaration, and the document type declaration (see chapter 2 A Gentle Introduction to XML); and from the front matter of the text itself (for which see section 7.4 Front Matter). A composite text, such as a corpus or collection, may contain several headers, as further discussed below. In the usual case however, a TEI-conformant text will contain a single <teiHeader> element, followed by a single <text> element. The header element has the following description:
As discussed above, the <teiHeader> element has four principal components:
Of these, only the <fileDesc> element is required in all TEI headers; the others are optional. The full form of a TEI header is thus: <teiHeader>
<fileDesc> <!-- ... --> </fileDesc>
<encodingDesc> <!-- ... --> </encodingDesc>
<profileDesc> <!-- ... --> </profileDesc>
<revisionDesc> <!-- ... --> </revisionDesc>
</teiHeader>
while a minimal header takes the form:
<teiHeader>
<fileDesc> <!-- ... --> </fileDesc>
</teiHeader>
In the case of language corpora or collections, it may be desirable to record header information either at the level of individual components in the corpus or collection, or once for all at the level of the corpus or collection itself, or at both levels. More details concerning the tagging of composite texts are given in section 23 Language Corpora, which should be read in conjunction with the current chapter. An optional type attribute may also be supplied on the <teiHeader> element to indicate whether the header applies to a corpus or a single text. A corpus may thus take the form: <teiCorpus.2>
<teiHeader type='corpus'>
<!-- header for corpus-level information -->
</teiHeader>
<TEI.2>
<teiHeader type='text'>
<!-- header for text-level information -->
</teiHeader>
<text> <!-- ... --> </text>
</TEI.2>
<TEI.2>
<teiHeader type='text'> <!-- ... --> </teiHeader>
<text> <!-- ... --> </text>
</TEI.2>
<!-- etc. -->
</teiCorpus.2>
The tags required for the TEI header are defined in the DTD file teihdr2.dtd which first defines the <teiHeader> element: <!-- 5.1.1: The TEI Header-->
<!--teihdr2.dtd Tags for TEI Header.-->
<!--
** Copyright 2004 TEI Consortium.
** See the main DTD fragment 'tei2.dtd' or the file 'COPYING' for the
** complete copyright notice.
-->
<!ELEMENT teiHeader %om.RR; (fileDesc, encodingDesc*, profileDesc*,
revisionDesc?)>
<!ATTLIST teiHeader
%a.global;
type CDATA "text"
creator CDATA #IMPLIED
status (new | update) "new"
date.created %ISO-date; #IMPLIED
date.updated %ISO-date; #IMPLIED
TEIform CDATA 'teiHeader' >
[continued in 5.1.1: ]
<!-- end of 5.1.1-->
Then it defines the rest of the header elements, embedding the DTD fragments found later in this chapter: <!-- 5.1.1: --> [declarations from 5.2: The file description inserted here ] [declarations from 5.2.7: The source description inserted here ] [declarations from 5.3: The encoding description inserted here ] [declarations from 5.4: The profile description inserted here ] [declarations from 5.5: The Revision Description inserted here ] <!-- end of 5.1.1--> 5.1.2 Types of Content in the TEI HeaderThe elements occurring within the TEI header may contain several types of content; the following list indicates how these types of content are described in the following sections:
5.2 The File DescriptionThis section describes the <fileDesc> element, which is the first component of the <teiHeader> element. The bibliographic description of a machine-readable text resembles in structure that of a book, an article, or any other kind of textual object. The file description element of the TEI header has therefore been closely modelled on existing standards in library cataloguing; it should thus provide enough information to allow users to give standard bibliographic references to the electronic text, and to allow cataloguers to catalogue it. Bibliographic citations occurring elsewhere in the header, and also in the text itself, are derived from the same model (on bibliographic citations in general, see further section 6.10 Bibliographic Citations and References). See further section 5.7 Note for Library Cataloguers. The bibliographic description of the electronic text (not its source) is given in the mandatory <fileDesc> element:
The <fileDesc> element contains three mandatory elements and four optional elements, each of which is described in more detail in sections 5.2.1 The Title Statement to 5.2.6 The Notes Statement below. These elements are listed below in the order in which they must be given within the <fileDesc> element.
A file description containing all possible subelements has the following structure: <teiHeader>
<fileDesc>
<titleStmt> <!-- ... --> </titleStmt>
<editionStmt> <!-- ... --> </editionStmt>
<extent> <!-- ... --> </extent>
<publicationStmt> <!-- ... --> </publicationStmt>
<seriesStmt> <!-- ... --> </seriesStmt>
<notesStmt> <!-- ... --> </notesStmt>
<sourceDesc> <!-- ... --> </sourceDesc>
</fileDesc>
<!-- remainder of TEI Header here -->
</teiHeader>
Several of these elements may be omitted; a minimal file description has
the following structure:
<teiHeader>
<fileDesc>
<titleStmt> <!-- ... --> </titleStmt>
<publicationStmt> <!-- ... --> </publicationStmt>
<sourceDesc> <!-- ... --> </sourceDesc>
</fileDesc>
<!-- remainder of TEI Header here -->
</teiHeader>
The <fileDesc> itself has the following formal definition: <!-- 5.2: The file description-->
<!ELEMENT fileDesc %om.RR; (titleStmt, editionStmt?, extent?,
publicationStmt, seriesStmt?, notesStmt?,
sourceDesc+ ) >
<!ATTLIST fileDesc
%a.global;
TEIform CDATA 'fileDesc' >
[declarations from 5.2.1: The title statement inserted here ]
[declarations from 5.2.2: The edition statement inserted here ]
[declarations from 5.2.3: The extent statement inserted here ]
[declarations from 5.2.4: The publication statement inserted here ]
[declarations from 5.2.5: The series statement inserted here ]
[declarations from 5.2.6: The notes statement inserted here ]
<!-- end of 5.2-->
5.2.1 The Title StatementThe <titleStmt> element is the first component of the <fileDesc> element, and is mandatory:
The <title> element contains the chief name of the file, including any alternative title or subtitles it may have. It may be repeated, if the file has more than one title, (perhaps in different languages) and takes whatever form is considered appropriate by its creator. Where the electronic work is derived from an existing source text, it is strongly recommended that the title for the former should also be derived from the latter, but that it should be clearly distinguishable from it. For example, do not call the computer file ‘A Sanskrit-English Dictionary, based upon the St. Petersburg Lexicons’. Call it, rather, ‘Sanskrit-English Dictionary, based upon the St. Petersburg Lexicons: a machine readable transcription’. If you wish to retain some or all of the title of the source text in the title of the computer file, then introduce one of the following phrases:
This will distinguish the computer file from the source text in citations and in catalogues which contain descriptions of both types of material. The computer file will almost certainly have an external name (its `filename' or `data set name') or reference number on the computer system where it resides at any time. This name is likely to change frequently, as new copies of the file are made on the computer system. Its form is entirely dependent on the particular computer system in use and thus cannot always easily be transferred from one system to another. For these reasons, these Guidelines strongly recommend that such names should not be used as the <title> for any computer file. Helpful guidance on the formulation of useful descriptive titles in difficult cases may be found in the Anglo-American Cataloguing Rules67 (AACR 2), chapter 25, or in equivalent national-level bibliographical documentation. The specialized elements <author>, <sponsor>, <funder>, and <principal>, and the more general <respStmt> provide the statements of responsibility which identify the persons responsible for the intellectual or artistic content of an item and any corporate bodies from which it emanates. Any number of statements of responsibility may occur within the title statement. At a minimum, identify the author of the text and the creator of the machine-readable file. If the bibliographic description is for a corpus, identify the creator of the corpus. These identifications are mandatory when applicable, though not enforceable by the parser. Optionally include also names of others involved in the transcription or elaboration of the text, sponsors, and funding agencies. The name of the person responsible for physical data input need not normally be recorded, unless that person is also intellectually responsible for some aspect of the creation of the file. Where the person whose responsibility is to be documented is not an author, sponsor, funding body, or principal researcher, the <respStmt> element should be used. This has two subcomponents: a <name> element identifying a responsible individual or organization, and a <resp> element indicating the nature of the responsibility. No specific recommendations are made at this time as to appropriate content for the <resp>: it should make clear the nature of the responsibility concerned, as in the examples below. Names given may be personal names or corporate names. Give all names in the form in which the persons or bodies wish to be publicly cited. This would usually be the fullest form of the name, including first names.68 <titleStmt>
<title>Capgrave's Life of St. John Norbert: a
machine-readable transcription</title>
<respStmt> <resp>compiled by</resp> <name>P.J. Lucas</name> </respStmt>
</titleStmt>
<titleStmt>
<title>Two stories by Edgar Allen Poe: electronic version</title>
<author>Poe, Edgar Allen (1809-1849)</author>
<respStmt>
<resp>compiled by</resp> <name>James D. Benson</name>
</respStmt>
</titleStmt>
<titleStmt>
<title>Yogadarśanam (arthāt
yogasūtrap¯⃛ha&hdot;):
a machine readable transcription.</title>
<title>The Yogasūutras of Patañjali:
a machine readable transcription.</title>
<funder>Wellcome Institute for the History of Medicine</funder>
<principal>Dominik Wujastyk</principal>
<respStmt><name>Wieslaw Mical</name>
<resp>data entry and proof correction</resp>
</respStmt>
<respStmt><name>Jan Hajic</name>
<resp>conversion to TEI-conformant markup</resp></respStmt>
</titleStmt>
The formal definition of the <titleStmt> element and its constituents is as follows: <!-- 5.2.1: The title statement-->
<!ELEMENT titleStmt %om.RO; ((title+, (author | editor
| sponsor | funder | principal
| respStmt)*))>
<!ATTLIST titleStmt
%a.global;
TEIform CDATA 'titleStmt' >
<!ELEMENT sponsor %om.RO; %phrase.seq; >
<!ATTLIST sponsor
%a.global;
TEIform CDATA 'sponsor' >
<!ELEMENT funder %om.RO; %phrase.seq; >
<!ATTLIST funder
%a.global;
TEIform CDATA 'funder' >
<!ELEMENT principal %om.RO; %phrase.seq;>
<!ATTLIST principal
%a.global;
TEIform CDATA 'principal' >
<!--The TITLE, AUTHOR, NAME, RESPSTMT, and RESP elements are
declared in file teicore2.dtd, not here.-->
<!-- end of 5.2.1-->
5.2.2 The Edition StatementThe <editionStmt> element is the second component of the <fileDesc> element. It is optional but recommended.
For printed texts, the word ‘edition’ applies to the set of all the identical copies of an item produced from one master copy and issued by a particular publishing agency or a group of such agencies. A change in the identity of the distributing body or bodies does not normally constitute a change of edition, while a change in the master copy does. For electronic texts, the notion of a `master copy' is not entirely appropriate, since they are far more easily copied and modified than printed ones; nonetheless the term ‘edition’ may be used for a particular state of a machine-readable text at which substantive changes are made and fixed. Synonymous terms used in these Guidelines are ‘version,’ ‘level,’ and ‘release’. The words ‘revision’ and ‘update’, by contrast, are used for minor changes to a file which do not amount to a new edition. No simple rule can specify how `substantive' changes have to be before they are regarded as producing a new edition, rather than a simple update. The general principle proposed here is that the production of a new edition entails a significant change in the intellectual content of the file, rather than its encoding or appearance. The addition of analytic coding to a text would thus constitute a new edition, while automatic conversion from one coded representation to another would not. Changes relating to the character code or physical storage details, corrections of misspellings, simple changes in the arrangement of the contents and changes in the output format do not normally constitute a new edition. The addition of new information (e.g. a linguistic analysis expressed in part-of-speech tagging, sound or graphics, referential links to external datasets) almost always does constitute a new edition. Clearly, there are always border line cases and the matter is somewhat arbitrary. The simplest rule is: if you think that your file is a new edition, then call it such. An edition statement is optional for the first release of a machine-readable file; it is mandatory for each later release, though this requirement cannot be enforced by the parser. Note that all changes in a file, whether or not they are regarded as constituting a new edition or simply a new revision, should be independently noted in the revision description section of the file header (see section 5.5 The Revision Description). The <edition> element should contain phrases describing the edition or version, including the word ‘edition’, ‘version’, or equivalent, together with a number or date, or terms indicating difference from other editions such as ‘new edition’, ‘revised edition’ etc. Any dates that occur within the edition statement should be marked with the <date> element. The n attribute of the <edition> element may be used as elsewhere to supply any formal identification (such as a version number) for the edition. One or more <respStmt> elements may also be used to supply statements of responsibility for the edition in question. These may refer to individuals or corporate bodies and can indicate functions such as that of a reviser, or can name the person or body responsible for the provision of supplementary matter, of appendices, etc., in a new edition. For further detail on the <respStmt> element, see section 6.10 Bibliographic Citations and References. <editionStmt>
<edition n='P2'>Second draft, substantially
extended, revised, and corrected.</edition>
</editionStmt>
<editionStmt> <edition>Student's edition, <date>June 1987</date></edition> <respStmt> <resp>New annotations by</resp> <name>George Brown</name> </respStmt> </editionStmt>The formal definition of the <editionStmt> element is as follows: <!-- 5.2.2: The edition statement-->
<!ELEMENT editionStmt %om.RO; ( (edition, respStmt*) | p+ )>
<!ATTLIST editionStmt
%a.global;
TEIform CDATA 'editionStmt' >
<!ELEMENT edition %om.RO; %phrase.seq;>
<!ATTLIST edition
%a.global;
TEIform CDATA 'edition' >
<!-- end of 5.2.2-->
5.2.3 Type and Extent of FileThe <extent> element is the third component of the <fileDesc> element. It is optional.
For printed books, information about the carrier, such as the kind of medium used and its size, are of great importance in cataloguing procedures. The print-oriented rules for bibliographic description of an item's medium and extent need some re-interpretation when applied to electronic media. An electronic file exists as a distinct entity quite independently of its carrier and remains the same intellectual object whether it is stored on a magnetic tape, a CD-ROM, a set of floppy disks, or as a file on a mainframe computer. Since, moreover, these Guidelines are specifically aimed at facilitating transparent document storage and interchange, any purely machine-dependent information should be irrelevant as far as the file header is concerned. This is particularly true of information about file-type although library-oriented rules for cataloguing often distinguish two types of computer file: ‘data’ and ‘programs’. This distinction is quite difficult to draw in some cases, for example, hypermedia or texts with built in search and retrieval software. Although it is equally system-dependent, some measure of the size of the computer file may be of use for cataloguing and other practical purposes. Because the measurement and expression of file size is fraught with difficulties, only very general recommendations are possible; the element <extent> is provided for this purpose. It contains a phrase indicating the size or approximate size of the computer file in one of the following ways:
<extent>between 1 16-bit MB and 2 16-bit MB</extent> <extent>4.2 MiB</extent> <extent>4532 bytes</extent> <extent>3200 sentences</extent> <extent>5 3.5" High Density Diskettes</extent>The <extent> element has the following formal declaration: <!-- 5.2.3: The extent statement-->
<!ELEMENT extent %om.RO; %phrase.seq;>
<!ATTLIST extent
%a.global;
TEIform CDATA 'extent' >
<!-- end of 5.2.3-->
5.2.4 Publication, Distribution, etc.The <publicationStmt> element is the fourth component of the <fileDesc> element and is mandatory.
The publisher is the person or institution by whose authority a given edition of the file is made public. The distributor is the person or institution from whom copies of the text may be obtained. Where a text is not considered formally published, but is nevertheless made available for circulation by some individual or organization, this person or institution is termed the release authority. At least one of the above three elements must be present, unless the entire publication statement is given as prose. Each may be followed by one or more of the following elements, in the following order:
Note that the dates, places, etc., given in the publication statement relate to the publisher, distributor, or release authority most recently mentioned. If the text was created at some date other than its date of publication, its date of creation should be given within the <profileDesc> element, not in the publication statement. Give any other useful dates (e.g., dates of collection of data) in a note. Additional detailed tagsets may be used for the encoding of names, dates, and addresses, as further described in section 6.4 Names, Numbers, Dates, Abbreviations, and Addresses and chapter 20 Names and Dates. <publicationStmt>
<publisher>Oxford University Press</publisher>
<pubPlace>Oxford</pubPlace> <date>1989</date>
<idno type='ISBN'>0-19-254705-4</idno>
<availability><p>Copyright 1989, Oxford University Press
</p></availability></publicationStmt>
<publicationStmt>
<authority>James D. Benson</authority>
<pubPlace>London</pubPlace> <date>1984</date></publicationStmt>
<publicationStmt>
<publisher>Sigma Press</publisher>
<address>
<addrLine>21 High Street,</addrLine>
<addrLine>Wilmslow,</addrLine>
<addrLine>Cheshire M24 3DF</addrLine>
</address>
<date>1991</date>
<distributor>Oxford Text Archive</distributor>
<idno type='ota'>1256</idno>
<availability>
<p>Available with prior consent of depositor for
purposes of academic research and teaching only.</p>
</availability>
</publicationStmt>
The publication statement and its components are formally defined as follows: <!-- 5.2.4: The publication statement-->
<!ELEMENT publicationStmt %om.RO;
( ( p, (%m.Incl;)*)+
| ( (publisher | distributor | authority | pubPlace | address | idno
| availability | date ), (%m.Incl;)*)+ )>
<!ATTLIST publicationStmt
%a.global;
TEIform CDATA 'publicationStmt' >
<!ELEMENT distributor %om.RO; %phrase.seq;>
<!ATTLIST distributor
%a.global;
TEIform CDATA 'distributor' >
<!ELEMENT authority %om.RO; %phrase.seq;>
<!ATTLIST authority
%a.global;
TEIform CDATA 'authority' >
<!ELEMENT idno %om.RO; (#PCDATA)>
<!ATTLIST idno
%a.global;
type CDATA #IMPLIED
TEIform CDATA 'idno' >
<!ELEMENT availability %om.RO; (p+)>
<!ATTLIST availability
%a.global;
status ( free | unknown | restricted ) "unknown"
TEIform CDATA 'availability' >
<!--The PUBLISHER, PUBPLACE, and ADDRESS elements
are defined in file teicore2.dtd.-->
<!-- end of 5.2.4-->
5.2.5 The Series StatementThe <seriesStmt> element is the fifth component of the <fileDesc> element and is optional.
In bibliographic parlance, a series may be defined in one of the following ways:
The <seriesStmt> element may contain a prose description or one or more of the following more specific elements:
The <idno> may be used to supply any identifying number associated with the item, including both standard numbers such as an ISSN and particular issue numbers. (Arabic numerals separated by punctuation are recommended for this purpose: 6.19.33, for example, rather than VI/xix:33). Its type attribute is used to categorize the number further, taking the value ISSN for an ISSN for example. <seriesStmt>
<title level="s">Machine-Readable Texts for the Study of
Indian Literature</title>
<respStmt> <resp>ed. by</resp> <name>Jan Gonda</name> </respStmt>
<idno type="vol">1.2</idno>
<idno type ='ISSN'>0 345 6789</idno>
</seriesStmt>
The series statement has the following formal definition:
<!-- 5.2.5: The series statement-->
<!ELEMENT seriesStmt %om.RO; ( (title+, (idno | respStmt)*)
| p+ )>
<!ATTLIST seriesStmt
%a.global;
TEIform CDATA 'seriesStmt' >
<!-- end of 5.2.5-->
Its components are all defined elsewhere.
5.2.6 The Notes StatementThe <notesStmt> element is the sixth component of the <fileDesc> element and is optional. If used, it contains one or more <note> elements, each containing a single piece of descriptive information of the kind treated as `general notes' in traditional bibliographic descriptions.
Some information found in the notes area in conventional bibliography has been assigned specific elements in these Guidelines; in particular the following items should be tagged as indicated, rather than as general notes:
Nevertheless, the <notesStmt> element may be used to record potentially significant details about the file and its features, e.g.:
Each such item of information should be tagged using the general-purpose <note> element, which is described in section 6.8 Notes, Annotation, and Indexing. Groups of notes are contained within the <notesStmt> element, as in the following example: <notesStmt> <note>Historical commentary provided by Mark Cohen.</note> <note>OCR scanning done at University of Toronto.</note> </notesStmt>The notes statement has the following formal definition: <!-- 5.2.6: The notes statement-->
<!ELEMENT notesStmt %om.RO; (note+)>
<!ATTLIST notesStmt
%a.global;
TEIform CDATA 'notesStmt' >
<!--The NOTE element is defined with the core tags.-->
<!-- end of 5.2.6-->
5.2.7 The Source DescriptionThe <sourceDesc> element is the seventh and final component of the <fileDesc> element. It is a mandatory element, and is used to record details of the source or sources from which a computer file is derived. This might be a printed text or manuscript, another computer file, an audio or video recording of some kind, or a combination of these. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form.
The <sourceDesc> element may contain a simple prose description, or, more usefully, a bibliographic citation of some kind specifying the provenance of the text. For written or printed sources, the source should be described in the same way as any other bibliographic citation, using one of the following elements:
When the header describes a transcription of spoken material, the <sourceDesc> element may also include the following special-purpose elements, intended for cases where an electronic text is derived from a spoken text rather than a written one:
The <sourceDesc> element may contain a mixture of one or more of the above elements, as in the following examples: <sourceDesc> <bibl>The first folio of Shakespeare, prepared by Charlton Hinman (The Norton Facsimile, 1968)</bibl> </sourceDesc> <sourceDesc> <p>No source: created in machine-readable form.</p> </sourceDesc> <sourceDesc>
<biblStruct lang='FR'>
<monogr>
<author>Eugène Sue</author>
<title>Martin, l'enfant trouvé</title>
<title type='sub'>Mémoires d'un valet de chambre</title>
<imprint>
<pubPlace>Bruxelles et Leipzig</pubPlace>
<publisher>C. Muquardt</publisher>
<date value="1846">1846</date>
</imprint>
</monogr></biblStruct>
</sourceDesc>
The source description itself has the following formal definition: <!-- 5.2.7: The source description-->
<!ELEMENT sourceDesc %om.RR; (p | bibl | biblFull | biblStruct
| listBibl | scriptStmt | recordingStmt )+ >
<!ATTLIST sourceDesc
%a.global;
%a.declarable;
TEIform CDATA 'sourceDesc' >
[declarations from 5.2.9: Script statement and recording statement
inserted here ]
<!-- end of 5.2.7-->
5.2.8 Computer Files Derived from Other Computer FilesIf a machine-readable text (call it B) is based not on a printed source but upon another machine-readable text (call it A) which includes a TEI file header, then the source text of computer file B is another computer file, A. The four sections of A's file header will need to be incorporated into the new header for B in slightly differing ways, as listed below:
5.2.9 Computer Files Composed of Transcribed SpeechWhere an electronic text is derived from a spoken text rather than a written one, it will usually be desirable to record additional information about the recording or broadcast which constitutes its source. Several additional elements are provided for this purpose within the source description element:
Note that detailed information about the participants or setting of an interview or other transcript of spoken language should be recorded in the appropriate division of the profile description, discussed in chapter 23 Language Corpora, rather than as part of the source description. The source description is used to hold information only about the source from which the transcribed speech was taken, for example, any script being read and any technical details of how the recording was produced. If the source was a previously-created transcript, it should be treated in the same way as any other source text. The <scriptStmt> element should be used where it is known that one or more of the participants in a spoken text is speaking from a previously prepared script. The script itself should be documented in the same way as any other written text, using one of the three citation tags mentioned above. Utterances or groups of utterances may be linked to the script concerned by means of the decls attribute, described in section 23.3 Associating Contextual Information with a Text. <sourceDesc>
<scriptStmt id='CNN12'>
<bibl>
<author>CNN Network News</author>
<title>News headlines</title>
<date value="1991-06-12">12 Jun 91</date>
</bibl>
</scriptStmt>
<!-- this script statement might be used to document the parts
of a spoken transcript which included a news broadcast -->
<!-- possibly other script statements or recording statements follow -->
</sourceDesc>
The <recordingStmt> is used to group together information relating to the recordings from which the spoken text was transcribed. The element may contain either a prose description or, more helpfully, one or more <recording> elements, each corresponding with a particular recording. The linkage between utterances or groups of utterances and the relevant recording statement is made by means of the decls attribute, described in section 23.3 Associating Contextual Information with a Text. The <recording> element should be used to provide a description of how and by whom a recording was made. This information may be a prose description, within which such items as statements of responsibility, names, places and dates should be identified using the appropriate phrase level tags. The <recording> element takes two additional attributes, as indicated above: type is used to specify the kind of recording concerned and dur to specify its length. In addition, descriptive information relating to the kind of recording equipment used should be specified using the <equipment> element. Where a recording is taken from a public broadcast, details of the broadcast should be given using the <broadcast> element described further below. Specialized collections may wish to add further sub-elements to these major components. Note however that this element should be used only for information relating to the recording process itself; information about the setting or participants (for example) is recorded elsewhere: see sections 23.2.3 The Setting Description and 23.2.2 The Participants Description below. <recording type='video'>
<p>U-matic recording made by college audio-visual department staff,
available as PAL-standard VHS transfer or sound-only casssette</p>
</recording>
<recording type='audio' dur="30 min">
<respStmt>
<resp>Location recording by</resp>
<name>Sound Services Ltd.</name>
</respStmt>
<equipment>
<p>Multiple close microphones mixed down to stereo Digital
Audio Tape, standard play, 44.1 KHz sampling frequency</p>
</equipment>
<date>12 Jan 1987</date>
</recording>
When a recording has been made from a public broadcast, details of the broadcast itself should be supplied within the <recording> element, as a nested <broadcast> element. A broadcast is closely analogous to a publication and the <broadcast> element should therefore contain one or the other of the bibliographic citation elements <bibl>, <biblStruct>, or <biblFull>. The broadcasting agency responsible for a broadcast is regarded as its author, while other participants (for example interviewers, interviewees, directors, producers, etc.) should be specified using the <respStmt> or <editor> element with an appropriate <resp> (see further section 6.10 Bibliographic Citations and References).
<recording type='audio' dur="10 min">
<equipment><p>Recorded from FM Radio to digital tape</p></equipment>
<broadcast>
<bibl>
<title>Interview on foreign policy</title> <author>BBC Radio 5</author>
<respStmt><resp>interviewer</resp><name>Robin Day</name></respStmt>
<respStmt><resp>interviewee</resp><name>Margaret Thatcher</name></respStmt>
<series><title>The World Tonight</title></series>
<note>First broadcast on <date value="1989-11-27">27 Nov 1989</date></note>
</bibl>
</broadcast>
</recording>
When a broadcast contains several distinct recordings (for example a compilation), additional <recording> elements may be further nested within the <broadcast> element. <recording dur='100'>
<broadcast>
<!-- details of broadcast -->
<recording>
<!-- details of broadcast recording -->
</recording>
</broadcast>
</recording>
Formal definitions for the elements discussed in this section are as follows: <!-- 5.2.9: Script statement and recording statement-->
<!ELEMENT scriptStmt %om.RR; (p+ | bibl | biblFull | biblStruct)>
<!ATTLIST scriptStmt
%a.global;
%a.declarable;
TEIform CDATA 'scriptStmt' >
<!ELEMENT recordingStmt %om.RR; (p+ | recording+ )>
<!ATTLIST recordingStmt
%a.global;
TEIform CDATA 'recordingStmt' >
<!ELEMENT recording %om.RR; (p+ | (respStmt | equipment | broadcast |
date)*)>
<!ATTLIST recording
%a.global;
%a.declarable;
type (audio | video) "audio"
dur CDATA #IMPLIED
TEIform CDATA 'recording' >
<!ELEMENT equipment %om.RO; (p+)>
<!ATTLIST equipment
%a.global;
%a.declarable;
TEIform CDATA 'equipment' >
<!ELEMENT broadcast %om.RR; (p+ | bibl | biblStruct | biblFull | recording)>
<!ATTLIST broadcast
%a.global;
%a.declarable;
TEIform CDATA 'broadcast' >
<!-- end of 5.2.9-->
This concludes the discussion of the <fileDesc> element and its contents. 5.3 The Encoding DescriptionThe <encodingDesc> element is the second major subdivision of the TEI header. It specifies the methods and editorial principles which governed the transcription or encoding of the text in hand and may also include sets of coded definitions used by other components of the header. Though not formally required, its use is highly recommended.
<!-- 5.3: The encoding description-->
<!ELEMENT encodingDesc %om.RR; (projectDesc*, samplingDecl*,
editorialDecl*, tagsDecl?, refsDecl*,
classDecl*, metDecl*, fsdDecl*,
variantEncoding*, p* )>
<!ATTLIST encodingDesc
%a.global;
TEIform CDATA 'encodingDesc' >
[declarations from 5.3.1: The project description inserted here ]
[declarations from 5.3.2: The sampling declaration inserted here ]
[declarations from 5.3.3: The editorial practices declaration inserted
here ]
[declarations from 5.3.4: Tag usage and rendition declarations inserted
here ]
[declarations from 5.3.5.3: The reference scheme declaration inserted
here ]
[declarations from 5.3.6: The classification declaration inserted here ]
[declarations from 5.3.7: The FSD declaration inserted here ]
[declarations from 5.3.8: Metrical Notation Declaration inserted here ]
[declarations from 5.3.9: Variant-Encoding Declaration inserted here ]
<!-- end of 5.3-->
5.3.1 The Project DescriptionThe <projectDesc> element is the first of the nine optional subdivisions of the <encodingDesc> element. It may be used to describe, in prose, the purpose for which the electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected. This is of particular importance for corpora or miscellaneous collections, but may be of use for any text, for example to explain why one kind of encoding practice has been followed rather than another.
<encodingDesc>
<projectDesc>
<p>Texts collected for use in the
Claremont Shakespeare Clinic, June 1990.</p>
</projectDesc>
</encodingDesc>
This element has the following formal declaration: <!-- 5.3.1: The project description-->
<!ELEMENT projectDesc %om.RO; (p+)>
<!ATTLIST projectDesc
%a.global;
%a.declarable;
TEIform CDATA 'projectDesc' >
<!-- end of 5.3.1-->
5.3.2 The Sampling DeclarationThe <samplingDecl> element is the second of the nine optional subdivisions of the <encodingDesc> element. It contains a prose description of the rationale and methods used in sampling texts, for example to create a representative corpus.
but is not restricted to these. <samplingDecl> <p>Samples of 2000 words taken from the beginning of the text.</p> </samplingDecl> It may also include a simple description of any parts of the source text included or excluded. <samplingDecl>
<p>Text of stories only has been transcribed. Pull quotes, captions,
and advertisements have been silently omitted. Any mathematical
expressions requiring symbols not present in the ISOnum or ISOpub
entity sets have been omitted, and their place marked with a GAP
element.</p>
</samplingDecl>
A sampling declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the decls attribute of each text (or subdivision of the text) to which the sampling declaration applies may be used to supply a cross reference to it, as further described in section 23.3 Associating Contextual Information with a Text. This element has the following formal declaration: <!-- 5.3.2: The sampling declaration-->
<!ELEMENT samplingDecl %om.RO; (p+)>
<!ATTLIST samplingDecl
%a.global;
%a.declarable;
TEIform CDATA 'samplingDecl' >
<!-- end of 5.3.2-->
5.3.3 The Editorial Practices DeclarationThe <editorialDecl> element is the third of the nine optional subdivisions of the <encodingDesc> element. It is used to provide details of the editorial practices applied during the encoding of a text.
Any information about the editorial principles applied not falling under one of the above headings should be recorded in a distinct list of items. Experience shows that a full record should be kept of decisions relating to editorial principles and encoding practice, both for future users of the text and for the project which produced the text in the first instance. A simple example follows: <editorialDecl id="e2">
<interpretation>
<p>The part of speech analysis applied throughout section 4 was
added by hand and has not been checked.</p>
</interpretation>
<correction>
<p>Errors in transcription controlled by using the
WordPerfect spelling checker.</p>
</correction>
<normalization source="W9">
<p>All words converted to Modern American spelling using
Websters 9th Collegiate dictionary.</p>
</normalization>
<quotation marks="all" form="std">
<p>All opening quotation marks represented by entity reference ODQ; all closing
quotation marks represented by entity reference CDQ.</p>
</quotation>
</editorialDecl>
These elements are formally defined as follows: <!-- 5.3.3: The editorial practices declaration-->
<!ELEMENT editorialDecl %om.RO; ( p+ | ((correction | normalization
| quotation | hyphenation | interpretation
| segmentation | stdVals)+, p*))>
<!ATTLIST editorialDecl
%a.global;
%a.declarable;
TEIform CDATA 'editorialDecl' >
<!ELEMENT correction %om.RO; (p+)>
<!ATTLIST correction
%a.global;
%a.declarable;
status (high | medium | low | unknown) "unknown"
method (silent | tags) "silent"
TEIform CDATA 'correction' >
<!ELEMENT normalization %om.RO; (p+)>
<!ATTLIST normalization
%a.global;
%a.declarable;
source CDATA #IMPLIED
method ( silent | tags ) "silent"
TEIform CDATA 'normalization' >
<!ELEMENT quotation %om.RO; (p+)>
<!ATTLIST quotation
%a.global;
%a.declarable;
marks ( none | some | all ) "all"
form (data | rend | std | nonstd | unknown) "unknown"
TEIform CDATA 'quotation' >
<!ELEMENT hyphenation %om.RO; (p+)>
<!ATTLIST hyphenation
%a.global;
%a.declarable;
eol ( all | some | hard | none ) "some"
TEIform CDATA 'hyphenation' >
<!ELEMENT segmentation %om.RO; (p+)>
<!ATTLIST segmentation
%a.global;
%a.declarable;
TEIform CDATA 'segmentation' >
<!ELEMENT stdVals %om.RO; (p+)>
<!ATTLIST stdVals
%a.global;
%a.declarable;
TEIform CDATA 'stdVals' >
<!ELEMENT interpretation %om.RO; (p+)>
<!ATTLIST interpretation
%a.global;
%a.declarable;
TEIform CDATA 'interpretation' >
<!-- end of 5.3.3-->
An editorial practices declaration which applies to more than one text
or division of a text need not be repeated in the header of each such
text. Instead, the decls attribute of each text (or
subdivision of the text) to which it applies may
be used to supply a cross reference to it, as further described in
section 23.3 Associating Contextual Information with a Text.
5.3.4 The Tagging DeclarationThe <tagsDecl> element is the fourth of the nine optional subdivisions of the <encodingDesc> element. It is used to record the following information about the tagging used within a particular text:
This information is conveyed by the following elements:
The <tagsDecl> element consists of an optional sequence of <rendition> elements, each of which must bear a unique identifier, followed by a sequence of <tagUsage> elements, one for each distinct element occurring within the outermost <text> element of a TEI document. The <rendition> element defined in this version of the TEI Guidelines is a preliminary proposal only, intended to provide a hook for more detailed specifications of default rendition in later versions. The present proposal allows the encoder to enter an informal description of a rendition, or style, as running prose only. This rendition will be assumed to apply, by default, to all occurrences of an element which names its identifier as the value of the render attribute of the appropriate <tagUsage> element. For element occurrences to which this default rendition does not apply, the encoder should specify an explicit description using the global rend attribute on the elements concerned. For example, the following schematic shows how an encoder might specify that <p> elements are by default to be rendered using one set of specifications identified as style1, while <hi> elements are to use a different set, identified as style2: <tagsDecl>
<rendition id="style1">
... description of one default rendition here ...
</rendition>
<rendition id="style2">
... description of another default rendition here ...
</rendition>
<tagUsage gi="p" render="style1"> ... </tagUsage>
<tagUsage gi="hi" render="style2"> ... </tagUsage>
<!-- ... -->
</tagsDecl>
No detailed proposals for the content of the <rendition> element have as yet been formulated. Earlier versions of these Guidelines suggested that specifications derived from, or compatible with, the properties standardized as part of the Document Style and Semantics Specification Language (ISO/IEC 10179) might be useful; the Cascading Stylesheet Language (http://www.w3.org/TR/REC-CSS1) is another possible candidate vehicle for their expression, as is the XML vocabulary for specifying formatting semantics which forms a part of the W3C's Extensible Stylesheet Language (http://www.w3.org/TR/xsl). A <tagsDecl> need not specify any <rendition> element. It must however contain exactly one occurrence of a <tagUsage> element for each distinct element marked within the outermost <text> element associated with the <teiHeader> in which it appears.69 The <tagUsage> element is used to supply a count of the number of occurrences of this element within the text, which is given as the value of its occurs attribute. It may also be used to hold any additional usage information, which is supplied as running prose within the element itself. <tagUsage gi="hi" occurs="28">
Used only to mark English words italicised in the copy text.
</tagUsage>
This indicates that the <hi> element appears a total of 28 times
in the <text> element in question, and that the encoder has used
it to mark italicised English phrases only.
The ident attribute may optionally be used to specify how many of the occurrences of the element in question bear a value for the global id attribute, as in the following example: <tagUsage gi="pb" occurs="321" ident="321">
Marks page breaks in the York (1734) edition only
</tagUsage>
This indicates that the <pb> element occurs 321 times, on each
of which an identifier is provided.
The content of the <tagUsage> element is not susceptible of automatic processing. It should not therefore be used to hold information for which provision is already made by other components of the encoding description. A TEI conformant document is not required to contain a <tagsDecl> element, but if one is present, it must contain <tagUsage> elements for each distinct element marked in the associated text, and the counts specified by their usage attributes must correspond with the number of such elements present in the document, as identified by some conforming processor. <!-- 5.3.4: Tag usage and rendition declarations-->
<!ELEMENT tagsDecl %om.RO; (rendition*, tagUsage*) >
<!ATTLIST tagsDecl
%a.global;
TEIform CDATA 'tagsDecl' >
<!ELEMENT tagUsage %om.RO; %paraContent; >
<!ATTLIST tagUsage
%a.global;
gi CDATA #REQUIRED
occurs CDATA #IMPLIED
ident CDATA #IMPLIED
render IDREF #IMPLIED
TEIform CDATA 'tagUsage' >
<!ELEMENT rendition %om.RO; %paraContent; >
<!ATTLIST rendition
%a.global;
TEIform CDATA 'rendition' >
<!-- end of 5.3.4-->
5.3.5 The Reference System DeclarationThe <refsDecl> element is the fifth of the nine optional subdivisions of the <encodingDesc> element. It is used to document the way in which any standard referencing scheme built into the encoding works, either as a series of prose paragraphs or by using the following specialized elements:
A referencing scheme may be described in one of three ways using this element:
Each method is described in more detail below. Only one method can be used within a single <refsDecl> element. More than one <refsDecl> element can be included in the header if more than one canonical reference scheme is to be used in the same document, but the current proposals do not check for mutual inconsistency. A reference declaration can only describe the referencing system applicable to a single document type; if therefore concurrent document types are in use (as discussed in section 6.9 Reference Systems), a <refsDecl> element must be supplied for each; the doctype attribute should be used to specify the document type to which the declaration relates. 5.3.5.1 Prose MethodThe referencing scheme may be specified within the <refsDecl> by a simple prose description. Such a description should indicate which elements carry identifying information, and whether this information is represented as attribute values or as content. Any special rules about how the information is to be interpreted when reading or generating a reference string should also be specified here. Such a prose description cannot be processed automatically, and this method of specifying the structure of a canonical reference system is therefore not recommended for automatic processing. <refsDecl>
<p>The N attribute of each text in this corpus carries a unique
identifying code for the whole text. The title of the text is held
as the content of the first HEAD element within each text. The N
attribute on each DIV1 and DIV2 contains the canonical reference
for each such division, in the form 'XX.yyy', where XX is the book
number in Roman numerals, and yyy the section number in arabic.
Line breaks are marked by empty LINEBREAK elements, each of which
includes the through line number in Casaubon's edition as the
value of its N attribute.</p>
<p>The through line number and the text identifier uniquely identify
any line. A canonical reference may be made up by concatenating
the N values from the TEXT, DIV1, or DIV2 and calculating the line
number within each part.</p>
</refsDecl>
5.3.5.2 Stepwise MethodThis method defines each reference as a series of steps, each of which corresponds to a single pair of expressions in the TEI extended pointer notation (for which see section 14.2 Extended Pointers). Often, but not always, each step will also correspond to one portion of the canonical reference itself; in many common forms of canonical reference, each step will narrow the scope within which the next step can be taken. The <refsDecl> element must specify the steps, delimiters, and lengths to be used by an application program, both when constructing references for a given location and when interpreting canonical references within a given document hierarchy. It does so by supplying one or more <step> elements, each of which identifies the type of `reference unit' handled by the step and uses a pair of extended-pointer expressions to indicate the starting and ending pointers of the portion of the document which corresponds to a given portion of the reference string. The element may also give either a delimiter or a length for use in breaking the corresponding reference string up into units.
For example, the reference ‘Matthew 5:29’ might be constructed by stepping down the tree to find an element labelled as the ‘Matthew’ node, then within that to the ‘5’ node, and finally, within that, to the ‘29’ node. The following declarations would be required; the special values %1, %2, and %3 refer here to the strings ‘Matthew’, ‘5’, and ‘29’, respectively. <refsDecl> <step refunit="book" delim=" " from="DESCENDANT (1 DIV1 N %1)"/> <step refunit="chapter" delim=":" from="DESCENDANT (1 DIV2 N %2)"/> <step refunit="verse" from="DESCENDANT (1 DIV3 N %3)"/> </refsDecl>As this example also shows, the steps of such a reference are typically separated by fixed character sequences, called delimiters. In this example, the delimiters are a space (following ‘Matthew’) and a colon (following the chapter number). A processor for canonical references would use the delimiters specified by the delim attributes to break the reference string up into pieces; the pieces would then be used to interpret the %1, etc., in the extended pointer expressions of the from and to attributes. An alternative to the use of delimiters is to specify a fixed length for each step of the reference: for example, the same reference might be given as ‘MAT05029’, assuming a fixed length of 3 for the first step, 2 for the second, and 3 for the third. <refsDecl> <step length="3" from="DESCENDANT (1 DIV1 N %1)"/> <step length="2" from="DESCENDANT (1 DIV2 N %2)"/> <step length="3" from="DESCENDANT (1 DIV3 N %3)"/> </refsDecl>The order in which the <step> elements are supplied corresponds here with the order of elements within the reference, with the largest (that is, the one nearest the top of the document hierarchy) item first and the smallest last. For a description of the processing required when a canonical reference defined by <step> elements is to be recognized, and examples of its use, see chapter 32 Algorithm for Recognizing Canonical References. 5.3.5.3 Milestone MethodThis method is appropriate when only `milestone' tags (see section 6.9.3 Milestone Tags) are available to provide the required referencing information. It does not provide any abilities which cannot be mimicked by the stepwise referencing method discussed in the previous section, but in the cases where it applies, it provides a somewhat simpler notation. A reference based on milestone tags concatenates the values specified by one or more such tags. Since each tag marks the point at which a value changes, it may be regarded as specifying the state of a variable. A reference declaration using this method therefore specifies the individual components of the canonical reference as a sequence of <state> elements:
For example, the reference ‘Matthew 12:34’ might be thought of as representing the state of three variables: the ‘book’ variable is in state ‘Matthew’; the chapter variable is in state ‘12’, and the verse variable is in state ‘34’. If milestone tagging has been used, there should be a tag marking the point in the text at which each of the above `variables' changes its state.70 To find ‘Matthew 12:34’ therefore an application must scan left to right through the text, monitoring changes in the state of each of these three variables as it does so. When all three are simultaneously in the required state, the desired point will have been reached. There may of course be several such points. The delim and length attributes are used to specify components of a canonical reference using this method in exactly the same way as for the stepwise method described in the preceding section. The other attributes are used to determine which instances of <milestone> tags in the text are to be checked for state-changes. A state-change is signalled whenever a new <milestone> tag is found with unit and, optionally, ed attributes identical to those of the <state> element in question. The value for the new state may be given explicitly by the n attribute on the <milestone> element, or it may be implied, if the n attribute is not specified. For example, for canonical references in the form ‘xx.yyy’ where the ‘xx’ represents the page number in the first edition, and ‘yyy’ the line number within this page, a reference system declaration such as the following would be appropriate: <refsDecl> <state ed="first" unit="page" length="2" delim="."/> <state ed="first" unit="line" length="3"/> </refsDecl>This implies that milestone tags of the form <milestone n="II" ed="first" unit="page"/> <milestone ed="first" unit="line"/>will be found throughout the text, marking the positions at which page and line numbers change. Note that no value has been specified for the n attribute on the second milestone tag above; this implies that its value at each state change is monotonically increased. For more detail on the use of milestone tags, see section 6.9.3 Milestone Tags. The milestone referencing scheme, though conceptually simple, is not supported by a generic SGML or XML parser. Its use places a correspondingly greater burden of verification and accuracy on the encoder.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||