Converting Leiden-style editions to TEI Lite XML


by T. J. Finney, ATLA Serials Project, 2001


1. Introduction


These recommendations concern the translation into TEIxLite documents of printed editions that employ the Leiden conventions defined in Chronique D'Egypte 13-14 (1932), pages 285-7. They may also be applied where a transcription is made directly from a manuscript. TEIxLite is an extensible markup language (XML) version of the TEI Lite document type definition. TEI Lite (TEI U5) represents a subset of the full Text Encoding Initiative guidelines (TEI P3).


The recommendations should be read in conjunction with the TEI Lite specification. Although TEI Lite is adequate for most features encountered in a printed edition, there are situations where the encoding methods of the full TEI guidelines are better. Following TEI Lite allows the present recommendations to use a widely adopted framework that is relatively well supported. This in turn should maximize the utility of Leiden-style editions that have been translated into TEIxLite documents according to these recommendations. However, the gain is achieved at a cost of bending less appropriate features of TEI Lite to purposes for which entirely appropriate features exist in the full TEI guidelines. The present recommendations take a minimalist approach to rendering features likely to be encountered in Leiden-style transcriptions. A more comprehensive approach that used an XML version of the full TEI guidelines would be less vulnerable to charges of 'tag abuse'.


At this point it may be appropriate to give a cursory introduction to XML so that what follows may be better understood. XML provides a way to describe the structure and content of a text. A document type definition (DTD) sets down definitions and structural rules to be followed by conforming documents. Not all XML documents require a DTD. However, those that have one must conform in order to be valid. TEIxLite documents conform to the TEIxLite DTD.


Features of a text are marked and described using markup elements that consist of start and end tags contained in angle brackets. Start tags may include attributes that take on particular values.


E.g. <anElement someAttribute='x' anotherAttribute='y'>marked text</anElement>


An element may be empty, in which case its start and end tags can be contracted.


E.g. <emptyElement someAttribute='x' anotherAttribute='y'/>


XML also uses entity and character references. These offer a shorthand method of referring to predefined items by their names and Unicode numbers, respectively. An entity reference may be used to represent a character that cannot be directly entered at the keyboard. Entity references must be defined and included before they can be used. The TEIxLite DTD currently includes four sets of standard entity references defined in the files 'iso-lat1.ent', 'iso-lat2.ent', 'iso-num.ent', and 'iso-pub.ent'. These cover a range of symbols, punctuation, and Latin characters with diacritical marks. An entity reference consists of the entity's name placed between '&' and ';'.


E.g. soft hyphen: &shy;


Character references refer to Unicode characters. Almost every conceivable character is included in Unicode. A particular character's code can be determined from the code charts found at the Unicode web site ( Decimal character references are placed between '&#' and ';', while hexadecimal (i.e. base sixteen) references are placed between '&#x' and ';'. Codes given in the Unicode charts are hexadecimal.


E.g. soft hyphen: &#0173; = &#x00AD;


There is no need to use character references if Unicode can be entered directly.


Returning now to the recommendations, an essential part of a valid TEI document is the header—the electronic title page. Among other things, the header includes information concerning the provenance of a text, and serves to distinguish between the editor of a printed transcription and the person responsible for its conversion to a TEIxLite document.


The first aim of the person converting a printed edition should be to faithfully reproduce its content. Responsibility for editorial decisions is assumed to belong to the editor of the printed transcription and not the one converting it to XML. Consequently, the 'resp' attribute of the elements used in these recommendations should have a default value of 'ed' or the editor's initials. If it is thought necessary to introduce changes, responsibility for each change needs to be noted using one of the methods provided in TEI Lite (see TEI U5 sections 9 and 20). The <sic> element is suitable for the purpose, with the value of its 'resp' attribute identifying the person responsible. In addition, a log of changes should be kept in the header's revision description.


2. Recommendations


The recommendations are presented below, either as Leiden-style features paired with methods of rendition, or as example renditions under general headings. Each recommendation may include a description of the Leiden-style feature, an authoritative basis [in square brackets], and commentary. References to TEI documents give the document number then section number. For example, TEI P3 18.2.3 means TEI P3, section 18.2.3, and TEI U5 10 means TEI U5, section 10. Recommendations for which no authoritative basis is shown should be treated with due caution.


a\.b\.g\.d\. : <unclear>abgd</unclear>

('\.' represents a dot printed beneath the preceding letter.)

Letters that are really doubtful or so imperfect that, apart from the context, they could be read more than one way.

[TEI P3 18.2.3]


These recommendations assume that doubtfulness or loss of letters is due to manuscript damage. That is, the 'reason' attribute of the relevant elements should have a default value of 'damage'.


.... or +-4 : <unclear><gap extent='4'/></unclear>

Illegible letters of which the approximate number is known.

[TEI U5 10]


[....] or [+-4] : <gap extent='4'/>

Lost letters of which the approximate number is known.

[TEI P3 18.1.7]


A letter of which any trace remains belongs outside the brackets.


] or [ ] or [ : <gap/>

Lost letters of which the approximate number is unknown.

[TEI P3 18.2.4]


[abgd] : <add>abgd</add>

The letters are lost, but restored from a parallel or by conjecture.


TEI Lite does not have the <supplied> element for restored text that is contained in the full TEI specification (TEI P3 18.1.7). It is therefore necessary to find an alternative among the available elements. The <gap> element is a logical choice in view of its use for the other categories of lost letters. However, it is an empty element and cannot contain the restored text. This leaves the <add> element, which is suitable for 'letters, words, or phrases inserted in the text by an author, scribe, annotator, or corrector' (TEI U5 10). In this case, the annotator is the editor, and the annotation is the supplement.


a(bgd) : <abbr expan='abgd'>a</abbr>

Braces indicate resolution of an abbreviation or symbol.

[TEI P3 6.4.5]


The entire abbreviated word is enclosed in the <abbr> element.


A brevigraph (i.e. scribal symbol for some sequence of letters) may be represented by the corresponding Unicode character or character reference, if it exists.



(1)  a 'kai' compendium:

<abbr expan='kai'>&#x03D7;</abbr>

(2)  'ou' printed with a superscript line for final 'nu':

<abbr expan='oun'>ou&#x0305;</abbr>.


If an appropriate Unicode character cannot be found, the letters represented by the brevigraph are given in the 'expan' attribute, and the 'type' attribute is set to a value of 'brevigraph'.



(1)  an abbreviation comprised of letters and a brevigraph that is not encoded:

<abbr expan='lambanomenos' type='brevigraph'>lambano</abbr>

(2)  an abbreviation comprised entirely of a brevigraph that is not encoded:

<abbr expan='estin' type='brevigraph'/>


<abgd> : <sic corr='abgd'/>

Letters the editor regards as mistakenly omitted by the scribe.

[TEI P3 6.5.1]


{abgd} : <sic>abgd</sic>

Letters the editor regards as mistakenly included by the scribe.

[TEI P3 6.5.1]


[[abgd]] : <del hand='hx'>abgd</del>

Letters deleted in the manuscript.

[TEI P3 18.1.4]


The scribe responsible for an alteration is specified using the 'hand' attribute, with recommended values of 'h1' for the first hand (the scribe), 'h2' for the second, and so on. A value of 'hx' may be used if the hand cannot be identified.


E.g. a deletion by the second hand: <del hand='h2'>abgd</del>.


The mode of deletion may be specified using the 'type' attribute. Refer to TEI P3 18.1.4 for suggested values of this attribute.


Any values of the hand attribute that are used must be declared in the header's profile description. A separate <ident> element is required for each hand, and the entire set is enclosed in a <creation> element.




<ident id='h1'>First hand.</ident>

<ident id='h2'>Second hand.</ident>

<ident id='hx'>Unidentified hand.</ident>



This approach is necessary due to the exclusion from TEI Lite of the <hand> and <handList> elements that are featured in the full TEI guidelines.


'abgd' : <add hand='hx'>abgd</add>

Interlinear additions which are difficult to print above the lines of the transcription.

[TEI P3 18.1.4]


This method of rendition can also be used for scribal additions that are not interlinear. It is important to always include the 'hand' attribute to distinguish a scribal addition from an editorial supplement. The location of an addition may be specified using the 'place' attribute, with TEI P3 18.1.4 providing a list of suggested values.


E.g. a matching deletion and addition: <del type='subpunction' hand='h2'>dgba</del> <add place="supralinear" hand='h2'>abgd</add>.



abgd : <hi rend='overline'>abgd</hi>

Lines drawn above letters to indicate 'nomina sacra' or numerals.


The function of such a line is to highlight the associated text. In contrast to a brevigraph or compendium, it does not stand for some other text. It is therefore not appropriate to encode the line using character references.


Where the letters are an abbreviation, they should be enclosed in an <abbr> element that provides the corresponding expansion. Letters that represent numerals are enclosed in a <num> element.



(1) a 'nomen sacrum' abbreviation with a superscript line:

<abbr expan='kurios'><hi rend='overline'>ks</hi></abbr>

(2) an 'alpha' used as a numeral:

<num value='1'><hi rend='overline'>a</hi></num>


Overall document structure


Leiden-style editions usually contain an editor's discussion, the manuscript's transcription, associated notes, and a bibliography. This document structure is emulated using <div0> elements with corresponding attribute values. The manuscript transcription is enclosed in a <q> (i.e. quotation) element, thus indicating that it derives from an independent source.



<div0 type='discussion'>

<p>Discussion goes here.</p>


<div0 type='transcription'>

<q>Transcription goes here.</q>


<div0 type='notes'>

<note>First note goes here.</note>


<div0 type='bibliography'>


<bibl>First bibliographic citation goes here.</bibl>




Any occurrence of the left angle bracket ('<') or ampersand ('&') that is not part of the document's markup must be replaced with the corresponding entity reference (&lt; or &amp;). The right angle bracket ('>') must be replaced with its entity reference (&gt;) when it occurs in the sequence ']]>' if the sequence is not part of the markup.


E.g. <p>… A reverse diple &lt; sometimes occurs …</p>


Language and encoding scheme


All languages encountered in the document are declared in the header's profile description. Non-Latin text, diacritical marks, and punctuation may be directly encoded with Unicode characters, or indirectly encoded using a standard transliteration scheme or Unicode character references. If a transliteration scheme is used, it must be identified in the profile description.




<language id='eng'>Text besides the transcription is in English.</language>

<language id='grc'>The transcription is in Greek. The transliteration scheme is TLG Beta Code.</language>



Any foreign word or phrase outside the transcription is marked with a <foreign> element whose 'lang' attribute is set to the relevant language code. The language of the transcription is specified using the enclosing quotation element's 'lang' attribute. If a Unicode character is used to encode a diacritical mark, it follows the modified letter.



(1) a foreign word outside the transcription:

<note>The papyrus reads <foreign lang='grc'>palin</foreign> …</note>

(2) specifying the language of the transcription:

<q lang='grc'>… doulous autou pros …</q>

(3)  a rough breathing encoded using TLG Beta Code:

<q lang='grc'>… o(n …</q>

(4)  the same breathing encoded with a character reference:

<q lang='grc'>… o&#x0314;n …</q>


Note: TLG Beta Code actually uses capitals for Greek letters.


Word division


Whereas editors often insert spaces between words transcribed from 'scriptio continua' manuscripts, they do not normally provide hyphens to mark words divided at line ends. The one translating a Leiden-style transcription to TEIxLite must therefore indicate whether or not the words are divided.


The soft hyphen character (&shy; = &#0173; = &#x00AD;) should be used to indicate word division at the end of a line. The space character is sufficient for word division within lines.



<q lang='grc'>

<lb n='1'/>… doulous autou pros

<lb n='2'/>tous gewrgous labein tous kar&shy;

<lb n='3'/>pous autou kai labontes oi gewr&shy;

<lb n='4'/>goi tous doulous autou o(n men



Page, column, and line divisions


TEI Lite has specific elements for page and line breaks but omits the column break element found in the full TEI specification. The more general <milestone> element may be used if it is necessary to indicate a column break. TEI Lite advises against mixing page and line break elements with milestone elements in this manner (TEI U5 5). However, the alternative of using milestone elements for each kind of division is quite costly in terms of the extra keystrokes required: compare <lb n='1'/> with <milestone n='1' unit='line'/>.



(1)  a page break: <pb n='7'/>

(2)  a column break: <milestone n='2' unit='column'/>

(3)  a line break: <lb n='4'/>


These are empty elements that precede the features they mark. The 'n' attribute gives the number of the page, column, or line that begins at the marked point.


Use of the milestone element to mark column breaks should be mentioned in the header's editorial declaration.




<p>A milestone element with the 'unit' attribute set to 'column' represents a column break.</p>



Recto, verso, and papyrus direction


Recto and verso sides of a codex leaf may be indicated by appending 'r' or 'v', respectively, to the folio number given in the 'n' attribute of the 'pb' element.


E.g. folio 7 recto: <pb n='7r'/>


Printed editions use arrows to indicate papyrus direction. The corresponding character reference can be used for an arrow that appears in the editorial discussion or notes. Where the arrow is part of the manuscript transcription, it should be encoded by assigning a value of 'horizontal' or 'vertical' to the 'rend' attribute of the 'pb' element. To use a character reference in this context is wrong because it implies that the arrow is part of the manuscript's text.


E.g. fibers horizontal: <pb rend='horizontal'/>


Canonical references


Canonical reference points are marked with empty milestone elements. The 'n' attribute identifies the standard division that begins at the marked point while the 'unit' attribute specifies the kind of division.



<milestone n='Mt' unit='book'/>

               <milestone n='21' unit='chapter'/>

                              <milestone n='34' unit='verse'/>


                              <milestone n='45' unit='verse'/>


Such a use of milestone elements should be mentioned in the header's editorial declaration.




<p>The 'unit' attribute of milestone elements is used to indicate biblical book, chapter, and verse divisions. Book abbreviations follow the ATLAS canonical references schema (</p>





TEI Lite recommends that, if possible, the body of a note should be inserted in the encoded text at its point of reference (TEI U5 7). As mentioned above, the present recommendations group editorial notes in a separate division in order to emulate the usual structure of a Leiden-style edition. As a consequence, each editorial note needs to be linked to its point of reference.


This is achieved using a <ptr> or <ref> element, depending on whether a single point or a span of text is annotated. The 'target' attribute of the <ptr> or <ref> element is set equal to the 'id' attribute of the relevant note. The one converting the Leiden-style edition to XML must supply unique values for the matching 'target' and 'id' attributes.


An alternative method is required where two notes refer to overlapping spans of text. In such a case, empty <anchor> elements are inserted at the beginning and end of each span. The 'target' and 'targetEnd' attributes of the respective <note> elements are then set equal to the 'id' values supplied for the relevant <anchor> elements.

[TEI U5 8.1]



(1) a <ptr> element marks the reference point:

<lb n='3'/>pous autou kai labontes<ptr target='n1'/> oi gewr&shy;

<note id='n1'><foreign lang='grc'>kai labontes</foreign>: so most MSS; <foreign lang='grc'>labontes de</foreign> 1555 and the Sahidic.</note>

(2) a <ref> element encloses the annotated section:

<lb n='3'/>pous autou <ref target='n1'>kai labontes</ref> oi gewr&shy;

<note id='n1'>…</note>

(3) two notes refer to overlapping spans of text:

<lb n='3'/>pous autou <anchor id='n1a'/>kai<anchor id='n1b'/> labontes<anchor id='n2b'/> oi gewr&shy;

<note target='n1a' targetEnd='n1b'><foreign lang='grc'>kai</foreign>: these letters are lost.</note>

<note target='n1a' targetEnd='n2b'><foreign lang='grc'>kai labontes</foreign>: so most MSS …</note>


A manuscript may include a commentary on the primary text. It may also have scribal annotation besides alterations to the primary text. In contrast to editorial notes, the encoded versions of these kinds of annotation are included at the places to which they refer. They are enclosed in <note> elements whose 'type' attribute is set to 'scribal' or 'commentary' as the case requires. Any text of such annotation that is not in the same language as the annotated text needs to be enclosed in a <foreign> element.


Responsibility for the annotation or commentary is indicated with the 'resp' attribute. A scribal note is attributed to the relevant scribe using 'h1' for the first hand, 'h2' for the second, and so on. For commentary, the 'resp' attribute identifies the original author or is set to 'unknown' when authorship has not been established. The location of notes and commentary may be recorded using the 'place' attribute, suggested values of which can be found at TEI P3 6.8.1.



(1) a scribal note placed in the left margin by the third hand:

<note type='scribal' resp='h3' place='left'>amaqestate kai kake. afes ton palaion. mh metapoiei.</note>

(2) commentary by a known author:

<note type='commentary' resp='Ephraem of Syria'>…</note>

(3) commentary by an unknown author:

<note type='commentary' resp='unknown'>…</note>


Bibliographic citations


Bibliographic citations are placed within <bibl> elements which may include further elements such as <author>, <title>, <editor>, <pubPlace>, <publisher>, <date>, and <biblScope>. The <listBibl> element encloses a list of citations.


The <title> element's 'level' attribute takes allowable values of 'm' for monographic (i.e. pertaining to a work published as a distinct item), 's' for series, 'j' for journal, 'u' for unpublished, and 'a' for analytic (i.e. pertaining to articles, poems, etc., published as part of a larger item). There is also a 'type' attribute for classifying the title as 'main', 'subordinate', 'parallel', 'abbreviated', and so on.


The <author> element encloses the statement of primary intellectual responsibility for a work, while the <editor> element contains a secondary statement of responsibility. The latter element's 'role' attribute has a default value of 'editor' but may take any appropriate value including 'translator', 'compiler', or 'illustrator'.


Place of publication, publisher, and date are marked with the corresponding elements shown above. The <biblScope> element contains page numbers, section numbers, etc., that define which parts of the work are referenced. An optional 'type' attribute may be used to specify the kind of reference. Appropriate values include 'pages', 'chapter', 'volume', 'part', and 'issue'. [TEI U5 13, TEI P3 6.10.2]


References in the text are linked to their counterparts in the bibliography using <ptr> or <ref> elements in the same manner as described above for notes.


E.g. a reference in the text that points to an item in the bibliography:

<p><ref target='JDT1997'>Thomas (1997, 8)</ref> regards the hands of P104 and P90 as similar…</p>


<bibl id='JDT1997'>

<author>J. David Thomas</author>

<title level='a'>4404. Matthew XXI 34-37; 43 and 45 (?)</title>

<title level='s'>The Oxyrhynchus Papyri</title>

<biblScope type='volume'>64</biblScope>

<biblScope type='pages'>7-9</biblScope>

<editor>E. W. Handley, U. Wartenberg, et al.</editor>


<publisher>Egypt Exploration Society</publisher>





Editorial confidence


Where the editor has supplied lost text, resolved an abbreviation or symbol, added text regarded as mistakenly omitted by the scribe, or deleted text regarded as spurious, the 'cert' attribute may be used to indicate the editor's level of confidence in that decision. Values recommended here are 'high', 'med', and 'low'. As a guide, 'high' indicates C > 75% (beyond reasonable doubt), 'med' indicates 25% < C < 75% (doubtful), and 'low' indicates C < 25% (very doubtful), where 'C' is the confidence level. There should be no presumption of the editor's level of confidence unless it is stated or clearly implied.


3. Validating the resultant document


The converted Leiden-style edition should be validated using an XML parser. The following outline shows how this might be done:


(1)  Place the completed TEIxLite document (i.e. the converted edition) in a directory along with the TEIxLite DTD and the entity files declared in the DTD.

(2)  Validate the TEIxLite document using an XML parser.


Ensure that the DTD and entity file names are the same as those given in the DTD. The files themselves can be obtained from the following locations:


TEIxLite DTD file:

Standard entity reference files:


A number of XML parsers are available free of charge. Two are mentioned here:


(1) Attempting to open an XML document from within Microsoft's XML Notepad will cause the document to be validated if it has a DTD.


(2) Xerces is a Java-based XML parser provided by Apache Software Foundation. Both Sun Microsystems' Java platform and the Xerces program must be installed before the parser will run. It can then be invoked using the following command:


java dom.DOMCount -v <filename>


where <filename> is replaced with the file name of the TEIxLite document to be validated.


Each parser will respond with error messages if the document being validated contains XML errors or does not conform to the DTD.


4. TEIxLite template


The following is a TEIxLite template for converting Leiden-style editions to XML.


<?xml version="1.0" encoding="ISO-8859-1"?>

<!-- Template for converting a Leiden-style edition to TEIxLite.

     T. J. Finney, ATLA Serials Project, 2001.

     Fill in the places marked ... -->

<!DOCTYPE TEI.2 SYSTEM 'teixlite.dtd'>





<!-- Full bibliographic description of electronic file -->



<!-- Enter title of source text here -->

<title>A machine readable version of ...</title>

<!-- Editor of source text -->


<!-- One responsible for making electronic file -->



<!-- Information on publication of electronic file -->


<!-- One responsible for making electronic file available -->


<!-- Availability. Status may be free, unknown, or restricted. -->

<availability status='...'/>



<!-- Description of source text -->












<!-- Relationship between electronic file and source -->


<!-- Extent of source included -->


<!-- Editorial practices applied during encoding -->


<p>Encoded according to:

<bibl><title>Converting Leiden-style editions to TEI Lite XML</title>

<author>T. J. Finney</author> <date>2001</date></bibl></p>

<!-- Descriptions of milestone usage (optional).

     Delete these p tags and contents if not used. -->



<!-- Declaration of classification systems (optional).

     Delete these classDecl tags and contents if not used. -->


<!-- Insert taxonomy codes and bibliographic references as required -->

<taxonomy id='...'><bibl>...</bibl></taxonomy>




<!-- Description of non-bibliographic aspects of electronic file -->


<!-- Information about creation of the text -->


<!-- Add more hands if required -->

<ident id='h1'>First hand.</ident>

<ident id='h2'>Second hand.</ident>

<ident id='h3'>Third hand.</ident>

<ident id='h4'>Fourth hand.</ident>

<ident id='hx'>Unidentified hand.</ident>



<!-- Insert appropriate language codes and descriptions here.

     cop = Coptic

     eng = English

     fre = French

     ger = German

     grc = Greek (ancient)

     gre = Greek (modern)

     lat = Latin

     See ISO 639-2/B for other three letter language codes.

     Delete the transliteration statement if not applicable. --> 

<language id='...'>Text besides the transcription is in ...</language>

<language id='...'>The transcription is in ... The transliteration scheme is ...</language>


<!-- Classification of electronic file's text (optional).

     Delete these textClass tags and contents if not used. -->


<!-- Insert taxonomy codes and text classifications as required.

     Scheme must match one of the taxonomy codes given above. -->

<classCode scheme='...'>...</classCode>




<!-- Revision history of electronic file. Insert changes as required. -->














<!-- Editor's discussion.

     Sample may be initial, medial, final, unknown, or complete. -->

<div0 type='discussion' sample='...'>




<!-- Manuscript transcription. Sample values as given above. -->

<div0 type='transcription' sample='...'>

<!-- Insert appropriate language code here -->

<q lang='...'>

<pb n='...'/>

<lb n='...'/>...




<!-- Transcription notes. Sample values as given above. -->

<div0 type='notes' sample='...'>




<!-- Bibliography. Sample values as given above. -->

<div0 type='bibliography' sample='...'>











5. Acknowledgments


This set of recommendations was prepared through the support of the American Theological Libraries Association's Center for Electronic Resources in Theology and Religion. It stems from initial discussions with Gabriel Bodard of TLG, and has been significantly improved through helpful suggestions from a number of people including Nick Nicholas of TLG, James R. Adair of ATLA CERTR, and Lou Burnard of OUCS. Errors, omissions, and infelicities remain my own.




Elliott, Tom, Hugh Cayless, and Helen Hawkins. tei.epidoc: structured markup of Greek and Latin epigraphic texts: a proposed "best-practice" guide. Version 0.2. 2001. Online:


"Essai d'unification des méthodes employées dans les éditions de papyrus." Chronique D'Égypte 13-14 (1932): 285-87.


Nicholas, Nick and Rosa Parent, eds. The TLG Beta Code Manual. Rev. and corr. PDF ed. Thesaurus Linguae Graecae, 2000. Online:


Sperberg-McQueen, C. M. and L. Burnard. TEI Lite: An Introduction to Text Encoding for Interchange. TEI U5. Text Encoding Initiative, 1995. Online:


Sperberg-McQueen, C. M. and L. Burnard, eds. Guidelines for Electronic Text Encoding and Interchange. TEI P3, rev. ed. Oxford: Text Encoding Initiative, 1999. Online:


Thomas, J. David. "4404. Matthew XXI 34-37; 43 and 45 (?)". Pages 7-9 in The Oxyrhynchus Papyri. Vol. 64. Edited by E. W. Handley, U. Wartenberg, R. A. Coles, N. Gonis, M. W. Haslam, and J. D. Thomas. London: Egypt Exploration Society, 1997.