16 Linking, Segmentation, and Alignment

Table of contents

This chapter discusses a number of ways in which encoders may represent analyses of the structure of a text which are not necessarily linear or hierarchic. The module defined by this chapter provides for the following common requirements:

These facilities all use the same set of techniques based on the W3C XPointer framework (Grosso et al. (eds.) (2003)) This provides a variety of schemes; the most convenient of which, and that recommended by these Guidelines, makes use of the global xml:id attribute, as defined in section 1.3.1.1 Global Attributes, and introduced in the section of v. A Gentle Introduction to XML titled Identifiers and indicators . When the linking module is included in a schema, the attribute class att.global is extended to include eight additional attributes to support the various kinds of linking listed above. Each of these attributes is introduced in the appropriate section below. In addition, for many of the topics discussed, a choice of methods of encoding is offered, ranging from simple but less general ones, which use attribute values only, to more elaborate and more general ones, which use specialized elements.

16.1 Links

We say that one element points to others if the first has an attribute whose value is a reference to the others: such an element is called a pointer element, or simply a pointer. Among the pointers that have been introduced up to this point in these Guidelines are note, ref, and ptr. These elements all indicate an association between one place in the document (the location of the pointer itself) and one or more others (the elements whose identifiers are specified by the pointer's target attribute). The module described in this chapter introduces a variation on this basic kind of pointer, known as a link, which specifies both ‘ends’ of an association. In addition, we define a syntax for representing locations in a document by a variety of means not dependent on the use of xml:id attributes.

16.1.1 Pointers and Links

In section 3.6 Simple Links and Cross-References we introduced the simplest pointer elements, ptr and ref. Here we introduce additionally the link element, which represents an association between two (or more) locations by specifying each location explicitly. Its own location is irrelevant to the intended linkage.
  • ptr/ (pointer) defines a pointer to another location.
    targetspecifies the destination of the pointer by supplying one or more URI References
  • ref (reference) defines a reference to another location, possibly modified by additional text or comment.
    targetspecifies the destination of the reference by supplying one or more URI References
  • link/ defines an association or hypertextual link among elements or passages, of some type not more precisely specifiable by other elements.
    targetsspecifies the identifiers of the elements or passages to be linked or associated.
The ptr element may be called a ‘pure pointer’, because its primary function is simply to point. A pointer sets up a connection between an element (which, in the case of a pure pointer, is simply a location in a document), and one or more others, known collectively as its target. The ptr and ref elements bear a target attribute (in the singular), because they point, conceptually, at a single target, even if that target may be discontinuous in the document. The link element bears a targets attribute (in the plural), because it specifies at least two targets, each of which is a unitary object. It may be thought of as a representing a double link between the objects specified.
As members of the class att.pointing, these elements share a common set of attributes:
  • att.pointing defines a set of attributes used by all elements which point to other elements by means of one or more URI references.
    typecategorizes the pointer in some respect, using any convenient set of categories.
    evaluatespecifies the intended meaning when the target of a pointer is itself a pointer.
Double connection among elements could also be expressed by a combination of pointer elements, for example, two ptr elements, or one ptr element and one note element. All that is required is that the value of the target (or other pointing) attribute of the one be the value of the xml:id attribute of the other. What the link element accomplishes is the handling of double connection by means of a single element. Thus, in the following encoding:
<ptr xml:id="sa-p1target="#sa-p2"/>
<ptr xml:id="sa-p2target="#sa-p1"/>
sa-p1 points to sa-p2, and sa-p2 points to sa-p1. This is logically equivalent to the more compact encoding:
<link targets="#sa-p1 #sa-p2"/>

As noted elsewhere, both target and targets attributes take as value one or more URI reference. In the simplest case, a URI reference might indicate an element in the current document (or in some other document) by supplying the value used for its global xml:id attribute. Pointing or linking to external documents and pointing and linking where identifiers are not available are implemented by more complex forms of URI references, as described below in section 16.2 Pointing Mechanisms.

16.1.2 Using Pointers and Links

As an example of the use of these mechanisms which establish connections among elements, consider the practice (common in 18th century English verse and elsewhere) of providing footnotes citing parallel passages from classical authors.
The
figure shows the original page of Pope's Dunciad
which is discussed in the text.
Such footnotes can of course simply be encoded using the note element (see section 3.8 Notes, Annotation, and Indexing) without a target attribute, placed adjacent to the passage to which the note refers:49
<l>(Diff'rent our parties, but with equal grace</l>
<l>The Goddess smiles on Whig and Tory race,</l>
<l>
 <note type="imitationplace="footanchored="false">
  <bibl>Virg. Æn. 10.</bibl>
  <quote>
   <l>Tros Rutulusve fuat; nullo discrimine habebo.</l>
   <l>—— Rex Jupiter omnibus idem.</l>
  </quote>
 </note>'Tis the same rope at sev'ral ends they twist,
</l>
<l>To Dulness, Ridpath is as dear as Mist)</l>

This use of the note element can be called implicit pointing (or implicit linking). It relies on the juxtaposition of the note to the text being commented on for the connection to be understood. If it is felt that the mere juxtaposition of the note to the text does not make it sufficiently clear exactly what text segment is being commented on (for example, is it the immediately preceding line, or the immediately preceding two lines, or what?), or if it is decided to place the note at some distance from the text, then the pointing or the linking must be made explicit. We now consider various methods for doing that.

Firstly, a ptr element might be placed at an appropriate point within the text to link it with the annotation:
<l>(Diff'rent our parties, but with equal grace</l>
<l>The Goddess smiles on Whig and Tory race,
<ptr rend="unmarkedtarget="#note3.284"/>
</l>
<l>'Tis the same rope at sev'ral ends they twist,</l>
<l>To Dulness, Ridpath is as dear as Mist)</l>
<note
  xml:id="note3.284"
  type="imitation"
  place="foot"
  anchored="false">

 <bibl>Virg. Æn. 10.</bibl>
 <quote>
  <l>Tros Rutulusve fuat; nullo discrimine habebo.</l>
  <l>—— Rex Jupiter omnibus idem.</l>
 </quote>
</note>
The note element has been given an arbitrary identifier (note3.284) to enable it to be specified as the target of the pointer element. Because there is nothing in the text to signal the existence of the annotation, the rend attribute has been given the value unmarked.
Secondly, the target attribute of the note element can be used to point at its associated text, provided that an xml:id attribute has been supplied for the associated text:
<l xml:id="l3.283">(Diff'rent our parties, but with equal grace</l>
<l xml:id="l3.284">The Goddess smiles on Whig and Tory race,</l>
<l xml:id="l3.285">'Tis the same rope at sev'ral ends they twist,</l>
<l xml:id="l3.286">To Dulness, Ridpath is as dear as Mist)</l>
<!-- ... -->
Given this encoding of the text itself, we can now link the various notes to it. In this case, the note itself contains a pointer to the place in the text which it is annotating; this could be encoded using a ref element, which bears a target attribute of its own and contains a (slightly misquoted) extract from the text marked as a quote element:
<note
  type="imitation"
  place="foot"
  anchored="false"
  target="#l3.284">

 <ref rend="sctarget="#l3.284">Verse 283–84.
 <quote>
   <l>——. With equal grace</l>
   <l>Our Goddess smiles on Whig and Tory race.</l>
  </quote>
 </ref>
 <bibl>Virg. Æn. 10.</bibl>
 <quote>
  <l>Tros Rutulusve fuat; nullo discrimine habebo.</l>
  <l>—— Rex Jupiter omnibus idem. </l>
 </quote>
</note>
Combining these two approaches gives us the following associations:
  • a pointer within one line indicates the note
  • the note indicates the line
  • a pointer within the note indicates the line
Note that we do not have any way of pointing from the line itself to the note: the association is implied by containment of the pointer. We do not as yet have a true double link between text and note. To achieve that we will need to supply identifiers for the annotations as well as for the verse lines, and use a link element to associate the two. Note that the ptr element and the target attribute on the note may now be dispensed with:
<note
  xml:id="n3.284"
  type="imitation"
  place="foot"
  anchored="false">

 <ref rend="sctarget="#l3.284">Verse 283–84.
 <quote>
   <l>——. With equal grace</l>
   <l>Our Goddess smiles on Whig and Tory race.</l>
  </quote>
 </ref>
 <bibl>Virg. Æn. 10.</bibl>
 <quote>
  <l>Tros Rutulusve fuat; nullo discrimine habebo.</l>
  <l>—— Rex Jupiter omnibus idem. </l>
 </quote>
</note>
<link targets="#n3.284 #l3.284"/>
The targets attribute of the link element here bears the identifiers of the note followed by that of the verse line. For completeness, we could also allocate an identifier to the reference within the note and encode the association between it and the verse line in the same way:
<note
  xml:id="nt3.284"
  type="imitation"
  place="foot"
  anchored="false">

 <ref rend="scxml:id="r3.284target="#l3.284">Verse 283–84.
 <quote>
   <l>——. With equal grace</l>
   <l>Our Goddess smiles on Whig and Tory race.</l>
  </quote>
 </ref>
<!-- ... -->
</note>
<!-- ... -->
<link targets="#r3.284 #l3.284"/>
Indeed, the two links could be combined into one, as follows:
<link targets="#n3.284 #r3.284 #l3.284"/>

16.1.3 Groups of Links

Clearly, there are many reasons for which an encoder might wish to represent a link or association between different elements. For some of them, specific elements are provided in these Guidelines; some of these are discussed elsewhere in the present chapter. The link element is a general purpose element which may be used for any kind of association. The element linkGrp may be used to group links of a particular type together in a single part of the document; such a collection may be used to represent what is sometimes referred to in the literature of Hypertext as a web, a term introduced by the Brown University FRESS project in 1969.
  • linkGrp (link group) defines a collection of associations or hypertextual links.
As a member of the class att.pointing.group, this element shares the following attributes with other members of that class:
  • att.pointing.group defines a set of attributes common to all elements which enclose groups of pointer elements.
    domainsoptionally specifies the identifiers of the elements within which all elements indicated by the contents of this element lie.
    targFunc(target function) describes the function of each of the values of the targets attribute of the enclosed link, join, or alt tags.
It is also a member of the att.pointing class, and therefore also carries the attributes specified in section 16.1.1 Pointers and Links above, in particular the type attribute:
  • att.pointing defines a set of attributes used by all elements which point to other elements by means of one or more URI references.
    typecategorizes the pointer in some respect, using any convenient set of categories.

The linkGrp element provides a convenient way of establishing a default for the type attribute on a group of links of the same type: by default, the type attribute on a link element has the same value as that given for type on the enclosing linkGrp.

Typical software might hide a web entirely from the user, but use it as a source of information about links, which are displayed independently at their referenced locations. Alternatively, software might provide a direct view of the link collection, along with added functions for manipulating the collection, as by filtering, sorting, and so on. To continue our previous example, this text contains many other notes of a kind similar to the one shown above. Here are a few more of the lines to which annotations have to be attached, followed by the annotations themselves:
<l xml:id="l2.79">A place there is, betwixt earth, air and seas</l>
<l xml:id="l2.80">Where from Ambrosia, Jove retires for ease.</l>
<!-- ... -->
<l xml:id="l2.88">Sign'd with that Ichor which from Gods distills.</l>
<!-- ... -->
<note xml:id="n2.79place="footanchored="false">
 <bibl>Ovid Met. 12.</bibl>
 <quote xml:lang="la">
  <l>Orbe locus media est, inter terrasq; fretumq;</l>
  <l>Cœlestesq; plagas —</l>
 </quote>
</note>
<note xml:id="n2.88place="footanchored="false"> Alludes to <bibl>Homer, Iliad 5</bibl> ...
</note>
To avoid having to repeat the specification of type as imitation on each note, we may specify it once for all on a linkGrp element containing all links of this type.
<linkGrp type="imitation">
 <link targets="#n2.79 #l2.79"/>
 <link targets="#n2.88 #l2.88"/>
 <link targets="#n3.284 #l3.284"/>
</linkGrp>
Additional information for applications that use linkGrp elements can be provided by means of special attributes. First, the domains attribute can be used to identify the text elements within which the individual targets of the links are to be found. Suppose that the text under discussion is organized into a body element, containing the text of the poem, and a back element containing the notes. Then the domains attribute can have as its value the identifiers of the body and the back, to enable an application to verify that the link targets are in fact contained by appropriate elements, or to limit its search space:

<!-- ... --><linkGrp type="imitationdomains="dunciad dunnotes">
 <link targets="#n2.79 #l2.79"/>
 <link targets="#n2.88 #l2.88"/>
<!-- ... -->
 <link targets="#n3.284 #l3.284"/>
<!-- ... -->
</linkGrp>

Note that there must be a single parent element for each ‘domain’; if some notes are contained by a section with identifier dunnotes, and others by a section with identifier dunimits, an intermediate pointer must be provided (as described in section 16.1.4 Intermediate Pointers) within the linkGrp and its identifier used instead.

Next, the targFunc attribute can be used to provide further information about the role or function of the various targets specified for each link in the group. The value of the targFunc attribute is a list of names (formally, name tokens), one for each of the targets in the link; these names can be chosen freely by the encoder, but their significance should be documented in the encoding declaration in the header.50 In the current example, we might think of the note as containing the source of the imitation and the verse line as containing the goal of the imitation. Accordingly, we can specify the linkGrp in the preceding example thus:
<linkGrp type="imitationdomains="dunciad dunnotestargFunc="source goal">
 <link targets="#n2.79 #l2.79"/>
 <link targets="#n2.88 #l2.88"/>
<!-- ... -->
 <link targets="#n3.284 #l3.284"/>
<!-- ... -->
</linkGrp>

16.1.4 Intermediate Pointers

In the preceding examples, we have shown various ways of linking an annotation and a single verse line. However, the example cited in fact requires us to encode an association between the note and a pair of verse lines (lines 284 and 285).

There are a number of possible ways of correcting this error: one could use the target and targetEnd attributes of the note element to delimit the span to which the note applies (see further section 3.8 Notes, Annotation, and Indexing). Alternatively one could create an element to encode the couplet itself and assign it an xml:id attribute, which can then be linked to the note and ref elements. This could be done either explicitly by means of an lg element, as defined in section 3.12.1 Core Tags for Verse or implicitly, by means of the join element discussed in section 16.7 Aggregation.

A third possibility however, is to use an ‘intermediate pointer’ as follows:
<ptr xml:id="l3.283-284target="#l3.283 #l3.284"/>
When the target attribute of a ptr or ref element specifies more than one element, the indicated elements are intended to be combined or aggregated in some way to produce the object of the pointer. (Such aggregation is however the task of a processing application, and cannot be defined simply by the markup). The xml:id attribute provides an identifier which can then be linked to the note and ref elements:
<link evaluate="alltargets="#n3.284 #r3.284 #l3.283-284"/>

The all value of evaluate is used on the link element to specify that any pointer encountered as a target of that element is itself evaluated. If evaluate had the value none, the link target would be the pointer itself, rather than the objects it points to.

Where a linkGrp element is used to group a collection of link elements, any intermediate pointer elements used by those link elements should be included within the linkGrp.

16.2 Pointing Mechanisms

This section introduces more formally the pointing mechanisms available in the TEI. In addition to those discussed so far, the TEI provides methods of pointing:
  • into documents other than the current document;
  • to a particular element in a document other than the current document using its xml:id;
  • to a particular element whether in the current document or not, using its position in the XML element tree;
  • at arbitrary content in any XML document using TEI-defined XPointer schemes.

All TEI attributes used to point at something else are declared as having the datatype data.pointer, which is defined as a URI reference51; the cases so far discussed are all simple examples of a URI reference. Another familiar example is the mechanism used in XHTML to create represent hypertext links by means of the XHTML href attribute. A URI reference can reference the whole of an XML resource such as a document or an XML element, or a sub-portion of such a resource, identified by means of an appropriate fragment identifier. Technically speaking, the ‘fragment identifier’ is that portion of a URI reference following the first unescaped ‘#’ character; in practice, it provides a means of accessing some part of the resource described by the URI which is less than the whole.

The first three of the following subsections provide only a brief overview and some examples of the W3C mechanisms recommended. More detailed information on the use of these mechanisms is readily available elsewhere.

16.2.1 Pointing Elsewhere

Like the ubiquitous if misnamed XHTML pointing attribute href, the TEI pointing attributes can point to a document that is not the current document (the one that contains the pointing element) whether it is in the same local filesystem as the current document, or on a different system entirely. In either case, the pointing can be accomplished absolutely (using the entire address of the target document) or relatively (using an address relative to the current base URI in force). The ‘current base URI’ is defined according to Marsh 2001. In general the current base URI in force is the value of the xml:base attribute of the closest ancestor that has one. If there is none, the base URI is that of the current document.

The following example demonstrates an absolute URI reference that points to a remote document:
The current base URI in force is as defined in the
W3C <ref target="http://www.w3.org/TR/xmlbase/">XML
Base</ref> recommendation.
This example points explicitly to a location on the Web, accessible via HTTP. Suppose however that we wish to access a document stored locally in a file. Again we will supply an absolute URI reference, but this time using a different protocol:
This Debian package is distributed under the terms
of the <ref
  target="file:///usr/share/common-licenses/GPL-2">
GNU General Public License</ref>.
In the following example, we use a relative URI reference to point to a local document:
<figure rend="float fullpage">
 <graphic url="Images/compic.png"/>
 <figDesc>The figure shows the page from the <title>Orbis
     pictus</title> of Comenius which is discussed in the text.</figDesc>
</figure>
Since no xml:base is specified here, the location of the resource Figures/compic.png is determined relative to the resource indicated by the current base URI, which is the current document.
In the following example, however, we first change the current base URI by setting a new value for xml:base. The resource required is then identified by means of a relative URI:
<div type="chapxml:base="http://classics.mit.edu/">
 <head>On Ancient Persian Manners</head>
 <p>In the very first story of <ref target="Sadi/gulistan.2.i.html">
   <title>The Gulistan of
       Sa'di</title>
  </ref>,
   Sa'di relates moral advice worthy of Miss Minners ...</p>
<!-- ... -->
</div>
As noted above, the current base URI is found on the nearest ancestor. This provides a useful way of abbreviating URIs within a given scope:
<body>
 <div n="A">
  <p>The base URI here is the current document. A URI such as
  <code>a.xml</code> is equivalent to
  <code>./a.xml</code>.</p>
 </div>
 <div n="Bxml:base="http://www.example.org/">
  <p>The base URI here is
  <code>http://www.example.org/</code>. A
     URI such as <code>a.xml</code> is equivalent to
  <code>http://www.example.org/a.xml</code>.</p>
 </div>
 <div n="Cxml:base="ftp://ftp.example.net/mirror/">
  <p>The base URI here is
  <code>ftp://ftp.example.net/mirror/</code>. A URI such
     as
  <code>a.xml</code> is equivalent to
  <code>ftp://ftp.example.net/mirror/a.xml</code>.</p>
 </div>
 <div n="D">
  <p>The base URI here is the current document. A URI such as
  <code>a.xml</code> is equivalent to
  <code>./a.xml</code>.</p>
 </div>
</body>

16.2.2 Pointing Locally

Because the default base URI is the current document, a pointer that is specified as a bare name fragment identifier alone acts as a pointer to an element in the current document, as in the following example.
<div type="sectionxml:id="sect106">
<!-- ... -->
</div>
<div type="sectionn="107xml:id="sect107">
 <head>Limitations on exclusive rights: Fair use</head>
 <p>Notwithstanding the provisions of
 <ref target="#sect106">section 106</ref>, the fair use of a
   copyrighted work, including such use by reproduction in copies
   or phonorecords or by any other means specified by that section,
   for purposes such as criticism, comment, news reporting,
   teaching (including multiple copies for classroom use),
   scholarship, or research, is not an infringement of copyright.
   In determining whether the use made of a work in any particular
   case is a fair use the factors to be considered shall
   include — 
 <list type="simple">
   <item n="(1)">the purpose and character of the use, including
       whether such use is of a commercial nature or is for nonprofit
       educational purposes;</item>
   <item n="(2)">the nature of the copyrighted work;</item>
   <item n="(3)">the amount and substantiality of the portion
       used in relation to the copyrighted work as a whole;
       and</item>
   <item n="(4)">the effect of the use upon the potential market
       for or value of the copyrighted work.</item>
  </list>
   The fact that a work is unpublished shall not itself bar a
   finding of fair use if such finding is made upon consideration
   of all the above factors.</p>
</div>
This method of pointing, by referring to the xml:id of the target element as a bare name only (e.g., #sect106) is the simplest and often the best approach where it can be applied, i.e. where both the source element and target element are in the same XML document, and where the target element carries an identifier. It is the method used extensively in previous sections of this chapter and elsewhere in these Guidelines.

16.2.3 W3C element() Scheme

If elements are not directly addressable by means of an identifier, because no identifier was originally given to them and the document cannot be modified to add one, they may still be pointed to by means of their position in the XML element tree. This method of pointing uses the element() scheme defined by the World Wide Web Consortium (Grosso et al, 2003). In this scheme, an element may be identified by stepwise navigation using a slash-separated list of child element numbers. For each step the integer n locates the nth child element of the previously located element. Thus a pointer such as <ptr target="foo.xml#element(/1/4)"/> indicates the fourth child element starting from the root element of the document indicated by the URI foo.xml.

For example, the following pointer selects one of Shakespeare's most famous lines:
<ref
  target="http://www.cs.mu.oz.au/621/2003project/hamlet.xml#element(/1/8/2/25/2)">
2B|^2B…</ref>
The URI in this example references an XML resource assumed to be available via the HTTP protocol on the Web; within that file, the specified element() scheme is used to select ‘the first (root-level) element's 8th child element's 2nd child element's 25th child element's 2nd child element’. This is equivalent to the XPath specification /*[1]/*[8]/*[2]/*[25]/*[2].
Rather than specifying a full path starting from the document root, it is also possible in this pointer scheme to specify as starting point any element which carries a value for its xml:id attribute, supplying a unique identifier for it. In this case the identifier is prefixed to the location path. For example, we can point more economically to the same line of Hamlet in a different digital version of the play which provides identifiers for the individual scenes:
<div
  xml:base="/Users/martin/Documents/c5/namelessShakespeare.xml">

 <p>
  <ptr target="#element(sha-ham301/22/2)"/>
 </p>
</div>
Here the identifier sha-ham301 is the identifier for the div element containing Act III, Scene I of Hamlet. The second child of the 22nd child of this div element contains the desired l element. This is equivalent to the XPath specfication id(sha-ham301)/*[22]/*[2].

As noted above, we could also point directly to this line if it had an identifier of its own. In another digital edition of Shakespeare, based on the first folio, each line is given an identifier based on its ‘through line number’. Our pointer to this line can now be represented simply as <ptr target="#element(Ham01245)"/>, or even more simply as <ptr target="#Ham01245"/>. The notation <ptr target="#xxx"/> is a convenient abbreviation for <ptr target="#element(xxx)"/>. This method requires, of course, that the ‘Through Line Number’ is supplied as the value of an xml:id attribute on each line, and must therefore be unique within each document. In section 16.2.5 Canonical References we discuss a method of pointing to the line which does not have this requirement.

16.2.4 TEI XPointer Schemes

The pointing scheme described in this chapter is one of a number of such schemes envisaged by the W3C, which together constitute a framework for addressing data within XML documents, known as the XPointer Framework (Grosso et al 2003). This framework permits the definition of many other named addressing methods, each of which is known as an XPointer Scheme. The W3C has predefined a set of such schemes, and maintains a register for their expansion. The element() scheme described above is one such scheme, defined by the W3C, and widely implemented by XML processing systems.

Another important scheme, also defined by the W3C, and recommended by these Guidelines is the xpath1() pointer scheme, which allows for any part of an XML structure to be selected using the syntax defined by the XPath specification. This is further discussed below, 16.2.4.2 xpath1(Expr). These Guidelines also define five other pointer schemes, which provide access to parts of an XML document such as points within data content or stretches of data content. These additional TEI pointer schemes are defined in sections 16.2.4.3 left(pointer) and right(pointer) to 16.2.4.6 match(pointer, string [, index]) below.

16.2.4.1 Introduction to TEI Pointers

Before discussing the TEI pointer schemes, we introduce slightly more formally the terminology used to define them. So far, we have discussed only ways of pointing at components of the XML information set node such as elements and attributes. However, there is often a need in text analysis to address additional types of location such as the ‘point’ locations between nodes, and ‘ranges’ that may arbitrarily cross the boundaries of nodes in a document. The content of an XML document is organized sequentially as well as hierarchically, and it therefore makes sense to consider ranges of characters within it independently of the nodes to which they belong, for example when making a selection in a text editor. For processing purposes, such a range is best defined by the pair of points at its start and end. It is often useful to think of pointer schemes as analogous to query functions that return nodes in the XML information set (the DOM tree) of an XML document, as in the case of the element and xpath pointer schemes discussed so far, but this is not invariably the case. A point is adjacent to one or two nodes, but is not a node itself, while a range may not even overlap with any complete node in the DOM tree.

The TEI pointer scheme thus distinguishes the following kinds of object:
Node
A node represents a single item in the XML information set for a document. For pointing purposes, the only nodes that are of interest are Text Nodes, Element Nodes, and Attribute nodes.
Node Set
A node set is a set of nodes in the XML information set of a document. In TEI Pointing applications, node sets are only allowed as the result of resolving a URI when multiple URIs would have been allowed where it appears, i.e. in attributes which are declared as permitting two or more data.pointer values as opposed to only one. As the name ‘set,’ implies, the individual items in a node set are not ordered, and no assumptions about relative ordering of items in a node set should be made.
Point
A Point represents a point between nodes in a document. Every point is adjacent to either characters or elements, and never to another point. In fact, in the character representation of an XML document, every position between data characters, start-tags or end-tags is a point, and there are no other points. If one treats all character content as if it were broken into single-character text-nodes, every point is definable as either
  • the point preceding a node, and if that node has a predecessor in document order, then it is the same as the point following that predecessor; or
  • the point following a node, and if that node has a successor in document order, then it is the same as the point preceding that successor.
Range
A Range is defined as the portion of a document between two points. Since points may occur anywhere within the document, ranges do not correspond directly to nodes or to node sets. A range may overlap the contents of a node either completely or partially.
The TEI has registered the following five pointer schemes:
xpath1()
Addresses a node or nodeset using the XPath syntax. (16.2.4.2 xpath1(Expr))
left() and right()
addresses the point before (left) or after (right) a node or node set (16.2.4.3 left(pointer) and right(pointer))
range()
addresses the range between two points (16.2.4.4 range(pointer1, pointer2))
string-range()
addresses a range of a specified length starting from a specified point (16.2.4.4 range(pointer1, pointer2))
match()
addresses a range which matches a specified string within a node (16.2.4.6 match(pointer, string [, index]))

The xpath1() scheme refers to the existing XPath specification which is adopted without modification or extension.

The other five schemes overlap in functionality with a W3C draft specification known as the XPointer scheme draft, but are individually much simpler. At the time of this writing, there is no current or scheduled activity at the W3C towards revising this draft or issuing it as a recommendation.

16.2.4.2 xpath1(Expr)
The xpath1() scheme locates a node or node set within an XML Information Set. The single argument Expr is an XPath Expr as defined in the W3C XPath 1 Recommendation. The node or node set resulting from evaluating the XPath is the reference of an address using the xpath1() scheme. For example, the following example selects the first paragraph of the <ftnote> element with id of fn6 of a paper that discusses XPointers.
<ptr
  target="http://tinyurl.com/267z62/xml/2004/Thompson01/EML2004Thompson01.xml#xpath1(//ftnote[@id='fn6']/para[1])"/>

When a URI reference is specified as the value of an attribute declared as a single data.pointer value, the result must be a single node, and it is an error if the result is a node set. When the URI reference is specified as the value of an attribute declared to permit two or more data.pointer values, each node in the node set is treated as if it were the result of a separate URI reference.

When an xpath is interpreted by a TEI processor, the information set of the referenced document is interpreted without any additional information supplied by any schema processing that may or may not be present. In particular this means that no whitespace normalization is applied to a document before the xpath is interpreted.

This pointer scheme allows easy, direct use of the most widely-implemented XML query method. It is probably the most robust pointing mechanism for the common situation of selecting an XML element or its contents where an xml:id is not present. The ability to use element names and attribute names and values makes xpath1() pointers more robust than the other mechanisms discussed in this section even if the designated document changes. For durability in the presence of editing, use of xml:id is always recommended when possible.

16.2.4.3 left(pointer) and right(pointer)
The left() (right()) scheme locates the point immediately preceding (following) its argument. The single pointer argument to left() or right() is treated like a fragment identifier itself, and must be a bare name or XPointer pointer. The designation of this argument is resolved with respect to the base URI in effect for the left() or right() according to the normal rules.52 Most pointer schemes return nodes or ranges rather than points; the possibilities for left() and right() pointer schemes are as follows:
A Node
When pointer resolves to a node, the point designated is the point immediately preceding (left()) or following (right()) the node.
A Node Set
When pointer resolves to a node set, the point designated is the point preceding the first element of the set (left()) or following the last element of the set (right())
A range
When pointer resolves to a range, the point designated is the point designating the start (left()) or end (right()) of the range.
A Point
When pointer resolves to a point, that point is the result. The pointer schemes left() and right() make no change when given a point as argument.
The following example points to the spot immediately following the last character of the element found by walking down the document tree to the 6th child of the 3rd child of the 3rd child of the 1st child of the root element. In this case, the path takes us to a <postcode> node which contains the string ‘20850’, so the point being pointed to is that following the ‘0’ character at the end of the element content.
<p
  xml:base="http://www.mulberrytech.com/Extreme/Proceedings/xml/2002/">

 <ptr
   target="Usdin01/EML2002Usdin01.xml#right(element(/1/1/3/3/6))"/>

</p>
16.2.4.4 range(pointer1, pointer2)
The range() scheme locates a range between two points in an XML information set. The two pointer arguments to range() locate the boundaries of the range by two points, and are interpreted as fragment identifiers. The parameters pointer1 and pointer2 are XPointers themselves, and are resolved according to the rules specified in the definition of the pointer scheme they use.53 Most pointer schemes return nodes or ranges rather than points; the possibilities for range() pointer schemes are as follows:
A Node
When pointer1 resolves to a node, the starting point of the range is the point immediately preceding the node. When pointer2 resolves to a node, the ending point of the range is the point immediately following the node. It is an error if the ending point precedes the starting point of a range.
A range
When pointer1 resolves to a range R, the starting point of the result range is the same as the starting point of R. When pointer2 resolves to a range R, the ending point of the result range is the ending point of R.
A Point
When pointer1 resolves to a point, that point is the start of the range. When pointer2 resolves to a point, that point is the end of the range.
16.2.4.5 string-range(pointer, offset [, length])

The string-range() scheme locates a range based on character positions. While string-range endpoints are points adjacent to character positions, they must be designated by the characters to which they are adjacent, in the same way that the nodes corresponding to XML elements are. This avoids ambiguity about which point between two characters is indicated when characters are interrupted by markup.

The pointer argument to string-range() designates a node or a range within which a string is to be located. No string range, even an empty one, can be defined by a string-range() if pointer has the empty string as string value. Every string-range is defined based on an ‘origin character’. The origin is numbered 0, and designates the first character of the string-value of pointer. The offset is a character index relative to the origin; the start of the resulting range is the position designated by the sum of the origin and offset.

If length is specified, the end of the range is at a point adjacent to the character designated by the origin added to the offset and length. If the offset is negative, or length is sufficiently large, a string-range can designate characters outside the string-value of the intitial pointer. In this case, characters are located using the string-value of the entire document. It is also legal for length plus the origin to exceed the length of the string-value of the document by one, in order to accommodate ranges that include the last character of a document.

If length is not specified, it defaults to the value 1, and the string range contains one character. If it is specified as 0, the zero-length range is interpreted as the point immediately preceding the origin character or offset character if there is one.

16.2.4.6 match(pointer, string [, index])

The match scheme designates the result of a literal match of the argument string within the string-value of the pointer argument. The result is a range from the first matching character to the last. It is an error if there is no matching string. A match may not extend outside the range corresponding to the string value of pointer.

The index argument is an integer greater than or equal to 1, specifying which match should be chosen when there is more than one match within the string-value of pointer. If no index is provided, the default value is 1, indicating the first match found.

16.2.5 Canonical References

By ‘canonical’ reference we mean any means of pointing into documents, specific to a community or corpus. For example, biblical scholars might understand ‘Matt 5:7’ to mean ‘the book called Matthew, chapter 5, verse 7.’ They might then wish to translate the string ‘Matt 5:7’ into a pointer into a TEI-encoded document, selecting the element which corresponds to the seventh div element within the fifth div element within the div element with the n attribute valued ‘Matt.’

Several elements in the TEI scheme (gloss, ptr, ref, and term) bear a special attribute, cRef, just for this purpose. Using the system described in this section, an encoder may specify references to canonical works in a discipline-familiar format, and expect software to derive a complete URI from it. The value of the cRef attribute is processed as described in this section, and the resulting URI reference is treated as if it were the value of the target attribute. The cRef and target attributes are mutually exclusive: only one or the other may be specified on any given occurrence of an element.

For the cRef attribute to function as required, a mechanism is needed to define the mapping between (for example) ‘the book called Matt’ and the part of the XML structure which corresponds with it. This is provided by the refsDecl element in the TEI Header, which contains an algorithm for translating a canonical reference string (like Matt 5:7) into a URI such as #xpath1(//div[@n='Matt']/div[5]/div[7]. The refsDecl element is described in section 2.3.5 The Reference System Declaration; the following example is discussed in more detail below in section 16.2.5.1 Worked Example.
<refsDecl xml:id="biblical">
 <cRefPattern
   matchPattern="(.+) (.+):(.+)"
   replacementPattern="#xpath1(//div[@n='$1']/div[$2]/div[$3])">

  <p>This pointer pattern extracts and references the <q>book,</q>
   <q>chapter,</q> and <q>verse</q> parts of a biblical reference.</p>
 </cRefPattern>
 <cRefPattern matchPattern="(.+) (.+)"
   replacementPattern="#xpath1(//div[@n='$1']/div[$2])">

  <p>This pointer pattern extracts and references the <q>book</q> and
  <q>chapter</q> parts of a biblical reference.</p>
 </cRefPattern>
 <cRefPattern matchPattern="(.+)"
   replacementPattern="#xpath1(//div[@n='$1'])">

  <p>This pointer pattern extracts and references just the <q>book</q>
     part of a biblical reference.</p>
 </cRefPattern>
</refsDecl>
When an application encounters a canonical reference as the value of cRef attribute, it follows a sequence of specific steps to transform it into a URI reference.
  1. Ascertain the correct refsDecl following the rules summarized in section 15.3.3 Summary.
  2. For each cRefPattern element encountered in the appropriate refsDecl, in the order encountered:
    1. match the value of cRef to the regular expression found as the value of the matchPattern attribute
    2. if the cRef value matches, take the value of the replacementPattern attribute and substitute the back references ($1, $2, etc.) with the corresponding matched substrings
    3. the result is taken as if it were a relative or absolute URI reference specified on the target attribute; i.e., it should be used as is or combined with the current xml:base value as usual
    4. no further processing of this cRef against the refsDecl should take place
    5. if, however, the cRef value does not match the regular expression specified on matchPattern attribute, proceed to the next cRefPattern
  3. If all the cRefPattern elements are examined in turn and none matches, the pointer fails.

The regular expression language used as the value of the matchPattern attribute is that used for the pattern facet of the World Wide Web Consortium's XML Schema Language in an Appendix to XML Schema Part 2.54 The value of the replacementPattern attribute is simply a string, except that occurences of ‘$1’ through ‘$9’ are replaced by the corresponding substring match. Note that since a maximum of nine substring matches are permitted, the string ‘$18’ means ‘the value of the first matched substring followed by the character ‘8’’ as opposed to ‘the eighteenth matched substring’. If there is a need for an actual string including a dollar sign followed by a digit that is not supposed to be replaced, the dollar sign should be written as %24.

16.2.5.1 Worked Example

Let us presume that with the example refsDecl above, an application comes across a cRef value of Matt 5:7 inside a div which has an xml:base of http://www.example.org/resources/books/Bible.xml. The application would first apply the regular expression (.+) (.+):(.+) to ‘Matt 5:7’. This regular expression would successfully match. The first matched substring would be ‘Matt’, the second ‘5’, and the third ‘7’. The application would then apply these substrings to the pattern #xpath1(//div[@n='$1']/div[$2]/div[$3]), producing #xpath1(//div[@n='Matt']/div[5]/div[7]). It would append this to the xml:base in force, thus generating the complete URI Reference http://www.example.org/resources/books/Bible.xml#xpath1(//div[@n='Matt']/div[5]/div[7]).

If, however, the input string had been ‘Matt 5’, the first regular expression would not have matched. The application would have then tried the second, (.+) (.+), producing a successful match, and the matched substrings ‘Matt’ and ‘5’. It would then have substituted those matched substrings into the pattern #xpath1(//div[@n='$1']/div[$2]) to produce a fragment identifier, which when appended to the xml:base in force produces the absolute URI reference http://www.example.org/resources/books/Bible.xml#xpath1(//div[@n='Matt']/div[5]).

If the input string had been ‘Matt’, neither the first nor the second regular expressions would have successfully matched. The application would have then tried the third, (.+), producing the matched substring ‘Matt’, and the URI Reference http://www.example.org/resources/books/Bible.xml#xpath1(//div[@n='Matt']).

It is an error to reference more matched substrings than are produced by the regular expression. For example:
<cRefPattern
  matchPattern="(.+) (.+):(.+)"
  replacementPattern="//div[@n='$1']/div[$2]/div[$3]/p[$4]"/>
would produce an error, since only three matched substrings would have been produced, but a fourth ($4) was referenced.

It is quite reasonable to believe that encoders would actually prefer much more precise regular expressions than those used as examples above. E.g., ^\s*([1-9]?[A-Z][a-z]+)\s+([1-9][0-9]?[0-9]?):([1-9][0-9]?)\s*$.

16.2.5.2 Complete and Partial URI Examples
In the above example, the value of cRef was used to generate a Fragment Identifier, which in turn was used to generate a complete URI. The complete URI could be generated directly, as in the following example.
<refsDecl xml:id="USC">
 <cRefPattern
   matchPattern="([0-9][0-9])\s*U\.?S\.?C\.?\s*[Cc](h(\.|ap(ter|\.)?)?)?\s*([1-9][0-9]*)"
   replacementPattern="http://uscode.house.gov/download/pls/$1C$5.txt">

  <p>Matches most standard references to particular
     chapters of the United States Code, e.g.
  <val>11USCC7</val>, <val>17 U.S.C. Chapter 3</val>, or
  <val>14 USC Ch. 5</val>. Note that a leading zero is
     required for the title (must be two digits), but is not
     permitted for the chapter number.</p>
 </cRefPattern>
 <cRefPattern
   matchPattern="([0-9][0-9])\s*U\.?S\.?C\.?\s*[Pp](re(lim(inary)?)?)?\s*[Mm](at(erial)?)?"
   replacementPattern="http://uscode.house.gov/download/pls/$1T.txt">

  <p>Matches references to the preliminary material for a
     given title, e.g. <val>11USCP</val>, <val>17 U.S.C.
       Prelim Mat</val>, or <val>14 USC pm</val>.</p>
 </cRefPattern>
 <cRefPattern
   matchPattern="([0-9][0-9])\s*U\.?S\.?C\.?\s*[Aa](ppend(ix)?)?"
   replacementPattern="http://uscode.house.gov/download/pls/$1A.txt">

  <p>Matches references to the appendix of a given tile,
     e.g. <val>05USCA</val>, <val>11 U.S.C. Appendix</val>,
     or <val>18 USC Append</val>.</p>
 </cRefPattern>
</refsDecl>
<!-- ... -->
<p>The example in section <ptr target="#SABN"/> is taken
from <ref cRef="17 USC Ch 1">Subject Matter and Scope of
   Copyright</ref>.</p>
16.2.5.3 Miscellaneous Usages

Canonical reference pointers are intended for use by TEI encoders. However, this specification might be useful to the development of a process for recognizing canonical references in non-TEI documents (such as plain text documents), possibly as part of their conversion to TEI.

16.3 Blocks, Segments, and Anchors

In this section, we discuss three general purposes elements which may be used to mark and categorize both a span of text and a point within one. These elements have several uses, most notably to provide elements which can be given identifiers for use when aligning or linking to parts of a document, as discussed elsewhere in this chapter. They also provide a convenient way of extending the semantics of the TEI markup scheme in a theory-neutral manner, by providing for two neutral or ‘anonymous’ elements to which the encoder can add any meaning not supplied by other TEI defined elements.
  • anchor/ (anchor point) attaches an identifier to a point within a text, whether or not it corresponds with a textual element.
  • ab (anonymous block) contains any arbitrary component-level unit of text, acting as an anonymous container for phrase or inter level elements analogous to, but without the semantic baggage of, a paragraph.
    partspecifies whether or not the block is complete.
The elements anchor, ab, and seg are members of the class att.typed, from which they inherit the following attributes:
  • att.typed provides attributes which can be used to classify or subclassify elements in any way.
    typecharacterizes the element in some sense, using any convenient classification scheme or typology.
    subtypeprovides a sub-categorization of the element, if needed
The seg element is also a member of the class att.segLike from which it inherits the following attributes:
  • att.segLike provides attributes for elements used for arbitrary segmentation.
    functioncharacterizes the function of the segment.
    partspecifies whether or not the segment is fragmented by some other structural element, for example a clause which is divided between two or more sentences.

The anchor element may be thought of as an empty seg, or as an artifice enabling an identifier to be attached to any position in a text. Like the milestone element discussed in section 3.10 Reference Systems, it is useful where multiple views of a document are to be combined, for example, when a logical view based on paragraphs or verse lines is to be mapped on to a physical view based on manuscript lines. Like those elements, it is a member of the class model.global and can therefore appear anywhere within a document when the module defined by this chapter is included in a schema. Unlike the other elements in its class, the