<?xml version="1.0" encoding="UTF-8"?>
<!-- © TEI Consortium. Dual-licensed under CC-by and BSD2 licenses; see the file COPYING.txt for details. -->
<?xml-model href="https://jenkins.tei-c.org/job/TEIP5-dev/lastSuccessfulBuild/artifact/P5/release/xml/tei/odd/p5.nvdl" type="application/xml" schematypens="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"?>
<div xmlns="http://www.tei-c.org/ns/1.0" type="div1" xml:id="AI" n="15"><head>Simple Analytic Mechanisms</head>

<p>This chapter describes a module for associating simple analyses and
interpretations with text elements.  We use the term
<term>analysis</term> here to refer to any kind of semantic or
syntactic interpretation which an encoder wishes to attach to all or
part of a text. Examples discussed in this chapter include familiar
linguistic categorizations (such as <q>clause</q>, <q>morpheme</q>,
<q>part-of-speech</q> etc.) and  characterizations of narrative
structure (such as <q>theme</q>, <q>reconciliation</q> etc.). The
mechanisms presented in this chapter are simpler but less powerful
than those described in chapter <ptr target="#FS"/>.
	</p>
<p>Section <ptr target="#AILC"/> introduces elements which can be used
to  characterize
text segments according to the familiar linguistic categories of
<term>sentence</term> or <term>s-unit</term>, <term>clause</term>,
<term>phrase</term>, <term>word</term>, <term>morpheme</term>,
<term>character</term>, and <term>punctuation mark</term>. These elements represent special cases of the
generic <gi>seg</gi> element described in section <ptr target="#SASE"/>.</p>
<p>Section <ptr target="#AIATTS"/> introduces an additional global
attribute which allows passages of text to be associated with
specialized elements representing their interpretation. 
These <soCalled>interpretative</soCalled> elements (<gi>span</gi> and
<gi>interp</gi>) are described in detail in section <ptr target="#AISP"/>.
They allow the encoder to specify an analysis as a series of names and
associated values,<note place="bottom">Or, as they are widely known,
<term>attribute-value pairs</term>; this term should not be confused,
however, with XML attributes and their values, which are similar in
concept but distinct in their formal definitions.</note> each such pair
being linked to one or more stretches of text, either directly, in the
case of spans, or indirectly, in the case of interpretations.</p>
<p>Finally section <ptr target="#AILA"/> revisits the topic of linguistic
analysis, and illustrates how these interpretative mechanisms may be
used to associate simple linguistic analysis with text segments.</p>

<div type="div2" xml:id="AILC"><head>Linguistic Segment Categories</head>
<p>In this section we introduce specialized <term>linguistic segment
category</term> elements which may be used to represent the segmentation of
a text into the traditional linguistic categories of
<term>sentence</term>, <term>clause</term>, <term>phrase</term>,
<term>word</term>, <term>morpheme</term>, 
<term>characters</term>, and <term>punctuation marks</term>.
</p>
<div type="div3" xml:id="AILCW"><head>Words and Above</head>
<p>Although different languages have very different rules about what
constitutes a <soCalled>word</soCalled> or a
<soCalled>sentence</soCalled>, these remain generally useful concepts.
In this section we discuss elements provided for marking up linguistic
units down to the word level, however defined. 
<specList><specDesc key="s"/><specDesc key="cl"/><specDesc key="phr"/><specDesc key="w"/>
</specList>
 </p>
<p>As members of the <ident type="class">att.segLike</ident> class, these
elements all share the following attribute:
<specList><specDesc key="att.segLike" atts="function"/></specList>
They also share attributes from <ident type="class">att.typed</ident>:
<specList><specDesc key="att.typed" atts="type subtype"/></specList>
</p>
<p>These elements are also all members of the <ident type="class">model.segLike</ident> class, which is a subclass of
<ident type="class">model.phrase</ident>. They may thus appear anywhere
that text is permitted within a document, when the module defined by
this chapter is included in a schema.</p>

<p>The <gi>w</gi> and <gi>pc</gi> elements belong to the <ident type="class">att.linguistic</ident> class, which supplies 
attributes that may be used for lightweight linguistic annotation (see section <ptr target="#AILALW"/> below): 
<specList><specDesc key="att.linguistic" atts="lemma lemmaRef pos msd join"/></specList></p>
<p>Additionally, these elements also have access to the <ident type="class">att.lexicographic.normalized</ident> class, 
which supplies the attributes <att>norm</att> and <att>orig</att>: the former for handling 
normalization/regularization at the word level, the latter providing the original form if the element 
content is modernized or regularized. Note that these attributes are a local (word-level) alternative 
to the robust mechanism that uses the <gi>choice</gi>, <gi>orig</gi>, and <gi>reg</gi> elements, 
discussed in section <ptr target="#COEDREG"/> and in chapter <ptr target="#TC"/>. The <gi>choice</gi>-based
mechanism is the default descriptive device, while the <att>norm</att> and <att>orig</att> attributes are used to
handle a subset of normalizations in linguistic contexts where a single sequence of tokens is a priority, for example 
in historical corpora subject to linguistic analysis. It needs to be stressed that the simplified attribute-based 
mechanism is not meant to be used for editorial interventions.
<note place="bottom">The <ident type="class">att.lexicographic.normalized</ident> class is also used in dictionary 
entries, as discussed in chapter <ptr target="#DI"/>.</note></p> 

<p>The <gi>s</gi> element may be used simply to segment a text
end-to-end  into a series of non-overlapping segments, referred to here
and elsewhere as <term>s-units</term>, or <term>sentences</term>.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-jj" source="#AI-BIBL-1"><p> 
  <s>Nineteen fifty-four, when I was eighteen years old,
    is held to be a crucial turning point in the history of
    the Afro-American — for the U.S.A. as a whole — the
    year segregation was outlawed by the U.S. Supreme Court.</s>
  <s>It was also a crucial year for me because on June 18,
    1954, I began serving a sentence in state prison for
    possession of marijuana.</s>
</p></egXML>

The <gi>s</gi> element is more restricted both in its content and its
usage than the generic <gi>seg</gi> element. The <gi>seg</gi> unit may
contain anything which can appear within a paragraph: thus it may be
used to enclose members of the <ident type="class">model.inter</ident>
class (such as <gi>bibl</gi> or <gi>list</gi>) as well as other phrase
elements; the <gi>s</gi> unit may only contain phrase-level elements
or text. Also, unlike <gi>seg</gi> elements, <gi>s</gi> elements
should not be nested within each other.<note place="bottom">Neither this
constraint, nor the requirement that the whole of the text be
segmented by <gi>s</gi> elements is required by the TEI
Guidelines.</note> The <gi>seg</gi> element is intended for
use as a generic segmentation element, the specific function of which
may be indicated by its <att>type</att> attribute; the other members
of the class are more specialized. Thus, the <gi>s</gi>, <gi>cl</gi>, and
<gi>phr</gi> elements may be thought of as equivalent to <tag>seg
type="s-unit"</tag>, <tag>seg
type="clause"</tag> and <tag>seg type="phrase"</tag>, respectively,
but with the above-mentioned restrictions.
</p>
<p>The <gi>s</gi> element may be further subdivided into
<term>clauses</term>, marked with the <gi>cl</gi> element,
as in the following example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-yo" source="#AI-BIBL-2"><p>
  <s>
    <cl>It was about the beginning of September, 1664,
      <cl>that I, among the rest of my neighbours,
        heard in ordinary discourse
        <cl>that the plague was returned again to Holland; </cl> </cl> </cl>
    <cl>for it had been very violent there, and particularly at
      Amsterdam and Rotterdam, in the year 1663, </cl>
    <cl>whither, <cl>they say,</cl> it was brought,
      <cl>some said</cl> from Italy, others from the Levant, among some goods
      <cl>which were brought home by their Turkey fleet;</cl> </cl>
    <cl>others said it was brought from Candia;
      others from Cyprus. </cl>
  </s>
  <s>
    <cl>It mattered not <cl>from whence it came;</cl> </cl>
    <cl>but all agreed <cl>it was come into Holland again.</cl> </cl>
  </s>
</p></egXML>
</p>
<p>Clauses may be further divided into <gi>phr</gi> elements in the same
way. A text may be segmented directly into clauses, or into
phrases, with no need to include segmentation at a higher level as well.
 </p>
<p>For verse texts, the overlapping of metrical and syntactic structure
requires that special care be given to representing both using an
element hierarchy. One simple approach is to split the syntactic phrases
into fragments when they cross verse boundaries, reuniting them 
with the <att>part</att> attribute:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-kc" source="#AI-BIBL-3"><div type="stanza">
  <l><cl part="I">Tweedledum and Tweedledee</cl></l>
  <l><cl part="F">Agreed to have a battle;</cl></l>
  <l><cl part="I">For Tweedledum said <cl part="I">Tweedledee</cl></cl></l>
  <l><cl part="F"><cl part="F">Had spoiled his nice new rattle.</cl></cl></l></div>
<div type="stanza">
  <l><cl part="I">Just then flew down a monstrous crow,</cl></l>
  <l><cl part="F">As black as a tar barrel;</cl></l>
  <l><cl part="I">Which frightened both the heroes so,</cl></l>
  <l><cl part="F"><cl>They quite forgot their quarrel.</cl></cl></l></div></egXML>

Another approach is to use the <att>next</att> and <att>prev</att>
attributes defined in the additional module for linking (chapter <ptr target="#SA"/>):
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-xu" source="#AI-BIBL-3"><l><cl next="#c5" xml:id="c3" part="I">For Tweedledum said
   <cl next="#c6" xml:id="c4" part="I">Tweedledee</cl></cl></l>
<l><cl prev="#c3" xml:id="c5" part="F">  
   <cl prev="#c4" xml:id="c6" part="F">Had spoiled his nice new rattle.</cl></cl></l></egXML>
Other methods are also possible; for discussion, see chapter <ptr target="#NH"/>.
 </p>
<p>The <att>type</att> attribute on linguistic segment categories can
be used to provide additional interpretative information about the
category. The <att>function</att> attribute on the <gi>cl</gi> and
<gi>phr</gi> elements can be used to provide additional information
about the function of the category. Legal values for these
two attributes are not defined by these Guidelines, but should be
documented in the <gi>segmentation</gi> element of the
<gi>encodingDesc</gi> element within the document's header. 
A general approach to the encoding of linguistic categories for 
parts of a text is discussed in section <ptr target="#AILA"/> below.
</p>
<p>Using traditional terminology, these attributes provide a convenient
way of specifying, for example, that the clause <mentioned>from whence it
came</mentioned> is a relative clause modifying another, or that  the
phrase <mentioned>by the U.S. Supreme Court</mentioned> is a prepositional
post-modifier:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-ls"><cl>It mattered not
  <cl type="relative" function="clause_modifier">from whence it came;</cl>
</cl></egXML>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-ud"><phr type="NP">the year segregation</phr>
<phr>was outlawed</phr>
<phr type="PP" function="postmodifier-agent">by the U.S. Supreme Court.</phr></egXML>
 </p>
<p>Segmentation into clauses and phrases can, of course, be combined.
Such detailed encodings as the following may require careful
formatting if they are to be easily readable however.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-rl"><p>
  <s>
    <cl type="finite-declarative" function="independent"> 
      <phr type="NP" function="subject">Nineteen fifty-four,
      <cl type="finite-relative-declarative" function="appositive">
	when <phr type="NP" function="subject">I</phr>
	<phr type="VP" function="predicate">was eighteen years old</phr>
	</cl></phr>,
	<phr type="VP" function="predicate">  
	  <phr type="V" function="verb-main">is held</phr>
	  <phr type="NP" function="complement"> 
	    <cl type="nonfinite" function="predicate-nom.">    
	      <phr type="V" function="copula">to be</phr>
	      <phr type="NP" function="predicate-nom.">a crucial turning point
	      <phr type="PP" function="postmodifier">in
	      <phr type="NP" function="prep.obj.">the history
	      <phr type="PP" function="postmodifier">of the Afro-American</phr>
	      </phr>
	      </phr>
	      —
	      <phr type="PP" function="postmodifier-appositive">for
	      <phr type="NP" function="prep.obj.">the U.S.A.
	      <phr type="PP" function="postmodifier">as a whole</phr>
	      </phr>
	      </phr>
	      </phr>
	      —
	      <phr type="NP" function="appositive-predicate-nom.">the year
	      <cl type="finite-relative" function="adjectival">      
		<phr type="NP" function="subject">segregation</phr>
		<phr type="VP" function="predicate">       
		  <phr type="V" function="verb-main">was outlawed</phr>
		  <phr type="PP" function="postmodifier">by the U.S. Supreme Court</phr>
  </phr></cl></phr></cl></phr></phr>.</cl></s>
  <s>
    <cl type="finite-declarative" function="independent"> 
      <phr type="NP" function="subject">It</phr>
      <phr type="VP" function="predicate">  
	<phr type="V" function="verb-main">was</phr>
	also
	<phr type="NP" function="predicate-nom.">a crucial year for me</phr>
      </phr>
      <cl type="declarative-finite" function="dependent-causative">because
      <phr type="PP" function="sentence_adverb">on June 18, 1954</phr>,
      <phr type="NP" function="subject">I</phr>
      <phr type="VP" function="predicate">
	<phr type="V" function="verb-main">began serving</phr>
	<phr type="NP" function="complement">a sentence in state prison
	<phr type="PP" function="complement">for possession of marijuana</phr>
</phr></phr></cl></cl></s>.</p></egXML></p>
<p>This style of markup may introduce spurious new lines and blanks
into the text. If the original layout is important, it should be
explicitly encoded, using such facilities as the <gi>lb</gi> element,
the global <att>rend</att> or <att>rendition</att> attributes, etc.
</p>
<!-- JC: 2018-07-20 Note this paragraph and that below are significantly out of date! -->
<p>The <gi>w</gi>, <gi>m</gi>, and <gi>c</gi> elements are identical
in meaning to the <gi>seg</gi> element with a <att>type</att>
attribute of <q>w</q>, <q>m</q>, or <q>c</q> respectively, and may
occur wherever <gi>seg</gi> is permitted to occur. However, their
content is more constrained than <gi>seg</gi>: for example,
the <gi>w</gi> element should only contain <gi>w</gi>, <gi>m</gi>,
<gi>c</gi> elements or <gi>pc</gi> elements, or plain text; the <gi>m</gi> element should
contain only <gi>c</gi> or <gi>pc</gi> elements or plain text; both
the <gi>c</gi> and <gi>pc</gi> elements
should contain only plain text, most often only a single character or
a sequence of graphemes to be treated as a single
character. Consequently, while these more specific elements can be
translated directly into typed <gi>seg</gi> elements, the reverse is
not necessarily the case.
 </p>
<p>The restriction on the content of the <gi>w</gi> element in
particular requires that a certain care must be exercised when using it,
especially in relation to the use of other tags that one may think of as
<term>word level</term>, but which are in fact defined as <term>phrase
level</term>. Consider the problem of segmenting an occurrence of the
<gi>mentioned</gi> element as a word.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-cu"><mentioned>grandiloquent</mentioned></egXML>
The first of the following two encodings is legitimate; the second is
not, since the <gi>mentioned</gi> element is not part of the content
model of the <gi>w</gi> element:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-sw">
<mentioned><w>grandiloquent</w></mentioned></egXML>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-sj" valid="false"><w xmlns=""><mentioned>grandiloquent</mentioned></w></egXML></p>
<p>On the other hand, both of the following encodings <emph>are</emph>
legitimate:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-bf"><mentioned>
   <phr>grandiloquent speech</phr>
</mentioned></egXML>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-cz"><phr>
   <mentioned>grandiloquent speech</mentioned>
</phr></egXML>
The first encoding describes the citing of a phrase. The second
describes a phrase which consists of something mentioned.
<!-- added following brief plug for otherwise  unsung attributes -->
</p>
<p>The <gi>w</gi> element <!-- and <gi>m</gi> elements carry --> carries additional attributes
which may be of use in many indexing or analytic applications. The 
<att>lemma</att> attribute may be used to specify  the
<term>lemma</term>, that is the head- or uninflected form of an
inflected verb or noun, for example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-yf" xml:lang="la"><s xml:lang="la">  
   <w lemma="timeo">timeo</w>
   <w lemma="danaii">Danaos</w>
   <w lemma="et">et</w>
   <w lemma="donum">dona</w>
   <w lemma="fero">ferentes</w>
</s></egXML>
</p>
<p>In some situations it may be more convenient to use the
<att>lemmaRef</att> pointer attribute than to supply an explicit
uninflected form. This attribute assumes the existence of a list of
uninflected forms, for example in an online lexicon, with which
individual <g>w</g> entries can be associated using the usual TEI
pointer mechanisms. Assuming that a
standardized lexicon for Latin is available at the location
<code>http://lexicon.org/latin.xml</code>, we might for example revise the above
example as:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILCW-egXML-un" xml:lang="la"><s xml:lang="la">  
   <w lemmaRef="http://lexicon.org/latin.xml#timeo">timeo</w>
   <w lemmaRef="http://lexicon.org/latin.xml#danaii">Danaos</w>
<!-- ... -->
</s></egXML></p>
</div>

<div xml:id="AIPC"><head>Below the Word Level</head>
<p>It is sometimes helpful to markup explicitly sub-word components
such as morphemes, characters, or punctuation.
<specList><specDesc key="m"/><specDesc key="c"/><specDesc key="pc"/>
</specList>
 </p>

<p>The <gi>m</gi> element is used to mark up morphologically
identified segmentation below the word level. Analogous to the
<att>lemma</att> attribute for <gi>w</gi>, there is a
<att>baseForm</att> attribute for the <gi>m</gi> element, 
which may be used to indicate the <soCalled>base form</soCalled> of
an inflected morpheme; where appropriate, <gi>m</gi> elements may also
be organized hierarchically:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-ld"><w type="adjective">  
   <m type="base">
     <m type="prefix" baseForm="con">com</m>
     <m type="root">fort</m>
   </m>
   <m type="suffix">able</m>
</w></egXML>
</p>
<p>The distinction between <gi>m</gi> and <gi>w</gi> is provided as a
convenience only; it may not be appropriate for all linguistic
theories, nor is it meaningful in all languages. The intention is to
provide a means for those cases where it is considered helpful to distinguish
lexical from sub-lexical tokens, to complement the more general
mechanism already provided by the <gi>seg</gi> element, using which
the above example could alternatively be marked up as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-rm"><seg type="adjective">  
   <seg type="base">
     <seg type="prefix">com</seg>
     <seg type="morph">fort</seg>
   </seg>
   <seg type="suffix">able</seg>
</seg></egXML>
See section <ptr target="#AILALW"/> for an alternative to using <att>type</att> in such contexts.
</p>

<p>There is a substantial
linguistic difference between characters like letters or diacritics
and punctuation marks. The former are used to
construct meaningful units like morphemes or words. The latter are
functionally independent units acting at the level of syntactic
units. A word may consist of a single letter (for example <soCalled>I</soCalled> in English),
but this  does not mean that we should use <gi>c</gi> instead of <gi>w</gi>
to mark it up. </p>

<p>The <gi>c</gi> (character) element should be used to mark up any non-lexical
character, whether this appears within a word, or outside it. In the
following example, the encoder wishes to indicate that the letters are
not to be regarded as words:

    <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-ow" source="#TwN">
      <phr>
	<c>M</c>
	<c>O</c>
	<c>A</c>
	<c>I</c>
	<w>doth</w>
	<w>sway</w>
	<w>my</w>
	<w>life</w>
      </phr>
    </egXML>
</p>
<p>The <gi>c</gi> element may be used for
individual characters occurring within a <gi>w</gi> or <gi>m</gi>
element which it is desired to distinguish for some reason, as in the
following examples:
    <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-ro">
      <m baseForm="not">
	<c>n</c>
	<c type="punct">'</c>
	<c>t</c>
      </m>
    </egXML>
This encoding represents the constituents of a common abbreviation,
but does not indicate that it is in fact an abbreviation; the
<gi>am</gi> element (<ptr target="#PHAB"/>) may be preferred for the
latter purpose.  Generally speaking, the use of <gi>c</gi> use to mark
non-lexical punctuation marks is deprecated, since the <gi>pc</gi>
element is provided specifically to distinguish these.
</p>

<p>The <gi>pc</gi> (punctuation character) element should be used to mark up
characters which are specifically regarded as providing punctuation,
rather than constituting parts of a word. It may be particularly
useful when transcribing older written materials, in which an encoding
of the original punctuation may be useful for interpretive or analytic
purposes, in much the same way as an encoding of the original
orthography may be. For example, in the following extract from
a Bodleian Library musical manuscript
<figure xml:id="AIPC-figure-sw">
<graphic url="Images/punctus.png"/>
</figure>
two different punctuation marks are used to distinguish kinds of pause
in the text. The <term>punctus elevatus</term> (which resembles an inverted
semicolon) is not a Unicode character, but may still be encoded using
the <gi>g</gi> element. As further described in chapter <ptr target="#WD"/>, this element points to a definition for the intended
character which may be stored either locally or elsewhere.
    <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-ue" source="#punctuseg">deus qui regis omnia
<pc><g ref="#pelev">;</g></pc> natus est in bethlehem
<pc>.</pc>o <pc>.</pc> mira gratia...
<!-- elsewhere -->
<char xml:id="pelev">
<!-- definition of the punctus elevatus character -->
</char>
</egXML>
</p>
<p>The <gi>pc</gi> element carries special attributes to record
analyses of the functional behaviour or classification of the
punctuation mark it contains. The <att>unit</att> attribute may be
used, as on the <gi>milestone</gi> element to name the kind of unit
which the punctuation mark delimits, for example a paragraph or
section. The <att>pre</att> attribute may be used to indicate whether
the punctuation precedes or follows the unit it delimits. The
<att>force</att> attribute indicates the strength of the association
between the punctuation mark and its adjacent word. </p>
<p>In the following example, the paragraph marker (¶) has been tagged
as a strong punctuation mark, preceding the unit it marks, which is
named <soCalled>para</soCalled>:
    <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-rn">
      <p><pc unit="para" force="strong" pre="true">¶</pc>Incipit...</p>
    </egXML>
</p>

<p>A similar encoding can be used for hyphenation:
  
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-im" xml:space="preserve">A fire<pc force="strong">-</pc>proof vest is recom<pc force="weak">-</pc><lb/>
mended. 
</egXML>
  
    Refer to <ptr target="#COPU-2"/> for a discussion of the motivations for 
  explicitely recording the presence of hyphens.</p>


<p>The <gi>w</gi>, <gi>m</gi>, <gi>c</gi>, and <gi>pc</gi> elements can be used
together to give a fairly detailed low-level grammatical analysis of
text. For example, consider the following segmentation of the English
S-unit <mentioned>I didn't do it</mentioned>.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AIPC-egXML-dc"><w>I</w>
<w> 
   <m baseForm="do">did</m>
   <m>n't</m>
</w>
<w lemma="do">do</w>
<w>it</w>
<pc>.</pc></egXML>
<!-- shouldn't we attribute this to Bart Simpson? :-) -->
 </p>
<p>This segmentation, crude as it is, succeeds in representing the idea
that <mentioned>did</mentioned> occurring  as a morphological
component of  the word
<mentioned>didn't</mentioned> has something in common with the word <gi>do</gi>. A further advantage of segmenting the text down
to this level is that it becomes relatively simple to associate each
such segment with a more detailed formal analysis, for example by
providing a baseform, or morphological analysis at whichever level is appropriate. 
This matter is taken up in detail in section <ptr target="#AILA"/>. 
 </p>


<specGrp xml:id="DAILC" n="Linguistic Segment Categories">
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/s.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/cl.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/phr.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/w.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/m.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/c.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/pc.xml"/>
</specGrp>
<specGrp xml:id="DAILA" n="Linguistic Word-level Attributes">
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/att.lexicographic.normalized.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/att.linguistic.xml"/>
</specGrp>
</div>
</div>
<div type="div2" xml:id="AIATTS"><head>Global Attributes for Simple Analyses</head>
<p>When the module described by this chapter is selected, an
additional attribute is defined for all elements:
<specList><specDesc key="att.global.analytic" atts="ana"/></specList>
The <att>ana</att> attribute may be specified for any element.
Its effect is to associate the element with one or more others
representing an analysis or interpretation of it. Its target should be
one of the elements described in the section <ptr target="#AISP"/> below,
or some other interpretative element such as <gi>note</gi>, on which 
see section <ptr target="#CONO"/> or <gi>fs</gi>,
on which see chapter <ptr target="#FS"/>. If a hierarchical form of classification 
is desired then it may point to <gi>category</gi> element at a suitable level in a
<gi>taxonomy</gi> see <ptr target="#HD55"/>.</p>
<specGrp xml:id="DAIGA" n="Global attribute for analysis">
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/att.global.analytic.xml"/>
    </specGrp>
</div>
<div type="div2" xml:id="AISP"><head>Spans and Interpretations</head>
<p>The simplest mechanisms for attaching analytic notes in some
structured vocabulary to particular passages of text are provided by the
<gi>span</gi> and <gi>interp</gi> elements, and their associated
grouping elements <gi>spanGrp</gi> and <gi>interpGrp</gi>.
<specList><specDesc key="span"/><specDesc key="spanGrp"/><specDesc key="interp"/><specDesc key="interpGrp"/></specList></p>
<!--
    When Stylesheets bug #370 is fixed we plan to move @type from
    <interp>, <interpGrp>, <span>, and <spanGrp> into att.interpLike.
    When that happens, the next bit should be changed to:

<p>These elements are all members of the class <ident type="class">att.interpLike</ident>, and thus share the following attributes:
<specList><specDesc key="att.interpLike" atts="inst type"/></specList>

    —Syd, 2020-10-30
-->
<p>These elements are all members of the class <ident type="class">att.interpLike</ident>, and thus share the following attribute:
<specList><specDesc key="att.interpLike" atts="inst"/></specList>
  They also inherit the following attributes from <ident type="class">att.global.responsibility</ident>:
  <specList>
<specDesc key="att.global.responsibility" atts="cert resp"/></specList>
</p>
<p>The <att>type</att>  attribute of the
<gi>span</gi> and <gi>interp</gi> elements may be used to indicate
that the annotations are of specific types, for example thematic or
structural. The annotation itself is supplied as the content of the
<gi>span</gi> or <gi>interp</gi> element. 
In the case of the <gi>span</gi> element, the span of text being
annotated is indicated by values of the <att>from</att>,
<att>to</att> or <att>target</att> attributes, used in combination as
follows. If only the <att>from</att> attribute is supplied, then the
span is coterminous with the element indicated by its value; if both
<att>from</att> and <att>to</att> are supplied, the span runs from the
start of the element indicated by the <att>from</att> attribute up to
the end of the element indicated by the <att>to</att> attribute; if
the <att>target</att> attribute is used, the span is defined by
aggregating the contents of the (possibly non-contiguous) elements pointed to by its values. It
is an error to supply only the <att>to</att> attribute; to supply more
than one pointer value for either <att>to</att> or <att>from</att>
attributes; or to supply either of these in conjunction with the
<att>target</att> attribute. 
In the case of <gi>interp</gi> (see below), the span is indicated by a
pointer from a <gi>link</gi> element or some similar mechanism.  The
<att>resp</att> attribute indicates the annotator responsible for this annotation.
</p>
<p>The <gi>span</gi> element provides a simple way of indicating such
features as phrasal verbs in a linguistic analysis, as in this
example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-ei">
<s><w>What</w><w>did</w><w>you</w><w xml:id="mk01">make</w><w xml:id="up01">up</w></s>
<span from="#mk01" to="#up01">phrasal verb "make up"</span>
</egXML>
Here the two components of the span follow each other, so the
<att>to</att> and <att>from</att> attributes may be used. The
same effect could however be achieved by using the <att>target</att>
attribute:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-oc">
<s><w>What</w><w>did</w><w>you</w><w xml:id="mk02">make</w><w xml:id="up02">up</w></s>
<span target="#mk02 #up02">phrasal verb "make up"</span>
</egXML>
This second approach might be cumbersome if the number of components
to be combined is very large. It is however essential if the 
components do not follow each other, as in this example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-bd">
<s><w>Did</w><w>you</w><w xml:id="mk03">make</w><w>it</w><w xml:id="up03">up</w></s>
<span target="#mk03 #up03">phrasal verb "make up"</span>
</egXML>
</p>
<p>The <gi>span</gi> element can be used for any kind of
annotation. In this example it is used in a narratological analysis:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-bq" source="#AI-BIBL-4"><p xml:id="MaQp1s2p114">
<s xml:id="MaQp1s2p114s1">There was certainly a definite point at which the
  thing began.</s>
<s xml:id="MaQp1s2p114s2">It was not; then it was suddenly inescapable,
  and nothing could have frightened it away.</s>
<s xml:id="MaQp1s2p114s3">There was a slow integration, during which she,
  and the little animals, and the moving grasses, and the sun-warmed
  trees, and the slopes of shivering silvery mealies, and the great
  dome of blue light overhead, and the stones of earth under her feet,
  became one, shuddering together in a dissolution of dancing
  atoms.</s>
<s xml:id="MaQp1s2p114s4">She felt the rivers under the ground forcing
  themselves painfully along her veins, swelling them out in an
  unbearable pressure; her flesh was the earth, and suffered growth
  like a ferment; and her eyes stared, fixed like the eye of the
  sun.</s>
<s xml:id="MaQp1s2p114s5">Not for one second longer (if the terms for time
  apply) could she have borne it; but then, with a sudden movement
  forwards and out, the whole process stopped; and <emph rend="italic">that</emph> was <soCalled rend="dquo">the
  moment</soCalled> which it was impossible to remember
  afterwards.</s>
<span from="#MaQp1s2p114s3" to="#MaQp1s2p114s5">the moment</span>
<s xml:id="MaQp1s2p114s6">For during that space of time (which was
  timeless) she understood quite finally her smallness, the
  unimportance of humanity.</s>
</p></egXML>
 </p>
<p>The <gi>span</gi> element may, as in this example, be placed in the
text near the textual span it is associated with. Alternatively, it  may be placed
elsewhere in the same or a different document. Where several
<gi>span</gi> or <gi>interp</gi> elements share the same attributes,
for example having the same responsibility or type, it may be
convenient to group them within a <gi>spanGrp</gi> or <gi>interpGrp</gi> element as follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-bs"><spanGrp resp="#DTL">
  <span from="#MaQp1s2p114s3" to="#MaQp1s2p114s5">the moment</span>
  <!-- other spans identified by DTL here -->
</spanGrp></egXML>
 </p>
<p>Spans may also be used to represent structural divisions within 
a narrative, particularly when these do not coincide with the
structure implied by the element structure. Consider the following narrative:
<q rend="display">
<p>Sigmund, the son of Volsung, was a king in Frankish country.
Sinfiotli was the eldest of his sons, the second was Helgi, the
third Hamund.
Borghild, Sigmund's wife, had a brother named —
But Sinfiotli, her stepson, and — both wooed the same woman
and Sinfiotli killed him over it.<note place="bottom">The rule marks spaces
left for the missing name in the manuscript.</note>
And when he came home, Borghild asked him to go away,
but Sigmund offered her weregild, and she was obliged to accept it.
At the funeral feast Borghild was serving beer.  She took poison, a big
drinking horn full, and brought it to Sinfiotli.  When Sinfiotli looked
into the horn, he saw that poison was in it, and said to Sigmund <q>This
drink is cloudy, old man.</q> Sigmund took the horn and drank it off.
It is said that Sigmund was hardy and that poison did him no harm,
inside or out.  And all his sons could tolerate poison on their skin.
Borghild brought another horn to Sinfiotli, and asked him to drink, and
everything happened as before.  And a third time she brought him a horn,
and reproachful words as well, if he didn't drink from it.  He spoke
again to Sigmund as before.  He said <q>Filter it through your mustache,
son!</q> Sinfiotli drank it off and at once fell dead.
</p>
<p>Sigmund carried him a long way in his arms and came to a long,
narrow fjord, and there was a small boat there and a man in it.  He
offered to ferry Sigmund over the fjord.  But when Sigmund carried the
body out to the boat, it was fully laden.  The man said Sigmund should
go around the fjord inland.  The man pushed the boat out and then
suddenly vanished.
</p>
<p>King Sigmund lived a long time in Denmark in the kingdom of
Borghild, after he married her.  Then he went south to Frankish lands,
to the kingdom he had there.  Then he married Hiordis, the daughter of
King Eylimi.  Their son was Sigurd.  King Sigmund fell in a battle with
the sons of Hunding.  And then Hiordis married Alf, the son of King
Hialprec.  Sigurd grew up there as a boy.
</p>
<p>Sigmund and all his sons were tall and outstanding in their
strength, their growth, their intelligence, and their accomplishments.
But Sigurd was the most outstanding of all, and everyone who knows about
the old days says he was the most outstanding of men and the noblest of
all the warrior kings.</p></q>
 </p>
<p>A structural analysis of this text, dividing it into narrative units
in a pattern shared with other texts from the same literature, might
look like this:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-br" source="#AI-eg-01"><p xml:id="P1">
<s xml:id="S1">Sigmund ... was a king in Frankish country.</s>
<s xml:id="S2">Sinfiotli was the eldest of his sons.</s>
<s xml:id="S3">Borghild, Sigmund's wife, had a brother ...</s>
<s xml:id="S4A">But Sinfiotli ... wooed the same woman</s>
<s xml:id="S4B">and Sinfiotli killed him over it.</s>
<s xml:id="S5">And when he came home, ... she was obliged to accept it.</s>
<s xml:id="S6">At the funeral feast Borghild was serving beer.</s>
<s xml:id="S7">She took poison ... and brought it to Sinfiotli.</s>
<s xml:id="S17">Sinfiotli drank it off and at once fell dead.</s>
<anchor xml:id="EOS17"/>
</p>
<p xml:id="P2">Sigmund carried him a long way in his arms ... </p>
<p xml:id="P3">King Sigmund lived a long time in Denmark ... </p>
<p xml:id="P4">Sigmund and all his sons were tall ... </p>
<spanGrp resp="#TMA" type="narrative-structure">
 <span from="#S1" to="#S3">introduction</span>
 <span from="#S4A">conflict</span>
 <span from="#S4B">climax</span>
 <span from="#S5" to="#S17">revenge</span>
 <span from="#EOS17">reconciliation</span>
 <span from="#P2" to="#P4">aftermath</span>
</spanGrp></egXML>
</p>
<p>Note the use of an empty <gi>anchor</gi> element to provide a target for
the <soCalled>reconciliation</soCalled> unit which is normally part of
the narrative pattern but which is not realized in the text shown.
 </p>
<!--<p>If groups of <gi>span</gi> elements with the same <att>resp</att> or
<att>type</att> are used, as in this example, they may be grouped
together inside a <gi>spanGrp</gi> element, with the values of the
common attribute(s) inherited from the higher element, as follows.
 -->
<p>The same analysis may be expressed with the <gi>interp</gi> element
instead of the <gi>span</gi> element; this element provides attributes
for recording an interpretive category and its value, as well as the
identity of the interpreter, but does not itself indicate which passage
of text is being interpreted; the same interpretive structures can thus
be associated with many passages of the text.  The association between
text passages and <gi>interp</gi> elements should be made either by
pointing from the text to the <gi>interp</gi> element with the
<att>ana</att> attribute defined in section <ptr target="#AIATTS"/>, or by
pointing at both text and interpretation from a <gi>link</gi> element,
<!-- If the associations must be made by @ana or <link>, then why does <interp> have an @inst? -sb, 2016-07-20 -->
as described in chapter <ptr target="#SA" type="div1"/>.
 </p>
<p>To encode the first example above using <gi>interp</gi>, it is
necessary to create a text element which contains—or corresponds to—the third, fourth, and fifth orthographic sentences (S-units) in
the paragraph.  This can be done either with the <gi>seg</gi> element,
described in <ptr target="#SASE" type="div2"/>, or the <gi>join</gi>
element, described in <ptr target="#SAAG" type="div2"/>.  The resulting
element can then be associated with the <gi>interp</gi> element using the
<att>ana</att> attribute described in section <ptr target="#AIATTS" type="div1"/>.  We illustrate using the <gi>seg</gi> element.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-vp"><p xml:id="MarQp1s2p114">
<s xml:id="MarQp1s2p114s1">There was certainly a definite point ... </s>
<s xml:id="MarQp1s2p114s2">It was not; then it was suddenly inescapable ... </s>
<seg xml:id="MarQp1s2p114s3-5" ana="#moment">
<s xml:id="MarQp1s2p114s3">There was a slow integration ... </s>
<s xml:id="MarQp1s2p114s4">She felt the rivers under the ground ... </s>
<s xml:id="MarQp1s2p114s5">Not for one second longer ... </s>
</seg>
<s xml:id="MarQp1s2p114s6">For during that space of time ... </s>
</p>
<interp xml:id="moment">the moment</interp></egXML>
 </p>
<p>The second example above can be recoded using <gi>interp</gi> and
<gi>interpGrp</gi> tags in a similar manner. The interpretation
itself can be expressed in an <gi>interpGrp</gi> element, which would
replace the <gi>spanGrp</gi> in the example shown above:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-zb"><interpGrp resp="#TMA" type="structuralunit">
        <interp xml:id="INTRO">introduction</interp>
        <interp xml:id="CONFLICT">conflict</interp>
        <interp xml:id="CLIMAX">climax</interp>
        <interp xml:id="REVENGE">revenge</interp>
        <interp xml:id="RECONCIL">reconciliation</interp>
        <interp xml:id="AFTERM">aftermath</interp>
</interpGrp></egXML>
 </p>
<p>Any of these <gi>interp</gi> elements may be linked to the text either
by means of the <att>ana</att> attribute, or by means of <gi>link</gi>
elements.  Using the <att>ana</att> attribute (on <gi>seg</gi> elements
introduced specifically for this purpose), the text would be encoded as
follows:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-fm"><p xml:id="PP1">
<seg xml:id="SS1-SS3" ana="#INTRO">
<s xml:id="SS1">Sigmund ... was a king in Frankish country.</s>
<s xml:id="SS2">Sinfiotli was the eldest of his sons.</s>
<s xml:id="SS3">Borghild, Sigmund's wife, had a brother ... </s>
</seg>
<s xml:id="SS4A" ana="#CONFLICT">But Sinfiotli ... wooed the same woman</s>
<s xml:id="SS4B" ana="#CLIMAX">and Sinfiotli killed him over it.</s>
<seg xml:id="SS5-SS17" ana="#REVENGE">
<s xml:id="SS5">And when he came home, ... she was obliged to accept it.</s>
<s xml:id="SS6">At the funeral feast Borghild was serving beer.</s>
<s xml:id="SS17">Sinfiotli drank it off and at once fell dead.</s>
</seg></p>
<anchor xml:id="NIL1" ana="#RECONCIL"/>
<p xml:id="PP2">Sigmund carried him a long way in his arms ... </p>
<p xml:id="PP3">King Sigmund lived a long time in Denmark ... </p>
<p xml:id="PP4">Sigmund and all his sons were tall ... </p>
<join xml:id="PP2-PP4" target="#PP2 #PP3 #PP4" ana="#AFTERM"/></egXML>
 </p>
<p>The linkage may also be accomplished using a <gi>linkGrp</gi> element,
whose content is a set of <gi>link</gi> elements which point to each
interpretive element and its corresponding text unit.  This method does
not require the use of the <att>ana</att> attribute on the text
units.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AISP-egXML-mj" source="#UND"><linkGrp targFunc="interpretation text">
  <link target="#INTRO    #SS1-SS3"/>
  <link target="#CONFLICT #SS4A"/>
  <link target="#CLIMAX   #SS4B"/>
  <link target="#REVENGE  #SS5-SS17"/>
  <link target="#RECONCIL #NIL1"/>
  <link target="#AFTERM   #PP2-PP4"/>
</linkGrp></egXML>
 </p>
<p>One obvious advantage of using <gi>interp</gi> rather than
<gi>span</gi> elements for the Sigmund text is that the <gi>interp</gi>
elements can be reused for marking up other texts in the same document,
whereas the <gi>span</gi> elements cannot.  <!--Another is that the
<gi>interp</gi> element can be used to provide interpretations for
discontinuous text elements (represented by <gi>join</gi> elements).  -->On
the other hand, the use of <gi>interp</gi> elements may require the
creation of special text elements not otherwise needed (e.g. the
<gi>seg</gi> and the <gi>join</gi> in the revised encoding of the text),
whereas the use of <gi>span</gi> elements does not.
 </p>
<specGrp xml:id="DAISP" n="Spans">
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/span.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/spanGrp.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/interp.xml"/>
<include xmlns="http://www.w3.org/2001/XInclude" href="../../Specs/interpGrp.xml"/>
</specGrp>
</div>
<div type="div2" xml:id="AILA"><head>Linguistic Annotation</head>
<p>By <term>linguistic annotation</term> we mean here any annotation
determined by an analysis of linguistic features of the text, excluding
as borderline cases both the formal structural properties of the text
(e.g. its division into chapters or paragraphs) and descriptive
information about its context (the circumstances of its production, its
genre or medium).  The structural properties of any TEI-conformant text
should be represented using the structural elements discussed elsewhere
in this chapter and in chapters <ptr target="#CO"/>, <ptr target="#DS"/>, 
<ptr target="#VE"/>, <ptr target="#DR"/>, <ptr target="#TS"/>, <ptr target="#DI"/>, 
and <ptr target="#CC"/>. The contextual
properties of a TEI text are fully documented in the TEI header, which
is discussed in chapter <ptr target="#HD"/>, and in section <ptr target="#CCAH"/>.
 </p>
<p>Other forms of linguistic annotation may be applied at a number of
levels in a text.  A code (such as a word-class or part-of-speech
code) may be associated with each word or token, or with groups of such
tokens, which may be continuous, discontinuous, or nested.  A code may
also be associated with relationships (such as cohesion) perceived as
existing between distinct parts of a text.  The codes themselves may
stand for discrete and non-decomposable categories, or they may represent
highly articulated bundles of textual features.  Their function may be
to place the annotated part of the text somewhere within a narrowly
linguistic or discoursal domain of analysis, or within a more general
semantic field, or any combination drawn from these and other domains.
 </p>
<p>The manner by which such annotations are generated and attached to
the text may be entirely automatic, entirely manual or a mixture.  The
ease and accuracy with which analysis may be automated may vary with the
level at which the annotation is attached.  The method employed should
be documented in the <gi>interpretation</gi> element within the encoding
description of the TEI header, as described in section <ptr target="#HD53"/>.
Where different parts of a language corpus have used
different annotation methods, the <att>decls</att>
attribute may be used to indicate the fact, as further
discussed in section <ptr target="#CCAS"/>.
</p>
  <div type="div3" xml:id="AILAGD"><head>Linguistic Annotation by Means of Generic TEI Devices</head>
<p>As one example of such types of analysis, consider the following
sentence, taken from the Lancaster/IBM Treebank
Project (<ptr target="#AI-BIBL-5"/>).
 <q rend="display">The victim's friends told police that Kruger drove
into the quarry and never surfaced.</q> </p> <p>Our discussion focuses
on the way that this sentence might be analysed using the CLAWS system
developed at the University of Lancaster but exactly the same
principles may be applied to a wide variety of other systems.<note place="bottom">For the word-class tagging method used by CLAWS see
<ptr target="#AI-BIBL-6"/>; 
For an overview of the system see <ptr type="cit" target="#AI-BIBL-7"/>. The example sentence was processed
using an online version of the CLAWS tagger at <ptr target="http://ucrel.lancs.ac.uk/claws/"/> </note>
Output from the system consists of a segmented and tokenized version
of the text, in which word class codes have been associated with each
token. CLAWS offers outputs in a variety of non-XML and XML formats:
for example, the simplest format for the sample sentence would be:
<eg xml:space="preserve"><![CDATA[The_AT0 victim_NN1 's_POS friends_NN2 told_VVD police_NN2 that_CJT Kruger_NP0 
drove_VVD into_PRP the_AT0 quarry_NN1 and_CJC never_AV0 surfaced_VVD]]></eg></p>
<p>This may be easily transformed into an equivalent TEI XML representation:

<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILAGD-egXML-ld"><s><w ana="#AT0">The </w> 
<w ana="#NN1">victim</w><w ana="#POS">'s</w> 
<w ana="#NN2">friends </w> <w ana="#VVD">told </w> 
<w ana="#NN2">police </w> <w ana="#CJT">that </w> 
<w ana="#NP0">Kruger </w> <w ana="#VVD">drove </w> <w ana="#PRP">into </w> 
<w ana="#AT0">the </w> <w ana="#NN1">quarry </w> 
<w ana="#CJC">and </w> <w ana="#AV0">never </w> 
<w ana="#VVD">surfaced</w></s></egXML> 

Although the names used for the attribute values here may have some
significance for the human reader (<val>AT0</val> for
<term>article</term>, <val>NN1</val> for <term>singular noun</term>,
<val>NN2</val> for <term>plural noun</term>, etc.) they are
arbitrary codes, used in this case as pointers to other elements which
define their significance more precisely.  If the codes are considered
to be <term>atomic</term>, then the <gi>interp</gi> element described
in section <ptr target="#AISP"/> might be used to supply brief definitions
in the header:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILAGD-egXML-go">
<interpGrp type="POS">
 <interp xml:id="AT0">Definite article</interp>
 <interp xml:id="AV0">Adverb</interp>
 <interp xml:id="CJC">Conjunction</interp>
 <interp xml:id="CJT">Relative that</interp>
 <interp xml:id="NN1">Noun singular</interp>
 <interp xml:id="NN2">Noun plural</interp>
 <interp xml:id="NP0">Proper noun</interp>
 <interp xml:id="POS">Genitive marker</interp>
 <interp xml:id="PRP">Preposition</interp>
 <interp xml:id="VVD">Verb past tense</interp>
</interpGrp>
</egXML> 

If the codes are considered to
be compositional (for example that <val>NN1</val> and <val>NN2</val>
have something in common, namely their <term>noun-ness</term>, which
they do not share with, say, <val>VVD</val>), then this
compositionality may be most clearly expressed using a mechanism based
on the <gi>fs</gi> element defined in chapter <ptr target="#FS"/>.
</p>
<p>This approach requires the text to be fully segmented, using the
linguistic segment elements described in section <ptr target="#AILC"/>, so that the scope of the <att>ana</att> attribute
used to point to each interpretation is clearly defined. A further
analysis into phrase and clause elements can be superimposed on the
word and morpheme tagging in the preceding illustration. For example,
CLAWS provides the following constituent analysis of the sample
sentence (the word class codes have been deleted):
<eg xml:space="preserve"><![CDATA[[N [G The victim's G] friends N] [V told [N police N] [Fn that 
[N Krueger N] [V [V& drove [P into [N the quarry N]P]V&] and 
[V+ never surfaced V+]V]Fn]V]]]></eg></p>
<p>Treating the labels on the brackets as phrase or clause
interpretations, this analysis of the structure of the example sentence
can be combined with the word class analysis and represented as follows
(the symbol <val>V&amp;"/&gt;</val> representing the first part of a coordinate
phrase, has been replaced by <val>V1</val>, and <val>V+</val>, representing the
second part, has been replaced by <val>V2</val>).
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILAGD-egXML-xd"><s type="sentence">
   <phr ana="#n">  
      <phr ana="#gn">    
         <w ana="#AT0">The</w>
         <w ana="#NN1">victim</w>
         <m ana="#POS">'s</m>
      </phr>
      <w ana="#NN2">friends</w>
   </phr>
   <phr ana="#v">  
      <w ana="#VVD">told</w>
      <phr ana="#n">
         <w ana="#NN2">police</w>
      </phr>
      <cl ana="#fn">    
         <w ana="#CJT">that</w>
         <phr ana="#n">
            <w ana="#NP0">Krueger</w>
         </phr>
         <phr ana="#v">      
            <phr ana="#v1">        
               <w ana="#VVD">drove</w>
               <phr ana="#pr">          
                  <w ana="#PRP">into</w>
                  <phr ana="#n">            
                     <w ana="#AT0">the</w>
                     <w ana="#NN1">quarry</w>
                  </phr>
               </phr>
            </phr>
            <w ana="#CJC">and</w>
            <phr ana="#v2">        
               <w ana="#AV0">never</w>
               <w ana="#VVD">surfaced</w>
            </phr>
         </phr>
      </cl>
   </phr>
   <c ana="#pun">.</c>
</s></egXML>
 </p>
<p>This approach requires the definition of further <gi>interp</gi>
(or <gi>fs</gi>) elements to provide targets for the pointers used to
represent the constituent analysis:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILAGD-egXML-hs">
<interpGrp type="constituentFunction">
 <interp xml:id="v2">coordinate  continuation</interp>
 <interp xml:id="v">verbal</interp>
 <interp xml:id="no">nominal</interp>
 <interp xml:id="gn">genitive</interp>
 <interp xml:id="fn">finite clause</interp>
 <interp xml:id="pr">prepositional</interp>
 <interp xml:id="v1">coordinate  start</interp>
</interpGrp>
</egXML></p>

<p>Alternatively, a <soCalled>stand-off</soCalled> representation for
these analyses might be created using the <gi>linkGrp</gi> element.
In this case, each linguistic segment to be annotated must be supplied with its own
<att>xml:id</att> attribute:

<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILAGD-egXML-iy"><s>
<w xml:id="word-1">The</w> 
<w xml:id="word-2">victim</w> 
<w xml:id="word-3">'s</w> <w xml:id="word-4">friends</w> 
<w xml:id="word-5">told</w> <w xml:id="word-6">police</w> 
<w xml:id="word-7">that</w> <w xml:id="word-8">Kruger</w> 
<w xml:id="word-9">drove</w> <w xml:id="word10">into</w> 
<w xml:id="word11">the</w> <w xml:id="word12">quarry</w> 
<w xml:id="word13">and</w> <w xml:id="word14">never</w> 
<w xml:id="word15">surfaced</w></s></egXML> 

Each segment-interpretation pair may now be represented by means of a
<gi>link</gi> element inside an appropriate <gi>linkGrp</gi> element:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILAGD-egXML-ga" source="#UND"><linkGrp type="POS-annotation">
<link target="#word-1 #AT0"/>
<link target="#word-2 #NN1"/>
<link target="#word-3 #POS"/>
<link target="#word-4 #NN2"/>
<link target="#word-5 #VVD"/>
<link target="#word-6 #NN2"/>
<!--... -->
</linkGrp>
</egXML>
</p>
<p>Each linguistic segment so far discussed has been well-behaved with
respect to the basic document hierarchy, having only a single parent.
Moreover, the segmentation has been complete, in that each part of the
text is accounted for by some segment at each level of analysis, without
discontinuities or overlap.  This state of affairs does not of
course apply in all types of analysis, and these Guidelines provide a
number of mechanisms to support the representation of discontinuities or
multiple analyses.  A brief overview of these facilities is provided in
chapter <ptr target="#NH"/>; also see <ptr target="#SA"/>.  These mechanisms
all depend to a greater or lesser degree on the use of pointing
elements of various kinds.
</p>
</div>
<div type="div3" xml:id="AILALW"><head>Lightweight Linguistic Annotation</head>
<p>While these Guidelines offer a variety of means to add linguistic information to textual
units and much of that has been presented above, two kinds of use cases and two groups of
users call for a dedicated set of specialized attributes to carry linguistic
information. One relevant use case is where basic linguistic information gets added to an
existing resource, in which generic attributes such as <att>type</att> or <att>ana</att>
have already been used to encode other categorizations and analyses. The other group of
users and use cases involves corpus linguists and resources built from scratch as lightly
annotated language corpora. In the latter kind of projects, energy and person-hours are not
devoted to careful literary analysis and hand-encoding of the relevant phenomena, but rather
to the analysis of the completed resources, and therefore the phase of resource-building must be
quick and relatively effortless, requiring minimal structural markup, well-established
containers for grammatical information, and a standardized way of filling them in.</p>

<p>The aims defined above can be realized by means of lightweight linguistic annotation using
attributes that belong to the <ident type="class">att.linguistic</ident> class: <specList>
<specDesc key="att.linguistic" atts="lemma pos msd join"/>
</specList></p>
    
<p>The essence of lightweight linguistic annotation is that the basic grammatical information
is encapsulated at the word level, together with the orthographic shape of the word. This
has clear advantages for automatic processing but, on the other hand, this form of data
encapsulation also imposes restrictions on the extent of information that can be encoded,
essentially limiting it to a single tokenization and lemmatization schema, a single tagset,
and a subset of the possible analyses (out from potentially many guesses at the
part-of-speech or morphosyntactic descriptions, single values have to fit into the existing
attributes). Another important principle that this kind of annotation is sensitive to is the
need for (near) homomorphism between the assumed tokenization (division of the text stream
into minimal units) and the division into minimal syntactic units (<term>word forms</term>,
in the terminology of ISO Morpho-Syntactic Framework, ISO 24611<note place="bottom">All 
definitions contained within ISO standards can be accessed at the ISO Online Browsing Platform. For ISO MAF, see 
<ptr target="https://www.iso.org/obp/ui#iso:std:iso:24611:ed-1:v1:en"/>.</note>), because it is the former that results
from the process of tokenization, but the latter that can be lemmatized and meaningfully
described by means of grammatical features. Where tokens are only minimally mismatched with
word forms, various repair strategies can be used (e.g., recursing <gi>w</gi> to capture
multi-token compounds or using <ident type="class">att.fragmentable</ident> to point at
disjoint tokens). Beyond that, more robust TEI mechanisms, based on standoff principles and
feature structures, should replace lightweight annotation.</p>
    <!-- I think it would be best to reference Bański, Haaf, and Mueller (2018) at this point, to avoid an in-depth discussion -->

<p>The basic grammatical information encoded by means of 
<ident type="class">att.linguistic</ident> is sufficient for the purpose of enhancing queries or improving
the analysis of search results by, for example, making it possible to distinguish between
the noun <mentioned>cut</mentioned> and the identically spelled verb
<mentioned>cut</mentioned> in English, and further between e.g. the present-tense form of
<mentioned>cut</mentioned> and its past-tense or past-participial forms. For the former
contrast, the part-of-speech (<att>pos</att>) attribute should be used, whereas the latter
may use <att>pos</att> and/or <att>msd</att> attributes, depending on the annotation
vocabulary adopted for the project in question. The various grammatical realizations of a
single <q>dictionary word</q> can be captured by means of the attribute <att>lemma</att>, which
provides a common label for them. For example, English verbs are typically lemmatized as the
base form (also called <term>bare infinitive</term>), so the value of <att>lemma</att> for
the verbal forms <mentioned>write</mentioned>, <mentioned>writes</mentioned>,
<mentioned>wrote</mentioned>, <mentioned>written</mentioned>, and
<mentioned>writing</mentioned> is typically <val>write</val>.</p>
    
<p>Together with the span-delimiting elements mentioned in this section, such as <gi>s</gi>,
<gi>cl</gi>, or <gi>phr</gi>, lightweight grammatical annotation may be used to build
basic syntactic constituency structures, where hierarchical information is expressed through
span containment rather than by relations among tree nodes. This is however the limit of
this kind of annotation: for the purpose of describing true constituency or dependency
syntactic structures, one needs to turn to more robust mechanisms offered by the TEI, which
may involve graph description (see chapter <ptr target="#GD"/>) or standoff techniques (see
section <ptr target="#SASO"/>), and where grammatical labels may need to be annotated by
means of feature structures (see chapter <ptr target="#FS"/>).</p>
    
<p>Some of the above-mentioned robust methods will also prove handy in cases where more than one tagset 
(label inventory) is used to label the words, or where automatic morphological analysis yields multiple 
possibilities (for example, the form <mentioned>cutting</mentioned> is morphologically ambiguous between 
verbal, adjectival, and nominal) and needs to be followed by (often also automatic) disambiguation in 
morphosyntactic contexts, with varying probabilities that may also need to be recorded together with their 
corresponding part-of-speech and morphosyntactic values.</p>
    
<p>It should be borne in mind that tokenization, lemmatization, part-of-speech identification, and 
morphosyntactic labelling, especially when performed automatically, should in most cases be seen as 
involving pragmatic decisions, dictated by concrete practical goals, economy of description, or the 
demands of particular analytic and/or visualization tools. It comes therefore as no surprise that 
numerous alternative (and often conflicting) lemmatization strategies and tagsets exist, in use by 
various communities and various tools, and that they change with time (a case in point is the CLAWS 
tagset for English, with several versions that merge the part-of-speech and morphosyntactic information 
to various degrees).
<note place="bottom">Given that the English language has relatively poor inflectional
morphology, the decision to merge part-of-speech symbols with morphosyntactic features (as
in, e.g., CLAWS-7, where the value <val>PPHO1</val> signals the 3rd person singular objective personal
pronoun) is fully justified as the most economical approach. For languages with more
robust inflection, the <att>pos</att> and <att>msd</att> attributes will either be used
separately, or the part-of-speech information will be merged into the morphosyntactic
description.</note> The nature and description of these systems is outside the scope of these 
Guidelines, but it has to be stressed that all the strategies adopted for linguistic annotation, 
even at the lightweight level of complexity, <emph>must</emph> be documented in the header of the 
given electronic resource, not only for the purpose of guaranteeing successful data interpretation and exchange, but 
also for the sake of sustainability of the results of the given project.</p>
    
<p>The last of the att.linguistic attributes, <att>join</att>, has the most text-technological
flavour. It can be used to amend the loss of whitespace-related information in non-inline
markup.</p> 
<p>Compare the following two listings. The first difference between them is in the
tagset used (CLAWS-5 vs. CLAWS-7) and only serves to exemplify the need to document the
choice of descriptive vocabulary in the header, lest the encoded information is unreadable or
confusing. The second difference is the difference in the treatment of inter-token
whitespace, and it is here that the <att>join</att> attribute proves indispensable.</p>
    
<p>The first example listing uses CLAWS-5 and inline annotation, where whitespace serves as
part of the markup:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILALW-egXML-qn" xml:space="preserve">
<s><w pos="AT0">The</w> <w pos="NN1">victim</w><w pos="POS">'s</w> <w pos="NN2">friends</w> 
   <w pos="VVD">told</w> <w pos="NN2">police</w> <w pos="CJT">that</w> <w pos="NP0">Kruger</w> 
   <w pos="VVD">drove</w> <w pos="PRP">into</w> <w pos="AT0">the</w> <w pos="NN1">quarry</w> 
   <w pos="CJC">and</w> <w pos="AV0">never</w> <w pos="SENT">surfaced</w><pc pos="PUN">.</pc></s>
</egXML></p>
    
<p>In the second example, the attribute <att>join</att> is the only way to encode whether two
tokens are adjacent or not: 
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILALW-egXML-sb"><s>
<w pos="AT">The</w>
<w pos="NN1">victim</w>
<w pos="GE" join="left">'s</w>
<w pos="NN2">friends</w>
<w pos="VVD">told</w>
<w pos="NN2">police</w>
<w pos="CST">that</w>
<w pos="NP1">Kruger</w>
<w pos="VVD">drove</w>
<w pos="II">into</w>
<w pos="AT">the</w>
<w pos="NN1">quarry</w>
<w pos="CC">and</w>
<w pos="RR">never</w>
<w pos="VVD">surfaced</w>
<pc pos="." join="left">.</pc></s></egXML></p>
<p>Note that projects will need to decide whether they want to redundantly encode full
information on the adjacency of each token (in which case, the above listing should also
have <att>join</att> with the value <val>right</val> on the tokens
<mentioned>victim</mentioned> and <mentioned>surfaced</mentioned>, or whether information
on a single direction of adjacency is enough. Strategies vary, and it is important to
document them in the TEI header.</p>
<p>The following example shows a German sentence <mentioned>Wir fahren in den
Urlaub.</mentioned> (<q>We're going on vacation.</q>) annotated with all the attributes discussed
above.<note place="bottom">The annotation values have been adapted from the <ref target="https://weblicht.sfs.uni-tuebingen.de/weblicht/">CLARIN Weblicht service</ref>,
where e.g. the full morphosyntactic description of the first item reads: <code><![CDATA[[cat pronoun,
personal true, substituting true, person 1, case nominative, number plural]]]></code>, and has been
mapped from a sequence of attribute-value pairs suitable for feature structure notation, into a
compressed form that fits inside a single attribute value.</note>
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILALW-egXML-ab"><s>
<w pos="PPER" lemma="wir" msd="pers:subst:p1:nom:pl">Wir</w>
<w pos="VVFIN" lemma="fahren" msd="p1:pl:pres:ind">fahren</w>
<w pos="APPR" lemma="in" msd="--">in</w>
<w pos="ART" lemma="d" msd="def:acc:sg:masc">den</w>
<w pos="NN" lemma="Urlaub" msd="acc:sg:masc">Urlaub</w>
<pc pos="$." lemma="." msd="--" join="left">.</pc>
</s>
</egXML>
</p>
<p>The final examples lay out a strategy for dealing with e.g. historical corpora where it is on
the one hand important to maintain a steady stream of token-level elements (<gi>w</gi> and
<gi>pc</gi>) for efficient processing, but, on the other hand, it is also important to
either record the original spelling (when the corpus text is normalized) or to record the
normalized variants (when the element content of the corpus preserves the original
spelling). The attribute class <ident type="class">att.lexicographic.normalized</ident> can be used for that purpose:
<specList><specDesc key="att.lexicographic.normalized" atts="norm orig"/></specList></p>
<p>The first fragment below comes from "Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631" 
encoded in the Deutsches Textarchiv and records normalized forms in the <att>norm</att> attribute.
  <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILALW-egXML-du">
    <w norm="unvermutete">vnuermuthete</w>
    <w norm="Freundschaft">Freundſchafft</w>
    <w norm="angeboten">angebotten</w>
  </egXML></p>  
<p>The following example comes from the EarlyPrint project and uses the attribute <att>orig</att> to 
record the original spelling (note that the <att>xml:id</att> attributes have been removed for the 
sake of readability).
  <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILALW-egXML-lo">
    <w lemma="he" pos="pns">he</w>
    <w lemma="have" pos="vvz">hath</w>
    <w lemma="bring" pos="vvn">brought</w>
    <w lemma="forth" pos="av" orig="sorth">forth</w>
  </egXML></p>
</div>
<div type="div3" xml:id="AILASP"><head>Spoken Text</head>
<p>The mechanisms proposed in this chapter may also be used to encode
analyses of an entirely different kind, for example discourse function.
Here is an application of the span technique to record details of a sales
transaction in a spoken text.
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="AILASP-egXML-ok" source="#CONAAB-eg-150"><u xml:id="u1">Can I have ten oranges and a kilo of bananas please?</u>
<u xml:id="u2">Yes, anything else?</u>
<u xml:id="u3">No thanks.</u>
<u xml:id="u4">That'll be dollar forty.</u>
<u xml:id="u5">Two dollars</u>
<u xml:id="u6">Sixty, eighty, two dollars. Thank you.</u>
<spanGrp type="transactions">
   <span from="#u1">sale request</span>
   <span from="#u2" to="#u3">sale compliance</span>
   <span from="#u4">sale</span>
   <span from="#u5">purchase</span>
   <span from="#u6">purchase closure</span>
</spanGrp></egXML>
For further discussion of the <gi>u</gi> (utterance) element and other
elements recommended for transcriptions of spoken language,
see chapter <ptr target="#TS"/>.
</p></div>
</div>
<div>
  <head>Module for Analysis and Interpretation</head>
  <p>The module described in this chapter makes available the
  following components:
  <moduleSpec xml:id="DAI" ident="analysis">
    <idno type="FPI">Analysis and Interpretation</idno>
    <desc xml:lang="en" versionDate="2006-09-13">Simple analytic mechanisms</desc>
    <desc xml:lang="fr" versionDate="2018-07-12">Mécanismes analytiques simples</desc>
    <desc xml:lang="zh-TW" versionDate="2018-07-12">簡易分析機制</desc>
    <desc xml:lang="it" versionDate="2018-07-12">Semplici meccanismi di analisi</desc>
    <desc xml:lang="pt" versionDate="2018-07-12">Mecanismos simples de análise</desc>
    <desc xml:lang="ja" versionDate="2018-07-12">分析モジュール</desc>
  </moduleSpec>
  The selection and combination of modules to form a TEI schema is
  described in <ptr target="#STIN"/>.
  <specGrpRef target="#DAIGA"/>
  <specGrpRef target="#DAISP"/>
  <specGrpRef target="#DAILC"/>
  </p>
</div>
</div>
