Simple Analytic Mechanisms

This chapter describes a module for associating simple analyses and interpretations with text elements. We use the term analysis here to refer to any kind of semantic or syntactic interpretation which an encoder wishes to attach to all or part of a text. Examples discussed in this chapter include familiar linguistic categorizations (such as clause, morpheme, part-of-speech etc.) and characterizations of narrative structure (such as theme, reconciliation etc.). The mechanisms presented in this chapter are simpler but less powerful than those described in chapter .

Section introduces elements which can be used to characterize text segments according to the familiar linguistic categories of sentence or s-unit, clause, phrase, word, morpheme, character, and punctuation mark. These elements represent special cases of the generic seg element described in section .

Section introduces an additional global attribute which allows passages of text to be associated with specialized elements representing their interpretation. These interpretative elements (span and interp) are described in detail in section . They allow the encoder to specify an analysis as a series of names and associated values,Or, as they are widely known, attribute-value pairs; this term should not be confused, however, with XML attributes and their values, which are similar in concept but distinct in their formal definitions. each such pair being linked to one or more stretches of text, either directly, in the case of spans, or indirectly, in the case of interpretations.

Finally section revisits the topic of linguistic analysis, and illustrates how these interpretative mechanisms may be used to associate simple linguistic analysis with text segments.

Linguistic Segment Categories

In this section we introduce specialized linguistic segment category elements which may be used to represent the segmentation of a text into the traditional linguistic categories of sentence, clause, phrase, word, morpheme, characters, and punctuation marks.

Words and Above

Although different languages have very different rules about what constitutes a word or a sentence, these remain generally useful concepts. In this section we discuss elements provided for marking up linguistic units down to the word level, however defined.

As members of the att.segLike class, these elements all share the following attribute: They also share attributes from att.typed:

These elements are also all members of the model.segLike class, which is a subclass of model.phrase. They may thus appear anywhere that text is permitted within a document, when the module defined by this chapter is included in a schema.

The w and pc elements belong to the att.linguistic class, which supplies attributes that may be used for lightweight linguistic annotation (see section below):

Additionally, these elements also have access to the att.lexicographic.normalized class, which supplies the attributes norm and orig: the former for handling normalization/regularization at the word level, the latter providing the original form if the element content is modernized or regularized. Note that these attributes are a local (word-level) alternative to the robust mechanism that uses the choice, orig, and reg elements, discussed in section and in chapter . The choice-based mechanism is the default descriptive device, while the norm and orig attributes are used to handle a subset of normalizations in linguistic contexts where a single sequence of tokens is a priority, for example in historical corpora subject to linguistic analysis. It needs to be stressed that the simplified attribute-based mechanism is not meant to be used for editorial interventions. The att.lexicographic.normalized class is also used in dictionary entries, as discussed in chapter .

The s element may be used simply to segment a text end-to-end into a series of non-overlapping segments, referred to here and elsewhere as s-units, or sentences.

Nineteen fifty-four, when I was eighteen years old, is held to be a crucial turning point in the history of the Afro-American — for the U.S.A. as a whole — the year segregation was outlawed by the U.S. Supreme Court. ~~It was also a crucial year for me because on June 18, 1954, I began serving a sentence in state prison for possession of marijuana.~~

The s element is more restricted both in its content and its usage than the generic seg element. The seg unit may contain anything which can appear within a paragraph: thus it may be used to enclose members of the model.inter class (such as bibl or list) as well as other phrase elements; the s unit may only contain phrase-level elements or text. Also, unlike seg elements, s elements should not be nested within each other.Neither this constraint, nor the requirement that the whole of the text be segmented by s elements is required by the TEI Guidelines. The seg element is intended for use as a generic segmentation element, the specific function of which may be indicated by its type attribute; the other members of the class are more specialized. Thus, the s, cl, and phr elements may be thought of as equivalent to seg type="s-unit", seg type="clause" and seg type="phrase", respectively, but with the above-mentioned restrictions.

The s element may be further subdivided into clauses, marked with the cl element, as in the following example:

It was about the beginning of September, 1664, that I, among the rest of my neighbours, heard in ordinary discourse that the plague was returned again to Holland; for it had been very violent there, and particularly at Amsterdam and Rotterdam, in the year 1663, whither, they say, it was brought, some said from Italy, others from the Levant, among some goods which were brought home by their Turkey fleet; others said it was brought from Candia; others from Cyprus. ~~It mattered not from whence it came; but all agreed it was come into Holland again.~~

Clauses may be further divided into phr elements in the same way. A text may be segmented directly into clauses, or into phrases, with no need to include segmentation at a higher level as well.

For verse texts, the overlapping of metrical and syntactic structure requires that special care be given to representing both using an element hierarchy. One simple approach is to split the syntactic phrases into fragments when they cross verse boundaries, reuniting them with the part attribute:

Tweedledum and Tweedledee Agreed to have a battle; For Tweedledum said Tweedledee Had spoiled his nice new rattle.

Just then flew down a monstrous crow, As black as a tar barrel; Which frightened both the heroes so, They quite forgot their quarrel.

Another approach is to use the next and prev attributes defined in the additional module for linking (chapter ): For Tweedledum said Tweedledee Had spoiled his nice new rattle. Other methods are also possible; for discussion, see chapter .

The type attribute on linguistic segment categories can be used to provide additional interpretative information about the category. The function attribute on the cl and phr elements can be used to provide additional information about the function of the category. Legal values for these two attributes are not defined by these Guidelines, but should be documented in the segmentation element of the encodingDesc element within the document's header. A general approach to the encoding of linguistic categories for parts of a text is discussed in section below.

Using traditional terminology, these attributes provide a convenient way of specifying, for example, that the clause from whence it came is a relative clause modifying another, or that the phrase by the U.S. Supreme Court is a prepositional post-modifier: It mattered not from whence it came; the year segregation was outlawed by the U.S. Supreme Court.

Segmentation into clauses and phrases can, of course, be combined. Such detailed encodings as the following may require careful formatting if they are to be easily readable however.

Nineteen fifty-four, when I was eighteen years old , is held to be a crucial turning point in the history of the Afro-American — for the U.S.A. as a whole — the year segregation was outlawed by the U.S. Supreme Court . ~~It was also a crucial year for me because on June 18, 1954, I began serving a sentence in state prison for possession of marijuana~~ .

This style of markup may introduce spurious new lines and blanks into the text. If the original layout is important, it should be explicitly encoded, using such facilities as the lb element, the global rend or rendition attributes, etc.

The w, m, and c elements are identical in meaning to the seg element with a type attribute of w, m, or c respectively, and may occur wherever seg is permitted to occur. However, their content is more constrained than seg: for example, the w element should only contain w, m, c elements or pc elements, or plain text; the m element should contain only c or pc elements or plain text; both the c and pc elements should contain only plain text, most often only a single character or a sequence of graphemes to be treated as a single character. Consequently, while these more specific elements can be translated directly into typed seg elements, the reverse is not necessarily the case.

The restriction on the content of the w element in particular requires that a certain care must be exercised when using it, especially in relation to the use of other tags that one may think of as word level, but which are in fact defined as phrase level. Consider the problem of segmenting an occurrence of the mentioned element as a word. grandiloquent The first of the following two encodings is legitimate; the second is not, since the mentioned element is not part of the content model of the w element: grandiloquent grandiloquent

On the other hand, both of the following encodings are legitimate: grandiloquent speech grandiloquent speech The first encoding describes the citing of a phrase. The second describes a phrase which consists of something mentioned.

The w element carries additional attributes which may be of use in many indexing or analytic applications. The lemma attribute may be used to specify the lemma, that is the head- or uninflected form of an inflected verb or noun, for example: ~~timeo Danaos et dona ferentes~~

In some situations it may be more convenient to use the lemmaRef pointer attribute than to supply an explicit uninflected form. This attribute assumes the existence of a list of uninflected forms, for example in an online lexicon, with which individual w entries can be associated using the usual TEI pointer mechanisms. Assuming that a standardized lexicon for Latin is available at the location http://lexicon.org/latin.xml, we might for example revise the above example as: ~~timeo Danaos~~

Below the Word Level

It is sometimes helpful to markup explicitly sub-word components such as morphemes, characters, or punctuation.

The m element is used to mark up morphologically identified segmentation below the word level. Analogous to the lemma attribute for w, there is a baseForm attribute for the m element, which may be used to indicate the base form of an inflected morpheme; where appropriate, m elements may also be organized hierarchically: com fort able

The distinction between m and w is provided as a convenience only; it may not be appropriate for all linguistic theories, nor is it meaningful in all languages. The intention is to provide a means for those cases where it is considered helpful to distinguish lexical from sub-lexical tokens, to complement the more general mechanism already provided by the seg element, using which the above example could alternatively be marked up as follows: com fort able See section for an alternative to using type in such contexts.

There is a substantial linguistic difference between characters like letters or diacritics and punctuation marks. The former are used to construct meaningful units like morphemes or words. The latter are functionally independent units acting at the level of syntactic units. A word may consist of a single letter (for example I in English), but this does not mean that we should use c instead of w to mark it up.

The c (character) element should be used to mark up any non-lexical character, whether this appears within a word, or outside it. In the following example, the encoder wishes to indicate that the letters are not to be regarded as words: M O A I doth sway my life

The c element may be used for individual characters occurring within a w or m element which it is desired to distinguish for some reason, as in the following examples: n ' t This encoding represents the constituents of a common abbreviation, but does not indicate that it is in fact an abbreviation; the am element () may be preferred for the latter purpose. Generally speaking, the use of c use to mark non-lexical punctuation marks is deprecated, since the pc element is provided specifically to distinguish these.

The pc (punctuation character) element should be used to mark up characters which are specifically regarded as providing punctuation, rather than constituting parts of a word. It may be particularly useful when transcribing older written materials, in which an encoding of the original punctuation may be useful for interpretive or analytic purposes, in much the same way as an encoding of the original orthography may be. For example, in the following extract from a Bodleian Library musical manuscript

two different punctuation marks are used to distinguish kinds of pause in the text. The punctus elevatus (which resembles an inverted semicolon) is not a Unicode character, but may still be encoded using the g element. As further described in chapter , this element points to a definition for the intended character which may be stored either locally or elsewhere. deus qui regis omnia ; natus est in bethlehem .o . mira gratia...

The pc element carries special attributes to record analyses of the functional behaviour or classification of the punctuation mark it contains. The unit attribute may be used, as on the milestone element to name the kind of unit which the punctuation mark delimits, for example a paragraph or section. The pre attribute may be used to indicate whether the punctuation precedes or follows the unit it delimits. The force attribute indicates the strength of the association between the punctuation mark and its adjacent word.

In the following example, the paragraph marker (¶) has been tagged as a strong punctuation mark, preceding the unit it marks, which is named para:

¶Incipit...

A similar encoding can be used for hyphenation: A fire-proof vest is recom- mended. Refer to for a discussion of the motivations for explicitely recording the presence of hyphens.

The w, m, c, and pc elements can be used together to give a fairly detailed low-level grammatical analysis of text. For example, consider the following segmentation of the English S-unit I didn't do it. I did n't do it .

This segmentation, crude as it is, succeeds in representing the idea that did occurring as a morphological component of the word didn't has something in common with the word do. A further advantage of segmenting the text down to this level is that it becomes relatively simple to associate each such segment with a more detailed formal analysis, for example by providing a baseform, or morphological analysis at whichever level is appropriate. This matter is taken up in detail in section .

Global Attributes for Simple Analyses

When the module described by this chapter is selected, an additional attribute is defined for all elements: The ana attribute may be specified for any element. Its effect is to associate the element with one or more others representing an analysis or interpretation of it. Its target should be one of the elements described in the section below, or some other interpretative element such as note, on which see section or fs, on which see chapter . If a hierarchical form of classification is desired then it may point to category element at a suitable level in a taxonomy see .

Spans and Interpretations

The simplest mechanisms for attaching analytic notes in some structured vocabulary to particular passages of text are provided by the span and interp elements, and their associated grouping elements spanGrp and interpGrp.

These elements are all members of the class att.interpLike, and thus share the following attribute: They also inherit the following attributes from att.global.responsibility:

The type attribute of the span and interp elements may be used to indicate that the annotations are of specific types, for example thematic or structural. The annotation itself is supplied as the content of the span or interp element. In the case of the span element, the span of text being annotated is indicated by values of the from, to or target attributes, used in combination as follows. If only the from attribute is supplied, then the span is coterminous with the element indicated by its value; if both from and to are supplied, the span runs from the start of the element indicated by the from attribute up to the end of the element indicated by the to attribute; if the target attribute is used, the span is defined by aggregating the contents of the (possibly non-contiguous) elements pointed to by its values. It is an error to supply only the to attribute; to supply more than one pointer value for either to or from attributes; or to supply either of these in conjunction with the target attribute. In the case of interp (see below), the span is indicated by a pointer from a link element or some similar mechanism. The resp attribute indicates the annotator responsible for this annotation.

The span element provides a simple way of indicating such features as phrasal verbs in a linguistic analysis, as in this example: ~~Whatdidyoumakeup~~ phrasal verb "make up" Here the two components of the span follow each other, so the to and from attributes may be used. The same effect could however be achieved by using the target attribute: ~~Whatdidyoumakeup~~ phrasal verb "make up" This second approach might be cumbersome if the number of components to be combined is very large. It is however essential if the components do not follow each other, as in this example: ~~Didyoumakeitup~~ phrasal verb "make up"

The span element can be used for any kind of annotation. In this example it is used in a narratological analysis:

~~There was certainly a definite point at which the thing began.~~ ~~It was not; then it was suddenly inescapable, and nothing could have frightened it away.~~ There was a slow integration, during which she, and the little animals, and the moving grasses, and the sun-warmed trees, and the slopes of shivering silvery mealies, and the great dome of blue light overhead, and the stones of earth under her feet, became one, shuddering together in a dissolution of dancing atoms. She felt the rivers under the ground forcing themselves painfully along her veins, swelling them out in an unbearable pressure; her flesh was the earth, and suffered growth like a ferment; and her eyes stared, fixed like the eye of the sun. Not for one second longer (if the terms for time apply) could she have borne it; but then, with a sudden movement forwards and out, the whole process stopped; and that was the moment which it was impossible to remember afterwards. the moment ~~For during that space of time (which was timeless) she understood quite finally her smallness, the unimportance of humanity.~~

The span element may, as in this example, be placed in the text near the textual span it is associated with. Alternatively, it may be placed elsewhere in the same or a different document. Where several span or interp elements share the same attributes, for example having the same responsibility or type, it may be convenient to group them within a spanGrp or interpGrp element as follows: the moment

Spans may also be used to represent structural divisions within a narrative, particularly when these do not coincide with the structure implied by the element structure. Consider the following narrative:

Sigmund, the son of Volsung, was a king in Frankish country. Sinfiotli was the eldest of his sons, the second was Helgi, the third Hamund. Borghild, Sigmund's wife, had a brother named — But Sinfiotli, her stepson, and — both wooed the same woman and Sinfiotli killed him over it.The rule marks spaces left for the missing name in the manuscript. And when he came home, Borghild asked him to go away, but Sigmund offered her weregild, and she was obliged to accept it. At the funeral feast Borghild was serving beer. She took poison, a big drinking horn full, and brought it to Sinfiotli. When Sinfiotli looked into the horn, he saw that poison was in it, and said to Sigmund This drink is cloudy, old man. Sigmund took the horn and drank it off. It is said that Sigmund was hardy and that poison did him no harm, inside or out. And all his sons could tolerate poison on their skin. Borghild brought another horn to Sinfiotli, and asked him to drink, and everything happened as before. And a third time she brought him a horn, and reproachful words as well, if he didn't drink from it. He spoke again to Sigmund as before. He said Filter it through your mustache, son! Sinfiotli drank it off and at once fell dead.

Sigmund carried him a long way in his arms and came to a long, narrow fjord, and there was a small boat there and a man in it. He offered to ferry Sigmund over the fjord. But when Sigmund carried the body out to the boat, it was fully laden. The man said Sigmund should go around the fjord inland. The man pushed the boat out and then suddenly vanished.

King Sigmund lived a long time in Denmark in the kingdom of Borghild, after he married her. Then he went south to Frankish lands, to the kingdom he had there. Then he married Hiordis, the daughter of King Eylimi. Their son was Sigurd. King Sigmund fell in a battle with the sons of Hunding. And then Hiordis married Alf, the son of King Hialprec. Sigurd grew up there as a boy.

Sigmund and all his sons were tall and outstanding in their strength, their growth, their intelligence, and their accomplishments. But Sigurd was the most outstanding of all, and everyone who knows about the old days says he was the most outstanding of men and the noblest of all the warrior kings.

A structural analysis of this text, dividing it into narrative units in a pattern shared with other texts from the same literature, might look like this:

~~Sigmund ... was a king in Frankish country.~~ ~~Sinfiotli was the eldest of his sons.~~ ~~Borghild, Sigmund's wife, had a brother ...~~ ~~But Sinfiotli ... wooed the same woman~~ ~~and Sinfiotli killed him over it.~~ ~~And when he came home, ... she was obliged to accept it.~~ ~~At the funeral feast Borghild was serving beer.~~ ~~She took poison ... and brought it to Sinfiotli.~~ ~~Sinfiotli drank it off and at once fell dead.~~

Sigmund carried him a long way in his arms ...

King Sigmund lived a long time in Denmark ...

Sigmund and all his sons were tall ...

introduction conflict climax revenge reconciliation aftermath

Note the use of an empty anchor element to provide a target for the reconciliation unit which is normally part of the narrative pattern but which is not realized in the text shown.

The same analysis may be expressed with the interp element instead of the span element; this element provides attributes for recording an interpretive category and its value, as well as the identity of the interpreter, but does not itself indicate which passage of text is being interpreted; the same interpretive structures can thus be associated with many passages of the text. The association between text passages and interp elements should be made either by pointing from the text to the interp element with the ana attribute defined in section , or by pointing at both text and interpretation from a link element, as described in chapter .

To encode the first example above using interp, it is necessary to create a text element which contains—or corresponds to—the third, fourth, and fifth orthographic sentences (S-units) in the paragraph. This can be done either with the seg element, described in , or the join element, described in . The resulting element can then be associated with the interp element using the ana attribute described in section . We illustrate using the seg element.

~~There was certainly a definite point ...~~ ~~It was not; then it was suddenly inescapable ...~~ ~~There was a slow integration ...~~ ~~She felt the rivers under the ground ...~~ ~~Not for one second longer ...~~ ~~For during that space of time ...~~

the moment

The second example above can be recoded using interp and interpGrp tags in a similar manner. The interpretation itself can be expressed in an interpGrp element, which would replace the spanGrp in the example shown above: introduction conflict climax revenge reconciliation aftermath

Any of these interp elements may be linked to the text either by means of the ana attribute, or by means of link elements. Using the ana attribute (on seg elements introduced specifically for this purpose), the text would be encoded as follows:

Sigmund carried him a long way in his arms ...

King Sigmund lived a long time in Denmark ...

Sigmund and all his sons were tall ...

The linkage may also be accomplished using a linkGrp element, whose content is a set of link elements which point to each interpretive element and its corresponding text unit. This method does not require the use of the ana attribute on the text units.

One obvious advantage of using interp rather than span elements for the Sigmund text is that the interp elements can be reused for marking up other texts in the same document, whereas the span elements cannot. On the other hand, the use of interp elements may require the creation of special text elements not otherwise needed (e.g. the seg and the join in the revised encoding of the text), whereas the use of span elements does not.

Linguistic Annotation

By linguistic annotation we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in this chapter and in chapters , , , , , , and . The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter , and in section .

Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete and non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.

The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the interpretation element within the encoding description of the TEI header, as described in section . Where different parts of a language corpus have used different annotation methods, the decls attribute may be used to indicate the fact, as further discussed in section .

Linguistic Annotation by Means of Generic TEI Devices

As one example of such types of analysis, consider the following sentence, taken from the Lancaster/IBM Treebank Project (). The victim's friends told police that Kruger drove into the quarry and never surfaced.

Our discussion focuses on the way that this sentence might be analysed using the CLAWS system developed at the University of Lancaster but exactly the same principles may be applied to a wide variety of other systems.For the word-class tagging method used by CLAWS see ; For an overview of the system see . The example sentence was processed using an online version of the CLAWS tagger at Output from the system consists of a segmented and tokenized version of the text, in which word class codes have been associated with each token. CLAWS offers outputs in a variety of non-XML and XML formats: for example, the simplest format for the sample sentence would be:

This may be easily transformed into an equivalent TEI XML representation: ~~The victim's friends told police that Kruger drove into the quarry and never surfaced~~ Although the names used for the attribute values here may have some significance for the human reader (AT0 for article, NN1 for singular noun, NN2 for plural noun, etc.) they are arbitrary codes, used in this case as pointers to other elements which define their significance more precisely. If the codes are considered to be atomic, then the interp element described in section might be used to supply brief definitions in the header: Definite article Adverb Conjunction Relative that Noun singular Noun plural Proper noun Genitive marker Preposition Verb past tense If the codes are considered to be compositional (for example that NN1 and NN2 have something in common, namely their noun-ness, which they do not share with, say, VVD), then this compositionality may be most clearly expressed using a mechanism based on the fs element defined in chapter .

This approach requires the text to be fully segmented, using the linguistic segment elements described in section , so that the scope of the ana attribute used to point to each interpretation is clearly defined. A further analysis into phrase and clause elements can be superimposed on the word and morpheme tagging in the preceding illustration. For example, CLAWS provides the following constituent analysis of the sample sentence (the word class codes have been deleted):

Treating the labels on the brackets as phrase or clause interpretations, this analysis of the structure of the example sentence can be combined with the word class analysis and represented as follows (the symbol V&"/> representing the first part of a coordinate phrase, has been replaced by V1, and V+, representing the second part, has been replaced by V2). The victim 's friends told police that Krueger drove into the quarry and never surfaced .

This approach requires the definition of further interp (or fs) elements to provide targets for the pointers used to represent the constituent analysis: coordinate continuation verbal nominal genitive finite clause prepositional coordinate start

Alternatively, a stand-off representation for these analyses might be created using the linkGrp element. In this case, each linguistic segment to be annotated must be supplied with its own xml:id attribute: ~~The victim 's friends told police that Kruger drove into the quarry and never surfaced~~ Each segment-interpretation pair may now be represented by means of a link element inside an appropriate linkGrp element:

Each linguistic segment so far discussed has been well-behaved with respect to the basic document hierarchy, having only a single parent. Moreover, the segmentation has been complete, in that each part of the text is accounted for by some segment at each level of analysis, without discontinuities or overlap. This state of affairs does not of course apply in all types of analysis, and these Guidelines provide a number of mechanisms to support the representation of discontinuities or multiple analyses. A brief overview of these facilities is provided in chapter ; also see . These mechanisms all depend to a greater or lesser degree on the use of pointing elements of various kinds.

Lightweight Linguistic Annotation

While these Guidelines offer a variety of means to add linguistic information to textual units and much of that has been presented above, two kinds of use cases and two groups of users call for a dedicated set of specialized attributes to carry linguistic information. One relevant use case is where basic linguistic information gets added to an existing resource, in which generic attributes such as type or ana have already been used to encode other categorizations and analyses. The other group of users and use cases involves corpus linguists and resources built from scratch as lightly annotated language corpora. In the latter kind of projects, energy and person-hours are not devoted to careful literary analysis and hand-encoding of the relevant phenomena, but rather to the analysis of the completed resources, and therefore the phase of resource-building must be quick and relatively effortless, requiring minimal structural markup, well-established containers for grammatical information, and a standardized way of filling them in.

The aims defined above can be realized by means of lightweight linguistic annotation using attributes that belong to the att.linguistic class:

The essence of lightweight linguistic annotation is that the basic grammatical information is encapsulated at the word level, together with the orthographic shape of the word. This has clear advantages for automatic processing but, on the other hand, this form of data encapsulation also imposes restrictions on the extent of information that can be encoded, essentially limiting it to a single tokenization and lemmatization schema, a single tagset, and a subset of the possible analyses (out from potentially many guesses at the part-of-speech or morphosyntactic descriptions, single values have to fit into the existing attributes). Another important principle that this kind of annotation is sensitive to is the need for (near) homomorphism between the assumed tokenization (division of the text stream into minimal units) and the division into minimal syntactic units (word forms, in the terminology of ISO Morpho-Syntactic Framework, ISO 24611All definitions contained within ISO standards can be accessed at the ISO Online Browsing Platform. For ISO MAF, see .), because it is the former that results from the process of tokenization, but the latter that can be lemmatized and meaningfully described by means of grammatical features. Where tokens are only minimally mismatched with word forms, various repair strategies can be used (e.g., recursing w to capture multi-token compounds or using att.fragmentable to point at disjoint tokens). Beyond that, more robust TEI mechanisms, based on standoff principles and feature structures, should replace lightweight annotation.

The basic grammatical information encoded by means of att.linguistic is sufficient for the purpose of enhancing queries or improving the analysis of search results by, for example, making it possible to distinguish between the noun cut and the identically spelled verb cut in English, and further between e.g. the present-tense form of cut and its past-tense or past-participial forms. For the former contrast, the part-of-speech (pos) attribute should be used, whereas the latter may use pos and/or msd attributes, depending on the annotation vocabulary adopted for the project in question. The various grammatical realizations of a single dictionary word can be captured by means of the attribute lemma, which provides a common label for them. For example, English verbs are typically lemmatized as the base form (also called bare infinitive), so the value of lemma for the verbal forms write, writes, wrote, written, and writing is typically write.

Together with the span-delimiting elements mentioned in this section, such as s, cl, or phr, lightweight grammatical annotation may be used to build basic syntactic constituency structures, where hierarchical information is expressed through span containment rather than by relations among tree nodes. This is however the limit of this kind of annotation: for the purpose of describing true constituency or dependency syntactic structures, one needs to turn to more robust mechanisms offered by the TEI, which may involve graph description (see chapter ) or standoff techniques (see section ), and where grammatical labels may need to be annotated by means of feature structures (see chapter ).

Some of the above-mentioned robust methods will also prove handy in cases where more than one tagset (label inventory) is used to label the words, or where automatic morphological analysis yields multiple possibilities (for example, the form cutting is morphologically ambiguous between verbal, adjectival, and nominal) and needs to be followed by (often also automatic) disambiguation in morphosyntactic contexts, with varying probabilities that may also need to be recorded together with their corresponding part-of-speech and morphosyntactic values.

It should be borne in mind that tokenization, lemmatization, part-of-speech identification, and morphosyntactic labelling, especially when performed automatically, should in most cases be seen as involving pragmatic decisions, dictated by concrete practical goals, economy of description, or the demands of particular analytic and/or visualization tools. It comes therefore as no surprise that numerous alternative (and often conflicting) lemmatization strategies and tagsets exist, in use by various communities and various tools, and that they change with time (a case in point is the CLAWS tagset for English, with several versions that merge the part-of-speech and morphosyntactic information to various degrees). Given that the English language has relatively poor inflectional morphology, the decision to merge part-of-speech symbols with morphosyntactic features (as in, e.g., CLAWS-7, where the value PPHO1 signals the 3rd person singular objective personal pronoun) is fully justified as the most economical approach. For languages with more robust inflection, the pos and msd attributes will either be used separately, or the part-of-speech information will be merged into the morphosyntactic description. The nature and description of these systems is outside the scope of these Guidelines, but it has to be stressed that all the strategies adopted for linguistic annotation, even at the lightweight level of complexity, must be documented in the header of the given electronic resource, not only for the purpose of guaranteeing successful data interpretation and exchange, but also for the sake of sustainability of the results of the given project.

The last of the att.linguistic attributes, join, has the most text-technological flavour. It can be used to amend the loss of whitespace-related information in non-inline markup.

Compare the following two listings. The first difference between them is in the tagset used (CLAWS-5 vs. CLAWS-7) and only serves to exemplify the need to document the choice of descriptive vocabulary in the header, lest the encoded information is unreadable or confusing. The second difference is the difference in the treatment of inter-token whitespace, and it is here that the join attribute proves indispensable.

The first example listing uses CLAWS-5 and inline annotation, where whitespace serves as part of the markup: ~~The victim's friends told police that Kruger drove into the quarry and never surfaced.~~

In the second example, the attribute join is the only way to encode whether two tokens are adjacent or not: ~~The victim 's friends told police that Kruger drove into the quarry and never surfaced .~~

Note that projects will need to decide whether they want to redundantly encode full information on the adjacency of each token (in which case, the above listing should also have join with the value right on the tokens victim and surfaced, or whether information on a single direction of adjacency is enough. Strategies vary, and it is important to document them in the TEI header.

The following example shows a German sentence Wir fahren in den Urlaub. (We're going on vacation.) annotated with all the attributes discussed above.The annotation values have been adapted from the CLARIN Weblicht service, where e.g. the full morphosyntactic description of the first item reads: , and has been mapped from a sequence of attribute-value pairs suitable for feature structure notation, into a compressed form that fits inside a single attribute value. ~~Wir fahren in den Urlaub .~~

The final examples lay out a strategy for dealing with e.g. historical corpora where it is on the one hand important to maintain a steady stream of token-level elements (w and pc) for efficient processing, but, on the other hand, it is also important to either record the original spelling (when the corpus text is normalized) or to record the normalized variants (when the element content of the corpus preserves the original spelling). The attribute class att.lexicographic.normalized can be used for that purpose:

The first fragment below comes from "Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631" encoded in the Deutsches Textarchiv and records normalized forms in the norm attribute. vnuermuthete Freundſchafft angebotten

The following example comes from the EarlyPrint project and uses the attribute orig to record the original spelling (note that the xml:id attributes have been removed for the sake of readability). he hath brought forth

Spoken Text

The mechanisms proposed in this chapter may also be used to encode analyses of an entirely different kind, for example discourse function. Here is an application of the span technique to record details of a sales transaction in a spoken text. Can I have ten oranges and a kilo of bananas please? Yes, anything else? No thanks. That'll be dollar forty. Two dollars Sixty, eighty, two dollars. Thank you. sale request sale compliance sale purchase purchase closure For further discussion of the u (utterance) element and other elements recommended for transcriptions of spoken language, see chapter .

Module for Analysis and Interpretation

The module described in this chapter makes available the following components: Analysis and Interpretation Simple analytic mechanisms Mécanismes analytiques simples 簡易分析機制 Semplici meccanismi di analisi Mecanismos simples de análise 分析モジュール The selection and combination of modules to form a TEI schema is described in .