Notes on SGML Solutions To Markup Problems David Barnard Queen's University, Kingston, Ont. Lou Burnard Oxford University Computing Service Jean-Pierre Gaspart Lynne Price Frame Technology C. M. Sperberg-McQueen University of Illinois at Chicago Academic Computer Center Nino Varile Commission of the European Communities, Directorate General XIII Document Number: TEI MLW18 April 16, 1992 Version 4, April 16, 1992 This paper discusses some sample problems in the use of SGML which have come up in the course of the work of the Text Encoding Initiative, and presents a number of example solutions to each problem. These notes may be taken as representing the views of the committee as to appropriate uses of SGML mechanisms for logical problems; they do not necessarily reflect the views of the committee concerning the individual application areas. The discussion of linguistic examples, especially, focuses on markup problems and is not intended as a full linguistic analysis.(1) Some problems discussed here are also treated in the work of ANSI committee X3V1.8, Music in Information Processing Standards, especially in their work on hypertext and hypermedia documents. Their documents, notably their document X3V1.8/SD-7, "Journal of Development, ANSI Project X3.749-D, Hypermedia/Time-based Document Structuring Language (HyTime)," should be consulted by anyone working with hypertext problems (as should the documents of TEI work group TR3, Hypermedia). The exam- ples and solutions described here are pedagogical in intent; divergences in this document from the recommendations of X3V1.8 or TR3 should not be taken as rejections of their recommendations. 1 MARKING ARBITRARY SEGMENTS Problem: how do you mark text segments which are arbitrary both with respect to the primary (or any) hierarchical structure in the text, and also with respect to each other? Examples: 1 the passage rendered illegible by water stains(2) 2 the passage rendered inaudible by a passing truck (e.g. in a tran- script of an interview) 3 this is where N.V. moved to the window (e.g. in a gestural tran- scription of a conversation) 4 the discussion of tariffs in these minutes where the topics are: tariff policy, foreign relations with Japan (including cultural exchange, economic cooperation, tariff policy, patent law), and wheat production (including a digression on wheat tariffs) Solution 1: Concurrent Markup If the number of such segment types is bounded and small, use CONCUR. For instance, example 3 shows a segment corresponding to a movement by one participant. By assigning one document type to each participant, we can mark each participant's moves using CONCUR. So NV's move to the window can be bounded by <(nino)move desc='walking to window'> ... </(nino)move> without regard for the other structures in the text, after a DTD with declarations like <!DOCTYPE nino [ <!ELEMENT nino - O (#PCDATA | move | gesture | ...)* > <!ELEMENT move - O (#PCDATA) > <!ELEMENT gesture - O (#PCDATA) > <!ATTLIST (move, gesture, ...) desc CDATA #IMPLIED > ]> This assumes the gestural transcription will mark MOVEs, GESTUREs, and possibly other things for which here the ellipsis stands in; also that no two gestures or moves overlap or nest. If gestures and moves may nest, the content models for the document type NINO should be revised appropriately: they may express a suitable hierarchy or be replaced by the content model ANY; alternatively, the gestural tags may all be included as inclusion exceptions on the document-type element NINO, thus: <!ELEMENT nino - O (#PCDATA) +(move | gesture | ...) > etc. Similarly, legibility or audibility information can be readily accom- modated by CONCUR with a "leg" or "aud" DTD segmenting the document into segments of a given legibility or audibility. The DTD might look some- thing like this: <!DOCTYPE aud [ <!ELEMENT aud - - (clear | distorted | inaudible | dna)+ > <!ELEMENT clear - - (#PCDATA) > <!ELEMENT distorted - - (#PCDATA) > <!ELEMENT inaudible - - (#PCDATA) > <!ATTLIST (distorted, inaudible) cause CDATA #IMPLIED > ]> The use of (#PCDATA) as a content model ensures that these tags cannot nest (which we assume would be meaningless) and the content model on AUD ensures that every portion of the document will be marked as clear, dis- torted, inaudible, or "dna" ("does not apply", for sections of the docu- ment to which audibility does not apply--document header, annotation, etc.) So the document could have, interspersed randomly among other tags, sequences like: <(aud)dna> [document header ...] </(aud)dna> <(aud)clear> ... </(aud)clear> <(aud)inaudible cause='truck'> ... </(aud)inaudible> <(aud)clear> ... </(aud)clear> <(aud)distorted cause='volume overload'> ... </(aud)distorted> <(aud)clear> ... </(aud)clear> Solution 2a: Single Empty Segment-Boundary Element Where the number of types of segments is in principle unbounded (e.g. not only move and gesture but an indefinite number of further possibili- ties), a single <event> tag may be used to mark the beginning or end of any segment, thus dividing the document into time-slices to be managed and grouped by higher-level application software.(3) If we wish to transcribe, say, movement, location, and whether a par- ticipant is smoking, we can segment the text on these lines: if N.V. walks to the window, stands there, and after a few minutes lights a cigarette, returning to the table before putting it out, we could imag- ine a simple segmentation like this: <tei.1> <event start id='e237' person='nino' act='sits at table'> ... passage A ... <event start id='e238' person='nino' act='walking to window'> ... passage B ... <event end startid='e238' > <event start id='e239' person='nino' act='stands at window' ... passage C ... <event start id='e240' person='nino' act='smokes' ... passage D ... <event end startid='e239' > <event start id='e241' person='nino' act='walks to table' ... passage E ... <event end startid='e241' > ... passage F ... <event end startid='e240' > ... passage G ... </tei.1> The application software is then responsible for linking each <event end> to its corresponding <event start>, by means of the identical value on the id attribute of the <event start> and the startid attribute (an IDREF attribute) on the <event end> tag, and treating the intervening text as though it were a single element. Here, passages A-C must be treated as though braced by <(nino)smoking> ... </(nino)smoking>, and C-D and F-G as though single segments with NV's location as marked. The document type declaration fragment required for tagging of this kind would be something like this: <!DOCTYPE tei.1 [ <!-- all normal TEI tags defined, and also EVENT --> <!-- EVENT is defined as part of F.EMPTY and can occur --> <!-- anywhere within prose. --> <!ENTITY % f.empty "citn.ref | milestone | xref | anchor | include | ext.formula | ext.table | ext.figure | index.term | event" > <!ELEMENT event - O EMPTY > <!ATTLIST event ID ID #IMPLIED startid IDREF #IMPLIED person (nino | lynne | lou | jp) #IMPLIED act CDATA #IMPLIED endpoint (start | end | point) #IMPLIED > ]> SGML would not verify that each <event start> tag had exactly one corresponding <event end>, nor that each event-start tag had an ID attribute and each event-end tag a STARTID attribute, but the ID/IDREF mechanism would ensure that each STARTID pointed at exactly one ID. Solution 2b: Paired Segment-Boundary Elements The method of the preceding section requires a clean distinction among <event> tags: some mark the beginning of an event and must have an ID value, along with person and desc; others mark the end of the same event and should have only a startid value. A third type marks the point of occurrence of an event without duration, for which person and desc values are logically required and an ID value optional. As was noted above, SGML is not in a position to enforce these constraints from the declarations given. We may make the method more reliable, at the cost of two additional element types, by defining three distinct ele- ments for the three types of event tags. In this revised method, a pair of <event-start> and <event-end> tags may be used to mark the beginning and end of any segment. Events which occur at a single point in time and have no distinct start and end may be represented by a third <event> tag. The same sequence of events as that given above would be transcribed thus, in this method: <tei.1> <event-start id='e237' person='nino' act='sits at table' ... passage A ... <event-start id='e238' person='nino' act='walking to window' ... passage B ... <event-end startid='e238' > <event-start id='e239' person='nino' act='stands at window' ... passage C ... <event-start id='e240' person='nino' act='smokes' ... passage D ... <event-end startid='e239' > <event-start id='e241' person='nino' act='walks to table' ... passage E ... <event-end startid='e241' > ... passage F ... <event id='e242' person='nino' act='knocks on wood' ... passage G ... <event-end startid='e240' > ... passage H ... </tei.1> As before, the application is responsible for treating passages A-C, C-D, and F-H as units. The document type declaration fragment required for tagging of this kind would be something like this: <!DOCTYPE tei.1 [ <!-- all normal TEI tags defined, and also EVENT --> <!-- EVENT is defined as part of F.EMPTY and can occur --> <!-- anywhere within prose. --> <!ENTITY % f.empty "citn.ref | milestone | xref | anchor | include | ext.formula | ext.table | ext.figure | index.term | event | event-start | event-end" > <!ELEMENT event - O EMPTY > <!ELEMENT event-start - O EMPTY > <!ELEMENT event-end - O EMPTY > <!ATTLIST event ID ID #IMPLIED person (nino | lynne | lou | jp) #REQUIRED act CDATA #REQUIRED > <!ATTLIST event-start ID ID #REQUIRED person (nino | lynne | lou | jp) #REQUIRED act CDATA #REQUIRED > <!ATTLIST event-end startid IDREF #REQUIRED > ]> Solution 3: Typed Segment-Boundary Delimiters If the types of events form a closed set, a different segment- boundary element can be defined for each type of event. Like the <event> tag of solution 2a, these segment-boundary tags would be empty. To define distinct segment-boundary tags for moves and gestures, the DTD would include definitions like these: <!ELEMENT move - O EMPTY> <!ELEMENT gesture - O EMPTY> <!ATTLIST (move, gesture) endpnt (start | end | point) #REQUIRED person (nino | lynne | lou | msm | jp ) #REQUIRED desc CDATA #IMPLIED id ID #IMPLIED startid IDREF #IMPLIED > N.V.'s walk to the window would be marked: ... <move start id='e237' person='nino' desc='walking to window' ... <!-- during this segment, Nino is walking to the window --> ... <move end id='e237' person='nino'> <!-- Nino is now at the window --> ... As in solution 2, application software would be responsible for linking the start and end tags (by means of the identical value for the ID and STARTID attributes). SGML would not verify that each START tag had exactly one END, nor that each START tag had an ID attribute and each END tag a STARTID attribute, but the ID/IDREF mechanism would ensure that each STARTID pointed at exactly one ID. Better SGML validation can be achieved, as before, by creating pairs of distinct segment-start and segment-end tags.(4) Solution 4: Arbitrary Segments as Lists of Elements A variant on solution 1 can be used to provide slightly better sup- port for arbitrary segments in SGML. In this variant, an arbitrary seg- ment of the text is defined as a set of elements of the text; where existing elements do not have precisely the right extension to define the desired segment, special segmentation elements are used to place boundaries in the correct positions in the text. This approach is defined in TEI P1, where the <al.map> element is used to specify arbitrary segments of a text as sets of elements, using <s> elements if necessary to divide the text into the desired chunks. This approach provides a more declarative interpretation of arbitrary segments (in terms of a set of subtrees of the parse tree, rather than in terms of a specific processing model involving left-to-right scan of the document); it also automatically provides for discontiguous seg- ments. Its disadvantage is in requiring out-of-line markup: the char- acteristics to be associated with a given arbitrary segment are speci- fied in a separate element, e.g. an <f.struct>, and associated with the arbitrary segment only through an alignment map. Using this method, N.V.'s walk to the window might be marked this way.(5) Comments are used to show more clearly what is happening; the information is carried by the tags, however, not the comments. <tei.1> ... <text><body><p> ... text ... <s id='s101'> <!-- Nino sits at the table --> ... passage A ... </s><s id='s102'> <!-- Nino starts toward the window --> ... passage B1 ... </s> </p> <p><s id='s103'> ... passage B2 ... </s><!-- Nino arrives at window --> <s id='s104'> <!-- Nino stands at window --> ... passage C ... <s id='s105'> <!-- Nino smokes ... --> ... passage D1 ... </s></s> </p> <p id='p234'><s id='s106'> ... passage D2 ... </s><!-- Nino leaves window and goes to table --> <s id='s107'> ... passage E ... </s><!-- Nino arrives at table, sits --> </p> <p id='p235'> ... passage F ... </p> <p id='p238'> <s id='s108'></s><!-- Nino knocks on wood --> ... passage G ... <!-- Nino stops smoking --> </p> <p id='p239'> ... passage H ... </body> <analysis> <!-- N.B. ordering of f.structs is irrelevant. --> <f.struct id='m1'> <feature name='who'>Nino</feature> <feature name='act'>sits</feature> <feature name='loc'>at table.</feature></f.struct> <f.struct id='m2'> <feature name='who'>Nino</feature> <feature name='act'>walks</feature> <feature name='loc'>table to window.</feature></f.struct> <f.struct id='m3'> <feature name='who'>Nino</feature> <feature name='act'>stands</feature> <feature name='loc'>at window.</feature></f.struct> <f.struct id='m4'> <feature name='who'>Nino</feature> <feature name='act'>smokes.</feature></f.struct> <f.struct id='m5'> <feature name='who'>Nino</feature> <feature name='act'>knocks on wood.</feature></f.struct> <f.struct id='m6'> <feature name='who'>Nino</feature> <feature name='act'>walks</feature> <feature name='loc'>window to table.</feature></f.struct> </analysis> <alignment> <al.map><!-- Nino sitting at table, passage A --> <al.ptr target='m1'><al.ptr target='s101'> </al.map> <al.map><!-- Nino walking to window, passages B1, B2 --> <al.ptr target='m2'><al.range al.start='s102' al.end='s103'> </al.map> <al.map><!-- Nino at window, passages C, D --> <al.ptr target='m3'><al.range al.start='s104' al.end='s106'> </al.map> <al.map><!-- Nino smokes, passages D-G --> <al.ptr target='m4'><al.range al.start='s105' al.end='p238'> </al.map> <al.map><!-- Nino knocks on wood, no duration --> <al.ptr target='m5'><al.ptr target='s108'> </al.map> <al.map><!-- Nino walks to table, passage E --> <al.ptr target='m6'><al.ptr target='s107'> </al.map> <al.map><!-- Nino sitting at table, passages F, G, H --> <al.ptr target='m1'><al.range al.start='p235' al.end='p238'> <!-- N.B. we use same f.struct (M1) as for passage A --> </al.map> </alignment> </tei.1> Discussion CONCUR is optimal for expressing orthogonal views of the document. Movement by participants in a conversation may be so viewed. Topic shift (ex. 4) is really not orthogonal and might require segment- terminus tags (solution 2). Solution 2 might also be preferred for data capture; a mechanical operation should be able to convert the resulting text to one using concurrent markup. If the number of views (types of segment) is in principle bounded, prefer CONCUR. If the number of views is in principle unbounded, the event/time-slice technique must be used. In this case one tag (EVENT) will suffice and more should not be used. 2 MARKING DISCONTIGUOUS SEGMENTS Problem: how do you mark a segment marked by a single feature, but which is discontiguous in the text? Examples: 1 the words rendered illegible by the stain on the right hand side of this page 2 the finite verb "stellte vor" in the German sentence "Er stellte seine These den Kollegen hoffnungsvoll vor" 3 the discussion of tariffs in these parliamentary minutes (assuming that the discussion wanders back and forth from one topic to another) 4 the root KTB in the Arabic word "al-kaatib" Solution 1: Co-indexing Solution 1: use co-indexing by means of the SGML ID/IDREF mechanism. If we wish, we can gather all the ID occurrences in other tags else- where, in a sort of register which might look like this (for problem example 3): <!DOCTYPE tei.1 system "tei1.dtd" [ <!-- various declarations to allow use of topic declaration within front matter and topics within text body ... --> <!ELEMENT topic_declaration - O EMPTY > <!ATTLIST topic_declaration full CDATA #REQUIRED ID ID #REQUIRED > <!ELEMENT topic - - (#PCDATA) > <!ATTLIST topic topicid IDREF #REQUIRED > ]> <TEI.1><TEI.header> ... </TEI.header> <text><front> ... <topiclist> <topic_declaration id='tariff' full='Tariffs on steel'> <topic_declaration id='wheat' full='Wheat Crop Projections'> <topic_declaration id='flag' full='National Flag Month'> ... </topiclist> ... </front> <body> ... <topic topicid='tariff'> ... </topic> <topic topicid='wheat'> ... </topic> <topic topicid='tariff'> ... </topic> <topic topicid='flag'> ... </topic> <topic topicid='tariff'> ... </topic> ... </body> An alternative form would use a single tag for both the ID and the IDREF attributes, using declarations like these (where LEG='legibility' and S='stain'): <!ELEMENT leg - - (#PCDATA | s)+ > <!ELEMENT s - O (#PCDATA) > <!ATTLIST s id ID #IMPLIED segid IDREF #IMPLIED > and allowing document sequences like: Random statistical quirk for the day: the word "no" appears 1344 times in the King James Bible, but the <(leg)s id='s23'>word</(leg)s> "yes" appears only twice! (Grep for<(leg)s segid='s23'> yourself if yo</(leg)s>u don't believe me). At<(leg)s segid='s23'> first I thought thi</(leg)s>s was just a hilarious<(leg)s segid='s23'> artifact of religious</(leg)s> dogma, so I chec<(leg)s segid='s23'>ked Alice in Wonder- lan</(leg)s>d -- "yes" appears onl<(leg)s segid='s23'>y once! Curiouser </(leg)s>and curiouser. Well it<(leg)s segid='s23'> turns out to be</(leg)s> a property of English (yes<(leg)s segid='s23'>/no = .0</(leg)s>66 on average), and when you consider why this might be, it's undoubtedly due to the fact ... (Humanist 3.769, Tue, 21 Nov 89, posting from mike@tome.media.mit.edu (Michael Hawley)) Note that one occurrence of S (here the first) must have an ID attri- bute, and the others an IDREF. Of the two mechanisms described here, the former, with different tags for the head of the group and the various tails, is to be preferred. Solution 2: Redundant Separate Storage For micro-discontinuities like that in example 4, it might be simpler to introduce redundancy and store the discontiguous segment separately, e.g. with <word root='KTB'>al-kaatib</word> or <word><root>KTB</root> <form>al-kaatib</form> </word> Since KTB is analysis and not part of the text being lemmatized, the ML committee leaned toward the former solution (root as attribute, not ele- ment). Solution 3: Alignment Mechanism Another mechanism for marking discontiguous segments is the alignment map mechanism defined in chapter 6 of TEI P1 and described above as solution 4 for arbitrary segmentation. 3 HANDLING AMBIGUOUS CONTENT Problem: how does one mark multiple analyses of the same content? Examples: 1 the gross syntactic structure of the sentence "I saw the man with the telescope" 2 the pagination of the various editions of Shakespeare's Hamlet ______ Solution 1: Concurrent Markup Use CONCUR and define a separate document type for each edition to be included. Assume that we wish to mark volume, page, and column numbers for some editions, volume and page numbers for others. The following DTD may be embedded for each edition; it assumes that any edition is composed of one or more volumes, each volume comprises a set of pages, and each page can contain character data, lines, or columns. Because different editions have different material, an OMITTED tag is provided to mark some contents as not being present in the edition. <!-- Define "VERSION.NAME" in the document type declaration --> <!-- subset before calling these declarations. Sample: --> <!-- <!DOCTYPE La system 'plrefs.dec' [ --> <!-- <!ENTITY % version.name "La" > --> <!-- ]> --> <!-- --> <!-- N.B. this hierarchy requires all data to be marked with --> <!-- the volume and page of the edition, or marked as omitted --> <!-- A looser hierarchy may be defined if desired, by --> <!-- allowing inserting "#PCDATA | " at the beginning of --> <!-- the content models for %version.name and VOL, or by --> <!-- defining PAGE as (#PCDATA | C | L)* which would --> <!-- allow some lines to be marked without marking all lines. --> <!-- A tighter hierarchy may be defined by omitting #PCDATA --> <!-- from the content models for PAGE and C, thus requiring --> <!-- all lines to be marked. --> <!-- --> <!ENTITY % version.name "ref"> <!ELEMENT %version.name - - (vol | page)* +(omitted) > <!ELEMENT omitted - O (#PCDATA) > <!ELEMENT vol - O (page)* > <!ELEMENT page - O (#PCDATA | l+ | c+) > <!-- Columns and lines get short names since they occur often --> <!ELEMENT c - O (#PCDATA | line)* > <!ELEMENT l - O (#PCDATA) > <!ATTLIST (vol, page, c, l) n CDATA #IMPLIED id ID #IMPLIED > This concurrent hierarchy is enabled as shown in the comments; the document contains (after the lines enabling the basic document hier- archy) the sequence of lines (assuming the DTD is stored under the sys- tem file identifier "plrefs.dtd"): <!DOCTYPE La system 'plrefs.dec' [ <!ENTITY % version.name "La" > ]> which call the document type for page and line references and give it the name "La." If page and line numbers from more than one standard edition are to be marked, then the relevant lines may be repeated, each time using a different value for the document type and entity definition (where the example has "La"). Multiple editions of Hamlet might be tagged this way, using this ______ mechanism: <!DOCTYPE TEI.1 system "TEI1.DTD" [ <!ENTITY % TEI.base system "teidram1.dtd" > ]> <!DOCTYPE F system "plrefs.dtd" [ <!-- First Folio pagination --> <!ENTITY % version.name "F" > ]> <!DOCTYPE Q1 system "plrefs.dtd" [ <!-- First Quarto pagination --> <!ENTITY % version.name "Q1" > ]> <!DOCTYPE Q2 system "plrefs.dtd" [ <!-- Second Quarto pagination --> <!ENTITY % version.name "Q2" > ]> <!DOCTYPE Ri system "plrefs.dtd" [ <!-- Riverside Shakespeare pagination --> <!ENTITY % version.name "Ri" > ]> <(tei.1)tei.1><(f)f><(q1)q1><(q2)q2><(Ri)Ri> <(f)omitted><(q1)omitted><(q2)omitted><(Ri)omitted> <!> <(tei.1)tei.header> ... </(tei.1)tei.header> <!> </(f)omitted></(q1)omitted></(q2)omitted></(Ri)omitted> <(tei.1)text><(tei.1)body> <!> <!-- Act 1, Scene 1 starts ... --> <(tei.1)div1 name='act' n='1'> <!-- initial pagination for various editions --> <(F)page n='g5a'> <(Q1)page n='3'> <(Q2)page n='[3]'> <(Ri)page n='234'> <!-- ... text of Hamlet ... --> </(F)page><(F)page n='g5b'> <!-- ... text of Hamlet ... --> </(Q2)page><(Q2)page n='4'> <!-- ... text of Hamlet ... --> </(Ri)page><(Ri)page n='235'> <!-- ... text of Hamlet ... --> </(F)page><(F)page n='g5b'> <!-- ... text of Hamlet ... through end ... --> </(f)page></(q1)page></(q2)page></(Ri)page> </(tei.1)body></(tei.1)text></(tei.1)tei.1> Solution 2: Redundant Storage of String The string may be repeated with different markup each time. This is an obvious solution but causes problems for views other than the one in which the ambiguity is visible: they see only the repeated content, not the difference in tagging. Solution 3: Out-of-Line Markup (Empty Elements) The chart of this sentence may be represented with an empty element for each arc of the chart, with pointers to the endpoints of the arc. The DTD will have: <!ELEMENT sentence (text, parse*) > <!ELEMENT text (#PCDATA) > <!ELEMENT parse (arc)+ > <!ELEMENT arc EMPTY > <!ATTLIST arc marker (s, np, vp, pp, v, n, p) #IMPLIED x NUMBER #REQUIRED y NUMBER #REQUIRED > Where the tokens of the text are numbered 1-N, and the endpoints of the nodes are 0-N, node K follows token K of the text. (If validation of endpoints by the SGML parser is desired, then make these changes or additions to the DTD: <!ELEMENT text (word)+ > <!ELEMENT word (#PCDATA) > <!ATTLIST sentence id ID #IMPLIED <!ATTLIST word id ID #IMPLIED <!ATTLIST arc marker (s, np, vp, pp, v, n, p) #IMPLIED x IDREF #REQUIRED y IDREF #REQUIRED > and assign the SENTENCE ID to be the node before the first word.) The text will have (using SHORTTAG to omit redundant attribute names), and using comments in the right margin to indicate selected phrases: <sentence><text>I saw the man with the telescope.</text> <parse> <arc S from='0' to='7'> <arc NP from='0' to='1'> <!-- I --> <arc VP from='1' to='7'> <!-- s.t.m.w.t.t.--> <arc VP from='1' to='4'> <!-- saw the man --> <arc V from='1' to='2'> <!-- saw --> <arc NP from='2' to='4'> <!-- the man --> <arc PP from='4' to='7'> <!-- with the telescope --> <arc P from='4' to='5'> <!-- with --> <arc NP from='5' to='7'> <!-- the telescope --> </parse> <parse> <arc S from='0' to='7'> <arc NP from='0' to='1'> <!-- I --> <arc VP from='1' to='7'> <arc V from='1' to='2'> <!-- saw --> <arc NP from='2' to='7'> <!-- the man w/ the tel --> <arc NP from='2' to='4'> <!-- the man --> <arc PP from='4' to='7'> <!-- with the telescope --> <arc P from='4' to='5'> <!-- with --> <arc NP from='5' to='7'> <!-- the telescope --> </parse> </sentence> Obviously, the unambiguous arc information can be interspersed with the text, leaving the PARSE elements to group the competing analyses. This does complicate the DTD and the text. Note: The solution described here is fundamentally similar to that offered by TEI P1's tags for linguistics analysis: out-of-line analysis linked to the analysed text by pointers implemented by SGML ID and IDREF attributes. Like this one, the <f.struct> notation allows multiple analyses of the same content; the intermingling of content and analysis is not contemplated, for simplicity's sake. Solution 4: Special Notation Use a special notation to express the parses more compactly, at the cost of losing validation by the SGML parser. Using a DTD like this: <!ELEMENT sentence (text, parse*) > <!ELEMENT text (#PCDATA) > <!ELEMENT parse EMPTY > <!ATTLIST parse p CDATA #REQUIRED > We can have a text like this: <sentence><text>I saw the man with the telescope.</text> <parse p='( (1) ( ((2) (3 4)) ((5)(6 7)) ) )'> <parse p='( (1) ( (2) ( (3 4) ((5)(6 7)) ) ) )'> </sentence> Or: <sentence><text>I saw the man with the telescope.</text> <parse p='s( np(1) vp( vp(v(2) np(3 4)) pp(p(5)np(6 7)) ) )'> <parse p='s( np(1) vp( v(2) np( np(3 4) pp(p(5)np(6 7)) ) ) )'> </sentence> Solution 5: Treat as Arbitrary Segments Treat all parse subtrees as arbitrary segments using the techniques already outlined. Commentary Where local ambiguities are independent, leading to combinatorial explosion of overall ambiguity, concurrent markup is not wholly satis- factory, since it requires a separate markup stream for each overall interpretation of the ambiguity. Out-of-line markup in the style of solution 3 or the <f.struct> construct of TEI P1 are preferred in these cases. Where there is no combinatorial explosion (as in the multiple paginations of classic works) and the different segmentations of the text do not interact, CONCUR is the preferred solution. 4 MARKING OVERLAPPING (E.G. BI-CLAUSAL ANALYSIS) Problem: how does one mark text segments which can associate either left or right, as in "she (took advantage [of) Joan]" or "Broadway Hit or Miss?" or as in apo koinu constructions? Solutions: these examples appear to be solved by the methods of arbitrary segmentation and by the out-of-line markup mechanisms described in chapter 6 of TEI P1 (<f.struct> and <alignment>). 5 SYNCHRONOUS PARALLEL STRUCTURES AND TRANSCRIPTIONS Problem: How do you mark the synchronization points of a set of par- allel texts (e.g. texts of the Bible, or the nine language versions of EEC legislation, or phonetic, phonemic, and orthographic transcriptions of the same text)? Examples: 1 parallel texts (translation equivalents) 2 parallel texts (manuscript variants or recensions) 3 phonemic and orthographic transcriptions of same content Solution 1: Implicit Parallelism So long as order is preserved, parallelism between synchronous struc- tures can be implicit. The lowest level at which the parallelism is to be expressed contains a sequence of parallel versions. For example 3, the DTD might include: <!ELEMENT segment - O (phonemic, orthographic) > <!ELEMENT phonemic - O (#PCDATA) > <!ELEMENT orthographic - O (#PCDATA) > (It is assumed that SEGMENT is small enough that all gross text struc- turing occurs above it in the hierarchy.) The text then is: content ... <segment> <phonemic> (phonemic transcription of 'the') <orthographic>the <segment> <phonemic> (phonemic transcription of 'fat') <orthographic>fat <segment> <phonemic> (phonemic transcription of 'cat') <orthographic>cat content ... This is the method used in the <unit> / <level> tagging defined in chapter 6 of TEI P1 and demonstrated on translation equivalents in appendix A.6.3. Solution 2: Explicit Synchronization Using Common Identifiers Where sequence is not preserved, locations or segments must be given identifiers, and cross-references from one text to another must indicate the parallelism. E.g. <seg id='s1'>Tor <seg id='s2'>nach Durchfahrt <seg id='s3'>bitte <seg id='s4'>zumachen! and <seg id='s3'>Please <seg id='s4'>close <seg id='s1'>the gate <seg id='s2'>after passing through! This is the approach taken in synchronization through canonical ref- erences (see appendix A.6.1 of TEI P1); a more elaborated version of the same approach, allowing for one-to-many matching of segments, is found in the <alignment> mechanism of TEI P1 chapter 6. Solution 3: Explicit Synchronization with Many-to-one Linkages Where the segments of parallel texts not only appear in varying orders but do not match one-to-one, the use of common identifiers to align the texts does not suffice. In this case, the <alignment> mecha- nism of TEI P1 (described above as solution 4 for arbitrary segmenta- tion) must be used. An example of the application of alignment maps to parallel texts may be found in appendix A.6.2 of TEI P1. 6 INTERNAL AND EXTERNAL CROSS-REFERENCES Problem: how does one refer to locations elsewhere in the same or in a separate document? Solution: ID/IDREF Use the ID/IDREF mechanism. For external references, this will require application support, but the specification of an ID name with a (possibly system-dependent) document identity will uniquely point to a specific ID in any document. This is the basic mechanism specified in section 5.7 of TEI P1. 7 VAGUENESS OF LOCATION Problem: How do you mark a segment or text element with "fuzzy" ends? Examples: 1 the passage begins approximately here, but it is not certain exactly where 2 the passage begins somewhere between one point (a) and another (b) (e.g. an echo of another text, which may begin and end gradually, providing a section which is certainly an echo, in the opinion of a tagger, surrounded by a penumbra which might or might not represent an echo) 3 the passage referred to by a marginal note which does not have a corresponding symbol in the text Solution 1: PRECISION Attribute on Tag Use a PRECISION=VAGUE attribute on the tag whose location is uncer- tain. Solution 2: Double Tagging Use double tagging, either with empty tags (as for arbitrary and overlapping segments) or with nested elements, so that one tag occurs at point (a) of example 2 and one at point (b). The nested elements could be separate elements, the inner representing the text segment where the text feature ("echo" in the example) is certainly present, the outer where it might be present. Alternatively, if the element can self-nest, the outer element could have the attribute STATUS=POSSIBLE and the inner STATUS=CERTAIN. ------------------------- (1) Note: it appears clear that before publication this paper needs revision along the following lines: 1. examples of the logical problems need to be expanded upon somewhat, and may need commentary explaining where the problem lies (e.g. why the marking of arbitrary spans in a text repre- sents any problem at all in SGML) 2. the example solutions need names not numbers 3. more of the example solutions should be complete parseable SGML documents, either with radically simplified DTDs or using TEI DTDs and extensions as recommended in TEI P2, second ver- sion of the Guidelines 4. in several cases the TEI P2 version of the solution should be presented instead of the generic SGML solution now given 5. the topics should be reordered so the paper does not end on such a flat note 6. some of the solutions need better commentary explaining how the declarations and sample markup work to solve the problem 7. if Nino's walk to the window is to remain a prominent example, we need to provide a fuller transcription of the entire scene; it may be preferable to use an excerpt from a play and imagine markup describing the movement in a particular performance or production It is not clear whether a quick review of SGML syntax is needed at the beginning, or readers of the journal should be assumed to know SGML well enough to follow with some commentary. Other comments and suggestions are welcome. (2) Or more elaborately, the passage marked by straight lines in the left margin, the passage marked by wavy lines in the left margin, the passage underlined by hand with simple straight line, the pas- sage underlined by hand with simple straight line which was later deleted by hand, etc., as in the transcriptions of Wittgenstein's manuscripts in the Norwegian Wittgenstein project. See Claus Huit- feldt and Viggo Rossvoer, The Norwegian Wittgenstein Project Report _________________________________________ 1988 ([Bergen]: NAVFs EDB-Senter for Humanistisk Forskning / Norwe- ____ gian Computing Centre for the Humanities, 1989), esp. pp. 201-236. (3) This is the equivalent of the <milestone> tag defined in drafts 1.0 and 1.1 of the guidelines for segmentation of text according to pagination of multiple editions. The <milestone> tag differs, how- ever, in assuming a simple single-level segmentation of the text: the values specified in any <milestone> tag apply to all following text until the next <milestone> marked as belonging to the same edi- tion. Hence no explicit end marker is needed and the ID / IDREF mechanism can be dispensed with. (4) The TEI expects to recommend better support for such non- hierarchical start-end segment-tag pairs as an enhancement of SGML to be added during the next revision of ISO 8879. See document TEI ML W32 for TEI proposals for the revision of SGML. (5) N.B. the feature structure tags here use a name attribute rather than a separate tag as in TEI P1 version 1.1; this change will be present in TEI P1 version 2. Version 4, April 16, 1992