Notes on SGML Solutions
 
                           To Markup Problems
 
 
                             David Barnard
                   Queen's University, Kingston, Ont.
                              Lou Burnard
                  Oxford University Computing Service
                          Jean-Pierre Gaspart
                              Lynne Price
                            Frame Technology
                         C. M. Sperberg-McQueen
       University of Illinois at Chicago Academic Computer Center
                              Nino Varile
    Commission of the European Communities, Directorate General XIII
 
                      Document Number:  TEI MLW18
 
                             April 16, 1992
 
                       Version 4, April 16, 1992
 
 
This paper discusses some sample problems in the use of SGML which have
come up in the course of the work of the Text Encoding Initiative, and
presents a number of example solutions to each problem.  These notes may
be taken as representing the views of the committee as to appropriate
uses of SGML mechanisms for logical problems; they do not necessarily
reflect the views of the committee concerning the individual application
areas.  The discussion of linguistic examples, especially, focuses on
markup problems and is not intended as a full linguistic analysis.(1)
 
   Some problems discussed here are also treated in the work of ANSI
committee X3V1.8, Music in Information Processing Standards, especially
in their work on hypertext and hypermedia documents.  Their documents,
notably their document X3V1.8/SD-7, "Journal of Development, ANSI
Project X3.749-D, Hypermedia/Time-based Document Structuring Language
(HyTime)," should be consulted by anyone working with hypertext problems
(as should the documents of TEI work group TR3, Hypermedia).  The exam-
ples and solutions described here are pedagogical in intent; divergences
in this document from the recommendations of X3V1.8 or TR3 should not be
taken as rejections of their recommendations.
 
 
 
                                   1
 
                       MARKING ARBITRARY SEGMENTS
 
 
   Problem:  how do you mark text segments which are arbitrary both with
respect to the primary (or any) hierarchical structure in the text, and
also with respect to each other?
 
   Examples:
 
    1 the passage rendered illegible by water stains(2)
    2 the passage rendered inaudible by a passing truck (e.g. in a tran-
    script of an interview)
    3 this is where N.V. moved to the window (e.g. in a gestural tran-
    scription of a conversation)
    4 the discussion of tariffs in these minutes where the topics are:
    tariff policy, foreign relations with Japan (including cultural
    exchange, economic cooperation, tariff policy, patent law), and
    wheat production (including a digression on wheat tariffs)
 
 
Solution 1:  Concurrent Markup
 
   If the number of such segment types is bounded and small, use CONCUR.
For instance, example 3 shows a segment corresponding to a movement by
one participant.  By assigning one document type to each participant, we
can mark each participant's moves using CONCUR.  So NV's move to the
window can be bounded by
 
     <(nino)move desc='walking to window'> ... </(nino)move>
 
without regard for the other structures in the text, after a DTD with
declarations like
 
     <!DOCTYPE nino [
     <!ELEMENT nino          - O  (#PCDATA | move | gesture | ...)*  >
     <!ELEMENT move          - O  (#PCDATA)                          >
     <!ELEMENT gesture       - O  (#PCDATA)                          >
     <!ATTLIST (move, gesture, ...)
               desc               CDATA               #IMPLIED       >
     ]>
 
This assumes the gestural transcription will mark MOVEs, GESTUREs, and
possibly other things for which here the ellipsis stands in; also that
no two gestures or moves overlap or nest.  If gestures and moves may
nest, the content models for the document type NINO should be revised
appropriately:  they may express a suitable hierarchy or be replaced by
the content model ANY; alternatively, the gestural tags may all be
included as inclusion exceptions on the document-type element NINO,
thus:
 
     <!ELEMENT nino          - O  (#PCDATA) +(move | gesture | ...)  >
     etc.
 
   Similarly, legibility or audibility information can be readily accom-
modated by CONCUR with a "leg" or "aud" DTD segmenting the document into
segments of a given legibility or audibility.  The DTD might look some-
thing like this:
 
     <!DOCTYPE aud [
     <!ELEMENT aud           - -  (clear | distorted
                                  | inaudible | dna)+                >
     <!ELEMENT clear         - -  (#PCDATA)                          >
     <!ELEMENT distorted     - -  (#PCDATA)                          >
     <!ELEMENT inaudible     - -  (#PCDATA)                          >
     <!ATTLIST (distorted, inaudible)
               cause              CDATA               #IMPLIED       >
     ]>
 
The use of (#PCDATA) as a content model ensures that these tags cannot
nest (which we assume would be meaningless) and the content model on AUD
ensures that every portion of the document will be marked as clear, dis-
torted, inaudible, or "dna" ("does not apply", for sections of the docu-
ment to which audibility does not apply--document header, annotation,
etc.) So the document could have, interspersed randomly among other
tags, sequences like:
 
     <(aud)dna> [document header ...] </(aud)dna>
     <(aud)clear> ... </(aud)clear>
     <(aud)inaudible cause='truck'> ... </(aud)inaudible>
     <(aud)clear> ... </(aud)clear>
     <(aud)distorted cause='volume overload'> ... </(aud)distorted>
     <(aud)clear> ... </(aud)clear>
 
 
Solution 2a:  Single Empty Segment-Boundary Element
 
   Where the number of types of segments is in principle unbounded (e.g.
not only move and gesture but an indefinite number of further possibili-
ties), a single <event> tag may be used to mark the beginning or end of
any segment, thus dividing the document into time-slices to be managed
and grouped by higher-level application software.(3)
 
   If we wish to transcribe, say, movement, location, and whether a par-
ticipant is smoking, we can segment the text on these lines:  if N.V.
walks to the window, stands there, and after a few minutes lights a
cigarette, returning to the table before putting it out, we could imag-
ine a simple segmentation like this:
 
         <tei.1>
         <event start id='e237' person='nino' act='sits at table'>
         ... passage A ...
         <event start id='e238' person='nino' act='walking to window'>
         ... passage B ...
         <event end   startid='e238' >
         <event start id='e239' person='nino' act='stands at window'
         ... passage C ...
         <event start id='e240' person='nino' act='smokes'
         ... passage D ...
         <event end   startid='e239' >
         <event start id='e241' person='nino' act='walks to table'
         ... passage E ...
         <event end   startid='e241' >
         ... passage F ...
         <event end   startid='e240' >
         ... passage G ...
         </tei.1>
 
   The application software is then responsible for linking each <event
end> to its corresponding <event start>, by means of the identical value
on the id attribute of the <event start> and the startid attribute (an
IDREF attribute) on the <event end> tag, and treating the intervening
text as though it were a single element.  Here, passages A-C must be
treated as though braced by <(nino)smoking> ... </(nino)smoking>, and
C-D and F-G as though single segments with NV's location as marked.
 
   The document type declaration fragment required for tagging of this
kind would be something like this:
 
     <!DOCTYPE tei.1 [
     <!-- all normal TEI tags defined, and also EVENT          -->
     <!-- EVENT is defined as part of F.EMPTY and can occur    -->
     <!-- anywhere within prose.                               -->
     <!ENTITY  % f.empty
               "citn.ref | milestone | xref | anchor | include |
               ext.formula | ext.table | ext.figure | index.term
               | event"                                          >
     <!ELEMENT event     - O EMPTY                               >
     <!ATTLIST event
               ID           ID                        #IMPLIED
               startid      IDREF                     #IMPLIED
               person       (nino | lynne | lou | jp) #IMPLIED
               act          CDATA                     #IMPLIED
               endpoint     (start | end | point)     #IMPLIED   >
     ]>
 
   SGML would not verify that each <event start> tag had exactly one
corresponding <event end>, nor that each event-start tag had an ID
attribute and each event-end tag a STARTID attribute, but the ID/IDREF
mechanism would ensure that each STARTID pointed at exactly one ID.
 
 
Solution 2b:  Paired Segment-Boundary Elements
 
   The method of the preceding section requires a clean distinction
among <event> tags:  some mark the beginning of an event and must have
an ID value, along with person and desc; others mark the end of the same
event and should have only a startid value.  A third type marks the
point of occurrence of an event without duration, for which person and
desc values are logically required and an ID value optional.  As was
noted above, SGML is not in a position to enforce these constraints from
the declarations given.  We may make the method more reliable, at the
cost of two additional element types, by defining three distinct ele-
ments for the three types of event tags.
 
   In this revised method, a pair of <event-start> and <event-end> tags
may be used to mark the beginning and end of any segment.  Events which
occur at a single point in time and have no distinct start and end may
be represented by a third <event> tag.
 
   The same sequence of events as that given above would be transcribed
thus, in this method:
 
         <tei.1>
         <event-start id='e237' person='nino' act='sits at table'
         ... passage A ...
         <event-start id='e238' person='nino' act='walking to window'
         ... passage B ...
         <event-end startid='e238' >
         <event-start id='e239' person='nino' act='stands at window'
         ... passage C ...
         <event-start id='e240' person='nino' act='smokes'
         ... passage D ...
         <event-end startid='e239' >
         <event-start id='e241' person='nino' act='walks to table'
         ... passage E ...
         <event-end startid='e241' >
         ... passage F ...
         <event id='e242' person='nino' act='knocks on wood'
         ... passage G ...
         <event-end startid='e240' >
         ... passage H ...
         </tei.1>
 
   As before, the application is responsible for treating passages A-C,
C-D, and F-H as units.
 
   The document type declaration fragment required for tagging of this
kind would be something like this:
 
     <!DOCTYPE tei.1 [
     <!-- all normal TEI tags defined, and also EVENT          -->
     <!-- EVENT is defined as part of F.EMPTY and can occur    -->
     <!-- anywhere within prose.                               -->
     <!ENTITY  % f.empty
               "citn.ref | milestone | xref | anchor | include |
               ext.formula | ext.table | ext.figure | index.term
               | event | event-start | event-end"                >
     <!ELEMENT event       - O EMPTY                             >
     <!ELEMENT event-start - O EMPTY                             >
     <!ELEMENT event-end   - O EMPTY                             >
     <!ATTLIST event
               ID           ID                        #IMPLIED
               person       (nino | lynne | lou | jp) #REQUIRED
               act          CDATA                     #REQUIRED  >
     <!ATTLIST event-start
               ID           ID                        #REQUIRED
               person       (nino | lynne | lou | jp) #REQUIRED
               act          CDATA                     #REQUIRED  >
     <!ATTLIST event-end
               startid      IDREF                     #REQUIRED  >
     ]>
 
 
Solution 3:  Typed Segment-Boundary Delimiters
 
   If the types of events form a closed set, a different segment-
boundary element can be defined for each type of event.  Like the
<event> tag of solution 2a, these segment-boundary tags would be empty.
To define distinct segment-boundary tags for moves and gestures, the DTD
would include definitions like these:
 
     <!ELEMENT move          - O  EMPTY>
     <!ELEMENT gesture       - O  EMPTY>
     <!ATTLIST (move, gesture)
               endpnt             (start | end | point)    #REQUIRED
               person             (nino | lynne |
                                  lou | msm | jp )         #REQUIRED
               desc               CDATA                    #IMPLIED
               id                 ID                       #IMPLIED
               startid            IDREF                    #IMPLIED  >
 
N.V.'s walk to the window would be marked:
 
         ...
         <move start id='e237' person='nino' desc='walking to window'
     ...
         <!-- during this segment, Nino is walking to the window -->
         ...
         <move end   id='e237' person='nino'>
         <!-- Nino is now at the window -->
         ...
 
As in solution 2, application software would be responsible for linking
the start and end tags (by means of the identical value for the ID and
STARTID attributes).  SGML would not verify that each START tag had
exactly one END, nor that each START tag had an ID attribute and each
END tag a STARTID attribute, but the ID/IDREF mechanism would ensure
that each STARTID pointed at exactly one ID.  Better SGML validation can
be achieved, as before, by creating pairs of distinct segment-start and
segment-end tags.(4)
 
 
Solution 4:  Arbitrary Segments as Lists of Elements
 
   A variant on solution 1 can be used to provide slightly better sup-
port for arbitrary segments in SGML.  In this variant, an arbitrary seg-
ment of the text is defined as a set of elements of the text; where
existing elements do not have precisely the right extension to define
the desired segment, special segmentation elements are used to place
boundaries in the correct positions in the text.
 
   This approach is defined in TEI P1, where the <al.map> element is
used to specify arbitrary segments of a text as sets of elements, using
<s> elements if necessary to divide the text into the desired chunks.
 
   This approach provides a more declarative interpretation of arbitrary
segments (in terms of a set of subtrees of the parse tree, rather than
in terms of a specific processing model involving left-to-right scan of
the document); it also automatically provides for discontiguous seg-
ments.  Its disadvantage is in requiring out-of-line markup:  the char-
acteristics to be associated with a given arbitrary segment are speci-
fied in a separate element, e.g. an <f.struct>, and associated with the
arbitrary segment only through an alignment map.
 
   Using this method, N.V.'s walk to the window might be marked this
way.(5)  Comments are used to show more clearly what is happening; the
information is carried by the tags, however, not the comments.
 
     <tei.1> ...
     <text><body><p> ... text ...
         <s id='s101'>
         <!-- Nino sits at the table -->
         ... passage A ...
         </s><s id='s102'>
         <!-- Nino starts toward the window -->
         ... passage B1 ...
         </s>
     </p>
     <p><s id='s103'>
         ... passage B2 ...
         </s><!-- Nino arrives at window -->
         <s id='s104'>
         <!-- Nino stands at window -->
         ... passage C ...
         <s id='s105'>
         <!-- Nino smokes ... -->
         ... passage D1 ...
         </s></s>
     </p>
     <p id='p234'><s id='s106'>
         ... passage D2 ...
         </s><!-- Nino leaves window and goes to table -->
         <s id='s107'>
         ... passage E ...
         </s><!-- Nino arrives at table, sits -->
     </p>
     <p id='p235'>
         ... passage F ...
     </p>
     <p id='p238'>
         <s id='s108'></s><!-- Nino knocks on wood -->
         ... passage G ...
         <!-- Nino stops smoking -->
     </p>
     <p id='p239'>
         ... passage H ...
     </body>
     <analysis>
         <!-- N.B. ordering of f.structs is irrelevant. -->
         <f.struct id='m1'>
              <feature name='who'>Nino</feature>
              <feature name='act'>sits</feature>
              <feature name='loc'>at table.</feature></f.struct>
         <f.struct id='m2'>
              <feature name='who'>Nino</feature>
              <feature name='act'>walks</feature>
              <feature name='loc'>table to window.</feature></f.struct>
         <f.struct id='m3'>
              <feature name='who'>Nino</feature>
              <feature name='act'>stands</feature>
              <feature name='loc'>at window.</feature></f.struct>
         <f.struct id='m4'>
              <feature name='who'>Nino</feature>
              <feature name='act'>smokes.</feature></f.struct>
         <f.struct id='m5'>
              <feature name='who'>Nino</feature>
              <feature name='act'>knocks on wood.</feature></f.struct>
         <f.struct id='m6'>
              <feature name='who'>Nino</feature>
              <feature name='act'>walks</feature>
              <feature name='loc'>window to table.</feature></f.struct>
     </analysis>
     <alignment>
     <al.map><!-- Nino sitting at table, passage A -->
         <al.ptr target='m1'><al.ptr target='s101'>
     </al.map>
     <al.map><!-- Nino walking to window, passages B1, B2 -->
         <al.ptr target='m2'><al.range al.start='s102' al.end='s103'>
     </al.map>
     <al.map><!-- Nino at window, passages C, D -->
         <al.ptr target='m3'><al.range al.start='s104' al.end='s106'>
     </al.map>
     <al.map><!-- Nino smokes, passages D-G -->
         <al.ptr target='m4'><al.range al.start='s105' al.end='p238'>
     </al.map>
     <al.map><!-- Nino knocks on wood, no duration -->
         <al.ptr target='m5'><al.ptr target='s108'>
     </al.map>
     <al.map><!-- Nino walks to table, passage E -->
         <al.ptr target='m6'><al.ptr target='s107'>
     </al.map>
     <al.map><!-- Nino sitting at table, passages F, G, H -->
         <al.ptr target='m1'><al.range al.start='p235' al.end='p238'>
         <!-- N.B. we use same f.struct (M1) as for passage A -->
     </al.map>
     </alignment>
     </tei.1>
 
 
Discussion
 
   CONCUR is optimal for expressing orthogonal views of the document.
Movement by participants in a conversation may be so viewed.  Topic
shift (ex. 4) is really not orthogonal and might require segment-
terminus tags (solution 2).  Solution 2 might also be preferred for data
capture; a mechanical operation should be able to convert the resulting
text to one using concurrent markup.  If the number of views (types of
segment) is in principle bounded, prefer CONCUR.  If the number of views
is in principle unbounded, the event/time-slice technique must be used.
In this case one tag (EVENT) will suffice and more should not be used.
 
 
 
                                   2
 
                     MARKING DISCONTIGUOUS SEGMENTS
 
 
   Problem:  how do you mark a segment marked by a single feature, but
which is discontiguous in the text?
 
   Examples:
 
    1 the words rendered illegible by the stain on the right hand side
    of this page
    2 the finite verb "stellte vor" in the German sentence "Er stellte
    seine These den Kollegen hoffnungsvoll vor"
    3 the discussion of tariffs in these parliamentary minutes (assuming
    that the discussion wanders back and forth from one topic to
    another)
    4 the root KTB in the Arabic word "al-kaatib"
 
 
Solution 1:  Co-indexing
 
   Solution 1:  use co-indexing by means of the SGML ID/IDREF mechanism.
If we wish, we can gather all the ID occurrences in other tags else-
where, in a sort of register which might look like this (for problem
example 3):
 
     <!DOCTYPE tei.1 system "tei1.dtd" [
         <!-- various declarations to allow use of topic declaration
              within front matter and topics within text body ... -->
         <!ELEMENT topic_declaration - O  EMPTY                       >
         <!ATTLIST topic_declaration
                   full                   CDATA       #REQUIRED
                   ID                     ID          #REQUIRED       >
         <!ELEMENT topic             - -  (#PCDATA)                   >
         <!ATTLIST topic
                   topicid                IDREF       #REQUIRED       >
 
     ]>
     <TEI.1><TEI.header> ... </TEI.header>
     <text><front> ...
     <topiclist>
         <topic_declaration id='tariff' full='Tariffs on steel'>
         <topic_declaration id='wheat'  full='Wheat Crop Projections'>
         <topic_declaration id='flag'   full='National Flag Month'> ...
     </topiclist> ...
     </front>
     <body> ...
     <topic topicid='tariff'> ... </topic>
     <topic topicid='wheat'> ... </topic>
     <topic topicid='tariff'> ... </topic>
     <topic topicid='flag'> ... </topic>
     <topic topicid='tariff'> ... </topic> ...
     </body>
 
   An alternative form would use a single tag for both the ID and the
IDREF attributes, using declarations like these (where LEG='legibility'
and S='stain'):
 
     <!ELEMENT leg           - -  (#PCDATA | s)+                     >
     <!ELEMENT s             - O  (#PCDATA)                          >
     <!ATTLIST s
               id                 ID                  #IMPLIED
               segid              IDREF               #IMPLIED       >
 
and allowing document sequences like:
 
     Random statistical quirk for the day: the
     word "no" appears 1344 times in the King
     James Bible, but the <(leg)s id='s23'>word</(leg)s> "yes" appears
     only
     twice!  (Grep for<(leg)s segid='s23'> yourself if yo</(leg)s>u
     don't
     believe me).  At<(leg)s segid='s23'> first I thought thi</(leg)s>s
     was
     just a hilarious<(leg)s segid='s23'> artifact of religious</(leg)s>
     dogma, so I chec<(leg)s segid='s23'>ked Alice in Wonder-
     lan</(leg)s>d --
     "yes" appears onl<(leg)s segid='s23'>y once!  Curiouser
     </(leg)s>and
     curiouser.  Well it<(leg)s segid='s23'> turns out to be</(leg)s> a
     property of English (yes<(leg)s segid='s23'>/no = .0</(leg)s>66 on
     average), and when you consider why this
     might be, it's undoubtedly due to the fact ...
 
     (Humanist 3.769, Tue, 21 Nov 89, posting from
     mike@tome.media.mit.edu (Michael Hawley))
 
Note that one occurrence of S (here the first) must have an ID attri-
bute, and the others an IDREF.
 
   Of the two mechanisms described here, the former, with different tags
for the head of the group and the various tails, is to be preferred.
 
 
Solution 2:  Redundant Separate Storage
 
   For micro-discontinuities like that in example 4, it might be simpler
to introduce redundancy and store the discontiguous segment separately,
e.g. with
 
     <word root='KTB'>al-kaatib</word>
 
or
 
     <word><root>KTB</root>
           <form>al-kaatib</form>
     </word>
 
Since KTB is analysis and not part of the text being lemmatized, the ML
committee leaned toward the former solution (root as attribute, not ele-
ment).
 
 
Solution 3:  Alignment Mechanism
 
   Another mechanism for marking discontiguous segments is the alignment
map mechanism defined in chapter 6 of TEI P1 and described above as
solution 4 for arbitrary segmentation.
 
 
 
                                   3
 
                       HANDLING AMBIGUOUS CONTENT
 
 
   Problem:  how does one mark multiple analyses of the same content?
 
   Examples:
 
    1 the gross syntactic structure of the sentence "I saw the man with
    the telescope"
    2 the pagination of the various editions of Shakespeare's Hamlet
                                                              ______
 
 
Solution 1:  Concurrent Markup
 
   Use CONCUR and define a separate document type for each edition to be
included.  Assume that we wish to mark volume, page, and column numbers
for some editions, volume and page numbers for others.  The following
DTD may be embedded for each edition; it assumes that any edition is
composed of one or more volumes, each volume comprises a set of pages,
and each page can contain character data, lines, or columns.  Because
different editions have different material, an OMITTED tag is provided
to mark some contents as not being present in the edition.
 
     <!-- Define "VERSION.NAME" in the document type declaration   -->
     <!-- subset before calling these declarations.  Sample:       -->
     <!--    <!DOCTYPE La system 'plrefs.dec' [                    -->
     <!--    <!ENTITY % version.name "La" >                        -->
     <!--    ]>                                                    -->
     <!--                                                          -->
     <!-- N.B. this hierarchy requires all data to be marked with  -->
     <!-- the volume and page of the edition, or marked as omitted -->
     <!-- A looser hierarchy may be defined if desired, by         -->
     <!-- allowing inserting "#PCDATA | " at the beginning of      -->
     <!-- the content models for %version.name and VOL, or by      -->
     <!-- defining PAGE as (#PCDATA | C | L)* which would          -->
     <!-- allow some lines to be marked without marking all lines. -->
     <!-- A tighter hierarchy may be defined by omitting #PCDATA   -->
     <!-- from the content models for PAGE and C, thus requiring   -->
     <!-- all lines to be marked.                                  -->
     <!--                                                          -->
     <!ENTITY % version.name "ref">
     <!ELEMENT %version.name - -  (vol | page)*       +(omitted)     >
     <!ELEMENT omitted       - O  (#PCDATA)                          >
     <!ELEMENT vol           - O  (page)*                            >
     <!ELEMENT page          - O  (#PCDATA | l+ | c+)                >
     <!-- Columns and lines get short names since they occur often -->
     <!ELEMENT c             - O  (#PCDATA | line)*                  >
     <!ELEMENT l             - O  (#PCDATA)                          >
     <!ATTLIST (vol, page, c, l)
               n             CDATA                    #IMPLIED
               id            ID                       #IMPLIED       >
 
   This concurrent hierarchy is enabled as shown in the comments; the
document contains (after the lines enabling the basic document hier-
archy) the sequence of lines (assuming the DTD is stored under the sys-
tem file identifier "plrefs.dtd"):
 
         <!DOCTYPE La system 'plrefs.dec' [
         <!ENTITY % version.name "La" >
         ]>
 
which call the document type for page and line references and give it
the name "La."  If page and line numbers from more than one standard
edition are to be marked, then the relevant lines may be repeated, each
time using a different value for the document type and entity definition
(where the example has "La").
 
   Multiple editions of Hamlet might be tagged this way, using this
                        ______
mechanism:
 
     <!DOCTYPE TEI.1 system "TEI1.DTD" [
         <!ENTITY % TEI.base system "teidram1.dtd" >
     ]>
     <!DOCTYPE F system "plrefs.dtd" [
         <!-- First Folio pagination -->
         <!ENTITY % version.name "F" >
     ]>
     <!DOCTYPE Q1 system "plrefs.dtd" [
         <!-- First Quarto pagination -->
         <!ENTITY % version.name "Q1" >
     ]>
     <!DOCTYPE Q2 system "plrefs.dtd" [
         <!-- Second Quarto pagination -->
         <!ENTITY % version.name "Q2" >
     ]>
     <!DOCTYPE Ri system "plrefs.dtd" [
         <!-- Riverside Shakespeare pagination -->
         <!ENTITY % version.name "Ri" >
     ]>
     <(tei.1)tei.1><(f)f><(q1)q1><(q2)q2><(Ri)Ri>
     <(f)omitted><(q1)omitted><(q2)omitted><(Ri)omitted>
     <!>
     <(tei.1)tei.header> ... </(tei.1)tei.header>
     <!>
     </(f)omitted></(q1)omitted></(q2)omitted></(Ri)omitted>
     <(tei.1)text><(tei.1)body>
     <!>
     <!-- Act 1, Scene 1 starts ... -->
     <(tei.1)div1 name='act' n='1'>
     <!-- initial pagination for various editions -->
     <(F)page n='g5a'>
     <(Q1)page n='3'>
     <(Q2)page n='[3]'>
     <(Ri)page n='234'>
     <!-- ... text of Hamlet ... -->
     </(F)page><(F)page n='g5b'>
     <!-- ... text of Hamlet ... -->
     </(Q2)page><(Q2)page n='4'>
     <!-- ... text of Hamlet ... -->
     </(Ri)page><(Ri)page n='235'>
     <!-- ... text of Hamlet ... -->
     </(F)page><(F)page n='g5b'>
     <!-- ... text of Hamlet ... through end ... -->
     </(f)page></(q1)page></(q2)page></(Ri)page>
     </(tei.1)body></(tei.1)text></(tei.1)tei.1>
 
 
Solution 2:  Redundant Storage of String
 
   The string may be repeated with different markup each time.  This is
an obvious solution but causes problems for views other than the one in
which the ambiguity is visible:  they see only the repeated content, not
the difference in tagging.
 
 
Solution 3:  Out-of-Line Markup (Empty Elements)
 
   The chart of this sentence may be represented with an empty element
for each arc of the chart, with pointers to the endpoints of the arc.
The DTD will have:
 
     <!ELEMENT sentence (text, parse*) >
     <!ELEMENT text     (#PCDATA)      >
     <!ELEMENT parse    (arc)+         >
     <!ELEMENT arc      EMPTY          >
 
     <!ATTLIST arc      marker (s, np, vp, pp, v, n, p) #IMPLIED
                        x      NUMBER                   #REQUIRED
                        y      NUMBER                   #REQUIRED >
 
Where the tokens of the text are numbered 1-N, and the endpoints of the
nodes are 0-N, node K follows token K of the text.  (If validation of
endpoints by the SGML parser is desired, then make these changes or
additions to the DTD:
 
     <!ELEMENT text     (word)+        >
     <!ELEMENT word     (#PCDATA)      >
     <!ATTLIST sentence id     ID                       #IMPLIED
     <!ATTLIST word     id     ID                       #IMPLIED
     <!ATTLIST arc      marker (s, np, vp, pp, v, n, p) #IMPLIED
                        x      IDREF                    #REQUIRED
                        y      IDREF                    #REQUIRED >
 
and assign the SENTENCE ID to be the node before the first word.)
 
The text will have (using SHORTTAG to omit redundant attribute names),
and using comments in the right margin to indicate selected phrases:
 
     <sentence><text>I saw the man with the telescope.</text>
     <parse>
       <arc S from='0' to='7'>
          <arc NP from='0' to='1'>        <!-- I           -->
          <arc VP from='1' to='7'>        <!-- s.t.m.w.t.t.-->
             <arc VP from='1' to='4'>     <!-- saw the man -->
                <arc V  from='1' to='2'>  <!-- saw         -->
                <arc NP from='2' to='4'>  <!--     the man -->
             <arc PP from='4' to='7'>     <!-- with the telescope -->
                <arc P  from='4' to='5'>  <!-- with               -->
                <arc NP from='5' to='7'>  <!--      the telescope -->
     </parse>
     <parse>
       <arc S from='0' to='7'>
          <arc NP from='0' to='1'>        <!-- I                  -->
          <arc VP from='1' to='7'>
             <arc V  from='1' to='2'>     <!-- saw                -->
             <arc NP from='2' to='7'>     <!-- the man w/ the tel -->
                <arc NP from='2' to='4'>  <!-- the man            -->
                <arc PP from='4' to='7'>  <!-- with the telescope -->
                   <arc P  from='4' to='5'>   <!-- with           -->
                   <arc NP from='5' to='7'>   <!-- the telescope  -->
     </parse>
     </sentence>
 
Obviously, the unambiguous arc information can be interspersed with the
text, leaving the PARSE elements to group the competing analyses.  This
does complicate the DTD and the text.
 
   Note:  The solution described here is fundamentally similar to that
offered by TEI P1's tags for linguistics analysis:  out-of-line analysis
linked to the analysed text by pointers implemented by SGML ID and IDREF
attributes.  Like this one, the <f.struct> notation allows multiple
analyses of the same content; the intermingling of content and analysis
is not contemplated, for simplicity's sake.
 
 
Solution 4:  Special Notation
 
   Use a special notation to express the parses more compactly, at the
cost of losing validation by the SGML parser.
 
   Using a DTD like this:
 
     <!ELEMENT sentence (text, parse*) >
     <!ELEMENT text     (#PCDATA)      >
     <!ELEMENT parse    EMPTY          >
     <!ATTLIST parse    p CDATA  #REQUIRED >
 
We can have a text like this:
 
     <sentence><text>I saw the man with the telescope.</text>
       <parse p='( (1) ( ((2)   (3 4)) ((5)(6 7)) ) )'>
       <parse p='( (1) (  (2) ( (3 4)  ((5)(6 7)) ) ) )'>
     </sentence>
 
Or:
 
     <sentence><text>I saw the man with the telescope.</text>
       <parse p='s( np(1) vp( vp(v(2)  np(3 4)) pp(p(5)np(6 7)) ) )'>
       <parse p='s( np(1) vp( v(2) np( np(3 4)  pp(p(5)np(6 7)) ) ) )'>
     </sentence>
 
 
Solution 5:  Treat as Arbitrary Segments
 
   Treat all parse subtrees as arbitrary segments using the techniques
already outlined.
 
 
Commentary
 
   Where local ambiguities are independent, leading to combinatorial
explosion of overall ambiguity, concurrent markup is not wholly satis-
factory, since it requires a separate markup stream for each overall
interpretation of the ambiguity.  Out-of-line markup in the style of
solution 3 or the <f.struct> construct of TEI P1 are preferred in these
cases.  Where there is no combinatorial explosion (as in the multiple
paginations of classic works) and the different segmentations of the
text do not interact, CONCUR is the preferred solution.
 
 
 
                                   4
 
             MARKING OVERLAPPING (E.G. BI-CLAUSAL ANALYSIS)
 
 
   Problem:  how does one mark text segments which can associate either
left or right, as in "she (took advantage [of) Joan]" or "Broadway Hit
or Miss?" or as in apo koinu constructions?
 
   Solutions:  these examples appear to be solved by the methods of
arbitrary segmentation and by the out-of-line markup mechanisms
described in chapter 6 of TEI P1 (<f.struct> and <alignment>).
 
 
 
                                   5
 
           SYNCHRONOUS PARALLEL STRUCTURES AND TRANSCRIPTIONS
 
 
   Problem:  How do you mark the synchronization points of a set of par-
allel texts (e.g. texts of the Bible, or the nine language versions of
EEC legislation, or phonetic, phonemic, and orthographic transcriptions
of the same text)?
 
   Examples:
 
    1 parallel texts (translation equivalents)
    2 parallel texts (manuscript variants or recensions)
    3 phonemic and orthographic transcriptions of same content
 
 
Solution 1:  Implicit Parallelism
 
   So long as order is preserved, parallelism between synchronous struc-
tures can be implicit.  The lowest level at which the parallelism is to
be expressed contains a sequence of parallel versions.  For example 3,
the DTD might include:
 
     <!ELEMENT segment      - O (phonemic, orthographic) >
     <!ELEMENT phonemic     - O (#PCDATA)                >
     <!ELEMENT orthographic - O (#PCDATA)                >
 
(It is assumed that SEGMENT is small enough that all gross text struc-
turing occurs above it in the hierarchy.)  The text then is:
 
     content ...
     <segment>
          <phonemic> (phonemic transcription of 'the')
          <orthographic>the
     <segment>
          <phonemic> (phonemic transcription of 'fat')
          <orthographic>fat
     <segment>
          <phonemic> (phonemic transcription of 'cat')
          <orthographic>cat
     content ...
 
   This is the method used in the <unit> / <level> tagging defined in
chapter 6 of TEI P1 and demonstrated on translation equivalents in
appendix A.6.3.
 
 
Solution 2:  Explicit Synchronization Using Common Identifiers
 
   Where sequence is not preserved, locations or segments must be given
identifiers, and cross-references from one text to another must indicate
the parallelism.  E.g.
 
     <seg id='s1'>Tor
     <seg id='s2'>nach Durchfahrt
     <seg id='s3'>bitte
     <seg id='s4'>zumachen!
 
and
 
     <seg id='s3'>Please
     <seg id='s4'>close
     <seg id='s1'>the gate
     <seg id='s2'>after passing through!
 
   This is the approach taken in synchronization through canonical ref-
erences (see appendix A.6.1 of TEI P1); a more elaborated version of the
same approach, allowing for one-to-many matching of segments, is found
in the <alignment> mechanism of TEI P1 chapter 6.
 
 
Solution 3:  Explicit Synchronization with Many-to-one Linkages
 
   Where the segments of parallel texts not only appear in varying
orders but do not match one-to-one, the use of common identifiers to
align the texts does not suffice.  In this case, the <alignment> mecha-
nism of TEI P1 (described above as solution 4 for arbitrary segmenta-
tion) must be used.  An example of the application of alignment maps to
parallel texts may be found in appendix A.6.2 of TEI P1.
 
 
 
                                   6
 
                 INTERNAL AND EXTERNAL CROSS-REFERENCES
 
 
   Problem:  how does one refer to locations elsewhere in the same or in
a separate document?
 
 
Solution:  ID/IDREF
 
   Use the ID/IDREF mechanism.  For external references, this will
require application support, but the specification of an ID name with a
(possibly system-dependent) document identity will uniquely point to a
specific ID in any document.
 
   This is the basic mechanism specified in section 5.7 of TEI P1.
 
 
 
                                   7
 
                         VAGUENESS OF LOCATION
 
 
   Problem:  How do you mark a segment or text element with "fuzzy"
ends?
 
   Examples:
 
    1 the passage begins approximately here, but it is not certain
    exactly where
    2 the passage begins somewhere between one point (a) and another (b)
    (e.g. an echo of another text, which may begin and end gradually,
    providing a section which is certainly an echo, in the opinion of a
    tagger, surrounded by a penumbra which might or might not represent
    an echo)
    3 the passage referred to by a marginal note which does not have a
    corresponding symbol in the text
 
 
Solution 1:  PRECISION Attribute on Tag
 
   Use a PRECISION=VAGUE attribute on the tag whose location is uncer-
tain.
 
 
Solution 2:  Double Tagging
 
   Use double tagging, either with empty tags (as for arbitrary and
overlapping segments) or with nested elements, so that one tag occurs at
point (a) of example 2 and one at point (b).  The nested elements could
be separate elements, the inner representing the text segment where the
text feature ("echo" in the example) is certainly present, the outer
where it might be present.  Alternatively, if the element can self-nest,
the outer element could have the attribute STATUS=POSSIBLE and the inner
STATUS=CERTAIN.
 
-------------------------
 
(1) Note:  it appears clear that before publication this paper needs
    revision along the following lines:
 
    1.    examples of the logical problems need to be expanded upon
          somewhat, and may need commentary explaining where the problem
          lies (e.g. why the marking of arbitrary spans in a text repre-
          sents any problem at all in SGML)
    2.    the example solutions need names not numbers
    3.    more of the example solutions should be complete parseable
          SGML documents, either with radically simplified DTDs or using
          TEI DTDs and extensions as recommended in TEI P2, second ver-
          sion of the Guidelines
    4.    in several cases the TEI P2 version of the solution should be
          presented instead of the generic SGML solution now given
    5.    the topics should be reordered so the paper does not end on
          such a flat note
    6.    some of the solutions need better commentary explaining how
          the declarations and sample markup work to solve the problem
    7.    if Nino's walk to the window is to remain a prominent example,
          we need to provide a fuller transcription of the entire scene;
          it may be preferable to use an excerpt from a play and imagine
          markup describing the movement in a particular performance or
          production
 
    It is not clear whether a quick review of SGML syntax is needed at
    the beginning, or readers of the journal should be assumed to know
    SGML well enough to follow with some commentary.  Other comments and
    suggestions are welcome.
 
(2) Or more elaborately, the passage marked by straight lines in the
    left margin, the passage marked by wavy lines in the left margin,
    the passage underlined by hand with simple straight line, the pas-
    sage underlined by hand with simple straight line which was later
    deleted by hand, etc., as in the transcriptions of Wittgenstein's
    manuscripts in the Norwegian Wittgenstein project.  See Claus Huit-
    feldt and Viggo Rossvoer, The Norwegian Wittgenstein Project Report
                              _________________________________________
    1988 ([Bergen]:  NAVFs EDB-Senter for Humanistisk Forskning / Norwe-
    ____
    gian Computing Centre for the Humanities, 1989), esp. pp. 201-236.
 
(3) This is the equivalent of the <milestone> tag defined in drafts 1.0
    and 1.1 of the guidelines for segmentation of text according to
    pagination of multiple editions.  The <milestone> tag differs, how-
    ever, in assuming a simple single-level segmentation of the text:
    the values specified in any <milestone> tag apply to all following
    text until the next <milestone> marked as belonging to the same edi-
    tion.  Hence no explicit end marker is needed and the ID / IDREF
    mechanism can be dispensed with.
 
(4) The TEI expects to recommend better support for such non-
    hierarchical start-end segment-tag pairs as an enhancement of SGML
    to be added during the next revision of ISO 8879.  See document TEI
    ML W32 for TEI proposals for the revision of SGML.
 
(5) N.B. the feature structure tags here use a name attribute rather
    than a separate tag as in TEI P1 version 1.1; this change will be
    present in TEI P1 version 2.
 
                                               Version 4, April 16, 1992