Received: by UICVM (Mailer R2.03B) id 5320; Thu, 11 Jan 90 05:19:30 CST
Date:         Thu, 11 Jan 90 12:15:00 +0100
Reply-To:     Text Encoding Initiative - Text Representation Committee list
              <TEI-REP@UICVM>,
              Stig Johansson <h_johansson%use.uio.uninett@NAC.NO>
Sender:       Text Encoding Initiative - Text Representation Committee list
              <TEI-REP@UICVM>
From:         Stig Johansson <h_johansson%use.uio.uninett@NAC.NO>
Subject:      Working paper TEI TR W10 (draft)
To:           Michael Sperberg-McQueen <U35395@UICVM.BITNET>
 
TEI TR W10
Problems with punctuation marks (draft)
 
1. Punctuation marks in ISO 646
 
The following punctuation marks (taken in a rather wide sense) are available
among the non-national characters of ISO 646:
 
space ! " ' ( ) , . / : ; < > ?
 
(According to Wilhelm Ott, we should exclude the exclamation mark. Did all
the other characters come through correctly?)
 
Characters which are not available can be represented by entity references
or dealt with by other forms of markup (see further DeRose TEI TRW7).
 
2. Entity names
 
The following lists, which have been compiled from ISO 8879-1986(E) and
Bryan (1988), may serve as a basis for further work. (I see in the Luxembourg
minutes that 'entity names should be chosen from existing sets', so
disregard my comments  at the beginning of Section 3.)
 
2.1. Separators
 
&period;     period (full stop)
&comma;      comma
&quest;      question mark
&iquest;     inverted question mark
&excl;       exclamation mark
&iexcl;      inverted exclamation mark
&colon;      colon
&semi;       semicolon
 
There are special entity names for full stops used as decimal points and
as indications of ellipsis:
 
&middot;     middle (decimal) dot
&hellip;     horizontal ellipsis (three dots)
&mldr;       em leader (three dots)
&nldr;       en leader (double baseline dot)
 
2.2. Includers
 
&lpar;       left parenthesis (opening)
&rpar;       right parenthesis (closing)
&lsqb;       left square bracket (opening)
&rsqb;       right square bracket (closing)
&lang;       left angle bracket (opening)
&rang:       right angle bracket (closing)
&lcub;       left curly braces (opening)
&rcub;       right curly braces (closing)
 
2.3. Quotation marks
 
&quot;       quotation mark (double, straight)
&lsquo;      left (opening) single quotation mark
&rsquo;      right (closing) single quotation mark
&ldquo;      left (opening) double quotation mark
&rdquo;      right (closing) double quotation mark
&lsquor;     rising single quote, left (low)
&rsquor;     rising single quote, right (high)
&ldquor;    rising double quote, left (low)
&rdquor;     rising double quote, right (high)
&laquo;      left angle quotation mark (French opening guillemet)
&raquo;      right angle quotation mark (French closing guillemet)
&lsaquo;     left single angle quotation mark (embedded guillemet)
&rsaquo;     right single angle quotation mark (embedded guillemet)
 
2.4. Spaces
 
&emsp;       em space
&ensp;       en space (1/2 em)
&numsp;      digit space (width of a number)
&emsp13;     1/3 em space
&puncsp;     punctuation space (width of a comma)
&nbsp;       no break (required) space
 
2.5 Dashes and hyphens
 
&mdash;      em dash
&ndash;      en dash
&dash;       neutral dash/minus
&hyphen;     hyphen
&shy;        soft hyphen
 
2.6. Other
 
&apos;       apostrophe
&sol;        solidus (shilling stroke)
&bsol;       backslash (reverse solidus)
&verbar;     vertical bar
&brvbar;     broken vertical bar
&horbar;     horizontal bar
&lowbar;     lowline (baseline rule)
 
3. Brief comment on entity names
 
Entity names should be transparent and consistently built up. If there are
groups of related features, this should be reflected in the names.
 
Some of the entity names given above could be made more explicit, e.g.
&shy; = soft hyphen. It would seem preferable to indicate the main type
first and then the particular type (deviating from ISO 8879 and Bryan,
where bodies of entity names may have both prefixes and suffixes), e.g.
&quot.ls; = left single quote, &hyph.s; = soft hyphen, &quest.i = inverted
question mark. To achieve greater transparency and consistency, it may be
necessary to allow longer names.
 
Entity names are used where the character set is insufficient. In some cases
of the commonest problems will be briefly discussed below.
 
4.1. Period (full stop)
 
The period is used to mark a) sentence endings, b) abbreviations, c) decimal
point, d) enumerators in lists (1., a., etc), e) parts of list items (e.g.
in bibliographies). We have already seen that decimal points can be identified
by entity names, if necessary. Tagging of abbreviations takes care of the
second use listed above. To identify sentence endings I would like to propose
the following mechanism:
 
We recognise a unit roughly corresponding to an orthographic sentence. Let
us call it an S-unit. Each S-unit is marked by an empty element, optionally
followed by a reference number.
 
S-units are orthographic sentences (ending with a period, an exclamation
mark, or a question mark) or other structurally independent forms (e.g.
headings). From the notion of S-units we may want to exclude enumerators
in lists, names of speakers in dramas, page references, and similar discourse
organisers. The text documentation section of a text should specify whether
S-units are used or whether there is some other form of reference system.
In either case, further details should be provided on the reference system.
 
Marking of S-units in this manner at the same time disambiguates the period
and provides a consistent reference system (which we need anyway).
 
4.2. Question and exclamation marks
 
Question and exclamation marks almost always mark the end of sentences. But
they may be used occasionally for other purposes, e.g. as a mid-sentence
comment by the author (! - to express surprise or some other strong feeling,
such cases can be distinguished from mid-sentence commenting ! and ? by the
following quotation marks (or markup indicating quotations).
 
4.3. Quotation marks
 
The best way of dealing with quotation marks is probably to replace them by
descriptive markup indicating begin-quote and end-quote, especially as
quotations are not always marked by quotation marks (notably long quotations)
and we need this form of markup anyway. Quotes within quotes can be handled by
SGML (as far as I understand). The main problem arises with cases where
quotation marks are used for other purposes, e.g. to give the title of an
article, to gloss the meaning of a word, to indicate that a word is a
technical term, or used them as a distancing device (as in: she hated 'good'
books). We need special tags for these uses, perhaps something like: title,
gloss, term, so-called.
 
Quotation marks indicating direct speech  should have special tags. Note,
incidentally, that direct speech may be unmarked in printed texts or may be
indicated by some other device, such as by a dash or by indentation. (In
dramas there is no need to mark direct speech, as long as we tag stage
directions and speaker attributions.)
 
4.4. Hyphen
 
In reproducing printed texts it is necessary to reach a decision on the
representation of soft (line-end) hyphens. Where the lineation of the
machine-readable text is different from the original (which is probably
most often the case), the editor can either eliminate soft hyphens or replace
them in some manner (by an entity reference or some other convention, e.g.
hyphen followed by space). It does not matter which solution is chosen, as
long as it is reported (in the section on text documentation).
 
Writers of original machine-readable texts should be recommended not to use
soft hyphens. (Or some other convention for word continuation should be
devised.)
 
4.5. Apostrophe
 
Apostrophes must be distinguished from single quote marks. This is perhaps
best done by using descriptive tagging for quotations. However, aspostrophes
have a variety of uses. In English they mark contractions, genitive forms,
and plural forms (occasionally). Disambiguation of these uses belongs to the
level of (linguistic) analysis and interpretation.
 
5. Final remarks
 
Using markup it is possible not only to express distinctions made in print
but also to transcend limitations inherent in current conventions for
printed texts. The greater distinctiveness carries a price, however: the
text becomes bulkier and hard to use and produce in a direct way. Software
must be provided which assists the writer in putting in tags, e.g. inserting
S-unit tags after periods, question marks, exclamation marks, and end tags
for headings (leaving the writer the choice to delete them where they are not
applicable). To take another example, quotation marks could be replaced by
start and end tags for quotations (leaving the writer the choice to change
the tags to 'title', 'gloss', 'term', 'so-called', etc). Similarly, there
is a need for software which enables users to rearrange and display the text
according to their needs. Finally, we need software which helps in checking
the accuracy and consistency of marked-up text.
 
The age of print has led to great gains in clarity of text representation.
We should build on conventions for printed texts, without being restricted
by them. The ultimate aim is to achieve even greater clarity and efficiency
of use and arrive at a text representation appropriate to the electronic age.
 
20 December 1989
Stig Johansson
University of Oslo