AI2W1
 
<!DOCTYPE ldoc [ ] >
 
<!-- Stig's working paper  *minimally* tagged
      Paragraphs, divs and lists are marked
      Displayed lists of attributes etc are tagged as <fig>
      All tags in the text are tagged <tag>
      Section numbering has been turned into 'ID' value but cross
      references are hard-wired, as Stig left them
                       LB Nov 3, on a train near Bruxelles
      Basic structure has been validated against LDOC3.dtd (same as
      LDOC but with LIST and ITEM rather than OL and LI )
                  LB Nov 4th, back at my desk
-->
<ldoc docnum=AI2w1>
<front><title>
TEI AI2 W1: Working paper on spoken texts
<date>October 1991
<!-- shouldnt all authors be removed anyway ? -->
<author>Stig Johansson, University of Oslo
<author>Lou Burnard, Oxford University Computing Services
<author>Jane Edwards, University of California at Berkeley
<author> And Rosta, University College London
<abstract>
This paper discusses problems in representing speech in
machine-readable form. After some brief introductory
considerations (Sections 1-5), a survey is given of features
marked in a selection of existing encoding schemes (Section
6), followed (Section 8) by proposals for encoding compatible
with the draft guidelines of the Text Encoding Initiative
(TEI). Appendix 1 contains example texts representing
different encoding schemes. Appendix 2 gives a brief summary
of the main features of some encoding schemes. Appendix 3
presents a DTD fragment and examples of texts encoded
according to the conventions proposed. The discussion focuses
on English but should be applicable more generally to the
representation of speech.
</front><body>
<div1><head>The problem
<p>
In the encoding of spoken texts we are faced by a double
problem:
<list>
<item>there are no generally accepted conventions for the
   encoding of spoken texts in machine-readable form;
<item>there is a great deal of variation in the ways in which
   speech has been represented using the written medium
   (transcription).
</list>
We therefore have less to go by than in the encoding of
written texts, where there are generally accepted conventions
which we can build on in producing machine-readable versions.
<p>   In addition to this basic problem, there are other special
difficulties which apply to the encoding of speech. Speech
varies according to a large number of dimensions, many of
which have no counterpart in writing (tempo, loudness, pitch,
etc.). The audibility of speech recorded in natural
communication situations is often less than perfect, affecting
the accuracy of the transcription. The production and
comprehension of speech are intimately bound up with the
speech situation. A transcription must therefore also be able
to capture contextual features. Moreover, there is an ethical
problem in recording and making public what was produced in a
private setting and intended for a limited audience (see
further Section 2.2).
 
<div1><head>From speech to transcription and electronic record
<div2 id=2.1><head>
The requirements of authenticity and computational
tractability
<p>
There is no such thing as a simple conversion of speech to a
transcription. All transcription schemes make a selection of
the features to be encoded, taking into account the intended
uses of the material. The goal of an electronic representation
is to provide a text which can be manipulated by computer to
study the particular features which the researcher wants to
focus on. At the same time, the text must reflect the original
as accurately as possible. We can sum this up by saying that
an electronic representation must strike a balance between the
following two, partially conflicting, requirements:
authenticity and computational tractability.
<p>   A workable transcription must also be convenient to use
(write and read) and reasonably easy to learn. Here we focus
on a representation which makes possible a systematic marking
of discourse phenomena while at the same time allowing
researchers to display the data in any form they like. We hope
that this representation will also promote "insightful
perception and classification of discourse phenomena" (Du
Bois, forthcoming).
 
<div2 id=2.2><head>The ethical requirement
<p>
Speech is typically addressed to a limited audience in a
private setting, while writing is characteristically public
(but we have of course public speech and private writing). The
very act of recording speech represents an intrusion. If this
intrusion is very obvious (e.g. with video recording), the
communication may be disturbed. On the other hand, if the
recording is too unobtrusive, it may be unethical (if the
speakers are unaware of being recorded).
<p>   It is not only the recording event itself which is
sensitive, but also the act of making the recording available
outside the context where it originated. Transcribers have
usually taken care to mask the identity of the speakers, e.g.
by replacing names by special codes. In other words, the
transcriber must also strike a balance between the requirement
of authenticity and an ethical requirement.
 
<div1 id=3><head>Who needs spoken machine-readable texts and for what
   purpose?
<p>
It follows from what has already been said that a
transcription - and its electronic counterpart - will be
different depending upon the questions it has been set up to
answer. The following are some types of users who might be
interested in spoken machine-readable texts:
<list>
<item>students of ethnology and oral history, who are mainly
   interested in content-based research;
<item>lexicographers, who are mainly interested in studying
   word usage, collocations, and the like;
<item>students of linguistics, who may be interested in research
   on phonology, morphology, lexis, syntax, and discourse;
<item>dialectologists and sociolinguists, whose primary interest
   is in patterns of variation on different linguistic levels;
<item>students of child language or second language acquisition,
   who are concerned with language development on different
   levels;
<item>social scientists and psychologists interested in patterns
   of spoken discourse;
<item>speech researchers, who need an empirical basis for
   setting up and testing systems of automatic linguistic
   analysis;
<item>engineers concerned with the transmission of speech.
</list>
The first two categories are probably best served by a
transcription which follows ordinary written conventions
(perhaps enhanced by coding for key words or lemmas). The
other groups generally need something which goes beyond
"speech as writing".
 
<div1 id=4><head>The delicacy of transcriptions
<p>
The texts in Appendix 1 are examples of different types of
transcription, some existing in machine-readable form and
others only occurring in the written medium. The range of
features represented varies greatly as does the choice of
symbols.
<p>   All the texts have some way of indicating speaker identity
and dividing the text up into units (words and higher-level
units). Words are most often reproduced orthographically,
though sometimes with modified or distorted spelling to
suggest pronunciation. Less often they are transcribed
phonetically or phonemically (as in G and W). The higher-level
units are orthographic sentences (as in A and B) or some sort
of prosodic units (as in H and R).
<p>   The degree of detail given in the transcription may be
extended in different directions (it is worth noting that in
some cases there are different versions of the same text,
differing according to the features represented or, more
generally, in the delicacy of the transcription; see H, J, K,
and Q). Some texts are edited in the direction of a written
standard (e.g. A and B), others carefully code speaker overlap
and preserve pauses, hesitations, repetitions, and other
disfluencies (e.g. H, I, and N). Some texts contain no marking
of phonological features (e.g. A and B), others are carefully
coded for prosodic features like stress and intonation (e.g.
H, I, and Q). The degree of detail given on the speech
situation varies, as does the amount of information provided
on non-verbal sounds, gestures, etc. Note the detailed coding
of non-verbal communication and actions in the second example
in V.
<p>   Two of the example texts in Appendix 1 consistently code
paralinguistic features (H1, W). In some of the texts there is
analytic coding which goes beyond a mere transcription of
audible features, e.g. the part-of-speech coding in J3 and K3-
5 and various types of linguistic analysis of selected words
in F and on the "dependent tier", indicated by lines opening
with %, in R. In G there is an accompanying standard
orthographic rendering, in W and X a running interpretive
commentary, in Y and Z a detailed analysis in discourse terms.
Here we are not concerned with analytic and interpretive
coding, but rather with the establishment of a basic text
which can later be enriched by such coding.
 
<div1 id=5><head>Types of spoken texts
<p>
Before we go on to a detailed discussion of features encoded
in our example texts, we shall briefly list some types of
spoken texts. These are some important categories:
<list>
<item>face-to-face conversation
<item>telephone conversation
<item>interview
<item>debate
<item>commentary
<item>demonstration
<item>story-telling
<item>public speech
</list>
In addition, we may distinguish between pure speech and
various mixed forms:
<list>
<item>scripted speech (as in a broadcast, performances
   of a drama, or reading aloud)
<item>texts spoken to be written (as in dictation)
</list>
Here we are mainly concerned with the most ubiquitous form of
speech (and, indeed, of language), i.e. face-to-face
interaction. If we can represent this prototypical form of
speech, we may assume that the mechanisms suggested can be
extended to deal with other forms of spoken material.
 
<div1 id=6><head>Survey of features marked in existing schemes
<p>
The survey below makes frequent reference to the example texts
in Appendix 1. No attempt is made to give a full description
of each individual scheme (for a comparison of some schemes,
see Appendix 2). The aim is rather to identify the sorts of
features marked in existing schemes. Reference will be made
both to electronic transcriptions and to transcriptions which
only exist in the written medium.
 
<div2 id=6.1><head>Text documentation
<p>Speech is highly context-dependent. It is important to provide
information on the speakers and the setting. This has been
done in different ways. Note the opening paragraphs in A and
B, the first lines in E, the header file in Q, and the header
lines in R. In some cases information of this kind is kept
separate from the machine-readable text file (see Appendix 2).
 
<div2 id=6.2><head>Basic text units
<p>
While written texts are broken up into units of different
sizes (such as chapters, sections, paragraphs, and
orthographic sentences, or S-units) which we can build on in
creating a machine-readable version, there are no such obvious
divisions in speech. In dialogues it is natural to mark turns.
As these may vary greatly in length, we also need some other
sort of unit (which is needed in any case in monologues).
<p>   Most conventional transcriptions of speech (such as A and
B) divide the text up into "sentences" in much the same way as
in ordinary written texts, without, however, specifying the
basis for the division. The closest we can get to such a
specification is probably the "macrosyntagm" used in a corpus
of spoken Swedish (cf. Loman 1972: 58ff.). A macrosyntagm is a
grammatically cohesive unit which is not part of any larger
grammatical construction.
<p>   Linguists often prefer a division based on prosody, e.g.
the tone units in the London-Lund Corpus (I) and in the scheme
set up for the Corpus of Spoken American English (Q). Others
are sceptical of this sort of division and prefer pause-
defined units (Brown et al. 1980: 46ff.).
<p>   In the new International Corpus of English project texts
are divided into "text-units" to be used for reference
purposes, and there is "no necessary connection between text
unit division and any features inherent in the original text"
(Rosta 1990). Text-units are used both in written and spoken
texts. In written texts they often correspond to an
orthographic sentence, in spoken texts to a turn. The primary
criterion is length (around three lines or 240 characters or
twenty words, not including tags).
<p>   On basic text units in different encoding schemes, see
further Appendix 2.
 
<div2 id=6.3><head>Reference system
<p>
If a text is to be used for scholarly purposes, it must be
provided with a reference system which makes it possible to
identify and refer to particular points in the text. Reference
systems have been organized in different ways (see Appendix
2). In the London-Lund Corpus (I) a text is given an
identification code and is divided into tone units, each with
a number. Some spoken machine-readable texts (e.g. K1) do not
seem to contain such a fine-grained reference system.
 
<div2 id=6.4><head>Speaker attribution
<p>
In a spoken text with two or more participants we must have a
mechanism for coding speaker identity. This is generally done
by inserting a prefix before each speaker turn; see A, H, I1,
J, L, N, O, etc. in Appendix 1 and the survey in Appendix 2
(in the London-Lund Corpus there is actually such a prefix
before each tone unit; see I2). The prefix is generally a
code; more information on the speakers may be given in the
accompanying documentation.
<p>   Problems arise where a speaker's turn is interrupted (see
how continuation has been coded in Appendix 2; note the
examples in text I in Appendix 1) and, particularly, where
there is simultaneous speech.
 
<div2 id=6.5><head>Speaker overlap
<p>
In informal speech there will normally be a good deal of
speaker overlap. This has been coded in a variety of ways, as
shown in Appendix 2. In the London-Lund Corpus simultaneous
speech is marked by the insertion of pairs of asterisks; see
text I in Appendix 1. Dubois et al. (1990) instead use pairs
of brackets; see Q. Both solutions involve separation of each
speaker's contribution and linearization in the transcription.
In some cases speaker overlap is shown by vertical alignment
(alone or combined with some other kind of notation); see O,
P, Q, T, and V.
 
<div2 id=6.6><head>Word form
<p>
Words are rarely transcribed phonetically/phonemically in
extended spoken texts; the only examples in Appendix 1 are G
and W, both of which do not exist in machine-readable form.
There are several reasons why there are very few extensive
texts with transcription of segmental features. Such
transcription is very laborious and time-consuming. A detailed
transcription of this kind makes the text difficult to use for
other purposes than close phonological study. And the
conversion to electronic form has been difficult because of
restrictions on the character set.
<p>   Orthographic transcriptions of speech reproduce words with
their conventional spelling and separated by spaces (apart
from conventionalized contractions). This is true not only of
texts like A and B, but also of the much more sophisticated
transcriptions in H, I, etc. In some cases, e.g. in E, P, and
R, the transcription introduces spellings intended to suggest
the way the words were pronounced (takin', an', y(a) know, gon
(t)a be, etc.). Some basically orthographic transcriptions
introduce phonetic symbols in special cases; see the remarks
on quasi-lexical vocalizations in the next section.
<p>   Simple word counts based on orthographic, phonemic, and
phonetic transcriptions of the same text will give quite
different results. Homographs are identical in an orthographic
transcription, although they may be pronounced differently:
row, that (as demonstrative pronoun vs. conjunction and
relative pronoun), etc. Conversely, homophones may be written
quite differently: two/too, so/sew/sew, etc. These are
commonplace examples, but they show that even answers to
seemingly straightforward questions (How many words are there
in this text? How many different words are there?) depend upon
choices made by the transcriber.
<p>   On the treatment of word form in different encoding
schemes, see further the comparison in Appendix 2.
 
<div2 id=6.7><head>Speech management
<p>
A "speech as writing" transcription will normally edit away a
lot of features which are essential for successful spoken
communication. A comparison of texts like A and B versus H, I,
L, and Q is instructive. The former contain little trace of
speech-management phenomena such as pauses, hesitation
markers, interrupted or repeated words, corrections and
reformulations. Interactional devices asking for or providing
feedback are less prominent. For content-based research they
are irrelevant or even disruptive, for discourse analysis they
are highly significant. The survey in Appendix 2 illustrates
how such features have been handled. (On speech-management
phenomena, see further Allwood et al. 1990.)
<p>   Different strategies are used to render quasi-lexical
vocalizations such as truncated words, hesitation markers, and
interactional signals. The London-Lund Corpus introduces
phonetic transcription in such cases (see examples in text I);
other transcriptions use ordinary spelling (e.g. b- and uh in
text Q, hmm and mmm in text R). Some schemes have introduced
control lists for such forms; see Appendix 2.
 
<div2 id=6.8><head>Prosodic features
<p>
The most obvious characteristic of speech, i.e. the fact that
it is conveyed by the medium of sound, is often lost
completely in the transcription. The only traces in texts like
A and B are the conventional contractions of function words
(and the occasional use of italics or other means to indicate
emphasis; see one example in B). Even texts produced for
language studies generally reflect only a small part of the
phonological features. The best-known computerized corpus of
spoken English, the London-Lund Corpus (I), focuses on
prosody: pauses, stress, and intonation. The same is true of
the Lancaster/IBM Spoken English Corpus (K1) and the systems
set up for the Corpus of Spoken American English (Q) and the
Survey of English Usage (H) - which, incidentally, is the
system which the transcription in the London-Lund Corpus is
based on. There is a great deal of variation in the way
pauses, stress, and intonation are marked, as shown by our
example texts in Appendix 1 and the survey in Appendix 2. Note
that some schemes have special codes for latching and
lengthening.
 
<div2 id=6.9><head>Paralinguistic features
<p>
Paralinguistic features (tempo, loudness, voice quality,
etc.), which are less systematic than phonological features,
have usually not been marked in transcriptions. Two of our
example texts, however, include elaborate marking of such
features (H1, W); neither of them has an exact counterpart in
machine-readable form. It should be noted that the London-Lund
Corpus is not coded for paralinguistic features, although the
texts were originally transcribed according to the Survey of
English Usage conventions (as in H). The simplification was
made for a number of reasons, "partly practical and technical,
partly linguistic" (Svartvik & Quirk 1980: 14). The omitted
features were considered less central for prosodic and

grammatical studies,
 which were the main concern of the corpus
compilers.
<p>   For more information on the encoding of paralinguistic
features, see Appendix 2.
 
<div2 id=6.10><head>Non-verbal sounds
<p>
Non-verbal sounds such as laughter and coughing are generally
noted, usually as a comment within brackets. See examples in
texts B, E, H, L, P, and Q. Q has a special code for laughter
(@), which may be repeated to indicate the number of pulses of
laughter. The London-Lund Corpus may add a code to indicate
length of the non-verbal sound, e.g. (. laughs), (- coughs),
(-- giggle). Gumperz & Berenz (forthcoming) distinguish
between such features as interruption and as overlay. See
further Appendix 2.
 
<div2 id=6.11><head>Kinesic features
<p>
On the edges of paralinguistic features and non-verbal sounds
we find kinesic features, which involve the systematic use of
facial expression and gesture to communicate meaning (gaze,
facial expressions like smiling and frowning, gestures like
nodding and pointing, posture, distance, and tactile contact).
They raise severe problems of transcription and
interpretation. Besides, they require access to video
recordings. For some examples of how such features have been
marked, see especially texts P and V. See also the survey in
Appendix 2.
 
<div2 id=6.12><head>Situational features
<p>
The high degree of context-dependence of speech makes it
essential to record a variety of non-linguistic features.
These include movements of the participants or other features
in the situation which are essential for an understanding of
the text. We have already drawn attention to the need for
documentation on the speech situation (Section 6.1). It is
also important to record changes in the course of the speech
event, e.g. new speakers coming in, long silences, background
noise disturbing the communication, and non-linguistic
activities affecting what is said. Note some comments of this
kind in %-lines in R and the marking of actions in U. Note
also the situation descriptions in texts A and B. See further
Appendix 2.
 
<div2 id=6.13><head>Editorial comment
<p>
In our example texts editorial comment is indicated in a
variety of ways, e.g. by additions within parentheses (such as
"Expletive deleted" in A) or comments in %-lines in R. The
London-Lund Corpus indicates uncertain transcription by double
parentheses (see examples in I); there are also comments
within parentheses like "gap in recording" or "ten seconds
untranscribable". Other schemes indicate uncertain hearing in
other ways. Note the elaborate coding of normalization in the
International Corpus of English (J2). See further Appendix 2.
 
<div2 id=6.14><head>Analytic coding
<p>
Though we are not concerned with analytic coding which goes
beyond the establishment of a basic text, we shall draw
attention to some conventions used in our example texts. Text
F includes analytic coding after selected words. J3 has word-
class markers before each word. K3-5 have word-class tags
accompanying each word (in different ways). R includes speech
act analysis in %-lines (%spa: ...). W and X have a running
interpretive commentary. Y and Z are coded for discourse
analysis. There is thus a great deal of variation in the
features coded (not to speak of the choice of codes).
 
<div2 id=6.15><head>Parallel representation
<p>
In order to convey a broad range of features, some schemes
provide parallel representations of the texts. Note the
phonetic, phonemic, and orthographic versions in text G and,
in particular, the multi-layered coding in V and W. An
adequate scheme for the encoding of spoken texts will no doubt
need to provide mechanisms for parallel representation.
 
<div1 id=7><head>Spoken texts and the Text Encoding Initiative
<p>
To what extent can spoken texts be accommodated within the
Text Encoding Initiative? They were not dealt with in the
current draft (Sperberg-McQueen & Burnard 1990), but many of
the mechanisms which have been suggested for written texts can
no doubt be adapted for spoken material. These include the
ways of handling text documentation, reference systems, and
editorial comment. Phonetic notation can be handled by the
general mechanisms for character representation (though these
will not be dealt with here, as they are the province of
another working group).
<p>   In considering the encoding of face-to-face conversation,
which is the main focus of interest here, we shall sometimes
make a comparison with dramas, which are dealt with (briefly)
in the current TEI draft. This literary genre should of course
not be confused with ordinary face-to-face conversation, but
we can look upon it as a stylized way of representing
conversation, and some of the mechanisms for handling dramatic
texts may be adapted for encoding genuine conversation.
 
<div1 id=8><head>Proposals
<p>
The overall structure we envisage for a spoken text is given
in Figure 1. In other words, a TEI-conformant text consists of
a header and the text body. The latter consists of a timeline
and one or more divisions (div); divisions are needed, for
example, for debates, broadcasts, and other spoken discourse
types where structural subdivisions can easily be identified.
<p>   The timeline is used to coordinate simultaneous phenomena.
It has an attribute "units" and consists of a sequence of
points, with the following attributes: id, elapsed, since.
<xmp><![ CDATA[
Example:
 
<point id=p43 elapsed=13 since=p42>
 
]]></xmp>
This identifies a point "p43" which follows 13 units (as
specified in the "units" attribute of the <tag>timeline</tag>) after
"p42".
<p>   Divisions consist of elements of the following kinds:
<list>
<item>utterances (u), which contain words (for more detail, see
8.2);
<item>vocals, which consist of other sorts of vocalizations (e.g.
voiced pauses, back-channels, and non-lexical or quasi-lexical
vocalizations of other kinds);
<item>pauses, which are marked by absence of vocalization;
<item>kinesic features, which include gestures and other non-vocal
communicative phenomena;
<item>events, which are non-vocal and non-communicative;
<item>writing, which is needed where the spoken text includes
embedded written elements (as in a lecture or television
commercial).
</list>
<p>In addition, there may be embedded divisions. All of the
elements mentioned have "start" and "end" attributes pointing
to the timeline (there may also be attributes for "duration"
and "units" expressing absolute time). If a "start" attribute
is omitted, the element is assumed to follow the element that
precedes it in the transcription. If an "end" attribute is
omitted, the element is assumed to precede the element that
follows it in the transcription. If "start" and "end"
attributes are used, the order of elements within the text
body is unrestricted.
<p>   Events, kinesic features, vocalizations, and utterances
form a hierarchy (though not in the SGML sense), as shown in
Figure 2. Utterances are a type of vocalization. Vocalizations
are a type of gesture. Gestures in their turn are a type of
event. We can show the relationship using the features
"eventive", "communicative", "anthropophonic" (for sounds
produced by the human vocal apparatus), and "lexical":
<fig>
          eventive  communicative  anthropophonic  lexical
 
event        +            -              -            -
kinesic      +            +              -            -
vocal        +            +              +            -
utterance    +            +              +            +
</fig>
Needless to say, the differences are not always clear-cut.
Among events we include actions like slamming the door, which
may certainly be communicative. Vocals include coughing and
sneezing, which are or may be involuntary noises. And there is
a cline between utterances and vocals, as implied by our use
above of the term "quasi-lexical vocalization". Individual
scholars may differ in the way the borderlines are drawn, but
we claim that the four element types are all necessary for an
adequate representation of speech.
<p>   There is another sort of hierarchy (in the SGML sense)
defining the relationship between linguistic elements of
different ranks, as shown in Figure 3. Texts consist of
divisions, which are made up of utterances, which may in their
turn be broken down into segments (s). See further Section
8.2.
<p>   Before we go on to a more detailed discussion of the
individual types of features (following the same order as in
Section 6 above), we should briefly draw attention to a couple
of other points illustrated in Figure 1. Note that pauses and
vocals may occur between utterances as well as within
utterances and segments. "Shift" is used to handle
paralinguistic features (see Section 8.9) and "pointer"
particularly to handle speech overlap (see Section 8.5).
 
<div2 id=8.1><head>Text documentation
<p>
The overall TEI framework proposed for written texts can be
adapted for spoken texts. Accordingly, there is a header with
three main sections identified by the tags <tag>file.description</tag>,
<tag>encoding.declarations</tag>, and <tag>revision.history</tag>; cf. Figure 1.
The content of the last two need little adaptation, while the
first one will be quite different. With spoken texts there is
no author; instead, there is recording personnel and one or
more transcribers, and the real "authors" appear in a list of
participants (as in the dramatis personae of a play). There is
also (as in all electronic texts) an electronic editor, who
may or may not be identical to the transcriber. There is no
source publication as for printed texts (except for scripted
speech); instead there is a recording event followed by a
transcription stage. The following structure is suggested (see
also Figures 4-8):
<xmp><![ CDATA[
<file.description>
   <title.statement>
      <title>name of the text supplied by the electronic
      editor</title>
      <statement.of.responsibility>
         <name id= >name of the electronic editor</name>
         <role>electronic editor</role>
      </statement.of.responsibility>
   </title.statement>
   <edition.statement>(if applicable)
      <edition>...</edition>
      <statement.of.responsibility>
         <name id= >...</name>
         <role>...</role>
      </statement.of.responsibility>
   </edition.statement>
   <extent.statement>size of file</extent.statement>
   <publication.statement>
      <creation.date>...</creation.date>
      <publication>(if applicable)
         <publisher>...</publisher>
         <place>...</place>
         <date>...</date>
      </publication>
      <distribution>
         <distributor>
         <place>...</place>
         <date>...</date>
      </distribution>
   </publication.statement>
   <script.statement>(if applicable)
      <script id= >
         <title.statement>
            <title>name of the script the recording is based
            on</title>
            <statement.of.responsibility>
               <name>...</name>
               <role>...</role>
            </statement.of.responsibility>
         </title-statement>
         <edition.statement>
            <edition>...</edition>
            <statement.of.responsibility>
               <name>...</name>
               <role>...</role>
            </statement.of.responsibility>
         </edition.statement>
         <publication.statement>
            <creation.date>...</creation.date>
            <publication>
               <publisher>...</publisher>
               <place>...</place>
               <date>...</date>
            </publication>
            <distribution>
               <distributor>...</distributor>
               <place>...</place>
               <date>...</date>
            </distribution>
         </publication.statement>
      </script>
      <script id= >
      ...
      </script>
      ...
   </script.statement>
   <recording.statement id= >
      <statement.of.responsibility>
         <name>...</name>
         <role>...</role>
         <name>...</name>
         <role>...</role>
         ...
      </statement.of.responsibility>
      <recording.information>
         <recording.type>audio/video</recording.type>
         <recording.equipment>...</recording.equipment>
         <recording.duration>...</recording.duration>
      </recording.information>
      <broadcast.statement>(for recordings of broadcasts)
         <title.statement>
            <title type=series/episode>...</title>
            <statement.of.responsibility>
               <name>...</name>
               <role>...</role>
               <name>...</name>
               <role>...</role>
               ...
            </statement.of.responsibility>
         </title-statement>
         <organization>
            <name>...</name>
            <address>...</address>
         </organization>
         <medium type=radio/TV>...</type>
         <station>...</station>
       </broadcast.statement>
       <list.of.participants>
         <participant id=P1>
            <demographic.information>
               <name>
                  <forename>...</forename>
                  <nickname>...</nickame>
                  <surname>...</surname>
                  <code>(if replacing name)</code>
               </name>
               <age>...</age>
               <sex>...</sex>
               <birth.date>...</birth.date>
               <birth.place>
                  <country>...</country>
                  <town>...</town>
               </birth.place>
               <place.of.residence>...</place.of.residence>
                  <country>...</country>
                  <town>...</town>
               </place.of.residence>
               <education>...</education>
               <occupation>...</occupation>
               <soc.econ.status>...</soc.econ.status>
               <affiliation>(may be needed in case the
               participant is speaking on behalf of an
               organization)</affiliation>
               <native.language>...</native.language>
               <other.language>...</other.language>
               <dialect>...</dialect>
               <other.information>...</other.information>
            </demographic.information>
            <situational.information>
               <relation target=P2>e.g. mother</relation>
               <relation target='P3 P4 P5'>friend</relation>
               ...
               <role>role in the interaction</role>
               <language>(the language used in the
               situation, or comments on language use)
               </language>
               <awareness.of.recording>...</awareness.of.
               recording>
               <other.information>...</other.information>
            </situational.information>
         </participant>
         <participant id=P2>...
         </participant>
         ...
         <participant.group id=  size=  members= >
            <demographic.information>
               <age>...</age>
               <sex>...</sex>
               ...
            </demographic.information>
            <situational.information>
               <role>school class/audience etc.</role>
               <awareness.of.recording>...</awareness.of
               recording>
               ...
            </situational.information>
            (The general principle is that <participant.
            group> can take any of the tags that occur
            under <participant>, provided that they apply
            to the whole group.)
         </participant.group>
      </list.of.participants>
      <setting>
         <location who= >...</location>
         <location who= >(repeated location needed
         with telephone conversation and other cases of
         distanced spoken communication; the "who"
         attribute identifies the relevant participant or
         participants)</location>
         ...
         <time who= >...</time>
         <time who= >(repeated time needed with
         distanced spoken communication)</time>
         ...
         <duration>...</duration>
         <channel type=direct who= >...</channel>
         <channel type=telephone who= >...</channel>
         ...
         <surroundings who= >room/train/library/open
         air, etc.</surroundings>
         <surroundings who= >(repeated surroundings
         needed with distanced spoken communication)
         </surroundings>
         ...
         <activities>activities of the participants
         </activities>
         <other.information>...</other.information>
      </setting>
      <publication.statement>(if the recording is
      published)
      ...
      </publication.statement>
   </recording.statement>
   <recording.statement id= >(repeated statement needed where
   there are several recordings)
   ...
   </recording.statement>
   ...
   <transcription.statement id= >
      <title>(if applicable)</title>
      <statement.of.responsibility>
         <name id= >...</name>
         <role>transcriber</role>
         <name id= >...</name>
         <role>...</role>
         ...
      </statement.of.responsibility>
      <transcription.date>...</transcription.date>
      <revision.history>...</revision.history>
      <publication.statement>(if the transcription
      is published)
      ...
      </publication.statement>
   </transcription.statement>
   <transcription.statement id= >...
   </transcription.statement>
   ...
   <notes.statement>notes stating the language of the
   text, giving copyright information, specifying
   conditions of availability, etc.</notes.statement>
]]></xmp>
<p>The difference with respect to the current TEI heading for
written texts is that "source description" is replaced by a
"script statement" (structured like the heading for a written
text; used where the recording is scripted), a "recording
statement", and a "transcription statement". To stress the
analogy with the heading for written texts, it is preferable
to embed these under "source description" (see Figure 4b):
<xmp><![ CDATA[
<source.description>
   <script.statement>...</script.statement>
   <recording.statement>...</recording.statement>
   <transcription.statement>...</transcription.statement>
</source.description>
]]></xmp>
<p>Spoken texts also differ from written ones in that they often
do not have a title. We suggest that the title of the
electronic text should be lifted from the "script statement"
(if there is a script), the "transcription statement" (if it
has a title), or the "recording statement" (if it has a
title). If not, the electronic editor should assign a title.
<p>   There is a problem in deciding how the responsibility for
the text should be stated. The following may be involved: the
electronic editor, the author of the script, the recording
personnel, the transcriber, the speaker(s). The last four are
naturally embedded in the "script", "recording", and
"transcription" statements, as shown above. The electronic
editor is naturally given in the "title statement" for the
electronic text. A possible solution is that the person who is
regarded as having the main responsibility for the text is
lifted from the "script", "recording", or "transcription"
statement and inserted before the electronic editor,
analogously to the current practice for editions of written
texts. Depending upon the nature of the material, the main
responsibility for the text is then allocated to the author
(where there is a script), the interviewer, the transcriber,
or the speaker(s):
<xmp><![ CDATA[
<title.statement>
   <title>...</title>
   <statement.of.responsibility>
      <name>...</name>
      <role>author/interviewer/transcriber/speaker</role>
      <name>...</name>
      <role>electronic editor</role>
   </statement.of.responsibility>
</title.statement>
]]></xmp>
It is worth considering whether the same lifting procedure
should also apply to electronic editions of written texts (at
the cost of some redundancy of coding).
<p>   The most complicated part of the heading for a spoken text
is the "recording statement". Note, in particular, the
structure suggested for the participants and the setting. Each
participant has an "id", which we can use to define his/her
relationship to other participants, to state the time and
location in cases of distanced communication, to assign
utterances to particular participants, etc. The categories
for participants should be coordinated with those needed in
history and the social sciences (particularly those grouped
together under "demographic information", while those placed
under "situational information" are more speech-specific).
<p>   Some of the types of information provided for above by tags
are probably better expressed as attributes. This applies, for
example, to the sex of participants and participant groups:
<xmp><![ CDATA[
<participant id=  sex=m/f>
<participant.group id= size=  members=  sex=m/f/mixed>
]]></xmp>
Incidentally, the "members" value in the last case is a list
of ids of any participants who belong to the participant group
(i.e. individual participants who also participate jointly).
The value of "size" is a number. Another case where attributes
may be recommended are for "time" (under "setting"). This
should be coordinated with general TEI recommendations for the
coding of date and time. The general principle should be that,
when values for a parameter are predictable and can thus be
constrained by the DTD, the parameter should be an attribute
rather than a tag.
<p>   The "script statement" is analogous to an ordinary heading
for a written text and requires no special machinery. The
"transcription statement" is fairly simple. The complexity
here is transferred to the statement of transcription
principles (under "encoding declarations"). Note that there
may be more than one script and more than one recording and
transcription statement, each with an "id".
<p>   As pointed out above, we have focused on the "file
description" of spoken texts. The other main text
documentation sections will define the principles of encoding
(transcription principles, reference system, etc.) and record
changes in the electronic edition. Note that text
classification goes under "encoding declarations". We do not
address the problem of text classification, as it is the
responsibility of the corpus work group.
<p>   Where a spoken text is part of a corpus, the information
must be split up between the corpus header (what is common for
all the texts in the corpus) and the header for each
individual text; cf. Johansson (forthcoming) and the
suggestions of the corpus work group.
<p>   Note, finally, that we do not suggest that all the slots in
the header should be filled; we just provide a place for the
different sorts of information, in case they are relevant. The
amount of information which goes on the "electronic title
page" will no doubt vary depending upon the type of material
and the intended uses (and depending upon how much information
is, in fact, available). In many cases it may be appropriate
to give a description in prose, e.g. of recording information
or the setting, rather than one broken down by tags. The
following is suggested as an absolute minimum: specification
of title and main responsibility for the text (under "title
statement"), recording time and circumstances of data
gathering (under "recording statement"), some information
about the participants (under "recording statement"),
transcription principles (under "editorial principles" in the
"encoding declarations" section).
 
<div2 id=8.2><head>Basic text units
<p>
We suggest the basic tag <tag>u</tag> (for utterance) referring to a
stretch of speech preceded and followed by silence or a change
of speaker. The <tag>u</tag> tag may have the following attributes:
<fig>
who=A1                  (speaker)
    'A1 B1 C1'          (several speakers or possible
                         speakers)
 
uncertain=              (description of uncertainty, e.g.
                         speaker attribution)
 
script=                 (if applicable, "id" of script)
 
trans=smooth            (smooth transition; default)
      latching          (noticeable lack of pause with respect
                         to the previous utterance)
 
n=1 2 3 ...             (for number)
</fig>
<p>
In addition, <tag>u</tag> has "start" and "end" attributes pointing to
the timeline; cf. the beginning of Section 8. <tag>u</tag> may contain
another <tag>u</tag>, but only where there is a change to/from a script
or between scripts; the speaker value must be the same as for
the matrix <tag>u</tag>.
<p>   As utterances may vary greatly in length, we also need tags
for lower-level units of different kinds: tone units (or
intonational phrases), pause-defined units, macrosyntagms, or
text-units defined solely for reference purposes (cf. Section
6.2). We suggest the use of the ordinary TEI <tag>s</tag> tag (for
segment) in all these cases, with attributes for type,
truncation, and number (as well as "start" and "end"
attributes pointing to the timeline; cf. the beginning of
Section 8):
<fig>
type=toneunit
     pauseunit
     macrosyntagm
     textunit
 
trunc=no  (default)
      yes
 
number=1 2 3 ...
</fig>
The interpretation of the segment types should be defined more
exactly in the "encoding declarations" section of the file
header.
<p>   Exceptionally an <tag>s</tag> may cross utterance boundaries, e.g.
where the addressee completes a macrosyntagm started by the
first speaker (or a speaker continues after a back-channel
from the addressee):
<xmp><![ CDATA[
<u who=A n=1><s type=macrosyntagm trunc=yes n=1.1>have you
heard that John</s></u>
<u who=B n=2><s type=macrosyntagm trunc=yes n=1.2>is
back</s></u>
]]></xmp>
In other words, we have two utterances, each consisting of a
fragment of a macrosyntagm. The identity of the macrosyntagm
is indicated by the "number" attribute of <tag>s</tag>.
<p>   Where more than one type of <tag>s</tag> is needed, we get problems
with conflicting hierarchies. In these cases we must resort to
milestone tags or concurrent markup. Concurrent markup will
certainly be needed where the text is analysed in terms of
turns and back-channels (although, most typically, a turn will
be a <tag>u</tag> and a back-channel a <tag>vocal</tag>). This belongs to the
area of discourse analysis, which is beyond the scope of our
present paper. The same applies to a more detailed analysis of
elements above <tag>u</tag>, where we have just suggested a general
<tag>div</tag> element; see the beginning of Section 8.
<p>   Among linguistic units we also find <tag>writing</tag>; cf. the
beginning of Section 8 and Figure 1. This contains a
representation of an event in which written text appears.
Typical cases are subtitles, captions, and credits in films
and on TV, though also overhead slides used in a lecture might
count. It has the same attributes as <tag>u</tag> and, in addition, an
"incremental" attribute (with the values "yes" or "no"),
specifying whether the writing appears bit by bit or all at
once. If the "who" attribute is specified, it picks out a
participant who generates or reveals the writing, e.g. a
lecturer using an overhead slide or writing on the blackboard.
<tag>writing</tag> may contain <tag>pointer</tag>, but only if the value of
"incremental" is "yes". It should perhaps be allowed to
contain all the tags in written texts. For texts with
extensive elements of <tag>writing</tag> there should be a
corresponding "script statement".
 
<div2 id=8.3><head>Reference system
<p>
The reference system is intimately connected with the choice
of text units. Here we can make use of the "number" attributes
of utterances and <tag>s</tag>-tags, refer to periods between points in
the timeline, use milestone tags, or define a concurrent
reference hierarchy: <tag>(ref)s</tag>...<tag>/(ref)s</tag>. The mechanism(s)
should be declared in the file header, in the "encoding
declarations" section.
 
<div2 id=8.4><head>Speaker attribution
<p>
In dramas speaker attribution is indicated in a double manner
in the current TEI scheme; in the first place, by the tag
<tag>speaker</tag> which identifies speaker prefixes; secondly, by the
"speaker" attribute of the tag <tag>speech</tag>. We suggest a "who"
attribute for utterances (Section 8.2) and also for vocals
(Section 8.7), kinesic features (Section 8.11), and events
(Section 8.12). The value of the "who" attribute is the "id"
given in the list of participants in the file header.
<p>   Where attribution is uncertain, this may be indicated by an
"uncertain" attribute:
<xmp><![ CDATA[
<u who=A1 uncertain=  >
 
<u who='A1 B1 C1' uncertain=  >
]]></xmp>
The value of "uncertain" is a description of the uncertainty
and optionally a statement of the cause of the uncertainty,
the degree of uncertainty, and the identity of the
transcriber. In the first case above, the probable speaker is
A1, in the second A1, B1, or C1. Where the identity of the
speaker is completely open, the "who" attribute takes the
value "unknown".
<p>   If the utterance is the collective response of a group,
e.g. the audience during a lecture or school children in a
class, we can use the "id" of the participant group, as
defined in the file header. This should be distinguished from
overlapping identical responses of individual participants who
do not otherwise act as a group in the interaction. In the
latter case we recommend the normal mechanisms for speaker
overlap (see Section 8.5).
 
<div2 id=8.5><head>Speaker overlap
<p>
Where there is simultaneous speech, the contributions of each
speaker are best separated and presented sequentially. Whole
utterances which overlap are catered for by the "start" and
"end" attributes of the elements (cf. the beginning of Section
8):
<xmp><![ CDATA[
<u who=A1 start=p10 end=p11>have you heard the news</u>
<u who=B1 start=p12 end=p13>no</u>
<u who=C1 start=p12 end=p13>no</u>
]]></xmp>
More likely, there is partial overlap between the utterances.
In these cases we must insert <tag>pointer</tag> tags specifying the
start and end of the overlapping segments. <tag>pointer</tag> is an
empty element, with an attribute pointing to the timeline.
 
<xmp><![ CDATA[
Example (see Figure 9):
<u who=A start=p1 end=p5>this<pointer time=p2>is<pointer
time=p3>my<pointer time=p4>turn</u>
<u who=B start=p2 end=p4>balderdash</u>
<u who=C start=p3 end=p5>no<pointer time=p4>it's mine</u>
]]></xmp>
In other words, the first speaker's "is my" overlaps with the
second speaker's "balderdash", and the first speaker's "my
turn" overlaps with the third speaker's "no it's mine". The
overlap may occur in the middle of a word, as in this example
(adapted from Gumperz & Berenz, forthcoming; see also Figure
10):
<xmp><![ CDATA[
<u who=R start=p1 end=p3>you haven't been to the skill
cen<pointer time=p2>ter</u>
<u who=K start=p2 end=p4>no<pointer time=p3>I haven't</u>
<u who=R start=p3 end=p5>so you have<pointer time=p4>n't seen
the workshop there</u>
]]></xmp>
<p>   Overlap between <tag>u</tag> and <tag>vocal</tag>, <tag>kinesic</tag>,
 <tag>event</tag> is
handled in a corresponding manner:
<xmp><![ CDATA[
<u who=A1 start=p1 end=p3>have you read Vanity<pointer
time=p2>Fair</u>
<u who=B1 start=p2 end=p3>yes</u>
<kinesic who=C1 start=p2 end=p3 desc=nod>
]]></xmp>
Overlap between instances of <tag>vocal</tag>, <tag>kinesic</tag>, and
 <tag>event</tag>
is handled by their "start" and "end" attributes.
 <p>  Overlap involving <tag>writing</tag> is dealt with in the same
manner as for <tag>u</tag>.
 
<div2 id=8.6><head>Word form
<p>
We shall only consider orthographic representations, as this
is the form of transcription most often used in extended
spoken texts (and as the problems of phonetic notation are the
concern of the work group on character sets). Words will then
be represented as in writing and will normally not present any
difficulties.
<p>   If there are deviations from ordinary orthography
suggesting how the words were pronounced (cf. Section 6.6), it
is preferable to include a normalized form, using the
mechanisms of editorial comment (see Section 8.13).
<p>   Standard conventions for hyphenation, capitalization,
names, and contractions are best used as in writing. If
required, standard contractions may be "normalized" in the
same way as idiosyncratic spellings (thus simplifying
retrieval of word forms). Initial capitalization of text units
is naturally used in a "speech as writing" text with ordinary
punctuation, but is best avoided where the text is divided
into prosodic units.
<p>   In a transcription which is to be prosodically marked it is
essential to write numerical expressions in full, e.g. twenty-
five dollars rather than $25 and five o'clock rather than
5:00.
<p>   The conventions for representing word forms should, like
all other editorial decitions, be stated in the
<tag>encoding.declarations</tag> section of the file header.
<p>   As regards truncated words and other types of quasi-lexical
vocalizations, see the next section.
 
<div2 id=8.7><head>Speech management
<p>
Depending upon the purpose of the study, the transcriber may
edit away the disfluencies which are so typical of unplanned
speech (truncated words or utterances, false starts,
repetitions, voiced and silent pauses) or transcribe the text
as closely as possible to the way it was produced. If the
disfluencies are left in the text, it may be desirable to
distinguish them by tags. These are some suggestions for the
treatment of speech management phenomena:
<list>
<item>truncated segment - use the "truncation" attribute of <tag>s</tag>;
                    see Section 8.2
<item>truncated word - write letter or letter sequence; if the
                 word is recognizable, tag <tag>trunc.word</tag>,
                 with attributes for "editor" and "full";
                 if it is not, tag <tag>vocal</tag>; see further
                 Section 8.13
<item>false start - tag <tag>false.start</tag>, with an "editor" attribute
<item>repetition - tag <tag>repetition</tag>, with an "editor" attribute
See further Section 8.13.
</list>
<p>   Silent pauses are represented as empty elements. They may
be sisters of <tag>u</tag> or included within <tag>u</tag> or <tag>s</tag>;
cf. Figure 1.
The tag <tag>pause</tag> has attributes for "start", "end", "duration",
and "units"; cf. the beginning of Section 8. In addition,
there may be a "type" attribute, which can take values like:
short, medium, long. Within <tag>u</tag> and <tag>s</tag> pauses can be
represented by entity references; see further Section 8.8.
<p>   Voiced pauses and other quasi-lexical vocalizations (e.g.
back-channels) are tagged <tag>vocal</tag>. This is an empty element
which occurs as a sister of <tag>u</tag> or within <tag>u</tag> or <tag>s</tag>;
 cf.
Figure 1. It has the following attributes:
<fig>
who=A1                (speaker)
    'A1 B1 C1'        (several speakers or possible
                       speakers)
 
uncertain=            (description of uncertainty)
 
script=               (if scripted, "id" of script)
 
type=                 (subclassification)
 
iterated=yes
         no           (single; default)
 
desc=                 (verbal description)
 
n=1 2 3 ...           (number)
</fig>
In addition, there are attributes for "start", "end",
"duration", and "units"; cf. the beginning of Section 8.
<p>   It may be convenient to have lists of conventional forms
for use as values of "desc".
Examples (based on lists in
existing encoding schemes; note that the list includes
suggestions for non-verbal sounds; cf. Section 8.10):
<list>
<item>descriptive: burp, click, cough, exhale, giggle, gulp, inhale,
laugh, sneeze, sniff, snort, sob, swallow, throat, yawn
<item>
quasi-lexical: ah, aha, aw, eh, ehm, er, erm, hmm, huh, mm,
mmhmm, oh, ooh, oops, phew, tsk, uh, uh-huh,
uh-uh, um, urgh, yup
</list>
Within <tag>u</tag> and <tag>s</tag> vocals can be represented by entity
references; here we can make use of conventional forms like
those listed above: &amp;cough; &amp;mm; etc.
<p>   As already mentioned in passing, the borderline between <tag>u</tag>
and <tag>vocal</tag> is far from clear-cut. Researchers may wish to
draw the quasi-lexical type within the bounds of <tag>u</tag> and treat
them as words. This would agree with current encoding
practice, where quasi-lexical vocalizations are typically
represented as words and non-verbal sounds by descriptions
within parentheses (cf. Appendix 2). As for all basic
categories, the definition should be made clear in the
"endcoding declarations" section of the file header.
 
<div2 id=8.8><head>Prosodic features
<p>
The marking of prosodic features is of paramount importance,
as these are the ones which structure and organize the spoken
message. In considering pauses in the last section we have
already entered the area of prosody. Boundaries of tone units
(or "intonational phrases") can be indicated by the <tag>s</tag> tag,
as pointed out in Section 8.2.
<p>The most difficult problem is finding a way of marking
stress and pitch patterns. These cannot be represented as
independent elements, as they are superimposed on words or
word sequences. One solution is to reserve special characters
for this purpose, as in a written prosodic transcription, and
to define a set of entity references for different types of
pause, stress, booster, tone, etc. We will not make any
specific suggestions, as this is within the province of the
working group on character sets.
 
<div2 id=8.9><head>Paralinguistic features
<p>
These features characterize stretches of speech, not
necessarily co-extensive with utterances or other text units.
We suggest the milestone tag <tag>shift</tag>, indicating a change in a
specific feature. <tag>shift</tag> may occur within the scope of <tag>u</tag>
and <tag>s</tag> tags; cf. Figure 1. It has attributes for "time"
(pointing to the timeline), "feature", and "new" (defining the
change in the relevant feature). The following are some
important paralinguistic features, with corresponding values
for "new" (the suggestions are based on the Survey of English
Usage transcription, which is in its turn inspired by musical
notation):
<fig>
tempo:
 
a   =  allegro (fast)
aa  =  very fast
ac  =  accelerando (getting faster)
l   =  lento (slow)
ll  =  very slow
ral =  rallentando (getting slower)
 
loud (for loudness):
 
f   =  forte (loud)
ff  =  very loud
cr  =  crescendo (getting louder)
p   =  piano (soft)
pp  =  very soft
dim =  diminuendo (getting softer)
 
pitch (for pitch range):
 
high   =  high pitch-range
low    =  low pitch-range
wide   =  wide pitch-range
narrow =  narrow pitch-range
asc    =  ascending
desc   =  descending
mon    =  monotonous
scan   =  scandent, each succeeding syllable higher than
          the last, generally ending in a falling tone
 
tension:
 
sl  =  slurred
lax =  lax, a little slurred
ten =  tense
pr  =  very precise
st  =  staccato, every stressed syllable being doubly stressed
leg =  legato, every syllable receiving more or less equal
       stress
 
rhythm:
 
rh  =  beatable rhythm
arh =  arythmic, particularly halting
spr =  spiky rising, with markedly higher unstressed
       syllables
spf =  spiky falling, with markedly lower unstressed
       syllables
glr =  glissando rising, like spiky rising but the
       unstressed syllables, usually several, also rise
       in pitch relative to each other
glf =  glissando falling, like spiky falling but with the
       unstressed syllables also falling in pitch relative
       to each other
 
voice (for voice quality):
 
wh   =  whisper
br   =  breathy
hsk  =  husky
crk  =  creaky
fal  =  falsetto
res  =  resonant
gig  =  unvoiced laugh or giggle
lau  =  voiced laugh
trm  =  tremulous
sob  =  sobbing
yawn =  yawning
sigh =  sighing
</fig>
The last group is reminiscent of the vocals dealt with at the
end of Section 8.7. But here we are concerned with the sounds
as overlay, not as individual units.
<p>   Shift is marked where there is a significant change in a
feature. In all cases "new" can take the value "normal",
indicating a change back to normal (for the relevant speaker).
Lack of marking implies that there are no conspicuous shifts
or that they are irrelevant for the particular transcription.
<p>   Paralinguistic feastures are tied to the utterances of the
individual participant (or participant group). A shift is
valid for the utterance of the same speaker until there is a
new <tag>shift</tag> tag for the same feature. Utterances by other
speakers, which may have quite different paralinguistic
qualities, may intervene. To prevent confusion, <tag>shift</tag> should
perhaps have a "who" attribute, though it is strictly speaking
redundant (as it is always embedded in an element with a "who"
attribute).
<p>   Note that there may be shifts of several features in a
single speaker with different values of the "time" attribute.
In other words, different categories of paralinguistic
features may occur independently of each other and overlap.
<p>   One problem with the recommendations above  is that one
would like to link the values for "feature" and "new". This
could be done by replacing the general <tag>shift</tag> tag by specific
tags with a "type" attribute, as in: <tag>temposhift type=ac</tag>,
<tag>loudshift type=ff</tag>. Alternatively, it would be possible to
replace the "feature" attribute of the <tag>shift</tag> tag by "tempo",
"loud", etc. and let them take the "new" values, as in: <tag>shift
tempo=ac</tag>, <tag>shift loud=ff</tag>.
 
<div2 id=8.10><head>Non-verbal sounds
<p>
In our proposal we make no distinction between quasi-lexical
phenomena and other vocalizations. Both are handled by
<tag>vocal</tag>; cf. Section 8.7.
 
<div2 id=8.11><head>Kinesic features
<p>
These can be dealt with in much the same way as the features
taken up in the preceding section, i.e. by using an empty
element, in this case <tag>kinesic</tag>, with the following
attributes:
<fig>
who=A1                ("id" of the participant(s) involved)
    'A1 B1 C1'
 
script=               (if scripted, "id" of script)
 
type=                 (subclassification)
 
iterated=yes
         no           (single; default)
desc=                 (verbal description)
</fig>
In addition, there are "start" and "end" attributes pointing
to the timeline; cf. the beginning of Section 8.
<p>   The tag <tag>kinesic</tag> is a sister of <tag>u</tag>; cf. Figure 1. It
handles the following types of features: facial expression
(smile, frown, etc.), gesture (nod, head-shake, etc.), posture
(tense, relaxed, etc.), pointing, gaze, applause, distance,
tactile contact. With actions the "who" attribute picks out
the active participant; there should perhaps also be a
"target" attribute, to be used, for example, for the goal of
the gazing or pointing (incidentally, "target" might also be
needed as a possible attribute of <tag>u</tag>, in case one wants to
specify that an utterance is addressed to a particular
participant or participant group; cf. asides in a play).
<p>   Our recommendations for kinesic features present somewhat
of a dilemma. Those who are not interested in gestures and the
like might very well be content with <tag>event</tag>. On the other
hand, our suggestions are probably insufficient for those who
are particularly concerned with kinesic features.
 
<div2 id=8.12><head>Situational features
<p>
These sorts of features are given in stage directions in
dramas, for which the current TEI draft recommends the tag
<tag>stage</tag>. We suggest a tag <tag>event</tag>, to be used for actions and
changes in the speech situation. Like <tag>kinesic</tag> this tag is a
sister of <tag>u</tag>; cf. Figure 1. It has the same attributes:
<fig>
who=A1                 ("id" of participant(s) involved)
    'A1 B1 C1'
 
script=                (if scripted, "id" of script)
 
type=                  (subclassification)
 
iterated=yes
         no            (single; default)
 
desc=                  (verbal description)
</fig>
In addition, there are "start" and "end"attributes pointing to
the timeline; cf. the beginning of Section 8.
<p>   With kinesic and situational features (as well as
paralinguistic features) we have assumed that they are only
marked where the transcriber considers them to be significant
for the interpretation of the interaction. If we want a
complete record throughout the interaction, the best solution
is probably to have concurrent markup streams or to establish
links to an audio or video recording (cf. Section 9).
 
<div2 id=8.13><head>Editorial comment
<p>
An account of the editorial principles in general belongs in
the header, under <tag>encoding.declarations</tag>. This takes up the
type of transcription (orthographic, phonetic, etc.),
transcription principles, definition of categories, policy
with respect to punctuation, capitalization, etc. Here we are
concerned with editorial comments needed at particular points
in the text.
<p>    The current TEI draft defines the tags <tag>norm</tag>, <tag>sic</tag>,
<tag>corr</tag>, <tag>del</tag>, and <tag>add</tag>.
These may be used in spoken texts as well, as in:
<xmp><![ CDATA[
<norm ed=NN sic=an'>and</norm>
 
<norm ed=NN sic=can't>can not</ norm>
</xmp>
This approach may be problematic, however, where the original
text is heavily tagged. A possible solution is to use the
approach outlined below for variants, with <tag>alt</tag> tags for the
original and the emended text (and a "source" attribute
associating the alternatives with the original speaker or the
editor, using their individual "id").
<p>   If it is desirable to delete truncated words, repeated
words, and false starts (cf. Section 8.7), we can use a <tag>del</tag>
tag, with a "cause" attribute:
<xmp><![ CDATA[
<del ed=NN cause=trunc.word>s</ del>see
 
<del ed=NN cause=repetition>you you</ del>you know
 
<del ed=NN cause=false.start>it's</ del>he's crazy
]]></xmp>
To handle cases like "expletive deleted" in text A in Appendix
1, one would have to allow the <tag>del</tag> tag to be empty or use an
editorial note (see below).
<p>   Spoken texts require special mechanisms for handling
uncertain transcription. Uncertain speaker attribution was
dealt with above; see Section 8.4. It is quite possible that
"uncertain" should be allowed as a global attribute, providing
for comments on uncertainty with respect to transcription
(e.g.: the transcription of this <tag>u</tag> or <tag>s</tag> is uncertain) or
classification (e.g.: this vocal is on the borderline of ...).
A comment on uncertainty should include identification of the
uncertainty and optionally a statement of the cause of the
uncertainty, the degree of uncertainty, and the identity of
the transcriber.
<p>   Very often there will be uncertainty with respect to words
or brief segments for which there is no tag to carry the
"uncertain" attribute. For these cases we suggest the tag
<tag>uncertain</tag>, with attributes for "transcriber" (where the
value is the "id" given in the file header), "cause", and
"degree" (high, low), as in:
<xmp><![ CDATA[
<uncertain transcr=NN cause=  degree=  >they're</ uncertain>
]]></xmp>
This tag occurs within the scope of <tag>u</tag> or <tag>s</tag>.
<p>   Alternative transcriptions can be identified by the tag
<tag>var</tag>, used in recording variant readings in written texts,
with attributes for "transcriber" and perhaps also
"preference" (high, low, none) as in:
<xmp><![ CDATA[
<var>
   <alt transcr=NN pref=  >they're</ alt>
   <alt transcr=NN pref=  >there're</ alt>
</ var>
]]></xmp>
A special type of variation is found where there is deviation
from a script. This could be handled by <tag>alt</tag> tags with a
"source" attribute ("id" of script vs. of speaker) or,
alternatively, by an editorial note (see below).
<p>   For unintelligible words and segments we can use <tag>vocal</tag>,
with the usual attributes (plus "uncertain"), as in:
 
<xmp><![ CDATA[
 
<vocal who=A1 type=speech desc=  uncertain=  start=  end=
dur=3 units=sylls>
 
]]></xmp>
As a last resort, it is possible to insert an explanatory
note, with an "editor" or "transcriber" attribute, as in:
 
<xmp><![ CDATA[
<note transcr=NN>10 seconds untranscribable</ note>
]]></xmp>
 
All the mechanisms for editorial comment should be coordinated
with those for written texts.
 
<div1 id=9><head>Parallel representation
<p>
The issue of parallel representation naturally arises in the
encoding of spoken texts, as shown by some of our example
texts in Appendix 1 (particularly G and W). The main tool we
suggest for parallel representation is the timeline, which we
use to synchronize utterances, other types of vocalizations,
pauses, paralinguistic features, kinesic features, and events.
The timeline can also be used for multiple representation of
utterances by a single speaker, as in:
 
<xmp><![ CDATA[
 
<u>
   <unit>
      <level type=orth>
         <pointer time=p1>cats<pointer time=p2>
         eat<pointer time=p3>mice<pointer time=p4>
      </ level>
      <level type=phon>
         <pointer time=1>kats<pointer time=2>i:t<pointer
          time=p3>mais<pointer time=p4>
      </ level>
   </ unit>
</ u>
]]></xmp>
 
The pointers to the timeline establish the alignment between
the two levels of representation.
<p>   When aligning the recording itself with the transcription,
the unit containing the parallel levels is the text body:
 
<xmp><![ CDATA[
 
<text>
   <timeline>...</ timeline>
   <unit>
      <level type=transcription>...</ level>
      <level type=recording>...</ level>
   </ unit>
</ text>
]]></xmp>
These suggestions are preliminary and should be coordinated
with the mechanisms proposed for linking different levels of
linguistic analysis or parallel texts of a manuscript.
 
<div1 id=10><head>Transcription versus digitized recording
<p>
New technological developments make it possible to link a
transcription computationally with a digitized recording. If
this is done, we may require less from the transcription and
can regard it as a scaffolding which we can use to access
particular points in the recording. The compilers of the
Corpus of Spoken American English (cf. Dubois et al. 1990) aim
at producing material of this kind. Similarly, it should be
possible to link a transcription with a video recording. The
use of such material is, however, restricted by the ethical
considerations which we drew attention to at the beginning of
this paper (Section 2.2), and it seems premature at this stage
to propose TEI encodings for such material (besides, it falls
within the bounds of the TEI hypertext work group).
 
<div1 id=11><head>Suggestions for further work
<p>
Among the features taken up in this paper there are some which
are more central than others. The minimal requirements for an
encoding scheme for spoken machine-readable texts are that it
provides mechanisms for coding:
<list>
<item>word forms
<item>basic text units
<item>speaker attribution and overlap
<item>a reference system
<item>text documentation (cf. the end of Section 8.1)
</list>
<p>
Other basic requirements are mechanisms for coding prosodic
features and editorial comment. One goal to strive for is some
agreement on basic requirements for the encoding of spoken
texts. We hope the present paper, and the comments it will
provoke, will be a step towards that goal.
<p>   Among the issues which require more work we would like to
single out:
<list>
<item>phonological representation, both of prosodic and segmental
   features (this is within the province of the working group
   on character sets)
<item>mechanisms for linking transcription, sound, and video,
   including possible parallels between the representation of
   speech and music (this is within the province of the
   hypertext work group and should be linked with the current
   HyTime proposals; cf. Goldfarb 1991)
<item>spoken discourse structure, including speech repair
<item>in general: parallel representation, discontinuities, and
   concurrent hierarchies
</list>
<p>
Last but not least, the proposals made in this paper need to
be worked out further and formalized within the framework of
SGML, to the extent that this is possible. As a first step, we
provide a DTD fragment and brief text samples encoded
according to our proposals; see Appendix 3. It is also highly
desirable to work on the mapping between our proposals and
major existing encoding schemes.
 
<div1 id=12><head>Spoken texts - a test case for the Text Encoding
    Initiative?
<p>
We can never get away from the fact that the encoding of
speech, even the establishment of a basic text, involves a lot
of subjective choice and interpretation. There is no blueprint
for a spoken text; there is no one and only transcription. The
most typical case is that a transcription develops cyclically,
beginning with a rough draft, more detail being added as the
study progresses.
<p>   It is no surprise if SGML can handle printed texts; after
all, it was set up for this purpose. It remains to be seen
whether the TEI application of SGML can be extended, in ways
which satisfy the needs of users of spoken machine-readable
texts, to handle the far more diffuse and shifting patterns of
speech.
</body>
<back>
<head>References
 
Allwood, J., J. Nivre, & E. Ahlsen. 1990. Speech Management
   - on the Non-written Life of Speech. Nordic Journal of
   Linguistics, 13, 3-48.
 
Atkinson, J.M. & J. Heritage (eds.). 1984. Structures of
   Social Action: Studies in Conversation Analysis.
   Cambridge: Cambridge University Press.
 
Autesserre, G. Perennou, & M. Rossi. 1991. Methodology for
   the Transcription and Labeling of a Speech Corpus. Journal
   of the International Phonetic Association, 19, 2-15.
 
Boase, S. 1990. London-Lund Corpus: Example Text and
   Transcription Guide. Survey of English Usage,
   University College London.
 
Brown, G., K.L. Currie, & J. Kenworthy. 1980. Questions of
   Intonation. London: Croom Helm.
 
Bruce, G. 1989. Report from the IPA Working Group on
   Suprasegmental Categories. Working Papers (Department
   of Linguistics and Phonetics, Lund University), 35,
   25-40.
 
Bruce, G., forthcoming. Comments on the paper by Jane
   Edwards. To appear in Svartvik (forthcoming).
 
Bruce, G. & P. Touati. 1990. On the Analysis of Prosody in
   Spontaneous Dialogue. Working Papers (Department
   of Linguistics and Phonetics, Lund University),
   36, 37-55.
 
Chafe, W. 1980. The Pear Stories: Cognitive, Cultural and
   Linguistic Aspects of Narrative Production. Norwood,
   N.J.: Ablex.
 
Chafe, W., forthcoming. Prosodic and Functional Units of
   Language. To appear in Edwards & Lampert (forthcoming).
 
Crowdy, S. 1991. The Longman Approach to Spoken Corpus
   Design. Manuscript.
 
Du Bois, J.W., forthcoming. Transcription Design Principles
   for Spoken Discourse Research, to appear in IPrA Papers
   in Pragmatics (1991).
 
Du Bois, J.W., S. Schuetze-Coburn, D. Paolino, & S. Cumming.
   1990. Discourse Transcription. Santa Barbara: University
   of California, Santa Barbara.
 
Du Bois, J.W., S. Schuetze-Coburn, D. Paolino, & S. Cumming,
   forthcoming. Outline of Discourse Transcription. To
   appear in Edwards & Lampert (forthcoming).
 
Edwards, J.A. 1989. Transcription and the New Functionalism:
   A Counterproposal to CHILDES' CHAT Conventions. Berkeley
   Cognitive Science Report 58. University of California,
   Berkeley.
 
Edwards, J., forthcoming. Design Principles in the
   Transcription of Spoken Discourse. To appear in Svartvik
   (forthcoming).
 
Edwards, J.A. & M.D. Lampert (eds.), forthcoming. Talking
   Language: Transcription and Coding of Spoken Discourse.
   Hillsdale, N.J.: Lawrence Erlbaum Associates, Inc.
 
Ehlich, K., forthcoming. HIAT - A Transcription System for
   Discourse Data. To appear in Edwards & Lampert
   (forthcoming).
 
Ehlich, Konrad & Bernd Switalla. 1976. Transkriptionssysteme:
   Eine exemplarische Ubersicht. Studium Linguistik 2:78-105.
 
Ellmannn, R. 1988. Oscar Wilde. New York: Knopf.
 
Esling, J.H. 1988. Computer Coding of IPA Symbols. Journal of
   the International Phonetic Association, 18, 99-106.
 
Faerch, C., K. Haastrup & R. Phillipson. 1984. Learner Language
   and Language Learning. Copenhagen: Gyldendal.
 
Goldfarb, C.F. (ed.) 1991. Information Technology -
   Hypermedia/Time-based Structuring Language (HyTime).
   Committee draft, international standard 107444. ISO/IEC CD
   10744.
 
Gumperz, J.J. & N. Berenz, forthcoming. Transcribing
   Conversational Exchanges. To appear in Edwards & Lampert
   (forthcoming).
 
Hout, R. van. 1990. From Language Behaviour to Database: Some
   Comments on Plunkett's Paper. Nordic Journal of
   Linguistics, 13, 201-205.
 
International Phonetic Association. Report on the 1989 Kiel
   Convention. Journal of the International Phonetic
   Association, 19, 67-80.
 
Jassem, W. 1989. IPA Phonemic Transcription Using an IBM PC
   and Compatibles. Journal of the International Phonetic
   Association, 19, 16-23.
 
Johansson, S. Forthcoming. Encoding a Corpus in
   Machine-readable Form. To appear in B.T.S. Atkins et al.
   (eds.), Computational Approaches to the Lexicon: An
   Overview. Oxford University Press.
 
Knowles, G. 1991. Prosodic Labelling: The Problem of Tone
   Group Boundaries. In S. Johansson & A.-B. Stenstrom (eds.),
   English Computer Corpora: Selected Papers and Research
   Guide. Berlin & New York: Mouton de Gruyter. 149-163.
 
Knowles, G. & L. Taylor. 1988. Manual of Information to
   accompany the Lancaster Spoken English Corpus. Lancaster:
   Unit for Computer Research on the English Language,
   University of Lancaster.
 
Kyt, M. 1990. Introduction to the Use of the Helsinki Corpus:
   Diachronic and Dialectal. In Proceedings from the
   Stockholm Conference on the Use of Computers in
   Language Research and Teaching, September 7-9, 1989.
   Stockholm Papers in English Language and Literature 6.
   Stockholm: English Department, University of Stockholm.
   41-56.
 
Labov, W. & D. Fanshel. 1977. Therapeutic Discourse:
   Psychotherapy as Conversation. New York: Academic Press.
 
Lanza, E. 1990. Language Mixing in Infant Bilingualism:
   A Sociolinguistic Perspective. Ph.D. thesis.
   Georgetown University, Washington D.C.
 
Loman, B. 1982. Om talsprakets varianter. In B. Loman (ed.),
   Sprak och samhalle. Lund: Lund University Press. 45-74.
 
MacWhinney, B. 1988. CHAT Manual. Pittsburgh, PA: Department
   of Psychology, Carnegie Mellon University.
 
MacWhinney, B. 1991. The CHILDES Project. Hillsdale, N.J.:
   Lawrence Erlbaum Associates, Inc.
 
Melchers, G. 1972. Studies in Yorkshire Dialects. Based on
   Recordings of 13 Dialect Speakers in the West Riding.
   Part II. Stockholm: English Department, University of
   Stockholm.
 
Ochs, E. 1979. Transcription as Theory. In E. Ochs &  B.
   Schieffelin (eds.), Developmental Pragmatics. New York:
 Academic Press. 43-72.
 
Pedersen, L. & M.W. Madsen. 1989. Linguistic Geography in
   Wyoming. In W.A. Kretzschmar et al. (eds.), Computer
   Methods in Dialectology. Special issue of Journal of
   English Linguistics 22.1 (April 1989). 17-24.
 
Pittenger, R.E., C.F. Hockett, & J.J. Danehy. 1960. The First
   Five Minutes. A Sample of Microscopic Interview Analysis.
   Ithaca, N.Y.: Paul Martineau.
 
Plunkett, K. 1990. Computational Tools for Analysis Talk.
   Nordic Journal of Linguistics 13, 187-199.
 
Rosta, A. 1990. The System of Preparation and Annotation of
   I.C.E. Texts. Appended to International Corpus of English,
   Newsletter 9 (ed. S. Greenbaum). University College London.
 
Sinclair, J.McH. & R.M. Coulthard. 1975. Towards an Analysis
   of Discourse: The English Used by Teachers and Pupils.
   London: Oxford University Press.
 
Sperberg-McQueen, C.M. & L. Burnard (eds.). 1990. Guidelines
   for the Encoding and Interchange of Machine-readable
   Texts. Draft version 1.0. Chicago & Oxford: Association
   for Computers and the Humanities/Association for
   Computational LInguistics/Association for Literary and
   Linguistic Computing.
 
Svartvik, J. (ed.). 1990. The London-Lund Corpus of Spoken
   English: Description and Research. Lund Studies in English
   82. Lund: Lund University Press.
 
Svartvik, J. (ed.), forthcoming. Directions in Corpus
   Linguistics. Proceedings of Nobel Symposium 82, Stockholm,
   4-8 August 1991. Berlin: Mouton de Gruyter.
 
Svartvik, J. & R. Quirk (eds.). 1980. A Corpus of English
   Conversation. Lund Studies in English 56. Lund: Lund
   University Press.
 
Terkel, S. 1975. Working. People Talk About What They Do All
   Day and How They Feel About What They Do. New York: Avon
   Books.
 
The White House Transcripts. Submission of Recorded
   Presidential Conversations to the Committee on the
   Judiciary of the House of Representatives by President
   Richard Nixon. By the New York Times Staff for The
   Whitehouse Transcripts. New York: Bantam Books. 1974.
 
Wells, J.C. 1987. Computer-coded Phonetic Transcription.
   Journal of the International Phonetic Association, 17,
   94-114.
 
Wells, J.C. 1989. Computer-coded Phonemic Notation of
   Individual Languages of the European Community. Journal
   of the International Phonetic Association, 19, 31-54.