TCW 25: Report on EIT/MMI 3rd Workshop, Berlin 22 Oct 2012

EIT MMI Workgroup Meeting, Berlin 22 oct 2012

As noted at the last FTF, Laurent Romary in his capacity as ISO TC7 WG3 chair has proposed a new ISO/TEI joint activity in the area of speech transcription, an idea which has been initiated in the context of the multimodal interaction activity (MMI) of the European IET ICT Labs project (

The workshop was organized as a joint working meeting between representations of the MMI activity, ISO, and TEI, and held at the DIN's offices in Berlin. Two previous workshops have been held, focussing on standardization activities for multimodal interaction in general.

Prime movers in the activity, apart from Laurent, appear to be Thomas Schmidt and Andreas Witt (also convenor of WG6 of ISO/TC/37/SC4 where the ISO project is likely to develop) from the Institut fur Deutsche Sprache in Mannheim, but a number of other European research labs, mostly concerned with analysis of corpora of human computer interaction, were also represented; specifically: Nadia Mana from FBK (Trento, Italy); Tatjana Scheffler (DFKI, Germany); Khiet Truong (Univ of Twente) ; Benjamin Weiss (TU Berlin); Mathias Wilhelm (DAI Labor); Bertrand Gaiffe (ATILF, Nancy). This being an ISO activity, the real world of commerce and industry was also represented by Felix Burkhardt from Deutsche Telekom's Innovation Lab.

Related ISO activity mentioned by Laurent included the work on Discourse Relations led by Harry Bunt, and the long-awaited MAF (morpho-syntactic annotation framework) which are both due to appear Real Soon Now. A quick tour de table confirmed my impression that most of the attendees were primarily researchers in Human Computer Interaction with interests in the construction and encoding of multimodal corpora.

The main business of the day was to go through a preliminary working document drafted by Thomas Schmidt, the objective of which is to confer ISO authority on a subset of the existing TEI proposals for spoken text transcription, with some possible modification. The underlying work is well described in Schmidt's recent excellent article in jTEI : essentially, it consists of a close look at the majority of transcription formats used by the relevant research community/ies and tools, a synthesis of what they have in common, and suggestions of how that synthesis maps to TEI.

This is to a large extent motivated by concerns about preservation and migration of data in “legacy” formats.

The discussion began by establishing boundaries: despite my proposal to the contrary, it seems there was little appetite to extend the work into the area of truly multimodal transcriptions, which was still generally felt to be insufficiently understood for a practice-based standard to be appropriate. Concern was expressed that we should not make ad hoc premature suggestions. So the document really only concerns transcribed speech. There was no disagreement with the general approach which is to

  • distinguish a small number of macro-structural features
  • provide guidelines about how to mark up specific units of analysis at the micro-structural level, using a subset of the TEI.

I was also much cheered by two further remarks he made

  • the graph-based “annotation framework” formalisation proposed by Bird and Liberman was theoretically complete but so generic as to be practically useless (I paraphrase)
  • at the micro level, everything you need is there in the TEI (I quote)

Discussion focussed on the following points raised by the working document:


Many existing tools organise transcriptions into “tiers” of annotation. These seem to be purely technical artefacts, which can be addressed more exactly by used of XML markup. Unlike “levels” of annotation, they have no semantics. It's doubtful that we need a <tier> element.

Metadata -1

How many of the (very rich) TEI proposals should be included, or mentioned? And how should the three things Thomas had found missing be supplied? I suggested that <appinfo> was an appropriate way to record information about the transcription tool used; that the definition of the transcription system used belonged in the <encodingDesc>; and agreed that there was nothing specifically provided for recording pointers or links to the original video or audio transcribed. In the meeting, I speculated that maybe there was scope for extending (or misusing) <facsimile> for this last purpose; another possibility which occurs to me as I type these notes is that one could also extend <recordingDesc>.


The timeline is fundamental to the macrostructure of a transcript. Thomas' examples all used absolute times for its <when>s, but I suggested that relative ones might be easier. The document ordering both of <when>s and of transcribed speech should reflect the temporal order as far as possible; this would allegedly facilitate interoperability


What metadata was needed, required, recommended for the description of participants? (@sex raised its ugly head here). Could we use <person> to refer to artificial respondents in MMI experiments? (yes, if they have person-like characteristics; no otherwise)

It was noted that almost any personal trait or state might be crucial to the analysis of some corpora. We noted that CMDI now recommended using the ISOCAT data category registry as an independent way of defining metadata terminology; also that ISOCAT was now available within the TEI scheme (though whether it fits into personal metadata I am less sure). There was (I think) general agreement that we'd reference the various options available in the TEI but not incorporate all of them.

We agreed that the principles underlying a given transcription should be clearly documented, either in associated articles, in the formal specification for an encoding, or in the header of individual documents.


Several people disliked the expanded element name <u> and its definition, for various theoretical reasons. Its definition should be modified to remove the implication that it necessarily followed a silence, though we seemed to agree that a <u> could only contain a stretch of speech from a single speaker.

The temporal alignment of a <u> can be indicated either by @start and @end or by nested <anchor/>s : the standard should probably recommend use of one or the other methods but not both. We discussed whether or not the fact that existing tools did not support the (even simpler) use of @trans to indicate overlap should lead us not to recommend it.


Thomas wanted some method of associating with a <u> the whole block of annotations made on it (represented as one or more <interpGrp>s). His document suggested using <div> for this purpose. A lighter-weight solution might be to include <interpGrp> within <u>, or to propose a new wrapper <annotatedU> element.


Laurent noted that MAF recommended use of <w> for individual tokens; we didn't need to take a stand on the definition of “word” but could simply refer to MAF. We needed some way of signalling the things that older transcription formats had found important, e.g. words considered incomplete, false starts, repetitions, abbreviations etc. so we needed to choose an appropriate TEI construct for them, even if we thought the concept was not useful or ill-defined. The general purpose <seg> element might be the simplest solution, but some diplomacy would be needed about how to define its application and its possible @type or @function values.


This workgroup will probably produce a useful document describing an important use case for the TEI recommendations on spoken language. It is currently a Google Doc which the group has agreed to share with the Council. I undertook to help turn this into an ODD, which could eventually become one of our Exemplars. Work on standardising other aspects of transcribed multimodal interactions probably needs to be deferred to a later stage.