A progress report from the TEI Character Encoding working group


A progress report from the TEI Character Encoding working group
Christian Wittern

Contents

Overview

  • Markup and Character Encoding
  • Extension mechanism in the current TEI Guidelines
  • Proposed new extension mechanism

Markup and Character Encoding

  • Character encoding is the basic transportation layer for all texts
  • It encodes abstract characters, no other information (ideally:-)
  • In XML documents, the only choice for character encoding is Unicode
  • Some characters in Unicode are control characters or do in other ways interfere with markup
  • More of this is discussed in: Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages at: http://www.unicode.org/unicode/reports/tr20/

Unicode / ISO 10646

  • A universal character set, jointly developed by The Unicode Consortium and ISO/IEC JTC 1/SC 2/WG 2
  • As of Unicode 3.2 (March 2002) more than 94000 characters are encoded.
  • Characters are identified by their names (except Chinese, Japanese, Korean characters)
  • XML can use a subset, but not something completely different.

Writing System Declaration (WSD)

  • Since P3, the TEI Guidelines provide a mechanism to declare
    • The language of a document or a part thereof
    • The script used to write that language
    • The encoding used to serialize that script into files
    • The declaration of characters used beyond those provide by that encoding
  • All these functions are bundled together in the WSD.

Problems with P3/P3 WSD

  • Language/Script/Encoding are lumped together and can not be separately declared
  • The WSD mechanism is cumbersome and little used
  • A large part of the WSD has become obsolete with Unicode as the base character set
  • The extension mechanism relied partly on a glyph registry, which is now defunct

TEI P5

  • The TEI Guidelines are on track to its first major revision (P5)
  • No definite schedule has been set
  • Among other things, this will likely include a schema based version of the constraints of the document structure

Towards a new WSD mechanism (‘WSD-NG’)

  • The WG is in operation since July 2001
  • There have been two face-to-face meetings (and a lot of email)
  • The WG plans to unbundle the language/script/encoding declaration of the WSD
  • Information about the work and current draft documents are at http://www.tei-c.org/Activities/CE
  • Currently, there are three proposed modules of WSD-NG:
    • A module to provide a syntax for defining new characters
    • A module to define properties for characters
    • A module for the linguistic description of writing systems

Defining new characters

  • This was the item most hotly debated at the second meeting in Tuebingen
  • The following suggestions have been discussed:
    1. Use Private Use Area (PUA) characters from Unicode (and escape/document them for interchange)
    2. Use markup constructs (e.g. elements)
  • Implementation of both of the above change largely depending on whether or not entity references are available to
  • The problem with using markup constructs is that these can not be used in attribute values
  • Since not just characters, but all language properties can not be used in attribute values, the use of attributes in the TEI Guidelines might need some reconsideration

Character properties

  • Unicode defines a set of normative properties for its characters:
    • Case
    • Combining Classes
    • Conjoining Jamo (1100­11FF)
    • Decomposition (Canonical and Compatibility)
    • Directionality
    • Jamo Short Name
    • Numeric Value
    • Private Use
    • Special Character Properties
    • Surrogate
    • Mirrored
    • Unicode Character Names
  • In addition, there are some informative properties
  • Text encoders may wish to fine tune these properties
  • This WSD module should enable to associate new properties or overwrite existing properties of characters

Linguistic description of writing systems

  • Eric Albright's Design of an electronic method for describing writing systems saves as a good starting point for this module
  • Work has begun (see CEW05) to enumerate the features needed.
  • A lot more needs to be done here.
  • One of the issues here is that text encoders frequently have to deal with two instances of a given writing system:
    • The writing system as it was when the text was written
    • The modern version of the writing system
  • A frequent requirement for digital texts is to be able to use (at least) either of these for rendering.

Conclusions

  • We had a lot of discussion, mostly centering about the first module, ‘character representation’
  • A lot of work, especially in the other modules, still needs to be done.
  • It would be helpful for the further work of the WG, if some of the architectural issues open for P5 could be discussed and decided.