A progress report from the TEI Character Encoding working group
A progress report from the TEI Character Encoding
working group
Christian Wittern
Contents
- Overview
- Markup and Character Encoding
- Unicode / ISO 10646
- Writing System Declaration (WSD)
- Problems with P3/P3 WSD
- TEI P5
- Towards a new WSD mechanism (WSD-NG)
- Defining new characters
- Character properties
- Linguistic description of writing systems
- Conclusions
Markup and Character Encoding
- Character encoding is the basic transportation layer for all texts
- It encodes abstract characters, no other information (ideally:-)
- In XML documents, the only choice for character encoding is Unicode
- Some characters in Unicode are control characters or do in other ways interfere with markup
- More of this is discussed in: Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages at: http://www.unicode.org/unicode/reports/tr20/
Unicode / ISO 10646
- A universal character set, jointly developed by The Unicode Consortium and ISO/IEC JTC 1/SC 2/WG 2
- As of Unicode 3.2 (March 2002) more than 94000 characters are encoded.
- Characters are identified by their names (except Chinese, Japanese, Korean characters)
- XML can use a subset, but not something completely different.
Writing System Declaration (WSD)
Towards a new WSD mechanism (‘WSD-NG’)
- The WG is in operation since July 2001
- There have been two face-to-face meetings (and a lot of email)
- The WG plans to unbundle the language/script/encoding declaration of the WSD
- Information about the work and current draft documents are at http://www.tei-c.org/Activities/CE
- Currently, there are three proposed modules of WSD-NG:
Defining new characters
- This was the item most hotly debated at the second meeting in Tuebingen
- The following suggestions have been discussed:
- Implementation of both of the above change largely depending on whether or not entity references are available to
- The problem with using markup constructs is that these can not be used in attribute values
- Since not just characters, but all language properties can not be used in attribute values, the use of attributes in the TEI Guidelines might need some reconsideration
Character properties
Linguistic description of writing systems
- Eric Albright's Design of an electronic method for describing writing systems saves as a good starting point for this module
- Work has begun (see CEW05) to enumerate the features needed.
- A lot more needs to be done here.
- One of the issues here is that text encoders frequently have to deal with two instances of a given writing system:
- A frequent requirement for digital texts is to be able to use (at least) either of these for rendering.

