TEI Character Encoding WG: Language Identification [CEW09]
Language identification
- The identifier consists of at least one ‘primary’ subtag, it maybe followed by one or more ‘extended’ subtags.
- Languages are identified by a language subtag, which may be a two letter code taken from ISO 639-1 or a three letter code taken from ISO 639-2.
- ISO 639-2 reserves for private use codes the range 'qaa' through 'qtz'. These codes should be used for non-registered language subtags.
- A single letter primary subtag "x" indicates that the whole language tag is privately used.
- Extended language subtags must begin with the letter "s". They must follow the primary subtag and precede subtags that do define other properties of the language. The order is significant.
- 4 character subtags are interpreted as script identifiers taken from ISO 15924
- Region subtags can be either two letter country codes taken from ISO 3166 (with exceptions) or 3 digit codes from the UN Standard Country Codes for Statistical Use.
- Variant subtags may follow any of the above, but must precede private use extensions.
- Private use extensions are separated from the other subtags by the single letter subtag "x", which must be followed by at least one subtag. They might consist of several subtags separated with "-", but may not exceed a length of 32 characters.
Examples of language tags, mostly taken from RTF 3066
It should be noted that capitalization given here follows established convention (e.g. capital letters for country coded, small letters for language codes), but RTF 3066 does not ascribed any meaning to differences in capitalization.As can be seen, both RTF 3066 and ISO 639-2 provide extensions that can be employed by private convention. The constructs mentioned above can thus be used to generate identifiers for any language, past and present, in any used in any area of the World. If such private extensions are used within the context of the TEI, they should be documented within the <language> element of the TEI header, which might also provide a prose description of the language described by the language tag.
While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g. "grc" for "Greek, Ancient (to 1453)" in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in RTF 3066 and relate that to a <date> or <dateRange> in the corresponding <language> sectio of the TEI header.
Equivalences to language identifiers by other authorities can be given in the <language> section as well, but no formal mechanism for doing so has been defined.
The scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the lang attribute, including all elements and all attributes where a language might apply. [Note: This will exclude all attributes where a non-textual data type has been specified, for example tokens, boolean values or predefined value lists.]
- Phillips, Addison. Davis, Mark, Tags for Identifying Languages 2004-04-08, Internet Draft, proposed revision for RTF3066 http://xml.coverpages.org/draft-phillips-langtags-02a.txt
- Cover, Robin Language Identifiers in the Markup Contexthttp://xml.coverpages.org/languageIdentifiers.html
- Tim Bray Jean Paoli C. M. Sperberg-McQueen Eve Maler - Second Edition Francois Yergeau - Third Edition Extensible Markup Language (XML) 1.0 (Third Edition) W3C Recommendation 04 February 2004 http://www.w3.org/TR/2004/REC-xml-20040204/

