TEI Internationalization proposal


Contents

Summary

The Text Encoding Initiative Guidelines ( http://www.tei-c.org/ ) have been widely adopted by projects and institutions in many countries in Europe, North America, and Asia, and are used for encoding texts in dozens of languages. The TEI community is broadly international and multilingual, and its geographical reach increases every year. However, the complex encoding of texts at which the TEI excels requires a close understanding of the 522 available elements, and non-English speakers are at a considerable disadvantage in learning and using the Guidelines. The TEI needs to be made more accessible to the international community it seeks to serve.

Ideally, the Guidelines themselves and the TEI tag set would be translated into as many languages as there are user communities. However, translation on this scale is well beyond the resources not only of the TEI but also of most funding bodies. A more realistic approach, which we propose here, is to develop an infrastructure for translation, through which individual language communities may easily produce versions of the essential portions of the Guidelines into other languages. The TEI Guidelines are written in a modular way that makes this infrastructure straightforward to develop. With comparatively modest initial funding, we can produce the necessary framework and sponsor the translation of five initial high-priority languages, drawing on existing efforts already under way in the TEI community. Funding of this proposal will enable us to coordinate these efforts and bring them to prompt and consistent completion.

This proposal provides a rationale for the approach to internationalization on which this work would be based, and describes the technical infrastructure we propose to develop, to maintain the internationalized versions of the TEI material and integrate them with existing tools. It also provides an initial workplan of languages to be addressed. The TEI requests a grant of £10,000, for a project to produce the following:
  1. a working architecture for delivering an internationalized TEI
  2. an interface for translators to use in creating translated versions of components of the TEI Guidelines
  3. translations of the TEI documentation into five languages which the TEI considers to be the highest priority
  4. a framework for coordinating further translation efforts to be undertaken on a volunteer basis, or to be supported by further funding if available

Approach to translation

The TEI Guidelines contain four distinct areas of information which can be translated. Each can be treated separately:
  1. The detailed descriptive prose of the Guidelines chapters and TEI Lite documentation.
  2. The element, attribute names and suggested attribute values which are put into DTDs and schemas. Thus instead of <addrLine> , the TEI user might prefer to write <líneaDirección> , <ligneAdresse> , <linDireccio> or <AdressZeile> . The TEI has an established system for recording such translations, and preserving the relationship to the original names so that document instances can be put back into canonical form.
  3. The summary technical descriptions of elements or attributes. Thus instead of ‘contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.’, the Spanish-speaking user might find it more helpful to read ‘contiene un único documento TEI, compuesto de una cabecera TEI (TEI header) y un cuerpo de texto (text), aislado o como parte de un elemento corpusTei (teiCorpus)’
  4. The examples of usage for each element. ‘Internationalization’ of these could take the form of simple translation, but in practice localisation would be considerably more useful. Localisation involves choosing examples originating in the target language, which illustrate the element's usage more effectively for a native speaker than a translated example could do.
It may be useful to distinguish between what we might call ‘traditional’ or documentary approaches to translation, which focus on translating the descriptive prose of the Guidelines as a document, and ‘formal’ approaches which focus instead on translating the individual components (examples, element and attribute names, technical descriptions) in a way that enables these components to be used within the formal structures of the TEI as a technical standard. While the first approach may be very useful, the results are more difficult to maintain over the long term and are also more difficult to produce, since they cannot be accomplished in discrete chunks. The latter approach is the one we propose here, since it is more easily maintainable (only the affected elements need to be updated when changes are made to the Guidelines) and can be more easily undertaken in a distributed fashion by collaborative groups.
Some translation work has already been undertaken:
  • There have already been six ‘traditional’ translations of the TEI Lite ( http://www.tei-c.org/Lite/ ) documentation into other languages. These have not covered translation of the element names or technical reference documentation. They are in wide use, however, and have created a need for more extensive translations of the Guidelines themselves.
  • The French Groupe d'experts n° 8 within CN 357 (Commission de normalisation «Modélisation, production et accès aux documents») of the CG 46 (Commission générale «Information et documentation») at AFNOR has an interest in TEI translation. Amongst other goals, this group intends to translate the definitions of the TEI elements and attributes into French. So far, they have worked in a ‘traditional way’ on some chapters of the P4 and P5 versions. Dissemination of the resulting French version of these chapters is very limited.
  • Some ‘formal’ work has also been undertaken on translating element and attribute names; Alejandro Bia and Arno Mittelbach have prepared translation sets for Catalan, Spanish, and German. This work is integrated into the Roma ( http://www.tei-c.org/Roma/ ) application, allowing users to create tailored schemas in one of the supported languages. However, while this work is useful, it is not by itself the most effective way to proceed. For example, many of the element names are in an abbreviated form of English (eg <respStmt> ) which are not easy to translate sensibly, and because the abbreviated names are relatively easy to recognize for people used to reading Latin script. Furthermore, unless the reference descriptions are also translated, the element names by themselves do not give a clear idea of what the element is for. Using <infoResp> instead of <respStmt> is not as helpful as translating the description ‘supplies a statement of responsibility for someone responsible for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply.’

Study of the work described aboves suggests that translating the reference descriptions, followed by translation of element names, is likely to be the most effective way to promote the TEI and support its use in non-English-speaking communities. Translation or localisation of examples would be an important further step for some communities.

The TEI Guidelines are constructed in a way that lends itself to this approach. The documentation is built by assembling units stored in a database (which constitutes the source of the TEI Guidelines). Translation can thus be done at the level of these units. This allows for several advantages. The most important of these is the optimization of the translation and its ongoing maintenance. Individual information units (element names, documentation segments, etc.) are stored togther with their translations. When a unit is updated, its translations can be updated as well. New units and new translations can be added as needed. The approach also allows for more advanced localisation of the documentation: users can get documentation with crucial technical parts (eg. definitions) in any available language, or in multiple languages. Figure 1, Example of reference documentation, shows an example of the current display of reference documentation. The labels in the left-hand column can appear in translated form, as will the description text. Figure 2, Example of reference documentation in Japanese shows a full-scale translation into Japanese, while Figure 3, Example of reference documentation in Spanish shows the same thing in Spanish.
Example of reference documentation
Figure 1. Example of reference documentation
Example of reference documentation in Japanese
Figure 2. Example of reference documentation in Japanese
Example of reference documentation in Spanish
Figure 3. Example of reference documentation in Spanish
It should be noted that the examples pose special problems for translation, as they require cultural as well as linguistic localisation. Translating the examples themselves is difficult not only from a practical viewpoint, but also because the significance of the example as an illustration of a particular encoding issue may not translate well. A quotation from Dickens or Beowulf would need to be adapted for a Chinese readership, for example, in order to make the point it illustrates clear. Thus
<lg> <l>Sire Thopas was a doghty swayn;</l> <l>White was his face as payndemayn,</l> <l>His lippes rede as rose;</l> <l>His rode is lyk scarlet in grayn,</l> <l>And I yow telle in good certayn,</l> <l>He hadde a semely nose.</l> </lg>
is probably not a very familiar example of verse for users in China. In the Chinese version of the TEI Guidelines, a suitable Chinese poem would be more helpful. In some cases it may be sufficient to translate the element names and leave the quotation itself in the original language. Best of all would be to find substitute examples in the target language, but that is labor-intensive and raises issues concerning whether the same principle is being illustrated in each case. Translating examples is beyond the scope of the current proposal, but it may be possible to offer translated element names as an intermediary step.

Organization and Language Coverage

Because the TEI Guidelines document a complex and well-established set of methods, we feel that any translation effort needs to work closely with the community of TEI practitioners in the target language. We do not believe that it is sensible to simply pay a professional translator to convert the words, even if we had sufficient funds; the descriptive texts are often short and need genuine experience in text encoding to translate well. Each translation set will also need to be validated by other TEI users in the relevant language community. In addition, with so many languages potentially needing attention, it is important to make any available funds go as far as possible, by using them to mobilize and coordinate existing volunteer efforts. Within the TEI community there are a large number of groups with translation expertise who are willing to collaborate with the TEI in this effort. These include the following:
Chinese Marcus Bingenheimer Chung-hwa Institute of Buddhist Studies, Taipei
French Veronika Lux University of Nancy
German Werner Wegstein Wuerzburg University
Hindi Paul Richards UGS (The PLM Company)
Italian Fabio Ciotti University of Roma
Japanese Ohya Kazushi Tsurumi University, Yokohama
Polish Radoslaw Moszczynski Warsaw University
Romanian Dan Matei CIMEC - Institutul de Memorie Culturala, Bucharest
Slovenian Tomaž Erjavec,Matija Ogrin Jozef Stefan Institute, Ljubljana
Spanish Manuel Sánchez Miguel de Cervantes Digital Library
Tibetan Linda Patrik, Tensin Namdak www.nitartha.org

We therefore propose to invite collaborating institutions to work with us on each language. For the initial set of languages, a small budget will be allocated to each in roughly equal proportions, to be expended in whatever way will have the most impact in the local context. The funding may be used to pay graduate students, to pay a supervisor to organize volunteer translators and check their work, to pay a single translator, or to support travel and participation costs in group meetings. In all cases, the funding serves not as full payment for a translation, but as support which makes it possible for local efforts to be completed successfully, and supervised so that they are consistent. The collaborators will supervise the creation of the translations. The initial translation will then undergo a check by a second collaborator. The final results will undergo a check by the TEI editors before final acceptance. All of the work will be done through a web interface to a central data repository (see below), so that translators can see the intermediate results of their work, and so that the TEI editors and Council can monitor progress and check results as they emerge.

Our initial choice of languages is based on the TEI Council's assessment of the most urgent areas of need, and on the level of available collaborative support. From among the initial proposals we received, we have chosen as the first five for support under this grant the following languages and collaborators:
French: University of Nancy
Spanish: Cervantes Digital Library
German: University of Würzburg
Chinese: Chung-hwa Institute of Buddhist Studies, Taipei
Japanese: Tsurumi University
We will also provide limited funding for the infrastructure work.

Translation infrastructure

In some ways, the translation infrastructure funded by this grant is the most important long-term result of the project, since it will provide a framework for further translation efforts which can be conducted on a volunteer basis, and also for future grants to support translation into specific languages. The infrastructure will provide for workflow management and support for the work of translation. Its development includes three components:
  1. Develop a user-friendly environment through which translators can get access to the individual text units for translation, and upload translated text units. This environment will take the form of a web interface to the underlying data repository in which the TEI documentation is stored. The interface will include features to allow for review of translated material, and viewing of the translations in their context, so that translators can see the ongoing results of their work. This will also enable language communities to begin testing and commenting on the translations as soon as they are begun, rather than awaiting a compilation process once everything is complete.
  2. Make the necessary technical developments to Roma (the tool which compiles the TEI source data to produce specific schemas and documentation) so that it permits more advanced and flexible localizations of the TEI documentation. For example, the user may be permitted to generate any of the following combinations (where ‘names’ means tag and attribute names and attribute values, and ‘descriptions’ means the contents of the <gloss> and <desc> elements in the TEI source):
    • canonical: English names, descriptions in English
    • local descriptions: English names, descriptions in chosen language)
    • local names: names designed to make sense to a speaker of the chosen language, descriptions in English
    • fully localized: both names and descriptions in chosen language
  3. Enhance the delivery infrastructure for TEI (XSLT stylesheets to make HTML and PDF output) so that string constants are available in the chosen language as well as the text. String constants are recurring phrases with specific technical meanings which can easily be translated and then substituted as necessary without regard for the context where they appear. Thus the reference document for an attribute might read ‘Obligatoriu dacă se potriveşte’ for Romanians instead of ‘Mandatory when applicable’.

The infrastructure work will be undertaken by the Research Technologies Service at Oxford University Computing Services.

Work Plan and Deliverables

The groundwork for this project has already been begun, with translation experiments as described and a call for participants. If funding is awarded, we could complete the work described on the following schedule:
January 2007:
Begin work on web interface for translation workflow. Collaborators convene translation teams.
February 2007:
Web interface for workflow is ready for use. Translation teams begin translation process.
August 2007:
Materials are substantially complete, and review process has begun.
September 2007:
Final revisions to TEI delivery infrastructure and to Roma complete.
October 2007:
All materials have been reviewed locally and are ready for final review by TEI Council and editors. Demonstration of system at TEI members meeting.
December 2007:
Project is completed.
The deliverables for the project are:
  1. A web-based interface for submitting translations.
  2. For a minimum of five languages, a translation of each element and attribute name, and their associated description text, maintained as part of the core TEI P5 source.
  3. Support in the Roma schema-generation tool for producing schemas and reference documentation in localized form.
The budget for the project is:
  • Translation of TEI documentation into French, Spanish, German, Chinese, and Japanese: £8500
  • Development of translation environment and web interface; adaptation of Roma tool; and enhancements to TEI stylesheets: £1500
  • Total project cost: £10,000