TEI META Task Force: Status Report [MEW01] Sebastian Rahtz May 2003

Licensed under

Created from scratch by Sebastian Rahtz, November 14th 2000

16 September 2007 Converted to P5

Sebastian RahtzRevision: #1

Choice of schema language for the TEI

The task force considered whether the content models of TEI elements should be expressed in the source TEI Guidelines in: XML DTD language as at present W3C Schema language OASIS Relax NG language A new notation of the TEI's devising There was an almost instant preference for Relax NG, since it: Uses XML syntax, enabling easy validation and analysis Is very readable, and fairly easy to relate to DTD Is well-implemented by different processors, and so immediately useable Uses W3C schema datatyping Seems likely to be included in the forthcoming ISO DSDL Can be converted to W3C schema if needed and it was therefore agreed to convert the TEI Guidelines so that element content models are represented in Relax NG syntax in its own namespace.

Sebastian Rahtz presented a paper at XML Europe 2002 on the subject of how to convert the TEI XML content models to RelaxNG. This work, slightly refined, is the basis for an experimental version of TEI P5. There are a set of derived sample TEI schemas which are available for immediate use.

Skeleton work plan for redesigning ODD

It is intended to suggest and implement changes to the ODD system in the following order: Clear up the details of tagDoc Revise (part 1) the TSD tagdoc and make it a standard topping Convert (part 1) the Guidelines to conform to that schema (*) Convert the elemDecl contents to RelaxNG schema (*) Convert attributes (where automatically possible) to use new datatyping scheme (*) Add new entDoc elements defining the datatypes (*) Examine and rework the string and entDoc elements to remove remaining SGML/XML material Rewrite and test the scripts which generate schemas (*) generate DTDs generate HTML version of the Guidelines generate PDF version of the Guidelines The items marked (*) have been completed. Clear up the details of higher-level class system Revise (part 2) the TSD tagdoc Convert (part 2) the Guidelines to conform to that schema

Work plan beyond ODD: towards P5

The following tasks need to be completed in order to create P5: Make corrections of known errors Assess all the attribute datatypes and decide whether: A new datatype should be created (when more than 2 or 3 attributes have the same pattern) An attribute which is now simple text should be reconsidered as a tokenized attribute Extra facets should be added to further refine datatypes Assess elements to see whether those with plain text bodies can be datatyped Consider all element content models to decide whether they are too restrictive or too lose; consider whether some of the simplifying facilities available in RelaxNG (eg whether interleave) should be used.

Work on ODD markup
attList

We currently have a structure called attList which contains one or more attDef elements, which have datatype and default children. The default holds both things like #IMPLIED and m, while datatype has a mixture of CDATA, 63 and %ISO-date;. It is suggested that attDef has a boolean attribute required the default element should only be used to hold genuine default strings or tokens. It will be optional. Some notation will be needed to encompass %INHERITED; datatype has a mandatory target attribute, which points to an entDoc, defining the datatype. This gives us an extra abstract layer over XML schema datatypes. Most token choice attributes would be boiled down to genuine datatypes, so all of Y|N, yes | no and true|false would be datatype target="BOOLEAN". In the entDoc, we expound on this and map to the relevant W3X Schema datatype (see section ). Where the choice is limited, eg A | B, it is recorded as a set of enumerated values, defined in the body of the datatype: <datatype target="TOKEN"> <rng:choice> <value>A</value> <value>B</value> </rng:choice> </datatype>

Datatyping in attributes

The task force is asked to use W3C Schema datatypes in the TEI as much as possible.

An analysis of all the current datatype values shows that they fall into four categories: Standard XML datatypes (ID, IDREFS, NMTOKENS, etc) Abstract datatypes linked to entities in the Guidelines (there are only 2 or 3 of these) Text with no conditions Text, but with a fixed set of possibilities We can deal with the first of these easily; they all map into schema datatypes. The second is simply an indirection. The third will remain as text (but see below). It is suggested that the fourth should be split into: attributes where the range of possibilities fits a W3C datatype, or it makes sense to at least have a common set of values across the TEI attributes which really should have token values However, it is likely that some attributes are mis-classified at present; some of those which are datatyped as free text should be tokenized, and some which are tokenized should be completely free text. It is important to separate out attributes which have completely arbitrary text from those where the text is tokenizable (see section ).

It is suggested that the system be rationalized so that all the existing datatype entries are replaced by pointers to one of the following datatypes: Name Relax NG representation ANYURI <rng:data type="anyURI"/> BOOLEAN <rng:data type="boolean"/> DATE <rng:data type="date"/> DATETIME <rng:data type="dateTime"/> DURATION <rng:data type="duration"/> ENTITIES <rng:data type="ENTITIES"/> ENTITY <rng:data type="ENTITY"/> EXTPTR <rng:text > FLOAT <rng:data type="float"/> FORMULA <rng:text > ID <rng:data type="ID"/> IDREF <rng:data type="IDREF"/> IDREFS <rng:data type="IDREFS"/> LANGUAGE <rng:text > NAME <rng:data type="NCNAME"/> NMTOKEN <rng:data type="NMTOKEN"/> NMTOKENS <rng:data type="NMTOKENS"/> SEX <rng:choice <value>m</value> <value>f</value> <value>u</value> <value>x</value> </rng:choice> TEXT <rng:text > TIME <rng:data type="time"/> TOKEN <rng:empty/> UBOOLEAN <rng:choice> <value>true</value> <value>false</value> <value>unknown</value> <value>unspecfied</value> </rng:choice> >

Table lists some current datatype values and how they map to the new scheme. Table shows 180 attributes which can automatically given a non-text and non-token data types.

Character encoding in attributes

The character encoding workgroup discussed how to deal with attributes which need to use the full range of characters (eg variations, and names). This task force agreed that the correct approach was to support an alternative notation, by which these attributes could optionally be recorded as elements if the TEI user wishes to use some form of character encoding not permitted in TEI attributes. TEI P5 will therefore: Record which attributes have the extended property of being representable as elements When making normal DTDs, only support the traditional scheme of attributes Allow for special DTDs (from son-of-pizzachef) which support only the element alternative When making schemas, support both attribute and element forms Processing applications (eg XSLT stylesheets) will have to decide whether to support both systems, or only one.

There are over 300 attributes which currently have a text datatype; this includes a good many elements which have a type attribute. The TEI editors will have to decide which of these should be classified as true text (see EDW79).

Namespaces and fragment inclusion

The task force is asked to consider how situations can be catered for: Using fragments of another markup language in TEI XML Using fragments of TEI in another markup language The answer to both of these is XML namespaces. Two vocabularies can be combined, if the elements identify their namespace. Using schemas, it is easy to validate a document which goes off into different namespaces at various points; this is demonstrated in a TEI RelaxNG schema which redefines formula to have MathML elements as content. However, to demonstrate the other way round (fragments of TEI embedded in another XML vocabulary) would require assigning a namespace for the TEI. This could be a single namespace for all TEI elements, or a different one for each tagset. The task force considered that the latter would be an unnecessary complication for users, but that a namespace (perhaps http://www.tei-c.org/P5) for the TEI would be a good idea. However, there are two major problems with this, which have prevented the taskforce from implementing it: All existing TEI documents would be invalid, as they would be in an empty namespace. It would be a fairly small fix for each instance to add a namespace declaration to root element, but that would make it fail with existing DTDs. All existing XML processing tools would fail to work with new documents; for instance, XSLT stylesheets which process a current (empty namespace) TEI.2 would fail to identify the new TEI.2 xmlns="http://www.tei-c.org/P5". It will be possible in XSLT 2.0 to write a stylesheet to work with both old and new TEI documents, but using XSLT 1.0 it will be much harder; all stylesheets will need a large rewrite. This issue requires further investigation.

A replacement for the Pizza chef

This has not been discussed by the task force, but Sebastian Rahtz has written a paper on the subject for XML Europe 2003. This shows that it is possible to have a simple web application which generates RelaxNG schemas, W3C schemas, and XML DTDs, on demand; the prototype, Roma, works solely with the TEI class system, and provides a better interface to it than the Pizza Chef. There are, however, facilities which Roma does not yet provide: Adding elements which do not simply follow the class system, but have arbitrary content models and attribute lists. The problem here is how to ask the user to specify the new material without directly writings schema code. It remains to see how many requests we will receive for this feature. Changing or limited the content model of elements which do not follow the class system fully. The correct answer to this may be to revise the TEI so that all elements do use the class system 100%, but in the short-term this is unrealistic. It may be possible to devise an interface to editing content models. Adding entire classes to the TEI. This is a complex matter, which it is unlikely we can provide in a simple web interface.

Tables

Current datatypes and proposed replacements: Current New datatype (values) %ISO-date; DATE %extPtr; EXTPTR %formulaNotations; FORMULA Y | N BOOLEAN Y | N | U UBOOLEAN YES | NO BOOLEAN all | one | none TOKEN all, one, none all | some | none TOKEN all, some, none free | unknown | restricted TOKEN free, unknown, restricted light | sound | prop | block TOKEN light, sound, prop, block m | f | u SEX m | f | u | x SEX none | some | all TOKEN silent | tags TOKEN y | n | u UBOOLEAN yes | no BOOLEAN Y | N | I | M | F TOKEN Y, N, I, M, F Y | N | U UBOOLEAN Y | N | partial TOKEN Y, N, partial Y | N BOOLEAN Y | N BOOLEAN a | m | j | s | u TOKEN a, m, j, s, u am | pm | 24hour | descriptive TOKEN am, pm, 24hour, descriptive audio | video TOKEN audio, video closed | semi | open TOKEN closed, semi, open composite | uniform TOKEN composite, uniform data | rend | std | nonstd | unknown TOKEN data, rend, std, nonstd, unknown eq | ne TOKEN eq, ne eq | ne | gt | ge | lt | le TOKEN eq, ne, gt, ge, lt, le eq | ne | lt | le | gt | ge TOKEN eq, ne, lt, le, gt, ge eq | ne | sb | ns TOKEN eq, ne, sb, ns eq | ne | sb | ns | lt | le | gt | ge TOKEN eq, ne, sb, ns, lt, le, gt, ge excl | incl TOKEN excl, incl fiction | fact | mixed | inapplicable TOKEN fiction, fact, mixed, inapplicable high | medium | low | unknown TOKEN high, medium, low, unknown horizontal | vertical TOKEN horizontal, vertical initial | medial | final | unknown | complete TOKEN initial, medial, final, unknown, complete internal | external TOKEN internal, external int | real TOKEN int, real lexical | punc | lexpunc | digit | space | DL | LD | dia | joiner | other TOKEN lexical, punc, lexpunc, digit, space, DL, LD, dia, joiner, other location-referenced | double-end-point | parallel-segmentation TOKEN location-referenced, double-end-point, parallel-segmentation model | atts | both TOKEN model, atts, both new | update TOKEN new, update none | partial | complete | inapplicable TOKEN none, partial, complete, inapplicable pe | ge TOKEN pe, ge perc | real TOKEN perc, real req | mwa | rec | rwa | opt TOKEN req, mwa, rec, rwa, opt role | list TOKEN role, list root | branches TOKEN root, branches s | w | ws | sw | m | x TOKEN s, w, ws, sw, m, x silent | tags TOKEN silent, tags single | composite | frags | unknown TOKEN single, composite, frags, unknown single | set | bag | list TOKEN single, set, bag, list smooth | latching | overlap | pause TOKEN smooth, latching, overlap, pause tei | iso | national | private | none TOKEN tei, iso, national, private, none tempo | loud | pitch | tension | rhythm | voice TOKEN tempo, loud, pitch, tension, rhythm, voice to | from | both | none TOKEN to, from, both, none unit | set | bag | list TOKEN unit, set, bag, list y | n | unspecified UBOOLEAN y | n BOOLEAN yes | abb | init TOKEN yes, abb, init yes | no BOOLEAN yes | no BOOLEAN CDATA TOKEN ENTITIES ENTITIES ENTITY ENTITY ID ID IDREF IDREF IDREFS IDREFS NAME NAME NMTOKEN NMTOKEN NMTOKENS NMTOKENS

Attributes with datatypes assigned: element attribute datatype analysis ana typeIDREFS declarable default typeBOOLEAN declaring decls typeIDREFS dictionaries location typeIDREF dictionaries mergedin typeIDREF dictionaries opt typeBOOLEAN edit resp typeIDREF formPointers target typeIDREF global id typeID global id typeID global lang typeIDREF interpret inst typeIDREFS linking corresp typeIDREFS linking synch typeIDREFS linking sameAs typeIDREF linking copyOf typeIDREF linking next typeIDREF linking prev typeIDREF linking exclude typeIDREFS linking select typeIDREFS pointer targOrder typeUBOOLEAN pointerGroup domains typeIDREFS readings hand typeIDREF TEIform TEIform typeNAME terminology grpPtr typeIDREF terminology depPtr typeIDREF timed start typeIDREF timed end typeIDREF xPointer doc typeENTITY xPointer from typeEXTPTR xPointer to typeEXTPTR abbr resp typeIDREF add resp typeIDREF add hand typeIDREF addSpan resp typeIDREF addSpan hand typeIDREF addSpan to typeIDREF admin date typeDATE alt targets typeIDREFS app from typeIDREF app to typeIDREF arc from typeIDREF arc to typeIDREF att tei typeBOOLEAN birth date typeDATE catRef target typeIDREFS catRef scheme typeIDREF cell rows typeNONNEGATIVEINTEGER cell cols typeNONNEGATIVEINTEGER certainty target typeIDREFS classCode scheme typeIDREF damage resp typeIDREF damage hand typeIDREF date value typeDATE del resp typeIDREF del hand typeIDREF delSpan resp typeIDREF delSpan hand typeIDREF delSpan to typeIDREF distance exact typeUBOOLEAN docDate value typeDATE eLeaf value typeIDREF eTree value typeIDREF event who typeIDREF event iterated typeUBOOLEAN expan resp typeIDREF f fVal typeIDREFS fAlt mutExcl typeBOOLEAN figure entity typeENTITY form codedCharSet typeIDREF form entityStd typeENTITIES form entityLoc typeENTITIES formula notation typeFORMULA fs feats typeIDREFS fsdDecl fsd typeENTITY gap resp typeIDREF gap hand typeIDREF gi tei typeBOOLEAN gloss target typeIDREF graph order typeNONNEGATIVEINTEGER graph size typeNONNEGATIVEINTEGER handShift new typeIDREF handShift old typeIDREF handShift resp typeIDREF iNode value typeIDREF iNode children typeIDREFS iNode parent typeIDREF iNode ord typeBOOLEAN iNode follow typeIDREF iNode outDegree typeNONNEGATIVEINTEGER join targets typeIDREFS keywords scheme typeIDREF kinesic who typeIDREF kinesic iterated typeUBOOLEAN language iso639 typeLANGUAGE language wsd typeENTITY leaf value typeIDREF leaf parent typeIDREF leaf follow typeIDREF link targets typeIDREFS move who typeIDREFS move perf typeIDREFS msr value typeFLOAT msr valueTo typeFLOAT nbr value typeFLOAT nbr valueTo typeFLOAT node value typeIDREF node adjTo typeIDREFS node adjFrom typeIDREFS node adj typeIDREFS node inDegree typeNONNEGATIVEINTEGER node outDegree typeNONNEGATIVEINTEGER node degree typeNONNEGATIVEINTEGER note anchored typeBOOLEAN note target typeIDREFS note targetEnd typeIDREFS occupation scheme typeIDREF occupation code typeIDREF pause who typeIDREF person sex typeSEX personGrp sex typeSEX ptr target typeIDREFS q direct typeUBOOLEAN rate value typeFLOAT rate valueTo typeFLOAT ref target typeIDREFS relation active typeIDREFS relation passive typeIDREFS relation mutual typeBOOLEAN respons target typeIDREFS restore resp typeIDREF restore hand typeIDREF root value typeIDREF root children typeIDREFS root ord typeBOOLEAN root outDegree typeNONNEGATIVEINTEGER setting who typeIDREFS shift who typeIDREF socecStatus scheme typeIDREF socecStatus code typeIDREF sound discrete typeUBOOLEAN sp who typeIDREFS span from typeIDREF span to typeIDREF state length typeNONNEGATIVEINTEGER step length typeNONNEGATIVEINTEGER step from typeEXTPTR step to typeEXTPTR supplied hand typeIDREF symbol terminal typeBOOLEAN table rows typeNONNEGATIVEINTEGER table cols typeNONNEGATIVEINTEGER tag TEI typeBOOLEAN tagUsage occurs typeNONNEGATIVEINTEGER tagUsage ident typeNONNEGATIVEINTEGER tagUsage render typeIDREF tech perf typeIDREFS teiHeader date.created typeDATE teiHeader date.updated typeDATE time value typeTIME timeline origin typeIDREF timeRange from typeTIME timeRange to typeTIME tree arity typeNONNEGATIVEINTEGER tree order typeNONNEGATIVEINTEGER triangle value typeIDREF u who typeIDREFS unclear hand typeIDREF vAlt mutExcl typeBOOLEAN vocal who typeIDREF vocal iterated typeUBOOLEAN when since typeIDREF witDetail target typeIDREFS writing who typeIDREF writing script typeIDREF writing gradual typeUBOOLEAN writingSystemDeclaration date typeDATE