Choice of schema language for the TEI
The task force considered whether the content models of TEI elements
should be expressed in the source TEI Guidelines in:
- XML DTD language as at present
- W3C Schema language
- OASIS Relax NG language
- A new notation of the TEI's devising
There was an almost instant preference for
[Relax NG](http://www.relaxng.org/), since it:
- Uses XML syntax, enabling easy validation and analysis
- Is very readable, and fairly easy to relate to DTD
- Is well-implemented by different processors, and so immediately
useable
- Uses W3C schema datatyping
- Seems likely to be included in the forthcoming ISO DSDL
- Can be converted to W3C schema if needed
and it was therefore agreed to convert the TEI Guidelines so that
element content models are represented in Relax NG syntax in its
own namespace.
Sebastian Rahtz presented a [paper at XML Europe 2002](../../../Talks/xmleurope2002)
on the subject of how to convert the TEI XML content models to
RelaxNG. This work, slightly refined, is the basis for an
experimental version of TEI P5. There are a set of derived
[sample TEI schemas](../../../Schemas/RelaxNG/P4X/)
which are available for immediate use.
Skeleton work plan for redesigning ODD
It is intended to suggest and implement changes to the ODD
system in the following order:
- Clear up the details of tagDoc
- Revise (part 1) the TSD tagdoc and make it a standard
topping
- Convert (part 1) the Guidelines to conform to that schema (*)
- Convert the elemDecl contents to RelaxNG schema (*)
- Convert attributes (where automatically possible)
to use new datatyping scheme (*)
- Add new entDoc elements defining the datatypes (*)
- Examine and rework the string and entDoc
elements to remove remaining SGML/XML material
- Rewrite and test the scripts which
- generate schemas (*)
- generate DTDs
- generate HTML version of the Guidelines
- generate PDF version of the Guidelines
The items marked (*) have been completed.
- Clear up the details of higher-level class system
- Revise (part 2) the TSD tagdoc
- Convert (part 2) the Guidelines to conform to that schema
Work plan beyond ODD: towards P5
The following tasks need to be completed in order to create
P5:
- Make corrections of known errors
- Assess all the attribute datatypes and
decide whether:
- A new datatype should be created (when more than 2 or 3
attributes have the same pattern)
- An attribute which is now simple text should be reconsidered as
a tokenized attribute
- Extra facets should be added to further refine datatypes
- Assess elements to see whether those with plain text bodies
can be datatyped
- Consider all element content models to decide whether they are
too restrictive or too lose; consider whether some of the
simplifying facilities available in RelaxNG (eg whether
interleave) should be used.
Work on ODD markup
attList
We currently have a structure called attList
which contains one or more attDef elements, which have
datatype and default children.
The default holds both
things like #IMPLIED
and m
,
while datatype has a
mixture of CDATA
, 63
and %ISO-date;
. It is
suggested that
-
attDef has a boolean attribute
required
- the default element should only be used to hold
genuine default strings or tokens. It will be optional.
Some notation will be needed to encompass
%INHERITED;
-
datatype has a mandatory
target
attribute, which
points to an entDoc, defining the datatype.
This gives us
an extra abstract layer over XML schema datatypes. Most
token choice attributes would be boiled down to
genuine datatypes, so all of Y|N
, yes | no
and
true|false
would be datatype target="BOOLEAN". In the entDoc,
we expound on this and map to the relevant W3X Schema
datatype (see section ).
Where the choice is limited, eg A | B
,
it is recorded as a set of enumerated values,
defined in the body of the datatype:
<datatype target="TOKEN">
<rng:choice>
<value>A</value>
<value>B</value>
</rng:choice>
</datatype>
Datatyping in attributes
The task force is asked to use W3C Schema datatypes in the TEI as
much as possible.
An analysis of all the current datatype values shows
that they fall into four categories:
- Standard XML datatypes (ID, IDREFS, NMTOKENS, etc)
- Abstract datatypes linked to entities in the Guidelines (there
are only 2 or 3 of these)
- Text with no conditions
- Text, but with a fixed set of possibilities
We can deal with the first of these easily; they all map into
schema datatypes. The second is simply an indirection. The third
will remain as text (but see below). It is suggested that
the fourth should be split into:
- attributes where the range of possibilities fits a W3C datatype,
or it makes sense to at least have a common set of values across the
TEI
- attributes which really should have token values
However, it is likely that some attributes are mis-classified at
present; some of those which are datatyped as free text should
be tokenized, and some which are tokenized should be completely free
text. It is important to separate out attributes which have
completely arbitrary text from those where the text is
tokenizable (see section ).
It is suggested that the system be rationalized
so that all the existing datatype entries
are replaced by pointers to one of the following datatypes:
| Name |
Relax NG representation |
| ANYURI |
<rng:data type="anyURI"/> |
| BOOLEAN |
<rng:data type="boolean"/> |
| DATE |
<rng:data type="date"/> |
| DATETIME |
<rng:data type="dateTime"/> |
| DURATION |
<rng:data type="duration"/> |
| ENTITIES |
<rng:data type="ENTITIES"/> |
| ENTITY |
<rng:data type="ENTITY"/> |
| EXTPTR |
<rng:text > |
| FLOAT |
<rng:data type="float"/> |
| FORMULA |
<rng:text > |
| ID |
<rng:data type="ID"/> |
| IDREF |
<rng:data type="IDREF"/> |
| IDREFS |
<rng:data type="IDREFS"/> |
| LANGUAGE |
<rng:text > |
| NAME |
<rng:data type="NCNAME"/> |
| NMTOKEN |
<rng:data type="NMTOKEN"/> |
| NMTOKENS |
<rng:data type="NMTOKENS"/> |
| SEX |
<rng:choice
<value>m</value>
<value>f</value>
<value>u</value>
<value>x</value>
</rng:choice> |
| TEXT |
<rng:text > |
| TIME |
<rng:data type="time"/> |
| TOKEN |
<rng:empty/> |
| UBOOLEAN |
<rng:choice> <value>true</value>
<value>false</value>
<value>unknown</value>
<value>unspecfied</value>
</rng:choice>
> |
Table lists some current datatype
values and how they map to the new scheme.
Table shows 180 attributes which can
automatically given a non-text and non-token data types.
Character encoding in attributes
The character encoding workgroup discussed how to deal
with attributes which need to use the full range of characters
(eg variations, and names). This task force agreed that the correct
approach was to support an alternative notation, by which these
attributes could optionally be recorded as elements if the TEI user
wishes to use some form of character encoding not permitted in TEI
attributes. TEI P5 will therefore:
- Record which attributes have the extended property of being
representable as elements
- When making normal DTDs, only support the
traditional
scheme
of attributes
- Allow for special DTDs (from son-of-pizzachef) which support
only the element alternative
- When making schemas, support both attribute and element forms
Processing applications (eg XSLT stylesheets) will have to decide
whether to support both systems, or only one.
There are over 300 attributes which currently have a text
datatype; this includes a good many elements which have a
type attribute. The TEI editors will have to
decide which of these should be classified as true text
(see [EDW79](../../Drafts/edw79.html)).
Namespaces and fragment inclusion
The task force is asked to consider how situations can be catered for:
- Using fragments of another markup language in TEI XML
- Using fragments of TEI in another markup language
The answer to both of these is XML namespaces. Two
vocabularies can be combined, if the elements identify their
namespace. Using schemas, it is easy to validate a document
which goes off into different namespaces at various points; this
is demonstrated in a [TEI
RelaxNG schema](../../../Schemas/RelaxNG/P4X/test5.rng) which redefines formula
to have MathML elements as content. However, to demonstrate
the other way round (fragments of TEI embedded in another XML
vocabulary) would require assigning a namespace for the
TEI. This could be a single namespace for all TEI elements, or a
different one for each tagset. The task force considered that the
latter would be an unnecessary complication for users, but that
a namespace (perhaps http://www.tei-c.org/P5
)
for the TEI would be a good idea. However, there are two
major problems with this, which have prevented the taskforce
from implementing it:
- All existing TEI documents would be invalid,
as they would be in an empty namespace. It would be a fairly small fix
for each instance to add a namespace declaration to root element, but
that would make it fail with existing DTDs.
- All existing XML processing tools would fail to work
with new documents; for instance, XSLT stylesheets which process
a current (empty namespace) TEI.2 would fail to identify
the new TEI.2 xmlns="http://www.tei-c.org/P5". It will
be possible in XSLT 2.0 to write a stylesheet to work with both
old and new TEI documents, but using XSLT 1.0 it will be much harder;
all stylesheets will need a large rewrite.
This issue requires further investigation.
A replacement for the Pizza chef
This has not been discussed by the task force, but Sebastian Rahtz
has written a [paper](../../../Talks/xmleurope2003)
on the subject for XML Europe 2003. This shows that it is possible to
have a simple web application which generates RelaxNG schemas, W3C
schemas, and XML DTDs, on demand; the prototype,
Roma,
works solely with the TEI class system, and provides a better
interface to it than the Pizza Chef. There are, however,
facilities which Roma does not
yet provide:
- Adding elements which do not simply follow the class system,
but have arbitrary content models and attribute lists. The problem
here is how to ask the user to specify the new material without
directly writings schema code. It remains to see how many requests
we will receive for this feature.
- Changing or limited the content model of elements which
do not follow the class system fully. The correct answer to this may
be to revise the TEI so that all elements do use the
class system 100%, but in the short-term this is unrealistic. It may
be possible to devise an interface to editing content models.
- Adding entire classes to the TEI. This is a complex matter,
which it is unlikely we can provide in a simple web interface.
Tables
Current datatypes and proposed replacements:
| Current |
New datatype |
(values) |
| %ISO-date; |
DATE |
| %extPtr; |
EXTPTR |
| %formulaNotations; |
FORMULA |
| Y | N |
BOOLEAN |
| Y | N | U |
UBOOLEAN |
| YES | NO |
BOOLEAN |
| all | one | none |
TOKEN |
all, one, none |
| all | some | none |
TOKEN |
all, some, none |
| free | unknown | restricted |
TOKEN |
free, unknown, restricted |
| light | sound | prop | block |
TOKEN |
light, sound, prop, block |
| m | f | u |
SEX |
| m | f | u | x |
SEX |
| none | some | all |
TOKEN |
| silent | tags |
TOKEN |
| y | n | u |
UBOOLEAN |
| yes | no |
BOOLEAN |
| Y | N | I | M | F |
TOKEN |
Y, N, I, M, F |
| Y | N | U |
UBOOLEAN |
| Y | N | partial |
TOKEN |
Y, N, partial |
| Y | N |
BOOLEAN |
| Y | N |
BOOLEAN |
| a | m | j | s | u |
TOKEN |
a, m, j, s, u |
| am | pm | 24hour | descriptive |
TOKEN |
am, pm, 24hour, descriptive |
| audio | video |
TOKEN |
audio, video |
| closed | semi | open |
TOKEN |
closed, semi, open |
| composite | uniform |
TOKEN |
composite, uniform |
| data | rend | std | nonstd | unknown |
TOKEN |
data, rend, std, nonstd, unknown |
| eq | ne |
TOKEN |
eq, ne |
| eq | ne | gt | ge | lt | le |
TOKEN |
eq, ne, gt, ge, lt, le |
| eq | ne | lt | le | gt | ge |
TOKEN |
eq, ne, lt, le, gt, ge |
| eq | ne | sb | ns |
TOKEN |
eq, ne, sb, ns |
| eq | ne | sb | ns | lt | le | gt | ge |
TOKEN |
eq, ne, sb, ns, lt, le, gt, ge |
| excl | incl |
TOKEN |
excl, incl |
| fiction | fact | mixed | inapplicable |
TOKEN |
fiction, fact, mixed, inapplicable |
| high | medium | low | unknown |
TOKEN |
high, medium, low, unknown |
| horizontal | vertical |
TOKEN |
horizontal, vertical |
| initial | medial | final | unknown | complete |
TOKEN |
initial, medial, final, unknown, complete |
| internal | external |
TOKEN |
internal, external |
| int | real |
TOKEN |
int, real |
| lexical | punc | lexpunc | digit | space | DL | LD | dia |
joiner | other |
TOKEN |
lexical, punc, lexpunc, digit, space, DL, LD, dia, joiner, other |
| location-referenced | double-end-point |
parallel-segmentation |
TOKEN |
location-referenced, double-end-point, parallel-segmentation |
| model | atts | both |
TOKEN |
model, atts, both |
| new | update |
TOKEN |
new, update |
| none | partial | complete | inapplicable |
TOKEN |
none, partial, complete, inapplicable |
| pe | ge |
TOKEN |
pe, ge |
| perc | real |
TOKEN |
perc, real |
| req | mwa | rec | rwa | opt |
TOKEN |
req, mwa, rec, rwa, opt |
| role | list |
TOKEN |
role, list |
| root | branches |
TOKEN |
root, branches |
| s | w | ws | sw | m | x |
TOKEN |
s, w, ws, sw, m, x |
| silent | tags |
TOKEN |
silent, tags |
| single | composite | frags | unknown |
TOKEN |
single, composite, frags, unknown |
| single | set | bag | list |
TOKEN |
single, set, bag, list |
| smooth | latching | overlap | pause |
TOKEN |
smooth, latching, overlap, pause |
| tei | iso | national | private | none |
TOKEN |
tei, iso, national, private, none |
| tempo | loud | pitch | tension | rhythm | voice |
TOKEN |
tempo, loud, pitch, tension, rhythm, voice |
| to | from | both | none |
TOKEN |
to, from, both, none |
| unit | set | bag | list |
TOKEN |
unit, set, bag, list |
| y | n | unspecified |
UBOOLEAN |
| y | n |
BOOLEAN |
| yes | abb | init |
TOKEN |
yes, abb, init |
| yes | no |
BOOLEAN |
| yes | no |
BOOLEAN |
| CDATA |
TOKEN |
| ENTITIES |
ENTITIES |
| ENTITY |
ENTITY |
| ID |
ID |
| IDREF |
IDREF |
| IDREFS |
IDREFS |
| NAME |
NAME |
| NMTOKEN |
NMTOKEN |
| NMTOKENS |
NMTOKENS |
Attributes with datatypes assigned:
| element |
attribute |
datatype |
| analysis |
ana |
typeIDREFS |
| declarable |
default |
typeBOOLEAN |
| declaring |
decls |
typeIDREFS |
| dictionaries |
location |
typeIDREF |
| dictionaries |
mergedin |
typeIDREF |
| dictionaries |
opt |
typeBOOLEAN |
| edit |
resp |
typeIDREF |
| formPointers |
target |
typeIDREF |
| global |
id |
typeID |
| global |
id |
typeID |
| global |
lang |
typeIDREF |
| interpret |
inst |
typeIDREFS |
| linking |
corresp |
typeIDREFS |
| linking |
synch |
typeIDREFS |
| linking |
sameAs |
typeIDREF |
| linking |
copyOf |
typeIDREF |
| linking |
next |
typeIDREF |
| linking |
prev |
typeIDREF |
| linking |
exclude |
typeIDREFS |
| linking |
select |
typeIDREFS |
| pointer |
targOrder |
typeUBOOLEAN |
| pointerGroup |
domains |
typeIDREFS |
| readings |
hand |
typeIDREF |
| TEIform |
TEIform |
typeNAME |
| terminology |
grpPtr |
typeIDREF |
| terminology |
depPtr |
typeIDREF |
| timed |
start |
typeIDREF |
| timed |
end |
typeIDREF |
| xPointer |
doc |
typeENTITY |
| xPointer |
from |
typeEXTPTR |
| xPointer |
to |
typeEXTPTR |
| abbr |
resp |
typeIDREF |
| add |
resp |
typeIDREF |
| add |
hand |
typeIDREF |
| addSpan |
resp |
typeIDREF |
| addSpan |
hand |
typeIDREF |
| addSpan |
to |
typeIDREF |
| admin |
date |
typeDATE |
| alt |
targets |
typeIDREFS |
| app |
from |
typeIDREF |
| app |
to |
typeIDREF |
| arc |
from |
typeIDREF |
| arc |
to |
typeIDREF |
| att |
tei |
typeBOOLEAN |
| birth |
date |
typeDATE |
| catRef |
target
|
typeIDREFS |
| catRef |
scheme
|
typeIDREF |
| cell |
rows |
typeNONNEGATIVEINTEGER |
| cell |
cols |
typeNONNEGATIVEINTEGER |
| certainty |
target |
typeIDREFS |
| classCode |
scheme |
typeIDREF |
| damage |
resp |
typeIDREF |
| damage |
hand |
typeIDREF |
| date |
value |
typeDATE |
| del |
resp |
typeIDREF |
| del |
hand |
typeIDREF |
| delSpan |
resp |
typeIDREF |
| delSpan |
hand |
typeIDREF |
| delSpan |
to |
typeIDREF |
| distance |
exact |
typeUBOOLEAN |
| docDate |
value |
typeDATE |
| eLeaf |
value |
typeIDREF |
| eTree |
value |
typeIDREF |
| event |
who |
typeIDREF |
| event |
iterated |
typeUBOOLEAN |
| expan |
resp |
typeIDREF |
| f |
fVal |
typeIDREFS |
| fAlt |
mutExcl |
typeBOOLEAN |
| figure |
entity |
typeENTITY |
| form |
codedCharSet |
typeIDREF |
| form |
entityStd |
typeENTITIES |
| form |
entityLoc |
typeENTITIES |
| formula |
notation |
typeFORMULA |
| fs |
feats |
typeIDREFS |
| fsdDecl |
fsd |
typeENTITY |
| gap |
resp |
typeIDREF |
| gap |
hand |
typeIDREF |
| gi |
tei |
typeBOOLEAN |
| gloss |
target |
typeIDREF |
| graph |
order |
typeNONNEGATIVEINTEGER |
| graph |
size |
typeNONNEGATIVEINTEGER |
| handShift |
new |
typeIDREF |
| handShift |
old |
typeIDREF |
| handShift |
resp |
typeIDREF |
| iNode |
value |
typeIDREF |
| iNode |
children |
typeIDREFS |
| iNode |
parent |
typeIDREF |
| iNode |
ord |
typeBOOLEAN |
| iNode |
follow |
typeIDREF |
| iNode |
outDegree |
typeNONNEGATIVEINTEGER |
| join |
targets |
typeIDREFS |
| keywords |
scheme
|
typeIDREF |
| kinesic |
who |
typeIDREF |
| kinesic |
iterated |
typeUBOOLEAN |
| language |
iso639 |
typeLANGUAGE |
| language |
wsd
|
typeENTITY |
| leaf |
value |
typeIDREF |
| leaf |
parent |
typeIDREF |
| leaf |
follow |
typeIDREF |
| link |
targets |
typeIDREFS |
| move |
who |
typeIDREFS |
| move |
perf |
typeIDREFS |
| msr |
value |
typeFLOAT |
| msr |
valueTo |
typeFLOAT |
| nbr |
value |
typeFLOAT |
| nbr |
valueTo |
typeFLOAT |
| node |
value |
typeIDREF |
| node |
adjTo |
typeIDREFS |
| node |
adjFrom |
typeIDREFS |
| node |
adj |
typeIDREFS |
| node |
inDegree |
typeNONNEGATIVEINTEGER |
| node |
outDegree |
typeNONNEGATIVEINTEGER |
| node |
degree |
typeNONNEGATIVEINTEGER |
| note |
anchored |
typeBOOLEAN |
| note |
target |
typeIDREFS |
| note |
targetEnd |
typeIDREFS |
| occupation |
scheme |
typeIDREF |
| occupation |
code |
typeIDREF |
| pause |
who |
typeIDREF |
| person |
sex |
typeSEX |
| personGrp |
sex |
typeSEX |
| ptr |
target |
typeIDREFS |
| q |
direct |
typeUBOOLEAN |
| rate |
value |
typeFLOAT |
| rate |
valueTo |
typeFLOAT |
| ref |
target |
typeIDREFS |
| relation |
active |
typeIDREFS |
| relation |
passive |
typeIDREFS |
| relation |
mutual |
typeBOOLEAN |
| respons |
target |
typeIDREFS |
| restore |
resp |
typeIDREF |
| restore |
hand |
typeIDREF |
| root |
value |
typeIDREF |
| root |
children |
typeIDREFS |
| root |
ord |
typeBOOLEAN |
| root |
outDegree |
typeNONNEGATIVEINTEGER |
| setting |
who |
typeIDREFS |
| shift |
who |
typeIDREF |
| socecStatus |
scheme |
typeIDREF |
| socecStatus |
code |
typeIDREF |
| sound |
discrete |
typeUBOOLEAN |
| sp |
who |
typeIDREFS |
| span |
from |
typeIDREF |
| span |
to |
typeIDREF |
| state |
length |
typeNONNEGATIVEINTEGER |
| step |
length |
typeNONNEGATIVEINTEGER |
| step |
from |
typeEXTPTR |
| step |
to |
typeEXTPTR |
| supplied |
hand |
typeIDREF |
| symbol |
terminal |
typeBOOLEAN |
| table |
rows |
typeNONNEGATIVEINTEGER |
| table |
cols |
typeNONNEGATIVEINTEGER |
| tag |
TEI
|
typeBOOLEAN |
| tagUsage |
occurs |
typeNONNEGATIVEINTEGER |
| tagUsage |
ident |
typeNONNEGATIVEINTEGER |
| tagUsage |
render |
typeIDREF |
| tech |
perf |
typeIDREFS |
| teiHeader |
date.created |
typeDATE |
| teiHeader |
date.updated |
typeDATE |
| time |
value |
typeTIME |
| timeline |
origin |
typeIDREF |
| timeRange |
from |
typeTIME |
| timeRange |
to |
typeTIME |
| tree |
arity |
typeNONNEGATIVEINTEGER |
| tree |
order |
typeNONNEGATIVEINTEGER |
| triangle |
value |
typeIDREF |
| u |
who |
typeIDREFS |
| unclear |
hand |
typeIDREF |
| vAlt |
mutExcl |
typeBOOLEAN |
| vocal |
who |
typeIDREF |
| vocal |
iterated |
typeUBOOLEAN |
| when |
since |
typeIDREF |
| witDetail |
target |
typeIDREFS |
| writing |
who |
typeIDREF |
| writing |
script |
typeIDREF |
| writing |
gradual |
typeUBOOLEAN |
| writingSystemDeclaration |
date |
typeDATE |