none
The purpose of these checks is to compare and classify sample TEI/SGML documents and their properties for the purpose of XML migration.
Assumed is a given sample of TEI/SGML, consisting of one or several files, in a separate subdirectory. Some of the suggestions for probing the sample that are given below are only helpful if the file is a valid, parseable SGML document.
The person checking the sample need not be well acquainted with it, and should not need to spend too much time per sample in order to gather meaningful information.
When answering the items of the checklist, it would be good to state
shortly how the answer was achieved. So, "No" is not as good
as "No (grepped for SUBDOC and didn't find any)."
In the system used for checking, the following could be useful:
Is the sample valid (parseable) SGML? (with its own DTD / with some standard TEI DTD?)
How to find out:
Parse with "nsgmls -s" and retry with your locally stored
DTDs.
Is the sample already XML? Is it valid?
How to find out:
The first hint is the extension (".xml" or
".sgm(l)") of course. Then, look for "/>" in
the code (the key feature of the XML syntax for empty elements). You can
try to parse with rxp or other validating XML parsers.
Does the sample use SUBDOCs? For anything else than WSDs?
How to find out: grep for the word "SUBDOC" - most likely, it will be used to refer to a
Writing System Declaration, as in:
Are all elements fully tagged without minimization techniques?
How to find out:
Search for </> and compare the number of
"<p" followed by whitespace or ">" with the number of
"<p" using something like "egrep '<p[ \t>]' |
wc -l" (other likely candidates could be "li" or
"item" - these suggestions are just heuristics, of course).
Run "spam -p -momittag" on the sample, redirect the result
into a new file, and check for differences with diff.
Are all attribute values quoted?
How to find out:
Use a regular expression to find any equal sign not followed by a single
or double quote, then quickly look for suspicious lines. The following
Perl script should do it:
Even better: use spam -p -mattvalue on the sample, redirect
the output into a new file, and compare it against the sample with
diff.
Are there any omitted attribute names (as in <title m>)?
How to find out: It will come up with the spam-technique proposed for the previous check
item, or by running spam -p -mattname and comparing
results. If you can't parse, you can try a complicated regular
expression (a word within a tag, after the tag name, and not followed by
an equal sign -- but don't forget that tags can spread over several
lines). It would be useful to have a list of TEI tags where this can happen
(those with attributes of NMTOKENS type, I would think).
Does the text use SDATA entity references for well-known (Unicode) characters? Are there any self-defined / non-ISO / non-Unicode SDATA entities?
How to find out:
You can use the following little perl program to get a statistic of
entity references used in the sample:
What remains to be done is checking the names. By starting at the other end, you can check the document prolog or extension files for entity definitions (assuming the extension files are part of the sample).
Are there comments? In formats not legal in XML?
How to find out:
XML comments must be of the form "<!-- ... -->" with
no "--" in between. Empty declarations
"<!>" are forbidden. Pragmatically, you can look for
"--" and "<!>" in the sample.
Are there Processing Instructions? Do they start with a name?
How to find out: Processing instructions start with "<?" and in XML, they
must be followed by a name; the name "xml" is reserved (and forbidden in
all other forms than lowercase). You can simply grep for
"<?" and check what you get.
Does the sample use really obscure SGML features? (CONCUR, ...)
How to find out: If you are an SGML specialist, you could have a quick look at the SGML
declaration and/or the beginning of the document. But the item is here
mostly for completeness' sake. If in doubt, just guess "no".
What kind of warnings and errors do you get from
How to find out:
Try to run sx on the sample and look at the errors and warnings. This,
of course, only works with parseable samples.
On which TEI DTD is the sample based? (P2, P3, P4, TEILite, unknown)
How to find out:
Check the DOCTYPE declaration at the beginning. If the sample comes
with its own DTD file, have a short look at that one. Have a quick look
at the TEI Header. You can use the perl code for the camelCase check to
get a list of tags and check for non-TEI ones.
Does the sample (consistently) use the TEI camelCase spelling?
How to find out:
Systematic problems (all uppercase or all lowercase) can be spotted with
one look at the start of the document, thanks to the spelling of
"teiHeader". For a deeper check, the following perl code
should give you a sorted list of tags as they occur in the input (in
their spelling). Below that is a list of all camelCased tag names (no
guarantee, it's a copy/paste from Sebastian Rahtz's XSL). If I had a
list of all TEI tags, this perl code could be enhanced to an
automatic checker for new and mis-cased tags.
Tags that are not all-lowercase: TEI.2, addName,
addSpan, addrLine, altGrp, attDef, attList, attName, attlDecl, baseWsd,
biblFull, biblScope, biblStruct, castGroup, castItem, castList, catDesc,
catRef, classCode, classDecl, classDoc, codedCharSet, dataDesc,
dateRange, dateStruct, delSpan, divGen, docAuthor, docDate, docEdition,
docImprint, docTitle, eLeaf, eTree, editionStmt, editorialDecl,
elemDecl, encodingDesc, entDoc, entName, entitySet, entryFree,
extFigure, fAlt, fDecl, fDescr, fLib, figDesc, fileDesc, firstLang,
foreName, forestGrp, fvLib, genName, geogName, gramGrp, handList,
handShift, headItem, headLabel, iNode, interpGrp, joinGrp, lacunaEnd,
lacunaStart, langKnown, langUsage, linkGrp, listBibl, metDecl, nameLink,
notesStmt, oRef, oVar, offSet, orgDivn, orgName, orgTitle, orgType,
otherForm, pRef, pVar, particDesc, particLinks, persName, personGrp,
placeName, postBox, postCode, profileDesc, projectDesc, pubPlace,
publicationStmt, rdgGrp, recordingStmt, refsDecl, respStmt,
revisionDesc, roleDesc, roleName, samplingDecl, scriptStmt, seriesStmt,
settingDesc, soCalled, socecStatus, sourceDesc, spanGrp, stdVals,
tagDoc, tagUsage, tagsDecl, teiCorpus.2, teiFsd2, teiHeader, termEntry,
textClass, textDesc, timeRange, timeStruct, titlePage, titlePart,
titleStmt, vAlt, vDefault, vRange, valDesc, valList, variantEncoding,
witDetail, witEnd, witList, witStart.
[ and ] at the beginning of the document,
within the DOCTYPE element.) Does it contain more than ENTITY
declarations with TEI DTD parameters and invocations of character
sets?Does the sample DTD rename TEI tags?
How to find out:
Check DTD extension files (if available) or the DTD itself. A checker
for unknown TEI tags would be nice to spot them automatically.
Are there real DTD modifications? With recommended technique or by editing DTD files?
How to find out:
Modifications should be done in DTD extension files and documented
somewhere. Even if "they" forgot to pack extension files, they should
be referred in the DTD subset at the beginning of the document.
Hand-edited DTD files (if available) could contain comments or other
indications. One could try a diff of an alleged TEILite DTD against an
official one (but I think there are more than one official TEI Lite DTD
and diff's might find a lot of noise anyway). If the sample is
parseable, one could try parsing against an official DTD and wait for
errors.