TEI MI W 06Migration Case Study Reports


Contents

Brown University Women Writers Project

About the Project

Started in 1988, the Brown University Women Writers Project is a long-term research project devoted to early modern women's writing and electronic text encoding. Our goal is to bring texts by pre-Victorian women writers out of the archive and make them accessible to a wide audience of teachers, students, scholars, and the general reader. We support research on women's writing, text encoding, and the role of electronic texts in teaching and scholarship.

Initially funded by the NEH, the WWP is now supported through licensing fees to Women Writers Online, the on-line face to our textbase, and through grant funding from the NEH and the Delmas Foundation. In the past, the WWP has also received grants from the Mellon Foundation and Apple Computer, Inc.

The WWP hires undergraduate and graduate students as encoders. They enter texts using GNU Emacs, Lennart Stafflin's psgml mode, and lots of home-grown macros on a Sun Unix system maintained by the University, and accessed via an SSH client on Macintosh desktop computers.

In addition to transient student encoders, the WWP has a permanent staff that has fluctuated from one to five, but has been steady at three for several years now. The WWP is part of Brown University's Computing and Information Services. The WWP is advised by a board of scholars, a subset of whom serve on an acquisition committee that decides which texts are next to be added to our textbase.

About the Collection

WWO (the on-line system available to subscribers) currently has over 220 texts, of which just over half have additional contextual materials funded by the Mellon Foundation as part of Renaissance Women Online. The total size of the XML source files for WWO is just over 50 million characters (just under 50 MiB). In addition to the texts available to subscribers, there are over a hundred more ‘under construction’, for a total of just over 97 million characters (just under 93 MiB). Currently we are concentrating on printed works, with an eye towards manuscripts in the future.

Nature of the Content

The collection includes works from every genre: political satire, drama, poetry, religious tracts, and even medical reference. All texts are in English (although there are plenty of phrases and quotations in other languages) and either written or translated by women.

Works from the period included (roughly 1400 to 1850) are often in black-letter; never have a nice, solid, scannable baseline; and often have pagination indicated by signatures, if at all. Several texts have tipped-in pages, errors in the printed page numbers, or publishing errors (such as incorrectly ordered sections). Furthermore typographic errors and difficult-to-read or damaged photocopies are common.

The WWP includes a variety of different kinds of texts including monographs, broadsides, and analytic pieces. Typically, in cases where work by a woman is included in a larger work by men, only the front and back matter (e.g., title page and colophon) and those parts written by the woman are included.

While there are obviously no photographs in the works involved, there are many figures, typically woodcuts or engravings. These are currently captured only by a very short description in the encoding 1 , i.e. no scanned images.

Nature of the Encoding

In our SGML texts all characters were encoded in US-ASCII, with characters outside that set represented by general entity references, often to SDATA entities. In the XML version, all characters are encoded in the US-ASCII subset of ISO 10646 UTF-8, with characters outside that set represented by general entity references, mostly to numeric character references to the appropriate ISO 10646 code point. However, we do have some characters not in Unicode. We do not use the WSD mechanism.

The WWP uses quite a detailed level of encoding; a level 5 encoding according the the TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices. For example, an average page break is indicated by
  • an <fw> element (renamed to <mw> ) with type=sig to indicate the signature as printed on the page;
  • an <fw> element (renamed to <mw> ) with type=catch to indicate the catchword (note that no further encoding takes place on a catchword; i.e., it is not encoded as a <persName> even if it is a person's name — this helps avoid over counting of such features, and ‘false positive’ hits for searches on names);
  • a <pb> element with an n attribute that indicates the ‘idealized’ (i.e., corrected) page number;
  • a <milestone> element, unit=sig with an n attribute that indicates the ‘idealized’ signature; and
  • an <fw> element (renamed to <mw> ) with type=pageNum to indicate the page number as printed on the page.
Here is an example of an encoded page break, along with the first line (in the encoded file) following the page break. Note that in this text the default rendition for <mw> is break(yes) align(right).
<mw rend="align(center)" type="sig">B3</mw> <mw rend="break(no)slant(italic)" type="catch">Va&s;.</mw> <pb n="6"/> <milestone n="B3v" unit="sig"/> <mw rend="place(outside)" type="pageNum">6</mw> <sp who="WOWSr4"><speaker><abbr expan="Vasquez">Va&s;.</abbr></speaker>
Modified slightly from WWP TR00444

The WWP textbase uses a single DTD that includes the prose, drama, and verse base tag sets, and the additional tagsets for linking, transcription, names & dates, and figures. This DTD, called wwp-store, includes significant extensions, including complex manipulations of element classes.

The WWP also uses a smaller DTD for internal documentation. This DTD, called wwp-doc, uses only the prose base tag set, and the additional tagsets for linking and for figures. This DTD includes simple extensions, mostly to remove elements (like <pb> ) that make no sense in documents we are authoring.

Before migration the WWP files were nearly XML already. We have required all end tags since 1994; we have required that attributes be quoted since 1997; and we have been using case-sensitive encoding since 1999. The DTD extension files, however, were another story. There we made significant use of SGML features not permitted in XML, particularly inclusion & exclusion exceptions, the ampersand connector, and SDATA entities. At least we have been using an or-bar (instead of comma) in all attribute declared value groups since 1998.

Motivations for Migration

In addition to all the usual reasons of being able to use newer, better tools, there were two reasons for migrating:
  • enforces our own syntax The WWP has, for years, had its own subset of SGML instance syntax that documents adhered to. (Core concrete syntax, all end tags present, all attribute values quoted, no occurrences of <, >, =, etc.) Most of our restrictions on SGML were codified in XML. Thus using an XML validator would allow us to ensure that our documents followed our local syntax conventions much more closely than an SGML validator would, thus reducing the need for additional syntax-checking software.
  • embarrassment The WWP is a venerable and respected TEI encoding project, and all staff members have extensive experience in XML — but our own texts were in SGML using P3:1994, i.e. not even P3:1999 (which was for some time unfortunately called ‘P4 beta’), let alone P4. We were getting sick of ‘do what I say, not what I do’.

Barriers to Migration

The biggest impediment to our migration was inertia, particularly in the area of processing our texts for web delivery. We already had a system for capturing, validating, machine-checking for common errors, printing for proofreading, proofreading, correcting, and transforming for delivery on the web via a commercial tool (DynaWeb) based on SGML texts that worked, why change? Especially that last step — the transformation from our source SGML to that which DynaWeb could read into its ‘book’ — is cumbersome and finicky, and not all those who wrote the software are readily available to us.

Furthermore, I anticipated that while migrating our instances to XML would be easy migrating the wwp-store DTD would be difficult.

About the Migration Samples

The samples available from the Migration Task Force's web site are a convenience sample that we had already given to TEI for its samples page.

Samples include prose, verse, and drama; interesting features include a letter with a postscript, an interesting <figure> , acrostics, and a complicated note.

Notes on the Conversion Process

There were many sets of files or processes that needed to be migrated. I will discuss three of them here:
  • migration of textbase files (instances of wwp-store)
  • migration of wwp-doc DTD extensions
  • migration of wwp-store DTD extensions
I will not discuss migration of the documentation files, as it was too trivial: each of the fewer than half dozen files conforming to wwp-doc was migrated by hand in under a minute by adding an XML declaration, changing the DOCTYPE declaration, changing ‘ & ’ to ‘ &amp; ’, and inserting a slash in the (extremely rare) empty element tags.

instances of wwp-store

By far the most difficult part of converting the hundreds of instance files from SGML to XML was writing a script to check a file out of the version control system (RCS) under one name (.sgm or .sgml), modify it, and check it back in under another (.xml), thus maintaining the history of revisions to each file despite the name change. 2 Said script executed a perl program, which might be described as a hack, to actually convert the syntax of a single instance file. The perl program does not actually do any true SGML or XML parsing, but rather relies on the fact that the characters < and > do not occur anywhere in any WWP textbase files except as markup. Accordingly, other projects may find this program un-helpful, if not outright harmful.

One perhaps minor (but nonetheless annoying) problem we have encountered in the migration of the instance files is due to our previous naming convention. In the SGML world, we gave complete SGML files (had a DOCTYPE declaration, and a complete <TEI.2> element) the extension ‘.sgml’; we gave sub-files (no DOCTYPE declaration, usually included in one or more complete files by a general entity reference) the extension ‘.sgm’. This allowed for easy differentiation, in particular when using command line tools to perform tasks, e.g. validation or printing, on a set of files.

wwp-doc DTD extensions

I followed the steps outlined in an early draft of miw03.html#index-div-id2653174 :
  1. Created uselessDoc.sgml as a P3 test document. Validates fine.
  2. Ensured that my P4 parsing environment works (it did).
    • In order to validate uselessDoc.sgml against P4(SGML), I had to create a new version of wwp-doc.dtd (which calls P3, new one calls P4). Also added
      <!-- ********* --> <!-- TEI flags --> <!-- ********* --> <!ENTITY % TEI.XML 'INCLUDE' >
    • Got 215 or so validation errors, all ambiguous content that both 1st and 2nd occurrence of <divGen> could be matched.
    • Found and removed following two lines from wwpdoc.ent:
      <!-- Changes to element classes (to fix oversights in TEI P3) --> <!ENTITY % x.front 'divGen |' >
    • Validates!
  3. Changed wwpdoc.ent and wwpdoc.dtd to be ‘dual-use’; process:
    • changed omissibility indicators to %om.RR; (1 regexp search-and-replace did the trick);
    • changed ‘,’ to ‘|’ in attribute declared value name token group.
  4. Validated against P4 in SGML mode — valid!
  5. (Had been done previously.)
  6. Created uselessDoc.xml from uselessDoc.sgml (added XML declaration).
    • Added SYSTEM identifiers after the PUBLIC identifiers of the TEI.extensions.ent and .dtd declarations;
    • moved intra-declaration comments to be right-after-declaration comments;
    • changed all empty comment declarations (‘<!>’) to blank comment declarations (‘<!-- -->’);
    • added SYSTEM identifiers after the PUBLIC identifier of the WWPiso declaration;
    • complete overhaul of WWPiso: removed large number of declarations no longer used, and changed the remaining ones to be CDATA declarations of numeric character references, rather than SDATA;
    • removed declarations for formulaNotations and formulaContent from wwpdoc.ent, which were declared as CDATA (thus permitting the new default values of CDATA and ‘(#PCDATA)’ to take hold).
    VALID!
  7. To try in SGML mode, added
    <!ENTITY % TEI.XML 'IGNORE'>
    to subset of uselessDoc.sgml. Upon validation nsgmls complained about all the entities declared as ‘&#xNNNN;’; otherwise valid. Pizza page generated flattened DTD without a hitch.
The entire process took the better part of a day (but likely would have taken less if I had not been taking notes for this report).

wwp-store DTD extensions

Whereas the wwp-doc extensions were relatively straightforward, quite similar to TEI Lite, and not that extensive (some 39 declarations other than comments and element selection), the wwp-store extensions are complicated, change several element classes of TEI, and are quite extensive (some 231 declarations other than comments and element selection). Not surprisingly, migration of the wwp-store DTD extensions has proved more difficult.

While I attempted to follow the same steps as with wwp-doc ( miw03.html#index-div-id2653174 ), I had more difficulty, but successfully completed the first seven steps in a day or two. Step 8 — fix all the errors — is not quite complete yet. Furthermore, I was running into sufficient problems and taking significant enough time migrating, that I eventually stopped taking detailed notes, partially on the theory that anything left is very specific to this particular set of extensions, and would not likely be helpful to others.

Here are some details of the migration process. Unless noted otherwise, validation is with nsgmls 1.3.1. Unless noted otherwise, text changes were made in Emacs; the notation qr/one/two/ is shorthand for M-x query-replace RET one RET two RET !, which replaces string ‘one’ with ‘two’; and qrr/three/four/ is shorthand for M-x query-replace-regexp RET three RET four RET !, which replaces the regular expression ‘three’ with ‘four’.
  1. Chose a test file and altered the DOCTYPE declaration in a copy of it so that instead of calling wwp-store.dtd (our driver DTD that in turn declares the various TEI parameter entities, including the extension files, and then calls the main TEI DTD) via FPI, it now calls the TEI (P3) DTD directly and has an appropriate internal subset.
  2. Tested that P4 parsing environment works against a tiny test file. If I recall correctly I had to increase GRPCNT. 3
    • Altered path in DOCTYPE to point to P4.
    • Since declaration of extension external entities has ‘P3’ in the FPI, redeclared them to plain SYSTEM identifiers for now.
    • 18,500 errors, all but last 17 in DTDs. 18,473 of them are due to ambiguous content models (occurrences of <persName> , <placeName> , <anchor> are the culprits in all cases). 2 are mistakes of mine that shouldn't have been here now (I had been jumping ahead). The instance errors are <text> undeclared, and one of <pb> , <milestone> , or <gap> in a spot it's not allowed.
    • Removed <persName> and <placeName> from x.data in wwpstore.ent.
    • Removed <anchor> from x.notes in wwpstore.ent.
    • Re-validated, whaddayaknow, 27 errors.
    • Removed part from ATTLIST of <seg> in wwpstore.dtd.
    • Fixed the 2 mistakes that I had inadvertently created earlier.
    • Changed ‘globincl’ to ‘Incl’ in both extension files.
    • Re-validated, now 47,464 errors. All but 3 are ambiguous content models with <addSpan> , <delSpan> , <gap> , <figure> , <advertisement> , <note> , <mw> , and <seg> as the culprits. And the three are <addSpan> , <delSpan> , <gap> occurring more than once in %m.Incl;, so I'll start there.
    • Remove <addSpan> , <delSpan> , <gap> from m.globedit in wwpstore.ent.
    • Re-validated, down to 20,581 errors. All of them are ambiguous content models with <figure> , <advertisement> , <note> , <mw> , and <seg> as the culprits (same list as before but without the 3 I fixed :-).
    • Removed <mw> , <figure> , and m.notes from m.Incl in wwptore.ent.
    • Re-validated, No errors in DTD! Only 2 errors in instance. However, the encoding is correct, and the DTD is wrong. 4
    • Changed all omitted tag minimization parameters to TEI parameter entity references. Note that since I was very consistent in how I wrote the element declarations, I was able to do this in two normal search-and-replace commands (one for EMPTY, one for content models) and one leftover (declared content ANY) by hand.
    • Case is not a problem (we've been case-sensitive since 1999-09).
    • We have no CDATA content models.
    • Inserted a bunch of missing REFCs 5 using a single Emacs query-replace-regexp command. 6
  3. Valid against P4 SGML except for the two occurrences of <mw> in <div> .
  4. Created test_that_p4_sgml_works.xml with just prose; works.
  5. I do not currently have access to a system that has both a recent enough version of OpenSP and the TEI DTDs, so I converted bg.sgml to bg.xml using the Perl program mentioned above (which basically inserts ‘/’ before the closing ‘>’ of empty element tags so long as they don't span 3 or more lines), and then hand-tweaked the DOCTYPE.
    • Changed all empty comment declarations (‘<!>’) to nothing content declarations (‘<!---->’).
    • Inserted SYSTEM identifiers for the 21 external entity declarations. Harder than it sounds; I had to find and obtain the proper XML ISO character entity sets first.
    • (At this point other obligations became pressing, and I stopped work on DTD extension migration for over a month.)
    • Made copies of the local character entity set files (wwpspec.ent and wwpgrk1.ent) in the right place, and changed the paths of the system identifiers in wwpstore.ent to match.
    • In those files, changed ‘CDATA’ to nothing globally.
    • Found XML versions of isogrk1–4 (available from the TEI or the W3C), moved them into the appropriate directory.
    • We have four ‘boilerplate’ files included into the <teiHeader> of each WWP textbase text via general entity references. For testing and migration purposes, I had redeclared these entities in the internal subset of my test files bg.sgml and bg.xml. However, xmllint was (probably appropriately) objecting to the lack of a SYSTEM identifier in the ‘original’ external declarations for these entities in the DTD extension file wwpstore.ent. Thus at this point I took the time to move over the four files, and update our CATALOG entries to point to them. Note that this is still an SGML format catalog, not an XML catalog.
    • Tried validating, found Lots of problems:

      Via nsgmls: 201 errors, 110 warnings

      • 199 errors about characters not in document set. These are false errors caused by the underlying presumption that there is a 1-byte character set, I think.
      • 2 errors in the instance: same two <mw> s not permitted in <div> .
      • 68 warnings about #PCDATA in model groups
      • 6 warnings about declared values of attributes
      • 1 warning about an and group
      • 13 warnings about attribute values not a literal
      • 4 warnings about exclusions
      • 1 warning about inclusions
      • 17 warnings about missing REFCs

      Via xmllint: I'm not good at reading the error messages yet, but it looks like it quit after spitting out 8 error messages at the first occurrence of paraContent (which is also where the majority of the 68 warnings came from, I believe).

    • So, I performed quite a few fixes:
      • qr/(%paraContent;)/ %paraContent; /
      • qr/(%phrase.seq;)/ %phrase.seq; /
      • qr/(%specialPara;)/ %specialPara; /
      • qrr/%[A-Za-z0-9.-]+/\&;/ and qr/;;/;/ (yes, it's a hack)
      • qr/ NUMBER / NMTOKEN/
      • qr/ NUMBERS / NMTOKENS/
      • qr/ NAME / NMTOKEN/
      • qr/ NAMES / NMTOKENS/
      • Quoted a bunch of attribute values that should have been literals (I used regexp-search for ‘[A-CE-Za-z]$’ which worked well only because I have been consistent about keeping 1 attribute declaration per line with no whitespace at the end; avoiding ‘D’ allowed me to skip all the false positives of the ‘#IMPLIED’ lines, at the risk of a few false negatives if a true attribute ended in ‘D’ — none did).
      • Eliminated entire section of our DTD extension that reproduced P3 element declarations except required end-tags. Since the sole ampersand connector was in one of those content models, that error was thus fixed.
      • Removed declarations for <opener> and <closer> , as the change we made (adding <respLine> , i.e. n.byline) is already present in P4.
      • Fixed declarations for <p> , <seg> , <titlePart> , as they used a reference to %paraContent; which is now a complete content model.
      • Fixed declaration for <text> .
      • I then went through entire set of declarations checking against whether the source declaration had been changed from P3 to P4 and re-copying it over if need be.
      • Several more class level changes in this period, which I failed to write down.

    At this point I tackled the <mw> not permitted in <div> problem. We used to add n.fw to the globincl class. To replicate this, we needed to add it to the m.Incl class. However, in SGML globincl was expressed as an inclusion exception (on <text> ), so the fact that n.fw appeared in phrase was not a problem — putting it in m.Incl, however, results in ambiguous content models, of course. I spent approx 2 hours straightening this out; I did not record all the details, but it included removing n.fw from phrase and adding it to m.refsys.

  6. I have not attempted to validate files in SGML mode, despite having created dual-use DTDs.

After this process, we now have valid XML DTDs a la xmllint, nearly valid (only characters not in character set) via nsgmls, but still get 2 errors from the pizza chef. However, a considerable amount of work is needed on the content model for <front> , as it is not what we want, due in part to the re-arrangement of how <front> is managed — the new fmchunk entity.

Summary and Evaluation

For the WWP, migrating our instances to XML syntax was trivial. Although migrating them into an XML environment was a bit more difficult, it was still pretty easy, and well worth the effort.

Migrating a small, clean (i.e., without much element class manipulation) set of DTD extensions also turned out to be very easy, and well worth the effort.

Migrating a massive, complicated set of DTD extensions has turned out to be very difficult. Projects in this category may need to do a more careful analysis of the benefits of migrating to XML, as the cost can be significant. Furthermore, unless the original author of the extensions or other TEI DTD expertise is available in-house, these projects should consider hiring outside expert assistance.

British National Corpus

About the Project

I have been saying that migrating the BNC from SGML to XML would be a trivial problem for approximately five years. My bluff was finally called when the Open University asked us to produce a 4 million word subset of the BNC in XML, sampled according to their criteria for use in a new grammar teaching course.

About the Collection

The British National Corpus (BNC) is a 100 million word snapshot of British English taken at the end of the 20th century. It contains 4130 distinct texts, sampled from a very wide range of materials both spoken and written.

Nature of the Content

The BNC includes samples of most written genres, including newspapers, novels, textbooks, ephemera, etc. It also contains transcripts of a very wide variety of spoken English, including informal conversation, radio broadcasts, meetings, lectures, consultations, etc.

Nature of the Encoding

All the material is in English.

In addition to the usual TEI structural tagging, the texts are segmented into sentence like units, and words; each word carries a POS (part of speech) code.

The BNC has its own DTD, using the TEI prose base, the corpus additional tagset, and a number of modifications to the basic TEI model, as further described in the Users reference Guide. The most recent edition of this Guide includes a section on TEI conformance which explains in excruciating detail the TEI Extension files used to define the BNC DTD.

The tagging makes heavy use of SGML minimization features; notably for part of speech (POS) coding. For example, here is a heading at the start of text A1l:
<head type=MAIN> <s n="1"><w VVG-AJ0>Ripping <w NN2>yarns <w CJC>and <w AJ0>moral <w NN2>minefields<c PUN>: <w NP0>Allan <w CJC>and <w NP0>Janet <w NP0>Ahlberg <w NN1-VVB>talk <w PRP>to <w NP0>Celia <w NP0>Dodd <w PRP>about <w DPS>their <w NN2>bestsellers <w PRP>for <w NN2>children </head>

Motivations for Migration

Desire to use new XML-based tools and to add to the way the corpus can be processed. For example, we can now offer an XSLT stylesheet to convert the texts to HTML for display.

Barriers to Migration

Laziness. No, cost, size, and complexity, chiefly.

About the Migration Samples

For the needs of the OU project, we selected texts totalling one million words for each of four subcorpora: demographically-sampled speech, newspapers, academic prose, and fiction.

The texts chosen were entirely typical of the rest of the BNC as far as format and tagging goes. The process of converting them was entirely automatic.

Notes on the Conversion Process

I spent a lot of time in March defining the process of migrating the DTD, and also drafted a brief document (now on the BNC website at http://www.natcorp.ox.ac.uk/migration.html ) on how to do this properly. What this document does not tell you is that in the end I had to fudge the BNC SGML DTD somewhat. After I had completed the work described in the migration document referred to, and delivered the first few sample XML files, I realised that the BNC suffered badly from what I will call the Underspecification Gotcha. The UG comes about when you have an SGML DTD which makes liberal use of default values for attributes, with declarations such as <!ATTLIST foo bar (ptrzbie|farble|wibble|zip) "zip">

Why is this a problem? Because when such a DTD is used to convert a document to XML, every occurrence of <foo> which does not specify a value for its bar attribute will be converted to read <foo bar="zip"> . And why is that a problem? Well, maybe the BNC project was exceptional in having encoders who did not realise that not specifying a value in some circumstances meant something different from not specifying a value in others. (But I doubt it).

In any case, I went back to the nice TEI-conformant SGML DTD, and mercilessly hacked it so that all defaulted attributes were given the value of #IMPLIED, even where their declared content was given explicitly in the DTD. So, for example, <!ATTLIST foo bar (yes|no|maybe) "maybe"> became <!ATTLIST foo bar (yes|no|maybe) #IMPLIED> passim.

The BNC has a typical TEI Corpus structure, with a corpus header that contains codebooks used to validate indivual texts by means of the usual SGML IDREF/ID mechanism. The corpus header also contains declarations for two specific speakers who appear in many of the BNC spoken texts (PS00: the unknown participant, and PS01: the unknown group participant). To avoid the nuisance of having to include the corpus header (and then remove it again) when processing individual texts, I modified the SGML DTD so that all IDREF valued attributes were changed to CDATA. Another, even more shameful, admission: a significant number of texts were found to be invalid against the intended XML DTD, although the application of minimization rules had made them appear to be valid against the SGML DTD. The reasons for this are murky, but reinforce in me the feeling that inclusions and minimization are indeed the lapses of good taste which I have long suspected them to be.

The BNC SGML files contain several thousand named character entity references. As delivered, the corpus provides declarations for these as SDATA entities declared explicitly. I used two sets of entity declarations: the one used in SGML land mapped each entity to a null string; the one used in XML land mapped each entity to the equivalent Unicode character number reference. This meant that I could use the exceptionally cool facility provided by OSX of retaining character entity references unconverted when parsing the SGML version, but also run the results through an XSLT transform which would produce meaningful Unicode character entity references.

Interestingly, I can report that the BNC contained entity declarations for the following five non Unicode characters:
<!ENTITY formula "[formula]" > <!ENTITY frac17 "[frac17]" > <!ENTITY FRAC19 "[frac19]" > <!ENTITY frac47 "[frac47]" > <!ENTITY shilling "[shilling]" >
The first of these is a BNC stand-in for any non-transcribed formula. Its expansion should really be <gap desc="formula"/> . The last of these should maybe expand to the string /-. The others I am not sure what to do with: fortunately none of them appears in the texts in our current sample.
I wrote a shell script called xmlify which carries out the following steps:
  • extract a filename from the BNC user file identifier
  • produce a wrapper file which can be submitted to an SGML parser
  • run OSX on this file, with parameters which retain both internal and external entity references
  • run an XSLT transformation to ‘pretty print’ the XML file generated in the previous step, thus replacing the named character entity references by appropriate character number references;
  • (optionally) run another XSLT transformation to generate an HTML version of the XML file generated in the previous step
The stylesheet for pretty printing carries out the following transforms:
  • indent the output nicely
  • ensure that each <s> element starts on a new line
  • adds an appropriate <change> element to the <revisionDesc> element in the document header
  • removes any TEIFORM attribute whose value is identical to the gi of the element carrying it

The xmlify shell script, and the XSLT prettyprint stylesheet are available on the task force's tools page; they are also available along with the DTD files etc. that were used in a single archive file.

Summary and Evaluation

This work took longer than expected, of course. It is not yet complete — I haven't tried running it over the whole BNC, though I have run it to produce the four million word OU sample. However, it was a satisfactory demonstration of the advantages of sticking to the standard TEI route. Problems that arose were almost all caused by deviation from the path of righteousness.

I have always maintained that converting the BNC to XML would be prohibitively expensive in diskspace. Well, diskspace is cheap, but here are some figures to go with the assertion.
corpus files SGML XML factor SGMLzip XMLzip factor
Aca 30 15,192 27,772 1.83 3196 3740 1.17
Fic 25 15,576 29,164 1.87 3456 4004 1.16
Dem 29 21,320 39,028 1.83 3936 4956 1.26
News 94 15,220 27,584 1.81 3696 4240 1.15
This table gives for each of the four 1 million word samples of the BNC XML corpus the number of files it contains, their total size in KiB as uncompressed SGML and XML, and as ZIP archives. The increase in size when going from SGML to XML is far less significant (between 1.15 and 1.26) for the compressed files than it is for the uncompressed files (where the factor is a fairly steady 1.8), because of the repetitiveness of the XML encoding.

The MULTEXT-East 1984 Corpus

About the Project

The EU Copernicus MULTEXT-East project (Multilingual Text Tools and Corpora for Central and Eastern European Languages, http://nl.ijs.si/ME/) is a spin-off of the EU MULTEXT project. MULTEXT is working to develop standards, specifications, and tools for the encoding and processing of linguistic corpora in a wide variety of languages. MULTEXT-East, which ran from 1995 to 1997, developed language resources for six Central and Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene) and for English, the 'hub' language of the project. It also adapted existing tools and standards to these languages. The main results of the project were an annotated multilingual corpus and lexical resources for the seven languages. The most important resource turned out to be the parallel corpus of George Orwell's novel 1984 in the English original and translations and heavily annotated with linguistic information.

MULTEXT-East resources have been used in a number of studies and experiments. In the course of such work, errors and inconsistencies were discovered in the MULTEXT-East specifications and data, most of which were subsequently corrected. But because this work was done at different sites and in different manners, the resources' encoding began to drift apart. The EU Copernicus project Concede (Consortium for Central European Dictionary Encoding), which ran from 1998 to 2000 and comprised many of the same partners as MULTEXT-East, offered the possibility of bringing the versions back on a common footing. Although Concede was primarily devoted to machine-readable dictionaries and lexical databases, one of its workpackages did consider the integration of its dictionary data with the MULTEXT-East corpus. In the scope of this workpackage, the corrected morphosyntactically annotated corpus was normalised and re-encoded. The Concede release of the MULTEXT-East resources contains the revised and expanded morphosyntactic specifications, the revised lexica, and the significantly corrected and re-encoded 1984 corpus. This edition, V.2 is also freely available, under a research license, from the http://nl.ijs.si/ME/V2/.

This report documents the SGML to XML conversion process of the Concede edition of the MULTEXT-East 1984 multilingual corpus.

About the Collection

The MULTEXT-East parallel corpus contains the novel 1984 in the orginal English and in six translations. The novel is approximately 100,000 words in length and is composed of four parts, each consisting of a number of chapters. The complete seven-language corpus contains 46,626 sentences, 618,879 word tokens, and 125,016 punctuation tokens.

The corpus file structure is rather complex; the files relevant for the TEI migration are the following, where xx = bg, cs, en, et, hu, ro, sl:
orwell.xml
The document file with the corpus header for and reference to ohdr-xx
ohdr-xx.tei
TEI text header for and reference to oana-xx
oana-xx.tei
text and annotations of '1984' in language xx
msd.tei
TEI document for morphosyntactic annotation: header for and reference to msd-flib and msd-fslib (further explained below)
msd-flib.tei
TEI feature library with attribute-value tables
msd-fslib.tei
TEI feature structure library with lexical MSDs
The full corpus is 35 MiB, or 4.5 MiB compressed.

The corpus is meant primarily as a dataset for the development and testing of language technology methods and tools. It has already been used to develop and test machine learning techniques for part-of-speech tagging, word alignment and word-sense disambiguation. While this research by computational linguistics and language technology professionals has resulted in a relativelly large bibliography (a probably somewhat outdated list is given here) it has not, as far as I am aware, been used by a more general audience interested in the novel itself.

Nature of the Content

The novel itself is well known, and deals with an anti-utopian vision of a Stalinist society. An interesting feature from the linguistic perspective is that the novel contains a number of ‘Newspeak’ words (such as Miniluv, doublethink, plusgood, etc.) and, in fact, an Appendix giving the introduction to Newspeak. The interest comes in studying the translations of these non-words, and the varying strategies the translators have used to translate them. Also, any systems processing the text are almost guaranteed a certain number of unknown words.

Nature of the Encoding

The DTD is a parametrization of TEI and uses the following tagsets: TEI.prose, TEI.linking, TEI.analysis, TEI.fs and no local extensions. Character encoding is via ISO 8879:1986//ENTITIES.

The corpus is rooted in the <teiCorpus.2> element, which consist of the header and seven <TEI.2> elements, each one containing one translation of the novel. The corpus and component TEI headers are quite detailed, and include editorial and tags declarations and source and revision descriptions. Each translation is divided into parts and chapters (encoded as <div> elements), and these into paragraphs (encoded as <p> elements).

The most important aspect of the corpus is its linguistic annotations. Hand-validated sentence elements are marked with IDs and serve as the alignment segments. They contain the TEI.analysis word and punctuation tokens ( <w> and <c> ). Word tokens have two attributes, lemma and ana: the latter has as its IDREF value the morphosyntactic description (MSD) of the word in question. E.g.,
<w lemma="clock" ana="Ncnp">clocks</w>
where the MSD signifies "Noun common nominative plural".

MSD IDs are defined in a feature structure library, <fsLib> , contained in a dedicated <TEI.2> element. Each <fs> defines an MSD, specifies which languages it is appropriate for, and describes its decomposition into features. The features are defined in a feature library, <fLib> , where assigns each feature value an attribute and value name.

In producing the Concede version, special care was taken to make the SGML encoding XML-like, with quoted attributes and camelCased GIs. The XML flavour was, to an extent, enforced by a special SGML declaration for the corpus, but was otherwise created through instructions to the language components contributors.

Motivations for Migration

On the practical level, the aim was to enable XML processing, in particular XSLT. More generally, there is a need to keep the corpus abreast of current encoding practices.

Barriers to Migration

The only barrier was the effort required to implement the migration.

About the Migration Samples

Although the migration was eventually undertaken for the full corpus, we started with a sample for initial tests and distribution in the scope of the working group. This sample contains only the first chapter (37,014 of the novel's 618,879 word tokens), but retains the structure of the full corpus and includes the complete ancillary files, such as the FS libraries that define the word-level syntactic tagset.

The sample, like the complete corpus, was not expected to present any great challenges to migration, except for the marked sections in the corpus document. If the entity %ONETEXT; was set to IGNORE, the corpus was processed as a whole (and hence defined the language and other IDs in the <teiCorpus.2> header). If it was set to INCLUDE, each language was taken to constitute its own SGML document (and hence defined the IDs in its <TEI.2> header).

Notes on the Conversion Process

As mentioned, we did not expect any real problems with the migration and did not experience many. The manner in which the conversion was performed is probably unorthodox: instead of using automatic tools for the conversion (such as osx), the emacs editor's simple search-and-replace function was considered a satisfactory and not too time-consuming method. This, of course, would not be an option with a project with a greater number of files and modifications. Below is a list of steps we had to perform to migrate from P3 SGML to P4 XML:

  1. Much to my embarrassment, the SGML sample itself was found to contain one SGML error in the file orwell.sgml: <tagUsage gi="c" occurs=""6117""> . This was corrected.
  2. One instance of an attribute value was found to be unquoted (file ohdr-en.tei): <availability status=restricted>. This was quoted, to conform to XML syntax.
  3. The empty elements used (actually this is only <sym> ) were converted to XML syntax
  4. The marked section discussed above was dropped — it has been assumed IGNORE and commented out
  5. The last step of the conversion involved making a new TEI P4 XML DTD with PizzaChef
  6. Finaly, the P4 corpus was validatated with the rxp validating parser

Summary and Evaluation

The conversion process turned out to be quite simple, partially due to the initial XML-like data encoding data. Still, it was surprising that the corpus contained mistakes in the application of XML conventions. This leads to the (probably obvious) conclusion that only a validating parser can guarantee syntactic well-formedness.

As noted above, our simple conversion via the emacs editor would not be an option with a project that contains a greater number of files and modifications to be performed. However, in our case it was an acceptable tool.

It would be difficult to say how long the migration process took with the original sample, as effort was made to answer the WG questionnaire, and to explore the various options offered for implementing the conversion. But performing the the subsequent migration on the whole MULTEXT-East corpus took only about two hours, and this includes the updating the corpus headers.

The XML-ized Concede edition of the MULTEXT-East corpus, together with additional resources, should be shortly released on the MULTEXT-East Web site, as V3.

Corpus of Middle English Prose and Verse

About the Project

This collection of Middle English texts, begun in 1993, was assembled from works contributed by University of Michigan faculty and from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus by the Humanities Text Initiative (HTI). The HTI is grateful for the permission of all contributors. All texts in the collection are valid SGML documents, tagged in conformance with the TEI Guidelines and converted to the TEI Lite DTD for wider use.

The HTI intends to develop the Corpus of Middle English Prose and Verse into an extensive and reliable collection of Middle English electronic texts, either by converting the texts ourselves or by negotiating access to other collections produced to specified high standards of accuracy. HTI wants the corpus to include all editions of Middle English texts used in the Middle English Dictionary and the more recent scholarly editions, which in some cases may have superseded them.

About the Collection

At present, sixty-one texts are publicly available and more than sixty others will be coming on-line soon. Currently, 30 MiB of data are online; the expanded collection will be 88 MiB. Texts vary in size from 23 KiB to 6 MiB and are encoded in SGML using a P3 TEI Lite-based DTD.

The collection's audience is made up of scholars, students, and general readers interested in Middle English prose and verse. It is part of the Middle English Compendium but can stand on its own. The Corpus is provided with a full array of search mechanisms, so that texts may be searched individually, in user-designated groups, or collectively. In 2002, users from around the world conducted 30,000 searches, 9,000 browses, and 320,000 text views.

Nature of the Content

The content is primarily electronic versions of nineteenth-century print publications, but includes some original translations and transcriptions. The content encompasses poetry, drama, prose fiction and non-fiction. Only five of the texts currently delivered have associated external image files. All of the new texts have page images associated with them through image references encoded in attributes in the page break elements.

The base language of all the texts is Middle English, but other languages are represented, including French, German, Greek, Latin, and Old English.

Nature of the Encoding

The level of markup encoding is level 4, ‘basic content analysis,’ as described in the Digital Library Federation's TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices. This level of encoding aspires to ‘create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text.’ However, the texts predate the TEI in Libraries Guidelines and have several inconsistencies, such as unnumbered nested <div> s in front- and backmatter elements and division numbering that begins at <div0> .

The SGML is normalized and so does not make use of tag minimization or unquoted attributes.

Motivations for Migration

The main motivation behind the migration was to be able to take advantage of the many XML and XML-related tools and technologies that are now available, including our own locally-developed digital library middleware, which will provide more functionality than the current online implementation. In addition, we were taking the opportunity to add more material to the collection and to harmonize encoding practices that had changed (intentionally and unintentionally) over time. In addition to the division numbering practices mentioned above, in the earliest days of the HTI we had a number of text encoders who had worked on the Middle English Dictionary and had specialist interests in various aspects of the texts. Some taggers supplied missing text headings or corrected errors in editions with which they were familiar using different encoding practices.

Barriers to Migration

The largest barrier to migration was the encoding harmonization, which added unnecessary complications to an otherwise straighforward task, and the few unusual characters encountered in 19th-century publications of Middle English manuscripts. We did not find Unicode representations for all of the characters, some of which had locally invented chracter entities. These are displayed as [[entname]]. This is a continuing problem in encoding these older texts, where special characters are used and where even specialists are not in agreement about what exactly they mean. A recent example from EEBO is a G with two dots above, two dots below, and a dot on each side. It is clear in context that it is some unit of measurement, but a measure of how much?

About the Migration Samples

We chose the CME as our migration collection because it is one of our oldest locally-created collections; it contains a variety of genres, languages, and special characters; and it is small and publicly available (unlike the EEBO files we sent as examples to the migration team). In addition, we have a new batch of recently produced texts to add to the collection. As we will be adding to this collection after many years without growth, it seemed to be an ideal time to migrate.

Notes on the Conversion Process

We generally followed the procedures laid out in TEI migration guidelines. We used sx for general SGML to XML conversion, a locally developed stylesheet for encoding harmonization (e.g., renumbering nested divisions to start with <div1> ), and the tei2tei.xsl stylesheet to normalize element and attribute names and cases. As much as possible, character entities in the SGML files were replaced by their Unicode numeric values. A handful of locally-created character entities had to be handled with through creative substitution during the conversion process. We now have osx available, and it appears that non-Unicode entity issues could be avoided through use of this tool. This would have been helpful, especially as we are not yet prepared to index and search UTF-8 encoded files.

Summary and Evaluation

Aside from the analysis of variant encoding practice and non-standard character entities, and the creation of strategies to deal with them, it was a straightforward process. We were helped by the fact that we already use sx and XSLT in current practice, and hindered by an over-ambitious plan of work.

Japanese Text Initiative, Electronic Text Center, University of Virginia Library

About the Project

The Electronic Text Center at the University of Virginia Library was founded in 1992. The Etext Center builds and maintains an on-line archive of tens of thousands of SGML and XML-encoded electronic texts; it is also a library service point that trains faculty, staff and students in the creation and analysis of electronic texts. In 1996, the Etext Center launched its Japanese Text Initiative (JTI). Founded in partnership with the University of Pittsburgh, the JTI uses an in-house team of specialists to create electronic versions of canonical Japanese literary texts.

About the Collection

The JTI currently contains several hundred titles, including Genji Monogatari, Manyoshu, and other large multi-volume works. This collection serves an international audience, including many visitors from Japan and the Far East; only a tiny percentage of visitors come to the site from the University of Virginia.

Nature of the Content

The JTI collection includes poetry, drama, memoirs, and prose fiction, ranging in date from the medieval period up through the mid-twentieth century. The structure of the texts varies dramatically — the basic divisional unit can be anything from a single line of poetry to an entire chapter of prose. The collection contains many lengthy multi-volume anthologies, which are often are broken out into separate files for each volume (this was done to aid local processing and is something we'd eventually like to fix).

Nature of the Encoding

The JTI files are encoded in EUC, a Unix-based double-byte encoding scheme for Japanese characters. EUC is widely supported by web browsers and is also supported by OpenText, the search tool we currently use to index our data (neither OpenText nor XPat, its current iteration, supports Unicode). Many of the texts are actually keyboarded on Windows machines in Shift-JIS encoding but are converted to EUC when they are transferred the Unix server for delivery.

The level of encoding is consistent with Level 4 as described in the TEI in Libraries Guidelines (TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices). In other words, we have encoded basic hierarchical and typographic features, but not all bibliographical or semantic features. In general, the prose works in the collection feature very light markup, while poetry and prose are encoded more thoroughly, with annotations, attributions, and more intricate structural features identified.

The JTI uses a straight TEI Lite DTD, with no extensions or modifications. In addition to the standard ISO character sets, the files use a lengthy catalog of SDATA entities to represent characters that are not present in the EUC code tables (mostly obsolete kanji appearing in the older texts).

As is generally the case at the Etext Center, the JTI has always produced very strict ‘XML-like’ SGML files — there's no tag minimization, attributes always appear in quotation marks, and elements are always in correct camelCase. None of the non-XML compliant SGML features (like SUBDOCS) are ever used.

Motivations for Migration

The University of Virginia Library is developing an XML-based integrated digital repository and the Etext Center will be expected to migrate all of its collections, including the JTI, into this new system. The new system will rely heavily on servlet-driven XSLT stylesheets for text dissemination, and although the indexing tool for this system has not yet been selected, it will most likely be incompatible with SGML. The migration process is especially useful for the JTI texts because it provides a good opportunity to convert the documents to Unicode, which contains a number of kanji not available in EUC and also allows us to combine Japanese characters with other non-Western character sets.

Barriers to Migration

Because the new digital repository is not yet in place, the transition will be slightly awkward and will probably require us to run SGML and XML systems simultaneously for some time. We are in fact running two parallel systems right now — an XML system for new materials that are being created in XML and a separate system for legacy data. Eventually the