							-*-indented-text-*-


James Joyce: Ulysses.  Critical and synoptic edition
====================================================

2002-09-24 / Tobias Rischer


This is a sample for testing SGML->XML conversion procedures, provided
by Tobias Rischer <tobias@rischer.com> to the XML Conversion Workgroup
of the TEI.  

           *****************************************************
           ** For reasons of copyright, this SGML material is **
           **          FOR INTERNAL USE ONLY.                 **
           *****************************************************

This is the demo version of a TEI/SGML encoding of:

	James Joyce: Ulysses.  A critical and synoptic edition.
	Ed. Hans-Walter Gabler et.al., 1984/86.

It is "demo" because it only contains the first few pages of each
chapter; some small changes were necessary in the rest of the document
to adapt to this (all due to dangling IDREFs).

This material was created by me in 1997 as part of my diploma thesis in
computer science.  I regret some of the things I did in there.

				   *

About the SGML document:
------------------------

File overview: 

catalog		- Catalog file for DTD and entities
iso/		- ISO entity definitions
tei/		- TEI P3 DTD and entity definitions
uly01.sgml	\
...		 > the chapters of Ulysses, reduced to a few pages each
uly18.sgml	/
uly_bib.sgml	- bibliography of printed editions of Ulysses
uly_ext.dtd	- DTD for my extensions to TEI P3
uly_ext.ent	- parameter entities for my extensions to TEI P3
uly_new.ent	- my own entity definitions for characters not in ISO
uly_spec.sgml	- explanations of very specific layout issues in the text
uly_syn.sgml	- list of "synoptic levels", i.e. layers of textual history
uly_tag.sgml	- TEI-conforming tag set documentation for my extensions
uly_tn.sgml	- "textual notes", an appendix with critical annotations
ulysses.sgml	- the main file of the document

The file "ulysses.sgml" is the main file, the text of the chapters and
other SGML files are drawn in through the entity mechanism.

The document should be completely self-contained; if you run 

	nsgmls -s ulysses.sgml

it can take some time, but it should pass without errors.

This SGML document uses the TEI P3 (green book) and the ISO charsets,
but adds extensions using the mechanisms recommended by P3.  

The TEI tagsets used are Prose, Drama, Textual Criticism, Transcription
of Primary Sources, Linking and Figures.

The SGML was mostly created automatically by a conversion process from
legacy data.  The SGML files are very hard to read because a lot of
material is embedded into the text; also, the principles of the Gabler
edition and the way I cast them into SGML are maybe not too
straightforward.

				   *

Specific conversion problems / requirements
-------------------------------------------

The document makes heavy use of tag minimization features; it is not
conceivable to convert it to XML without the help of an SGML parser.

The document is segmented into several files; this segmentation is
useful for maintenance and should be kept. (All SGML processors I know
(spam, sx) create one continuous output stream.)

The document DTD extends the TEI.  Obviously, the new tags are not
covered by existing TEI-specific tools (e.g., Sebastian's
caps-normalization stylesheet).

The DTD extension makes use of the inclusion mechanism, mixing my new
elements into TEI content models; this would need careful rewriting.
Example from uly_ext.dtd:

	<!ELEMENT hico - - (%n.lem, (%n.rdg, %n.wit, plref?, depart?)+) 
							+(comm) >

The document uses SDATA entities for specific characters or
character-like features of the text and layout that are not part of
Unicode, and reasonably never will be.  These are, among others,
ellipses ("...")  with different numbers of dots, and characters with a
slash across.  It is conceivable to encode them with something like a
<c> tag instead, I don't think they are used in attribute values.

There are tags that are within a word of the text and should not
introduce whitespace.  I attempted to guarantee that and provide "nice"
formatting of the SGML files at the same time; I tried to understand and
apply the SGML rules for omissible white space, but I am not sure this
works even in SGML.  It should not get worse than it is in XML.

				   *

Summary on "convertability"
---------------------------

Some thoughtful manual activity will definitely be necessary for a
conversion to XML.  

At the same time, it seems to me that the automatic part of the
conversion will not be straightforward either with the tools I know of.
(spam, sx, Sebastian's normalization stylesheet, ...?)
