Licensed under
No source: this is an original work
The scholarly editor's basic task is to present a reliable text
(undated draft revision, MLA
The book is generally seen as a trustworthy carrier of text because, once printed, text cannot be changed without leaving obvious physical evidence. This stability is accompanied by a corresponding inflexibility. Apart from handwritten marginal annotation, there is little augmentation or manipulation available to the user of a printed text. Electronic texts are far more malleable. They can be modified with great ease and speed. This modification may be careful and deliberate (e.g. editing, adding markup for a new scholarly purpose), it may be whimsical or mendacious (e.g. forgery), or it may be accidental (e.g. mistakes made while editing, or minor mistranslations by a software system). And the nature of the medium makes the potential impact of these modifications greater because the different versions of the text can be quickly duplicated and distributed, beyond recall by the editor. Does the electronic future, then, hold in store something akin to medieval scribal culture? If this is the risk, will scholars be prepared to put several years of their lives into the painstaking creation of electronic editions of important historical documents or works of literature and philosophy?
How can textual reliability be maintained in the electronic environment? There is a major question here of authority and integrity which, if not more acute than that in the print domain, at least has different characteristics. Especially where it is crucial that a text be stable and long-lasting—in the case, for example, of legal statutes, cumulative records or scholarly editions—a non-invasive method of authentication is required. Following a discussion of various problems associated with the markup (encoding) of electronic texts and the danger to ongoing textual reliability that markup poses, we describe a potential model.
Verbal texts being prepared in a scholarly manner for electronic delivery and manipulation need to be marked up for structure and the meaning-bearing aspects of presentation. In the electronic domain, the features of text that, in the print domain, have long been naturalised by readers demand explicit categorising and interpretation. This is not straightforward. The most trivial things can ask tricky questions. What, for instance, is the meaning of small capitals or italics in a nineteenth-century novel? Traditionally, italics are seen either as a form of emphasis (and therefore a substantive aspect of meaning), or as presentational (as in the name of a ship or a painting). Neither can be rendered in the ASCII character set. As they cannot responsibly be ignored, a decision about their function (and therefore their presentation by the software) has to be provided by the human editor. Under the current paradigm, the instruction must be entered into the text file.
Similarly, electronic-text editors are forced to decide whether line-breaks are meaningful, whether a line of white space is a section break within a chapter or whether it was only a convenience dictated by the size of the printed page and the desire to avoid widows and orphans. Editors have to decide whether a wrong-font comma, a white space prior to a mark of punctuation or a half-inked character is meaningful—should it be tagged, or not? The instruction (recorded in markup) will be an editorial interpretation, made, probably, in the context of what is currently known about contemporaneous print-workshop practice and convention. In making explicit what in the physical text was implicit, the editor is inevitably providing a subjective interpretation of the meaning-bearing aspects of text. Of course a later editor, or the same editor returning with new information, may disagree with the earlier interpretation.
The arduous business of entering, proof-reading, amending and consequently re-proofing a transcription containing the new interpretation (the print-edition paradigm) can seemingly be avoided in the electronic medium; but in fact a new state of the text will have been created. Accidental corruption of the verbal text is very possible, so collation and careful checking of the new state against the old will be necessary. The same thing will be true if interpretation of other features of text is added, for example linguistic features, historical annotations or cross-references. Even though markup is usually separated from text by paired demarcators, as its density increases, so does the practical difficulty in proofing the text accurately.
Consider the following scenario. No-one expects any two scribal copies of the same work to be textually identical since scribes will almost certainly have changed or added things, large or small. This instability is, however, not restricted to pre-1455 or even the pre-1800 period, before the age of the steam-driven machine press. Optical collation in scholarly editing projects has proved again and again that no two copies of the same edition are precisely identical, even if printed in the industrial age. Printing involves change, as well as wear and tear; inking varies, and paper has imperfections. While recent editorial theory has shown that the physical carrier can itself affect the meaning of text, the prospect of marking up text to record every physical variation in every known copy of a work would create a file of bewildering complexity whose reliability would be in serious doubt. No editor can foresee all the uses to which an electronic scholarly edition can be put, nor all the interpretative markup that will be required. The more the attempt to provide it is pursued via a more and more heavily marked-up file, the more the reliability of the text is put at risk.
This situation argues the need for an automated authentication
technique that separates verbal text from markup while retaining all
the functionality of a computer-manipulable file. The proposal that we
describe below involves such stand-off
markup. It also
addresses another problem of markup that has often been observed. The
current standard for the markup of humanities texts, that of the Text
Encoding Initiative, requires an objective textual structuring to be
declared on the assumption that if computers are to manipulate parts
of text powerfully, then text needs to be seen as an ordered hierarchy
of content objects with its various divisions and parts appropriately
identified. The difficulty with this assumption is that texts are not
just objective or ideal things. They incorporate a stream of perhaps
only lightly structured human decision-making, of which traces have
been left behind as part of the production process. Nor can we, as
readers, help participating in the business of making meaning as we
read and interpret what we see on the page. The advantage of our
participation is that we, unlike computers and logic systems, can
handle structural contradictions and overlaps with relative ease and
safety. But if we then attempt to codify the texts for use with
systems that cannot handle contradictions, the systems reveal their
inadequacies. At present, only fudges—partly satisfactory
work-arounds—are possible to deal with this problem.
Authentication technologies were developed by information scientists to provide a reliable basis for sending verifiable messages over networks. These technologies are based on the mathematical routines of cryptography, but are designed to work with clear-text messages. (The subtle forms of meaning-bearing presentation, discussed above, are not normally relevant here.) The goal of such technologies is not to obscure the information contained in the message, but to verify that it was sent by the person claiming to have sent it and has not been altered in the course of transmission. Meeting these requirements has allowed the development of e-commerce with such services as Internet banking.
These services require a large amount of infrastructure to support
them. Changes deemed necessary to the authentication protocols and
procedures must be carried out quickly because of the potential risk
of criminal exploitation of a weakness. While financial institutions
have the money to pay for these high maintenance costs, such
resources are not available to an academic community interested in
authenticating their electronic editions.
Fortunately, the authentication requirements for electronic editions are not as exacting as those for e-commerce where it is a requirement that the creator of a message be verifiable. In electronic editions, the detection of textual corruption is the primary concern. An authentication system must protect the reliability of the encoded text, by indicating if and where a file has been corrupted, thus allowing it to be replaced from a trusted master-copy.
The best authentication method is bit-by-bit comparison of the working copy of the file against a locked master-copy. Some electronic editions at present provide their master files on non-volatile media; working files are always generated afresh from the master files. Unfortunately, this solution is very weak for long-term storage as the master files are bound to a particular storage technology. And the system does not allow for the possibility of revised or additional interpretative markup.
Most authentication methods involve the use of hashing
algorithms.
I want to discuss what I consider one of the worst mistakes of
the current software world, embedded markup; which is, regrettably,
the heart of such current standards as SGML and HTML
(Nelson
1).
The problem of maintaining the authenticity of a text file across platforms is not a trivial one. In addition, it is desirable to prevent the proliferation of different versions of a text that would otherwise be brought about by (future) developments in, or additions to, markup, annotation, and cross-referencing. The use of stand-off markup, within an electronic-text environment possessing strong authentication characteristics, potentially allows these desiderata to be met.
To illustrate how such authentication might be achieved, let us take the case of a literary work extant in several typesettings. After the base transcription file of each typesetting was prepared, each such file would be a lexical transcription of the original, but only minimally marked up—since the editor's interpretative responsibilities could be reported within stand-off markup files. The verbal text in the base file would need to be contained within uniquely identifiable text elements. This could be done at the level of the paragraph in the case of prose, or at the level of, say, the line in the case of verse. The identifiers would need to be inserted in the text to act as markers, and the text proofed against the original. After proofing, the file's authenticity could be maintained by an authentication mechanism based on a simple hashing algorithm. Ideally, authentication would be done at the text-element level, so that a change to even a single character would be immediately discernible when the hash value for the text element was checked. Authentication at the text-element level would allow possible corruptions to be quarantined while leaving the rest of the text useable.
Once the base transcription file had been prepared and proofed, markup
(e.g. SGML using the TEI DTD
A model developed along lines such as these would offer a number of
advantages. First, by supporting the standard TEI-compliant SGML it
could be used within an SGML environment giving access to all the
available browsers and tools. However, the base transcription file
would not be dependent on SGML and the separate markup files could be
easily manipulated to comply with whatever markup schemes were
required.
To date, only one implementation of the proposed model has been
developed for electronic editions: the Just In Time Markup (JITM)
system. It has utilities for inserting tags, subsequently removing
them, and running the verification process. The embedding into the
base file of the markup from the stand-off files creates a virtual
document—a perspective
—which is inserted into a
template conforming to the appropriate Document-Type Definition.just in time
when a call is made to create the
new perspective that incorporates the added markup, an automatic
proofreading of the base file is in effect being continually carried
out.
There are further advantages. First, in the original creation of the
base transcription file, proofing can, if desired, be simplified by
separate checking of the markup on the one hand and the words and
punctuation on the other. Second, different or conflicting structural
markups can be applied to the same base file because they are in
different files and can be applied to the base file
selectively. Finally, because the JITM system separates the
transcriptions from the markup, the question of copyright is
simplified. Since the markup is interpretative (as explained above,
and more obviously in the case of added explanatory and textual
notes), a copyright in it can be clearly identified and defended. In
all of this, the base transcription file remains as simple as possible
(thereby greatly easing its portability into future systems) and the
authentication mechanism remains non-invasive. JITM is, in other
words, an open rather than a proprietary system.
Ensuring the continuing reliability of electronic editions is a bigger issue than for print editions. The creator's responsibility to the users of the edition does not end with its publication, and steps must be taken to ensure that it is protected against corruption by the very processes and medium that gave it life. Authentication technologies can provide the required reliability, but must be applied in such a way that they do not affect the long-term availability and reliability of the edition through obsolescence.
The use of stand-off mark up and abstracted authentication techniques potentially allows editions to have their markup revised, reinterpreted or enhanced, and their protection mechanisms easily upgraded or replaced, when future developments require it. This will be able to be done without compromising the base transcription files or wasting the editorial labor that has gone into establishing them.