Electronic Textual Editing: The Canterbury Tales and other Medieval Texts [Peter Robinson, De Montfort University]


Over the last decade (and longer) editorial activity in older English literature has been marked by fascination with the possibilities of the digital medium. There have been several large-scale editorial projects and a host of smaller initiatives, built from the beginning on computer methods. Some of these have now had more than ten years' experience of the possibilities, problems and actual achievements of the medium. This is long enough for us to begin to suggest some general propositions about the nature of editorial work in this new medium. In this essay, I wish to use my experience with The Canterbury Tales Project, with which I have been involved since its beginnings in 1989, to explore five such propositions.

My five propositions are:
  1. The use of computer technology in the making of a particular edition takes place in a particular research context
  2. A digital edition should be based on full-text transcription of original texts into electronic form, and this transcription should be based on explicit principles
  3. The use of computer-assisted analytic methods may restore historical criticism of large textual traditions as a central aim for scholarly editors
  4. The new technology has the power to alter both how editors edit, and how readers read
  5. Editorial projects generating substantial quantities of transcribed text in electronic form should adopt, from the beginning, an open transcription policy

While I will concentrate on The Canterbury Tales Project: , it should be emphasized that this is far from being the only editorial project in older English (and, medieval vernacular) literature. Indeed, it seems that all the major editorial projects undertaken in this domain over the last decade have a significant computer component: thus the cluster of projects undertaken by Kevin Kiernan, beginning with the Electronic Beowulf: ; the Piers Plowman: archive and other SEENET initiatives associated with Hoyt Duggan; the Middle English Compendium: (itself growing from the digitization of the Middle English Dictionary: ); individual enterprises by Murray McGillivray, Larry Benson, Peter Baker, Graham Caie, the Bestiary project at Aberdeen, the Roman de la Rose project at Johns Hopkins, and many more. 1

One could speculate at some length about the reasons for the speed with which scholars working in older areas of English literature have taken up the new technologies. One severely practical reason has been the reluctance of traditional publishers to commit to publishing the results of large editorial projects. Also, editors of medieval texts rely heavily on manuscript sources, some of which are famous and fascinating objects in their own right (if not beautiful, indeed) and these lend themselves very well to digitization: thus, for instance, the manuscripts published under the ‘Turning the pages’ initiative by the British Library. It is a material help too that all these texts are comprehensively out of copyright (though rights in manuscript images do need to be negotiated with the individual owners). Medieval authors are safely dead, and so too are any relatives who might have any copyright interest. It is also something of a fortunate chance that at least three of the leading editors of older english texts in the 70s and 80s were fascinated by computers and were among the ‘early adopters’ of the technology: thus Kiernan, Benson and Duggan.

Proposition 1: The use of computer technology in the making of a particular edition takes place in a particular research context.

Beside these practical considerations (which are hardly unique to medieval English literature), a key reason for the particular receptivity of scholars working in early English studies in the last decades of the twentieth century to the promise of computer methods was the state of thinking about editing in this area. There are two aspects to this. Firstly, many editions in this domain—notably those produced for the Early English Text Society since the late nineteenth century—have been closely based on single manuscripts. Such editors of early English texts were ‘best text’ editors before Joseph Bédier (Bédier), and such editions lend themselves very easily to computer representation. One can see Kiernan's Electronic Beowulf: as in a direct line of descent from Zupitza's facsimile, which over a hundred years ago presented a full image record alongside a transcription of the whole text (Zupitza). Secondly, for editors in Middle English particularly, there has been the influence of the Athlone edition of Piers Plowman and George Kane's writings about textual editing (Kane 1960). This edition is famously uncompromising and controversial, in its insistence on the failure of stemmatic and indeed any other methodology apart from the application of editorial judgement at every point of the text: see for instance articles by Adams, Brewer, Donaldson and Paterson on this (Adams; Brewer; Donaldson; Patterson). For many editors, the sheer confidence (not to say extremism) of Kane's assertions about textual scholarship have provoked a reaction: is this really what editing is? The advent of computer methods has offered a new domain, in which editors might explore routes of editing which run counter to the vision offered by Kane. I will return later to the influence of this edition on my own work.

In this context, the conception of The Canterbury Tales Project: in the early 90s sprang from the same impulse which variously moved Kiernan and Duggan (among others): to apply the new methodology not just to convert existing printed editions into digital form, but to use the new methods to try to solve long-standing textual difficulties. This can be seen at a literally microscopic level, in Kiernan's use of fibre-optics to recover readings lost at the charred edges of the Beowulf manuscript (Kiernan) . For The Canterbury Tales Project, the textual difficulty is, indeed, the whole text, the whole tradition. There are (by the latest count) 84 manuscripts and four incunable editions of the Tales dating from before 1500. Further, not only did Chaucer notoriously leave the text unfinished, but he seems to have taken no care (as some other medieval authors did) to prepare something like an authorized text or to control the form in which his text was distributed. As a result, editors of the Tales are left with the documents—all the manuscripts and incunables—and no authorial declaration of any kind to help make sense of them all. The very earliest manuscripts bear eloquent witness to the struggles by scribes and their supervisors to put Chaucer's original materials (in whatever form they were) into shape. Later editors have inherited the struggle, and with it the record of earlier scribes and editors. One can see too the history of textual scholarship in the West through the lens of the Canterbury Tales: through the medieval scribes trying to impose a coherent ordinatio and compilatio on incomplete and complex texts; through the first incunable printers printing a text from a single source, either a manuscript or earlier printing, then progressively modifying the text by reference to other sources; then in the 18th century through the ‘ad fontes’ movement reflected in Tyrrwhit's determination to return to the manuscripts and attempt to establish a new text from study of the originals; then through the 19th and early 20th centuries, attempts to set the manuscripts in some kind of order which might in turn justify the choice of one or more manuscripts as base for the edition (Ruggiers). This enterprise culminated in the massive effort of John Manly and Edith Rickert to collate all the known manuscripts of the Tales to attempt to establish a historical recension of the entire textual tradition, on something like Lachmannian principles (Manly and Rickert).

Manly and Rickert's work appears as a late expression of a nineteenth-century confidence (they began their work in the early 1920s) in the ability of editors to establish definitive texts. It is exactly this confidence which editorial thinking through the latter part of the twentieth century has undermined. Editors of older English texts have found themselves engaged in the same problems, of versioning, variance, of copy-text theory, as their counterparts in other editorial domains. 2 In this context of increasing editorial anxiety, the clarity of Kane's view of editing has found many followers. Over this period, apart from asserting the eminence of the Hengwrt manuscript of the Tales, Manly and Rickert's work has had little influence and was heavily attacked by Kane in an article in Editing Chaucer: The Great Tradition. Indeed, Manly and Rickert's failure (as Kane sees it) to create any kind of valid historical account of the tradition is a cornerstone of Kane's wider argument, that editorial judgement alone must be used to fix the text.

The failure of Manly and Rickert's attempt to create a historical account of the textual tradition, and the vehemence of Kane's assertion that no such account could possibly be created for the Tales, set out a clear challenge. It seemed arguable that the failure of Manly and Rickert did not occur, as Kane suggested, because of a fundamental theoretical flaw in their method. Possibly, they failed because the sheer volume of data generated by their collation (some three million pieces of information on sixty thousand collation cards) quite overwhelmed the tools of analysis available to them: basically, pencil, paper and Edith Rickert's memory. Up to the late 1980s, a few experiments and articles appeared to suggest that a combination of the computer, with its ability to absorb and re-order vast amounts of information, and new methods of analysis being developed within computer science (in the form of sophisticated relational databases) and in mathematics and in other sciences, might be able to make sense of the many millions of pieces of information in a complex collation, and provide a historical reconstruction of the development of tradition. [Note: The first scholar to explore such methods with an actual tradition appears to have been J. G. Griffith, “A Taxonomic Study of the Manuscript Tradition of Juvenal.” , (Griffith). For a bibliography of work in this area up to 1992, see my article Computer-assisted Methods of Stemmatic Analysis (Robinson and O'Hara). For work done since then see the publications of the STEMMA project, listed at http://www.cta.dmu.ac.uk/projects/stemma/res.html . ]

From this account, we can see that any decision to use computer technology in the making of a particular edition takes place in a particular research context. What has been done before, the controversies reigning at any one moment, determine our sense of what is to be done. It may seem a rather obvious point, that electronic editions are made in a context of editorial theory, just as print editions always have been. But it happens that over the last decade many electronic texts have been made which present the text alone, often with images, but with only the minimum of the additional material (variant apparatus; descriptive and analytic commentaries) which have in the past characterized scholarly editions. There is a place for such ‘plain-text’ enterprises, but their inability to engage with the wider issues surrounding the texts they offer limits their utility.

Proposition 2: A digital edition should be based on full-text transcription of original texts into electronic form, and this transcription should be based on explicit principles.

The perceived failure of Manly and Rickert to create a historical account of the relations between the manuscripts of the Canterbury Tales sets a clear challenge: to apply the emerging methods of computer analysis to create such an account, based on analysis of the agreements and disagreements between the texts they contain. This challenge was the starting point of The Canterbury Tales Project.

This poses an immediate problem: how would we gather the record of agreement and disagreements, on which this analysis would be based? We decided, from the first, that we would do this as follows:
  1. We would make a full-text transcription of the whole text of the manuscripts
  2. We would use computer tools to compare the transcripts, to create the record of agreements and disagreements between the manuscripts

Accordingly, in 1989, and with the help of Susan Hockey, I wrote an application to the Leverhulme Trust for a grant to carry out a series of experiments in the use of computers in textual editing. We proposed to develop a computer collation system for comparing different versions of texts word by word, and to experiment with different methods of analyzing the results. Given the intractable nature of the editorial problems posed by the Canterbury Tales: , we chose Geoffrey Chaucer's Wife of Bath's Prologue (830 lines in fifty-four manuscripts and four incunables) as one of the exemplary texts for this experiment.

We were successful with this application, and work began on this in September 1989. We were fortunate indeed in time and the place. In time: the three year project began just after the inception of the Text Encoding Initiative, and during its life the first steps were being made towards electronic publishing, first on CD-ROM and later over the Internet as the Web began to take shape. In place: Oxford, where the project began, was intensely involved in the Text Encoding Initiative through Susan Hockey and Lou Burnard. Susan was the project leader, and the project was based in the same building (Oxford University Computing Services) as Lou Burnard, who was the European Editor. This close link with the Text Encoding Initiative became crucial because of something to which I had not given much thought before: the need for a stable and rich encoding scheme both to record the transcripts of the original texts which we were to make, and to hold the record of variation created by the collation program. One could say that The Canterbury Tales Project, the TEI and the Web were born and have grown side by side, and that the TEI has been the crucial enabling factor in the Project.

A first, critical impact of the TEI on this project was in the shaping of the transcription guidelines. We needed to make transcripts of the manuscripts in electronic form, ready to submit to a collation program we would develop in this initial project. We had many different people working on the transcriptions so we had to work out a scheme of transcription which could be applied uniformly. There were two aspects to this.

The first aspect concerned the structure of the text: its division into tales, links, lines, blocks of prose, with marginalia, rubrics, glosses, and more. Clearly, we needed some means of indicating all these, even if only (at the most basic level) to be able to locate all the different forms of any one line in the manuscripts so they could be collated. From the very first meetings of the TEI, the rich repertoire of structural encodings it offered was apparent. Accordingly, I designed a set of markup protocols for the Collate program to replicate the TEI structural encodings, with the aim of being able to translate these encodings into the TEI implementation of SGML (later XML) as needed. Over the years, this has proved very successful, and we have experienced no difficulty in moving our files into SGML/XML for publication. This has had great benefits, as it has meant that we have been able to use readily available commercial SGML/XML software (firstly DynaText, later Anastasia) to achieve excellent results. However, we decided at the very beginning of the project that we would not encode the working transcripts themselves in SGML. There was a simple practical reason for this: the SGML editors then available (basically, emacs!) were rather formidable, and well beyond the slender computer abilities of the transcribers we then had. Many years on, the gap between the programs and the transcribers has narrowed, but persists (emacs is still the tool of choice for many). There is also an important principle here. There is no doubting the unique properties of SGML/XML for interchange systems, and the TEI work, of creating a set of guidelines for interchange of an enormous range of humanities texts, is one of the great scholarly achievements of our time. But what is good for interchange is not necessarily good for capture, where an efficient and focussed system is required for the transcribers (nor may it be good for programming, as is attempted by the XSLT and similar initiatives—but that is another subject). I have no time for the shibboleth, that one must use SGML or XML everywhere one has a text. One should use it where it is efficient (for interchange and archiving, for example), one should not use it where it is not.

The second aspect of the transcription concerned the words and letters themselves, the marks on the parchment themselves. Encoding the words and letters in a printed text can be quite simple: just establish the characters used by the printer and allocate a computer sign to each. All the transcriber has to do is recognize the character in the text and press the appropriate button. But in a manuscript, where the range of marks which can be made by a scribe is limitless, matters are not so simple at all. The transcriber has to decide which of these marks might be ‘meaningful’ and then decide which, of the range of signs available on the given computer system, best represents that meaning. One can see a world of problems emerging just from this bare outline. What do we mean by meaning? Meaning—for whom, by whom, to whom? Are we speaking of the meanings intended by author, or scribe, or the meanings which might be mediated by ourselves and received by our readers? How adequately can the necessarily limited range of signs available in any computer system (let alone the fewer than 200 available in most text computing environments, pre-Unicode) represent all the possible meanings?

In our first experiments in the early 1990s, we were influenced by the capacity (only just recently developed) to add characters to fonts. So we thought we had an excellent solution. Where we saw a ‘meaningful sign’ which could not be represented in the characters available in our computer font we would just add a character for that sign to the font. In theory, this is very attractive. Scribes use many different graphic (or ‘graphetic’) forms for the one letter. It seemed possible that the choice of forms by any scribe (r rotunda, or z shaped, or long, or ragged; sigma s, kidney shaped s, long s, s ligatured with following letters, and so on) might be distinctive of that scribe. It might be possible to use this information towards finding one of the philosopher's stones of manuscript studies: a means of distinguishing and identifying both individual scribes and scribal schools, as suggested in two articles by Angus McIntosh (McIntosh; McIntosh). This delusion lasted about one month: actually, to the moment when we started transcribing manuscripts written later than around 1410. Naturally, we started the transcription on the manuscripts generally regarded as the oldest and the most important: essentially, those commonly dated before 1410. It happens that these manuscripts (or at least four of them) do form something of a coherent group. Their scribes worked closely together, and the scripts share many common characteristics. So it did appear possible to map the marks in these manuscripts to a finite set of signs, and represent each sign by a single computer character. For the first month then, so long as we dealt only with these manuscripts, we convinced ourselves that this system was working. But when we came to deal with a much later manuscript, we immediately found a character we had not dealt with (a distinctive s in final word position). Fine, we thought: add the character to the font. But then we noticed something very alarming: when we looked back at the first manuscripts we had transcribed, we found that the character was indeed present in these—but just not in final position. Worse still: as we looked at more manuscripts, more and more such signs appeared (to the point where it became prohibitively time-consuming to add these to the font). Worse yet: we kept discovering that when we identified such a sign in a new manuscript, we would find that it was indeed present in manuscripts we had already seen—just, we had not noticed it.

At this point, too, we discovered ourselves asking: how ‘new’ is ‘new’? Take the long s: there are long s forms which tower proudly over the other letters, and wave a luxuriant tail into the line below; there are others which skulk among the other characters, ducking their head and tucking in their tails so they barely show. Should we distinguish each and all of these? Where do we stop? And worst of all: one of the letters we observed at this late stage was a distinctive form of s, used in some manuscripts—almost universally—in final position. We could see an argument developing, that certain letter forms were reserved for final word position, and that their function (in a time of uncertain word division) was to mark the ends of words. But we realized that interesting and valid as that argument might be, we could not confidently assert this on the basis of our transcripts. Such an argument might have force only if we could be sure that the occurrence of this particular letter form in this particular position was really distinctive. This would mean that not only would we have to recognize and transcribe this letter securely, in each place where it appears: we should also have to distinguish competing forms of the letter and also recognize and transcribe them securely. Increasingly, we found ourselves lacking confidence in this. It became clear to us that the more signs we distinguished, the greater the possibility of error. Because there were so many, it would be easy for our transcribers to overlook individual signs. Further, the more distinctions we made, the narrower the differences between them, and the easier it would be for a transcriber to misallocate characters. This led us to consider: if we can not support such an argument with these distinctions in our transcriptions, what use is it to make all these distinctions? At the same time, we noticed another phenomenon which gave us pause. We observed that in a pair of manuscripts written probably by the one scribe, in one the scribe commonly used a long-stemmed form of r; in the other, the scribe used the more normal r. On the face of it, this rather clear distinction might appear to justify our experiment. But it does not. What does this tell us? Really, not very much: just that the scribe adjusted his practice from manuscript to manuscript. Further, one hardly needs go to the trouble of transcribing the whole of both manuscripts, spending hours on meticulous discrimination of the various r characters, in order to make this one rather facile observation.

In the last paragraphs I have spoken of ‘we’: by this time (early 1992) Elizabeth Solopova, then a graduate student at Oxford, had come to work as a transcriber on the project, at just the time I was wrestling with these problems. She brought to this an understanding of semiotics, and an awareness of the whole range of signs on a manuscript page. In effect, we decided to discard the whole elaborate effort to separate signs according to fine graphic criteria. Instead we asked: what are our transcriptions for? Why are we making them? How will we use them and for what purposes? Who else might use them? In essence, we determined to concentrate on our work as mediators: interpreting the manuscripts for our use, and for others. This approach had one signal advantage. Questions such as ‘what did this scribe mean by this mark’ can never be answered; but the question ‘what do we want to do with these transcripts’ can be answered. We wanted to compare them by computer program and then use the results of the analysis to determine—if possible—the relations between the manuscripts, in terms of genetic descent. Rather clearly, all this additional information concerning variant letter forms was irrelevant to this (or, what is the same thing, would be so rarely relevant that we could not justify the effort of gathering the information). By definition, our transcription then could focus on lexical variation, on the kinds of differences which might survive copying from one manuscript to another. Even this is not unproblematic: purely lexical variation would involve removal of differences at the level of spelling as well as letter form. But this would mean the transcribers would have to regularize all spelling as they transcribed. Regularize, to what? However, the collation tool we had developed by then had the ability to regularize as we collated, thereby shifting the responsibility for deciding exactly what is a variant to the editor, not the transcriber. We decided therefore to adopt a ‘graphemic’ system: as transcribers, we would represent individual spellings, but not (normally) the individual letter shapes. However, we also included in our transcriptions sets of markers to represent non-linguistic features, such as varying heights of initial capitals, different kinds of scribal emphasis, and the like: what McGann calls ‘bibliographic codes’. On a strict interpretation of our goal—to analyze only lexical variation—we should not have recorded these features. But we felt a need to record them, nonetheless, for various reasons. Firstly, these features were undeniably ‘there’, were indeed the most striking phenomena about the manuscripts, and so should, somehow, be noted. Secondly, while we could not see how we might use this information, its prominence in the manuscripts, an increasing interest in manuscript layout (not least in Solopova's own graduate work), and the likelihood that we might publish the transcripts, all persuaded us that we should retain this information on the likely chance that it would be useful to others. [Note: There has been a lively debate concerning the meaning and purposes of transcription. One view, argued by Alan Renear, states that text has an objective existence, which a transcription act may witness: thus his contribution to the MII-PESP - Philosophy and Electronic Publishing discussion group on 27 November 1995. The discussion group was established as part of a paper “Philosophy and Electronic Publishing.” , organized by Claus Huitfeldt for publication in an interactive issue of the journal The Monist: , published as Volume 80,no. 3, 1997: see http://www.univie.ac.at/philosophie/bureau/intro.htm . An opposite view, asserted by Alois Pichler and Solopova and myself, argued that a transcription is a text constructed for a particular purpose, and has no existence outside this construction: see Pichler, A. Transcriptions, Texts and Interpretations: (Pichler), and my article (with Elizabeth Solopova) on the project's transcription guidelines (Robinson and Solopova). I tried to find a third way between the two views in “What Text Really is Not, and Why Editors have to Learn to Swim.” , to be published in Computing the Edition (Robinson).]

Through 1992 and 1993 therefore, Elizabeth Solopova and I prepared a set of transcription guidelines which has since become the foundation of the project (Robinson and Solopova). These guidelines amount to a hypothesis of significant difference. One can hardly overestimate the importance of a defined set of transcription guidelines for any project built (as ours is) on full-text transcripts. Perhaps more to the point: the effect of the new medium was to force us to define exactly what is variation, for our purposes, and to build a transcription and collation system to capture that variation.

Thus, then, the second proposition I offer in this essay: an electronic edition should be based on full-text transcription of original texts into electronic form, and this transcription must be based on explicit principles. The route of full-text transcription we chose was not so obvious as it appears now. Traditionally, editors of texts would prepare collations by selecting a base text, comparing each version one at a time against that text, and recording the differences. One could—in theory—just input this record of difference into the computer and analyze that and thereby save vast amounts of time. This method would have the apparent advantage of circumventing all the difficult questions about transcription discussed above. But it was exactly this shortcut which raised the most serious doubts. In traditional collation the three parts of the editorial process—observing the actual spellings in the manuscript, noting those seen as actually different, and recording the differences—are so compressed into a single act as to make it very difficult to determine just what the editor actually sees in the manuscripts. But if a collation is not based on an explicit declaration of just what the editor sees, what is it based on? The advantage of full-text transcription is that it forces us to state exactly what we do see, and it makes it possible for readers to check what we say we see against what the reader can see.

Proposition 3: The use of computer-assisted analytic methods may restore historical criticism of large textual traditions as a central aim for scholarly editors.

By mid-1992, we had begun to consider what we would do with all the information we were gathering. Rather obviously we should publish it. But what, exactly, would we publish? How? We had a body of transcripts in electronic form. We had electronic tools to compare and analyze them, and we had various additional materials (discussions, descriptions of the manuscripts) also in electronic form. We were also becoming aware of the possibility of digital imaging, and CD-ROM publication, as a means of distributing large volumes of text and perhaps images, had been established through the pioneering ventures of Chadwyck-Healey and OUP. We seemed to have an obvious answer: we should publish electronically. From the work of Chadwyck-Healey, we knew that the DynaText program (then a product of Electronic Book Technologies) was able to publish, on CD-ROM, large volumes of SGML encoded text such as we were then developing. By fortunate chance, it happened that Cambridge University Press were investigating electronic publication, and saw our project as a chance to explore the possibilities. So we formed an alliance: CUP would purchase DynaText; we would work out how to use it both to publish our own CD-ROMs with CUP, and use the knowledge we developed to help CUP publish other CD-ROMs.

Accordingly, in 1996 the first of our CD-ROMs was published: the Wife of Bath's Prologue on CD-ROM (Robinson). This included all the transcripts of the fifty-eight witnesses, images of all the pages of the text in these manuscripts, the spelling databases we had developed as a by-product of the collation, collation in both ‘regularized spelling’ and ‘original spelling’ forms, and various descriptive and discursive materials. As such, it presents a mass of materials which an editor might use in the course of preparing an edition. Indeed, this mass quite overwhelms the rather slender explanatory and discursive materials included on the CD-ROM. As a result, the CD-ROM on its own may give the impression that this is all the aim of the project: to gather the sources of the text, to transcribe and collate them, and then publish all this as an ‘electronic archive’ (as, for example, Matthew Kirschenbaum suggested in a presentation at the Society for Textual Scholarship conference, New York 2001).

However, transcription for us was always a means, and never an end in itself. We sought to compare the transcripts to discover what the texts had in common and what they did not. This itself too was only a means: for what we really wanted to do was to use that information to find out why: why do the texts differ? A reasonable guess was that it was the process of copying which was the cause of the difference: that scribes introduced new readings into copies, and these copies were themselves copies, introducing yet more new readings. This, of course, is the basis of the traditional, ‘Lachmannian’, stemmatics: and it was precisely the denial by Kane of the grounds for this method which sparked this project (Piers Plowman). There is an obvious analogy between the processes of copying and descent we might hypothesize for manuscript copying and those of replication and evolution underlying biological sciences: both appear instances of ‘descent with modification,’ to use Darwin's phrase. Therefore, it seemed possible that the techniques developed for tracing descent in evolutionary biology, especially through comparison of DNA sequences, might be applicable to manuscript traditions. With the aid of Robert O'Hara, then at the University of Wisconsin at Madison, and later with Chris Howe, Adrian Barbrook and Matthew Spenser at the University of Cambridge, we were able to show that phylogenetic software developed for biological sciences gave useful results when applied to manuscript traditions. That is: we can turn our lists of agreements and disagreements among the manuscripts into a form which can be input into a program used by biologists to hypothesize trees of descent among species; we can then use these programs to hypothesize trees of descent among the manuscripts. Thence, we have to ask: what, exactly, do these hypothetical trees of descent represent? Are they useful for editors, or just a curiosity?

As to the first question, what these trees represent: our experiments with these programs suggest that they may indeed produce representations of relations between manuscripts which do correspond with historical sequences of copying. That is: if a group of manuscripts is shown by the software as descended from a single point within the tradition, then there is a good chance that indeed these manuscripts were all descended from just one exemplar within the tradition. To convert ‘chance’ to ‘probability’ one would need to analyze further: to look at the history of the manuscripts themselves, so far as we can recover it from any external evidence, and look at the readings themselves which cause the software to hypothesize this. In at least one instance, for the manuscript tradition of the Old Norse Svipdagsmøl, we are able to compare external evidence of manuscript relations with the representation offered by phylogenetic software, and the software did succeed in showing close links between manuscripts known to be near relatives by copying (Robinson and O'Hara, ‘Cladistic Analysis’).

As to the second question, are these reconstructions useful for editors: where these techniques show a group of manuscripts as apparently descended from a single exemplar within the tradition one should be able to deduce just what readings were introduced by this particular exemplar. One could go yet further: scrutiny of these readings might give answers to a questions such as, are these variants likely to have come from authorial revision or from scribal interference? In a long article on the Wife of Bath's Prologue: published in the second Project Occasional Papers volume, I suggested that one could discriminate some six ‘fundamental groupings’ among witnesses for this text, and one could use these groupings as a means of identifying contamination and shifts of exemplar (“A Stemmatic Analysis.” . One could also isolate through this means the variants characteristic of each group, and therefore apparently descended into each group from a single ancestor within the tradition. These variants could then be examined, and a judgement made on whether they are scribal or authorial in character. From this, I arrived at a reasonably firm set of conclusions: for example, that there was no evidence of word by word revision by Chaucer in the groupings; that the so-called ‘added passages’ were present very early in the tradition, so reinforcing the argument that Chaucer himself wrote them, and that their scattered attestation across the manuscripts might have resulted from these passages having been marked for deletion in very early manuscripts.

This leads to the third of the five lessons the project might offer: after decades of doubt and uncertainty, historical criticism of large textual traditions may return as a central tool of scholarly editors. The denial by Kane and his followers that any kind of ‘genetic’ reconstruction might be possible or useful left the text and the editor in a historical vacuum: an editor must use only judgement, based on a sense of the author's intention and an intuition for how this might be corrupted, to create a pristine text. Against this, our work suggests that historical analysis of textual traditions, in terms of ‘descent with modification’ by the flow of readings from manuscript to manuscript, is possible. It does appear useful to explore the development over time of the tradition: to attempt to trace it both forwards, from the first surviving manuscripts to the latest incunables, and back, from the extant witnesses to the hypothetical ancestors underlying the texts we now have. If so, this is a remarkable instance of advances in one field (evolutionary biology) having a significant effect on a second, quite distinct field (textual scholarship).

Proposition 4: The new technology has the power to alter both how editors edit, and how readers read.

The Wife of Bath's Prologue: on CD-ROM was published before these analyses were complete. By the time we came to publish our next CD-ROM, Elizabeth Solopova's edition of The General Prologue: on CD-ROM, we had considerably more experience with these phylogenetic methods (Solopova). For the Wife of Bath's Prologue: , materials and analysis were separated between the CD-ROM and the printed article. This time, we were determined to unite them. So, I wrote a long “Analysis Workshop.” section for the General Prologue: CD-ROM along the same lines as the Wife of Bath article (Robinson). Because of the electronic publication mode, we were also able to include the actual software and all the data we used for the analysis, and include also exercises allowing readers to run the software themselves, so that they might confirm, extend or deny the hypotheses suggested in the article. I also wrote a “stemmatic commentary,.” which took some 120 individual places in the General Prologue and attempted to use the results of the stemmatic analysis to clarify the range of readings at each place (Robinson).

The movement from the first to the second CD-ROM marked a significant shift in our thinking. This was put as follows, in the “General Editors' Introduction.” to the General Prologue CD-ROM:

One might summarize the shift in our thinking in the last two years, underlying the differences between the two CD-ROMs, as follows: the aim of The Wife of Bath's Prologue CD-ROM was to help editors edit; our aim now is also to help readers read.

Elsewhere on the CD-ROM, I give this approach the name ‘New Stemmatics’ and explain it as follows:

Like the stemmatics of the last century, its aim is to illuminate the history of the text. Unlike the stemmatics of the old century, its aim is not a well-made edition, but a well-informed reader.

This shift coincided with (and was indeed largely caused by) an increasing sense of the expressive power of the computing medium. As standard browsers grew in stability and capacity, it became possible to build attractive and responsive interfaces for the vast range of material at our disposal. Not in DynaText however: this was now a venerable (ten years old!) system, and lacked many of the features we could see appearing on the Web and elsewhere. We wanted to do much more than just have text and image appear in a window when you clicked (about all you could do in the early days of hypertext). We wanted to be able to have images resize and scroll as you moved over them, boxes with text appear or the text itself change when you clicked or as the mouse moved, and more. With Javascript and other technical advances, all this became possible; and we developed a new software tool, Anastasia, specifically to offer a bridge between the XML into which we now decanted all our files and the new Javascript/HTML-based interfaces now appearing. Our first publication to make use of this combination was our third CD-ROM, the Hengwrt Chaucer Digital Facsimile (Stubbs). For this we had a new aspiration: it should be beautiful. We sought out a clean and efficient interface which would allow the manuscript to speak for itself as clearly as possible. The editor, Estelle Stubbs, wished to present the manuscript as much more than a container of text, but as a physical object, with distinctive orderings of page, quire, tale and ink, and we devised a set of tables to show this. Our motivation was to make the manuscript as accessible to as many people as possible, at least down to undergraduates beginning their first work on Chaucer and manuscripts. For us, this was an attempt to give practical expression to what was becoming a core belief of the project: that we could use the new tools and our materials to change the way people experience the text.

Here is the fourth of the five lessons I offer: computing technology has the power to alter both how editors edit, and how readers read. In our most recent publications, beginning with The Miller's Tale: on CD-ROM, we are applying the technology used for the Hengwrt Chaucer Digital Facsimile: to our ‘single-tale’ editions, where (as earlier with The Wife of Bath's Prologue: and The General Prologue) we bring together all the transcripts and images of the many versions of any one tale with collations and analyses based on these transcripts and images. These offer readers the opportunity to check efficiently the stability of the text at critical points, and offer too an agreeable means of discovery of how the text came to be how it is. By inviting exploration rather than baffling it, such editions might help us all to be better readers.

Proposition 5: Editorial projects generating substantial quantities of transcribed text in electronic form should adopt, from the beginning, an open transcription policy.

This compressed account rather elides the many difficulties we faced in our (now) thirteen years progress. One of these, however, looms so large that it must be mentioned. We are necessarily a collaborative project: over forty people, and some six institutions, have contributed significantly to our work. Typically, transcripts are started by one person at one institution, and then checked and rechecked by other people elsewhere. This is of course a great strength: no one person or institution could have done all this. But it can be a source of very great difficulty. If only one of these people, or one of these institutions, decides to insist on control of ‘their’ part of the project, and attempts to use this control to determine how it should be published (or deny its publication altogether), then we have a problem. If the materials should happen to be central to the project, then the existence of the whole project might be threatened. Of course, this might be done for the best of motives: but the effect is the same. The reason for this state of affairs lies in copyright law, and in the legal status of projects such as ours. The Canterbury Tales Project is not a legal entity, and so cannot own anything, including copyright. Rather the copyright in the transcripts belongs either to the individuals who did them or to the institutions in which those individuals are based: which, might vary from case to case. Already, this means typically that we need permission from at least two, and often three, institutions before we can publish any transcripts at all. As time passes, and more institutions become involved (if the project continues at all) every one of these institutions will need to be contacted for each and every act of publication. This, alone, threatens the future viability of the transcripts on which we have spent such energy. We would like others to take them, re-use them, elaborate them (for example, including the graphetic information we rejected), and republish them: exactly the means of scholarship promoted by the fluid electronic medium. But if any future scholar has to go through a process of increasingly lengthy multi-sided negotiation, then the transcripts will become simply unusable, walled from the world by legal argument.

The answer to this, we can now see, is an open transcription policy, modelled on the copyright licensing arrangements developed by the Open Software Foundation. It is important to note that this policy does not mean that institutions and individuals give up all copyright control. The originators of the transcripts still retain this, and so can still (where possible) make commercial arrangements for their publication, and prevent inappropriate use. What it does mean is that the copyright holders assert that the transcripts may be freely downloaded, used, altered and republished subject to certain conditions (basically: the republication must be under the same conditions; all files must retain a notice with them to this effect; permission must still be sought for any paid-for publication). It seems to me appropriate that copyright holders retain a measure of control, in a collaborative environment. It does not seem to me appropriate that any copyright holders should have exclusive control, to the point where it might, unilaterally and without consultation, determine the conditions under which transcripts might or might not be published, to the possible grave damage of others who have worked on the transcripts. The open transcription policy seems to me to balance nicely the rights and needs of all those involved in a collaborative scholarly project of this kind. The policy was accepted by the project steering group in 2002, as henceforth the official policy of the project. I regret greatly that we did not adopt this course at the very beginning of the project. Here then is my fifth lesson: editorial projects generating substantial quantities of transcribed text should adopt, from the beginning, an open transcription policy. 3

One should not finish this article with the impression that The Canterbury Tales Project is all there is to the electronic editing of medieval texts. We have had perhaps a longer continuous history than any comparable ‘born digital’ project in the medieval realm, but we are certainly not alone. The roll of names given at the beginning of this essay shows the vigorous activity in the field in just England and America. There is also a wide range of similar editorial projects in Europe: one could mention Michael Stoltz's work on the Parzifal tradition; the initiatives of Andrea Bozzi at the Consorzio Pisa Ricerche and associates in the BAMBI and other projects; the electronic publications of SISMEL; the ‘virtual reunification’ of the Arnamagnean collection underway in Copenhagen and Reykjavik, and several other projects in Old Norse; the ‘Forum computerphilologie’ in Germany; and many more. 1 This massive burst of activity, across all the traditional domains of medieval philology, puts one in mind of the grand editorial projects of the nineteenth century. The comparison is humbling: it remains to be seen whether the work we are now doing will last so well.

Notes
1.
For Kevin Kiernan: his Beowulf edition is introduced at http://www.uky.edu/~kiernan/eBeowulf/guide.htm and was published in CD-ROM in 2000 (Kiernan). The lessons of this project have been subsumed into the ambitious ARCHway project, which aims to create a whole architecture for the research, teaching and learning in the digital medium, and particularly in the EPT project, which seeks to create an ‘Edition Production Technology’ for multimedia contents in digital libraries: ( http://beowulf.engl.uky.edu/~kiernan/ARCHway/pubs.htm . For Hoyt N. Duggan: SEENET is at http://www.iath.virginia.edu/seenet/ , with links to the Piers Plowman archive. For the Middle English Compendium: see http://www.hti.umich.edu/mec/ . Perhaps the most fully realized of the other enterprises referred to in this paragraph is Murray McGillivray's Book of the Duchess: , released as a CD-ROM in 1997 by the University of Calgary Press; versions of this have also appeared online; see http://www.ucalgary.ca/ucpress/online/pubs/duchess/Websample/mainmenu.htm (McGillivray). Benson's work underlies the Harvard Chaucer pages at http://www.courses.fas.harvard.edu/~chaucer/index.html . Peter Baker's work on early English texts can be viewed through http://www.engl.virginia.edu/OE/ ; Graham Caie's teaching edition of the Miller's Tale is at http://www2.arts.gla.ac.uk/SESLL/EngLang/ugrad/Miller/cover.htm ; the Aberdeen Bestiary project is at http://www.clues.abdn.ac.uk:8080/besttest/firstpag.html ; the Johns Hopkins Roman de la Rose at http://rose.mse.jhu.edu/pages/terms.htm . There are many websites offering access to older English literary resources: for example, the British Academy portal at http://www.britac.ac.uk/portal/ .
2.
A pioneering instance of such discussion is Tim Machan's Textual Criticism and Middle English Texts: (Machan) See also Hoyt N. Duggan's review of this in Text 10 (1996) and Machan's reply, both available at http://www.msstate.edu/Archives/TEXT/contents10.html .
3.
Paolo D'Iorio has come to exactly the same conclusion concerning the transcripts of Nietzche materials prepared for the HyperNietzche project, a large collaborative editorial project organized on very similar lines to the Canterbury Tales Project. See his HyperNietzsche (D'Iorio).
1.
For Michael Stoltz: see New Philology and New Phylogeny: Aspects of a Critical Electronic Edition of Wolfram's Parzival: (Stoltz); for Andrea Bozzi and the BAMBI workstation see Sylvie Calabretto and Andrea Bozzi “The Philological Workstation BAMBI.” (Calabretto and Bozzi) at http://jodi.ecs.soton.ac.uk/ ; for SISMEL see http://www.sismel.it/ , listing six different series of electronic publication; for the Arnamagnaean collection see http://www.hum.ku.dk/ami/ ; for the Medieval Norse Text Archive see http://www.menota.org ; for the Forum computerphilologie see http://computerphilologie.uni-muenchen.de/ . These enterprises represent only a fraction of the activity in the area.