Electronic Textual Editing: The Women Writers Project: A Digital Anthology [Julia Flanders]


Contents

Introduction

In the transition to the digital medium, the anthology has undergone a number of significant changes—not least because the constraints that give print anthologies their distinctive character have been largely removed, while the methodological boundaries that separated the anthology from more strictly conceived editions have blurred. In describing the work of the Women Writers Project in this case study, in fact, I will be discussing a collection which began in the spirit of a traditional anthology, and then evolved under the influence of the past decade's debates about electronic editing and digital textuality. The result is a collection which has been described variously as an ‘archive’, an ‘edition’, and an ‘anthology’ and serves, in a sense, the purposes of all three.

Despite the range of textual approaches implied by these three terms, the original spirit of the anthology still animates the creation of these new digital collections. In a sense, what was essential about print anthologies was their aim of bringing together a wide range of material bound by some common factor: a period, an author, a genre, a language. The constraints of space typically require that a print anthology emphasize brief works or excerpts, and at first this may appear to be a functional constraint as well: allowing the reader to focus quickly on the qualities the anthology seeks to highlight, without having to grapple with long texts in their entirety. As a strategy for managing the problem of sheer scale, the anthology, like the canon, exercises a strategic simplification. However, with the digital anthology this same strategic purchase can be achieved not through exclusion and brevity, but through the intelligence of the data itself, which can enable the reader to discover the thematic subcollections within a larger assembly of texts.

Indeed, this emphasis on readerly discovery is part of a crucial shift which has shaped the digital collection and its editorial assumptions. The digital collection, unlike the typical print anthology, has a history of self-consciousness about editing—both in the sense of being uneasy about it, and in the sense of wanting to foreground its own activities. This urge has double roots in the early discourse surrounding digital texts. For one thing, the initial scholarly distrust of digital texts encouraged an emphasis on full disclosure: responsible digital projects went far beyond ordinary print practice to document their sources and methods, and to provide the reader with careful linkages to the ‘real’ textual world. At the same time, the enthusiasm for the democratizing possibilities of digital texts (typified in early works such as George Landow's Hypertext : or Richard Lanham's The Electronic Word) suggested that digital texts might constitute a critique of the politics of the printed word, including those of traditional editorial practice. Digital projects in the theoretical forefront took up the challenge of giving readers access to raw, unedited textual information (hence the ‘electronic archive’) and to whatever editorial decisions were to be made—to make the reader a potential participant in the editorial process, or at least not its unwitting and passive consumer. Early discussions of text encoding, too, foregrounded the question of whether encoding itself was, or could be, an objective process or whether it inevitably constituted an editorial intervention. Even though the climate of these debates has changed over time, they have helped establish the assumption that readers can and should be given a much more participatory and knowledgeable role in the process that brings a text finally before their eyes.

This trend has resulted in digital methods in which the source text retains a distinct existence within the electronic edition, rather than being consumed and ultimately effaced by the editing process. Editors wishing to provide the reader with access to the unedited source materials and also to an edited final product can do so by using an encoding system such as SGML (or, more recently, XML), through which multiple witnesses and their complex relationships may be encoded simultaneously. The final editorial decisions as to which reading was to be preferred could be encoded as well, in such a way that the reader could explore not only a given editor's choices but also the roads not taken, revealing other plausible hypotheses and useful representations based on the same source texts. Editions like the Canterbury Tales Project 1 or the Piers Plowman Archive 2 offer exceptionally rich ground for such an approach—and indeed could not be conceived without it—but even in less editorially fertile collections the reader can usefully be offered choices such as whether to view modernized or unmodernized spellings, original capitalization, relevant excerpts or full texts, and the like. And although this distinction between ‘raw’ source text and the editorial cooking it may receive needs to be tempered by an acknowledgement of how much cooking is already present in the encoding of the source, nonetheless the distinction usefully reflects what is for the reader an important new set of opportunities which had not been available before.

If one result of these developments has been a tendency to view digital collections in the spirit of archives—bodies of source material on which may be built a superstructure of metadata, retrieval and analysis tools, and editorial decisions—the corollary has been an almost ironic interest in the materiality of the text. At its most thorough-going, this results in an approach like that of the Rossetti Archive, 3 in which the physical details of the original text are captured and foregrounded as fundamentally constitutive of the text's significance. Few collections go to these lengths; indeed, most digital anthologies capture very little structural or physical detail from the source. But as the information transcribed diminishes, the ubiquity of page images rises, so that the information to which the reader has most direct access is in fact the physical sequence and appearance of the source. The challenge posed by projects like the Rossetti Archive is how to capture bibliographic codes and textual materiality in ways which can represent them usefully to readers: not simply as visible cues but as data which can give one leverage on the text.

The digital anthology or collection emerges out of these debates and possibilities not so much as a particular kind of artifact, but as a set of activities—activities which are in fact very similar to the activities readers expect of other kinds of digital objects. Indeed, the distance between the digital ‘archive’, ‘edition’, and ‘anthology’ is in fact not very great, because all three seek to take advantage of the same set of functions which are characteristic of the digital medium. Their differences lie rather in the reference point they choose from the print world, the emphasis they choose in cueing the reader. It might be fair to say that there is no stable category of the ‘digital anthology’, only a body of works which inherit the anthology's logic of collection. For purposes of this discussion, then, a digital anthology is a collection of texts assembled primarily (though not exclusively) for the purpose of providing convenient (or perhaps sole) access to materials which are either unavailable in other forms, or which gain value from being collected together. The anthology may in fact take an archival approach, by presenting the texts in something close to a diplomatic transcription and emphasizing their materiality. It may also present an critical edition of each text (though this is rare, if only for reasons of cost).

The anthology or collection which is the subject of this case study—the Women Writers Project's online collection entitled Women Writers Online —is the product of the history just sketched. In what follows, I will describe the decisions and processes of greatest importance in developing an anthology of this sort, including the choice of editorial approach, the methods by which the text is captured and represented to the reader, the production process and its provision for error detection and correction, the publication infrastructure, and the role of text encoding as an editorial and scholarly research tool.

The WWP Case Study: General points

In presenting a specific case of a genre so broad and so likely to be variable, one should emphasize not only what is distinctive but what is exemplary: not the unfortunate quirks that have been rightly avoided by other projects, but the sound decisions that are worthy of emulation. During the WWP's dozen or more years of research on the digital editing of large textbases, we have occasionally ventured down some very thorny paths, but on the whole our approach has been guided by a few important principles which we feel are defensible and worth generalizing.

Probably the most fundamental of these is the approach we take to the transcription of textual witnesses. As suggested above, the WWP's methodology emerged from a set of debates about digital editing which place great emphasis on presenting readers with a set of data with which they can work, rather than an editorial fait accompli. It also seemed clear that in the case of early women's writing, where so few texts had been republished in any form, the appropriate role for a digital collection would be to make the primary source materials available to the public and encourage their study, rather than to prepare scholarly editions of a few texts. It seemed premature to determine what the best text of these profoundly unfamiliar works might be, when the scholarly community was still recovering basic information about their publication, authorship, and history. As a result, the WWP's emphasis has been on representing specific documents—particular copies of particular books—at a level of detail that would support many different kinds of textual study. Although it might seem absurd to imagine such a text substituting for a visit to the physical archive, our goal was to represent all of the linguistic detail that a view of the original would provide, and to capture all of the document's contents, even where they were almost unconnected with the main work (as for example in the case of advertisements).

In transcribing the text, we preserve the readings of the original text, whether or not they seem correct, explicable, or intended by the author or printer. Our premise here is, first, that errors may be significant, whatever their source: they are part of the information that circulated to readers when the text was first published, and are part of the evidence that literary researchers may wish to view. 4 And secondly, in many cases (particularly in earlier texts) it may be difficult to say with confidence that a given reading is an error. Given that our expectations about meaning and the conventions for its expression are based overwhelmingly on the textual tradition with which we are familiar, it seemed theoretically important not to allow these expectations to ‘correct’ a dissenting text into conformity.

This diplomatic transcription forms the basis for an encoded document which carries a great deal more information. The errors which we stolidly refuse to correct in the base transcription are marked and encoded, with an alternative corrected reading if one is obvious. Illegible passages are identified, with information about the number of letters or words that cannot be read, and a reading supplied from another source if available. Unclear text, where the given reading is uncertain, is also flagged. For a subset of the collection, entitled Renaissance Women Online, we have also added brief introductions and contextual materials giving historical and biographical background. 5 Finally, although the WWP does not at present encode more than one version of a given text, the possibility remains that alternate witnesses could be included and their differences encoded as variant readings. This could be done either by transcribing each witness independently and encoding all the differences between it and the others, or by creating a single master document containing the data necessary to reconstruct all possible versions.

This approach owes a methodological debt to the work of Jerome McGann, whose arguments for the importance of the historical and physical specificity of documents and their production have provided good reasons to regard the document, rather than the ‘work’, as the central object of interest. We treat the authors' specific intentions with respect to literary meaning as not only largely unknowable but also beside the point: what we wish to represent is a cultural document, a piece of historical currency whose modern readers may or may not find in it insight into the author's mind. We do not wish to minimize the difficulty—philosophical and practical—of that insight; on the contrary, we wish to emphasize the historical distance and difference that must be bridged in reading these documents.

While the editions we create do not attempt to represent the ideal text that might emerge from a realization of the author's literary intentions, they do make cautious reference to a more limited model of social intentionality that governs domains like printing conventions, orthography, and genre. Furthermore, the agency of the author and of the other participants in the document's production is clearly an operative analytical category for readers, and insofar as the document bears its traces our encoding seeks to record it in a meaningful way. So for instance we distinguish between footnotes by the author and by her contemporary editors (and of course distinguish both from our own notes of whatever sort). In cases where a text contains sections by different authors, we can associate metadata with each section indicating its separate authorship. Most importantly, the header for each file includes identification of each participant—author, editor, publisher, printer, and potentially many others—together with the possibility of demographic information on each. 6

The fact that we are dealing with a large collection of documents by different authors has some important consequences in this context as well. Although each text has a certain autonomy within the collection and can be read independently, the reader may also ‘read’ the collection as a whole as a larger historical text representing four centuries of women's literate culture in English. The tools for this kind of reading—search and text analysis tools, concordance views, and other forms of textual manipulation—rely on our ability to identify generic commonalities and distinctions and capture these in the encoding of the texts. By using a vocabulary for describing genre and textual structure that locates the particular instance within a larger framework, we not only allow for comparisons across the collection but also potentially between this and other collections similarly prepared.

Finally, this approach is motivated by insights into what might be called information engineering. It is simpler to capture all of the primary data first and then derive from it whatever second-order versions are needed, than to create such a second-order version at the start and then work backwards to the source. On the basis of the information we capture about documents, we can create a multitude of editions which build on specific hypotheses about intention, revision history, and the like.

Text representation

The foregoing may already have suggested some of the complexities of discussing editorial method in a medium where data can—and should—be so clearly separated from display. In describing the source data we capture, we say almost nothing about what the actual representation of the text will be like; in describing the representation, we are only discussing one of a large range of possibilities, or a set of choices offered to the reader. The distinctiveness of the approach is revealed in the tools and constraints that govern what the reader may do with the text and what kinds of information he or she may expect to find there. In the case of the WWP, our greatest investment thus far has been in the capture of the source data. Our current provision for display has been constrained by available resources, and while it responds to the needs of our most frequent users—a university classroom audience—it does not represent our final expectations for the collection.

As we currently present it, the text is displayed in a manner that preserves the most significant details of its original formatting. We preserve the more determinate features of the text more or less exactly, preserving the original's capitalization, punctuation, and use of italics. Indentation and alignment are more difficult to reproduce with precision, because they depend on the particular size and aspect ratio of the original text block; instead of displaying absolute positioning, we represent what might be termed the ‘significant position’ of the textual unit, i.e. the positioning that distinguishes it from its context and indicates its structural function: whether it is centered, aligned left or right, indented from one side or the other. Although we record all line breaks from the original, the display suppresses prose line breaks except within headings, letter closings, and title pages, and similar features where line breaks may be significant. Similarly, line-end hyphens are retained in our transcription, but are suppressed when the text is relineated for display. From among the readings recorded in our transcription, the display currently provides a version with typographical errors corrected, and with the text's original use of i, j, u, v, and w regularized to conform to modern usage. Additionally, contractions derived from manuscript practice in very early texts (such as õ for on) are displayed in expanded form. These choices result in a reading text which serves a non-specialist audience; very shortly we will also be able to offer a display which shows the original readings and offers the ability to switch between views.

Several important kinds of textual feature are omitted both from our transcription and from our representation of the text. Most significantly, we do not capture any of the graphical features of the text such as illustrations and ornaments. Our transcription includes placeholders for such features, and in the case of figures (images with representational content) we encode a detailed description of the illustration and a transcription of any words it may contain. Non-representational ornaments, ruled lines, borders, and other printers' devices are noted merely as ‘ornament’ or ‘rule’ without further detail. This may seem like a real impoverishment of the text, and in a sense it is; we do not regard this as an ideal approach. However, the work and cost involved in negotiating rights to reproduce images of the source, and the logistics of digitizing these images, were beyond our reach at the start of the project. The result was a methodology which emphasizes the specifically textual domain: the ability to search, to manipulate the texts, to find material of interest. For graphical features, we reasoned, readers could consult microfilm or the original when necessary, whereas these options could never provide the textual power of the digital version.

Because we currently do not represent more than one copy of a given text, we do not have any apparatus representing textual variants. However, in some texts we do need to represent manuscript deletions and revisions. Where a complete representation of all the marks on the page is required, this is accomplished using an adaptation of marks often used in print, with strikethrough to mark deletions and square brackets to mark additions. However, it is also possible in our newest interface design for the reader to choose a clear reading display which presents either the originally printed version or the revised version, which allows for easy comparison of the two. A similar approach would be used if we did wish to represent textual variants from other sources: the varying text would be marked with some notation or highlighting indicating that a variant exists. Clicking on the text would display the variant reading(s); the reader could also toggle between different versions. In all of these cases, our goal is to present the reader with the option to explore variation in an informed way, or to choose a specific version as a clear reading text.

WWP in context of other projects

There are large numbers of text collections now available online which could be described as digital anthologies. Among those that may be considered scholarly resources—intended for a scholarly or academic audience and created with a high degree of editorial care—responsible practice has coalesced around a set of widely accepted methods, although local practice varies as to the details. There is general agreement as to the importance of providing searchable text, and a page image as well if possible. The advantages of marking up the text using an SGML or XML-based markup language are also well understood and widely accepted, although there are also highly regarded projects which are experimenting with non-SGML-based markup. The Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange 7 represent the most widely-used markup language in the scholarly community (aside from HTML, whose limitations are becoming clearer as scholarly expectations of digital resources rise). While its provisions for literary editing and text representation are extensive, few projects apply this encoding in detail, largely for reasons of cost and lack of needed expertise. Similarly, the need for good metadata—information by which a text can be identified, located and retrieved from within a collection—is also increasingly well understood, but high-quality metadata is challenging to create and consequently rare.

The WWP is situated among a group of digital anthologies that have adopted the most rigorous practices from among those sketched above. Providing page images is not practicable for us, but our full-text transcriptions and metadata are encoded in SGML following the TEI Guidelines with a degree of detail that is unusual (perhaps even unmatched) among projects of this sort. We are also unusual, though not unique, in providing a detailed account of our editorial and transcriptional methods to the reader as part of our site documentation. Perhaps because print anthologies typically wear their editorial practices lightly and seldom dwell on details of regularization, line-end hyphenation, and the like, digital anthologies (particularly those aimed at a student audience) are often similarly reticent.

As for the WWP's editorial practice, it too matches the range of practice accepted by the community of digital anthologies. The use of unamended or very lightly edited diplomatic transcriptions from single sources is common and serves the goals of such projects well. 8 The WWP's preference is to capture any emendation using SGML encoding, rather than making silent alterations, and as a result our approach more than others may lend itself to offering readers alternative versions of the text (concerning the treatment of details like typographical errors or abbreviations).

Practical procedures

A number of practical procedures are worth describing here, because of their impact on the reliability of the resulting transcriptions. Our choice of source text (the edition and copy to be used for transcription) and our methods of assuring transcriptional accuracy and encoding consistency have required careful thought and planning.

Our basic criteria for choosing a source text are in many ways unremarkable, reflecting the factors that give a text particular scholarly value. We prefer a first edition, or an edition published within the author's lifetime; if there is evidence that a certain edition was revised by the author, that fact will carry weight as well. 9 Within a given edition, we choose a copy that is both legible and complete. From among copies which meet our criteria, we prefer to choose one which is readily available on microfilm.

The accuracy of the transcription is checked by several different kinds of proofreading. Because some of the transcribed characters are actually captured in the markup itself, the first proofreading is performed with the markup visible, so that the proofreader can ensure that features such as rendition, abbreviations, typographical errors in the original, and the use of i, j, u, v, and w have been captured correctly. A subsequent proofreading is ideally performed on a formatted copy which shows only the content of the text. 10 In both cases, the proofreader does a word-by-word reading of the output against a copy of the source. In addition, we have developed tools which list the vocabulary of a given text, expressed as a unique word list, so that individual usages (those most likely to be typographical errors) can be spotted and checked; the list can also be checked against a cumulative dictionary of WWP usages to catch the most frequent errors.

Consistency of encoding is the most difficult to achieve, particularly with a complex encoding system like the TEI Guidelines. There are in many cases several encoding solutions possible for a given textual feature, all of them valid SGML and many of them equally defensible. It is therefore important to build in checks to ensure that similar features are always tagged alike. Like most digitization projects, we rely first of all on careful and extensive documentation which our encoders use both during training and as a reference while they are transcribing. Following the initial transcription and encoding, we also run a set of automated checks which scan for the most frequent kinds of encoding inconsistencies and also check for certain kinds of errors which are difficult to catch manually. Finally, each text is given a final review when the last set of proofreading corrections has been entered. With all of these checks, there are nonetheless differences from text to text which may even amount to differing encoding aesthetics. It is not clear to what extent these affect the digital behavior of the collection; as tools for manipulating digital texts grow more powerful, we will need to develop more nuanced ways of assessing and enforcing consistency.

Conclusion

As a case study, the WWP's example illustrates a few tradeoffs which are particularly significant in the transition to digital editing. The WWP anthology emphasizes capturing editorial decisions not as finalities but as contingencies, with important effects. By capturing the text so as to represent its variability as a data structure, we are able to create a distinct editorial space which stands apart from the source transcription and from any final editorial result. This space is accessible to us as editors—it is the place where editing proper can really occur—but it is also accessible to readers, enabling them to inspect the decisions that have been made and choose different strategies if they wish. In our particular case, these choices operate at a broad level, on things like typographical modernization or abbreviations, but they could operate on individual readings as well using the same basic approach. What is crucial here is that the editorial work has not simply been displaced onto the reader, an abnegation of editorial responsibility which—though a heady possibility earlier in digital history—is no longer regarded as desirable by editors or readers. Instead, the process has the potential to be both collaborative and mutable, contingent and yet not flimsy.

This strategy, with its underlying infrastructure of SGML or XML encoding, is now increasingly the approach taken to digital editions of all sorts. And although the digital anthology is a loosely identifiable genre in the landscape of digital editions, it is also in a sense the form which that landscape as a whole is taking. With the development of large-scale retrieval tools and methods of federating digital resources, not only the large digital library collections but also individual editions may be treated as part of one vast textual field. At first such a textual universe seems to have little to do with the anthology tradition: it reverses the selectivity, the annotation, the editorial uniformity, the thematic appositeness which are characteristic of the print anthology. But with care, these qualities can—and should—be relocated into the liminal space between the data and the interface, where choices about which texts, which readings, which presentation, can be made and remade. The digital anthology thus serves both as macrocosm and microcosm, a scale model of the textual world that contains everything and yet fits in the palm of your hand.

Notes
4.
Indeed, the many examples of ingenious research—on the quantity of type used to set a given book, or the error rates of different compositors, or the possibilities of pronunciation alternatives—seemed like important indicators of the variety of research that might prove possible and illuminating if the requisite data were available.
5.
See http://www.wwp.brown.edu/texts/rwoentry.html . These contextual materials are limited to the RWO collection for reasons having to do with the vagaries of funding; they were created under a generous grant from the Andrew W. Mellon Foundation, as an experiment in digital pedagogy and publication, but the experiment was not extended to the entire WWP collection.
6.
Unfortunately limited resources have prevented us from developing this demographic metadata fully, but it would be a natural and extremely valuable addition to the resource and forms part of the methodology we would recommend to others.
8.
Digital anthology projects are often motivated by the desire to provide access to rare materials—women's writing, slave narratives, rare books, documentary materials—where a transcription of the original document may be of greater value to readers than an edited version. In addition, there may be practical reasons for this emphasis; few anthology projects have the resources to create substantially edited versions on a large scale, and most prefer to put their resources into digitizing additional texts.
9.
Indeed, if there exist two substantially different versions of the text, we strongly consider encoding both, resources permitting.
10.
Given the vagaries of SGML output software, it has not always been possible to produce this latter form of proofreading output; in these cases, a second round of proofreading with encoding visible is carried out.

Last recorded change to this page: 2007-10-31  •  For corrections or updates, contact web@tei-c.org