Electronic Textual Editing: Documentary Editing [Bob Rosenberg]
The most important point to be made about any digital documentary edition is that the editors' fundamental intellectual work is unchanged. Editors must devote the profession's characteristic, meticulous attention to selection, transcription, and annotation if the resulting electronic publication is to deserve the respect given to modern microfilm and print publications. At the same time, it is abundantly clear that a digital edition presents opportunities well beyond the possibilities of film and paper. In the case of the Edison Papers, both the microfilm and print editions were well under way, and the electronic edition was seen as a means to combine and extend the work done without significantly altering the established editorial principles.
There are a number of principal considerations facing anyone planning to create an electronic edition of historical documents. If the documents are presented as images, the primary concerns will be construction of a database and creation of those images; if the documents are all transcribed, then preparation and presentation of the text will be foremost. The Edison Papers is working to combine images and text, and I hope that a careful examination of some avenues explored and lessons learned in that process will be helpful to anyone fortunate enough and bold enough to undertake such a task. 1
The Edison Papers is unusual in several respects, the first and most striking being the size of the archive from which it draws. When the project was launched at the end of the 1970s, it was estimated that the collection at the Edison National Historic Site comprised about 1.5 million pages. Within a few years that estimate had grown to approximately five million pages, which made the tens of thousands of Edison pages in other archives, libraries, and repositories seem easily manageable. It was clear that the two editions projected at that time—microfilm and print—would be selective, with the microfilm including about ten percent of the holdings, and the printed volumes including perhaps two or three percent of that ten percent. A second unusual aspect of the Edison corpus is the central importance of drawings and even physical artifacts to an understanding of its subject's work, which is a direct consequence of Edison's being an inventor, fresh territory for documentary editing. Nevertheless, despite these differences and others more subtle, we decided at the outset to hew as closely as possible to the standard practices of documentary editing.
The original plan was to have the microfilm proceed ahead of the print edition, since the microfilm could capture years of Edison's life while the book editors were dissecting it one day at a time. This worked admirably, even if at first it had the book editors champing at the bit, teasing insight out of reams of photocopies. Organizing, comprehending, selecting, and filming the documents was —and remains—a truly Herculean task. When the first part of the collection was published on film in 1985, the book editors had Edison's early work at their fingertips. At the same time, two crucial pieces for the foundation for the electronic edition had been unwittingly put in place. First was the structure of the descriptive data recorded about each document; second was the high quality of the microfilm itself, which would later allow the creation of excellent digital images.
It had been impressed on the project organizers that the only way to control a collection of this size was with an electronic database. Fortunately, the Joseph Henry Papers had already started blazing a trail into that mysterious territory. Using that experience and knowledge as a foundation, the Edison Papers created a database that would prove two decades later to be the heart of their electronic edition. At its inception, the database served two functions: it was the raw material from which was created the detailed, item-level printed index that accompanied the microfilm, and it also contained information about the organization and contents of the documents which was used for in-house research.
- A two-character field identifying the record group to which the document belonged (‘A’ for accounts,‘PN’ for pocket notebooks, etc.)
- An eight-character field specifying the position of the document within its record group (‘7204A’ would indicate the first document in the fourth folder for 1872)
- A two-digit field that held a code indicating which of the many types of documents this document was (there are more than 50 types in the edition). A separate table held the codes and their full meaning (‘01’ = Accounts, ‘33’ = Test Reports, ‘79’ = Interviews, etc.)
- Three two-character fields: month, day, year. Dating turned out to be very complex. Although it would be nice to use the date function that is built into most databases, it cannot be done in any straightforward way because documents are frequently partially dated, and the date function will not allow dates such as ‘May 1875.’
- Two three-character fields containing codes. As in the Type field, there was a separate table holding the codes and the names they represent. There were two author fields (and two Recipient fields) to allow for situations such as an individual writing on behalf of a company.
- Two three-character fields containing codes.
- Name Mention
- Two three-character fields containing codes. Limiting name mentions to two fields meant that many names appearing in documents could not be included, which was unfortunate. However, computational power and storage were much dearer when the project began.
- Three three-character subject codes
- Status Codes
- Six single-character fields that flagged information about the documents and their data: What language is it in? Is it a fragment? Is it a photocopy? Is it an attachment or enclosure? Is some part of the date conjectured?
- Reel, Frame, Addframe
- These three fields recorded the reel and frame numbers of the document on the microfilm.
As a database structure, this is far from what would now be considered optimal. However, it did contain almost all the information needed for searching and retrieval. Over time, the database migrated to three new programs: first to a different mainframe program, then to a desktop PC, and finally to a newer desktop program. Once on the desktop the number of name mentions and subjects was increased to 16 each, the Group and Location fields were merged into a single Document ID, and a field was added that held a code for the folder or volume containing the document. With the most recent migration (to Microsoft Access) the data was ‘normalized’ which means it was broken up into a larger number of interlinked tables, most of which only contain a few fields. This makes the data easier to manipulate; it also means there is no limit to the number of names or subjects recorded for a document
This is not meant to be simply a long and inordinately detailed technical discussion. Any digital documentary edition that provides images of the original documents, with or without transcriptions, can be no better than its database. At the very least the database must have dates and authors' and recipients' names. It should also have information about the organization of the edition. One of the great strengths of a microfilm edition is that once a user finds the first page of a letter, an account book, or a legal proceeding, the successive pages will usually appear on successive frames. Moreover, documents in a given editorial grouping usually appear together on the film. Digital images, however, must be ordered for the user, and beginnings and ends of documents must be flagged somehow. We will return to the issue of the database after discussing images.
In the mid-1990s, after the first graphical browsers had awakened everybody to the Web's potential, a foundation program officer was arguing against the work of entering documents' information into the Edison Papers' database. ‘You don't have to do all that indexing,’ he said. ‘Just scan the documents and put them on the Internet!’ Not only was he wrong about the indexing; he was wrong about the ‘just scan.’ At the time of that conversation, the Edison Papers had published 162 reels of microfilm, each reel averaging slightly more than 1,000 images. The market for microfilm scanning was largely driven by institutions such as banks and insurance companies, which had huge collections on microfilm and which wanted greater access to those images. They were not interested in subtlety, but were content with black-and-white (1-bit) images. Such images can suffice if a typed or printed document is scanned at a sufficiently high spatial resolution (300-600 dpi [dots/pixels per inch]), but most of the documents in the Edison Papers microfilm edition were handwritten, much of it in pencil or light pen, and much of it on paper that had darkened in the century since it was written. There was no doubt that the documents would have to be scanned as eight-bit images (256 shades of gray), a capability the scanner manufacturers were just beginning to explore. Increasing the bit depth of the images allowed us to scan them at a relatively modest resolution of 200 dpi.
After some misadventure, we settled on a vendor. We soon found that the scanning produced better images when we used negative film, as the amount of light that came through positive film tended to overload the sensors and wash out fainter lines. Besides straightforward quality control issues of light or dark images, we found that the scanner occasionally trapped dust particles between the sensors and the film, creating a streak across dozens or even hundreds of images. Those problems aside, we were pleasantly surprised by the quality of the images. Because the documents were recorded on high-contrast microfilm, we expected little in the way of fine distinction, but in fact the images often revealed details that were nearly or actually impossible to see using a microfilm reader.
The original time estimate for the job was six months. There were some kinks to be worked out in the technology, and it took about two years. When it was done, we had nearly 1,500 CDs holding a terabyte of data. The images, captured as uncompressed TIFF files, averaged around 6 MB each, and before we could deliver them over the Internet we would have to create smaller versions. Even more important, we had to somehow link the images to the appropriate document information, so that when a user called up a particular document the correct images would appear.
The creation of derivative images, like much editorial work, is repetitive but not routine and requires a finicky intelligence. We did not have sufficient storage space to put the full-size images online as a viewing option, so the user was going to get one derivative image and it had to be legible. We aimed at reducing the spatial resolution to 80 dpi, which is about life-size on most computer screens, and using JPEG compression to wind up with an image that was about one percent the size of the original (an average of 60 Kb). Because the microfilm edition groups documents by subject or type, the images could often be batch-processed and then reviewed for quality. We did not hesitate to lighten or darken an image if the alteration made it easier to read. (Some editors initially winced at this, reflecting a general uneasiness about manipulating digital images, but it is philosophically no different from changing the lighting when microfilming in order to enhance the contrast of a document; an illegible document is of little use.) Although most of the film was shot at a constant 14:1 reduction, some unusually large or small documents were filmed at other ratios. We tried to scan all the images at the equivalent of 200 dpi on the original. However, in the case of agate-type newspaper clippings and other documents with fine detail, we increased the spatial resolution of the online images to make them legible for the user.
Linking images to their document information was a painstaking process. There is no simple one-to-one correspondence between images and their information. A document might contain one or more attachments or enclosures, for example, in which case a user who retrieves the covering document will want to see the enclosures as part of the document, while at the same time the enclosures might be recorded as separate documents themselves. That is, the same image may be linked to more than one document. With the help of an outside programmer (and a 21-inch screen), we created an interface that displayed successive digitized images on one side and database information on the other. Using the microfilm frame numbers in the database, the program would calculate the number of images in a document. Most of the time the calculation was right, but when it was wrong the operator—working with the digital images, the database, and the microfilm—could easily override it.
The result is an online image edition that allows the user to sample or assemble the documents in a number of ways—name, date, document type, editorial organization—and to view as a group documents scattered across many reels of microfilm. At the same time, there are certain characteristics of microfilm that are useful to preserve in an electronic edition. A user landing on a page in the middle of an experimental notebook, account book, or scrapbook is likely to want to browse forward and backward through the entire item. This problem was solved by the creation of a new data table, but the solution was only possible because the structural information identifying the collection of individual documents as a unit was already present in the database.
As might be expected, the other side of the edition—creating live, linked text from the transcribed documents and their editorial apparatus—presented its own set of issues. Again, the electronic text and apparatus, if not identical to those in the published volumes, reflected the same principles of selection, transcription, and annotation. There were both theoretical and practical considerations in the creation of the digital text for the Edison Papers, as is always the case, but the practical issues predominated.
At the start of the project we had chosen ‘a conservative expanded approach [to transcription] that does not try to 'clean up the text' ... to strike a balance between the needs of the scholar for details of editorial emendation, the requirements of all users for readability, and the desire of the editors that all readers obtain a feel and flavor for Edison, his associates, and their era’ (Jenkins lv-vi). Because of the nature of the documents, establishment of an authoritative text—in the sense of choosing between alternative readings—was rarely an issue. Only a tiny percentage of the documents (such as contracts, letters to the editor, or patent specifications) were written or printed more than once.
Traditional text editing conventions generally proved a comfortable fit. Where it was not, we tried to keep our improvisations as close as we could to the spirit of traditional models. For example, we used traditional abbreviations to describe the documents: ‘A’ for autograph, meaning the document was in the hand of the author; ‘L’ for letter; ‘D’ for document; ‘S’ for signed; and so on. However, there was no existing symbol for an artifact, nor was there one for technical notes or notebook entries, both of which we had in abundance. So we created new ones: ‘M’ (Model) for physical objects, and ‘X’ for technical materials. Almost immediately, we realized that notebook entries and similar materials presented a significant new entanglement. From quite early in his career Edison had co-workers beside him at the bench, people who helped him carry out his research plans. Sometimes one of them would work and another would take notes; sometimes the experimenter would make his own notes; sometimes a group of them would work together with one keeping notes; sometimes the researcher was more or less autonomous; sometimes they were pursuing a line of thought that Edison had assigned. What stumped us was the question of authorship raised by such documents. Was the author the person who did the work? Recorded the work? Had the idea for the work that was carried out? Even if those puzzles had answers, most of the time we couldn't assign those roles with much surety. Finally we simply cut the Gordian knot and declared that documents of type ‘X’ had indeterminate authors, even when Edison wrote them. This turned out to reflect the way work was pursued as well as the way many of Edison's co-workers felt about the work. They realized that they were active participants, but they also recognized that when Edison was not in the laboratory work slowed after a couple of days and that in fact the work would not have existed without Edison to drive it.
Although the quantity of Edison's drawings and their importance is very unusual for documentary editions, with only a few published scientific documents as distant prototypes, there is one documentary category in which the Edison Papers pioneered and which so far remains unique to the project—technological artifacts (Rosenberg). Edison was, after all, an inventor, and the things he created were the core of his work. In the mid-1980s, when we confronted the problem posed by physical objects, we considered several options for presentation, even exploring videodisc (and, in a lighter moment, paper pop-up constructions). Finally we decided to give each artifact an annotated introductory headnote, presenting the object as a photograph if we had one—preferably our own, if we had access to the object, or a historical image if we didn't— or, failing that, a historical sketch, patent drawing, or other representation. Dating was a challenge—we settled on design as comparable to textual composition —as was the slippery analogy of transcription for both photographs and drawings. We have found the system satisfactory for print, and the edition's users have seemed to agree. The electronic edition will offer similar presentation of photographs and drawings. For those instances where an artifact still exists, though—stock tickers, electric lighting, and motion pictures, for example—the internet's potential for displaying sound and motion opens fascinating possibilities for annotation, affording the user detail, depth, and understanding simply not possible with static images and text. As Edison's designs stretch the notion of ‘artifact’ to include his electrical central stations and the Ogdensburg ore-milling plant of the 1890s, questions of presentation will doubtless adapt with them.
It was planned from the beginning of the electronic edition that the text of the print volumes would be included, marked up with SGML (later XML) to take full advantage of the capabilities of live digital text. But before doing any markup, the text itself had to be established, letter-for-letter as accurate as in print. This was not a problem for volumes three and four, because all the corrections, alterations, and additions to the documents and editorial material in those volumes made after their submission to the press—through galleys and page proofs—had been entered into the electronic files of the documents. For the first two volumes, however, edited and published before the electronic edition was on the horizon, the electronic files were uncorrected. Neither scanning with optical character recognition (OCR) nor simply retyping the published text was accurate enough to avoid another full proofreading of the text. However, those two processes combined offered an extremely accurate text, since the computer and the typist did not make the same kinds of mistakes. The scanned text was divided into individual documents, and a second copy of the documents was typed in individually from the volumes. The project secretary then used the ‘compare’ function of WordPerfect to find differences between the two versions and create a corrected one, which she stored. Editorial material, such as front matter, headnotes, and back matter, was treated the same way. 2
What was not clear at the outset was what DTD would be used (even less clear was what software would be used to deliver the text on screen 3 . The work done by David Chesnutt, Michael Sperberg-McQueen, and Susan Hockey for the Model Editions Partnership was immensely helpful, as was the DTD they developed. The MEP tackled a diverse collection of electronic documentary presentations, and the resultant Markup Guidelines for Documentary Editions is now the only reasonable starting place for anyone preparing text for an electronic documentary edition. 4
Some of the tagging we recommend can be automated in very simple ways, e.g. by simple macros in a word processor, or in some cases even by simple global changes. Other tagging we recommend can be automated successfully only by a skilled programmer. Some things fall between the two extremes, and can be performed by an astute editor or a journeyman programmer. Some kinds of tagging cannot be fully automated, even by expert programmers, but an automatic process can propose tagging for a human editor to accept or reject, in much the same way that a selective global change in a word processor allows the user to decide whether or not to make the change, on a case by case basis. And, finally, some tagging is done most simply by hand.
Automated and semi-automated tagging can substantially reduce the cost of tagging an edition, but failed attempts to automate what cannot be automated can consume alarming amounts of time, patience, and money. The art and challenge of managing the creation of an electronic edition using limited resources lies, in no small part, in automating what can be automated, doing manually what must be done manually, and deciding (perhaps with a sigh) to leave untagged what cannot be tagged automatically and is not essential to the edition. It will not always be easy to decide where to class a particular kind of tagging: some kinds of information require manual tagging in some collections of documents, but can be tagged automatically or semi-automatically in others. Right judgment will depend on the body of materials being edited, on the time and resources at hand, and on the skills of the available programming assistance. (Markup Guidelines, Section 2.4 )
Those paragraphs constitute guiding principles for editors. We had the advantage of long familiarity with our word processor, which allowed us to write fairly complex macros that greatly simplified much of the tagging. Nevertheless, the actual work required as much intelligence and care as any other in the project.
The structural tagging of the documents tended to be straightforward (which did not always mean easy). Markup targeted the structure and physical presentation of the text—date lines, paragraphs, closings, signatures, damage—rather than such details as names of people or places. Perhaps the most important structural decision concerned images in the documents—what in the volumes we call ‘art’ (as opposed to ‘illustrations,’ which are images we supply). These images are usually integral to the meaning of the documents, especially in notebook entries, patent documents, and other technical material. Occasionally a selected document has no words on the page at all, and we have to insert bracketed letters next to drawings as hooks from which to hang annotation. Often enough the placement of text relative to drawings is meaningful, and there are times when no transcription can fully capture that relationship. On paper we tackle these issues with careful explication, thoughtful design, and sometimes inclusion of text in reproduced art; still, there is sometimes meaning lost in a patchwork of text and excised drawings. But in the electronic edition, with the original document instantly available, the transcribed representation of such documents no longer carries the full burden of interpreting the document to the user. It becomes an aid to reading, and the design of individual documents —still a nightmare of fluidity on screen— becomes a secondary consideration. With the text transcribed, <figure> elements that contain the ID of their respective page image are inserted where images exist on the page. Those elements will appear on screen as icons that, when clicked, will call up the entire page image.
Embedding editorial material proved challenging. The first significant decision concerned the index. The back-of-the-book index, although created for use in a codex, is a sophisticated intellectual tool that maintains and arguably increases its strength when applied to electronic texts. Full-text searching can help find specific text known to exist, but it is at best a marginally effective way to explore a body of information. 6 A good index not only provides direction to implicit meaning in the text, but it reveals to the user what may be found in the work. In print volumes it is often used to that end as a browsing aid; online, where the scope and depth of a work is harder to judge, such an aid is that much more valuable. The full index—or that for any selected group of documents—can be assembled on the fly for browsing and access. 7 Moreover, index entries can appear on the screen with documents or editorial text, and so serve to link to related material.
The other complex editorial decision involved references and is still very much in process. In an edition where only a small percentage of the documents are transcribed, annotations often direct the reader to the images of other documents that are not transcribed. In the books, these take the form of lists of documents on the microfilm, identifications accompanied by reel and frame numbers. For correspondence and many other types of documents, the references can be translated into direct links. However, references to material in notebooks are rarely coordinate with the way notebooks are divided into documents. 8 Moreover, Edison and his crew recorded their work in whatever notebook lay to hand, and so the background information for a technical document that represents a week's work is as often as not collected from more than one source. What in the book are haphazard strings of frame numbers must become a new type of online document, an artifice that allows the user to see the relevant notebook pages as the assemblage we intend. Just as we made notebooks, account books, and scrapbooks browsable by creating a new data table, these compound references will be a creation of the database. Like the tagging, this will be half-automated and half-handwork; with the tagging—and the scanning, the data entry, the myriad editorial decisions, and the work that follows on them—it is part of a foundation for an edition we couldn't imagine when the project started twenty-five years ago, one that combines the known strengths of microfilm and books with the remarkable power and access of the digital world.