Licensed under
No source: this is an original work
The most important point to be made about any digital documentary edition is that the editors' fundamental intellectual work is unchanged. Editors must devote the profession's characteristic, meticulous attention to selection, transcription, and annotation if the resulting electronic publication is to deserve the respect given to modern microfilm and print publications. At the same time, it is abundantly clear that a digital edition presents opportunities well beyond the possibilities of film and paper. In the case of the Edison Papers, both the microfilm and print editions were well under way, and the electronic edition was seen as a means to combine and extend the work done without significantly altering the established editorial principles.
There are a number of principal considerations facing anyone
planning to create an electronic edition of historical documents. If
the documents are presented as images, the primary concerns will be
construction of a database and creation of those images; if the
documents are all transcribed, then preparation and presentation of
the text will be foremost. The Edison Papers is working to combine
images and text, and I hope that a careful examination of some avenues
explored and lessons learned in that process will be helpful to anyone
fortunate enough and bold enough to undertake such a
task.we
for
convenience throughout this chapter, the Edison Papers was organized
in 1978-9 and I did not join until 1983. I left the project in
2002.
The Edison Papers is unusual in several respects, the first and most striking being the size of the archive from which it draws. When the project was launched at the end of the 1970s, it was estimated that the collection at the Edison National Historic Site comprised about 1.5 million pages. Within a few years that estimate had grown to approximately five million pages, which made the tens of thousands of Edison pages in other archives, libraries, and repositories seem easily manageable. It was clear that the two editions projected at that time—microfilm and print—would be selective, with the microfilm including about ten percent of the holdings, and the printed volumes including perhaps two or three percent of that ten percent. A second unusual aspect of the Edison corpus is the central importance of drawings and even physical artifacts to an understanding of its subject's work, which is a direct consequence of Edison's being an inventor, fresh territory for documentary editing. Nevertheless, despite these differences and others more subtle, we decided at the outset to hew as closely as possible to the standard practices of documentary editing.
The original plan was to have the microfilm proceed ahead of the print edition, since the microfilm could capture years of Edison's life while the book editors were dissecting it one day at a time. This worked admirably, even if at first it had the book editors champing at the bit, teasing insight out of reams of photocopies. Organizing, comprehending, selecting, and filming the documents was —and remains—a truly Herculean task. When the first part of the collection was published on film in 1985, the book editors had Edison's early work at their fingertips. At the same time, two crucial pieces for the foundation for the electronic edition had been unwittingly put in place. First was the structure of the descriptive data recorded about each document; second was the high quality of the microfilm itself, which would later allow the creation of excellent digital images.
It had been impressed on the project organizers that the only way to control a collection of this size was with an electronic database. Fortunately, the Joseph Henry Papers had already started blazing a trail into that mysterious territory. Using that experience and knowledge as a foundation, the Edison Papers created a database that would prove two decades later to be the heart of their electronic edition. At its inception, the database served two functions: it was the raw material from which was created the detailed, item-level printed index that accompanied the microfilm, and it also contained information about the organization and contents of the documents which was used for in-house research.
The first incarnation of the database lived on a university
mainframe and was written by a hired programmer in Fortran 77. The
main table had 24 data fields:
A
for accounts,PN
for pocket notebooks, etc.)
7204A
would indicate the first document in the fourth folder for
1872)
01
=
Accounts, 33
= Test Reports, 79
= Interviews, etc.)
May 1875.
As a database structure, this is far from what would now be
considered optimal. However, it did contain almost all the
information needed for searching and retrieval. Over time, the
database migrated to three new programs: first to a different
mainframe program, then to a desktop PC, and finally to a newer
desktop program. Once on the desktop the number of name mentions and
subjects was increased to 16 each, the Group and Location fields were
merged into a single Document ID, and a field was added that held a
code for the folder or volume containing the document. With the most
recent migration (to Microsoft Access) the data was normalized
which means it was broken up into a larger
number of interlinked tables, most of which only contain a few
fields. This makes the data easier to manipulate; it also means there
is no limit to the number of names or subjects recorded for a document
This is not meant to be simply a long and inordinately detailed technical discussion. Any digital documentary edition that provides images of the original documents, with or without transcriptions, can be no better than its database. At the very least the database must have dates and authors' and recipients' names. It should also have information about the organization of the edition. One of the great strengths of a microfilm edition is that once a user finds the first page of a letter, an account book, or a legal proceeding, the successive pages will usually appear on successive frames. Moreover, documents in a given editorial grouping usually appear together on the film. Digital images, however, must be ordered for the user, and beginnings and ends of documents must be flagged somehow. We will return to the issue of the database after discussing images.
In the mid-1990s, after the first graphical browsers had awakened
everybody to the Web's potential, a foundation program officer was
arguing against the work of entering documents' information into the
Edison Papers' database. You don't have to do all that
indexing,
he said. Just scan the documents and put them
on the Internet!
Not only was he wrong about the indexing; he
was wrong about the just scan.
At the time of that
conversation, the Edison Papers had published 162 reels of microfilm,
each reel averaging slightly more than 1,000 images. The market for
microfilm scanning was largely driven by institutions such as banks
and insurance companies, which had huge collections on microfilm and
which wanted greater access to those images. They were not interested
in subtlety, but were content with black-and-white (1-bit) images.
Such images can suffice if a typed or printed document is scanned at a
sufficiently high spatial resolution (300-600 dpi [dots/pixels per
inch]), but most of the documents in the Edison Papers microfilm
edition were handwritten, much of it in pencil or light pen, and much
of it on paper that had darkened in the century since it was written.
There was no doubt that the documents would have to be scanned as
eight-bit images (256 shades of gray), a capability the scanner
manufacturers were just beginning to explore. Increasing the bit depth
of the images allowed us to scan them at a relatively modest
resolution of 200 dpi.
After some misadventure, we settled on a vendor. We soon found that the scanning produced better images when we used negative film, as the amount of light that came through positive film tended to overload the sensors and wash out fainter lines. Besides straightforward quality control issues of light or dark images, we found that the scanner occasionally trapped dust particles between the sensors and the film, creating a streak across dozens or even hundreds of images. Those problems aside, we were pleasantly surprised by the quality of the images. Because the documents were recorded on high-contrast microfilm, we expected little in the way of fine distinction, but in fact the images often revealed details that were nearly or actually impossible to see using a microfilm reader.
The original time estimate for the job was six months. There were some kinks to be worked out in the technology, and it took about two years. When it was done, we had nearly 1,500 CDs holding a terabyte of data. The images, captured as uncompressed TIFF files, averaged around 6 MB each, and before we could deliver them over the Internet we would have to create smaller versions. Even more important, we had to somehow link the images to the appropriate document information, so that when a user called up a particular document the correct images would appear.
The creation of derivative images, like much editorial work, is repetitive but not routine and requires a finicky intelligence. We did not have sufficient storage space to put the full-size images online as a viewing option, so the user was going to get one derivative image and it had to be legible. We aimed at reducing the spatial resolution to 80 dpi, which is about life-size on most computer screens, and using JPEG compression to wind up with an image that was about one percent the size of the original (an average of 60 Kb). Because the microfilm edition groups documents by subject or type, the images could often be batch-processed and then reviewed for quality. We did not hesitate to lighten or darken an image if the alteration made it easier to read. (Some editors initially winced at this, reflecting a general uneasiness about manipulating digital images, but it is philosophically no different from changing the lighting when microfilming in order to enhance the contrast of a document; an illegible document is of little use.) Although most of the film was shot at a constant 14:1 reduction, some unusually large or small documents were filmed at other ratios. We tried to scan all the images at the equivalent of 200 dpi on the original. However, in the case of agate-type newspaper clippings and other documents with fine detail, we increased the spatial resolution of the online images to make them legible for the user.
Linking images to their document information was a painstaking process. There is no simple one-to-one correspondence between images and their information. A document might contain one or more attachments or enclosures, for example, in which case a user who retrieves the covering document will want to see the enclosures as part of the document, while at the same time the enclosures might be recorded as separate documents themselves. That is, the same image may be linked to more than one document. With the help of an outside programmer (and a 21-inch screen), we created an interface that displayed successive digitized images on one side and database information on the other. Using the microfilm frame numbers in the database, the program would calculate the number of images in a document. Most of the time the calculation was right, but when it was wrong the operator—working with the digital images, the database, and the microfilm—could easily override it.
The result is an online image edition that allows the user to sample or assemble the documents in a number of ways—name, date, document type, editorial organization—and to view as a group documents scattered across many reels of microfilm. At the same time, there are certain characteristics of microfilm that are useful to preserve in an electronic edition. A user landing on a page in the middle of an experimental notebook, account book, or scrapbook is likely to want to browse forward and backward through the entire item. This problem was solved by the creation of a new data table, but the solution was only possible because the structural information identifying the collection of individual documents as a unit was already present in the database.
As might be expected, the other side of the edition—creating live, linked text from the transcribed documents and their editorial apparatus—presented its own set of issues. Again, the electronic text and apparatus, if not identical to those in the published volumes, reflected the same principles of selection, transcription, and annotation. There were both theoretical and practical considerations in the creation of the digital text for the Edison Papers, as is always the case, but the practical issues predominated.
At the start of the project we had chosen a conservative
expanded approach [to transcription] that does not try to 'clean up
the text' ... to strike a balance between the needs of the
scholar for details of editorial emendation, the requirements of all
users for readability, and the desire of the editors that all readers
obtain a feel and flavor for Edison, his associates, and their
era
(Jenkins lv-vi). Because of the nature of the documents,
establishment of an authoritative text—in the sense of
choosing between alternative readings—was rarely an issue.
Only a tiny percentage of the documents (such as contracts, letters to
the editor, or patent specifications) were written or printed more
than once.
Traditional text editing conventions generally proved a comfortable
fit. Where it was not, we tried to keep our improvisations as close
as we could to the spirit of traditional models. For example, we used
traditional abbreviations to describe the documents: A
for
autograph, meaning the document was in the hand of the author; L
for letter; D
for document; S
for signed; and so on. However, there was no existing symbol for an
artifact, nor was there one for technical notes or notebook entries,
both of which we had in abundance. So we created new ones: M
(Model) for physical objects, and X
for
technical materials. Almost immediately, we realized that notebook
entries and similar materials presented a significant new
entanglement. From quite early in his career Edison had co-workers
beside him at the bench, people who helped him carry out his research
plans. Sometimes one of them would work and another would take notes;
sometimes the experimenter would make his own notes; sometimes a group
of them would work together with one keeping notes; sometimes the
researcher was more or less autonomous; sometimes they were pursuing a
line of thought that Edison had assigned. What stumped us was the
question of authorship raised by such documents. Was the author the
person who did the work? Recorded the work? Had the idea for the
work that was carried out? Even if those puzzles had answers, most of
the time we couldn't assign those roles with much surety. Finally we
simply cut the Gordian knot and declared that documents of type X
had indeterminate authors, even when Edison wrote
them. This turned out to reflect the way work was pursued as well as
the way many of Edison's co-workers felt about the work. They realized
that they were active participants, but they also recognized that when
Edison was not in the laboratory work slowed after a couple of days
and that in fact the work would not have existed without Edison to
drive it.
Although the quantity of Edison's drawings and their importance is
very unusual for documentary editions, with only a few published
scientific documents as distant prototypes, there is one documentary
category in which the Edison Papers pioneered and which so far remains
unique to the project—technological artifacts (Rosenberg). Edison was, after all, an inventor, and the things he
created were the core of his work. In the mid-1980s, when we
confronted the problem posed by physical objects, we considered
several options for presentation, even exploring videodisc (and, in a
lighter moment, paper pop-up constructions). Finally we decided to
give each artifact an annotated introductory headnote, presenting the
object as a photograph if we had one—preferably our own, if we
had access to the object, or a historical image if we didn't—
or, failing that, a historical sketch, patent drawing, or other
representation. Dating was a challenge—we settled on design as
comparable to textual composition —as was the slippery analogy
of transcription for both photographs and drawings. We have found the
system satisfactory for print, and the edition's users have seemed to
agree. The electronic edition will offer similar presentation of
photographs and drawings. For those instances where an artifact still
exists, though—stock tickers, electric lighting, and motion
pictures, for example—the internet's potential for displaying
sound and motion opens fascinating possibilities for annotation,
affording the user detail, depth, and understanding simply not
possible with static images and text. As Edison's designs stretch the
notion of
It was planned from the beginning of the electronic edition that the
text of the print volumes would be included, marked up with SGML
(later XML) to take full advantage of the capabilities of live digital
text. But before doing any markup, the text itself had to be
established, letter-for-letter as accurate as in print. This was not
a problem for volumes three and four, because all the corrections,
alterations, and additions to the documents and editorial material in
those volumes made after their submission to the press—through
galleys and page proofs—had been entered into the electronic
files of the documents. For the first two volumes, however, edited
and published before the electronic edition was on the horizon, the
electronic files were uncorrected. Neither scanning with optical
character recognition (OCR) nor simply retyping the published text was
accurate enough to avoid another full proofreading of the text.
However, those two processes combined offered an extremely accurate
text, since the computer and the typist did not make the same kinds of
mistakes. The scanned text was divided into individual documents, and
a second copy of the documents was typed in individually from the
volumes. The project secretary then used the compare
function of WordPerfect to find differences between the two versions
and create a corrected one, which she stored. Editorial material,
such as front matter, headnotes, and back matter, was treated the same
way.
What was not clear at the outset was what DTD would be used (even
less clear was what software would be used to deliver the text on
screen
We found, as everyone who attempts
markup does, that tagging the text is a painstaking process.
Some of the tagging we recommend can be automated in very simple ways, e.g. by simple macros in a word processor, or in some cases even by simple global changes. Other tagging we recommend can be automated successfully only by a skilled programmer. Some things fall between the two extremes, and can be performed by an astute editor or a journeyman programmer. Some kinds of tagging cannot be fully automated, even by expert programmers, but an automatic process can propose tagging for a human editor to accept or reject, in much the same way that a selective global change in a word processor allows the user to decide whether or not to make the change, on a case by case basis. And, finally, some tagging is done most simply by hand.
Automated and semi-automated tagging can substantially reduce the cost of tagging an edition, but failed attempts to automate what cannot be automated can consume alarming amounts of time, patience, and money. The art and challenge of managing the creation of an electronic edition using limited resources lies, in no small part, in automating what can be automated, doing manually what must be done manually, and deciding (perhaps with a sigh) to leave untagged what cannot be tagged automatically and is not essential to the edition. It will not always be easy to decide where to class a particular kind of tagging: some kinds of information require manual tagging in some collections of documents, but can be tagged automatically or semi-automatically in others. Right judgment will depend on the body of materials being edited, on the time and resources at hand, and on the skills of the available programming assistance. (
Markup Guidelines , Section 2.4 )
Those paragraphs constitute guiding principles for editors. We had the advantage of long familiarity with our word processor, which allowed us to write fairly complex macros that greatly simplified much of the tagging. Nevertheless, the actual work required as much intelligence and care as any other in the project.
The structural tagging of the documents tended to be straightforward
(which did not always mean easy). Markup targeted the structure and
physical presentation of the text—date lines, paragraphs,
closings, signatures, damage—rather than such details as names
of people or places. Perhaps the most important structural decision
concerned images in the documents—what in the volumes we call
art
(as opposed to illustrations,
which are images we
supply). These images are usually integral to the meaning of the
documents, especially in notebook entries, patent documents, and other
technical material. Occasionally a selected document has no words on
the page at all, and we have to insert bracketed letters next to
drawings as hooks from which to hang annotation. Often enough the
placement of text relative to drawings is meaningful, and there are
times when no transcription can fully capture that relationship. On
paper we tackle these issues with careful explication, thoughtful
design, and sometimes inclusion of text in reproduced art; still,
there is sometimes meaning lost in a patchwork of text and excised
drawings. But in the electronic edition, with the original document
instantly available, the transcribed representation of such documents
no longer carries the full burden of interpreting the document to the
user. It becomes an aid to reading, and the design of individual
documents —still a nightmare of fluidity on screen—
becomes a secondary consideration. With the text transcribed,
<figure> elements that contain the ID of their respective page
image are inserted where images exist on the page. Those elements will
appear on screen as icons that, when clicked, will call up the entire
page image.
Embedding editorial material proved challenging. The first
significant decision concerned the index. The back-of-the-book index,
although created for use in a codex, is a sophisticated intellectual
tool that maintains and arguably increases its strength when applied
to electronic texts. Full-text searching can help find specific text
known to exist, but it is at best a marginally effective way to
explore a body of information.
We had final electronic versions of the volumes' indexes, and we used
a macro to convert the alphabetically ordered entries
Having the index prepared this way made it relatively
straightforward for taggers to place entries in the text as they
serially processed the documents. Afterwards a macro inserted an ID
attribute in the element for linking.
The other complex editorial decision involved references and is
still very much in process. In an edition where only a small
percentage of the documents are transcribed, annotations often direct
the reader to the images of other documents that are not
transcribed. In the books, these take the form of lists of documents
on the microfilm, identifications accompanied by reel and frame
numbers. For correspondence and many other types of documents, the
references can be translated into direct links. However, references
to material in notebooks are rarely coordinate with the way notebooks
are divided into documents.