Source Files & their Challenges

We build our datasets from text files provided to the public by the Text Creation Partnership (TCP). For our purposes, we use its XML files encoded with TEI P4. These texts have their own complicated histories. Files from EEBO-TCP Phase I, EEBO-TCP Phase II, ECCO-TCP, and Evans-TCP were hand-keyed from digitized facsimiles.

Page Contents

Transmission History

The digitized facsimiles used for hand-keying are from several microfilm efforts. In 1938, University Microfilms International began microfilming a cultural preservation project titled Early English Books (EEB I). The books selected for microfilm are those listed in A. W. Pollard and G. R. Redgrave’s Short-Title Catalogue, a bibliography of English-printed books from 1473 to 1640. The microfilm project Early English Books II (EEB II) includes books listed in Donald Wing’s Short-Title Catalogue, which continues Pollard and Redgrave’s efforts by cataloguing English-printed books from 1641 to 1700. The Eighteenth Century Microfilm Collection emerged in tandem with the Eighteenth-Century Short-Title Catalogue, an electronic finding aid that records microfilm location for individual texts. It focused on items printed in English from 1701-1800. The Evans American Imprints Series are based on Charles Evans’s American Bibliography and Ralph R. Shaw and Richard H. Shoemaker’s American Bibliography.

TCP Files and Selection Criteria

The TCP’s goal is to provide standardized, XML-encoded electronic text editions of early printed books. It has partnered with institutions and several publishing and information-content/technology companies. The TCP collaborates with ProQuest, Gale, and NewsBank to make early printed books accessible. By providing electronic editions, the TCP makes these early books searchable due to the problem they present for machine-encoding text technologies, like optical character recognition (OCR). For an account of working with the textual features of early modern printed texts, read about project challenges in the next section.

EEBO-TCP I25,368January 1, 2015
EEBO-TCP II28,4665 years from completion date
ECCO-TCP2,473April 25, 2011
Evans-TCP4,977June 30, 2014

EEBO-TCP Phase I: Selected first editions and works listed in the New Cambridge Bibliography of English Literature (NCBEL).

EEBO-TCP Phase II: Aims to provide an edition of each unique work in EEBO.

ECCO-TCP: Focused heavily on works by authors whose works span the divide between the 17th and 18th centuries, to produce continuity with EBBO-TCP. Works that didn’t respond well to OCR.

Evans-TCP: Most studied texts from the Evans bibliography, identified by the American Antiquarian Society (AAS).

Challenges of Early Modern Texts

The humanist scholar faces unique problems in order to automate the study of early modern texts. This section will detail the historical and technological complications VEP has faced while generating curated data sets of early modern texts from TCP source files and building visual analytic tools to explore topic models of corpora.

Early Modern Typographical Features

One cannot dispute the technological and cultural significance of the printing press. What is generally called the “early modern” period in English history coincides with the era of moveable type, from 1450 to 1800. During this time, print conventions emerged from those of the manuscript, after which print, manuscript, and codex features dynamically contributed to the printed books we are familiar with today.

Nonstandard Spelling

English language speakers and readers often begrudge the complexity of English’s orthography, or its standard spelling system. Frustration with English orthography, in part, results from sound changes that began mid-14th century. English underwent the Great Vowel Shift, a systematic change in the pronunciation of vowels occurring until the eighteenth century. The introduction of the printing press allowed written forms of the English language to reach a wider audience. As a result, spelling, which varied regionally, began to standardize in the 15th and 16th centuries. The orthographic system English writers and readers have inherited is rooted in Middle English spellings that grappled with pronunciation changes of the Great Vowel Shift.

Spelling variety, such as found in early modern texts, complicates the automated reading process. Pre-standardized spelling conventions circumvent standard text analysis. While a human reader can recognize ivory, ivorie, and juorie as the same word, standard text analysis treats each spelling as separate entities. Therefore, an individual who wants to study ivory’s connotations in Early Modern England would need to be savvy to typographical variants, like the interchangeability of i‘s and j‘s, in order to search a corpus with non-standard spellings.

Recent efforts like the CIC CLI Virtual Modernisation Project (CIC) have begun to address the difficulties of nonstandard spelling. CIC made texts in the Early English Books Online (EEBO) database more accessible to searching, programming the database to handle spelling variants at search time for the user. For the scholar with text files of early modern writing with nonstandard spellings, however, textual analysis becomes a problem. To address the issue for our data set, we used VARD, a java program that modernizes Early Modern English spellings into Standard British English spellings. This modernization tool assists researchers to improve the accuracy of text analysis of historical corpora.


The printed book emerged from ornate manuscript and manuscript codex traditions. Designers modeled type fonts after popular handwriting styles (gothic, chancery, italic, and secretary). The evolution of fonts and codex conventions provide challenges for automated reading. Below are several difficulties of dealing with early modern typography.

Font: The variety of fonts in early modern texts pose problems for optical character recognition and hand-keying. Different fonts signal varying registers of meaning and prestige. For example, blackletter is often found in printed legal documents. Texts also rely heavily only roman and italic type.

Special Characters: Early modern texts display a range of special characters and ornate symbols, which require decisions on how to include them in the text. Common examples are the pilcrow ( ¶ ) and the long s ( ſ ).

Letter Interchangeability: The English alphabet did not always contain 26 letters. Letters I, J, U, V, and W have a complex history.

The sound represented by the contemporary W came into the English language when monks began to write Anglo-Saxon phonetically with the roman alphabet. Since the roman alphabet did not contain a character for the sound, monks borrowed the Anglo-Saxon rune the Wynn ( ƿ ). This rune was replaced by the French who invaded England in 1066. French invaders used UU or VV instead.

It wasn’t until recently that I and J were distinguished as separate letters/sounds, along with U and V. J and I were used interchangeably in writing and print, as were U and V. These letters began to be used separately during the 16th century. U and V fully diverged as separate in meaning during the 18th century.