VARD & Character Encoding

Everything is always already encoded.

The first time I used VARD, discussed in my previous entry, it was a shiny toy, one with which I wanted to automatically process batches of TCP TEI P4 XML files. That was in February of this year. Since then, interactions I have had with VARD underscore the need to understand how the tools I use work.

The public releases of ECCO-TCP, EEBO-TCP, and Evans-TCP texts is a boon scholars, who can use these texts as the basis for their scholarship. Those who wish to create digital editions and computationally analyze TCP texts may turn to programs like VARD to assist with crucial file pre-processing, like spelling standardization. Given the use of unsupervised or semi-supervised curation methods, those who work with TCP texts must be transparent about how they access, process, and analyze their data sets. The decisions involved in data curation impact interpretations that can be drawn from TCP texts. This transparency demands responsibility on the part of the scholar to know how their methods and tools manipulate the data within TCP XML files.

The purpose of this entry is twofold. It is to make transparent decisions made by the VEP team to process TCP XML files. More importantly, it is to highlight how VARD has interacted with TCP files, providing a resource for scholars and the curious working with them. VARD’s default behavior has serious implications for the spelling standardization of TCP XML files.

(For those who wish to used text extracted from TCP XML files, the following is just as relevant.)

If you use VARD on TCP XML files, you must be extra careful with non-ASCII characters and XML tags. Even ASCII symbols create problems. TCP XML files contain a plethora of symbols, traces of their transcription. The symbols not only capture textual elements, like foreign alphabets and diacritical marks, but transcribers’ experiences with texts. There are symbols to communicate illegible characters (•), ambiguous dot-like punctuation (▪), end-of-line hyphens (∣), even end-of-line hyphens inserted by editors (¦). In the TEI P4 versions of the XML files, many of the symbols are inserted right into the body of the text. In the TEI P5 versions, however, XML tags take their place. (For the interested, here is the TCP character entity list.)

TEI P4: gentle Rea∣der
TEI P5: gentle Rea<g ref="char:EOLhyphen"/>der

Why is it such a problem? VARD doesn’t handle non-ASCII symbols, and TCP transcription frequently includes them within words. When XML tags and symbols occur within a word, VARD processes the characters on each side of the tags/symbols separately.

VARDThe image to the left demonstrates how VARD would automatically process wor|thinesse in both P4 and P5. It leaves wor alone, but suggests chinese as a replacement for thinenesse. (The picture also demonstrates how TEI P4 and P5 are different beasts.)

Symbols and XML tags are ubiquitous in TEI P4 and P5 versions of TCP XML files. They are integral to recreating early modern texts, rife with diacritical marks and foreign alphabets. To drive the point home, any letters with accents and macrons will not be recognized in VARD’s modern English dictionary. VARD will simply read characters on both sides of letters with diacritical marks as different words. XML tags create similar situations, just as diverse as non-ASCII symbols. They contain information from decorative initials to notes in the margin and the number of illegible characters.

Before you process TCP texts with any tool, know the encoded contents of those texts. Be sure to check how processing alters the files. You might be surprised.

The next will entry will discuss how the VEP team prepares TCP XML files for VARD, leveraging ASCII symbols during character cleanup and XML extraction.