VARD

Forcing Standardization in VARD Part 1

Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary. There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary. Read more…

Forcing Standardization in VARD Part 2

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss. The VEP team decided to force bee > be, doe > do, and wee > we. Naturally one can see the problems inherent to these forced standardizations. Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. Read more…

VARD Normalization Errors

Deidre Stuffer
in News
VARD decently standardizes Early Modern English. Sometimes, though, it makes questionable replacements. ORIGINAL NORMALIZATION SHOULD BE all’s ell’s all’s caus’d cause caused Cicilia Cicely Cicilia courtesie curtsy courtesy diuers divers diverse hir his her ile isle I’ll ist first is’t kild/killd kilt killed maister moister master maist moist mayst *nunc nuns nunc Pauls Pals Paul’s *qui queen qui shees shoes she’s / she is weele weal we’ll / we will where’s whores where’s / where is Of course, you will want to check how your VARD installation handles these words. Read more…

Tweaking VARD: Aggressive Rules for Early Modern English Morphemes and Elisions

Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus. Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. Read more…

VARD & Character Encoding

Deidre Stuffer
in News
Everything is always already encoded. The first time I used VARD, discussed in my previous entry, it was a shiny toy, one with which I wanted to automatically process batches of TCP TEI P4 XML files. That was in February of this year. Since then, interactions I have had with VARD underscore the need to understand how the tools I use work. The public releases of ECCO-TCP, EEBO-TCP, and Evans-TCP texts is a boon scholars, who can use these texts as the basis for their scholarship. Read more…