documentation

Forcing Standardization in VARD Part 1

Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary. There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary. Read more…

Forcing Standardization in VARD Part 2

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss. The VEP team decided to force bee > be, doe > do, and wee > we. Naturally one can see the problems inherent to these forced standardizations. Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. Read more…

Tweaking VARD: Aggressive Rules for Early Modern English Morphemes and Elisions

Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus. Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. Read more…