Forcing Standardization in VARD Part 1
Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary.
There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary.
For the early modern drama corpus, the VEP team identified two grammatical features for forced standardization. We decided to implement consistent spelling for pronouns, adverbs, and prepositions; and expanding elisions that would interfere with algorithmic analysis, like topic modeling. Granted, more could have been changed, but we erred on the side of caution to see how effective the changes would be overall.
After documenting forced changes, I will discuss their implications for the dataset, which will come in the next entry
RULES TO FORCE ELISION EXPANSION (read more here)
t’ to_ Start
th’ the_ Start
PRONOUNS AND CONTRACTIONS
hee > he
hir > her
ide > I’d
ile > I’ll
i’le > I’ll
she’s > she’s
shees > she’s
* wee > we
ADVERB CONTRACTIONS
heeres > here’s
heere’s > here’s
theres > there’s
ther’s > there’s
wheres > where’s
where’s > where’s
ADVERBS/PREPOSITIONS
aboue > above
ne’er > never
ne’re > never
nev’r > never
o’er > over
oe’r > over
ope > open
op’n > open
WORDS ADDED TO DICTIONARY: Cupid, Damon, Leander, Mathias, nunc, Paul’s, Piso, qui, quod, tis, twas, twere, twould
MARKED AS VARIANTS FOR CORRECTION: greene > green, lockes > locks, vs > us, wilde > wild
* I will discuss the implications of our decision for wee in the next entry.