Tweaking VARD: Aggressive Rules for Early Modern English Morphemes and Elisions
Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus.
Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. I looked at unique spellings changed within the frequency range of approximately 46,000 to 100 times, which amounted to nearly 3,000 cases. (There were approximately 58,000 unique spellings in the corpus changed 10 or fewer times.) To offer a glimpse, here are the 10 most frequent VARD normalizations for the early modern drama corpus:
ORIGINAL | NORMALIZED | FREQUENCY |
---|---|---|
haue | have | 45680 |
selfe | self | 18473 |
Ile | Isle | 16095 |
loue | love | 15666 |
thinke | think | 10450 |
mee | me | 10437 |
vpon | upon | 10287 |
owne | own | 10205 |
vp | up | 9704 |
’tis | it is | 9691 |
The CSV tracking normalizations proved a painless way to identify where VARD needed a gentle push in another direction. Note Ile in the above table. Yes, England is an island (of which writers were aware), but 16,095 changes to Isle seemed suspect. When I looked at files with VARD-inserted XML tags, it became obvious those _Ile_s should have been standardized to _I’ll_s. There, VARD was simply wrong. (I will devote the next post to where VARD goofs–sometimes amusingly–in standardization.)
By researching questionable corrections, I was able to formulate standardization rules more “aggressive” than which the program instantiates with. (You can locate the default rules in the file “rules.txt,” in VARD’s “training” folder.) These rules dictate modern letter substitutions for common early modern letter combinations. Examples of the rules are as follows:
CHARACTERS | CHANGE TO | LOCATION IN WORD |
---|---|---|
vv | w | Anywhere |
ie | y | Anywhere |
Given the above rules, when VARD processes the word alvvaies, the program may suggest multiple variants: alwaies and alvvays. This contributes to competing spellings for variations across documents standardized, which you can find proliferate when VARD handles early modern prepositions and adverbs, even words with hyphens (e.g.: ne’er, ne’re, nev’r normalize differently; should the hyphen be eliminated or maintained?).
My additions to “rules.txt” aided not only spelling standardization but expanding elisions. The rules mainly gave VARD an extra push to handle early modern English morphemes. While “rules.txt” contains the rule ie at the end of words can be changed to y, it didn’t have a rule to help with standardizing the common adverb ending lie. Here is a table of the rules I added:
CHARACTERS | CHANGE TO | LOCATION IN WORD |
---|---|---|
cyon | tion | End |
lie | ly | End |
shyp | ship | End |
t’ | *to_ | Start |
th’ | *the_ | Start |
tiue | tive | End |
vn | un | Start |
vs | us | Anywhere |
ynge | ing | End |
While not comprehensive, the rules definitely aided VARD’s efforts. Of course, entering rules is only one step of the process. For the rules you add to the dictionary, you must manually train VARD to implement them.
- A final word regarding the entries I made to expand the elisions t’ and th’ when they begin words. I typed an underscore (_) to reflect that there is a space after to and the in the rules. VARD will recognize spaces for rule input. In the GUI the rule will be displayed with an underscore; you do not not type the underscores in. The rules worked, and the program properly expanded words after some manual training. It changed th’ambassador to the ambassador, t’change to to change.