Shakespeare’s Words and NotWords
An experiment we often talk about is to see what words are “unique” to Shakespeare (or some other author, or group of book), and words that are conspicuously missing (e.g. they are very common in the rest of the corpus).
I think this is more an exercise to test our data (especially the standardization) than it is to shed new light on the corpus, but it is still worthwhile.
I start with a corpus (in this case, it’s the “all drama” or 1292 set). I pick a subset (in this case “the 38 plays that have Shakespeare as an author”). I then ask “what percentage of the occurrences of each word is in the subset”. So, if it’s 100%, then all the words are in the subset (in this case, Shakespeare was the only author to use the word), if it’s zero, only texts not in the subset use the word (in this case, Shakespeare never used the word).
I then sort the list by the number of documents it appears in.
This can be really useful for spotting errors – when we tried this with a draft version of the standardization (using VARD), it said that “john” was the most common word Shakespeare never used – something even I knew had to be wrong.
Here are the most common (in terms of number of plays it occurs in) words that Shakespeare does not use:
[(0.0, 461, 1214, 'designed'),
(0.0, 384, 645, 'invite'),
(0.0, 370, 1055, 'however'),
(0.0, 352, 840, 'oblige'),
(0.0, 304, 618, 'sufferings'),
(0.0, 277, 534, 'ills'),
(0.0, 276, 579, 'whatever'),
(0.0, 264, 423, 'various'),
(0.0, 252, 377, 'secured'),
(0.0, 239, 391, 'contrive'),
(0.0, 234, 343, 'numerous'),
(0.0, 232, 2028, 'mrs'),
(0.0, 229, 324, 'conveyed'),
(0.0, 225, 344, 'heretofore'),
(0.0, 225, 283, 'delude'),
(0.0, 224, 599, 'coxcomb'),
(0.0, 214, 290, 'propitious'),
(0.0, 213, 453, 'matrimony'),
(0.0, 211, 314, 'declared'),
(0.0, 203, 272, 'brighter')]
The way to read this: the word “designed” appeared in 461 documents in the corpus, it appeared 1214 times, and 0% of these were in the 38 plays that list Shakespeare as an author.
I have a hard time believing that Shakespeare never used some of these words. Could these be symptoms of a standardization issue?
Here are the 20 most common words that Shakespeare used that no one else did
[(1.0, 7, 8, 'sequent'),
(1.0, 5, 11, 'prayres'),
(1.0, 5, 11, "didd'st"),
(1.0, 4, 29, 'bardolfe'),
(1.0, 4, 24, 'glouster'),
(1.0, 4, 10, 'come-on'),
(1.0, 4, 8, 'porpentine'),
(1.0, 4, 5, 'leaue-taking'),
(1.0, 4, 4, 'winnowed'),
(1.0, 4, 4, 'vnfirme'),
(1.0, 4, 4, 'out-stretcht'),
(1.0, 4, 4, 'non-pareill'),
(1.0, 3, 22, 'rosaline'),
(1.0, 3, 12, 'thisbie'),
(1.0, 3, 12, 'glousters'),
(1.0, 3, 9, 'faulconbridge'),
(1.0, 3, 6, 'falstaffes'),
(1.0, 3, 5, 'wildenesse'),
(1.0, 3, 4, 'bawcock'),
(1.0, 3, 4, 'acorne')]
If we go deeper into the list (things that occur in only 1 book) we start to see things like character names.
In this list, we definitely see some standardization issues. For example, “acorne” is not standardized to “acorn” because it wasn’t a common enough word to be considered when we made the standardizer dictionary. Leaue-taking, prayres, vnfirme are also obvious examples of words that would by standardized by an encyclopedic standardizer.
Other parts of the list can also be interesting. Here is the bottom of the list Shakespeare used (the words with the lowest percentages):
[(0.0016168148746968471, 419, 3711, "don't"),
(0.0015408320493066256, 179, 1298, 'gad'),
(0.0014005602240896359, 359, 714, 'treat'),
(0.0013192612137203166, 339, 758, 'concerned'),
(0.0012300123001230013, 45, 813, 'ego'),
(0.0011534025374855825, 321, 867, 'st'),
(0.0010834236186348862, 418, 923, 'fain'),
(0.00099601593625498006, 376, 1004, 'obliged'),
(0.00072358900144717795, 83, 1382, 'j'),
(0.00051334702258726901, 706, 3896, 'its')]
Shakespeare represents 3.9% of the corpus (in terms of word count – it’s about 3% in terms of number of documents). So anything less than 3.9% means he uses a word less than average. He uses “it’s” and “don’t” a lot less than average. These could be a standardization or a typesetting thing.
In contrast he uses the word “it” 4.2% and “the” 4.1% of the usages – so this is about what you’d expect.
Again, I don’t think this tells us much interesting about Shakespeare. I think it is a useful tool for appreciating the limits of standardization. And I think that this kind of analysis – when applied in a slightly more sophisticated fashion – might turn up interesting things. But that’s the next hack…