News

Shakespeare Annual Association Meeting 2016

Deidre Stuffer
in News
Visualizing English Print is at the Shakespeare Annual Association Meeting in New Orleans! Jonathan Hope, Alan Hogarth, and I will be part of the digital exhibits on Thursday from 10:00 AM to 1:30 PM. Be sure to stop by! We’ll be offering demonstrations of new tools like TextDNA and of our early modern drama and early modern science corpora. Read formatted page...

XML Tags in TCP TEI-P4 Files

Deidre Stuffer
in News
Do you work with the TEI P4 versions of TCP XML files and wonder what all those tags mean? After surveying XML tags in TCP corpora, I made a spreadsheet that lists all of the tags, defines them, and mentions where you may find said tags within the XML documents. Download the spreadsheet from here. The survey and examining the files has made obvious that different TCP corpora have different levels of curation. Read more…

Forcing Standardization in VARD Part 1

Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary. There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary. Read more…

Forcing Standardization in VARD Part 2

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss. The VEP team decided to force bee > be, doe > do, and wee > we. Naturally one can see the problems inherent to these forced standardizations. Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. Read more…

VARD Normalization Errors

Deidre Stuffer
in News
VARD decently standardizes Early Modern English. Sometimes, though, it makes questionable replacements. ORIGINAL NORMALIZATION SHOULD BE all’s ell’s all’s caus’d cause caused Cicilia Cicely Cicilia courtesie curtsy courtesy diuers divers diverse hir his her ile isle I’ll ist first is’t kild/killd kilt killed maister moister master maist moist mayst *nunc nuns nunc Pauls Pals Paul’s *qui queen qui shees shoes she’s / she is weele weal we’ll / we will where’s whores where’s / where is Of course, you will want to check how your VARD installation handles these words. Read more…

Tweaking VARD: Aggressive Rules for Early Modern English Morphemes and Elisions

Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus. Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. Read more…

VARD & ASCII Symbols

Deidre Stuffer
in News
Yes, even ASCII symbols mess up VARD. Those who have tried to extract plain text from TCP TEI P4 or P5 XML files know how difficult it is. While coding tools to extract TCP text, the VEP team grappled with the order of operations to perform. Where is the best place in an extraction pipeline to convert the XML document to text? Where do we want to use VARD? As discussed in my previous post, processing XML files through VARD can be tricky. Read more…

VARD & Character Encoding

Deidre Stuffer
in News
Everything is always already encoded. The first time I used VARD, discussed in my previous entry, it was a shiny toy, one with which I wanted to automatically process batches of TCP TEI P4 XML files. That was in February of this year. Since then, interactions I have had with VARD underscore the need to understand how the tools I use work. The public releases of ECCO-TCP, EEBO-TCP, and Evans-TCP texts is a boon scholars, who can use these texts as the basis for their scholarship. Read more…

Standardizing Early Modern Drama

Deidre Stuffer
in News
We have made great progress with Jonathan Hope’s early modern drama corpus. It now includes plays dated up through 1700, built from TCP corpora. By my count, it is comprised of 1,257 plays. A corpus of this size and origin requires considerable curation. Beth Ralston has spearheaded metadata collection and cross-referencing–quite the feat–from Glasgow. In Madison, the VEP team has worked on extracting necessary text from TCP XML files. This effort involved writing and tweaking python scripts specifically for TEI P4 versions of the TCP offerings. Read more…

Visualizing English Print Project

Deidre Stuffer
in News
Welcome to the the Visualizing English Print (VEP) project. Our mission is to scale humanist scholarship: we design visualization tools, in addition to curating and analyzing corpora of increasing size for digital early modern texts. To learn about our visualization programs, visit the “Tools” page. Our tools support visual exploration of topic models, analyze text according to genomic sequence alignment techniques, and tag plain text for rhetorical effects. The “Corpora” page catalogs corpora we have generated in collaboration with domain specialists in early modern drama and early modern scientific writing. Read more…