News

[I am pleased to offer this guest post by Darby Foster, a first year undergraduate student at Georgia Institute of Technology, majoring in Business Administration/Information Technology Management. Her professor, Dr Sarah Higanbotham, was kind enough to get in touch with me to share Darby’s final paper, which appears in a truncated form here. VEP loves hearing from students whose imaginations have been really taken by the work we do. –hgf] Read more…

As we’ve been releasing new resources for interacting with the TCP files, one of the questions that keeps coming up is “This is great, but what are we supposed to do with this stuff?” In this blog post I’m going to show how you can use the Core 1660 Drama corpus (from our Early Modern Drama collection) and the Metadata Builder to look at plays which didn’t explicitly involve Shakespeare as an author. Read more…

Yesterday, Deidre wrote about the release of our new Metadata Builder, which collates lots of available information about materials included in the Text Creation Partnership transcriptions in one place. For each corpus available, you have the option of downloading metadata only for texts freely available in the public domain or metadata for texts both freely available and presently restricted, to be made available in the public domain in 2020 (we can distribute information about these restricted-access texts, but we can’t share the files). Read more…

VEP’s Metadata Builder helps users navigate our corpora collections, which are vast in scale. The Metadata Builder provides metadata to users in an accessible, intelligible format, preventing users from having to decipher spreadsheets with more than 120 columns and 160,000 rows. This blog post will explain the motivations for creating the Metadata Builder and its main components, enabling users to understand how to use the tool. Motivations This tool was built to allows users to generate and download metadata spreadsheets of VEP corpora tailored to their own research interests. Read more…

Several years ago, Michael Witmore and Jonathan Hope published a paper in Shakespeare Quarterly that describes how the string-matching rhetorical analysis software DocuScope is able to identify stylistic fingerprints of genre in Shakespeare’s plays. Visualizing English Print is proud to make the string-matching rules used by DocuScope available online for general use as part of the multivariate textual analysis package Ubiqu+Ity. The DocuScope dictionaries, which were initially designed to analyze rhetorical features such as persuasiveness or first-person reporting, covers 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects (see http://www. Read more…

VEP has been busy improving its visualization tools and processing pipeline! You can read about all the changes in the list below. Release Information Text Processing Pipeline 2.0 features better Unicode handling during character cleaning and a dictionary that standardizes spelling variation across TCP corpora. Read about the pipeline on the ‘Workflow’ page. Download the pipeline from GitHub. TextDNA is available for download! The download includes sample datasets and Python scripts for curating your own TextDNA datasets. Read more…

No one has time to read and really understand all of the 1,476,894,257 words that comprise the Text Creation Partnership (TCP) digital texts. Considering that adults read on average 300 words a minute, it would take someone about 40 years to read every word in the TCP’s 61,000 texts. That 40-year estimate assumes 52 40-hour work weeks per year—no vacations, no holiday time, no sick time, no lunches or breaks. Read more…

LSA is an older corpus processing method – it’s kind of gone out of favor for things like topic modeling – but I like it as an illustration because it is very simple. And to use it, we have to make all the critical decisions we need to do with any statistical modeling thing. And of course, simple experiments are good for exposing errors in our data and process (fixes were made to my code in the process of doing this, but the data is standing up). Read more…

Let’s try the words/not words game again… This time I will start with the entire TCP corpus (61315 documents – this includes some Evans and some ECCO). There are 6065951 different words (which is a lot – it gives us a 61315×6065951 matrix). But most words occur once in the whole corpus. I’ll limit things to words that appear in 5 or more documents (there are still 909665 words). Also here I am referring to different words (e. Read more…

As part of Visualising English Print, I have been evaluating and validating judgments about non-English print in the Text Creation Partnership transcriptions of EEBO. I’ve been looking at texts which have been classified as non-English (or texts that appear to be non-English, such as lists of names or places) by an automated text tagger. Bi- or multi-lingual text cause particular difficulties for this task, as a strong percentage of the text can still be in English but still pose problems by containing a relatively high percentage of untaggable words. Read more…

An experiment we often talk about is to see what words are “unique” to Shakespeare (or some other author, or group of book), and words that are conspicuously missing (e.g. they are very common in the rest of the corpus). I think this is more an exercise to test our data (especially the standardization) than it is to shed new light on the corpus, but it is still worthwhile. I start with a corpus (in this case, it’s the “all drama” or 1292 set). Read more…

VEP’s own Heather Froehlich recently presented at the Yale Digital Humanities Lab! Here is what Heather has to say about her presentations: On the kind invitation of Cathy DeRose, an alumnus of Visualising English Print, I was a visitor at the Yale Digital Humanities Lab last week. While there I gave two presentations: one, a paper about some of my research involving EEBO-TCP and the other a 3 hour masterclass on ways of using and accessing EEBO-TCP phase I. Read more…

Guest Post: Data-Mining King Lear

Using the metadata builder to guide an analysis

Using the Metadata Builder: Getting the information that you want

VEP's Metadata Builder

Making your own rules for use with Ubiqu+Ity

VEP Releases

Editing Programmatically; or, Curating ‘Big Data’ Literature Corpora

A first attempt at LSA (Latent Semantic Analysis)

Words / NotWords for Plays and All TCP

The Untranscribable in EEBO

Shakespeare’s Words and NotWords

Presenting at Yale’s Digital Humanities Lab