Using the Metadata Builder: Getting the information that you want

hfroehlich

October 28, 2016

Yesterday, Deidre wrote about the release of our new Metadata Builder, which collates lots of available information about materials included in the Text Creation Partnership transcriptions in one place. For each corpus available, you have the option of downloading metadata only for texts freely available in the public domain or metadata for texts both freely available and presently restricted, to be made available in the public domain in 2020 (we can distribute information about these restricted-access texts, but we can’t share the files). As a user of the Metadata Builder, I want to be able to take advantage of all the different metadata options available to supplement and guide my analyses. In this post I’ll walk you through a few ways of obtaining a couple different kinds of information using the various kinds information on offer.

I happen to be interested in the language of dramatic writing. Visualizing English Print offers three different dramatic corpora: the Core Drama 1660 corpus, the Expanded Drama 1660 corpus, and the Expanded Drama 1700 Corpus. (Many scholars of Early Modern drama will be familiar with the Database of Early English Playbooks or DEEP as it is commonly known; this is in some ways quite similar). Of our dramatic corpora, the Expanded Drama 1700 corpus (ED1700) covers the largest quantity of dramatic writing, so I’ll use it as an example.

If I want all the metadata available for this corpus regardless of public-domain status, I would selected ‘All’ available texts in Step 1. However, if I want to use this metadata to guide decisions about a project I might prefer to use the ‘Unrestricted’ version of the corpus, as these texts are all freely available for download from our site.

First things first: to get all of our available metadata for either version of ED1700 specified in Step 1, select ‘all’ under every drop-down menu in Step 2. This is the “all you can eat” option: it will include every piece of metadata we have available, and from there you can download the spreadsheet and its associated readme file in Step 4 and 5. metadatabuilder all With everything, you can always further refine your downloaded spreadsheet, but I find it to be useful to keep one master spreadsheet pristine and do metadata manipulations, such as organising by author, date or other parameters on in a second version of the original spreadsheet.

While it is great to have everything, sometimes that can be too overwhelming. This post is therefore not meant to be a how-to guide but more of a ‘ways of thinking about the Metadata Builder’ guide. Here are a few of the metadata columns we offer which I personally find most useful.

If you want to get the dedicated TCP ID number associated with each transcription, you’ll want to select the category ‘TCP’ from the the dropdown menu “Master Metadata”. These unique TCP identifiers match to a specific transcription: so the TCP identification number A01234 will always link to this specific document. step 2 ESTC data (including Wing numbers, where applicable) is available under the option ‘ESTC’.

Under Master Metadata, we also offer information from the Wiggins Catalogues of British Drama, including their identification number schema and historical and contemporary generic assignments. Other useful generic information includes the DEEP genre and the Harbage genre, should you want to compare different understandings of genre forms over time or using various criteria to show variation in generic forms. I also I often want to know how many words are in each text, as this is a common way of describing how big or long a text is. This can be found under the Ubiq categories; select ‘# word tokens’ at the the very bottom of the Ubiq dropdown menu in Step 2.

The ability to group plays by company, based on information from the Wiggins Catalogues (using the options for Play Company 1 and Play Company 2 under Master Metadata in Step 2) means that I can easily organise an analysis using attributed information about working theatrical networks of the time and ask what makes the language of plays put on by the King’s Men different than, say, all other companies. With the options of including up to five authors as well, I can start to make these more complex analyses using multiple axis, such as asking only about single-authored plays performed by Queen Henrietta Maria’s Men that are over 20,000 words long.

What I am doing here is not limiting my corpus based on arbitrary features, but by selecting texts which fit certain parameters to get at more specific questions. The more features I pull in, the more information I can base my decisions around, but not all of these categories in Step 2 may be immediately useful. For example, I probably don’t need to know if there are figures (images) in these texts or how many pages long the texts originally were: that’s probably not going to help me. By excluding them from my spreadsheet, I am able to focus more on more relevant information (and if I decide I do want to know about it later, I can always get it from the all-metadata-spreadsheet I downloaded in the first instance).

Another thing I can do with the Metadata Builder is download Docuscope tagging statistics for every text in a specified corpus using the dropdown menu ‘Ubiq’ in Step 2. This means that I do not have to process the ED1700 Unrestricted corpus through Ubiqu+ity myself, but rather combine multiple metadata categories alongside the statistical distributions produced by the Docuscope text-tagging schema. By selecting relevant metadata categories such as author(s) date of first performance, theatrical group, and several views of genre assignment, I am setting myself up for quite a nuanced multivariate analysis using these particular features.

Finally, the multivariate analyses I suggested above do not necessarily require the use of further computational methods. The ability to isolate all the texts based on a certain principle can guide any number of decisions for studies which rely on close-reading, such as identifying transcriptions which are have multi-lingual content and realising there is a text which you didn’t know about but has a clear connection to your previous work. The Metadata Builder therefore makes the ability to obtain a lot of information about a huge number of texts now available as a result of the TCP project. We look forward to what you will do with it!