Page content

Metadata Builder

The Metadata Builder is an interface to interact with multiple spreadsheets. You can sort and search through the metadata, join together columns from different spreadsheets, and download spreadsheets customized to your needs. The tool contains spreadsheets for VEP curated corpora and VEP supplemental TCP metadata. It also links to live versions of free content. Learn more about our metadata holdings on the ‘Metadata’ page.


In this project, we are exploring multi-scale exploration of large text corpora guided by probabilistic topic models. Unlike prior work that focuses on visualizing topic models, we seek to treat the models as a lens through which the original documents can be viewed, rather than treating it as an end to be visualized in and of itself. Through this lens, the reader can observe trends and build hypotheses at multiple scales—ranging from across a corpus to within a single text—and support these hypotheses with both algorithmic data and textual examples. Supporting this workflow requires a multi-tiered framework that affords comparisons at many levels, from multiple documents to specific passages to individual words. In doing so, we must overcome challenges including the scale of the corpus, the density of the models, and the overlapping nature of topic distributions.

We tackle these in our implementation of Serendip, a tool that combines view-coordinated re-orderable matrices, small multiples displays, and tagged text in order to allow readers develop insight at multiple levels and carry that insight into their analysis of other levels. Serendip uses metadata and reader interaction to highlight trends and areas of potential interest.


In this project, we are exploring multi-scale exploration of large text corpora through the affordances of the genomics sequencing system, Sequence Surveyor. With TextDNA, users can compare word usage between document collections, between individual documents, or between elements within a document. Word usage can be explored across raw texts, i.e., text documents not subject to processing. Additionally, word usage can be explored across different metrics, such as the frequency with which they appear in a document.


Ubiqu+Ity is a front-end web application for DocuScope, a rhetorical text analysis suite by David Kaufer and Suguru Ishizaki at Carnegie Mellon. Ubiqu+Ity is a text-tagging service that generates statistics and web-based tagged text views for your text(s), using the DocuScope dictionaries or your own rules. A preliminary version of the tool can be found here.