Text Processings

Pipeline

Below you will find explanations of the Python scripts that comprise the steps of our text processing pipeline, and a record of known issues. Also see the GitHub repo for code, data, and explanations. VEP Scripts There are three scripts that handle character cleaning, text extraction, and spelling standardization. characterCleaner prepares TCP XML files to facilitate spelling standardization later in the pipeline. replaces reserved XML characters (<, >, %) with at-signs (@) replaces ampersands (&) with the word “and” removes XML comments and TEI XML tags that can interrupt words: , , transforms non-ASCII letters into ASCII alternatives (e. Read more…

Pipeline (OLD)

Warning: this page is out of data Please see the page for the more modern pipeline . VEP Scripts div-divider processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types allows extraction of specific DIV types (e.g., “play”) naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level] div-merger: sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a tag user must specify name for output file pre-VARDer: Read more…

Source Files & their Challenges

We build our datasets from text files provided to the public by the Text Creation Partnership (TCP). For our purposes, we use its XML files encoded with TEI P4. These texts have their own complicated histories. Files from EEBO-TCP Phase I, EEBO-TCP Phase II, ECCO-TCP, and Evans-TCP were hand-keyed from digitized facsimiles. Page Contents Transmission History TCP Files and Selection Criteria Challenges of Early Modern Texts Early Modern Typographical Features Nonstandard Spelling Typography Transmission History The digitized facsimiles used for hand-keying are from several microfilm efforts. Read more…

Tokenization

Before Doing Things with Texts written by: Deidre Stuffer VEP often breaks texts down into their component words. While breaking down a text is computationally fast, the speed belies the difficult decisions behind exactly how to break down strings of characters. The process of segmenting strings of characters into meaningful parts is called tokenization. The resulting meaningful parts are referred to as tokens, and what constitutes as a meaningful part must be specified computationally. Read more…

Workflow

Deidre Stuffer
in Text Processings
Visualizing English Print’s Text Processing Pipeline Version 2.0 To make the work of the Text Creation Partnership (TCP) more suitable for scalable computationally-driven analysis, VEP has designed a text processing pipeline that includes three important steps in the following order: character cleaning, text extraction, and spelling standardizaton. We have designed the character cleaning and text extraction processes to facilitate spelling standardization based on the composition of our source files. (Note: this page describes pipeline version 2. Read more…

Workflow

To make the work of the Text Creation Partnership (TCP) available for large-scale computational analysis, VEP designed a text-processing pipeline to support spelling standardization and flexible text extraction. For ease of use, VEP generates ASCII representations of TCP source files. The pipeline consequently reduces character variety so that corpora, in a lowest common denominator format, can be analyzed by tools other than our own. Our pipeline includes three important steps in the following order: character cleaning , text extraction, and spelling standardization. Read more…