Pipeline (OLD)

in Text Processings

Page content

Warning: this page is out of data

Please see the page for the more modern pipeline .

VEP Scripts

div-divider

processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
allows extraction of specific DIV types (e.g., “play”)
naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]

div-merger:

sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a tag
user must specify name for output file

pre-VARDer:

prepares TCP XML files for VARD
replaces XML reserved characters (<, >, %) with at signs (@)
replaces ampersands (&) with the word “and”
removes XML comments and TEI XML tags that can interrupt words: , _,
transforms non-ASCII characters into ASCII alternatives (e.g., “naïve” to “naive”)
replaces dashes (—) with two hyphens (–)
replaces TCP illegible characters (bullet: •) with carets (^)
replaces TCP unrecognizable punctuation (small black square: ▪) with asterisks (*)
replaces non-ASCII characters not assigned ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
replaces TCP missing word symbol (lozenge in brackets: ◊) with ellipses in parentheses ((…))
removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
if desired, removes textual symbols converted to at signs (@)
VEP Unicode Character Substitutions
TCP Unicode Character Survey

tei-decoder:

flexibly eliminates XML tags, their attributes, and their content to produce plain text
uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
prints text in lines no longer than 80 characters
TCP TEI-P4 XML Tag Survey

VARD

software that standardizes Early Modern English spelling across corpora
version 2.5.4
trained by Deidre Stuffer on Jonathan Hope’s early modern drama corpora
expands contracted pronouns, elided adverbs, and elided prepositions
standardizes early modern English 2nd and 3rd person singular verb endings to modern equivalents
Aggressive Rules
Correcting Normalization Errors
[Forcing Standardization I]({{ < ref “news/forcingstandardization1.md” >}})
Forcing Standardization II

Known Issues

1. Hyphenation Discrepancy that occurs in our analytical tools due to how the pipeline preserves end-of-line hyphens in SimpleText files

Explanation: Our plain text generation tools, to the extent they can, preserve the layout of source XML files when generating SimpleText plain text representations. As a result, SimpleText files preserve hyphens that occur at the end of lines in texts. For example, take this intentional, performative hyphen use in poetry from TCP file K012309.000:

Conjunctions, Prepositions, Interjec-
Tions, in blameful negligence. – Ah!

At the moment, our internal use analytical scripts that generate n-grams treat hyphens at the end of a line as part of the word, in addition to the syllable(s) it is connected with on the next line. Therefore, “Interjec-Tions” is considered a word by the algorithm, different from “interjections”. Future iterations of our tools will work to fix this discrepancy so when end-of-line hyphens occur they are not treated as a valid letter within a word.

The tool Ubiqu+Ity removes end-of-line hyphens in its analysis stage, treating “interjec-tions” as “interjections”.