Pipeline (OLD)
Warning: this page is out of data
Please see the page for the more modern pipeline .
VEP Scripts
div-divider
- processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
- allows extraction of specific DIV types (e.g., “play”)
- naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]
div-merger:
- sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a
tag - user must specify name for output file
pre-VARDer:
- prepares TCP XML files for VARD
- replaces XML reserved characters (<, >, %) with at signs (@)
- replaces ampersands (&) with the word “and”
- removes XML comments and TEI XML tags that can interrupt words:
, , - transforms non-ASCII characters into ASCII alternatives (e.g., “naïve” to “naive”)
- replaces dashes (—) with two hyphens (–)
- replaces TCP illegible characters (bullet: •) with carets (^)
- replaces TCP unrecognizable punctuation (small black square: ▪) with asterisks (*)
- replaces non-ASCII characters not assigned ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
- replaces TCP missing word symbol (lozenge in brackets: ◊) with ellipses in parentheses ((…))
- removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
- if desired, removes textual symbols converted to at signs (@)
- VEP Unicode Character Substitutions
- TCP Unicode Character Survey
tei-decoder:
- flexibly eliminates XML tags, their attributes, and their content to produce plain text
- uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
- prints text in lines no longer than 80 characters
- TCP TEI-P4 XML Tag Survey
VARD
software that standardizes Early Modern English spelling across corpora
version 2.5.4
trained by Deidre Stuffer on Jonathan Hope’s early modern drama corpora
expands contracted pronouns, elided adverbs, and elided prepositions
standardizes early modern English 2nd and 3rd person singular verb endings to modern equivalents
Correcting Normalization Errors
[Forcing Standardization I]({{ < ref “news/forcingstandardization1.md” >}})
Known Issues
1. Hyphenation Discrepancy that occurs in our analytical tools due to how the pipeline preserves end-of-line hyphens in SimpleText files
Explanation: Our plain text generation tools, to the extent they can, preserve the layout of source XML files when generating SimpleText plain text representations. As a result, SimpleText files preserve hyphens that occur at the end of lines in texts. For example, take this intentional, performative hyphen use in poetry from TCP file K012309.000:
Conjunctions, Prepositions, Interjec-
Tions, in blameful negligence. – Ah!
At the moment, our internal use analytical scripts that generate n-grams treat hyphens at the end of a line as part of the word, in addition to the syllable(s) it is connected with on the next line. Therefore, “Interjec-Tions” is considered a word by the algorithm, different from “interjections”. Future iterations of our tools will work to fix this discrepancy so when end-of-line hyphens occur they are not treated as a valid letter within a word.
The tool Ubiqu+Ity removes end-of-line hyphens in its analysis stage, treating “interjec-tions” as “interjections”.