Pipeline
Below you will find explanations of the Python scripts that comprise the steps of our text processing pipeline, and a record of known issues.
Also see the GitHub repo for code, data, and explanations.
VEP Scripts There are three scripts that handle character cleaning, text extraction, and spelling standardization.
characterCleaner prepares TCP XML files to facilitate spelling standardization later in the pipeline.
replaces reserved XML characters (<, >, %) with at-signs (@) replaces ampersands (&) with the word “and” removes XML comments and TEI XML tags that can interrupt words: , , transforms non-ASCII letters into ASCII alternatives (e.
Read more…