The Untranscribable in EEBO

hfroehlich

September 23, 2016

As part of Visualising English Print, I have been evaluating and validating judgments about non-English print in the Text Creation Partnership transcriptions of EEBO. I’ve been looking at texts which have been classified as non-English (or texts that appear to be non-English, such as lists of names or places) by an automated text tagger. Bi- or multi-lingual text cause particular difficulties for this task, as a strong percentage of the text can still be in English but still pose problems by containing a relatively high percentage of untaggable words. (Inconsistent orthography is another big difficulty for this task, which is why VEP is working on improving the machine-readability of the TCP texts).

In the case of Early English Books Online, transcribers were given very specific instructions on how – and what – to transcribe. The TCP provides a map of every character in available in Unicode. This page is extremely thorough, covering a huge range of language and symbol character sets including print symbols (❧, ☞ ,⁂, ¶), alchemical symbols (♁, ♃, ℥, ☋), diacritics, and non-Latinate alphabetical symbols, including those associated with Greek, Hebrew, and Cyrillic. All of these characters therefore have the potential to be transcribed.

Characters which are considered part of the Classical Roman alphabet are retained, though there are few exceptions, such as when the source image is obfuscated by heavy inking or damage to the page. The TCP guidelines also include an entire section devoted to foreign (that is, non-Roman) alphabets. The entire document is linked here, but I’ve replicated the important part below.

“Foreign” (non-Roman) alphabets. Extended text in a non-roman alphabet. Though individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities (see discussion of Characters, below), entire words or extended passages in a non-Roman alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC=”foreign”>, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.
Surrounding structures should be preserved if possible, at the highest level that applies. A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC=”foreign”></L></Q>; a paragrah in Greek as <P><GAP DESC=”foreign”></P>; and a stanza in Greek as <LG><GAP DESC=”foreign”></LG>.
Record as: the semicircle .18.5, <GAP DESC=”foreign”> .21.7, <GAP DESC=”foreign”> .23
The presence of musical notation should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “music”: <GAP DESC=”music”>.
Extened spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.
Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC=”music”> tag.
Any mathematical formulas or mathematical notation too complicated (or too dependent on two-dimensional layout) to be rendered as plain text should be recorded with the <GAP> tag, with the value of the “DESC” attribute assigned as “math.”
Illegible text, missing and damaged text, or clear but unrecognized symbols all will require some attention from us.Illegible text that cannot be read, for whatever reason, should be marked using variations on the “$” symbol:
$ = individual character or characters, less than a word.
$word$ = a whole word
$span$ = any span of two or more words, less than a page.
$page$ = a whole page.
Additional variants are possible if it proves useful to flag some other piece of the structure as unreadable, e.g.:
$ = individual character or characters, less than a word.
$para$ = illegible paragraph
$line$ = illegible line of verse or prose
Unknown symbols or characters if they can be distinguished from illegible characters, should preferably be recorded as “#”.
The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) “creative” capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.

$	= individual character or characters, less than a word.
$word$	= a whole word
$span$	= any span of two or more words, less than a page.
$page$	= a whole page.

$	= individual character or characters, less than a word.
$para$	= illegible paragraph
$line$	= illegible line of verse or prose

This is primarily due to expedience: in order to capture lots of text quickly, transcribers were encouraged to give the most attention to Roman symbols, though glyphs like alchemical symbols are important for being able to read and understand the texts in question and are therefore retained.[1] Print symbols such as pilcrows are also commonly included in the transcriptions, as again they are useful for navigating the text. However, as Chris Powell and Paul Schnaffer very kindly confirmed for me,

The original instruction came to be modified in the course of actual practice. For one thing, majuscule Greek posed nothing like the same difficulties for capture and review as ligatured early modern Greek type did; in fact, it was difficult to think of a rule that would prevent the keyers from capturing it. So we let them do so, and attempted only to correct their tendency to capture Greek “A” as if it were Latin “A” etc., and so on for the other ambiguous glyphs. Lower-case, ligatured Greek we mostly left uncaptured, unless it served some structural purpose, e.g. was part of a title or chapter heading — or unless the TCP editor felt a rare and inexplicable impulse to type it in.
The end result of all this is that much, perhaps most, of the upper-case Greek has been captured, usually correctly, but that the vast majority of lower-case Greek has not been. (and a little bit of Hebrew, again when it served a structural purpose, and was essential for navigation through the book.) Math only when it could readily be represented as running text, or as a TEI/HTML table. Music only when it consisted of individual notes or symbols, no full tablature.

The implication here is that the final TCP texts are Unicode compliant, but only some of the characters from the full Unicode set make it into the TCP transcriptions. So Greek block letters (Γ, Σ, Φ, Ω) are sometimes transcribed, but the corresponding lowercase ligature symbols (γ, ς, φ, ω) are not.[2] Hebrew characters (א,ק,ש,מ) are very, very rarely transcribed, but Hebrew transliterated into Latinate characters will have been transcribed. Arabic is also notably absent, even though it is available in Unicode and is visible in several EEBO texts.

Using the JISC historical books (UK subscribing institutions) interface, a user is able to look at the TCP transcripts alongside the images from EEBO, making it possible to compare the print to the transcriptions. Here are some examples of what I mean when I say ‘untranscribed’.

This trilingual page from TCPID A90307 shows text in Latin, Greek script, and Hebrew:

As we can see, the Greek and Hebrew print in this text are rendered as […], even though there is printed material there. A second example, a few pages later in the same book, is even more illustrative of this issue: the Arabic page shown below is listed as unavailable, even though there is definitely print there, and the next page, in Latin, is transcribed.

Even though they are available in the Unicode character set, these scripts are erased completely from the Text Creation Partnerships’ texts.

I now want to mention some exceptions which are notable. As I’ve stressed, very little of the Hebrew is transcribed unless it’s been transliterated into Latinate characters, such as in TCPID A37959, a three column book which offers translations between Latin, Hebrew and Welsh:

Syriac and related languages rendered in Latinate characters are also transcribed.

And as Paul and Chris suggest, block Greek characters can be transcribed, as this page from TCPID A57729 shows:

but, as the TCP guidelines recommend, the Greek ligatures later down the page are not transcribed and are instead replaced with […]:

(The transcriptions do not retain the structure of the printed object!)

There’s also at least one fake alphabet in EEBO which is definitely not transcribed as there are no corresponding Unicode characters. But much of that book (TCPID A57259) is left untranscribed – though to its credit, […] is used for Arabic as opposed to pretending it’s not there at all. When present, Greek block characters in this book get transcribed too. It even contains this hyper-detailed transcription of a stylised alphabet, suggesting the definition of a Roman letter can be quite flexible:

(Decorated initials are also recorded, in case you were curious.)

Of the languages discussed so far, Greek is by far the most common. Lots of evidence of historical printed Greek has been removed from the TCP corpus as a result of these rules. Although I haven’t seen a book printed entirely in Greek ligature yet, it wouldn’t be out of the question; plenty of texts I did look at contained lines, passages, paragraphs, or columns in ligature which are untranscribed and therefore eliminated from the corpus. Arabic and Hebrew are far less frequently found in EEBO, but it’s good to know they’re there. There may be other missing languages I don’t know about because they aren’t available in multilingual texts which are partially transcribed.

For our purposes, this is not necessarily a bad thing. VEP privileges English-language printed material in our pipeline, as our resources are designed to get high-accuracy ways of visualising, exploring and understanding English-language print from the TCP texts. At least 99% of Early English print is available in Roman characters, and even including Latin, we have a very high level of accuracy and coverage of the TCP data.

Individually, non-Roman glyphs may not represent massive amounts of early printed material, but in aggregate I’d estimate non-English print in EEBO is going to be represent maybe 1% of the entire EEBO-TCP corpus. This is still not a huge – or even meaningful – number, unless you study these languages, in which case it will be a big loss to you! But we will be releasing all the multi-lingual texts in the TCP collection soon, including some notes on languages that do not make it into the transcriptions, meaning you will soon be able to conduct your own investigations on these multi-lingual documents.

+++++++

[1] A small number of alchemical symbols are also available as emoji these days, and confusingly can render as emoji in the transcriptions. They are: ♈ (Aries), ♉ (Taurus), ♊ (Gemini), ♍ (Virgo), ♎ (Libra), ♏ (Scorpio), ♐ (Sagittarius), ♑ (Capricorn), ♒ (Aquarius), ♓ (Pisces), ♌ (Leo). A full list of Unicode-compliant alchemical symbols is available from https://en.wikipedia.org/wiki/Alchemical_symbol#Unicode

[2] For more on lowercase ligature characters in Greek, see the description of Aldine-style characters by Jane Raisch for the JHI: https://jhiblog.org/2016/08/29/greek-to-me-the-hellenism-of-early-print. This essay is an interesting discussion of Greek in Early Modern print more generally, and worth a read if you are interested in the how/why/what of Greek language printing.