tesseract-ocr/langdata

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin)

grantbarrett opened this issue · 1 comments

One of the holes in Tesseract's ability to do quality OCR on the historical texts is that it's missing just three characters that prevent it from reasonably handling Old English Latin-character texts.

If you compare the characters available in the Tesseract trained date for Middle English with the character set of Old English using Latin, you'll see the omissions "Æ æ" (ash), "Ð ð" (eth), and "Ƿ ƿ" (wynn).

https://github.com/tesseract-ocr/langdata_lstm/blob/main/enm/enm.unicharset
https://en.wikipedia.org/wiki/Old_English_Latin_alphabet

https://en.wikipedia.org/wiki/Old_English_Latin_alphabet
https://en.wikipedia.org/wiki/Eth
https://en.wikipedia.org/wiki/Wynn

Admittedly these three will require quite a bit of training to distinguish them from an AE digraph, D d, and P p, respectively, but of course, that's what we do here!

As you can see on their respective Wikipedia pages, we may already have trained data for eth and ash in other languages (Danish and Norwegian, and Icelandic, Faroese, and Khmer, respectively), but there are other letter forms that may need to be accounted for, especially for wynn.

If were were able to make these changes, then we could rename the Middle English trained data to be used for Middle and Old English (Latin), differentiating it clearly from the "enm" three-letter code, and especially for those who associate "Old English" primarily with blackletter script, which this trained data would not be suitable to handle. (Blackletter OCR can be handled by the tools at this link, although they are for older versions of Tesseract https://emop.tamu.edu/.)

It is possible to enhance the existing model with those additional glyphs. The original training was done with artificial training data, but I think that you will get better results with transcribed scans from historic books.