/retro-digitization

Retro-digitization workflow

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

How to Retro-Digitize a Historical Dictionary

Contact: Ben Bongalon (ben@isawika.org)

Retro-digitization is the process of converting a paper-based historical publication into an electronic format suitable for publishing online or for sharing as a digital resource. In this tutorial, you will learn the workflow we developed to digitize a 1953 bilingual dictionary. For details, see our paper "Using Open-Source Tools to Digitize Lexical Resources for Low-Resource Languages" (upcoming).

We designed the workflow to enable even those with modest budgets to conduct their own retro-digitization projects. In doing so, we hope to encourage more communities, especially speakers of minority and indigenous languages, to build e-dictionaries and other digital lexical resources for their mother-tongue language.

What You'll Do

You will use sample pages from Harold Conklin's 1953 Hanunoo-English dictionary. Hanunoo (IPA: "hanunuʔɔ") is an indigenous language spoken by ~25,000 Hanunoo Mangyan people in the Philippines. Although they have a native writing system called Surat Mangyan, the dictionary itself had Hanunoo words printed in Roman letters but their pronounciations were denoted with non-Roman letters. These include 5 vowels with diacritical marks (á é í ó ú), the eng character 'ŋ' and the glottal stop 'ʔ' symbol. Here are two sample entries in the dictionary where you can see them used.

Two sample entries from the Conklin dictionary for the headwords 'agusbakyang and 'Agustu'.


You will train the open-source Tesseract OCR engine to recognize the special character 'ŋ' since no existing engine can (the glottal stop symbol will be handled differently, and Tesseract already has a language model that recognizes the vowels with diacritical marks). You will also format the OCR-ed pages into XML then load/edit/display them in a locally-installed Lexonomy dictionary server. How cool is that? :-)

Example Lexonomy dictionary Example dictionary hosted in Lexonomy

Prerequisites

  1. Computer running Ubuntu 18.04 or later (see Note below)
  2. Python 3 installed
  3. Admin privilege to install software
  4. You know how to run commands in a console

To follow along, clone the Git project into your working directory.

$ git clone https://github.com/isawika/retro-digitization.git
$ cd retro-digitization

Note: The tutorial should run on other Linux systems with only minor tweaks, but we have not tested this. Running on Mac or Windows should also be possible but needs more work. Contact us if you want to discuss.

The Workflow

We follow the technical steps outlined in the DariahTeach project, highlighted as blue-ish boxes below:

workflow diagram showing the 5 steps

  • Step 1: Planning
  • Step 2: Image Capture
  • Step 3: Text Capture
    • 3.1 - Prepare the Training Data
    • 3.2 - Finetune Tesseract (train the OCR language model)
    • 3.3 - Transcribe the dictionary pages with the trained model
    • 3.4 - Proofread the pages
  • Step 4: Data Modeling & Enrichment
  • Step 5: Publish