Web-Based Box Editor for Tesseract OCR

Web app for editing Tesseract OCR LSTM box files in WordStr format, using Tesseract OCR js, leaflet mapping library, and leaflet.draw.

pages-build-deployment

screenshot

Features

  • Automatically detects text lines;
  • Supports for adding, editing, and deleting bounding boxes;
  • May take existing box files;
  • Exports box files and tagged text;
  • Highlights characters according to regex patterns;
  • Notifies of potential issues and errors;
  • Indicates document tagging progress;
  • Shows character Unicode info popup;
  • Allows choosing language model;
  • Shows text suggestion confidence;
  • Displays invisible characters;
  • Has built-in samples;
  • Remembers preferences;
  • Supports Dark Mode.

How to use

  1. Upload an image and wait for bounding boxes to be automatically generated.
  • optionally, upload an existing box file to edit.
  1. Correct generated bounding boxes.
  • add new boxes using the draw tool;
  • remove boxes using the delete tool;
  • edit boxes by clicking on them and dragging the handles, or adjusting the Left, Bottom, Right, and Top input fields.
  1. Type the ground-truth into the text field.
  2. Press Return or the Next button to commit the current text line.
  3. Download the edited box file using the Download button.

Keyboard shortcuts are available to navigate forward control + shift + down and backward control + shift + up. NOTE: Keyboard shortcuts implementation is buggy and needs fixing.

Acknowledgment

This work was supported by a grant from the Ministry of Research, Innovation and Digitization, CCCDI - UEFISCDI, project number PN-III-P2-2.1-PED-2021-0693, within PNCDI III.

Project website: https://transitional-romanian-transliteration.azurewebsites.net


Original Source Code by Johan Junkka.