comics-ocr

Tool for extracting script from comic pages using OCR engine Tesseract. Inspired by motion comic Rewind's last message (or alternative link here). Useful for making something like page 18~19 of The Transformers: More than Meets the Eye #16 (or alternative link in Chinese here).

Supports image file formats .jpg, .png. .bmp, .tiff formats on Windows and Unix systems. Supports archive file formats .rar, .cbr, .zip on Unix systems. The OCR engine Tesseract that is used is not trained, but can be if needed.

comics-ocr
Prerequisites
Installation
Compatibility
Usage

Prerequisites

Installation

python setup.py install

Compatibility

Supports Python 2.7 and 3.6+.

Usage

See here for more detailed example (using a simplified version of the tool).

Using as command-line tool

usage: comicsocr [-h] [--paths PATHS [PATHS ...]] [--output-path OUTPUT_PATH] [--config CONFIG]

Tool to extract scripts from comic pages.

optional arguments:
  -h, --help            show this help message and exit
  --paths PATHS [PATHS ...]
                        Paths to comic image files, archive files or directories containing comic image files. Supported file formats (Windows and Unix):
                        .jpg, .png, .bmp, .tiff. Supported archive file formats (Unix only): .rar, .cbr, .zip.
  --output-path OUTPUT_PATH
                        Path to write the comic scripts to.
  --config CONFIG       Configurations.

E.g.,

[2020-07-20 22:47:58,252] INFO [api.py:54:read_from_file] Reading from file: C:\Users\largecats\Fun\programming\personal-projects\comics-ocr\test\test.jpg
[2020-07-20 22:47:59,299] INFO [reader.py:72:read] 'a ela a'
[2020-07-20 22:48:02,704] INFO [reader.py:72:read] 'THE LAW GAYS THISSORT OF THING HAS TOBE DECLARED ON-SITE.FORMALITIES.'
[2020-07-20 22:48:04,556] INFO [reader.py:72:read] "I DON'T UNDERSTAND WHYWE HAVE TO BE HERE. CAN'TWE FUST... PUSH A BUTTONAND BE DONE VUITH IT?"
[2020-07-20 22:48:05,359] INFO [reader.py:72:read] 'MINING OUTPOST C-12.'
[2020-07-20 22:48:06,166] INFO [reader.py:72:read] 'LONG AGO. PEACETIME.'
[2020-07-20 22:48:07,025] INFO [reader.py:72:read] 'THE CYBERTRON SYSTEM.Zs'
[2020-07-20 22:48:10,287] INFO [reader.py:72:read] 'Pinto d 3 ABO adieSoa an eee'
[2020-07-20 22:48:10,288] INFO [api.py:74:write_to_file] Writing to: C:\Users\largecats\Fun\programming\personal-projects\comics-ocr\test\result.txt

Using as Python library

Call api.read_from_file, api.read_from_archive_file, or api.read_from_directory to read from a single image file, a single archive file, or a directory containing image files or archive files of images.