中文说明

README_CN.md

Online Demo

https://touhou.ai/imgtrans/

Note this may not work sometimes due to stupid google gcp kept restarting my instance. In that case you can wait for me to restart the service, which may take up to 24 hrs.
Note this online demo is using the current main branch version.

Changelogs

2021-08-21

New MST based text region merge algorithm, huge text region merge improvement
Add baidu translator in demo mode
Add google translator in demo mode
Various bugfixes

2021-07-29

Web demo adds translator, detection resolution and target language option
Slight text color extraction improvement

2021-07-26

Major upgrades for all components, now we are on beta!
Note in this version all English texts are detected as capital letters,
You need Python >= 3.8 for cached_property to work

Detection model upgrade
OCR model upgrade, better at text color extraction
Inpainting model upgrade
Major text rendering improvement, faster rendering and higher quality text with shadow
Slight mask generation improvement
Various bugfixes
Default detection resolution has been dialed back to 1536 from 2048

2021-07-09

Fix erroneous image rendering when inpainting is not used

2021-06-18

Support manual translation
Support detection and rendering of angled texts

2021-06-13

Text mask completion is now based on CRF, mask quality is drastically improved

2021-06-10

Improve text rendering

2021-06-09

New text region based text direction detection method
Support running demo as web service

2021-05-20

Text detection model is now based on DBNet with ResNet34 backbone
OCR model is now trained with more English sentences
Inpaint model is now based on AOT which requires far less memory
Default inpainting resolution is now increased to 2048, thanks to the new inpainting model
Support merging hyphenated English words

2021-05-11

Add youdao translate and set as default translator

2021-05-06

Text detection model is now based on DBNet with ResNet101 backbone
OCR model is now deeper
Default detection resolution has been increased to 2048 from 1536

Note this version is slightly better at handling English texts, other than that it is worse in every other ways

2021-03-04

Added inpainting model

2021-02-17

First version launched

Translate texts in manga/images

Some manga/images will never be translated, therefore this project is born,
Primarily designed for translating Japanese text, but also support Chinese and English
Support inpainting and text rendering
Successor to https://github.com/PatchyVideo/MMDOCR-HighPerformance

How to use

Python>=3.8
Clone this repo
Download ocr.ckpt、detect.ckpt and inpainting.ckpt, put them in the root directory of this repo
[Optional if using Google translate] Apply for youdao translate API, put ypur APP_KEY and APP_SECRET in translators/key.py
Run python translate_demo.py --image <path_to_image_file> [--use-inpainting] [--verbose] [--use-cuda] [--translator=google] [--target-lang=CHS], result can be found in result/. Add --use-inpainting to enable inpainting, Add --use-cuda to use CUDA.

How to use

Python>=3.8
Clone this repo
Download ocr.ckpt、detect.ckpt and inpainting.ckpt, put them in the root directory of this repo
[Optional if using Google translate] Apply for youdao translate API, put ypur APP_KEY and APP_SECRET in translators/key.py
Run python translate_demo.py --mode web [--use-inpainting] [--verbose] [--use-cuda] [--translator=google] [--target-lang=CHS], the demo will be serving on http://127.0.0.1:5003

Two modes of translation service are provided by the demo: synchronous mode and asynchronous mode
In synchronous mode your HTTP POST request will finish once the translation task is finished.
In asynchronous mode your HTTP POST request will respond with a task_id immediately, you can use this task_id to poll for translation task state.

Synchronous mode

POST a form request with form data file:<content-of-image> to http://127.0.0.1:5003/run
Wait for response
Use the resultant task_id to find translation result in result/ directory, e.g. using Nginx to expose result/

Asynchronous mode

POST a form request with form data file:<content-of-image> to http://127.0.0.1:5003/submit
Acquire translation task_id
Poll for translation task state by posting JSON {"taskid": <task-id>} to http://127.0.0.1:5003/task-state
Translation is finished when the resultant state is either finished, error or error-lang
Find translation result in result/ directory, e.g. using Nginx to expose result/

Manual translation

Manual translation replace machine translation with human translators

POST a form request with form data file:<content-of-image> to http://127.0.0.1:5003/manual-translate
Wait for response
You will obtain a JSON response like this:

{
    "task_id": "12c779c9431f954971cae720eb104499",
    "status": "pending",
    "trans_result": [
        {
            "s": "☆上司来ちゃった……",
            "t": ""
        }
    ]
}

Fill in translated texts

{
    "task_id": "12c779c9431f954971cae720eb104499",
    "status": "pending",
    "trans_result": [
        {
            "s": "☆上司来ちゃった……",
            "t": "☆Boss is here..."
        }
    ]
}

Post translated JSON to http://127.0.0.1:5003/post-translation-result
Wait for response
Find translation result in result/ directory, e.g. using Nginx to expose result/

This is a hobby project, you are welcome to contribute

Currently this only a simple demo, many imperfections exist, we need your support to make this project better!

Next steps

What need to be done

Inpainting is based onAggregated Contextual Transformations for High-Resolution Image Inpainting
IMPORTANT!!!HELP NEEDED!!! The current text rendering engine is barely usable, we need your help to improve text rendering!
Text rendering area is determined by detected text lines, not speech bubbles. This works for images without speech bubbles, but making it impossible to decide where to put translated English text. I have no idea how to solve this.
Ryota et al. proposed using multimodal machine translation, maybe we can add ViT features for building custom NMT models.
Make this project works for video(rewrite code in C++ and use GPU/other hardware NN accelerator). Used for detecting hard subtitles in videos, generting ass file and remove them completetly.
~~Mask refinement based using non deep learning algorithms, I am currently testing out CRF based algorithm.~~
~~Angled text region merge is not currently supported~~