subtitles_extract

Tool for extraction hardcoded chinese subtitles from video files with 720p resolution (1280 × 720) based on EasyOCR tool by JaidedAI

Inspride by Entrepreneurial Age/创业时代 (2018)

Download:

git clone https://github.com/krviolent/subtitles_extract.git
or tap Code -> Download ZIP and extract

Install requirements:

OS: Windows 10/WSL Instructions: Enable and install WSL

Install python3, ffmpeg, easyocr (https://github.com/JaidedAI/EasyOCR):
sudo apt install python3
sudo apt install ffmpeg
git clone https://github.com/JaidedAI/EasyOCR.git
cd EasyOCR
sudo python3 setup.py install

Use:

Tested on WSL Ubuntu 20.04. Meet some difficulties running CUDA on Windows to use GPU for OCR.

	bash scripts/run_extract_subs.sh [video.mp4] [episode_number] [duration_of_video_in_seconds] [frame_rate]
	[duration_of_video_in_seconds] - optional argument
	[frame_rate] = 1
Example:
	bash scripts/run_extract_subs.sh video_ep34.mp4 34 2600
Divide subs_file_[EP].txt into the timestamps.txt and textonly.txt:
	bash scripts/divide_timestamp_and_text.py [episode_number]

Steps to extract subtitles into the text file:

1. crop.sh -> frame_xx/*.jpg
2. 2580 - 43 minites, 2600 - ok
	python3 easyocr_test.py [episode_number] [duration_in_seconds]
	Output files will saved in files:
		subs/subs_file_[episode_number].txt
		subs/EP.A.[episode_number]/subs_[episode_number].srt
3. Auto-translate obtained subs using https://translatesubtitles.co/

Optional (replace names, for example):

bash scripts/replace.sh

command to replace A -> B:
sed -i -e 's/[A]/[B]/g' subs_file.srt
This might not work quite right.

Info

Duplicated subs not removed during extraction, because same phrases might be repeated during video.
Also sometimes recognition accuracy is not sophisticated.