Subtitles Extraction

Extract key frames from Amanpreet Walia.

This project is used to extract subtitles from the video. First, the key frames is extracted from the video, and then the subtitle area of the frame picture is cropped, and the text is recognized by the OCR.

Getting Started

Install following dependences

OpenCV-Python (used for basic video processing e.g. read-frame-stream, crop, frame-diff, processing-gui)
PyTesseract (only use its image_to_string(img, lang))
NumPy (smooth filter) (find it here)
SciPy (signal.argrelextrema)
StrsimPy (NormalizedLevenshtein string similiarity)
Matplotlib (draw frame differences stem plot)
ProgressBar

Install missing dependences first using pip install -r requirements.txt

Install Tesseract OCR

Download and (try) run it, select language support in tesseract --list-lang if you want.

Run

λ python extract_subtitles.py <videopath>

License

This project is licensed under the MIT License - see LICENSE for details

sintak/extract-subtitles

Subtitles Extraction

Getting Started

Install following dependences

Install Tesseract OCR

Run

License