/audiocorpusbuilder

Command-line package for automatical creation of russian language audio corpus (pairs speech-text) from YouTube audiotracks and subtitles

Primary LanguagePythonApache License 2.0Apache-2.0

About

Audiocorpusbuilder-package was made to automatically create a russian language audio corpus from YouTube videotracks playlists: it downloads video's audio and subtitles, makes pairs "sound-text" and saves them in the directory. If there are not subtitles for the video, audiocorpusbuilder misses it.

Installing

For installation you need Python 3.6 or later and OC Linux on your local machine.

You can install it with these commands:

git clone https://github.com/dangrebenkin/audiocorpusbuilder.git
cd audiocorpusbuilder
python3 setup.py install

Start

To run audiocorpusbuilder you should prepare directories for audiotracks, subtitles, results (directories should be like '/home/Audio/'). Also you need to create playlists.txt with playlists' links, every link should be on the separate line.

Arguments

All arguments are required for program use.

  1. -p URL_list

Playlists txt-file path.

  1. -a directory_audio

Path to download audiotracks.

  1. -s directory_subtitles

Path to download subtitles.

  1. -r directory_results

Path to results.

Usage

acbr [-p URL_list] [-a directory_audio] [-s directory_subtitles] [-r directory_results]

Example

acbr -p playlists.txt -a Audio -s Subs -r Results