/open_stt

Russian open STT dataset

Primary LanguagePythonOtherNOASSERTION

Backers License: CC BY-NC 4.0

Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

  • ~7m utterances (1-2m with less perfect annotation, see #7);
  • ~7000 hours;
  • 855 GB (in .wav format in int16);
  • (new!) A new domain - radio;
  • (new!) A larger YouTube dataset with 1000+ additional hours;
  • (new!) A small (300 hours) YouTube dataset downloaded in maximum quality;
  • (new!) 18 hours in 3 validation sets for YouTube / books / public calls with ground truth annotation;

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Planned releases:

  • 1000-10,000 additional hours of books;
  • Data quality distillation and improvement / annotation improvement;
  • EVEN MOAR DATA (give us your ideas where to find it!);
  • 1000+ additional hours of YouTube;
  • Some validation / test sets;
  • Plain benchmarks, "bad files";
  • Mp3 torrent;
  • Wav torrent;
  • Radio set
  • ... and more!;

Table of contents

Dataset composition

Dataset Utterances Hours GB Av s/chars Comment Annotation Quality/noise
audiobook_2 1,149,404 1,511 162 4.7s / 56 Books Alignment (*) 95% / crisp
radio_2 651,645 1,439 154 7.95s / 110 Radio Alignment (*) TBC, should be high
public_youtube1120 1,410,979 1,104 237 2.82s / 34 Yutube videos Subtitles 95% / ~crisp
public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles 95% / ~crisp
tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS 4 voices 100% / crisp
asr_public_phone_calls_2 603,797 601 66 3.6s / 37 Phone calls ASR 70% / noisy
public_youtube1120_hq 369,245 291 31 2.84s / 37 YouTube videos HQ sound Subtitles 95% / ~crisp
asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy
asr_public_stories_2 78,186 78 9 3.5s / 43 Books ASR 80% / crisp
asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 80% / crisp
public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp
ru_RU 5,826 17 2 11s / 12 Public dataset Alignment 99% / crisp
voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp
russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp
asr_calls_2_val 12,950 7,7 2 2.15s / 34 Phone calls Manual annotation 99% / crisp
public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles 95% / crisp
buriy_audiobooks_2_val 7,850 4,9 1 2.25s / 31 Books Manual annotation 99% / crisp
public_youtube700_val 7,311 4,5 1 2.2 / 35 Youtube videos Manual annotation 99% / crisp
Total 7,117,271‬ 6,812 855

(*) Automatic alignment

This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.

Updates

Update 2019-06-28

New train datasets added:

  • 1,439 hours radio_2;
  • 1,104 hours public_youtube1120;
  • 291 hours public_youtube1120_hq;

New validation datasets added:

  • 8 hours asr_calls_2_val;
  • 5 hours buriy_audiobooks_2_val;
  • 5 hours public_youtube700_val;

Update 2019-05-19

Also shared a wav version via torrent.

Click to expand

Update 2019-05-13

Added the forgotten txt files to mp3 archives. Updating the torrent.

Update 2019-05-12

Torrent created and uploaded to academictorrents.

Update 2019-05-10

Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.

Update 2019-05-07 Help needed!

If you want to support the project, you can:

  • Help us with hosting (create a mirror) / provide a reliable node for torrent;
  • Help us with writing some helper functions;
  • Donate (each coffee pays for several full downloads) / use our DO referral link to help;

We are converting the dataset to MP3 now. Please contact us using the below contacts, if you would like to help.

Downloads

Via torrent

Save us a couple of bucks, download via torrent:

  • An MP3 version of the dataset (v3), to be updated;
  • A WAV version of the dataset (v5);

You can download separate files via torrent. Try several torrent clients if some do not work.

Links

Meta data file.

Dataset GB, wav GB, mp3 Wav Mp3 Source Manifest
audiobook_2 162 21.0 torrent part1 Sources from the Internet + alignment link
radio_2 154 25.7 torrent part1 Radio link
public_youtube1120 237 32.4 torrent part1 YouTube videos link
asr_public_phone_calls_2 66 7.5 torrent part1 Sources from the Internet + ASR link
public_youtube1120_hq 31 8.6 torrent parе1 YouTube videos link
asr_public_stories_2 9 1.1 torrent part1 Sources from the Internet + alignment link
tts_russian_addresses_rhvoice_4voices 80.9 9.9 torrent part1 TTS link
public_youtube700 75.0 9.6 torrent part1 YouTube videos link
asr_public_phone_calls_1 22.7 2.6 torrent part1 Sources from the Internet + ASR link
asr_public_stories_1 4.1 0.5 torrent part1 Public stories link
public_series_1 1.9 0.2 torrent part1 Public series link
ru_RU 1.9 0.2 torrent part1 Caito.de dataset link
voxforge_ru 1.9 0.2 torrent part1 Voxforge dataset link
russian_single 0.9 0.1 torrent part1 Russian single speaker dataset link
asr_calls_2_val 2 0.2 torrent part1 Sources from the Internet link
public_lecture_1 0.7 0.1 torrent part1 Sources from the Internet + manual link
buriy_audiobooks_2_val 1 0.15 torrent part1 Books + manual link
public_youtube700_val 2 0.13 torrent part1 YouTube videos + manual link
Total 855 87.5

Download instructions

  1. Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
  1. Download the meta data and manifests for each dataset:
  2. Merge files (where applicable), unpack and enjoy!

Check md5sum

Including links to deprecated files. md5sum /path/to/downloaded/file

Click to expand
type md5sum file
audio f24e21c69c03062d667caf0f055244f2 asr_public_stories_2_mp3.tar.gz
audio a6f888c53d7cbded85ab51627ef57c96 asr_public_phone_calls_1_mp3.tar.gz
audio f707e34f488c62af2e3142085ff595ad asr_public_phone_calls_2_mp3.tar.gz
audio baa491ed0b526b2a989b8c4a8897429d asr_public_stories_1_mp3.tar.gz
audio 42b9c8c2e31100d6c5b972c9ac000167 private_buriy_audiobooks_2_mp3.tar.gz
audio 7a5704721012fafa115e7316e5f6e058 public_lecture_1_mp3.tar.gz
audio 16cf820330f9f8b388395d777b2331ac public_series_1_mp3.tar.gz
audio dd048e7110c0c852c353759dad8fec0f public_youtube700_mp3.tar.gz
audio 579e9d98bd159a27d3573641edee69b0 ru_ru_mp3.tar.gz
audio 177b041594684623ec7d038613e1330d russian_single_mp3.tar.gz
audio d7ce4c4116dcc655be2b466f82c98b6e tts_russian_addresses_rhvoice_4voices_mp3.tar.gz
audio 25ea6d9e249a242ecc217acc28c8077b voxforge_ru_mp3.tar.gz
audio 97cd6b56ba1eb5088bc5643dce054028 asr_calls_2_val_mp3.tar.gz
audio 69a465e218fc1f597f7b5da836952d9d radio_2_mp3.tar.gz
audio 0cc0f50db85ec4271696b4eb03a2203c buriy_audiobooks_2_val_mp3.tar.gz
audio f5d2e3d13b47e1566ba0b021f00788cf public_youtube1120_hq_mp3.tar.gz
audio 12eb78a9ab7c3d39bbe2842b8d6550ca public_youtube1120_mp3.tar.gz
audio f6b6034e1e91d9a0a5069fc9ad2ed545 public_youtube700_val_mp3.tar.gz
manifest b0ce7564ba90b121aeb13aada73a6e30 asr_public_phone_calls_1.csv
manifest 6867d14dfdec1f9e9b8ca2f1de9ceda6 asr_public_phone_calls_2.csv
manifest 0bdd77e15172e654d9a1999a86e92c7f asr_public_stories_1.csv
manifest f388013039d94dc36970547944db51c7 asr_public_stories_2.csv
manifest 3b67e27c1429593cccbf7c516c4b582d private_buriy_audiobooks_2.csv
manifest 04027c20eb3aff05f6067957ecff856b public_lecture_1.csv
manifest 89da3f1b6afcd4d4936662ceabf3033e public_series_1.csv
manifest a81dfb018c88d0ecd5194ab3d8ff6c95 public_youtube700.csv
manifest c858f020729c34ba0ab525bbb8950d0c ru_RU.csv
manifest 0275525914825dec663fd53390fdc9a0 russian_single.csv
manifest 52f406f4e30fcc8c634f992befd91beb tts_russian_addresses_rhvoice_4voices.csv
audio 7533581bb26975212817bcacb25546d0 asr_public_stories_2.tar.gz
manifest 0cdbd085ffa6dab4bfdce7c3ed31fcfe asr_calls_2_val.csv
manifest 4e0b73e0d00374482a0f2286acf314a0 buriy_audiobooks_2_val.csv
manifest 6b9ce6828a55d2741d51bc3503345db5 public_youtube1120.csv
manifest 33040a25cad99e70a81e9e54ff8c758e public_youtube1120_hq.csv
manifest 525bd20802e529dcabf9e44345a50d0b public_youtube700_val.csv
manifest 2996fe938cdfb37dc6e359e4384c9bfe radio_2.csv

End to end download scripts

You can use this script or this script with this config file. Please check the config first. You can also contribute a similar script in python.

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

  • Converted to mono, if necessary;
  • Converted to 16 kHz sampling rate, if necessary;
  • Stored as 16-bit integers;

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

See example

from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')

Merge, check and save manifests

See example

from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')

Contacts

Please contact us here or just create a GitHub issue!

Authors in alphabetic order:

  • Anna Slizhikova;
  • Alexander Veysov;
  • Dmitry Voronin;
  • Yuri Baburov;

Acknowledgements

This repo would not be possible without these people:

  • Many thanks for helping to encode the initial bulk of the data into mp3 to akreal;
  • 18 hours of ground truth annotation datasets for validation are a courtesy of activebc;

Kudos!

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Mostly we used pydub (via ffmpeg) to convert to MP3. We omitted blank files (YouTube mostly). We used the following parameters:

  • 16kHz;
  • 32 kbps;
  • Mono;

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech. But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice. We did not use other formats like .ogg, because .mp3 is much more popular.

See example

from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
                               format="wav")

file_handle = sound.export(store_mp3_path,
                           format="mp3",
                           parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
                           bitrate="{}k".format(str(32)))

Decoding

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

See example

# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
                    root_folder='../data/ru_open_stt/',
                    target_sr=16000):
    assert type(wav) == np.ndarray
    assert wav.dtype == np.dtype('int16')
    assert len(wav.shape)==1

    target_format = 'wav'
    wavb = wav.tobytes()

    # f_path = Path(audio_path)
    f_hash = hashlib.sha1(wavb).hexdigest()

    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)

    store_path.parent.mkdir(parents=True,
                            exist_ok=True)

    wavfile.write(filename=str(store_path),
                  rate=target_sr,
                  data=wav)

    return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
                       mono=True,
                       sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
                           root_folder=root_folder,
                           target_sr=target_sr)

Why not OGG

Even though OGG is considered to be better for speech with higher compression, we opted for a more conventional well known format.

1. Issues with reading files

Maybe try this approach:

See example

from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

  • Public datasets;
  • Public pre-trained models;
  • Open source frameworks;
  • Open research;

TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

  • Blank files in Youtube dataset. Removed in mp3 archive. Meta-data not cleaned;
  • Some files that have low values / crash with tochaudio;
  • Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

License

License:

  • cc-by-nc and commercial usage available after agreement with dataset authors;
  • Except for radio_2, which is public domain;
  • Except for VoxForge, its license is GNU GPL 3.0;
  • Except for Caito.de dataset, its licence is here.

Donations

Donate (each coffee pays for several full downloads) / use our DO referral link to help.