/Speech-Dataset

One more russian speech dataset.

Primary LanguageJupyter NotebookOtherNOASSERTION

Zpoken is a Ukrainian IT company with one of the major divisions oriented on Speech Recognition technologies in English and Slavic (Ukrainian, Russian) languages.

What is this repo about

We are happy to present here our Russian Speech Dataset — Zpoken Dataset [RU]

At the current moment the dataset consists of 5 source parts: radio_source_1, radio_source_2, radio_source_3, radio_source_5, Ru-films.

All data is stored in .opus format and was converted to mono, 16 kHz sampling rate, 16-bit.

Part name Duration (h) Samples num. Average duration (s) Characters per second Characters per sample
radio_source_1 16 424.82 7 887 042 7.50 14.12 105.84
radio_source_2 2 308.46 955 904 8.69 13.53 117.62
radio_source_3 500.14 165 584 10.87 13.90 151.16
radio_source_5 655.88 216 101 10.93 16.63 181.66
Ru-films 850.88 203 972 15.02 8.76 131.57
Total | Average 20 740,18 9 428 603 7.91 13.95 106.17

All parts were scraped from open sources. Basically there were long audio files and transcriptions without timesteps. So that one of the challenges we solved is to align original transcription directly to each short audio sample. More about this problem you will be able to read in our future paper.

Download & play

We provide absolutely free to use 150 hours demos for each part. It is a randomly selected sample from the original dataset part.

Part name Duration(h) Samples num. Size (MB) Link to download
radio_source_1 50 34 356 837 Radio1_50h.zip
radio_source_2 25 16 041 430 Radio2_25h.zip
radio_source_3 25 8 933 418 Radio3_25h.zip
radio_source_5 25 10 786 441 Radio5_25h.zip
Ru-films 25 7 358 380 Ru_films_25h.zip
Total 150 77 474 2 506

They are hosted on Gdrive so we provide ./download.sh to easily get them.

Requirements

You need a gdown to run the ./download.sh

pip install gdown

Just run bash download.sh on your linux machine.

Data structure

You will find the next directory structure, after you unzip each archive.

+---<DatasetPartName>
| +---data
| | +---subfolder1 (optional)
| | | +---speech\_file1.opus
| | | +...
| | | \---speech\_file[N].opus
| | +...
| | +---subfolder[N] (optional)
| | | +---speech\_file1.opus
| | | ...
| | \ \---speech\_file[N].opus
| +---transcription.csv

Get full dataset.

If you are interested in the full version of the dataset feel free to contact us in this form. Usually we'll answer in one working day.

Future work

  • release more hours
  • optimize archive storage (Gdrive is too annoying)

License

CC-BY-4.0

Creative Commons License
Zpoken Dataset [RU] is licensed under a Creative Commons Attribution 4.0 International License.