Zpoken is a Ukrainian IT company with one of the major divisions oriented on Speech Recognition technologies in English and Slavic (Ukrainian, Russian) languages.

What is this repo about

We are happy to present here our Russian Speech Dataset — Zpoken Dataset [RU]

At the current moment the dataset consists of 5 source parts: radio_source_1, radio_source_2, radio_source_3, radio_source_5, Ru-films.

All data is stored in .opus format and was converted to mono, 16 kHz sampling rate, 16-bit.

Part name	Duration (h)	Samples num.	Average duration (s)	Characters per second	Characters per sample
radio_source_1	16 424.82	7 887 042	7.50	14.12	105.84
radio_source_2	2 308.46	955 904	8.69	13.53	117.62
radio_source_3	500.14	165 584	10.87	13.90	151.16
radio_source_5	655.88	216 101	10.93	16.63	181.66
Ru-films	850.88	203 972	15.02	8.76	131.57
Total \| Average	20 740,18	9 428 603	7.91	13.95	106.17

All parts were scraped from open sources. Basically there were long audio files and transcriptions without timesteps. So that one of the challenges we solved is to align original transcription directly to each short audio sample. More about this problem you will be able to read in our future paper.

Download & play

We provide absolutely free to use 150 hours demos for each part. It is a randomly selected sample from the original dataset part.

Part name	Duration(h)	Samples num.	Size (MB)	Link to download
radio_source_1	50	34 356	837	Radio1_50h.zip
radio_source_2	25	16 041	430	Radio2_25h.zip
radio_source_3	25	8 933	418	Radio3_25h.zip
radio_source_5	25	10 786	441	Radio5_25h.zip
Ru-films	25	7 358	380	Ru_films_25h.zip
Total	150	77 474	2 506

They are hosted on Gdrive so we provide ./download.sh to easily get them.

Requirements

You need a gdown to run the ./download.sh

pip install gdown

Just run bash download.sh on your linux machine.

Data structure

You will find the next directory structure, after you unzip each archive.

+---<DatasetPartName>
| +---data
| | +---subfolder1 (optional)
| | | +---speech\_file1.opus
| | | +...
| | | \---speech\_file[N].opus
| | +...
| | +---subfolder[N] (optional)
| | | +---speech\_file1.opus
| | | ...
| | \ \---speech\_file[N].opus
| +---transcription.csv

Get full dataset.

If you are interested in the full version of the dataset feel free to contact us in this form. Usually we'll answer in one working day.

Future work

release more hours
optimize archive storage (Gdrive is too annoying)

License

CC-BY-4.0

Zpoken Dataset [RU] is licensed under a Creative Commons Attribution 4.0 International License.

zpoken/Speech-Dataset