Speech Recognition for Ukrainian 🇺🇦

The goal of this repository is to collect information and datasets for Ukrainian automatic speech recognition aka speech-to-text.

Also, this repository contains information about Ukrainian speech synthesis aka text-to-speech.

Or you can start a discussion.

Donate

You can support our work by donation:

🎤 Speech-to-Text

💡 Implementations

wav2vec2-bert

wav2vec2

You can check demos out here: https://github.com/egorsmkv/wav2vec2-uk-demo

data2vec

Citrinet

ContextNet

FastConformer

Squeezeformer

Silero

VOSK

Note: VOSK models are licensed under Apache License 2.0.

DeepSpeech

M-CTC-T

whisper

Flashlight

📊 Benchmarks

This benchmark uses Common Voice 10 test split.

wav2vec2-bert

Model WER CER Accuracy, % WER+LM CER+LM Accuracy+LM, %
Yehor/w2v-bert-2.0-uk 0.0727 0.0151 92.73% 0.0655 0.0139 93.45%

wav2vec2

Model WER CER Accuracy, % WER+LM CER+LM Accuracy+LM, %
Yehor/wav2vec2-xls-r-1b-uk-with-lm 0.1807 0.0317 81.93% 0.1193 0.0218 88.07%
Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm 0.1807 0.0317 81.93% 0.0997 0.0191 90.03%
Yehor/wav2vec2-xls-r-300m-uk-with-lm 0.2906 0.0548 70.94% 0.172 0.0355 82.8%
Yehor/wav2vec2-xls-r-300m-uk-with-news-lm 0.2027 0.0365 79.73% 0.0929 0.019 90.71%
Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm 0.2027 0.0365 79.73% 0.1045 0.0208 89.55%
Yehor/wav2vec2-xls-r-base-uk-with-small-lm 0.4441 0.0975 55.59% 0.2878 0.0711 71.22%
robinhad/wav2vec2-xls-r-300m-uk 0.2736 0.0537 72.64% - - -
arampacha/wav2vec2-xls-r-1b-uk 0.1652 0.0293 83.48% 0.0945 0.0175 90.55%

Citrinet

lm-4gram-500k is used as the LM

Model WER CER Accuracy, % WER+LM CER+LM Accuracy+LM, %
nvidia/stt_uk_citrinet_1024_gamma_0_25 0.0432 0.0094 95.68% 0.0352 0.0079 96.48%
neongeckocom/stt_uk_citrinet_512_gamma_0_25 0.0746 0.016 92.54% 0.0563 0.0128 94.37%

ContextNet

Model WER CER Accuracy, %
theodotus/stt_uk_contextnet_512 0.0669 0.0145 93.31%

FastConformer P&C

This model supports text punctuation and capitalization

Model WER CER Accuracy, % WER+P&C CER+P&C Accuracy+P&C, %
theodotus/stt_ua_fastconformer_hybrid_large_pc 0.0400 0.0102 96.00% 0.0710 0.0167 92.90%

Squeezeformer

lm-4gram-500k is used as the LM

Model WER CER Accuracy, % WER+LM CER+LM Accuracy+LM, %
theodotus/stt_uk_squeezeformer_ctc_xs 0.1078 0.0229 89.22% 0.0777 0.0174 92.23%
theodotus/stt_uk_squeezeformer_ctc_sm 0.082 0.0175 91.8% 0.0605 0.0142 93.95%
theodotus/stt_uk_squeezeformer_ctc_ml 0.0591 0.0126 94.09% 0.0451 0.0105 95.49%

Flashlight

lm-4gram-500k is used as the LM

Model WER CER Accuracy, % WER+LM CER+LM Accuracy+LM, %
Flashlight Conformer 0.1915 0.0244 80.85% 0.0907 0.0198 90.93%

data2vec

Model WER CER Accuracy, %
robinhad/data2vec-large-uk 0.3117 0.0731 68.83%

VOSK

Model WER CER Accuracy, %
v3 0.5325 0.3878 46.75%

Silero

Model WER CER Accuracy, %
snakers4/silero-models 0.2356 0.0646 76.44%

m-ctc-t

Model WER CER Accuracy, %
speechbrain/m-ctc-t-large 0.57 0.1094 43%

whisper

Model WER CER Accuracy, %
tiny 0.6308 0.1859 36.92%
base 0.521 0.1408 47.9%
small 0.3057 0.0764 69.43%
medium 0.1873 0.044 81.27%
large (v1) 0.1642 0.0393 83.58%
large (v2) 0.1372 0.0318 86.28%

Fine-tuned version for Ukrainian:

Model WER CER Accuracy, %
small 0.2704 0.0565 72.96%
large 0.2482 0.055 75.18%

If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian

DeepSpeech

Model WER CER Accuracy, %
v0.5 0.7025 0.2009 29.75%

📖 Development

📚 Datasets

Compiled dataset from different open sources + Companies + Community = 188.31GB / ~1200 hours 💪

Voice of America (398 hours)

Companies

Cleaned Common Voice 10 (test set)

Noised Common Voice 10

Community

Other

⭐ Related works

Language models

Inverse Text Normalization:

Text Enhancement

📢 Text-to-Speech

Test sentence with stresses:

К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.

Without stresses:

Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.

💡 Implementations

RAD-TTS

demo.mp4

Silero TTS

silero.mp4

Coqui TTS

tts_output.mp4

Neon TTS

neon_tts.mp4

Balacoon TTS

balacoon_tts.mp4

📚 Datasets

⭐ Related works

Accentors