Real-time streaming Korean speech-to-text model that can run on a CPU
ASR (Automatic Speech Recognition) is a process that involves two distinct stages:
-
Speech Enhancement: In this stage, the incoming audio or speech signal is processed to reduce noise, improve clarity, and enhance the quality of the speech. Various techniques such as filtering, spectral subtraction, and deep learning-based methods may be employed to achieve speech enhancement. There are two main approaches for processing using deep learning techniques: waveform domain processing and spectrogram domain processing. We process waveform domain.
-
Speech Recognition: Once the speech signal has been enhanced, it is passed through the speech recognition system. In this stage, the system converts the processed audio into text by identifying and transcribing the spoken words. Modern ASR systems typically rely on advanced machine learning algorithms, such as deep neural networks, to accurately recognize and transcribe the speech.
Together, these two stages enable ASR systems to convert spoken language into text, making them valuable tools in various applications such as voice assistants, transcription services, and more.
We used denoiser from @facebook and @Nemo framework for conformer CTC.
Clone the Repository
git clone https://github.com/SUNGBEOMCHOI/Korean-Streaming-ASR.git
cd Korean-Streaming-ASR
Make Conda Environment
conda create -n korean_asr python==3.8.10
conda activate korean_asr
Installing Dependencies on Ubuntu
sudo apt-get update
sudo apt-get install -y libsndfile1 ffmpeg libffi-dev portaudio19-dev
Python Dependencies Installation
-
Install PyTorch, torchvision, torchaudio, and the CUDA version of PyTorch by following the instructions on the official PyTorch website: https://pytorch.org/get-started/locally/.
-
Install the rest of the required Python packages using pip. Open a terminal and execute the following commands:
pip install -r requirements.txt
Download Denoiser and ASR Models
- From the provided Google Drive link, download denoiser.th, Conformer-CTC-BPE.nemo. If you wish to train the ASR model, also download Conformer-CTC-BPE.ckpt.
- Create a folder named checkpoint and place the downloaded files in it.
Google Drive Folder: Download Here
File mode
For CPU:
python main.py --audio_path "./audio_example/0001.wav" --device cpu
For GPU:
python main.py --audio_path "./audio_example/0001.wav" --device cuda
Save Denoised Audio:
python main.py --audio_path "./audio_example/0001.wav" --device cuda --denoiser_output_save
Disable Denoiser(Only ASR):
python main.py --audio_path "./audio_example/0001.wav" --device cuda --disable_denoiser
Microphone mode
python main.py --mode microphone --device cpu
Raw Wave(Input)
noise_bigmac.mp4
Clean Wave (enhanced by denoiser)
enhanced_bigmac.mp4
Text (output)
We collect data from AI Hub
Stage 1 Speech Enhancement
We initialized denoiser to dns48 (H = 48, trained on DNS dataset, # of Parameters : 18,867,937) and let enhancement module dry output by
Stage 2 Speech to Text
Name | # of Samples(train/test) |
---|---|
고객응대음성 | 2067668/21092 |
한국어 음성 | 620000/3000 |
한국인 대화 음성 | 2483570/142399 |
자유대화음성(일반남녀) | 1886882/263371 |
복지 분야 콜센터 상담데이터 | 1096704/206470 |
차량내 대화 데이터 | 2624132/332787 |
명령어 음성(노인남여) | 137467/237469 |
Total | 10916423(13946시간)/1206588(1474시간) |
If you wanna more info, go to KO STT(in Hunggingface)
@inproceedings{defossez2020real,
title={Real Time Speech Enhancement in the Waveform Domain},
author={Defossez, Alexandre and Synnaeve, Gabriel and Adi, Yossi},
booktitle={Interspeech},
year={2020}
}