Korean Streaming Automatic Speech Recognition

Real-time streaming Korean speech-to-text model that can run on a CPU

ASR (Automatic Speech Recognition) is a process that involves two distinct stages:

Speech Enhancement: In this stage, the incoming audio or speech signal is processed to reduce noise, improve clarity, and enhance the quality of the speech. Various techniques such as filtering, spectral subtraction, and deep learning-based methods may be employed to achieve speech enhancement. There are two main approaches for processing using deep learning techniques: waveform domain processing and spectrogram domain processing. We process waveform domain.
Speech Recognition: Once the speech signal has been enhanced, it is passed through the speech recognition system. In this stage, the system converts the processed audio into text by identifying and transcribing the spoken words. Modern ASR systems typically rely on advanced machine learning algorithms, such as deep neural networks, to accurately recognize and transcribe the speech.

Together, these two stages enable ASR systems to convert spoken language into text, making them valuable tools in various applications such as voice assistants, transcription services, and more.

We used denoiser from @facebook and @Nemo framework for conformer CTC.

Requirements

Clone the Repository

git clone https://github.com/SUNGBEOMCHOI/Korean-Streaming-ASR.git
cd Korean-Streaming-ASR

Make Conda Environment

conda create -n korean_asr python==3.8.10
conda activate korean_asr

Installing Dependencies on Ubuntu

sudo apt-get update
sudo apt-get install -y libsndfile1 ffmpeg libffi-dev portaudio19-dev

Python Dependencies Installation

Install PyTorch, torchvision, torchaudio, and the CUDA version of PyTorch by following the instructions on the official PyTorch website: https://pytorch.org/get-started/locally/.
Install the rest of the required Python packages using pip. Open a terminal and execute the following commands:

pip install -r requirements.txt

Download Denoiser and ASR Models

From the provided Google Drive link, download denoiser.th, Conformer-CTC-BPE.nemo. If you wish to train the ASR model, also download Conformer-CTC-BPE.ckpt.
Create a folder named checkpoint and place the downloaded files in it.

Google Drive Folder: Download Here

Run

File mode

For CPU:

python  main.py --audio_path "./audio_example/0001.wav" --device cpu

For GPU:

python  main.py --audio_path "./audio_example/0001.wav" --device cuda

Save Denoised Audio:

python  main.py --audio_path "./audio_example/0001.wav" --device cuda --denoiser_output_save

Disable Denoiser(Only ASR):

python  main.py --audio_path "./audio_example/0001.wav" --device cuda --disable_denoiser

Microphone mode

python main.py --mode microphone --device cpu

Example

Raw Wave(Input)

noise_bigmac.mp4

Clean Wave (enhanced by denoiser)

enhanced_bigmac.mp4

Text (output)

Datasets

We collect data from AI Hub

Stage 1 Speech Enhancement

We initialized denoiser to dns48 (H = 48, trained on DNS dataset, # of Parameters : 18,867,937) and let enhancement module dry output by $\text{dry} \cdot x + (1-\text{dry}) \cdot \hat y$ We also apply STFT Loss for training the Speech Enhancement model. We train the model on 카페,음식점 소음 & 시장, 쇼핑몰 소음 in 소음환경음성인식데이터

Stage 2 Speech to Text

Name	# of Samples(train/test)
고객응대음성	2067668/21092
한국어 음성	620000/3000
한국인 대화 음성	2483570/142399
자유대화음성(일반남녀)	1886882/263371
복지 분야 콜센터 상담데이터	1096704/206470
차량내 대화 데이터	2624132/332787
명령어 음성(노인남여)	137467/237469
Total	10916423(13946시간)/1206588(1474시간)

If you wanna more info, go to KO STT(in Hunggingface)

References

@inproceedings{defossez2020real,
  title={Real Time Speech Enhancement in the Waveform Domain},
  author={Defossez, Alexandre and Synnaeve, Gabriel and Adi, Yossi},
  booktitle={Interspeech},
  year={2020}
}