klaam

Arabic speech recognition and classification using wav2vec models. This repository allows training and prediction using pretrained models.

Usage

from klaam import SpeechClassification
model = SpeechClassification()
model.classify('file.wav')

from klaam import SpeechRecognition
model = SpeechRecognition()
model.transcribe('file.wav')

There are two avilable models for recognition trageting MSA and egyptian dialect . You can set any of them using the lang attribute

 from klaam import SpeechRecognition
 model = SpeechRecognition(lang = 'msa')
 model.transcribe('file.wav')

Datasets

Dataset	Description	link
MGB-3	Egyptian Arabic Speech recognition in the wild. Every sentence was annotated by four annotators. More than 15 hours have been collected from YouTube.	requires registeration here
ADI-5	More than 50 hours collected from Aljazeera TV. 4 regional dialectal: Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA). This dataset is a part of the MGB-3 challenge.	requires registeration here
Common voice	Multlilingual dataset avilable on huggingface	here.

Models

We currently support three models, all of them are avilable on transformers.

Language	Description	Source
Egyptian	Speech recognition	wav2vec2-large-xlsr-53-arabic-egyptian
Standard Arabic	Speech recognition	wav2vec2-large-xlsr-53-arabic
EGY, NOR, LAV, GLF, MSA	Speech classification	wav2vec2-large-xlsr-dialect-classification

Example Notebooks

Name	Description	Notebook
Demo	Classification and Recongition example in a few lines of code.

Training

The scripts are a modification of this repo.

classification

This script is used for the classification task on the 5 classes.

python run_classifier.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --freeze_feature_extractor \
    --num_train_epochs="50" \
    --per_device_train_batch_size="32" \
    --preprocessing_num_workers="1" \
    --learning_rate="3e-5" \
    --warmup_steps="20" \
    --evaluation_strategy="steps"\
    --save_steps="100" \
    --eval_steps="100" \
    --save_total_limit="1" \
    --logging_steps="100" \
    --do_eval \
    --do_train \

Recognition

This script is for training on the dataset for pretraining on the egyption dialects dataset.

python run_recognition.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --freeze_feature_extractor \
    --num_train_epochs="50" \
    --per_device_train_batch_size="32" \
    --preprocessing_num_workers="1" \
    --learning_rate="3e-5" \
    --warmup_steps="20" \
    --evaluation_strategy="steps"\
    --save_steps="100" \
    --eval_steps="100" \
    --save_total_limit="1" \
    --logging_steps="100" \
    --do_eval \
    --do_train \

This script can be used for Arabic common voice training

python run_common_voice.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --dataset_config_name="ar" \
    --output_dir=/path/to/output/ \
    --cache_dir=/path/to/cache \
    --overwrite_output_dir \
    --num_train_epochs="1" \
    --per_device_train_batch_size="32" \
    --per_device_eval_batch_size="32" \
    --evaluation_strategy="steps" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --fp16 \
    --freeze_feature_extractor \
    --save_steps="10" \
    --eval_steps="10" \
    --save_total_limit="1" \
    --logging_steps="10" \
    --group_by_length \
    --feat_proj_dropout="0.0" \
    --layerdrop="0.1" \
    --gradient_checkpointing \
    --do_train --do_eval \
    --max_train_samples 100 --max_val_samples 100

AnwarKamel/klaam