Accent-Bias-in-ASR-Models

As Automatic speech recognition (ASR) systems become more and more ubiquitous in our daily lives (smart home assistants, automatic closed captioning), the downstream effects of biased models becomes more and more pronounced. ASR models have already been shown to be biased in terms of accent, dialect, and gender, but these studies are generally conducted on overly simplified academic datasets, while most of the applied uses of these models occur in a much noisier setting, e.g., audio with multiple speakers/accents. To this end, our work investigates how the presence of multiple accents in a single audio clip can affect the transcription quality on different accent groups, potentially amplifying bias. We evaluate transcription quality on both single-accent and multi-accent audio using word error rate, character error rate, and Jaro-Winkler distance. Our findings affirm previous works showing clear and consistent bias in single-accent audio, but interestingly show that there is not a significant effect on transcription quality when these models are applied to multi-accent audio.

The full paper can be found here


Data Processing

Data preprocessing scripts are contained in the data directory. This directory should also be used to store the actual dataset. Scripts use globals for pathing, as the directory structure is assumed to be static. These can easily be altered to apply these scripts to alternative data/directory structures, however.

Accent mappings

A mapping for accent lables present in the original data to accent groups on which we perform our comparative analysis is stored as a json object in accents.json. This mapping was generated by us, based on literature review and the samples in the dataset

download_data.py

A variation of download_data_ref_file.py which only downloads audio samples for valid rows. This script does not reference a valid samples files, but may be preferable if computational resources are limited.

generate_agglomarative_samples.py

Generates multi-accent samples from already downloaded single-accent samples. A tsv file of samples if provided, along with the common sample rate to convert samples to (usually 16000 Hz). Samples are then saved and recorded in another tsv file.

preprocess_tabular_data.py

Preprocesses the original tsv file containing all dataset samples (excluding actual audio files). This process includes filtering rows without valid accent labels, mapping accent labels to groups, and text preprocessing for the sentence spoken in the recording.

filter_for_accent.py

This script is redundant to preprocess_tabular_data.py, which also performs filtering. However, this script was used to help facilitate the design of our experimental setup and for determining accent groups to be used.

download_data_ref_file.py

Downloads audio files only for valid samples (those with selected accent labels), which are stored in a reference file. Valid samples are provided from a .tsv file. Samples is retrieved from the HuggingFace Hub, then cross referenced with the file before being saved as a .wav file.


Transcription

3 scripts are provied to transcribe audio samples with different ASR models. These scripts accept similar arguments, briefly described below, but python3 <SCRIPT>.py --help should be called for more detailed descriptions.

transcribe_cv16_whisper.py

Transcribes the dataset specified with the "--dataset" argument using OpenAI's Whisper model. Different variations of the Whisper model (e.g., large, base, base.en, etc.) can be specified with the "--model" argument.

transcribe_cv16_wav2vec2.py

Performs transcription using Facebook/Meta's's Wav2Vec2 model. Different variations of the Wav2Vec2 model (e.g., large, base, etc.) can be specified with the "--model" argument.

transcribe_cv16_canary.py

This Script is used to run transcription of the dataset using Nvidia's Canary-1B Model. This model required the use of the NeMo Package to be run, which can be installed along with its dependencies suing the following commands:

pip install git+https://github.com/NVIDIA/NeMo.git
pip install hydra-core pytorch-lightning lhotse jiwer pyannote.audio webdataset datasets

Evaluation

evaluate_single_accent.py

Evaluates sample wise transcription quality in terms of wer, cer, bertscore_precision, bertscore_recall, bertscore_f1, and jaro_winkler distance. The transcriptions to be evaluated are specified as a path to a .tsv file and the model to use for calculating BERTScore can also be specified if necessary.

evaluate_multi_accent.py

Evaluates transcription quality for multi-accent audio. This script does not calculate BERTScore metrics due to their high computational cost.