NISQA: Speech Quality and Naturalness Assessment

+++ News: The NISQA model has recently been updated to NISQA v2.0. The new version offers multidimensional predictions with higher accuracy and allows for training and finetuning the model.

Speech Quality Prediction:
NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness to give more insight into the cause of the quality degradation.

TTS Naturalness Prediction:
The NISQA-TTS model weights can be used to estimate the Naturalness of synthetic speech generated by a Voice Conversion or Text-To-Speech system (Siri, Alexa, etc.).

Training/Finetuning:
NISQA can be used to train new single-ended or double-ended speech quality prediction models with different deep learning architectures, such as CNN or DFF -> Self-Attention or LSTM -> Attention-Pooling or Max-Pooling. The provided model weights can also be applied to finetune the trained model towards new data or for transfer-learning to a different regression task (e.g. quality estimation of enhanced speech, speaker similarity estimation, or emotion recognition) .

Speech Quality Datasets:
We provide a large corpus of more than 14,000 speech samples with subjective speech quality and speech quality dimension labels.

Installation
Using NISQA
NISQA Corpus
Paper and License

More information about the deep learning model structure, the used training datasets, and the training options, see the NISQA paper and the Wiki.

Installation

To install requirements install Anaconda and then use:

conda env create -f env.yml

This will create a new environment with the name "nisqa". Activate this environment to go on:

conda activate nisqa

Using NISQA

We provide examples for using NISQA to predict the quality of speech samples, to train a new speech quality model, and to evaluate the performance of a trained speech quality model.

There are three different model weights available, the appropriate weights should be loaded depending on the domain:

Model	Prediction Output	Domain	Filename
NISQA (v2.0)	Overall Quality, Noisiness, Coloration, Discontinuity, Loudness	Transmitted Speech	nisqa.tar
NISQA (v2.0) mos only	Overall Quality only (for finetuning/transfer learning)	Transmitted Speech	nisqa_mos_only.tar
NISQA-TTS (v1.0)	Naturalness	Synthesized Speech	nisqa_tts.tar

Prediction

There are three modes available to predict the quality of speech via command line arguments:

Predict a single file
Predict all files in a folder
Predict all files in a CSV table

Important: Select "nisqa.tar" to predict the quality of a transmitted speech sample and "nisqa_tts.tar" to predict the Naturalness of a synthesized speech sample.

To predict the quality of a single .wav file use:

python run_predict.py --mode predict_file --pretrained_model weights/nisqa.tar --deg /path/to/wav/file.wav --output_dir /path/to/dir/with/results

To predict the quality of all .wav files in a folder use:

python run_predict.py --mode predict_dir --pretrained_model weights/nisqa.tar --data_dir /path/to/folder/with/wavs --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results

To predict the quality of all .wav files listed in a csv table use:

python run_predict.py --mode predict_csv --pretrained_model weights/nisqa.tar --csv_file files.csv --csv_deg column_name_of_filepaths --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results

The results will be printed to the console and saved to a csv file in a given folder (optional with --output_dir). To speed up the prediction, the number of workers and batch size of the Pytorch Dataloader can be increased (optional with --num_workers and --bs). In case of stereo files --ms_channel can be used to select the audio channel.

Training

Finetuning / Transfer Learning

To use the model weights to finetune the model on a new dataset, only a CSV file with the filenames and labels is needed. The training configuration is controlled from a YAML file and can be started as follows:

python run_train.py --yaml config/finetune_nisqa.yaml

If the NISQA Corpus is used, only two arguments need to updated in the YAML file and you are ready to go: The data_dir to the extracted NISQA_Corpus folder and the output_dir, where the results should be stored.
If you use your own dataset or want to load the NISQA-TTS model, some other updates are needed.

Your CSV file needs to contain at least three columns with the following names
- db with the individual dataset names for each file
- filepath_deg filepath to the degraded WAV file, either absolute paths or relative to the data_dir (CSV column name can be changed in YAML)
- mos with the target labels (CSV column name can be changed in YAML)
The finetune_nisqa.yaml needs to be updated as follows:
- data_dir path to the main folder, which contains the CSV file and the datasets
- output_dir path to output folder with saved model weights and results
- pretrained_model filename of the pretrained model, either nisqa_mos_only.tar for natural speech or nisqa_tts.tar for synthesized speech
- csv_file name of the CSV with filepaths and target labels
- csv_deg CSV column name that contains filepaths (e.g. filepath_deg)
- csv_mos_train and csv_mos_val CSV column names of the target value (e.g. mos)
- csv_db_train and csv_db_val names of the datasets you want to use for training and validation. Datasets names must be in the db column.

See the comments in the YAML configuration file and the Wiki (not yet added) for more advanced training options. A good starting point would be to use the NISQA Corpus to get the training started with the standard configuration.

Training a new model

NISQA can also be used as a framework to train new speech quality models with different deep learning architectures. The general model structure is as follows:

Framewise model: CNN or Feedforward network
Time-Dependency model: Self-Attention or LSTM
Pooling: Average, Max, Attention or Last-Step-Pooling

The framewise and time-dependency models can be skipped, for example to train an LSTM model without CNN that uses the last-time step for prediction. Also a second time-dependency stage can be added, for example for LSTM-Self-Attention structure. The model structure can be easily controlled via the YAML configuration file. The training with the standard NISQA model configuration can be started with the NISQA Corpus as follows:

python run_train.py --yaml config/train_nisqa_cnn_sa_ap.yaml

If the NISQA Corpus is used, only the data_dir needs to be updated to the unzipped NISQA_Corpus folder and the output_dir in the YAML file. Otherwise, see the previous finetuning section for updating the YAML file if a custom dataset is applied.

It is also possible to train any other combination of neural networks, for example, to train a model with LSTM instead of Self-Attention, the train_nisqa_cnn_lstm_avg.yaml example configuration file is provided.

To train a double-ended model for full-reference speech quality prediction, the train_nisqa_double_ended.yaml configuration file can be used as an example. See the comments in the YAML files and the Wiki (not yet added) for more details on different possible model structures and advanced training options.

Evaluation

Trained models can be evaluated on a given dataset as follows (can also be used as a conformance test of the model installation):

python run_evaluate.py

Before running, the options and paths inside the Python script run_evaluate.py should be updated. If the NISQA Corpus is used, only the data_dir and output_dir paths need to be adjusted. Besides Pearson's Correlation and RMSE, also an RMSE after first-order polynomial mapping is calculated. If a CSV file with per-condition labels is provided, the script will also output per-condition results and RMSE*. Optionally, correlation diagrams can be plotted. The script should return the same results as in the NISQA paper when it is run on the NISQA Corpus.

NISQA Corpus

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions.

For the download link and more details on the datasets and used source speech samples see the NISQA Corpus Wiki.

Paper and License

If you use the NISQA model or the NISQA Corpus for your research, please cite following paper:
G. Mittag, B. Naderi, A. Chehadi, and S. Möller “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” arXiv:2104.09494 [eess.AS], 2021.
Please cite following paper if you use the NISQA-TTS model for Naturalness prediction of synthesized speech:
G. Mittag and S. Moller, “Deep Learning Based Assessment of Synthetic Speech Naturalness,” in Proc. Interspeech 2020, 2020.
Please cite following paper if you use the double-ended NISQA model:
G. Mittag and S. Möller. Full-reference speech quality estimation with attentional Siamese neural networks. In Proc. ICASSP 2020, 2020.
The older NISQA (v0.42) model version is described in following paper:
G. Mittag and S. Möller, “Non-intrusive speech quality assessment for super-wideband speech communication networks,” in Proc. ICASSP 2019, 2019

The NISQA code is licensed under MIT License.

The model weights (nisqa.tar, nisqa_mos_only.tar, nisqa_tts.tar) are provided under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License

The NISQA Corpus is provided under the original terms of the used source speech and noise samples. More information can be found in the NISQA Corpus Wiki.

haojunyong/NISQA