/SD-QA

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

SD-QA

Data File Structure

dev/
    -lang/
        -language.dev.csv
        -language.dev.txt
        -dialect/
            -language.dev.dialect-ASR.txt.jsonl.gz
            -language.dev.dialect-ASR.txt
            -metadata.csv
            -wav_lang/
    	        -ID.wav

test/
    -lang/
        -language.test.csv
        -language.test.txt
        -dialect/
            -language.test.dialect-ASR.txt.jsonl.gz
            -language.test.dialect-ASR.txt
            -metadata.csv
            -wav_lang/
    	        -ID.wav

-asr_metadata/
	-dev/
		-asr_output_with_metadata_lang.csv
	-test/
		-asr_output_with_metadata_lang.csv
  • lang: eg. eng, ara
  • language.dev.csv: language specific csv file containing gold and ASR transcripts for all dialects
  • language.dev.txt: language specific text file containing gold data
  • language.dev.dialect-ASR.txt.jsonl.gz: Language and dialect specific TyDi-QA format datafile (gold question replaced with transcript)
  • metadata.csv: metadata file (example ID-user ID mapping with additional info for each dialect and language)
  • wav_lang: folder containing audio files
  • asr_output_with_metadata_lang.csv: single language specific csv file containing all metadata, transcripts with word error rate for each example instance

WER based evaluation on ASR outputs

Comparative minimal answer predictions for Error analysis

Baseline-TydiQA

We train a tydiqa baseline model for the primary task evaluation. Instead of using the original training data, we use the discard_dev version (SDQA development questions are discarded from the training data).

Available model and training data for download:

Experimenting with a primary task baseline

Detailed steps to train a tydiqa primary task baseline model is here

prepare the training samples:
python3 baselines/tydiqa/baseline/prepare_tydi_data.py \
  --input_jsonl=tydiqa_data/tydiqa-v1.0-train-discard-dev.jsonl.gz \
  --output_tfrecord=tydiqa_data/train_tf/train_samples.tfrecord \
  --vocab_file=baselines/tydiqa/baseline/mbert_modified_vocab.txt \
  --record_count_file=tydiqa_data/train_tf/train_samples_record_count.txt \
  --include_unknowns=0.1 \
  --is_training=true
prepare dev samples from all language-dialect specific asr outputs
./experiments/test_prep.sh tydiqa_data/dev tydiqa_data/dev_tf
prepare test samples from all language-dialect specific asr outputs
./experiments/test_prep.sh tydiqa_data/test tydiqa_data/test_tf
train
python3 baselines/tydiqa/baseline/run_tydi.py \
  --bert_config_file=mbert_dir/bert_config.json \
  --vocab_file=baselines/tydiqa/baseline/mbert_modified_vocab.txt \
  --init_checkpoint=mbert_dir/bert_model.ckpt \
  --train_records_file=tydiqa_data/train_tf/train_samples.tfrecord \
  --record_count_file=tydiqa_data/train_tf/train_samples_record_count.txt \
  --do_train \
  --output_dir=trained_models/
Predict

Once the model is trained, we run inference on the dev/test set:

dev:

./experiments/test_predict.sh \
tydiqa_data/dev tydiqa_data/dev_predict tydiqa_data/dev_tf \
trained_models/model.ckpt discard_dev mbert_dir

test:

./experiments/test_predict.sh \
tydiqa_data/test tydiqa_data/test_predict tydiqa_data/test_tf \
trained_models/model.ckpt discard_dev mbert_dir
  • to point the trained checkpoint at --init_checkpoint, write correct location inplace of trained_models/model.ckpt
  • write downloaded mbert location inplace of mbert_dir
Evaluate

Citation

If you use SD-QA, please cite the "SD-QA: Spoken Dialectal Question Answering for the Real World". You can use the following BibTeX entry

@inproceedings{faisal-etal-21-sdqa,
 title = {{SD-QA}: {S}poken {D}ialectal {Q}uestion {A}nswering for the {R}eal {W}orld},
  author = {Faisal, Fahim and Keshava, Sharlina and ibn Alam, Md Mahfuz and Anastasopoulos, Antonios},
  url={https://arxiv.org/abs/2109.12072},
  year = {2021},
  booktitle = {Findings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings)},
  publisher = {Association for Computational Linguistics},
  month = {November},
}

We built our augmented dataset and baselines on top of TydiQA. Kindly also make sure to cite the original TyDi QA paper,

@article{tydiqa,
title   = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages},
author  = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki}
journal = {TACL},
year    = {2020}
}

License

Both the code and data for SD-QA are availalbe under the Apache License 2.0.