
Textless (ASR-transcript free) Spoken Question Answering. The official release of NMSQA dataset and the implementation of "DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning" paper.

Primary LanguagePython


This repo is under-construction, please stay tuned for the update

This repository is the official implementation for DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering paper, and the release of the Natural Multi-speakers Spoken Question Answering (NMSQA) dataset.




Download our NMSQA dataset

Data Preparation for Original Dataset

Preprocessed data link (including passage merging and unit-level labels): [link]

  • Directory format

    • train
    • dev
    • test
  • Files

    • For train and dev split {split}-answer-span.csv: answer time span in seconds meta-{split}.csv: the duration, speaker, and transcription of each utterance {split}-textgrid.tar.gz: force alignment of each utterance {split}_audio.tar.gz: utterance waveform files {split}_hash2question.json: map the hash value to question id
    • For test split lxt_sqa.tar.gz: contains all audio files in audio and transcriptions meta-lxt.csv: the duration, speaker, and transcription of each utterance test/test-SQuAD/test-SQuAD-answer-span.csv: the answer span in the test-SQuAD split test/test-OOD/test-OOD-answer-span.csv: the answer span in the test-OOD split

    NOTE Current the spoken passage is split to segments of utterances. For the standard QA task, you should merge the segments back to the whole passages. The suffix of -1, -2, ..., -n is the segment number of specific passage.

    • Speech Content Encoder Please see details in speeech-content-encoder.
    • Pre-process the QA labels
    python code_answer.py

Parquet Format & Huggingface Format dataset

It basically follow the same file format as the Origin SQuAD with the following extra field:

   "id": Same as SQuAD,
   "title": Same as SQuAD,
   "context": Same as SQuAD,
   "question": Same as SQuAD,
      "answer_start": Same as SQuAD,
      "audio_full_answer_end":[], Audio answer end position in second
      "audio_full_answer_start":[], Audio answer start position in second
      "audio_full_neg_answer_end":[], Audio answer end position in second that using the same words but not the correct one
      "audio_full_neg_answer_start":[], Audio answer start position in second that using the same words but not the correct one
      "text": Same as SQuAD
   "content_segment_audio_path": Segment Audio Path,
   "content_full_audio_path": Complete Audio Path,
   "content_audio_sampling_rate": Audio Sampling Rate,
   "content_audio_speaker": Audio Speaker,
   "content_segment_normalized_text": Normalized Text for generating audio,
   "question_audio_path": Question Audio Path,
   "question_audio_sampling_rate": Audio Sampling Rate,
   "question_audio_speaker": Audio Speaker,
   "question_normalized_text": Normalized Text for generating audio,


python train.py --exp_name [exp name] --config baseline.yaml


python evaluate.py --data_dir [data dir path] --model_path [model checkpoint dir] --output_dir [output dir path] --out_fname [output name]


Discrete unit PLM dev FF1 dev AOS test FF1 test AOS
HuBERT-64 Longformer 47.8 42.4 39.0 33.0
HuBERT-128 Longformer 54.2 48.5 56.0 49.1
HuBERT-512 Longformer 55.0 49.6 17.3 12.5


Guan-Ting Lin (Email: daniel094144@gmail.com) Eric Lam (Email: voidful.stack@gmail.com)


    title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning},
    author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan},
    journal={arXiv preprint arXiv:2203.04911},