/ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

Primary LanguagePython

Listen, Think, and Understand


Introduction

Illustration of CAV-MAE.

This repository contains the official implementation (in PyTorch), pretrained checkpoints, and datasets of LTU and LTU-AS. LTU and LTU-AS are the first generation of audio and speech large language model that bridges audio/speech perception with understanding. They not only achieve SOTA on multiple closed-ended audio and speech tasks, but also can answer any open-ended question based on the given audio. Please try the interactive demos to see how good they are!

[LTU Interactive Demo]

[LTU-AS Interactive Demo]


Citation

LTU (First Generation, Only Supports Audio):

LTU was accepted at ICLR 2024. See you in Vienna!

[Paper] [HuggingFace Space] [ICLR Peer Review]

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

@article{gong2023listen,
  title={Listen, Think, and Understand},
  author={Gong, Yuan and Luo, Hongyin and Liu, Alexander H and Karlinsky, Leonid and Glass, James},
  journal={arXiv preprint arXiv:2305.10790},
  year={2023}
}

LTU-AS (Second Generation, Supports Speech and Audio):

LTU-AS was accepted at ASRU 2023 (top 3% paper). See you in Taipei!

[Paper] [HuggingFace Space] [ASRU Peer Review]

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

@inproceedings{gong_ltuas,
  title={Joint Audio and Speech Understanding},
  author={Gong, Yuan and Liu, Alexander H and Luo, Hongyin, and Karlinsky, Leonid and Glass, James},
  year={2023},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
}

OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset

We release the training data for LTU (OpenAQA) and LTU-AS (OpenASQA). Specifically, we release the (question, answer, audio_id) tuples. The actual audio files are from existing public datasets and need to be downloaded by the users. We provide the full dataset (including all AQAs) as well as breakdowns (closed-ended and open-ended subsets, subsets of each original dataset, etc). All links are hosted on Dropbox and support wget.

For LTU (OpenAQA)

Toy Set (Contains Raw Audio Files, for Testing Purpose Only):

For LTU: [Meta] [Audio]

OpenAQA Training (Only Audio Datasets, 5.6M AQAs in Total):

Full Dataset (2.3GB): [Download]

Breakdown Subsets: [Download]

LTU Evaluation Data: [Download]


For LTU-AS (OpenASQA)

Toy Set (Contains Raw Audio Files, for Testing Purpose Only):

For LTU-AS: [Meta] [Audio and Whisper Feature]

OpenASQA Training (Audio and Speech Datasets, 10.2M AQAs in Total):

Full Dataset (4.6GB): [Download]

Breakdown Subsets: [Download]

LTU-AS Evaluation Data: [Download]


When preparing audio files, please make sure all audio files use the same sampling rate of 16kHz.

The format of the dataset is a JSON file of a list of dicts, in the following format:

[
 {
  "instruction": "What is the significance of the sound of crying in this audio clip?", % the question
  "input": "I am so sad...", % the speech content
  "audio_id": "/data/sls/audioset/dave_version/audio/LZq4Neh-oWU.flac", % the audio id
  "dataset": "as_strong_train", % the original dataset (optional)
  "task": "open-ended question", % question type (optional)
  "output": "The sound of crying suggests that there is a sad or emotional situation happening in the audio clip." % the answer
 },
  ...
]

Set the Virtual Environment

For almost all usages, you would need to set up a virtual environment. Note that LTU and LTU-AS need different environments. Their hf-dev and peft-main are different. Please do not mix use the venvs of LTU and LTU-AS.

Clone or download this repository as ltu-main, then,

For LTU:

cd /ltu-main/src/ltu
conda create --name venv_ltu python=3.10
conda activate venv_ltu
pip install -r requirements.txt
# install customized hugging face transformer, the original transformer won't work
pip install -e hf-dev/transformers-main
# install customized hugging face peft, original peft won't work
pip install -e peft-main

For LTU-AS:

cd /ltu-main/src/ltu_as
conda create --name venv_ltu_as python=3.10
conda activate venv_ltu_as
pip install -r requirements.txt
# install customized hugging face transformer, the original transformer won't work
pip install -e hf-dev/transformers-main
# install customized hugging face peft, original peft won't work
pip install -e peft-main/
# install customized openai-whisper, original whisper won't work 
pip install -e whisper/

Inference

We provide three options for inference.

Option 1. Inference via HuggingFace Space (No Code Needed)

Illustration of CAV-MAE.

[LTU Interactive Demo]

[LTU-AS Interactive Demo]

Option 2. Inference with API (No GPU Needed)

API supports batch inference with a simple for loop.

!pip install gradio_client

For LTU:

from gradio_client import Client

client = Client("https://yuangongfdu-ltu.hf.space/")
result = client.predict(
      "path_to_your_wav/audio.wav",  # your audio file in 16K
      "What can be inferred from the audio?",    # your question
      api_name="/predict"
)
print(result)

For LTU-AS:

# For LTU-AS
from gradio_client import Client

client = Client("https://yuangongfdu-ltu-2.hf.space/")
result = client.predict(
            "path_to_your_wav/audio.wav",  # your audio file in 16K
            "",
            "What can be inferred from the audio?",    # your question
            "7B (Default)",    # str in 'LLM size' Radio component
            api_name="/predict"
)
print(result)

Option 3. Local Inference

For users interested in training/finetuning, we suggest starting with running inference. This would help debugging. The bash scripts will automatically download default LTU/LTU-AS models, you do not need to do it by yourself. inference_gradio.py can be run on CPU or GPU.

For LTU:

conda activate venv_ltu
cd ltu-main/src/ltu
chmod 777 *
./inference.sh

The script may output some warnings which can be ignored. After the script finishes, it will provide a gradio link for inference, which can be run on a browser of any machine. You can also modify the script to run it on a local terminal.

We also provide batch inference script inference_batch.py.

For LTU-AS:

conda activate venv_ltu_as
cd ltu-main/src/ltu_as
chmod 777 *
./inference.sh

The script may output some warnings which can be ignored. After the script finishes, it will provide a gradio link for inference, which can be run on a browser of any machine.

We also provide batch inference script inference_batch.py, note this script loads pre-extracted whisper features, rather than raw WAV files. If you want to use raw audio files, please use inference_gradio.py. For how to extract whisper features, see [here].

*GPU Issue for LTU-AS: We find that Open-AI whisper features are different on different GPUs, which impacts the performance of LTU-AS as it takes the Whisper feature as input. In the paper, we always use features generated by old GPUs (Titan-X). But we do release a checkpoint that uses a feature generated by newer GPUs (A5000/A6000), please manually switch the checkpoint if you are running on old/new GPUs (by default this code uses a new GPU feature). A mismatch of training and inference GPU does not completely destroy the model, but would cause a performance drop.

Finetune LTU and LTU-AS

Finetune the LTU/LTU-AS Model with Toy Data

We do not provide raw audio files for OpenAQA and OpenASQA due to copyright reasons. However, for easy reproduction, we provide audio files and Whisper audio features for a small sample set (toy set). Specifically, we provide a very simple, almost one-click script to finetune the model. Once successful, you can change the toy data to your own data.

For both scripts:

  • You do not need to download the toy data, prep_train.sh will do this for you.
  • You do not need to download the pretrained model, prep_train.sh will download the default pretrained model. However, you can change which pretrained model to use in finetune_toy.sh.

For LTU:

conda activate venv_ltu
# this path matters, many codes require relative path
cd ltu-main/src/ltu/train_script
# allow script executable
chmod 777 *
# prepare toy data and pretrained models
./prep_train.sh
# run finetuning on the data
./finetune_toy.sh
# for (multiple) GPUs with <48GB memory use, slower
#./finetune_toy_low_resource.sh

You should get something similar as

trainable params: 93065216 || all params: 6831480832 || trainable%: 1.3622993065290356
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6306/6306 [00:02<00:00, 2626.08 examples/s]
{'loss': 0.6383, 'learning_rate': 1e-05, 'epoch': 0.41}                                                                                               
{'loss': 0.6052, 'learning_rate': 2e-05, 'epoch': 0.81}                                                                                               
{'train_runtime': 142.0142, 'train_samples_per_second': 44.404, 'train_steps_per_second': 0.169, 'train_loss': 0.6136090755462646, 'epoch': 0.97}    

For LTU-AS:

conda activate venv_ltu_as
# this path matters, many codes require relative path
cd ltu-main/src/ltu_as/train_script
# allow script executable
chmod 777 *
# prepare toy data and pretrained models
./prep_train.sh
# run finetuning on the data
./finetune_toy.sh
# for (multiple) GPUs with <48GB memory use, slower
#./finetune_toy_low_resource.sh

You should get something like:

trainable params: 48793600 || all params: 6787209216 || trainable%: 0.718905200166442
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 8769/8769 [00:04<00:00, 2088.17 examples/s]
{'loss': 0.6029, 'learning_rate': 2e-05, 'epoch': 0.29}                                                                                               
{'loss': 0.5805, 'learning_rate': 4e-05, 'epoch': 0.58}                                                                                               
{'loss': 0.5397, 'learning_rate': 6e-05, 'epoch': 0.87}                                                                                               
{'train_runtime': 175.7491, 'train_samples_per_second': 49.895, 'train_steps_per_second': 0.193, 'train_loss': 0.5713561913546394, 'epoch': 0.99} 

Finetune the LTU/LTU-AS Model with Your Own Data

For LTU, it is simple, you just need to replace --data_path '../../../openaqa/data/openaqa_toy_relative.json' in finetune_toy.sh to your own data. Note please make sure your own audios are 16kHz, absolute paths are encouraged, we use relative paths just for simple one-click sample.

For LTU-AS, it is a bit more complex, our script does not load raw audio, but pre-extracted Whisper features, so you would also need to first extract Whisper features for your own audio, and then change the code in the HF transformer package to point to your dir for Whisper feature (basically need to change [these lines of code]). For how to extract whisper features, see [here].

Reproduce LTU and LTU-AS Training

We suggest you first try finetuning with toy data then do this.

This is similar to finetuning, the difference is that both LTU and LTU-AS training have multi-stage curriculums, so you would need to start from stage 1, and then stage 2,... For stage 2, you would need to change --base_model 'your_path_to_mdl/pytorch_model.bin' to the checkpoint of the trained model in stage 1. And so on and so forth.

For LTU:

conda activate venv_ltu
# this path matters, many codes require relative path
cd ltu/src/ltu/train_script
# allow script executable
chmod 777 *
# prepare data and pretrained models
./prep_train.sh
# run finetuning on the data
./stage1_proj_cla.sh
./stage2_all_cla.sh # need to specify the checkpoint in stage 1 training
./stage3_all_close.sh # need to specify the checkpoint in stage 2 training
./stage4_all_mix.sh # need to specify the checkpoint in stage 3 training

For LTU-AS:

conda activate venv_ltu_as
# this path matters, many codes require relative path
cd ltu/src/ltu_as/train_script
# allow script executable
chmod 777 *
# prepare data and pretrained models
./prep_train.sh
# run finetuning on the data
./finetune_toy.sh
./stage1_proj_cla.sh
./stage2_all_cla.sh # need to specify the checkpoint in stage 1 training
./stage4_all_mix_v2.sh # need to specify the checkpoint in stage 2 training

Pretrained Models

For most above applications, our script handles model download (so you do not need to do it by yourself), but we do provide more checkpoints.

Other models mentioned in the paper may be provided upon request, please create a GitHub issue to ask.

LTU Models

LTU Model Size Train Seq Length Train Steps Whisper Feature GPU Not Answerable Questions Link
Original in Paper (Default) 370M 108 20000 - Included Download
Full-Finetuned (include LLM Parameters) 27G 108 20000 - Included Download

LTU-AS Models

LTU-AS Model Size Train Seq Length Train Steps Whisper Feature GPU Not Answerable Questions Link
Original in Paper 200M 108 40000 Old GPUs (Titan) Included Download
Long_sequence_exclude_noqa_old_gpu 200M 160 40000 Old GPUs (Titan) Excluded Download
Long_sequence_exclude_noqa_new_gpu (Default) 200M 160 40000 New GPUs (A5000/6000) Excluded Download
Full-Finetuned (include LLM Parameters) 27G 160 40000 Old GPUs (Titan) Excluded Download

More Pretrained Models

We provide the following models to help reproduction.

1. Checkpoints of Each Stage in the Training Curriculum (with Loss Log)

These checkpoints can be used for continue training from any stage, e.g., you can train your own stage 4 model based on a stage 3 checkpoint. You can compare our loss log with yours to ensure everything is OK.

LTU: [Download Link]

Including Stage 1/2/3/4 checkpoints. Training arguments and loss logs are provided to help reproduction.

LTU-AS: [Download Link]

Including Stage 1/2/3 checkpoints, for the final stage3 checkpoint, provide v1 and v2 (with more joint audio and speech training data) checkpoints. Training arguments and loss logs are provided to help reproduction.

Where are the loss logs? Click one of above links, in folders named "checkpoint-xxxx", find files called trainer_state.json, you should see something like:

"log_history": [
    {
      "epoch": 0.0,
      "learning_rate": 0.0001,
      "loss": 8.7039,
      "step": 10
    },
    {
      "epoch": 0.0,
      "learning_rate": 0.0002,
      "loss": 5.5624,
      "step": 20
    },
    {
      "epoch": 0.01,
      "learning_rate": 0.0003,
      "loss": 4.1076,
      "step": 30

That is the actual log loss. We released the logs for all stages for both LTU and LTU-AS to help reproduction.

2. LLaMA 13B Models (including 13B model script)

Our papers mostly focus on LLaMA-7B models, but we provide LLaMA-13B checkpoints. You would need to replace the model script For LTU and For LTU-AS with the 13B version ones, which can be downloaded with the following links.

LTU-13B: [Download Link]

Including Stage 1/2/3/4 checkpoints. For stage 4, provide a standard seq length model (108) and a longer sequence model (192). We recommend to use the model stage4_all_mix_long_seq.

LTU_AS-13B: [Download Link]

Including Stage 1/2/3 checkpoints. For stage 3, provide a model trained with not-answerable QA training data and a model trained without not-answerable QA training data. We recommend to use the model stage3_long_seq_exclude_noqa.

Important Code

This is a large code base, and we are unable to explain the code one by one. Below are the codes that we think are important.

  1. The LTU/LTU model architecture are in LTU Architecture and LTU-AS Architecture, respectively.
  2. The training data collector for LTU and LTU-AS are in here and here, respectively.
  3. The text generation code for LTU and LTU-AS are in here and here, respectively.
  4. The closed-ended evaluation codes for LTU and LTU-AS are in here and here, respectively.
  5. The GPT-assisted data generation code for LTU and LTU-AS are in here and here, respectively.
  6. The Whisper-feature extraction code for LTU-AS is in here.
  7. The training shell scripts with our hyperparameters for each stage for LTU and LTU-AS are in here and here, respectively.
  8. The finetuning Python script (which will be called by the above shell scripts) for LTU and LTU-AS are in here and here, respectively.

For training, the start point is the training shell scripts at here and here, these shell scripts will call ltu-main/{ltu,ltu_as}/finetune.py, which will call the customized hugging face transformer which contains the LTU/LTU-AS model and peft package.

If you have a question about the code, please create an issue.

Required Computational Resources

For LTU/LTU-AS training, we use 4 X A6000 (4 X 48GB=196GB VRAM). The code can be run on 1 X A6000 (or similar GPUs).

To train/finetuning on smaller GPUs, turn on model parallelism, we were able to run it on 4 X A5000 (4 X 24GB = 96GB), we provide sample script for low-resource training for LTU and LTU-AS). Please note they are slower than normal training scripts.

For inference, the minimal would be 2 X TitanX (2 X 12GB = 24GB) for LTU and 4 X TitanX (4 X 12GB = 48GB) for LTU-AS (as Whisper takes some memory). However, you can run inference on CPUs.

Mirror Links

All resources are hosted on Dropbox, support wget, and should be available for most countries/areas. For those who cannot access Dropbox, A VPN is recommended in this case, but we do provide a mirror link at Tencent Cloud 腾讯微云, however, you would need to manually place the model/data in to desired place, our automatic script will fail.

Contact

If you have a question, please create an issue, I usually respond promptly, if delayed, please ping me. For more personal or confidential requests, please send me an email yuangong@mit.edu.