/ARMHuBERT

(Interspeech 2023 & ICASSP 2024) Official repository for ARMHuBERT and STaRHuBERT

Primary LanguagePythonApache License 2.0Apache-2.0

♽ Recycle-and-Distill (Interspeech 2023)


model

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation, INTERSPEECH 2023.

Kangwook Jang*, Sungnyun Kim*, Se-Young Yun, Hoirin Kim
* equal contribution

  • Attention Map Reusing: Reuse previous layer's attention map to remove key & query parameters in Transformer
  • Masking Distillation: Masking distillation treating masked frames and unmasked frames separately
  • Parameters and MACs of ARMHuBERT have decreased to 28% and 30% of the teacher, HuBERT Base, respectively.
  • ARMHuBERT achieves PER of 7.72%, WER of 9.96% on the SUPERB benchmark in an E2E distillation manner.

📌 Check out our model's performance in SUPERB Leaderboard!

🤗 Checkpoints

For our model's checkpoints, go check this link!

Model name Parameters Teacher Training dataset Link
ARMHuBERT-960h 26.45M HuBERT LibriSpeech-960h HF Model
ARMHuBERT-S-100h 22.39M HuBERT LibriSpeech-100h HF Model
ARMHuBERT-S-960h 22.39M HuBERT LibriSpeech-960h HF Model
ARMwavLM-S-100h 22.39M wavLM LibriSpeech-100h HF Model
ARMwavLM-S-960h 22.39M wavLM LibriSpeech-960h HF Model
MaskHuBERT-960h 26.64M HuBERT LibriSpeech-960h HF Model

How to use this repo

Requirements

Install the necessary packages with:

$ pip install -r requirements.txt

Distillation

  1. Download the teacher model checkpoint to perform knowledge distillation, and place it under the root path, ./.

    • For HuBERT Base: link (hubert_base_ls960.pt)
    • For wavLM Base: link (wavlm_base.pt)
  2. Download the LibriSpeech dataset.

    • For 100h distillation, download train-clean-100
    • For 960h distillation, download whole dataset, train-clean-100, train-clean-360, train-other-500
    • For validation, download dev-clean
      • You can validate your model with test clean other either. In this case, please download test-clean, and modify self.eval_data in train.py file.
  3. Modify the configuration file in ./conf/[model_name]/[config].yaml.

    • For example, the configuration file ./conf/armhubert/armhubert-960.yaml contains all the settings for reproducing ARMHuBERT on LibriSpeech 960h dataset.
    • Set the path to the teacher model checkpoint at teacher_model, and the root path to the LibriSpeech dataset at libri_root.
  4. Then, run the following command:

python train.py -c ./conf/[model_name]/[config].yaml

For ARMHuBERT, python train.py -c ./conf/armhubert/armhubert-960.yaml

After training, the model checkpoints and the corresponding configuration file will be created at ./results/pretrain/.

Fine-tuning

  1. If you don't feel like training your model, feel free to use our checkpoints.

  2. Clone and install the S3PRL toolkit with pip install -e ".[all]" (dev mode).

  3. Copy the entire ./models/[model_name] folder into <s3prl root>/s3prl/upstream/.

  4. Please add upstream importing line in <s3prl root>/s3prl/hub.py.

    from s3prl.upstream.[model_name].hubconf import *
    

    For ARMHuBERT,

    from s3prl.upstream.armhubert.hubconf import *
    
  5. Please change each config file of s3prl downstream tasks as follows.

    • Uncomment learning rate scheduler
    • Learning rate scaled to 10x in spekaer identification (SID) task
  6. Run the following command to fine-tune the ARMHuBERT model.

    For automatic speech recognition (ASR) as an example:

    python run_downstream.py \
    -m train \
    -n ARMHuBERT-ASR \  # You can set your exp name whatever you want
    -u armhubert \
    -d asr \
    -k <path to .ckpt file in <git root>/results/pretrain/> \
    -g <path to .yaml file in <git root>/results/pretrain/>
    

    Note: Refer to the SUPERB docs for more information on usage details and data preparation.

Result

result

We evaluate our student models on the SUPERB benchmark.

MaskHuBERT highly improves the performances in content- and semantics-related tasks. See PR, ASR, SF, and IC.

ARMHuBERT shows promising improvements when compared to MaskHuBERT in SF and SID tasks, exhibiting a similar level of performance in other tasks.

ARMHuBERT achieves a better overall score of 78.1 with less parameters than MaskHuBERT. This is an state-of-the-art performance for an end-to-end distillation approach such as Deep-versus-wide 12-L or FitHuBERT.

You can also check that our model works on other Transformer backbone model, wavLM, too.

Try this distillation strategy with your Transformer backbone models

We have only performed evaluation on HuBERT-based models, but this strategy can be performed identically on any speech model with a Transformer backbone. E.g. AST (Audio Spectrogram Transformer).

BibTeX

If you find this repo useful for your research, please consider citing our paper:

@article{jang2023recycleanddistill,
         title={Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation}, 
         author={Kangwook Jang and Sungnyun Kim and Se-Young Yun and Hoirin Kim},
	 	 booktitle={Proc. INTERSPEECH 2023},
  		 pages={316--320},
         year={2023}
}

🌟 STaR (ICASSP 2024)

🎉 Update (Apr 12, 2024): Our new paper, STaR, has been selected as Best Student Paper in ICASSP 2024!
🎉 Check out our model's performance in SUPERB Leaderboard!

model

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models, ICASSP 2024.

Kangwook Jang, Sungnyun Kim, Hoirin Kim

  • Speech Temporal Relation (STaR): Distill the knowledge by focusing on the pairwise temporal relation between two speech frames.
  • Temporal Gram Matrix (TGM): Propose Temporal Gram Matrix which aggregates channel information at two time steps.
    • Layer-wise TGM: Distill the TGM for every Transformer layer
    • Intra-layer TGM: Modify the TGM as computing the temporal relation between the input and output of a single Transformer layer.
  • Incorporating two TGMs as the distillation objectives together, our student model STaRHuBERT (22M & 26M) shows the SOTA performance on the SUPERB benchmark with the metric of overall and generalizability scores.
  • For further compression (9.39M & 14.1M), our approach shows the robust performance against degradation compares to other works.

model

model

Checkpoints

For our model's checkpoints, please check the following links. All models are distilled from HuBERT base.

Distillation

We do not offer an official implementation code for distillation. Nevertheless, since STaRHuBERT is developed upon the backbone of ARMHuBERT, you can easily re-implement our apporach with this ARMHuBERT repository.

Fine-tuning

You can reproduce our model with given checkpoints. Please follow the steps. (This is almost the same as ARMHuBERT case.)

  1. Clone and install the S3PRL toolkit with pip install -e ".[all]" (dev mode).

  2. Copy the entire ./models/starhubert folder into <s3prl root>/s3prl/upstream/.

  3. Please add upstream importing line in <s3prl root>/s3prl/hub.py.

    from s3prl.upstream.starhubert.hubconf import *
    
  4. Please change each config file of s3prl downstream tasks as follows.

    • Uncomment learning rate scheduler
    • Learning rate scaled to 10x in spekaer identification (SID) task
  5. Run the following command to fine-tune the ARMHuBERT model.

    For automatic speech recognition (ASR) as an example:

    python run_downstream.py \
    -m train \
    -n STaRHuBERT-ASR \  # You can set your exp name whatever you want
    -u starhubert \
    -d asr \
    -k <path to .ckpt file in <git root>/results/pretrain/> \
    -g <path to .yaml file in <git root>/results/pretrain/>
    

    Note: Refer to the SUPERB docs for more information on usage details and data preparation.

BibTeX

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{jang2024star,
  title={STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models},
  author={Jang, Kangwook and Kim, Sungnyun and Kim, Hoirin},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={10721--10725},
  year={2024},
  organization={IEEE}
}

Contact

For any details or clarification, please reach out to