huggingface/transformers

Add data2vec 2.0

formiel opened this issue · 4 comments

Model description

Hello,

The data2vec 2.0 paper has been released quite a while and achieved impressive performance across different modalities: speech, text, and image (results similar or better than data2vec 1.0 but much more efficient). Especially, the audio-only model seems to be one of the best SSL speech models using base architecture (93M parameters). Therefore, I think it would be a nice addition to the transformers library.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Paper: https://arxiv.org/abs/2212.07525
Code: https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec
Authors: @alexeib @michaelauli @wnhsu

cc @sanchit-gandhi for reference

Thanks for the requested model @formiel! Data2Vec2 is indeed a cool model. However, it's only trained on LibriSpeech for the audio component, and ImageNet for the image part. This limits its usefulness in downstream applications, since it's only been trained on one distribution for each task, which is unlikely to match the distribution of many use-cases (c.f. the ESB paper for arguments for this claim).

The trend we're seeing is more and more usage of Whisper (trained on 680k of diverse data) and Wav2Vec2-BERT (pre-trained on 4M hours of un-labelled audio data), both of which achieve superior performance when averaged over a range of different distributions.

Hence, I'd be reluctant to add the model to the library, since it's likely to be used by only a handful of people.

Let me know if that makes sense! Happy to answer any questions 🤗

Hello @sanchit-gandhi,

Many thanks for your reply! Whisper and w2v2-BERT are indeed very powerful models that are widely used in building state-of-the-art systems across a wide range of tasks. Nonetheless, I think that data2vec 2.0 would be a nice addition to the library and potentially beneficial to not a few people though. There are around more than 7k downloads for data2vec and nearly 1.9M downloads last month for wav2vec 2.0 even though they were also trained on 960 hours of the LibriSpeech dataset, demonstrating the usefulness of these base models in speech-related tasks, whereas data2vec 2.0 was shown to be stronger and more efficient than both data2vec and wav2vec 2.0.

In addition, since w2v2-BERT was trained on much larger data than data2vec 2.0, if we were to train data2vec 2.0 using similar large-scale datasets, the result obtained by data2vec 2.0 could be even better than w2v2-BERT. And even if we do not increase the model size and training data, data2vec objective was also shown to be helpful to many downstream tasks such as voice conversions (https://arxiv.org/abs/2404.09385).

For these reasons I believe that the addition of data2vec 2.0 to HF would still be beneficial to the community.

Hey @formiel - thanks so much for your thorough arguments! It's certainly true that there is some on-going usage for audio encoder models like Wav2Vec2, HuBERT, WavLM and data2vec. Looking at the faireq code for data2vec2, I believe the architectural changes between data2vec and data2vec2 are minimal? It looks more or less identical to data2vec, which we already have in Transformers.

If this is the case, I do agree that it would be worth converting the audio-only checkpoint from fairseq to Transformers, since it wouldn't introduce any additional maintenance burden to the library and would only serve to benefit the ASR community.

We can do this by loading the data2vec2 checkpoint in fairseq, and a data2vec checkpoint in Transformers, and mapping the weights from fairseq to Transformers. You can see an example of this for data2vec here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/data2vec/convert_data2vec_audio_original_pytorch_checkpoint_to_pytorch.py

Should there be any significant model changes from data2vec to data2vec2, we would need to create a new data2vec2 model class in Transformers, and copy as much of the existing code from data2vec as possible (e.g. as we did going from wav2vec2 -> data2vec). You can read motivation for the design here.

Would you like to have a go at this model addition process @formiel? More than happy to help with any questions and queries!