/WSPSR

A Description of the New Transformer

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

WSPSR

A multi-modal audio-to-text encoder-decoder model trained on a large, weakly supervised dataset wisper

Overview

Relevant work

Pretrained multilingual language models

Multilingual language models such as mBERT and XLM-R are large language models based on the BERT or RoBERTa architectures which train on not one but many languages. mBERT and XLM-R are both encoder-only language models with vocabularies of between 110K and 250K. This method has been found to increase accuracy, particularly with low-resource languages, since knowledge of one language seems to be able to transfer to others, even if they are unrelated.

wav2vec

Wav2vec[3] and other audio models rely on a method of feature extraction in which raw audio is sampled (usually at 16,000 Hz) and converted through a fourier transform over time into a log-mel spectrogram. The coefficients of the spectrogram are fed into convolutional neural networks with between 2 and 5 layers. These networks are trained to output vectors called 'features' that can be used like tokens in a traditional text-based transformer.

audio encoding

An encoder/decoder audio model

All current audio transformers are encoder-only, meaning that they must be finetuned. This can cause some problems:

  • Machine learning is good at cheating
  • Finetuned models are prone to overfitting

To make matters even worse, most current audio models have very small training sets as a result of the difficulty of getting high-quality labelled data in audio. For example, Wav2Vec uses only 960 hours of audio (which may seem like a lot, but it's not).

oh noes!

Enter WSPR

Three major improvements:

  • Multimodal encoder-decoder architecture
  • Task-specific tokens fed to the decoder
  • Large dataset with weak supervision

The Model

  • Audio is resampled at 16,000 Hz
  • 80-channel log-magnitude Mel spectrogram computed on 25ms windows with stride of 10 ms
  • Input is normalized
  • Features are extracted with small CNN then fed to encoder
  • Architecture is similar to original encoder/decoder, with some special tokens

Architecture

WSPSR pipeline

Pseudocode

Input: 𝒛 ∈ 𝑉*𝒛, sequence of Mel spectrogram coefficients; 𝒙 ∈ 𝑉*𝒙, a sequence of token IDs.
Output: 𝑷 ∈ (0, 1) 𝑁VΓ—length(𝒙), where the 𝑑-th column of 𝑷 represents π‘ƒΛ†πœ½(π‘₯ [𝑑 + 1] | 𝒙[1 : 𝑑], 𝒛).
Hyperparameters: lmax, 𝐿, 𝐻, 𝑑e, 𝑑mlp ∈ β„•
Parameters: 𝜽 includes all of the following parameters:

  • 𝑾𝒆 ∈ ℝ𝑑e×𝑁V , 𝑾𝒑 ∈ ℝ𝑑eΓ—lmax , the token and positional embedding matrices.
  • For 𝑙 ∈ [𝐿enc]:
    • | W𝑙, multi-head attention parameters for layer 𝑙, see (4),
    • | 𝜸1𝑙, 𝜷1𝑙, 𝜸2𝑙, 𝜷2π‘™βˆˆ ℝ𝑑e, two sets of layer-norm parameters,
    • | 𝑾𝑙mlp1 ∈ ℝ𝑑mlp×𝑑e, 𝒃𝑙mlp1 ∈ ℝ𝑑mlp, 𝑾𝑙mlp2 ∈ ℝ𝑑e×𝑑mlp, 𝒃𝑙mlp2 ∈ ℝ𝑑e, MLP parameters.
  • For 𝑙 ∈ [𝐿dec]:
    • | W𝑙, multi-head attention parameters for layer 𝑙, see (4),
    • | We/d, multi-head cross-attention parameters for layer 𝑙, see (4),
    • | 𝜸3𝑙, 𝜷3𝑙, 𝜸4𝑙, 𝜷4𝑙, 𝜸5𝑙, 𝜷5π‘™βˆˆ ℝ𝑑e, three sets of layer-norm parameters,
    • | 𝑾𝑙mlp1 ∈ ℝ𝑑mlp×𝑑e, 𝒃𝑙mlp1 ∈ ℝ𝑑mlp, 𝑾𝑙mlp2 ∈ ℝ𝑑e×𝑑mlp, 𝒃𝑙mlp2 ∈ ℝ𝑑e, MLP parameters.
  • 𝑾𝒖 ∈ ℝ𝑁V×𝑑e, the unembedding matrix.

encode the context sequence

  1. lz ← length(𝒛)
  2. for 𝑑 ∈ [lz] : 𝒆𝑑 ← 2 x conv(𝒛[𝑑], GELU) + 𝑾𝒑 [:, 𝑑]
  3. 𝑿 ← [𝒆1, 𝒆2, . . . 𝒆l]
  4. for 𝑙 = 1, 2, . . . , 𝐿 do
    • | 𝒁 ← 𝒁 + MHAttention(𝒁| 𝑾encl 𝑙, Mask = 1)
    • | for 𝑑 ∈ [l,z] : 𝒆𝑑 ← layer_norm(𝒁[:,𝑑]|𝜸1𝑙, 𝜷1𝑙)
    • | 𝒁 ← 𝒁 + 𝑾𝑙mlp2ReLU(𝑾𝑙mlp1𝒁+𝒃𝑙mlp1) + 𝒃𝑙mlp21T
    • | for 𝑑 ∈ [l,z]: 𝒁[:,t] ← layer_norm(𝒁[:,t]|𝜸2𝑙, 𝜷2𝑙)
  5. end
    decode the primary sequence, conditioning on the context
  6. lx ← length(𝒙)
  7. for 𝑑 ∈ [lx] : 𝒆𝑑 ← 𝑾𝒆 [:, π‘₯ [𝑑]] + 𝑾𝒑 [:, 𝑑]
  8. 𝑿 ← [𝒆1, 𝒆2, . . . 𝒆l]
  9. for idec = 1, 2, . . . , 𝐿 do
    • | 𝑿 ← 𝑿 + MHAttention(𝑿 |W𝑙dec, Mask[𝑑, 𝑑'] = [[𝑑 ≀ 𝑑']])
    • | for 𝑑 ∈ [lx] : π‘ΏΛœ[:, 𝑑] ← layer_norm(𝑿[:, 𝑑] | 𝜸3𝑙, 𝜷3𝑙)
    • | 𝑿 ← 𝑿 + MHAttention(𝑿 |W𝑙e/d, Mask = 1)
    • | for 𝑑 ∈ [lx] : π‘ΏΛœ[:, 𝑑] ← layer_norm(𝑿[:, 𝑑] | 𝜸4𝑙, 𝜷4𝑙)
    • | 𝑿 ← 𝑿 + 𝑾𝑙mlp4ReLU(𝑾𝑙mlp3𝑿+𝒃𝑙mlp3) + 𝒃𝑙mlp41T
    • | for 𝑑 ∈ [lx] : π‘ΏΛœ[:, 𝑑] ← layer_norm(𝑿[:, 𝑑] | 𝜸5𝑙, 𝜷5𝑙)
  10. end
    derive conditional probabilities and return
  11. return P = softmax(𝑾u𝑿)

Datasets - Supervised, Unsupervised, Weakly Supervised

  • Not a lot of supervised data is available. Chan et al. only got 5,140 hours
  • Unsupervised data can be easier to find (Zhang et al. got 1,000,000 hours) but is noisier
  • Weak supervision uses data that is labeled by machine
  • WSPSR uses 680,000 hours of weakly supervised labeled audio data.
    • 117,000 hours are in 96 non-English languages
    • 125,000 hours of x -> en translation data

WSPSR's Weakly Supervised Annotation Process

  1. Trained on transcripted audio from the internet
  2. Subpar and machine generated transcripts are automatically detected and removed
  3. Audio language detector was used to annotate language
  4. Performed deduplication and manual inspection

Questions

1. How does Whisper differ from the original encoder/decoder?

2. What does it mean for data to be 'weakly supervised'?

3. Can you think of any other applications of weakly supervised data outside of textless nlp?

Critical Analysis

Low Resource Languages

This work is a great start toward translation in low resource languages, but how do we expand our dataset to include more languages and include more data from languages that are not well-represented? On the other side, while text-based multilingual models have been shown to improve accuracy for low-resource languages, they are actually not as good as mono-lingual models for highly resourced languages. Would the weak supervision process for obtaining very large datasets be a good way to train a monolingual model?

Low Quality Data

The data used in the training of Whisper is far from 'gold standard'. Much of it is, itself, machine translated or transcribed. Would improvements in data collection lead to a more powerful model?

Additional Tasks

Whisper is purpose-built as a transcription/translation model wich works out of the box without any finetuning or extra training. However, these are not the only purposes for a textless nlp model. It remains to be seen whether it can be repurposed with additional training for other tasks.

Small context length

As a result of the limited sequence length, Whisper can only process audio files that are less than 30 seconds long. This may be sufficient for translation or transcription tasks, but may not be enough for other types of NLP tasks such as classification or summarization.

Links

The link to whisper's github

The link to openai's blog

An article from infoq about openai

A youtube video by setdex

Here is whisper's huggingface page

Video

References

[1] Doddapaneni, S., Ramesh, G., Kunchukuttan, A., Kumar, P., & Khapra, M. M. (2021). A primer on pretrained multilingual language models. arXiv preprint arXiv:2107.00676.

[2] Phuong, M., & Hutter, M. (2022). Formal Algorithms for Transformers. arXiv preprint arXiv:2207.09238.

[3] Radford, A., Kim, J.W., Tao, X., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. Technical report, OpenAI, 2022. URL https://cdn.openai.com/papers/whisper.pdf.

[3] Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.