WSPSR

A multi-modal audio-to-text encoder-decoder model trained on a large, weakly supervised dataset

Overview

Relevant work

Pretrained multilingual language models

Multilingual language models such as mBERT and XLM-R are large language models based on the BERT or RoBERTa architectures which train on not one but many languages. mBERT and XLM-R are both encoder-only language models with vocabularies of between 110K and 250K. This method has been found to increase accuracy, particularly with low-resource languages, since knowledge of one language seems to be able to transfer to others, even if they are unrelated.

wav2vec

Wav2vec[3] and other audio models rely on a method of feature extraction in which raw audio is sampled (usually at 16,000 Hz) and converted through a fourier transform over time into a log-mel spectrogram. The coefficients of the spectrogram are fed into convolutional neural networks with between 2 and 5 layers. These networks are trained to output vectors called 'features' that can be used like tokens in a traditional text-based transformer.

An encoder/decoder audio model

All current audio transformers are encoder-only, meaning that they must be finetuned. This can cause some problems:

Machine learning is good at cheating
Finetuned models are prone to overfitting

To make matters even worse, most current audio models have very small training sets as a result of the difficulty of getting high-quality labelled data in audio. For example, Wav2Vec uses only 960 hours of audio (which may seem like a lot, but it's not).

Enter WSPR

Three major improvements:

Multimodal encoder-decoder architecture
Task-specific tokens fed to the decoder
Large dataset with weak supervision

The Model

Audio is resampled at 16,000 Hz
80-channel log-magnitude Mel spectrogram computed on 25ms windows with stride of 10 ms
Input is normalized
Features are extracted with small CNN then fed to encoder
Architecture is similar to original encoder/decoder, with some special tokens

Architecture

Pseudocode

Input: 𝒛 ∈ 𝑉*_𝒛, sequence of Mel spectrogram coefficients; 𝒙 ∈ 𝑉*_𝒙, a sequence of token IDs.
Output: 𝑷 ∈ (0, 1) ^{𝑁_V×length(𝒙)}, where the 𝑡-th column of 𝑷 represents 𝑃ˆ𝜽(𝑥 [𝑡 + 1] | 𝒙[1 : 𝑡], 𝒛).
Hyperparameters: l_max, 𝐿, 𝐻, 𝑑_e, 𝑑_mlp ∈ ℕ
Parameters: 𝜽 includes all of the following parameters:

𝑾_𝒆 ∈ ℝ^{𝑑_e×𝑁_V} , 𝑾_𝒑 ∈ ℝ^{𝑑_e×l_max} , the token and positional embedding matrices.
For 𝑙 ∈ [𝐿_enc]:
- | W_𝑙, multi-head attention parameters for layer 𝑙, see (4),
- | 𝜸¹_𝑙, 𝜷¹_𝑙, 𝜸²_𝑙, 𝜷²_𝑙∈ ℝ^𝑑_e, two sets of layer-norm parameters,
- | 𝑾^𝑙_mlp1 ∈ ℝ^{𝑑_mlp×𝑑_e}, 𝒃^𝑙_mlp1 ∈ ℝ^𝑑_mlp, 𝑾^𝑙_mlp2 ∈ ℝ^{𝑑_e×𝑑_mlp}, 𝒃^𝑙_mlp2 ∈ ℝ^𝑑_e, MLP parameters.
For 𝑙 ∈ [𝐿_dec]:
- | W_𝑙, multi-head attention parameters for layer 𝑙, see (4),
- | W^e/d, multi-head cross-attention parameters for layer 𝑙, see (4),
- | 𝜸³_𝑙, 𝜷³_𝑙, 𝜸⁴_𝑙, 𝜷⁴_𝑙, 𝜸⁵_𝑙, 𝜷⁵_𝑙∈ ℝ^𝑑_e, three sets of layer-norm parameters,
- | 𝑾^𝑙_mlp1 ∈ ℝ^{𝑑_mlp×𝑑_e}, 𝒃^𝑙_mlp1 ∈ ℝ^𝑑_mlp, 𝑾^𝑙_mlp2 ∈ ℝ^{𝑑_e×𝑑_mlp}, 𝒃^𝑙_mlp2 ∈ ℝ^𝑑_e, MLP parameters.
𝑾_𝒖 ∈ ℝ^{𝑁_V×𝑑_e}, the unembedding matrix.

encode the context sequence

l_z ← length(𝒛)
for 𝑡 ∈ [l_z] : 𝒆_𝑡 ← 2 x conv(𝒛[𝑡], GELU) + 𝑾_𝒑 [:, 𝑡]
𝑿 ← [𝒆₁, 𝒆₂, . . . 𝒆_l]
for 𝑙 = 1, 2, . . . , 𝐿 do
- | 𝒁 ← 𝒁 + MHAttention(𝒁| 𝑾^enc_l 𝑙, Mask = 1)
- | for 𝑡 ∈ [l,_z] : 𝒆_𝑡 ← layer_norm(𝒁[:,𝑡]|𝜸¹_𝑙, 𝜷¹_𝑙)
- | 𝒁 ← 𝒁 + 𝑾^𝑙_mlp2ReLU(𝑾^𝑙_mlp1𝒁+𝒃^𝑙_mlp1) + 𝒃^𝑙_mlp21^T
- | for 𝑡 ∈ [l,_z]: 𝒁[:,t] ← layer_norm(𝒁[:,t]|𝜸²_𝑙, 𝜷²_𝑙)
end
decode the primary sequence, conditioning on the context
l_x ← length(𝒙)
for 𝑡 ∈ [l_x] : 𝒆_𝑡 ← 𝑾_𝒆 [:, 𝑥 [𝑡]] + 𝑾_𝒑 [:, 𝑡]
𝑿 ← [𝒆₁, 𝒆₂, . . . 𝒆_l]
for i_dec = 1, 2, . . . , 𝐿 do
- | 𝑿 ← 𝑿 + MHAttention(𝑿 |W_𝑙^dec, Mask[𝑡, 𝑡'] = [[𝑡 ≤ 𝑡']])
- | for 𝑡 ∈ [l_x] : 𝑿˜[:, 𝑡] ← layer_norm(𝑿[:, 𝑡] | 𝜸³_𝑙, 𝜷³_𝑙)
- | 𝑿 ← 𝑿 + MHAttention(𝑿 |W_𝑙^e/d, Mask = 1)
- | for 𝑡 ∈ [l_x] : 𝑿˜[:, 𝑡] ← layer_norm(𝑿[:, 𝑡] | 𝜸⁴_𝑙, 𝜷⁴_𝑙)
- | 𝑿 ← 𝑿 + 𝑾^𝑙_mlp4ReLU(𝑾^𝑙_mlp3𝑿+𝒃^𝑙_mlp3) + 𝒃^𝑙_mlp41^T
- | for 𝑡 ∈ [l_x] : 𝑿˜[:, 𝑡] ← layer_norm(𝑿[:, 𝑡] | 𝜸⁵_𝑙, 𝜷⁵_𝑙)
end
derive conditional probabilities and return
return P = softmax(𝑾_u𝑿)

Datasets - Supervised, Unsupervised, Weakly Supervised

Not a lot of supervised data is available. Chan et al. only got 5,140 hours
Unsupervised data can be easier to find (Zhang et al. got 1,000,000 hours) but is noisier
Weak supervision uses data that is labeled by machine
WSPSR uses 680,000 hours of weakly supervised labeled audio data.
- 117,000 hours are in 96 non-English languages
- 125,000 hours of x -> en translation data

WSPSR's Weakly Supervised Annotation Process

Trained on transcripted audio from the internet
Subpar and machine generated transcripts are automatically detected and removed
Audio language detector was used to annotate language
Performed deduplication and manual inspection

Questions

1. How does Whisper differ from the original encoder/decoder?

2. What does it mean for data to be 'weakly supervised'?

3. Can you think of any other applications of weakly supervised data outside of textless nlp?

Critical Analysis

Low Resource Languages

This work is a great start toward translation in low resource languages, but how do we expand our dataset to include more languages and include more data from languages that are not well-represented? On the other side, while text-based multilingual models have been shown to improve accuracy for low-resource languages, they are actually not as good as mono-lingual models for highly resourced languages. Would the weak supervision process for obtaining very large datasets be a good way to train a monolingual model?

Low Quality Data

The data used in the training of Whisper is far from 'gold standard'. Much of it is, itself, machine translated or transcribed. Would improvements in data collection lead to a more powerful model?

Additional Tasks

Whisper is purpose-built as a transcription/translation model wich works out of the box without any finetuning or extra training. However, these are not the only purposes for a textless nlp model. It remains to be seen whether it can be repurposed with additional training for other tasks.

Small context length

As a result of the limited sequence length, Whisper can only process audio files that are less than 30 seconds long. This may be sufficient for translation or transcription tasks, but may not be enough for other types of NLP tasks such as classification or summarization.

Links

The link to whisper's github

The link to openai's blog

An article from infoq about openai

A youtube video by setdex

Here is whisper's huggingface page

Video

References

[1] Doddapaneni, S., Ramesh, G., Kunchukuttan, A., Kumar, P., & Khapra, M. M. (2021). A primer on pretrained multilingual language models. arXiv preprint arXiv:2107.00676.

[2] Phuong, M., & Hutter, M. (2022). Formal Algorithms for Transformers. arXiv preprint arXiv:2207.09238.

[3] Radford, A., Kim, J.W., Tao, X., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. Technical report, OpenAI, 2022. URL https://cdn.openai.com/papers/whisper.pdf.

[3] Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.

Xinchunran/WSPSR