A multi-modal audio-to-text encoder-decoder model trained on a large, weakly supervised dataset
Multilingual language models such as mBERT and XLM-R are large language models based on the BERT or RoBERTa architectures which train on not one but many languages. mBERT and XLM-R are both encoder-only language models with vocabularies of between 110K and 250K. This method has been found to increase accuracy, particularly with low-resource languages, since knowledge of one language seems to be able to transfer to others, even if they are unrelated.
Wav2vec[3] and other audio models rely on a method of feature extraction in which raw audio is sampled (usually at 16,000 Hz) and converted through a fourier transform over time into a log-mel spectrogram. The coefficients of the spectrogram are fed into convolutional neural networks with between 2 and 5 layers. These networks are trained to output vectors called 'features' that can be used like tokens in a traditional text-based transformer.
All current audio transformers are encoder-only, meaning that they must be finetuned. This can cause some problems:
- Machine learning is good at cheating
- Finetuned models are prone to overfitting
To make matters even worse, most current audio models have very small training sets as a result of the difficulty of getting high-quality labelled data in audio. For example, Wav2Vec uses only 960 hours of audio (which may seem like a lot, but it's not).
Three major improvements:
- Multimodal encoder-decoder architecture
- Task-specific tokens fed to the decoder
- Large dataset with weak supervision
- Audio is resampled at 16,000 Hz
- 80-channel log-magnitude Mel spectrogram computed on 25ms windows with stride of 10 ms
- Input is normalized
- Features are extracted with small CNN then fed to encoder
- Architecture is similar to original encoder/decoder, with some special tokens
Input: π β π*π, sequence of Mel spectrogram coefficients; π β π*π, a sequence of token IDs.
Output: π· β (0, 1) πVΓlength(π), where the π‘-th column of π· represents πΛπ½(π₯ [π‘ + 1] | π[1 : π‘], π).
Hyperparameters: lmax, πΏ, π», πe, πmlp β β
Parameters: π½ includes all of the following parameters:
- πΎπ β βπeΓπV , πΎπ β βπeΓlmax , the token and positional embedding matrices.
- For π β [πΏenc]:
- | Wπ, multi-head attention parameters for layer π, see (4),
- | πΈ1π, π·1π, πΈ2π, π·2πβ βπe, two sets of layer-norm parameters,
- | πΎπmlp1 β βπmlpΓπe, ππmlp1 β βπmlp, πΎπmlp2 β βπeΓπmlp, ππmlp2 β βπe, MLP parameters.
- For π β [πΏdec]:
- | Wπ, multi-head attention parameters for layer π, see (4),
- | We/d, multi-head cross-attention parameters for layer π, see (4),
- | πΈ3π, π·3π, πΈ4π, π·4π, πΈ5π, π·5πβ βπe, three sets of layer-norm parameters,
- | πΎπmlp1 β βπmlpΓπe, ππmlp1 β βπmlp, πΎπmlp2 β βπeΓπmlp, ππmlp2 β βπe, MLP parameters.
- πΎπ β βπVΓπe, the unembedding matrix.
encode the context sequence
- lz β length(π)
- for π‘ β [lz] : ππ‘ β 2 x conv(π[π‘], GELU) + πΎπ [:, π‘]
- πΏ β [π1, π2, . . . πl]
- for π = 1, 2, . . . , πΏ do
-
- | π β π + MHAttention(π| πΎencl π, Mask = 1)
-
- | for π‘ β [l,z] : ππ‘ β layer_norm(π[:,π‘]|πΈ1π, π·1π)
-
- | π β π + πΎπmlp2ReLU(πΎπmlp1π+ππmlp1) + ππmlp21T
-
- | for π‘ β [l,z]: π[:,t] β layer_norm(π[:,t]|πΈ2π, π·2π)
- end
decode the primary sequence, conditioning on the context - lx β length(π)
- for π‘ β [lx] : ππ‘ β πΎπ [:, π₯ [π‘]] + πΎπ [:, π‘]
- πΏ β [π1, π2, . . . πl]
- for idec = 1, 2, . . . , πΏ do
-
- | πΏ β πΏ + MHAttention(πΏ |Wπdec, Mask[π‘, π‘'] = [[π‘ β€ π‘']])
-
- | for π‘ β [lx] : πΏΛ[:, π‘] β layer_norm(πΏ[:, π‘] | πΈ3π, π·3π)
-
- | πΏ β πΏ + MHAttention(πΏ |Wπe/d, Mask = 1)
-
- | for π‘ β [lx] : πΏΛ[:, π‘] β layer_norm(πΏ[:, π‘] | πΈ4π, π·4π)
-
- | πΏ β πΏ + πΎπmlp4ReLU(πΎπmlp3πΏ+ππmlp3) + ππmlp41T
-
- | for π‘ β [lx] : πΏΛ[:, π‘] β layer_norm(πΏ[:, π‘] | πΈ5π, π·5π)
- end
derive conditional probabilities and return - return P = softmax(πΎuπΏ)
- Not a lot of supervised data is available. Chan et al. only got 5,140 hours
- Unsupervised data can be easier to find (Zhang et al. got 1,000,000 hours) but is noisier
- Weak supervision uses data that is labeled by machine
- WSPSR uses 680,000 hours of weakly supervised labeled audio data.
- 117,000 hours are in 96 non-English languages
- 125,000 hours of x -> en translation data
- Trained on transcripted audio from the internet
- Subpar and machine generated transcripts are automatically detected and removed
- Audio language detector was used to annotate language
- Performed deduplication and manual inspection
This work is a great start toward translation in low resource languages, but how do we expand our dataset to include more languages and include more data from languages that are not well-represented? On the other side, while text-based multilingual models have been shown to improve accuracy for low-resource languages, they are actually not as good as mono-lingual models for highly resourced languages. Would the weak supervision process for obtaining very large datasets be a good way to train a monolingual model?
The data used in the training of Whisper is far from 'gold standard'. Much of it is, itself, machine translated or transcribed. Would improvements in data collection lead to a more powerful model?
Whisper is purpose-built as a transcription/translation model wich works out of the box without any finetuning or extra training. However, these are not the only purposes for a textless nlp model. It remains to be seen whether it can be repurposed with additional training for other tasks.
As a result of the limited sequence length, Whisper can only process audio files that are less than 30 seconds long. This may be sufficient for translation or transcription tasks, but may not be enough for other types of NLP tasks such as classification or summarization.
An article from infoq about openai
Here is whisper's huggingface page
[1] Doddapaneni, S., Ramesh, G., Kunchukuttan, A., Kumar, P., & Khapra, M. M. (2021). A primer on pretrained multilingual language models. arXiv preprint arXiv:2107.00676.
[2] Phuong, M., & Hutter, M. (2022). Formal Algorithms for Transformers. arXiv preprint arXiv:2207.09238.
[3] Radford, A., Kim, J.W., Tao, X., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. Technical report, OpenAI, 2022. URL https://cdn.openai.com/papers/whisper.pdf.
[3] Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.