- 1 & 2 contains the mathematical foundation. Especially 1 is a high-quality tutorial. 2 is kind of on the arcane side.
- 3 & 3.5 revolve around the same task: utilize both audio and articulatory information during training and use only audio during test. This is a natural setting since humans don't have built-in X-ray functions, at least not yet. Some practical aspects are covered such as making the training of KCCA computationally tractable.
- 4 & 5 utilize neural nets. The derivation of gradient in 4 is involved and deserves a fair look. 5 compares different architectures of multimodal deep learning, with a conclusion that a CCA-autoencoder hybrid might have superior performance.
There is also a taxonomy of multimodal learning here.