

Multilingual captions for the HowTo100M dataset

We provide the multilingual captions for the HowTo100M dataset in the following languages:

Language code link
Englsish en link
German de link
French fr link
Czech cs link
Swahili sw link
Russian ru link
Vietnamese vi link
Spanish es link
Chinese zh link


The how2_[lang].json file contains the captions for the HowTo100M videos. It can be read into a python dictionary where video_id as the key. Each value of the dictionary is another dictionary with the keys ['text', 'start', 'end']. The value of 'text' is a list of all the captions from the given video_id, and 'start' and 'end' are arrays correspondings to the start and end time timestamp of the captions (in second).

HowTo100M videos

Please refer to here for the list of HowTo100M videos and the video meta data

VTT in 9 Languages

The translated VTT in 9 languages for evaluation is available here


  title={Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models},
  author={Huang, Po-Yao and Patrick, Mandela and Hu, Junjie and Neubig, Graham and Metze, Florian and Hauptmann, Alexander G},
  booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  url = {},


Please feel free to contact Bernie Huang ( or if you have any questions. Thanks for your interest!