IMPORTANT NOTICE

It appears that most of the original data files (videos) have been removed from Youtube and therefore running the download scripts no longer produces a useful copy of the corpus. Unfortunately we cannot redistribute the data directly and the team has since moved on. If you want to use the corpus in your own research, your best bet is to get in touch with a group that has a copy of the corpus and see if you can share that. Thank you!

The How-2 Dataset

How-2 is a multimodal dataset which consists of around 80,000 instructional videos (about 2,000 hours) with associated English sub-titles and summaries. About 300 hours have also been translated into Portuguese using crowd-sourcing, and used during the JSALT 2018 Workshop. The How-2 training data was split into 300h and 2000h, with only the former supporting Portuguese Machine Translation. The 2000h set can be used for other tasks such as speech recognition, speech summarization, text summarization, and their multimodal extensions.

We currently have released the following packages pertaining to the How-2 data to be able to replicate our results and encourage further research:

ASR (300h): This release contains (audio) fbank+pitch features in Kaldi scp/ark format for 300 hours
E2E Summarization + ASR (2000h): This release contains (audio_2000) fbank+pitch features in Kaldi scp/ark format, along with transcript and abstractive summaries for 2000 hours
Visual features: This release contains (video) Action features in numpy arrays for MT and ASR
English Transcript: This release contains (en) English text for How2
Portuguese Machine Translations: This release contains (pt) Portuguese crowdsource text
English Abstractive Summaries: This release contains (en_sum) Summarization text
Visual features for Summarization: This release contains (video_sum) Summarization action features in numpy arrays
Object Grounding Features: This release contains (ground) Object grounding test and development set

~~Please fill the [Data Request form]~~

Please cite the following paper in all academic work that uses this dataset:

@inproceedings{sanabria18how2,
  title = {{How2:} A Large-scale Dataset For Multimodal Language Understanding},
  author = {Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Lo\"ic and Specia, Lucia and Metze, Florian},
  booktitle = {Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)},
  year = {2018},
  organization={NeurIPS},
  url = {http://arxiv.org/abs/1811.00347}
}

More papers can be found in the bibliography.

~~To subscribe to the How2 mailing list click here.~~

Speech Summarization

How2 has been used for end to end speech summarization- we are releasing 43 dim fbank+pitch features to support this. See our ESPNet Recipe and paper. Please consider citing our paper on speech summarization if you utilize this data release.

@inproceedings{Sharma2022, 
author={Sharma, Roshan and Palaskar, Shruti and Black, Alan W and Metze, Florian},
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={End-to-End Speech Summarization Using Restricted Self-Attention},
year={2022}, 
volume={},
number={},
pages={8072-8076},
doi={10.1109/ICASSP43922.2022.9747320}
}

How2 Get Help

Please use the issues ticket system (https://github.com/srvk/how2-dataset/issues) to ask questions and get clarifications.

How2 License

License information for every video can be found in the .info.json file that is being downloaded for every video. At the time of release, all videos included in this dataset were being made available by the original content providers under the standard YouTube License.

Unless noted otherwise, we are providing the contents of this repository under the Creative Commons BY-SA 4.0 (Attribution-Share-Alike) License (for data-like content) and/ or BSD-2-Clause License (for software-type content).