Detours for Navigating Instructional Videos

Official code of Detours for Navigating Instructional Videos, CVPR 2024 (highlight).

Video detours dataset

We propose the first video detours dataset. A detour instance is when the user pauses watching an instructional video and asks the AI assistant a query like "how to do this step without a blender". The assistant should return another instructional video that satisfies the user query while doing the same activity as the original video.

The (weakly-supervised) training dataset can be downloaded here.

The (weakly-supervised) validation dataset can be downloaded here.

The manually annotated testing dataset can be downloaded here.

HowTo100M summary dataset

We use Llama 2 to summarize HowTo100M cooking videos into distinct steps along with timestamps. These summaries is used to create the video detours dataset. The HowTo100M summary dataset can be downloaded from here.

Code usage

Installation

conda create -n detours python=3.10 -y
conda activate detours
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Video features

We use Internvideo features extracted at 1 features per second. The features can be downloaded or extracted following TAN codebase.

Dataset download

Task	Train dataset	Validation dataset	Test dataset
Detour video retrieval	here	here	here
Detour window localization	here	here	here

Training and inference

To run the retrieval training, run

bash submit_ds_video_retrieval.sh

Use multinode_retrieval.sh to run on multiple nodes on a SLURM cluster.

To run the localization training, run

bash submit_ds_video_localization.sh

Use multinode_localization.sh to run on multiple nodes on a SLURM cluster.

Checkpoints

The checkpoints will be released soon.

Issues

Please open an issue in this repository (preferred for better visibility) or reach out to kumar.ashutosh@utexas.edu.

Contributing

See the CONTRIBUTING file for how to help out.

Citation

If you use the code or the method, please cite the following paper:

@misc{ashutosh2024detours,
      title={Detours for Navigating Instructional Videos}, 
      author={Kumar Ashutosh and Zihui Xue and Tushar Nagarajan and Kristen Grauman},
      year={2024},
      eprint={2401.01823},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

The majority of VidDetours is licensed under CC-BY-NC, however portions of the project are available under separate license terms: LLaVA is licensed under the Apache 2.0 license.

facebookresearch/VidDetours