Official code of Detours for Navigating Instructional Videos, CVPR 2024 (highlight).
We propose the first video detours dataset. A detour instance is when the user pauses watching an instructional video and asks the AI assistant a query like "how to do this step without a blender". The assistant should return another instructional video that satisfies the user query while doing the same activity as the original video.
The (weakly-supervised) training dataset can be downloaded here.
The (weakly-supervised) validation dataset can be downloaded here.
The manually annotated testing dataset can be downloaded here.
We use Llama 2 to summarize HowTo100M cooking videos into distinct steps along with timestamps. These summaries is used to create the video detours dataset. The HowTo100M summary dataset can be downloaded from here.
conda create -n detours python=3.10 -y
conda activate detours
pip install --upgrade pip # enable PEP 660 support
pip install -e .
We use Internvideo features extracted at 1 features per second. The features can be downloaded or extracted following TAN codebase.
Task | Train dataset | Validation dataset | Test dataset |
---|---|---|---|
Detour video retrieval | here | here | here |
Detour window localization | here | here | here |
To run the retrieval training, run
bash submit_ds_video_retrieval.sh
Use multinode_retrieval.sh
to run on multiple nodes on a SLURM cluster.
To run the localization training, run
bash submit_ds_video_localization.sh
Use multinode_localization.sh
to run on multiple nodes on a SLURM cluster.
The checkpoints will be released soon.
Please open an issue in this repository (preferred for better visibility) or reach out to kumar.ashutosh@utexas.edu.
See the CONTRIBUTING file for how to help out.
If you use the code or the method, please cite the following paper:
@misc{ashutosh2024detours,
title={Detours for Navigating Instructional Videos},
author={Kumar Ashutosh and Zihui Xue and Tushar Nagarajan and Kristen Grauman},
year={2024},
eprint={2401.01823},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
The majority of VidDetours is licensed under CC-BY-NC, however portions of the project are available under separate license terms: LLaVA is licensed under the Apache 2.0 license.