/com_kitchens

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Primary LanguagePythonMIT LicenseMIT

python pytorch lightning hydra black isort license

license

Table of Contents

Authors

Koki Maeda(3,1)*, Tosho Hirasawa(4,1)*, Atsushi Hashimoto(1), Jun Harashima(2), Leszek Rybicki(2), Yusuke Fukasawa(2), Yoshitaka Ushiku(1)

(1) OMRON SINIC X Corp. (2) Cookpad Inc. (3) Tokyo Institute of Technology (4) Tokyo Metropolitan University

*: Equally Contribution. This work is done for the internship at OMRON SINIC X.

Citation

Note

@InProceedings{comkitchens_eccv2024,
   author    = {Koki Maeda and Tosho Hirasawa and Atsushi Hashimoto and Jun Harashima and Leszek Rybicki and Yusuke Fukasawa and Yoshitaka Ushiku},
   title     = {COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark},
   booktitle = {Proceedings of the European Conference on Computer Vision},
   year      = {2024},
}

Link to arXiv

Dataset Details

This COMKitchens dataset provides cooking videos annotated with a structured visual action graph. The dataset currently has two benchmarks:

  • Dense Video Captioning on unedited fixed-viewpoint videos (DVC-FV)
  • Online Recipe Retrieval (OnRR)

We provide all the dataset for the benchmarks and attach .dat files which represent the train/validation/test split.

File Structure

data
├─ ap                # captions for each action-by-person entry
├─ frames            # frames extracted from videos (split into train/valid/test)
├─ frozenbilm        # features by FrozenBiLM (used by vid2seq)
└─ main              # recipes annotated by human
    └─ {recipe_id}      # recipe id
        └─ {kitchen_id} # kitchen id
            ├─ cropped_images                  # cropped images of bounding boxes for visual action graph
            ├─ frames                          # annotated frames for AP of visual action graph
            ├─ front_compressed.mp4            # recorded video
            ├─ annotations.xml                 # annotations in xml file format
            ├─ gold_recipe_translation_en.json # recipe annotations
            ├─ gold_recipe.json                # rewritten recipe (in Japanese)
            ├─ graph.dot                       # visual action graph
            ├─ graph.dot.pdf                   # visualization of visual action graph
            └─ obj.names
    ├── ingredients.txt                # ingredients list in the COM Kitchens dataset
    ├── ingredients_translation_en.txt # translated ingredients list in the COM Kitchens dataset
    ├── train.txt                      # list of recipe id in the train split
    └── val.txt                        # list of recipe id in the validation split

Important files

gold_recipe.json

gold_recipe.json provides the recipe information, to which the visual action graph is attached.

key value description
"recipe_id" str recipe id
"kitchen_id" int kitchen id
"ingredients" List[str] ingredients list (in Japanese)
"ingredient_images" List[str] path of the images of each ingredient
"steps" List[Dict] annotations by step
"steps/memo" str recipe sentence
"steps/words" List[str] recipe split word by word
"steps/ap_ids" List[Dict] Correspondence between AP and words
"actions_by_person" List[str] annotation of the visual action graph, including the time span and bounding boxes

{recipe_id}/{kitchen_id}/gold_recipe_translation_en.json

gold_recipe_translation_en.json provides only the translated recipe information.

key value description
"ingredients" List[str] ingredients list (in English)
"steps" List[Dict] annotations by step
"steps/memo" str recipe sentence
"steps/words" List[str] recipe split word by word
"steps/ap_ids" List[Dict] Correspondence between AP and words

Download Procedure for COM Kitchens

Note

Application Form English support will be available soon.

Quick Start

Dataset Preparation

  1. Dataset Preparation
    1. Download annotation files and videos.
  2. Preprocess
    1. Run python -m com_kitchens.preprocess.video for extracting all frames of the videos.
    2. Run python -m com_kitchens.preprocess.recipe for extracting all action-by-person entries of the videos.

Warning

While we extract all frames in preprocess for simplicity, you can save disk storage space by extracting only the frames you use with the annotation files.

Online Recipe Retrieval (OnRR)

  1. Training
    1. Run sh scripts/onrr-train-xclip.sh for simple start of trainings.
  2. Evaluation
    1. Run sh scripts/onrr-eval-xclip.sh {your/path/to/ckpt} for the evaluation.

Training UniVL models in OnRR

For UniVL, we are required to extract s3d features of the videos.

  1. Download s3d_howto100m.pth to cache/s3d_howto100m.pth or other path you configure.
  2. Run sh scripts/extract_s3d_features.sh to extract s3d features.
  3. Download pretrained model univl.pretrained.bin to cache/univl.pretrained.bin or other path you configure.
  4. Then you can run sh scripts/onrr-train-univl.sh to train UniVL models.

Dense Video Captioning on unedited fixed-viewpoint videos (DVC-FV)

  1. Docker Images
    1. Run make build-docker-images to build docker images.
  2. Preprocess
    1. Run sh scripts/dvc-vid2seq-prep to extract
  3. Training & Evaluation
    1. Run sh scripts/vid2seq-zs.sh to evaluate a pre-trained vid2seq model
    2. Run sh scripts/vid2seq-ft.sh to fine-tune and evaluate a vid2seq model
    3. RUn sh scripts/vid2seq-ft-rl-as.sh to fine-tune and evaluate a vid2seq model incorporating action graph as both relation labels and attention supervision (RL+AS)

LICENSE

This project (other than the dataset) is licensed under the MIT License, see the LICENSE.txt file for details.