/xmic

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization, CVPR 2024

Primary LanguagePythonMIT LicenseMIT

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization, CVPR 2024

paper, video

alt text

Datasets

Epic-Kitchens (we used rgb frames)

Ego4D (we used fho subset)


Hand-crops (this step can be skipped to run basic version):

Epic:

  • download hand crops for epic kitchens from the following repo

  • we preprocess the provided crops by applying union on the objects that touch with hands and all visible hands. We keep the default parameters from the respective library.

  • Put in a pickle in a format: dict[segment_id][frame_idx] = (left, top, right, bottom)
    (Otherwise, the library works too long if used without preextracting and preprocessing)

  • save file with the name:hand_thd0.8_obj_thd0.01_ONLY_inter_obj_with_HANDS_v2

Download hand crops detection here for Ego4D and apply similar preprocessing: https://github.com/Chuhanxx/helping_hand_for_egocentric_videos


Splits:

All splits of shared and unique (novel) noun and verb classes are in folder anno/


Prerequisites

  • follow CoOp to install prerequisites. However, skip installation of Dassl as its modified version is already integrated into the framework and the requirements will be installed during the next step
  • Go to the Dassl folder and run:
cd x-mic/Dassl.pytorch

# Install dependencies
pip install -r requirements.txt

# Install this library (no need to re-build if the source code is modified)
python setup.py develop
  • [In case of no internet connection during training] In general, CLIP model will be downloaded automatically. However, in case if you do not have internet connection during training, download CLIP vit-b-16 manually and set the path in ‘x-mic/clip/clip’ as a default parameter in _download function “root” parameter.

Extract features for faster training and evaluation

this step also can be skipped

Full frames

Epic config: extract_EPIC_clip_vitb16_segments.yaml

To change:

DATASET.ROOT - where your dataset is located with the structure DATASET.ROOT/annotations, DATASET.ROOT/epic_kitchens_videos_256ss

and OUTPUT_DIR

Ego config: extract_EGO4D_clip_vitb16.yaml

To change:

DATASET.ROOT - where your dataset is located with the structure DATASET.ROOT/annotations, DATASET.ROOT/epic_kitchens_videos_256ss

DATA.PATH_TO_DATA_DIR: - path to annotations

DATA.PATH_PREFIX: - path to videos

DATASET.ROOT - path to videos (same as path_prefix)

and OUTPUT_DIR


Hand Crops:

Epic config: extract_EPIC_clip_vitb16_segments_handcrops.yaml

see full frames +

DATASET.DETECTION_ROOT - path to hand crop annotations

Ego4d config: extract_EGO4D_clip_vitb16_handcrops.yaml


Run the scrips:

To run the script on a subset distributed over 8 gpus:

export OMP_NUM_THREADS=64; export NCCL_ASYNC_ERROR_HANDLING=1; torchrun --standalone --nproc_per_node=8 --nnodes 1 feat_extractor_segments_distributed.py --config_name XX --split YY --distributed --seed 42

To run the script on a subset on a single gpu: python feat_extractor_segments.py --config_name XX --split YY --div 0

XX - config name without “.yaml” extension and folder

YY - train or validation

Similarly, features can be extracted with DINO and Lavila models.


Run Training and Eval

Config params:

DATA.PATH_TO_DATA_DIR - Ego4D dataset annotations location

DATA.PATH_PREFIX - Ego4D features that will be classified with adopted classifier - best results with hand cropped frames

DATA.PATH_PREFIX_DINO - Ego4D features that will be adopted - best results with hand cropped frames

DATA.PATH_PREFIX_DINO2 - Ego4D features that will be adopted. This and previous features will be combined in the adaptation module - best results with full frames

DATALOADER.FEATURES_NAME - Epic features that will be classified with adopted classifier - best results with hand cropped frames

DATALOADER.FEATURES_NAME_DINO - Epic features that will be adopted - best results with hand cropped frames

DATALOADER.FEATURES_NAME_DINO2 - Epic features that will be adopted. This and previous features will be combined in the adaptation module - best results with full frames

note that all these features can be the same. If use the model without hand crops, set DATALOADER.USE_DINO_FEATURES2 = False

Set resolution of conditioning features in DATALOADER.DINO_DIM if it’s different from 512

If only one dataset is available, disable cross-dataset evaluation by setting TEST.CROSS_DATASET.EVAL = False

Run the scrips

train X-MIC config: XMIC_vitb16.yaml

setup data or feature paths for one or two datasets

XX - name of the config file located in scripts/configs folder

With single gpu:

Epic nouns:

sh scripts/baselines/epic_gpu1.sh noun XX

Epic verbs:

sh scripts/baselines/epic_gpu1.sh verb XX

Ego4d nouns:

sh scripts/baselines/ego_gpu1.sh noun XX

Ego4d verbs:

sh scripts/baselines/ego_gpu1.sh verb XX

With 8 gpus:

Epic nouns:

sh scripts/baselines/epic_gpu8.sh noun XX

Epic verbs:

sh scripts/baselines/epic_gpu8.sh verb XX

Ego4d nouns:

sh scripts/baselines/ego_gpu8.sh noun XX

Ego4d verbs:

sh scripts/baselines/ego_gpu8.sh verb XX


Tips


Important Note

Unfortunately, after my internship all models and data were deleted due to internal refactoring. Therefore, I lost all the pretrained models, parts of code and could not make a final verification of the code.

Feel free to connect with me via email in case of any questions.

I sincerely apologise for the inconvenience it may cause.


Citation

If you use our work, please consider citing:


@inproceedings{kukleva2024xmic,
  title={X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization},
  author={Kukleva, Anna and Sener, Fadime and Remelli, Edoardo and Tekin, Bugra and Sauser, Eric and Schiele, Bernt and Ma, Shugao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}


Acknowledgements

The code is based on CoOp and Maple repos