DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [AAAI 2023]

Primary LanguagePythonMIT LicenseMIT


license arxiv badge

In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives.

Model Architecture


Code Path

├── cache
│   ├── AWA2
│   │   ├── attributeindex2prompt.json
│   │   └── id2imagepixel.pkl
│   ├── CUB
│   │   ├── attributeindex2prompt.json
│   │   ├── id2imagepixel.pkl
│   │   └── mapping.json
│   └── SUN
│   │   ├── attributeindex2prompt.json
│   │   ├── id2imagepixel.pkl
│   │   └── mapping.json
├── data
│   ├── AWA2
│   │   ├── APN.mat
│   │   ├── TransE_65000.mat
│   │   ├── att_splits.mat
│   │   ├── attri_groups_9.json
│   │   ├── kge_CH_AH_CA_60000.mat
│   │   └── res101.mat
│   ├── CUB
│   │   ├── APN.mat
│   │   ├── att_splits.mat
│   │   ├── attri_groups_8.json
│   │   ├── attri_groups_8_layer.json
│   │   └── res101.mat
│   └── SUN
│       ├── APN.mat
│       ├── att_splits.mat
│       ├── attri_groups_4.json
│       └── res101.mat
├── log
│   ├── AWA2
│   ├── CUB
│   └── SUN
├── model
│   ├── log.py
│   ├── main.py
│   ├── main_utils.py
│   ├── model_proto.py
│   ├── modeling_lxmert.py
│   ├── opt.py
│   ├── swin_modeling_bert.py
│   ├── util.py
│   └── visual_utils.py
├── out
│   ├── AWA2
│   ├── CUB
│   └── SUN
└── script
    ├── AWA2
    │   └── AWA2_GZSL.sh
    ├── CUB
    │   └── CUB_GZSL.sh
    └── SUN
        └── SUN_GZSL.sh


  • Python 3
  • PyTorch >= 1.8.0
  • Transformers>= 4.11.3
  • NumPy
  • All experiments are performed with one RTX 3090Ti GPU.


  • Dataset: please download the dataset, i.e., CUB, AWA2, SUN, and change the opt.image_root to the dataset root path on your machine
  • Data split: please download the data folder and place it in ./data/.
  • Attributeindex2prompt.json should generate and place it in ./cache/dataset/.
  • Download pretranined vision Transformer as the vision encoder:

Code Structures

There are four parts in the code.

  • model: It contains the main files for DUET network.
  • data: It contains the data splits for different datasets.
  • cache: It contains some cache files.
  • script: The training scripts for DUET.

Train & Eval

The training script for AWA2_GZSL:

bash script/AWA2/AWA2_GZSL.sh
[--dataset {AWA2, SUN, CUB}] [--calibrated_stacking CALIBRATED_STACKING] [--nepoch NEPOCH] [--batch_size BATCH_SIZE] [--manualSeed MANUAL_SEED]
[--classifier_lr LEARNING-RATE] [--xe XE] [--attri ATTRI] [--gzsl] [--patient PATIENT] [--model_name MODEL_NAME] [--mask_pro MASK-PRO] 
[--mask_loss_xishu MASK_LOSS_XISHU] [--xlayer_num XLAYER_NUM] [--construct_loss_weight CONSTRUCT_LOSS_WEIGHT] [--sc_loss SC_LOSS] [--mask_way MASK_WAY]
[--attribute_miss ATTRIBUTE_MISS]


  • you can open the .sh file for parameter modification.


Please condiser citing this paper if you use the code or data from our work. Thanks a lot :)

    author    = {Chen, Zhuo and Huang, Yufeng and Chen, Jiaoyan and Geng, Yuxia and Zhang, Wen and Fang, Yin and Pan, Jeff Z and Song, Wenting and Chen, Huajun},
    title     = {DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning},
    booktitle = {Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI)},
    year      = {2023}