AI Final Project

Prepare Environment

    conda create -n cvpdl_final python=3.10
    conda activate cvpdl_final
    pip install -r requirements.txt
    pip install git+https://github.com/openai/CLIP.git

Prepare Dataset

    wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
    tar xvf VOCtrainval_11-May-2012.tar
    mv VOCdevkit/VOC2012 .
    rm -r VOCdevkit/

COCO

wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip

Prepare Images' Embedding

It takes about 30 and 60 minutes for vit_b and vit_h

    CUDA_VISIBLE_DEVICES={GPU_ID} sh scripts/preprocess.sh vit_h

COCO:

    sh scripts/preprocess_coco.sh vit_h

Do Experiments

To do an experiment on the naive prompt learning method, use --dataset to specify the type of dataset (VOC or COCO).

    CUDA_VISIBLE_DEVICES={GPU_ID} python prompt_learning.py --n-emb 1

I notice that using multiple tokens for one class can boost the performace

    CUDA_VISIBLE_DEVICES={GPU_ID} python prompt_learning.py --n-emb 2

CoCoOp

    CUDA_VISIBLE_DEVICES={GPU_ID} python prompt_learning.py --n-emb {N} --trainer cocoop --batch-size 8

Experimental Results

Fully supervised method (CoOp, epoch 200)

number of tokens 1 2 3 4 5 6
mIoU 0.645413 0.667269 0.694688 0.709082 0.720209 0.719938

Fully supervised method (CoCoOp, epoch 200)

number of tokens 1 2 3 4 5 6
mIoU 0.743866 0.764570 0.773365 0.760778 0.767652 0.777974

Zero Shot method (CoOp)

mIoU aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor Unseen
Training v v v v v v v v v v v v v v v v
n-emb = 1 0.6077 0.9048 0.7908 0.3244 0.6899 0.5939 0.3517 0.8240 0.4982 0.8356 0.1126 0.7978 0.3493 0.7815 0.7965 0.8192 0.6871 0.1100 0.7352 0.2013 0.7659 0.4531
n-emb = 2 0.6092 0.9099 0.8631 0.2345 0.6774 0.6188 0.3348 0.8160 0.4550 0.8551 0.1169 0.7768 0.3109 0.7637 0.7607 0.8338 0.7724 0.1262 0.7168 0.2016 0.8329 0.4694
n-emb = 3 0.6580 0.9149 0.8567 0.3870 0.6858 0.6284 0.4393 0.8799 0.5200 0.8565 0.2788 0.8315 0.3751 0.8075 0.8174 0.8405 0.7768 0.2135 0.8341 0.2410 0.7881 0.5192
n-emb = 4 0.6580 0.9172 0.8668 0.3628 0.6692 0.6090 0.4379 0.8449 0.4987 0.8550 0.2918 0.8063 0.3581 0.8324 0.7880 0.8481 0.8326 0.2590 0.7481 0.2827 0.8054 0.5238
n-emb = 5 0.6901 0.9206 0.8757 0.4020 0.7563 0.6608 0.5710 0.8662 0.5047 0.8696 0.2938 0.8526 0.4193 0.8377 0.8117 0.8491 0.8228 0.2709 0.7967 0.2404 0.8207 0.5322
n-emb = 6 0.6668 0.9203 0.8741 0.3664 0.7279 0.6711 0.5304 0.8498 0.5276 0.8527 0.2912 0.8310 0.4178 0.7725 0.8260 0.8106 0.8030 0.2332 0.7840 0.1972 0.8412 0.5139

Zero Shot method (CoCoOp)

mIoU aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor Unseen
Training v v v v v v v v v v v v v v v v
n-emb = 1 0.6887 0.8361 0.4244 0.8126 0.6920 0.6226 0.8795 0.6219 0.8866 0.3552 0.8924 0.5344 0.8652 0.8271 0.8496 0.8194 0.4138 0.7113 0.3127 0.2585 0.1917 0.3686
n-emb = 2 0.6921 0.8940 0.3814 0.8349 0.6092 0.7026 0.8834 0.6693 0.9027 0.3733 0.8864 0.5108 0.8529 0.8444 0.8478 0.8188 0.4477 0.8207 0.3497 0.1865 0.1870 0.3860
n-emb = 3 0.7148 0.8829 0.3752 0.8074 0.6451 0.7043 0.8587 0.6977 0.8895 0.3827 0.8642 0.5382 0.8555 0.8288 0.8392 0.8256 0.4136 0.8148 0.3720 0.5255 0.1750 0.4718
n-emb = 4 0.7012 0.9173 0.4796 0.8157 0.7270 0.6544 0.8859 0.6651 0.8891 0.3749 0.8906 0.5860 0.8433 0.8549 0.8721 0.8574 0.4524 0.5913 0.2907 0.2599 0.3147 0.3642
n-emb = 5 0.6984 0.9019 0.4401 0.8611 0.7298 0.6798 0.8854 0.6727 0.8817 0.4084 0.8567 0.5396 0.8614 0.8370 0.8583 0.8328 0.4093 0.7218 0.3297 0.2027 0.2045 0.3647
n-emb = 6 0.7144 0.9141 0.4752 0.8379 0.7197 0.6670 0.9090 0.6820 0.8696 0.4000 0.8838 0.5985 0.8347 0.8380 0.8595 0.8398 0.4747 0.7726 0.2811 0.3608 0.2198 0.4086

Zero Shot method (CoCoOp v2)

mIoU aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor Unseen
Training v v v v v v v v v v v v v v v v
n-emb = 1 0.6780 0.8905 0.3670 0.7846 0.6635 0.6236 0.8625 0.5975 0.8649 0.2712 0.8500 0.4238 0.8081 0.7983 0.8329 0.7707 0.4101 0.7405 0.3207 0.7280 0.1178 0.4768
n-emb = 2 0.6659 0.8963 0.2885 0.8188 0.6512 0.6740 0.8481 0.6051 0.8698 0.2420 0.8206 0.4621 0.7754 0.8162 0.8301 0.7902 0.4026 0.7512 0.2846 0.4007 0.1726 0.4023
n-emb = 3 0.6945 0.9185 0.3993 0.8131 0.7285 0.6647 0.8926 0.6007 0.8636 0.3284 0.8360 0.5236 0.8065 0.8210 0.8557 0.8046 0.3957 0.7415 0.2801 0.6810 0.0201 0.4307
n-emb = 4 0.7082 0.8865 0.3787 0.8019 0.6277 0.6666 0.8792 0.6686 0.8773 0.2742 0.8128 0.5053 0.8292 0.8570 0.8627 0.8182 0.4088 0.8061 0.4168 0.7448 0.0819 0.5124
n-emb = 5 0.6894 0.9195 0.3737 0.8009 0.7103 0.6124 0.8854 0.6009 0.8922 0.2171 0.8818 0.4748 0.8290 0.7896 0.8484 0.8091 0.4177 0.7711 0.1957 0.6708 0.1981 0.4589
n-emb = 6 0.7119 0.8847 0.3405 0.7684 0.6656 0.6549 0.8799 0.6390 0.8615 0.2875 0.8590 0.5210 0.8238 0.8208 0.8671 0.8183 0.4555 0.8033 0.4101 0.8156 0.1471 0.5440

To Do

  • Add Seen/Unseen splits to VOC2012Dataset
    • Train on class -[1~16] ("aeroplane" ~ "pottedplant")
    • Test on class -[17~20] ("sheep", "sofa", "train", "tvmonitor")
  • Add checkpoint saver
  • Add logger
  • Record experimental results
  • Rewrite the code according to your familiar coding style
  • Add explicit data types to the arguments of functions
  • Add MSCOCO