This repo provides the implemetation of the paper Learning Distinct and Representative Modes for Image Captioning.
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
pip install transformers yacs scipy
Follow the instructions in VLP.
python -m modecap.train data_dir PATH_TO_DATA
python -m modecap.inference data_dir PATH_TO_DATA model_path PATH_TO_MODEL