Apply an end-to-end model structure (ViT + GPT) to describe images in more detail, rather than traditional image captioning that only provides object detections or a few simple sentences.
detail: https://www.dropbox.com/scl/fi/ybvzpnkkcy7lnn9jbkkl1/report.pdf?rlkey=grunab2372z90x0uj429x2n5o&dl=1
(optional) conda create -n <name> python=3.9
pip install -r requirements.txt
before the following scripts, install java by yourself.
bash scripts/download.sh
## tune config in scripts/train.sh
bash scripts/train.sh $output_model_path
## tune config in scripts/predict.sh
bash scripts/predict.sh