ViLT_multimodal

Finetuned ViLT model for image scenic classification with comments.

Version 1

Finetune the model using labeled images. Requires big memory allocation and may crash according to dataset increase.

Version 2

Preprocess and stores encodings to local storage. Requires big storage size.

Version 3

Solved memory increase and storage requirement. Stable version.