cross-modal-retrieval: A Python repository from chloeeegao

This is my master thesis project. Written report can be found here: https://theses.liacs.nl/pdf/2022-2023-GaoYaqiong.pdf

Recipe1M dataset was used in this work.

This repository is organized as follows:

data
- Data was downloaded from http://im2recipe.csail.mit.edu/dataset/download/
- The contents of the directory DATASET_PATH should be the following and train/, val/, and test/ must contain the image files for each split after uncompressing.
- ```
layer1.json
layer2.json
vocab.txt
train/
val/
test/
```
preprocessing
dual_stream
- This is a replication of H-Transformer
single_stream
- fine_tune: two VLP models (Oscar and ViLT) are fine-tuned on Recipe1M
- recipevl: recipe vision-language (RecipeVL) model is trained from scratch

run scripts under preprocessing

run python bigrams.py --create will save all bigrams to disk in the corpus of all recipe titles in the training set, sorted by frequency.
run python bigrams.py --no_create will create class labels from food101 categories and top bigrams; then classes1M.pkl file will be created and used later
run python preprocessing.py --root DATASET_PATH will create a folder /traindata which contains data for training

Under each folder there is a run.sh script including training parameters, run sh run.sh in terminal window to train and evaluate the model.

chloeeegao/cross-modal-retrieval