Exploration of techniques to solve a multi-label classification problem
- Access subset of training data and loop through each row
- Find largest images portrait and landscape - resize all images to same.
- Greyscale than normalise (pixel values between 0 - 1)
- One hot encode labels + check label distribution within training set
- Preprocess captions - remove stop words + lowercasing + tokenise
- Visualise the data at the different stages
- Number of labels per image distribution
- Count caption dataset size (lemma vs stem + together)
- Explore new models proposed in tute
- ViT, DeiT, YOLO, CLIP
- Or ResNet + LSTM?
- Build models with PyTorch - see if papers have pretrained models available (DEit, ViT, 1.) (maybe check out other multilabel models) (motivation for pretrained model - some of our classes have very few examples - training from scratch would give very poor performance)
- Use untrained architecture
- use pretrained architecture
- (optional) Map pretrained model classes to our classes and check accuracy - play around with whether pretrained model can do multi label
- pretrained language model
- Save model to file
- Add F1 score
- check which labels have the poorest prediction rate (perhaps add extra training examples for poor predictors)
- Hyperparameter tuning
- Ablation tests based on above baselines
- baseline with and without pretraining
- remove captions
- remove images
- Add improvements
- Piggyback off of pretrained models
- cnn layers
- preprocessing
- Low rank adaption + parameter efficient fine tuning
- Split image to increase training sets