Zero shot classification through img to text matching.
https://github.com/EthanZhu90/ZSL_PP_CVPR17
Caltech-UCSD Birds-200-2011 birds images with corresponding wikipedia text articles for each class
Img --> ResNet50 --> visual embedding
Wikipedia Text --> Transformer --> text embedding
visual embedding,text_embedding --> MLP --> match score for each text-img couple
Train MLP through standard cross entropy over outputted scores, ResNet and Transformer models are pretrained.