ru-clip is a multimodal model for obtaining images and text similarities + rearranging captions and images.
ru-clip (Russian Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories.
We show that the continuation of work on the pre-trained language models ru-gpts with the addition of a new modality - images - is able to make the system stable and generalize complex categories beyond standard samples.
Note! This is the prototype model of OpenAI CLIP's Russian version following this paper.
We use ViT-B/32 Image Encoder and RuGPT3Small Text Encoder.
🤗 See HF model cards:
See here, .
Habr post coming soon