ru-clip

First multimodal model for Russian language

ru-clip is a multimodal model for obtaining images and text similarities + rearranging captions and images.

ru-clip (Russian Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories.

We show that the continuation of work on the pre-trained language models ru-gpts with the addition of a new modality - images - is able to make the system stable and generalize complex categories beyond standard samples.

Note! This is the prototype model of OpenAI CLIP's Russian version following this paper.

Model description

We use ViT-B/32 Image Encoder and RuGPT3Small Text Encoder.

🤗 See HF model cards:

ru-clip
ruGPT-3 small

Usage

See here, .

How it works

Habr post coming soon

vvv-tech/ru-clip

ru-clip

First multimodal model for Russian language

Model description

Usage

How it works