Training data

Question

Training data

christophschuhmann opened this issue 3 years ago · 2 comments

christophschuhmann commented 3 years ago

I would like to know on what ruCLIP was trained.
We, LAION, have around 6B yet unreleased img-text-pairs, filtered with CLIP and mCLIP. Many of them also are russian. :)

If you 'd like access, let me know.

Christoph Schuhmann
www.laion.ai

Answer 1 · 2022-01-24T15:53:51.000Z

@christophschuhmann Hello! Your dataset LAION is incredible. As a researcher, I would be interested in working with your dataset in the Russian language.

ruCLIP was trained on datasets from open sources, datasets of the Sberbank ecosystem, and sample datasets translated using neural networks. We collected about 240M pairs, with only 100M in "native" Russian. The data turned out quite noisy, but the signal for ruCLIP is definitely in them.

My colleague Andrey Kuznetsov sent you an e-mail christoph_s@freenet.de . Could you discuss with him the conditions and rules of your dataset? We would be very grateful for your help.

Answer 2 · 2022-01-24T16:43:06.000Z

Nice to hear from you, I have not received an email yet on christoph_s@freenet.de
Maybe it got caught in a spam filter. Could he sent it again to christoph.schuhmann@laion.ai

Waiting to hear from you :)