piskvorky/gensim-data

Russian word embedding models from RusVectores project

akutuzov opened this issue · 10 comments

Name: word2vec-ruscorpora-300
Link: http://rusvectores.org/static/models/ruscorpora_1_300_10.bin.gz
Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Related papers: https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized and tagged with Universal PoS.
Parameters: vector size 300, window size 10
Code example:

model = gensim.models.KeyedVectors.load_word2vec_format('ruscorpora_1_300_10.bin.gz', binary=True)
for n in model.most_similar(positive=[u'пожар_NOUN']):
    print n[0], n[1]
пожарище_NOUN 0.618148565292
возгорание_NOUN 0.592390716076
сгорать_VERB 0.589370012283
наводнение_NOUN 0.575950324535
тушение_NOUN 0.572953224182
пожарный_NOUN 0.562128543854
поджог_NOUN 0.561940491199
сгорать::дотла_VERB 0.547737360001
поджигать_VERB 0.534844279289
незатушить_VERB 0.534272968769

OK, let's try :-)
By the way, what is the procedure for updating the resources? RusVectores rolls out new models from time to time.

@akutuzov no updates, only adding a new model, best scheme for support backward compatibility :)

Thanks for the detailed info, only one thing: as I remember, RusVectores used mystem for _POS, can you add function for converting word -> word_POS in the first message?

Well, it can be any tagger supporting Russian and Universal Tags, do we really need to clutter the issue with the preprocessing details?

@akutuzov This would be very desirable because this is not an obvious process (it is impossible to apply this model without pre-processing in the current case).

Your code example will be linked with this model and simplify life for users :)

OK. It will look somewhat like this with UDPipe. Models for various languages can be downloaded here.

def tag(word='пожар', modelfile='russian-syntagrus-ud-2.0-170801.udpipe'):
    from ufal.udpipe import Model, Pipeline
    model = Model.load(modelfile)
    pipeline = Pipeline(model, 'tokenize', Pipeline.DEFAULT, Pipeline.DEFAULT, 'conllu')
    processed = pipeline.process(word)
    output = [l for l in processed.split('\n') if not l.startswith('#')]
    tagged = ['_'.join(w.split('\t')[2:4]) for w in output if w]
    return tagged

This produces Universal PoS tags straight away.
Another option is to use pymystem:

def tag(word='пожар'):
    from pymystem3 import Mystem
    m = Mystem()
    processed = m.analyze(word)[0]
    lemma = processed["analysis"][0]["lex"].lower().strip()
    pos = processed["analysis"][0]["gr"].split(',')[0]
    pos = pos.split('=')[0].strip()
    tagged = lemma+'_'+pos
    return tagged

With Mystem output, one will have to convert RNC tags to UPOS, using this conversion table.

Thanks @akutuzov, sorry for waiting, now this repo released and ruscorpora vectors available with our API gensim>=3.2.0

import gensim.downloader as api

model = api.load("word2vec-ruscorpora-300")

Thanks @menshikh-iv! One small fix: in the table, I see "License not found" for this model. However, we do have a license, it is Creative Commons Attribution 4.0 International :-).
We are now updating our models, will come up with more of them before the end of month, I think.

@akutuzov update license fa71854 :)

Sorry, i cant download the file, may you fix the download link above?