Russian word embedding models from RusVectores project
akutuzov opened this issue · 10 comments
Name: word2vec-ruscorpora-300
Link: http://rusvectores.org/static/models/ruscorpora_1_300_10.bin.gz
Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Related papers: https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized and tagged with Universal PoS.
Parameters: vector size 300, window size 10
Code example:
model = gensim.models.KeyedVectors.load_word2vec_format('ruscorpora_1_300_10.bin.gz', binary=True)
for n in model.most_similar(positive=[u'пожар_NOUN']):
print n[0], n[1]
пожарище_NOUN 0.618148565292
возгорание_NOUN 0.592390716076
сгорать_VERB 0.589370012283
наводнение_NOUN 0.575950324535
тушение_NOUN 0.572953224182
пожарный_NOUN 0.562128543854
поджог_NOUN 0.561940491199
сгорать::дотла_VERB 0.547737360001
поджигать_VERB 0.534844279289
незатушить_VERB 0.534272968769
OK, let's try :-)
By the way, what is the procedure for updating the resources? RusVectores rolls out new models from time to time.
@akutuzov no updates, only adding a new model, best scheme for support backward compatibility :)
Thanks for the detailed info, only one thing: as I remember, RusVectores used mystem
for _POS
, can you add function for converting word
-> word_POS
in the first message?
Well, it can be any tagger supporting Russian and Universal Tags, do we really need to clutter the issue with the preprocessing details?
@akutuzov This would be very desirable because this is not an obvious process (it is impossible to apply this model without pre-processing in the current case).
Your code example will be linked with this model and simplify life for users :)
OK. It will look somewhat like this with UDPipe. Models for various languages can be downloaded here.
def tag(word='пожар', modelfile='russian-syntagrus-ud-2.0-170801.udpipe'):
from ufal.udpipe import Model, Pipeline
model = Model.load(modelfile)
pipeline = Pipeline(model, 'tokenize', Pipeline.DEFAULT, Pipeline.DEFAULT, 'conllu')
processed = pipeline.process(word)
output = [l for l in processed.split('\n') if not l.startswith('#')]
tagged = ['_'.join(w.split('\t')[2:4]) for w in output if w]
return tagged
This produces Universal PoS tags straight away.
Another option is to use pymystem:
def tag(word='пожар'):
from pymystem3 import Mystem
m = Mystem()
processed = m.analyze(word)[0]
lemma = processed["analysis"][0]["lex"].lower().strip()
pos = processed["analysis"][0]["gr"].split(',')[0]
pos = pos.split('=')[0].strip()
tagged = lemma+'_'+pos
return tagged
With Mystem output, one will have to convert RNC tags to UPOS, using this conversion table.
Thanks @akutuzov, sorry for waiting, now this repo released and ruscorpora
vectors available with our API gensim>=3.2.0
import gensim.downloader as api
model = api.load("word2vec-ruscorpora-300")
Thanks @menshikh-iv! One small fix: in the table, I see "License not found" for this model. However, we do have a license, it is Creative Commons Attribution 4.0 International :-).
We are now updating our models, will come up with more of them before the end of month, I think.
Sorry, i cant download the file, may you fix the download link above?