Pre-trained word embedding for German/Spanish/Dutch/French

Question

Pre-trained word embedding for German/Spanish/Dutch/French

dungtn opened this issue 8 years ago · 11 comments

dungtn commented 8 years ago

Hi @glample,

Do you use any pre-trained embedding for languages other than English? If so, where I can download these embeddings?

Thanks,
Dung Thai

Answer 1 · 2017-07-06T10:43:32.000Z

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Answer 2 · 2017-09-06T18:46:30.000Z

Hi @dungtn @SawyerW i want to use Publicly available word vectors trained from Google News as pre-trained word embeddings available at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

Its a .gz file, but i dont have any idea how to use those word embeddings in my script.
python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob

Can you please guide me?I am stuck...

Answer 3 · 2017-09-06T19:22:46.000Z

Since your pre-trained word embedding comes from word2vec, you can use " gensim.models.KeyedVectors.load_word2vec_format" .
You can find what you need from here: https://radimrehurek.com/gensim/models/word2vec.html

Answer 4 · 2017-09-06T21:14:27.000Z

@Rabia-Noureen

I wrote a (very) simple script a little while ago to achieve exactly what you are looking to do. It it based directly off the documentation @SawyerW links to.

You can grab the script here. Just make sure you haven genism installed by running either:

easy_install -U gensim or pip install --upgrade gensim

And make sure to first unzip the .gz file.

To use, just open the script and set binary_w2v_file_path to the location of your unzipped.bin file. By default the script will save the converted embeddings in the directory you called the script from (although you can also change this in the script). Then run the script: python convert_w2v_bin_to_glove_format.py in the terminal.

(I confirmed the script works with the word vectors you linked to)

Answer 5 · 2017-09-06T21:32:05.000Z

@JohnGiorgi @SawyerW Thanks a lot for your quick response i appreciate your suggestions. I will try the script and will get back to you if i need any further help in this regard.

Thanks again

Answer 6 · 2017-09-07T19:16:37.000Z

Hi @JohnGiorgi i tried to run the script but i got this error

I tried this for changing the permissions

but no use..... Can you please suggest some other way regarding the permissions?

Answer 7 · 2017-09-07T19:27:37.000Z

@Rabia-Noureen , got no idea why you try to convert the word2vec embedding. Why don't you just use the word2vec embedding directly?

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(
    '/path/GoogleNews-vectors-negative300.bin.gz', binary=True)
word2vec_model.init_sims(replace=True)

then you can find all the embedding with word2vec_model.vocab

Answer 8 · 2017-09-07T20:03:03.000Z

@SawyerW actually i have just started with python and i was trying to train the model. I dont have any idea where should i embed the code in NERtagger

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(
'/path/GoogleNews-vectors-negative300.bin.gz', binary=True)
word2vec_model.init_sims(replace=True)

I would highly appreciate if you can guide me....

Answer 9 · 2017-09-07T20:10:30.000Z

Maybe it is a good idea if you can post your code and show us what you want to do.

Answer 10 · 2017-09-07T20:19:33.000Z

@SawyerW I am actually trying to use this project NER tagger as it is, in order to train the model i am using iob tag scheme and adam as learning method

python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob

Now i have to use the pre-trained word embedding for training using GoogleNews-vectors-negative300.bin.gz

Answer 11 · 2018-01-01T13:53:27.000Z

You can find all pretrained embeddings used in the paper here: #44