Script for training embeddings

Question

Script for training embeddings

sa-j opened this issue 6 years ago · 9 comments

Hi there,

Thanks for uploading the NER Tagger! I'm trying to build on the performance of your model for German. You already provided the pre-trained embeddings in issue #44 , however, I want to extend your corpus with some more text. Is it possible for you to upload the script with which the embeddings were produced?

Thank your very much!

@glample
@pvcastro
@julien-c

Answer 1 · 2018-06-07T10:36:44.000Z

Sorry, I'm just working with Portuguese language, can't help you with scripts for German!

Answer 2 · 2018-06-07T11:02:15.000Z

Ok!

I'm actually looking for the original script with which the embeddings were trained on the Leipzig corpora collection & German monolingual training data from 2010 Machine Translation (according to the paper).

Answer 3 · 2018-06-07T11:05:51.000Z

Hi,

We trained our embeddings using the wang2vec model, you can find it here:
https://github.com/wlin12/wang2vec

Answer 4 · 2018-06-07T11:36:21.000Z

Thank you! And do you have your preprocessing script with which you produced the texts for wang2vec? I want to exactly reproduce the GER64 embeddings (and therefore the results) for the NER tagger.

Answer 5 · 2018-06-07T11:46:25.000Z

Sorry I don't remember about the preprocessing :/
But I think we only used the Moses tokenizer: https://github.com/moses-smt/mosesdecoder/

Answer 6 · 2018-06-07T11:48:56.000Z

Ok, thank you! And what about the the param settings for wang2vec, including the window size (should be different for Geman than for, say, English).

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

Do you have them?

Answer 7 · 2018-06-07T12:22:09.000Z

Parameters can be same for all languages. You should use -type 3 and also -size 50 is the dimension of your embeddings so you probably want more than that. GER64 uses 64 typically, but higher might be better.

Answer 8 · 2018-06-08T15:40:56.000Z

Which versions of the Leipzig corpora collections have you used? Minus "web", there are 4 text sources ("wiki,news,newscrawl,mixed") each consisting 30k to 1M sentences. Have you by chance used only the 1M variants with their most recent entry and merged all 4 documents?

Answer 9 · 2018-06-08T18:25:02.000Z

Sorry I don't remember. I would just use everything.