glample/tagger

Script for training embeddings

sa-j opened this issue ยท 9 comments

sa-j commented

Hi there,

Thanks for uploading the NER Tagger! I'm trying to build on the performance of your model for German. You already provided the pre-trained embeddings in issue #44 , however, I want to extend your corpus with some more text. Is it possible for you to upload the script with which the embeddings were produced?

Thank your very much!

@glample
@pvcastro
@julien-c

Sorry, I'm just working with Portuguese language, can't help you with scripts for German!

sa-j commented

Ok!

I'm actually looking for the original script with which the embeddings were trained on the Leipzig corpora collection & German monolingual training data from 2010 Machine Translation (according to the paper).

Hi,

We trained our embeddings using the wang2vec model, you can find it here:
https://github.com/wlin12/wang2vec

sa-j commented

Thank you! And do you have your preprocessing script with which you produced the texts for wang2vec? I want to exactly reproduce the GER64 embeddings (and therefore the results) for the NER tagger.

Sorry I don't remember about the preprocessing :/
But I think we only used the Moses tokenizer: https://github.com/moses-smt/mosesdecoder/

sa-j commented

Ok, thank you! And what about the the param settings for wang2vec, including the window size (should be different for Geman than for, say, English).

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

Do you have them?

Parameters can be same for all languages. You should use -type 3 and also -size 50 is the dimension of your embeddings so you probably want more than that. GER64 uses 64 typically, but higher might be better.

sa-j commented

Which versions of the Leipzig corpora collections have you used? Minus "web", there are 4 text sources ("wiki,news,newscrawl,mixed") each consisting 30k to 1M sentences. Have you by chance used only the 1M variants with their most recent entry and merged all 4 documents?

Sorry I don't remember. I would just use everything.