Other Languages
pepeballesterostel opened this issue · 4 comments
I would like to know if this module is able to adapt to other languagges such as Spanish, to obtain metrics generated from a finetuned GPT-2.
Thanks!
The metrics worked fine for spanish text. But I want to use Spanish models for a better evaluation.
For instance, to use a Spanish version of Glove, Should I change this line of code (line 93 in nlg-eval/bin/nlg-eval):
url='https://raw.githubusercontent.com/manasRK/glovegensim/42ce46f00e83d3afa028fb6bf17ed3c90ca65fcc/glove2word2vec.py',
And change it to an alternative Spanish model url? Should I change lines 102 and 105 from the same script?
Thank you.
The glove2word2vec.py
file converts the glove embeddings into word2vec
format so that should still be usable. You would need to change the glove files from English to Spanish so you are right about changing lines 102 and 105:
Lines 102 to 107 in 7f79930
Another thing is that we use NLTK's word_tokenize
tokenizer:
nlg-eval/nlgeval/word2vec/evaluate.py
Line 47 in 7f79930
nlg-eval/nlgeval/word2vec/evaluate.py
Line 63 in 7f79930
To a limited extent it will work for Spanish but it would be better to also replace that with a Spanish tokenizer. If you are planning to publish this work somewhere, it would also be good to let your readers know which tokenizer you ended up using for comparability.
If you are planning to use other non-embedding metrics from this repo then maybe for those as well a Spanish tokenizer would be more appropriate.
Closing due to inactivity and the question is mostly answered.