sdadas/polish-nlp-resources

mógłbym and moglbym

NicolasMathieu opened this issue · 6 comments

When I write word2vec.similar_by_word("moglbym") , it works fine
When I write word2vec.similar_by_word("mógłbym") I get then message delivered by keyedvectors.py

"word 'mógłbym' not in vocabulary"

How do you manage accents and non english letters ?

Accents and non english letters are preserved in the vocabulary. However, all of the embeddings in this repository (except ELMo) have been trained on lemmatized corpus. Therefore they only contain base word forms. In order to use the embeddings in practice, you need to lemmatize your input first.

In the case of "mógłbym", its base form present in the vocabulary is "móc".
In the case of "moglbym", it doesn't have o base form since its not a proper Polish word. Such words occur frequently in the training corpus so they were included in the vocabulary in their original form.

Thank you very much. What system are you using to lemmatize the inputs ? I can download "Morpheus" and write down a system, but is there any ready to use system ?

Additional question : if you have 2 different possible lemmas for 1 word , which one do you chose ? How do you do ?

For instance hte lematization of "Ala ma kota" gives :

0 : 1 : ('Ala', 'Ala', 'subst:sg:nom:f', ['imię'], [])
0 : 1 : ('Ala', 'Al', 'subst:sg:gen.acc:m1', ['imię'], [])
0 : 1 : ('Ala', 'Alo', 'subst:sg:gen.acc:m1', ['imię'], [])
1 : 2 : ('ma', 'mój:a', 'adj:sg:nom.voc:f:pos', [], [])
1 : 2 : ('ma', 'mieć', 'fin:sg:ter:imperf', [], [])
2 : 3 : ('kota', 'kota', 'subst:sg:nom:f', ['nazwa_pospolita'], [])
2 : 3 : ('kota', 'kot:s1', 'subst:sg:gen.acc:m2', ['nazwa_pospolita'], [])
2 : 3 : ('kota', 'kot:s2', 'subst:sg:gen.acc:m1', ['nazwa_pospolita'], ['pot.', 'środ.'])

How do you chose between the different lemmas possibles ?

mógłbym ==>
0 : 1 : ('mógł', 'móc', 'praet:sg:m1.m2.m3:imperf:nagl', [], [])
1 : 2 : ('by', 'by:q', 'part', [], [])
2 : 3 : ('m', 'być', 'aglt:sg:pri:imperf:nwok', [], [])

Choosing the right lemma in the context is an open research problem. Most of the time I use LanguageTool since it is relatively simple and provides lemmatizers for many languages including Polish., Recent state of the art tools utilise deep learnig for lemmatization, for example KRNNT is a really good Polish lemmatizer reaching ~98% accuracy.