morfologik/morfologik-stemming

Get all word varieties by world base

gaffkins opened this issue · 5 comments

Can I get all world varieties by word base?

For Polish, yes. This is the main purpose of the morfologik-polish subproject.

With what method? Because lockup return only base world. I need all varities by base world. Example I write pies and what I expect is psy, psu, psem, psie...

Short answer is: the same method, but different dictionary.
https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-stemming/src/test/java/morfologik/stemming/DictionaryLookupTest.java#L164-L176

Morfologik doesn't ship with a dictionary for synthesis -- you'll have to invert the tagging dictionary or get the polish_synth dictionary from LanguageTool. See polish.README.Polish.txt

Hey, this question is also relevant for me. At the moment I'm using polish_syth dict from LanguageTool and IStemmer.lookup method like this:
iStemmer.lookup("<word>|<tag>")
eg.
iStemmer.lookup("niemiecki|adjp")
will result in "niemiecku", if "adja" passed as a tag it will return "niemiecko" etc. Is there a way in which I can retrieve list of all possible varieties with single request to lookup method?

You can look up a node corresponding to "niemiecki|" in the automaton and collect all the leaves starting from there. There are utilities to do this in a pretty simple way -- look at unit tests and grep the code, please.