jprante/elasticsearch-plugin-bundle

baseform: less word forms returned than defined in the resource

Opened this issue · 2 comments

nkrot commented

Situation: The baseform resource de-lemma-utf8.txt defines various outcomes for one input word, for example,

Zuschlage	Zuschlag
Zuschlage	zuschlagen

I would expect that all outcomes will be returned, as the correct baseform depends on the part of speech.

If the resource is used case-insensitively, the number of such collisions will increase, now comprising cases like:

Gefahren	Gefahr
gefahren	fahren

Would it be possible to fix the plugin to return all entries given in the resource?

Thanx

That's a bug, on left column in de-lemma-utf8.txt, every word should occur at most once.

Part-of-speech is out of scope of the baseform token filter. For this, a wordnet-like input would be required with an NLP plugin (for POS tagging).

nkrot commented

Hopefully you agree that a single word form can be transformed into 1+ baseforms. This is the main idea of my initial post: if no PoS information is available, it is reasonable to assume any PoS and produce all possible base forms. Here you are an example of two different lemmata having the same derived forms:

leaves       leaf
leaves       leave

If the left column is supposed to contain unique words only, how will multiple outcomes be given? Like this:

Zuschlage     Zuschlag,zuschlagen

It is also possible to accomplish such merging at load/compile time. This way it is a little bit easier for the the users who may want to update the resource.