German stemmer doesn't match schlummert/schlummern or grüßend/gegrüßt/grüßen
Opened this issue · 1 comments
Hello,
I'm using Snowball via Elasticsearch, which is based on Lucene. The Snowball German stemming is not matching some common forms:
- "schlummert" should match "schlummern" (infinitive) but instead is unchanged
- "grüßend" should match "grüßen" (infinitive) but instead yields "grussend"
- "gegrüßt" should match "grüßen" (infinitive) but instead yields "gegrusst"
Original Lucene bug was here: https://issues.apache.org/jira/browse/LUCENE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17217670#comment-17217670
Looks like I didn't explicitly repeat the advice from #91 here, but to achieve what you ask for we would need a way to remove these suffixes (or prefix in the case of ge-) that doesn't negatively affect words that happen to end in t
, or n
or start with ge
where it shouldn't be removed. If we're unable to come up with such a rule then it's better to not try to remove (because understemming is generally less problematic than overstemming), but it would be useful to note the limitation in the algorithm description on the website.
The website does actually already note ge-
as "almost intractable", though in the "germanic" overview page rather than the page about the German stemmer:
the almost intractable problems of [...] prefixed and infixed ge
For example, you want ge
removed from gegrußt
but we shouldn't remove it from some other words - here are a some cases I trivially found from a grep in our German word list for words starting ge
which are the same length and also end in a consonant and t
:
gedeiht
gelangt
gelingt
genießt
gesellt
gesteht
gewöhnt
Gedicht
(particularly unhelpful as it would get conflated withdicht
which has a totally different meaning)Gewicht
The last two are nouns so should be capitalised in text, but the current expectation is that input it lower-cased before being fed to the stemmer so we can't use the capitalisation as a clue. Potentially that could be changed, but doing so would be somewhat disruptive for users of the stemmers so it's not a simple change to make. It would also need to deal with words which aren't nouns being capitalised at the start of a sentence, in titles, etc.
A solution doesn't have to be perfect, it just needs to not be harmful to other cases, so if there's a rule we can use to identify a significant number of cases where ge-
should be removed without triggering in cases where it shouldn't be removed we could use that.
Removing -t
and -d
also seems hard to do without removing it from words which just happen to end with these letters. Even trying a more targetted rule for your particular example of just removing -ert
is problematic as it would conflate e.g. hundert
and Hund
. Similarly, just removing -end
would affect Tugend
.