Encoded sequences can clash with separator byte and cause assertion errors
Closed this issue · 6 comments
This causes the assert to fail in TrimSuffixEncoder.java:60:
Dictionary d = Dictionary.read(Paths.get("/tmp/org/languagetool/resource/de/german_synth.dict"));
DictionaryLookup dict = new DictionaryLookup(d);
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT"); // works
dict.lookup("anfragen|DOESNOTEXIST"); // works
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT:NEB"); // AssertionError at TrimSuffixEncoder.java:60
To reproduce, you can get the german_synth.dict
from http://search.maven.org/remotecontent?filepath=de/danielnaber/german-pos-dict/1.0/german-pos-dict-1.0.jar
Stacktrace is this:
java.lang.AssertionError
at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:60)
at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
at org.languagetool.CrashTest.test(CrashTest.java:18)
In a less stripped down case this shows as (at least I assume this is the same issue):
Caused by: java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:540)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:62)
at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
I'll look into this, thanks Daniel.
Confirmed. The encoded length of the suffix to strip clashes with separator character. Thinking on how to fix this.
Daniel, the simplest workaround for this problem is currently to recompile the dictionary from scratch and substitute the separator character for something with a higher value. Good examples would be ',' or '\t' since these fall below 'A' in ascii range and would wrap around to much longer suffixes.
I didn't invent this encoding scheme, it's a legacy from Janek Daciuk's work. A proper fix would involve some bookkeeping on the encoded string so that we don't accidentally trip on separator character (the input must not contain it, this is verified).
Fixed in 2.1.2. Will publish asap. Thanks for a detailed repro, Daniel!
Thanks for the fast fix, I can confirm it fixes the issue - also without re-building the dictionary, or do you still recommend that?
The fix will work with your existing dictionary if you can/ are willing to upgrade. If you want to stick to 2.1.1, you'll have to rebuild the dictionary with a different separator, unfortunately.