applyTopicModel/MalletLDA: exception when converting feature sequence
johann-petrak opened this issue · 2 comments
Exception is
java.lang.IndexOutOfBoundsException: Index -1 out-of-bounds for length 921
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.util.Objects.checkIndex(Objects.java:372)
at java.base/java.util.ArrayList.get(ArrayList.java:440)
at cc.mallet.types.Alphabet.lookupObject(Alphabet.java:154)
at gate.plugin.learningframework.mallet.LFAlphabet.lookupObject(LFAlphabet.java:40)
at cc.mallet.types.FeatureSequence.toString(FeatureSequence.java:102)
at java.base/java.lang.String.valueOf(String.java:2788)
at java.base/java.lang.StringBuilder.append(StringBuilder.java:135)
at gate.plugin.learningframework.engines.EngineMBTopicsLDA.applyTopicModel(EngineMBTopicsLDA.java:196)
at gate.plugin.learningframework.LF_ApplyTopicModel.process(LF_ApplyTopicModel.java:122)
at gate.plugin.learningframework.AbstractDocumentProcessor.execute(AbstractDocumentProcessor.java:207)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:172)
at gate.creole.SerialController.executeImpl(SerialController.java:157)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:225)
at gate.creole.ConditionalSerialAnalyserController.execute(ConditionalSerialAnalyserController.java:132)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1777)
at java.base/java.lang.Thread.run(Thread.java:844)
This happens when the feature sequence that was created from a new document which was not in the training set gets converted back to string. An index gets looked up in the alphabet using lookupObject(idx) and that index is not in the alphabet, for some reason. So how did it get into the feature sequence in the first place?
It turns out that the Mallet TokenSequence.toFeatureSequence(Alphabet) method adds index -1 entries to the feature sequence for unknown tokens, if the Alphabet is set to not growing. But then any code for converting the FeatureSequence back to String will get the ArrayOutOfBoundsException.
Not sure how to best deal with this. Ideally there would be a way to just add the known tokens to the feature sequence.
See mimno/Mallet#138
For now will just construct the feature sequence manually.