riedlma/topictiling

Annotator processing failed

Closed this issue · 2 comments

@riedlma I'm encountering the following error when running the shell script like this (after using GibbsLDA++, which worked fine):
sh topictiling.sh -ri 5 -tmd topicmodel -tmn model-final -fp *txt -fd files_to_segment
(Note that I had to remove the quotes around the argument to -fp, otherwise I would encounter the same problem as #2, although this seems to be OS-dependent)

org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(407)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:296)
        at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
        at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:223)
        at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:143)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.RunTopicTilingOnFile.<init>(RunTopicTilingOnFile.java:133)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.RunTopicTilingOnFile.main(RunTopicTilingOnFile.java:94)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at java.util.ArrayList.elementData(ArrayList.java:422)
        at java.util.ArrayList.get(ArrayList.java:435)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.annotator.TopicTilingSegmenterAnnotator.annotateSegments(TopicTilingSegmenterAnnotator.java:231)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.annotator.TopicTilingSegmenterAnnotator.process(TopicTilingSegmenterAnnotator.java:142)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:375)
        ... 6 more

Exception in thread "main" org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:296)
        at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
        at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:223)
        at org.uimafit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:143)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.RunTopicTilingOnFile.<init>(RunTopicTilingOnFile.java:133)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.RunTopicTilingOnFile.main(RunTopicTilingOnFile.java:94)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at java.util.ArrayList.elementData(ArrayList.java:422)
        at java.util.ArrayList.get(ArrayList.java:435)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.annotator.TopicTilingSegmenterAnnotator.annotateSegments(TopicTilingSegmenterAnnotator.java:231)
        at de.tudarmstadt.langtech.semantics.segmentation.segmenter.annotator.TopicTilingSegmenterAnnotator.process(TopicTilingSegmenterAnnotator.java:142)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:375)
        ... 6 more

I have 3919 documents, each in its own .txt (with everything in the first line) file in files_to_segment. I played around a little bit and realized that when I put every document into a single file, with one line per document (like what GibbsLDA++ expects), the error doesn't happen. However, in this case it just segments it along some (but not all) of the newlines and believes everything is part of the same document, so clearly this wasn't the way you intended it to be used (the readme doesn't really discuss the expected input format much). If I concatenate all documents into a single line in a single file, the error also doesn't happen, but again, the segmentation doesn't work properly (it basically creates two huge segments, splitting it somewhere in the middle).

I tried this and got the same error on two systems (Arch Linux & Ubuntu), so it doesn't seem to be system-related.

Hi,

thanks for the note. This is quite hard to fix, if I do not know the input data that failed. So normally, documents also with new lines should work well.

Have you tested to segment the documents in such a way that each line is sentence and then used the -s (for simple segmentation), which expects words to be separated by whitespace and sentences by new line.

Best,
Martin

Thanks for the reply! That did indeed help me figure out what was wrong. My corpus is based on text generated through automated speech recognition and therefore doesn't have punctuation to indicate a sentence structure. If I just randomly split my text into sentences, it does work (at least in principle, the results obviously don't make sense without real sentences). My bad, should have read the paper more closely!