termsuite/termsuite-core

Error with Chinese

Opened this issue · 0 comments

Hi,

  • your OS : Mac OSX
  • your java runtime environment (java -version): java version "1.8.0_221"
  • the log file provided by TermSuite (under the current ./logs/ directory) : NA
  • a short description of the problem you encounter.

I am trying to use TermSuite to extract terms from documents in Chinese. I've successfully installed treetagger for Chinese, as well as found a segmenter for Chinese that works. However, when I try to run TermSuite with the following settings:

-c /input_files/pdf2txt -l zh --contextualize --context-scope 3 --context-assoc-rate MutualInformation --enable-semantic-gathering --post-filter-property documentFrequency --post-filter-th 2 --semantic-distance Jaccard --tsv /ctxt3_jac_pmi.tsv --tsv-properties "rank,pilot,isFixedExp,pattern,freq,spec,semScore,isDico,isDistrib"

  • which work for 'en' on a set of English documents - but when applied to 'zh' on a set of Chinese documents, I get the following error:

Exception in thread "main" fr.univnantes.termsuite.tools.TermSuiteCliException: An unexpected error occurred: Resource initialization error at fr.univnantes.termsuite.tools.CommandLineClient.launch(CommandLineClient.java:295) at fr.univnantes.termsuite.tools.TerminologyExtractorCLI.main(TerminologyExtractorCLI.java:203) Caused by: fr.univnantes.termsuite.api.TermSuiteException: Resource initialization error at fr.univnantes.termsuite.framework.service.TermSuiteResourceManager.loadResource(TermSuiteResourceManager.java:68) at fr.univnantes.termsuite.framework.service.TermSuiteResourceManager.get(TermSuiteResourceManager.java:77) at fr.univnantes.termsuite.framework.injector.ResourceInjector.injectResources(ResourceInjector.java:25) at fr.univnantes.termsuite.framework.injector.EngineInjector.injectResources(EngineInjector.java:31) at fr.univnantes.termsuite.framework.pipeline.SimpleEngineRunner.run(SimpleEngineRunner.java:28) at fr.univnantes.termsuite.framework.pipeline.AggregateEngineRunner.run(AggregateEngineRunner.java:49) at fr.univnantes.termsuite.framework.pipeline.AggregateEngineRunner.run(AggregateEngineRunner.java:49) at fr.univnantes.termsuite.framework.Pipeline.run(Pipeline.java:25) at fr.univnantes.termsuite.api.TerminoExtractor.execute(TerminoExtractor.java:95) at fr.univnantes.termsuite.tools.TerminologyExtractorCLI.run(TerminologyExtractorCLI.java:137) at fr.univnantes.termsuite.tools.CommandLineClient.launch(CommandLineClient.java:287) ... 1 more Caused by: org.apache.uima.resource.ResourceInitializationException at fr.univnantes.julestar.uima.resources.MultilineResource.load(MultilineResource.java:44) at fr.univnantes.termsuite.framework.service.TermSuiteResourceManager.loadResource(TermSuiteResourceManager.java:55) ... 11 more Caused by: fr.univnantes.julestar.uima.resources.ResourceFormatException: Expected two columns at line 36. Got: "职掌" at fr.univnantes.julestar.uima.resources.MapResource.doError(MapResource.java:52) at fr.univnantes.julestar.uima.resources.MapResource.doRow(MapResource.java:30) at fr.univnantes.julestar.uima.resources.TabResource.doLine(TabResource.java:15) at fr.univnantes.julestar.uima.resources.MultilineResource.load(MultilineResource.java:41) ... 12 more