Error with Chinese
Opened this issue · 0 comments
Hi,
- your OS : Mac OSX
- your java runtime environment (java -version): java version "1.8.0_221"
- the log file provided by TermSuite (under the current ./logs/ directory) : NA
- a short description of the problem you encounter.
I am trying to use TermSuite to extract terms from documents in Chinese. I've successfully installed treetagger for Chinese, as well as found a segmenter for Chinese that works. However, when I try to run TermSuite with the following settings:
-c /input_files/pdf2txt -l zh --contextualize --context-scope 3 --context-assoc-rate MutualInformation --enable-semantic-gathering --post-filter-property documentFrequency --post-filter-th 2 --semantic-distance Jaccard --tsv /ctxt3_jac_pmi.tsv --tsv-properties "rank,pilot,isFixedExp,pattern,freq,spec,semScore,isDico,isDistrib"
- which work for 'en' on a set of English documents - but when applied to 'zh' on a set of Chinese documents, I get the following error:
Exception in thread "main" fr.univnantes.termsuite.tools.TermSuiteCliException: An unexpected error occurred: Resource initialization error at fr.univnantes.termsuite.tools.CommandLineClient.launch(CommandLineClient.java:295) at fr.univnantes.termsuite.tools.TerminologyExtractorCLI.main(TerminologyExtractorCLI.java:203) Caused by: fr.univnantes.termsuite.api.TermSuiteException: Resource initialization error at fr.univnantes.termsuite.framework.service.TermSuiteResourceManager.loadResource(TermSuiteResourceManager.java:68) at fr.univnantes.termsuite.framework.service.TermSuiteResourceManager.get(TermSuiteResourceManager.java:77) at fr.univnantes.termsuite.framework.injector.ResourceInjector.injectResources(ResourceInjector.java:25) at fr.univnantes.termsuite.framework.injector.EngineInjector.injectResources(EngineInjector.java:31) at fr.univnantes.termsuite.framework.pipeline.SimpleEngineRunner.run(SimpleEngineRunner.java:28) at fr.univnantes.termsuite.framework.pipeline.AggregateEngineRunner.run(AggregateEngineRunner.java:49) at fr.univnantes.termsuite.framework.pipeline.AggregateEngineRunner.run(AggregateEngineRunner.java:49) at fr.univnantes.termsuite.framework.Pipeline.run(Pipeline.java:25) at fr.univnantes.termsuite.api.TerminoExtractor.execute(TerminoExtractor.java:95) at fr.univnantes.termsuite.tools.TerminologyExtractorCLI.run(TerminologyExtractorCLI.java:137) at fr.univnantes.termsuite.tools.CommandLineClient.launch(CommandLineClient.java:287) ... 1 more Caused by: org.apache.uima.resource.ResourceInitializationException at fr.univnantes.julestar.uima.resources.MultilineResource.load(MultilineResource.java:44) at fr.univnantes.termsuite.framework.service.TermSuiteResourceManager.loadResource(TermSuiteResourceManager.java:55) ... 11 more Caused by: fr.univnantes.julestar.uima.resources.ResourceFormatException: Expected two columns at line 36. Got: "职掌" at fr.univnantes.julestar.uima.resources.MapResource.doError(MapResource.java:52) at fr.univnantes.julestar.uima.resources.MapResource.doRow(MapResource.java:30) at fr.univnantes.julestar.uima.resources.TabResource.doLine(TabResource.java:15) at fr.univnantes.julestar.uima.resources.MultilineResource.load(MultilineResource.java:41) ... 12 more