TermSuite-lang (TermSuite resources)

This repository contains the set of resources needed by TermSuite. These resource are :

Language-specific TreeTagger tags to Multex categories/subCategories mappings (*-tt-*-mapping.xml),
Language-specific morphological analysis resource (*-compost-*.txt files),
Language-specific allowed characters (used for non-word or formula detection, *-allowed-chars.txt files),
Language-specific multi-word detection rules (in the form of UIMA Tokens Regex rules, .regex files)
Language-specific syntactic variation rules (.yaml files),
Language-specific dictionaries (*-dico.txt files),
Language-specific term frequencies in a general language corpus (GeneralLanguage.Lang files).

Supported languages

Language	Qualitity of resource pack (and other comments)
french	Excellent
english	Excellent
russian	Excellent (TreeTagger is slow)
german	Good
spanish	Good
danish	Poor
chinese	Poor
latvian	Poor (no POS tagger supported natively)

First clone this repository.

$ git clone https://github.com/termsuite/termsuite-lang.git

Then pass the path to this local termsuite-lang repo to TermSuite.

Example with TermSuite Java API

If you cloned this repo from the current directory /path/to/current-dir/, then the resource path will be :

TermSuitePipeline pipeline = TermSuitePipeline.create("fr")
	.setResourcePath("/path/to/current-dir/resource-lang/")
  ...

You can also package all elements inside resource-lang directory to a jar and give the path to the jar as the TermSuite resource path.