ziqizhang/jate

Provide API and improvement for PoS tagger with dictionary

jerrygaoLondon opened this issue · 1 comments

At the moment, we are using OpenNLP Part-of-Speech (PoS) tagger, which relies on a pre-trained english maxent pos model (1.5). This is a general purpose model for PoS tagging.

However, for most of domain specific tasks (typically like the biomedical text GENIA contains), the general purposed PoS tagger works very bad and can only gain very low recall. We can provide support for dictionary based PoS tagging in JATE toolset with simple setting. It will be a valuable features as for most of domain specific problem, having a large training set is great challenge, while have a set of manually maintained dictionary is a simple and efficient way such as the approach adopted in GENIA Tagger.

We can provide an example/demo of (default) setting for benchmaking various ATE algorithms over GENIA corpus, by using biomedical PoS dictionary provided by the GENIA Tagger.

OpenNLP PoS tagger provide the support of using Tag Dictionary

For more details, see GENIA Tagger.

maybe a nice feature, but decide not to implement unless there are many requests