Which tokenizer did you use?
janwendt opened this issue · 11 comments
Your documentation says:
The input file should contain one sentence by line, and they have to be tokenized. Otherwise, the tagger will perform poorly.
Simple Question: Which tokenizer did you use?
Hi,
I only trained the model on the CoNLL datasets that were already tokenized, so I did not have to tokenized anything. Probably the Moses tokenizer should work well:
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer
Yes, you can check the data here:
https://github.com/glample/tagger/tree/master/dataset
@janwendt What did you eventually do?
@mrmotallebi I am using the StanfordCoreNLP API which does a very good job but there are similar Python libs (NLTK is pretty good) as well.
Post that got me to the API: https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en
I would personally recommend the Moses one, it's pretty standard, and very fast.
@janwendt Do you have a domo of input.txt to the tagger.py?
@janwendt It needn't . I have successed.
@bjtu-lucas-nlp Could you please share an example of input.txt?
I have tried all kinds of combination but still get O tags of everything in the output.