Which tokenizer did you use?

Question

Which tokenizer did you use?

janwendt opened this issue 6 years ago · 11 comments

Your documentation says:

The input file should contain one sentence by line, and they have to be tokenized. Otherwise, the tagger will perform poorly.

Simple Question: Which tokenizer did you use?

Answer 1 · 2018-04-20T21:11:52.000Z

Hi,

I only trained the model on the CoNLL datasets that were already tokenized, so I did not have to tokenized anything. Probably the Moses tokenizer should work well:
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer

Answer 2 · 2018-04-22T13:04:42.000Z

@glample can you give one example line how the input.txt should look like?

Answer 3 · 2018-04-22T17:50:30.000Z

Yes, you can check the data here:
https://github.com/glample/tagger/tree/master/dataset

Answer 4 · 2018-05-08T07:38:47.000Z

@janwendt What did you eventually do?

Answer 5 · 2018-05-08T16:13:57.000Z

@mrmotallebi I am using the StanfordCoreNLP API which does a very good job but there are similar Python libs (NLTK is pretty good) as well.
Post that got me to the API: https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en

Answer 6 · 2018-05-08T22:08:09.000Z

I would personally recommend the Moses one, it's pretty standard, and very fast.

Answer 7 · 2018-05-10T06:30:27.000Z

@janwendt Do you have a domo of input.txt to the tagger.py?

Answer 8 · 2018-05-10T06:48:49.000Z

@janwendt It needn't . I have successed.

Answer 9 · 2019-07-22T23:55:43.000Z

@bjtu-lucas-nlp Could you please share an example of input.txt?
I have tried all kinds of combination but still get O tags of everything in the output.

Answer 10 · 2019-09-17T16:03:00.000Z

@gui-li if you have a column based data structure like:
example1;example2;example3;

the tokenized output and input.txt for your tagger should be:

example1
example2
example3

Answer 11 · 2019-09-17T16:57:56.000Z

@janwendt Thanks for replying.