intended meaning of arguments in `tokenization_comparison.py` (question and suggestion)

Question

intended meaning of arguments in `tokenization_comparison.py` (question and suggestion)

makrai opened this issue 4 years ago · 1 comments

Dávid, I create this issue, because I think its easier to keep track of than an e-mail,
but if you don't like this, feel free to close, and continue somehow else.
Thanks for the awesome repo!
In my understanding, tokenization_comparison compares two tokenizers (in the WordPiece sense) based on a corpus, so

--vocab-file is the gold tokenizer,
--model-dir is the "system" tokenizer, and
--input-dir is the corpus.
Am I right? If so, the kwargs might be renamed accordingly.

Answer 1 · 2020-07-22T12:39:36.000Z

Thanks for the question. This repo might not be the final place and form of that script, so I did not want to put too much work and thought into it. Still, I changed the argument descriptions a bit to make them more informative.