intended meaning of arguments in `tokenization_comparison.py` (question and suggestion)
makrai opened this issue · 1 comments
makrai commented
Dávid, I create this issue, because I think its easier to keep track of than an e-mail,
but if you don't like this, feel free to close, and continue somehow else.
Thanks for the awesome repo!
In my understanding, tokenization_comparison
compares two tokenizers (in the WordPiece sense) based on a corpus, so
- --vocab-file is the gold tokenizer,
- --model-dir is the "system" tokenizer, and
- --input-dir is the corpus.
Am I right? If so, the kwargs might be renamed accordingly.
DavidNemeskey commented
Thanks for the question. This repo might not be the final place and form of that script, so I did not want to put too much work and thought into it. Still, I changed the argument descriptions a bit to make them more informative.