DavidNemeskey/emBERT

intended meaning of arguments in `tokenization_comparison.py` (question and suggestion)

makrai opened this issue · 1 comments

Dávid, I create this issue, because I think its easier to keep track of than an e-mail,
but if you don't like this, feel free to close, and continue somehow else.
Thanks for the awesome repo!
In my understanding, tokenization_comparison compares two tokenizers (in the WordPiece sense) based on a corpus, so

  • --vocab-file is the gold tokenizer,
  • --model-dir is the "system" tokenizer, and
  • --input-dir is the corpus.
    Am I right? If so, the kwargs might be renamed accordingly.

Thanks for the question. This repo might not be the final place and form of that script, so I did not want to put too much work and thought into it. Still, I changed the argument descriptions a bit to make them more informative.