How to convert tokenizers.tokenizer to XXTokenizerFast in transformers?
rangehow opened this issue · 3 comments
Motivation
I followed the guide build-a-tokenizer-from-scratch and got a single tokenizer.json from my corpus. Since I'm not sure if it is compatible with the trainer, I want to convert it back to XXTokenizerFast in transformers.
Observation
In llama2-7b-hf, tokenizer file seems consist of
tokenizer.json ✅ I have
tokenizer.model ✖ I don't have, not sure its usage
tokenizer_config.json ✖ I don't have, but this looks like not that important. I can manually set this.
Initialize a LlamaTokenizerFast from scratch through __init__ function seems to require tokenizer.model and tokenizer.json, but I don't get a tokenizer.model.
def __init__(
self,
vocab_file=None,
tokenizer_file=None,
clean_up_tokenization_spaces=False,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
add_bos_token=True,
add_eos_token=False,
use_default_system_prompt=False,
add_prefix_space=None,
**kwargs,
):
After dive deeper in transformers.PreTrainedTokenizerFast._save_pretrained, I found a code snippet in which fastTokenizer in transformers seems save tokenizer.json only without tokenizer.model
if save_fast:
tokenizer_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + TOKENIZER_FILE
)
self.backend_tokenizer.save(tokenizer_file)
file_names = file_names + (tokenizer_file,)
Trial
So I just typically use xxTokenizerFast.from_pretrained('dir_contained_my_tokenizer.json'), and it works with default config, I can modified it manually and save_pretrained to get tokenizer_config.json
Query
I still have some query needed help.
- What's the role of tokenizer.model? Is it a subset of tokenizer.json ?
- Is my conversion method correct ? or is there any better method?
related to #1469 We don't have a very clear guide. It should come!
In the meantime checkout https://github.com/huggingface/transformers/blob/4ac645305b3ea05fe834f56a3ac6095f872c27ca/src/transformers/convert_slow_tokenizer.py#L1239 should help!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.