huggingface/tokenizers

How to convert tokenizers.tokenizer to XXTokenizerFast in transformers?

rangehow opened this issue · 3 comments

Motivation

I followed the guide build-a-tokenizer-from-scratch and got a single tokenizer.json from my corpus. Since I'm not sure if it is compatible with the trainer, I want to convert it back to XXTokenizerFast in transformers.

Observation

In llama2-7b-hf, tokenizer file seems consist of
tokenizer.json ✅ I have
tokenizer.model ✖ I don't have, not sure its usage
tokenizer_config.json ✖ I don't have, but this looks like not that important. I can manually set this.
Initialize a LlamaTokenizerFast from scratch through __init__ function seems to require tokenizer.model and tokenizer.json, but I don't get a tokenizer.model.

def __init__(
        self,
        vocab_file=None,
        tokenizer_file=None,
        clean_up_tokenization_spaces=False,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        add_bos_token=True,
        add_eos_token=False,
        use_default_system_prompt=False,
        add_prefix_space=None,
        **kwargs,
    ):

After dive deeper in transformers.PreTrainedTokenizerFast._save_pretrained, I found a code snippet in which fastTokenizer in transformers seems save tokenizer.json only without tokenizer.model

if save_fast:
            tokenizer_file = os.path.join(
                save_directory, (filename_prefix + "-" if filename_prefix else "") + TOKENIZER_FILE
            )
            self.backend_tokenizer.save(tokenizer_file)
            file_names = file_names + (tokenizer_file,)

Trial

So I just typically use xxTokenizerFast.from_pretrained('dir_contained_my_tokenizer.json'), and it works with default config, I can modified it manually and save_pretrained to get tokenizer_config.json

Query

I still have some query needed help.

  1. What's the role of tokenizer.model? Is it a subset of tokenizer.json ?
  2. Is my conversion method correct ? or is there any better method?

related to #1469 We don't have a very clear guide. It should come!

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.