How to convert tokenizers.tokenizer to XXTokenizerFast in transformers?

Question

How to convert tokenizers.tokenizer to XXTokenizerFast in transformers?

rangehow opened this issue 3 months ago · 3 comments

Motivation

I followed the guide build-a-tokenizer-from-scratch and got a single tokenizer.json from my corpus. Since I'm not sure if it is compatible with the trainer, I want to convert it back to XXTokenizerFast in transformers.

Observation

In llama2-7b-hf, tokenizer file seems consist of
tokenizer.json ✅ I have
tokenizer.model ✖ I don't have, not sure its usage
tokenizer_config.json ✖ I don't have, but this looks like not that important. I can manually set this.
Initialize a LlamaTokenizerFast from scratch through __init__ function seems to require tokenizer.model and tokenizer.json, but I don't get a tokenizer.model.

def __init__(
        self,
        vocab_file=None,
        tokenizer_file=None,
        clean_up_tokenization_spaces=False,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        add_bos_token=True,
        add_eos_token=False,
        use_default_system_prompt=False,
        add_prefix_space=None,
        **kwargs,
    ):

After dive deeper in transformers.PreTrainedTokenizerFast._save_pretrained, I found a code snippet in which fastTokenizer in transformers seems save tokenizer.json only without tokenizer.model

if save_fast:
            tokenizer_file = os.path.join(
                save_directory, (filename_prefix + "-" if filename_prefix else "") + TOKENIZER_FILE
            )
            self.backend_tokenizer.save(tokenizer_file)
            file_names = file_names + (tokenizer_file,)

Trial

So I just typically use xxTokenizerFast.from_pretrained('dir_contained_my_tokenizer.json'), and it works with default config, I can modified it manually and save_pretrained to get tokenizer_config.json

Query

I still have some query needed help.

What's the role of tokenizer.model? Is it a subset of tokenizer.json ?
Is my conversion method correct ? or is there any better method?

Answer 1 · 2024-03-27T15:22:27.000Z

related to #1469 We don't have a very clear guide. It should come!

Answer 2 · 2024-03-27T15:22:45.000Z

In the meantime checkout https://github.com/huggingface/transformers/blob/4ac645305b3ea05fe834f56a3ac6095f872c27ca/src/transformers/convert_slow_tokenizer.py#L1239 should help!

Answer 3 · 2024-04-27T01:48:54.000Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.