Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion)

Question

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion)

andreaskoepf opened this issue a year ago · 2 comments

The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (special_tokens_map.json, tokenizer.json, tokenizer.model, tokenizer_config.json) should be generated as part of the megatron2hf conversion script.

As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.

Additionally I noticed the following points:

Unlike _SentencePieceTokenizer the _FalconTokenizer doesn't add special tokens like <CLS><SEP><EOD><MASK> and uses the standard EOS token (<|endoftext|>) also as EOD token.
For _SentencePieceTokenizer the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK> are added when new_tokens == True) even though they might not be used (eod should always be mapped to eos (</s>) since it is used by get_ltor_masks_and_position_ids() when reset_position_ids or reset_attention_mask are True)
SentencePieceTokenizer requires a vocab file and the test for it should not be excluded here only to do the check a few lines below

Answer 1 · 2023-08-13T16:41:33.000Z

Could you please elaborate on the second point? I think depending on the settings when the data was tokenized (i.e. either if new_tokens=True or not), during training the code will either look for the <eos> or <eod> token, right? Sorry if I misunderstood something.

Answer 2 · 2023-08-14T09:45:12.000Z

Could you please elaborate on the second point?

Defining custom tokens (passed via vocal_extra_ids_list) currently implies the addition of built-in special tokens <CLS>, <SEP>, <EOD>, <MASK>. Adding these built-in special tokens is not always necessary. I suggest that the ctor's new_tokens parameter should only control whether the builtin standard tokens are added and not influence the addition of tokens specified via vocab_extra_ids_list. The function _add_special_token() currently checks the new_tokens argument and it is also used for adding the entries in vocab_extra_ids_list...
Regarding eod: The current implementation of the eod porperty already returns _eos_id if _eod_id is None so nothing needs to be changed there.

Background to EOD: The eod token appears at several locations in the megatron code and it can be used to separate documents within a sequence, for example GPTDataset potentially concatenates several documents and if EOD tokens were added via preprocess_data.py they could be further used for attention-masking and positon-id resetting in get_ltor_masks_and_position_ids().