Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion)
andreaskoepf opened this issue · 2 comments
The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (special_tokens_map.json
, tokenizer.json
, tokenizer.model
, tokenizer_config.json
) should be generated as part of the megatron2hf conversion script.
As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.
Additionally I noticed the following points:
- Unlike
_SentencePieceTokenizer
the_FalconTokenizer
doesn't add special tokens like<CLS><SEP><EOD><MASK>
and uses the standard EOS token (<|endoftext|>
) also as EOD token. - For
_SentencePieceTokenizer
the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK>
are added whennew_tokens == True
) even though they might not be used (eod should always be mapped to eos (</s>
) since it is used byget_ltor_masks_and_position_ids()
whenreset_position_ids
orreset_attention_mask
areTrue
) - SentencePieceTokenizer requires a vocab file and the test for it should not be excluded here only to do the check a few lines below
Could you please elaborate on the second point? I think depending on the settings when the data was tokenized (i.e. either if new_tokens=True
or not), during training the code will either look for the <eos>
or <eod>
token, right? Sorry if I misunderstood something.
Could you please elaborate on the second point?
Defining custom tokens (passed via vocal_extra_ids_list
) currently implies the addition of built-in special tokens <CLS>, <SEP>, <EOD>, <MASK>
. Adding these built-in special tokens is not always necessary. I suggest that the ctor's new_tokens
parameter should only control whether the builtin standard tokens are added and not influence the addition of tokens specified via vocab_extra_ids_list. The function _add_special_token() currently checks the new_tokens
argument and it is also used for adding the entries in vocab_extra_ids_list
...
Regarding eod: The current implementation of the eod porperty already returns _eos_id
if _eod_id is None
so nothing needs to be changed there.
Background to EOD: The eod token appears at several locations in the megatron code and it can be used to separate documents within a sequence, for example GPTDataset
potentially concatenates several documents and if EOD tokens were added via preprocess_data.py
they could be further used for attention-masking and positon-id resetting in get_ltor_masks_and_position_ids().