Regarding padding and batched inference for LLAMA-2 and CodeLLAMA
anmolagarwal999 opened this issue ยท 23 comments
System Info
Platform:
- transformers_version: "4.33.0.dev0"
- Python: 3.8
Who can help?
@ArthurZucker @younesbelkada @gante
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Regarding LLAMA-2 CHAT
I have been using LLAMA-2 13B chat
for batched inference. I have the followed the steps in the TIPS section here. My question is regarding the padding_side to be chosen. I have tried setting the padding_side to be both left and right and my observations are as follows:
- The results with padding_side = left and really very bad. The results with padding_side = right seem to be coherent and very good. This also seems to be backed up with here.
- However, on using the model with padding_side = right, I get the warning:
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
What is the padding_side
to be used ?
Regarding CodeLLAMA
No guidelines on how to deal with the absence of a padding token is to be dealt with is present on the model page for CodeLLAMA. It would be good to have some documentation on parameters such as "what padding token is to be set", what is the 'padding_side' to be kept, etc.
Expected behavior
Consistent behaviour ie better results to come during the case when there is no warning.
Hey! The warning is a general warning. Left
padding is the usual recommendation, but the original Llama codebase (and Code-Llama is part of the Llama codebase) use a default right
padding. Our goal is to have similar results out of the box (so right padding) but still allow users to have the best results and we thus give recommendation on the padding side.
There is a guideline: CodeLlama is the same as Llama. Would it be clearer if the tip section is copied over to CodeLlama?
Would it be clearer if the tip section is copied over to CodeLlama?
Yes it would help. Should I create a PR for this?
Sure ๐
Hey @anmolagarwal999 ๐
Out of curiosity, have you passed the attention mask that came out of the tokenizer to model.generate
? That is a common cause of performance degradation that would explain what you're seeing :)
Hi @rafa852 ๐ Have a look at this doc sections about padding sides: https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side
As for the padding token, it's common to set tokenizer.pad_token = tokenizer.eos_token
:)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Is there any solution to close the warning?
If you read the issue you will see that you can simply do tokenizer.pad_token = tokenizer.eos_token
. Could you read the issue?
is it the same as setting tokenizer.pad_token = 0
?
No, you cannot set a token
to an id
. It is the same as tokenzier.pad_token_id = 0
if tokenizer.eos_token_id
is 0
No, you cannot set a
token
to anid
. It is the same astokenzier.pad_token_id = 0
iftokenizer.eos_token_id
is0
sorry, I mean setting tokenizer.pad_token_id = 0
and tokenizer.eos_token_id != 0
0 is actually the id for <unk>
token in llama2 config. Would this affect the inference time result?
It should not really affect inference not, by default it is what is used. Feel free to use the eos token as it is common practice
@ArthurZucker Hi, just to make sure I understood correctly from this issue, to run batched generation with llama 2 models is this enough?
tokenizer = AutoTokenizer.from_pretrained("llama-2-7b-hf") # padding_side will default to "right"
tokenizer.pad_token_id = tokenized.eos_token_id
I can't be 100% sure neither reading this issue or the tip section
@gpucce left-padding should be used for batched inference (see this comment)
@gpucce Decoder-only models continue generating from the input prompt and can't have gaps between the end of the prompt and the start of generation. They were not trained to handle these gaps.
Encoder-decoder models convert the input prompt into an encoded vector, which is fed to a decoder. In this case, the decoder starts with an embedded input and input_ids = [bos_token_id]
. The encoder was trained to handle padding on the right, but not padding on the left.
Hi @gante and @ArthurZucker , your responses above are really helpful! Could you point me to the code how positional embedding deals with the left padding?
I am asking because if absolute positional embedding is used, the positional embedding also needs to be left padded, i.e., right shifted, so that the first position can be correctly added to the first input token. For instance, the sinusoid embedding in the vanilla transformer and the rope embedding in llama all need such type of shifting. I also found an earlier discussion here which was quite helpful in illustration. Since I tried both left and right padding in llama2-7b-chat (curious why llama2 also works with right padding, which shouldn't be the case for all decoder only LLM), and found out the output was quite good, I guess this type of absolute positional shifting was implemented somewhere in the codebase, but I cannot find it. Can you point me to where it is in the code?
Do you mean this: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L132-L145 ?
Thanks! Oh, I mean when attention_mask
generated by the tokenizer is provided as shown here, how does the rope positional embedding deal with the padding in the attention_mask
e.g., right shift the embedding?
@ShengYun-Peng with generate
, we derive the correct position indexes from the attention_mask
, regardless of padding (here). The position ids are what's actually used to get the position embeddings.
If you call forward
(e.g. if you're building your own custom generation loop), you need to provide position_ids
if there is left-padding. If there is no padding or right-padding, we default to the correct position_ids
(here)