huggingface/transformers

Regarding padding and batched inference for LLAMA-2 and CodeLLAMA

anmolagarwal999 opened this issue ยท 23 comments

System Info

Platform:

  • transformers_version: "4.33.0.dev0"
  • Python: 3.8

Who can help?

@ArthurZucker @younesbelkada @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Regarding LLAMA-2 CHAT

I have been using LLAMA-2 13B chat for batched inference. I have the followed the steps in the TIPS section here. My question is regarding the padding_side to be chosen. I have tried setting the padding_side to be both left and right and my observations are as follows:

  • The results with padding_side = left and really very bad. The results with padding_side = right seem to be coherent and very good. This also seems to be backed up with here.
  • However, on using the model with padding_side = right, I get the warning: A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

What is the padding_side to be used ?

Regarding CodeLLAMA

No guidelines on how to deal with the absence of a padding token is to be dealt with is present on the model page for CodeLLAMA. It would be good to have some documentation on parameters such as "what padding token is to be set", what is the 'padding_side' to be kept, etc.

Expected behavior

Consistent behaviour ie better results to come during the case when there is no warning.

Hey! The warning is a general warning. Left padding is the usual recommendation, but the original Llama codebase (and Code-Llama is part of the Llama codebase) use a default right padding. Our goal is to have similar results out of the box (so right padding) but still allow users to have the best results and we thus give recommendation on the padding side.
There is a guideline: CodeLlama is the same as Llama. Would it be clearer if the tip section is copied over to CodeLlama?

Would it be clearer if the tip section is copied over to CodeLlama?

Yes it would help. Should I create a PR for this?

Sure ๐Ÿ˜‰

gante commented

Hey @anmolagarwal999 ๐Ÿ‘‹

Out of curiosity, have you passed the attention mask that came out of the tokenizer to model.generate? That is a common cause of performance degradation that would explain what you're seeing :)

gante commented

Hi @rafa852 ๐Ÿ‘‹ Have a look at this doc sections about padding sides: https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side

As for the padding token, it's common to set tokenizer.pad_token = tokenizer.eos_token :)

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Is there any solution to close the warning?

If you read the issue you will see that you can simply do tokenizer.pad_token = tokenizer.eos_token. Could you read the issue?

is it the same as setting tokenizer.pad_token = 0 ?

No, you cannot set a token to an id. It is the same as tokenzier.pad_token_id = 0 if tokenizer.eos_token_id is 0

No, you cannot set a token to an id. It is the same as tokenzier.pad_token_id = 0 if tokenizer.eos_token_id is 0

sorry, I mean setting tokenizer.pad_token_id = 0 and tokenizer.eos_token_id != 0 0 is actually the id for <unk> token in llama2 config. Would this affect the inference time result?

It should not really affect inference not, by default it is what is used. Feel free to use the eos token as it is common practice

@ArthurZucker Hi, just to make sure I understood correctly from this issue, to run batched generation with llama 2 models is this enough?

tokenizer = AutoTokenizer.from_pretrained("llama-2-7b-hf") # padding_side will default to "right"
tokenizer.pad_token_id = tokenized.eos_token_id

I can't be 100% sure neither reading this issue or the tip section

@gpucce left-padding should be used for batched inference (see this comment)

@gante thank you very much, would this be the case also for T5 models?

@gpucce nope, encoder-decoder models should use right-padding :)

@gpucce nope, encoder-decoder models should use right-padding :)

@gante can you explain a bit why this is the case?

@gpucce Decoder-only models continue generating from the input prompt and can't have gaps between the end of the prompt and the start of generation. They were not trained to handle these gaps.

Encoder-decoder models convert the input prompt into an encoded vector, which is fed to a decoder. In this case, the decoder starts with an embedded input and input_ids = [bos_token_id]. The encoder was trained to handle padding on the right, but not padding on the left.

Hi @gante and @ArthurZucker , your responses above are really helpful! Could you point me to the code how positional embedding deals with the left padding?

I am asking because if absolute positional embedding is used, the positional embedding also needs to be left padded, i.e., right shifted, so that the first position can be correctly added to the first input token. For instance, the sinusoid embedding in the vanilla transformer and the rope embedding in llama all need such type of shifting. I also found an earlier discussion here which was quite helpful in illustration. Since I tried both left and right padding in llama2-7b-chat (curious why llama2 also works with right padding, which shouldn't be the case for all decoder only LLM), and found out the output was quite good, I guess this type of absolute positional shifting was implemented somewhere in the codebase, but I cannot find it. Can you point me to where it is in the code?

Do you mean this: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L132-L145 ?

Thanks! Oh, I mean when attention_mask generated by the tokenizer is provided as shown here, how does the rope positional embedding deal with the padding in the attention_maske.g., right shift the embedding?

@ShengYun-Peng with generate, we derive the correct position indexes from the attention_mask, regardless of padding (here). The position ids are what's actually used to get the position embeddings.

If you call forward (e.g. if you're building your own custom generation loop), you need to provide position_ids if there is left-padding. If there is no padding or right-padding, we default to the correct position_ids (here)