unexpected response when using llama2-7b-chat

Hello!

I'm trying to use your pre-trained model with this command:
CUDA_VISIBLE_DEVICES=4,5,6,7 python inference.py -i -m llama-2-7b-chat --eval_name concat_recur

However, there is an unexpected generation stop when inputting the query:
help me list popular songs written by Taylor Swift.

The result is shown as follows:

It stops generating more content and outputs </s> instead.

Are there any other settings I missed?

Hello!
I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists:

However, after the compression, the model seems to produce EOS token before the lists:

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

Thanks so much for your quick reply!

I have another question about the class LinearMask() in most modeling files under the directory "arch". As shown in the following figure, the forward input of LinearMask() includes comp_mask. However, the specific operation doesn't apply this variable.

If this variable is not used, the linear mapping function is the same as the original function in "modeling_llama.py".

Hello! I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists:

However, after the compression, the model seems to produce EOS token before the lists:

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

It is an interesting phenomenon as compression tokens affect generation capability.

For the question regarding LinearMask, comp_mask works with LoRA. I modified the LoRA Huggingface code at src/peft_custom/lora.py.

Context-Memory/src/peft_custom/lora.py

Line 565 in 24af6a0

def forward(self, x: torch.Tensor, comp_mask=None):

Without LoRA, our model works the same as the original function, while the LoRA activates only for the compression tokens.