Llama-2-7b-chat-hf chat templete
Closed this issue ยท 7 comments
Meta provides a chat template for the Llama-2-7b-chat-hf model like:<s>[INST] <>\n{your_system_message}\n<>\n\n{user_message_1} [/INST]
or
<s>[INST] {user_message_1} [/INST]
but the template does not seem to be used in the code.
This will lead to a higher attack success rate. Is there any code implementation that add the template?
Without using the chat template to attack the chat model, it will bypass the security alignment of the model itself. This experiment is not consistent with the baseline compared in the article, such as GCG. Referring to the template provided by Fastchat, I made modifications to the code and found that the method proposed in this article is almost unable to attack successfully. Have the authors conducted similar experiments?
same. the default config is not enable the system prompt template. after add the template, this method failed to attack.
Thank you for bringing this up. The setting without a system prompt has been widely used in previous works [Huang2023Catastrophic, Liu2023AutoDAN], and our main paper follows this convention. Nevertheless, we explore the impact of system prompts on COLD-Attack in Section D.8 of our arXiv paper. We acknowledge that generating strong, fluent attacks under system prompts is not fully resolved, as many attack methods increase ASR at the cost of fluency, such as GCG. More work is needed in this setting, and we also have discussed potential solutions in our paper.
@xi1ngang, I agree with @wkwk-ai here: it seems slightly unfair to attack the chat model without using the chat template, as the model was only fine-tuned and aligned under the chat template. In your response, you mention the system prompt:
we acknowledge that generating strong, fluent attacks under system prompts is not fully resolved
But the chat template is not just the system prompt, it also includes the separation between the user's query and assistant's response. For example with llama 2 it's:
<s>[INST] <<SYS>>{system prompt}<</SYS>>{user query}[/INST] {generation starts here}
About the prior works, you say:
The setting without a system prompt has been widely used in previous works [Huang2023Catastrophic]
But somebody also raised the observation/question in their repo here.
One last comment here, the code in this repo to add the system prompt seems incomplete/incorrect because it leaves out the closing [/INST]
from the chat template after the user query and before the generation. In other words, the completion and jailbreak seems to be exclusively be done in the user query part of the prompt rather than the response part that should follow [/INST]
@bamos Thanks for pointing this out. I agree that using the chat template as the default attack setting is appropriate, as it aligns with how LLMs are post-trained. In our main experiments, we initially followed previous work and did not include any system prompt or chat template. After posting the paper and code online, we realized there was a performance gap when the system prompt was included, as noted in an issue posted in this repo. We have since mentioned this gap in Section 5 of the updated main paper. We will reintroduce [/INST] in our experiments and update the results accordingly. Closing this gap using more efficient optimization methods than GCG would be an interesting research direction going forward. A notable paper in this area is PGD Attack on LLMs.