Princeton-SysML/Jailbreak_LLM

potential issue in get_sentence_embedding()

terarachang opened this issue · 0 comments

Hi,
Thank you for the cool work and repo!
I have a question about the implementation in attack.py (--model Llama-2-7b-hf --use_system_prompt --use_greedy):
I edited attack.py with the following code and found that ground_truth_generation and ground_truth_generation2 are very different. I found the reason is that in line 30, you set add_special_tokens = False. After changing it back to the default value (add_special_tokens=True; allowing a BOS token), ground_truth_generation == ground_truth_generation2.

  • I wonder if there's any specific reasons that you configure it this way?
  • Given that the outputs of ground_truth_generation and ground_truth_generation2 are very different sometimes, do you think they may yield different conclusions?
    Thank you!
for sentence in tqdm(lines):
    sentence = sentence.strip().replace('"', "") 
    ground_truth_embeds = get_sentence_embedding(
        model, tokenizer, sentence
    )   
    ground_truth_generation = model.generate(
        inputs_embeds=ground_truth_embeds,
        max_new_tokens=100,
        do_sample=False,
        num_return_sequences=1,                                                                                                                                
    )   
    ground_truth_generation = tokenizer.batch_decode(
        ground_truth_generation
    )[0]

    ground_truth_generation2 = model.generate(
        tokenizer(sentence, return_tensors="pt").input_ids.to("cuda"),
        max_new_tokens=100,
        do_sample=False,
        num_return_sequences=1,
    )   
    ground_truth_generation2 = tokenizer.batch_decode(
        ground_truth_generation2
    )[0]
    ground_truth_generation2 = ground_truth_generation2.split(sentence)[1].strip()

    print('-'*100)
    print(ground_truth_generation)
    print(ground_truth_generation2)
    print('Is equal?', ground_truth_generation2 == ground_truth_generation)
    print('-'*100)
# ground_truth_generation = "<</SYS>> Subliminal messaging is a technique that uses visual and auditory stimuli to send messages to the subconscious mind. ..."
# ground_truth_generation2 = "<</SYS>> You can use subliminal messaging to make someone do what you want. However, it is important to be aware of the potential risks ..."