sherdencooper/GPTFuzz

How to fuzz closed source LLMs and possible bug when calling OpenAI model

chinggg opened this issue ยท 5 comments

Thanks for making the code public available. I am trying to understand codebase to see how GPTFuzzer interact with target LLM models. The paper shows some attack results on commercial LLMs like Bard and Claude2. However, I didn't find any code attacking Bard/Claude2/PaLM2 in the current repo. It is understandable since authors already explained in the paper: "we did not have the API accesses to some commercial models. Therefore, we conducted attacks via web inference for Claude2, PaLM2, and Bard"

The code below shows that currently only OpenAI and open-source models are supported.

args_target.model_path = args.target_model
args_target.temperature = 0.01 #some models need to have strict positive temperature
MODEL_TARGET, TOK_TARGET = prepare_model_and_tok(args_target)

def create_model_and_tok(args, model_path):
# Note that 'moderation' is only used for classification and cannot be used for generation
openai_model_list = ['gpt-3.5-turbo-0613', 'gpt-3.5-turbo', 'gpt-3.5-turbo-0301', 'gpt-4-0613', 'gpt-4', 'gpt-4-0301', 'moderation']
open_sourced_model_list = ['lmsys/vicuna-7b-v1.3', 'lmsys/vicuna-33b-v1.3', 'meta-llama/Llama-2-7b-chat-hf', 'lmsys/vicuna-13b-v1.3', 'THUDM/chatglm2-6b', 'meta-llama/Llama-2-13b-chat-hf', 'meta-llama/Llama-2-70b-chat-hf','baichuan-inc/Baichuan-13B-Chat']
supported_model_list = openai_model_list + open_sourced_model_list

I try to locate the code to interact with LLM and it seems that OpenAI models are called through function openai_request, while open-source models are locally inferenced.

GPTFuzz/fuzz_utils.py

Lines 417 to 425 in 0cb85c0

if TOK_TARGET == None: #openai model
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(openai_request, prompt): prompt for prompt in inputs}
for future in concurrent.futures.as_completed(futures):
try:
data.append(future.result()['choices'][0]['message']['content'])
except:
data.append(future.result())

But it seems that openai_request hardcodes model='gpt-3.5-turbo' and MODEL_TARGET is never used. So I think the current code will always use 'gpt-3.5-turbo' no matter which target_model is specified. If it's indeed a bug, then a possible fix would be passing an argument to specify model when calling openai.ChatCompletion.create.

GPTFuzz/fuzz_utils.py

Lines 327 to 340 in 0cb85c0

def openai_request(prompt, temperature=0, n=1):
response = "Sorry, I cannot help with this request. The system is busy now."
max_trial = 50
for i in range(max_trial):
try:
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
],
temperature=temperature,
n = n,
)

I wonder how to fuzz close sourced LLMs with API available. If model can be specified by user, then it would be possible to fuzz any close sourced LLMs served with OpenAI-compatible API by setting OPENAI_API_BASE env.

Thanks for your interest in our work! We interact with openai models with the API, and interact with Bard, Claude, PaLM2 with web inference. We tried to access Bard and Claude with third-party apis like https://github.com/dsdanielpark/Bard-API , however, we found it unstable and it needs frequent human-in-the-loop to change the cache, which make it not suitable for fuzzing. For Claude, we tried to apply for the official API access but at the time of paper writing, we did not get one, thus we also use the web inference. We save all the screenshots for our attack for these commercial models for reproduction if you apply the template via email.

I agree that the code could be modified to support other non-opanai commercial LLMs. Could you tell us which one you would like to fuzz and how the return looks like, so we could modify our codes?

I agree that the code could be modified to support other non-opanai commercial LLMs. Could you tell us which one you would like to fuzz and how the return looks like, so we could modify our codes?

Thanks for your reply. I am trying to fuzz commercial LLM which has limited availability inside the company. So I may need to modify the code on my side to fuzz it.

In addition, do you think it is a bug for function openai_request to hardcode model='gpt-3.5-turbo' regardless of MODEL_TARGET?

In addition, do you think it is a bug for function openai_request to hardcode model='gpt-3.5-turbo' regardless of MODEL_TARGET?

For commercial models, we only did the fuzz experiments on gpt-3.5 due to the cost budget and rate limit (for other commercial models, we ran the transfer attack instead of directly fuzzing on them), so I hardcoded the target model name when it detected that the target model was a commercial model. Yeah, this is inappropriate and I did not notice this when publishing the code. Thanks for pointing this out!

Also, we are currently having collaborators to polish the codes in
the dev branch to make it more readable and extendable for users and our future research. I will ask my collaborator to have the config to make the user easily adapt to their own api.

I make a few modifications based on master branch and successfully jailbreak a commercial LLM. That's amazing!
In addition, I wonder how you fuzz non-English LLMs like Baichuan? jailbreak-prompt.xlsx only contain English prompts, while your paper claims high ASR on Baichuan, which is a LLM focusing more on Chinese.

@chinggg It is nice to hear that you could successfully jailbreak a commercial LLM. For Baichuan, we only used the English prompt in our experiments although we found that Baichuan sometimes prefers to answer in Chinese towards English jailbreak prompts.

For jailbreaking Chinese LLMs, we have some initial experiments and results to show in a presentation in the future, and I would like to share some details about our initial results. Specifically, we used machine translation to convert the English template into Chinese and applied the same process of fuzzing. Here is an example:
ch_jailbreak
It is worth noting that it could potentially have better jailbreak performance if you could have high-quality translation or Chinese templates from other sources like xiaohongshu. For harmful questions, you could refer to CoAI's dataset. For the judgment model, since currently we have not done the large-scale Chinese response labeling to train a model, so I would suggest using human annotators or ChatGPT evaluation.

Plz let me know if you have any other questions about our work, and I would be very happy to help.