/AttackPrompt-NLP

Reproduce Attack Prompt Methodology to Test for AdvBench and AdvGLUE dataset.

Primary LanguageJupyter Notebook

Utilizing GPT-3.5 Turbo to Perform a Prompt-Based Adversarial Attack on AdvBench

Abstract

Content warning: This project contains unfiltered content generated by LLMs that may be offensive to readers.

Large Language Models (LLMs) have many applications, including transforming harmful prompts into acceptable formats. This study demonstrates the efficacy of LLMs in modifying harmful prompts so they can be accepted by other LLMs. Specifically, we evaluate a methodology proposed for the AdvGLUE dataset that changes the classification outcomes.

Our evaluation primarily focuses on the AdvBench dataset, a collection of harmful prompts. We find that the current methodology is unsuitable for this task but propose potential research directions through further modifications.

This project evaluates the success rate of our methodologies and provides empirical results on prompt-based adversarial attacks.


Introduction

LLMs have advanced rapidly in recent years, assisting with tasks such as code generation, business analytics, and everyday language processing, thanks to their training on massive datasets.

One important use case of LLMs relates to adversarial attacks, where attackers aim to manipulate inputs to produce unintended outputs. In this project, we leverage an approach proposed in AttackPrompt [Xu et al., 2023] to modify prompts by changing the meaning of harmful inputs. The objective is to make the modified prompts acceptable to the model, ensuring they return positive responses.

The AttackPrompt framework introduces a methodology for converting a prompt with a negative classification result into a positive one. We extend their work by experimenting with different levels of prompt permutation (from 0 to 8) to assist LLMs in adjusting these adversarial prompts. Specifically, we used GPT-3.5 Turbo for this task to modify and evaluate the transformed prompts.


Main Contributions

  • Application of AttackPrompt methodology on the AdvGLUE dataset to validate the framework on their datasets.
  • Testing on the AdvBench dataset, revealing limitations and identifying necessary modifications for improved performance.
  • Success and Failure Rate Analysis: Measured the success rate of prompt transformation with GPT-3.5 and analyzed its performance in adversarial tasks.
  • Identification of Future Research Opportunities: Demonstrated that the methodology requires modification to be effective for further research purposes.

Prompt Transformation Success Rate

We experimented with 11 different permutations for attack guidance and evaluated the success of prompt transformations using both GPT-3.5 Turbo and Llama-2 models.

  • GPT-3.5 Turbo: Successfully converted more than 95% of prompts, as the model typically flags harmful prompts instead of rejecting them outright.
  • Llama-2: Rejected 80% of prompts due to stricter rules against harmful content, leading to lower success rates.

Results Overview

Below is a table summarizing the success rates for different levels of prompt permutation with GPT-3.5:

Permutation Successful Failures Success Rate (%)
Level 0 568 7 98.78
Level 1 570 5 99.13
Level 2 547 28 95.13
Level 3 560 15 97.39
Level 4 559 16 97.21
Level 5 555 20 96.52
Level 6 560 15 97.39
Level 7 569 6 98.95
Level 8 566 9 98.43
Level 9 540 35 93.91
Level 10 549 26 95.47

Findings and Limitations

  • AdvBench Results:
    The methodology was less effective on the AdvBench dataset due to differences in problem formulation, as expected. While some prompts were accepted after modification, the model could not generate the desired target responses.

  • Permutation Impact:
    Our experiments showed that permutations L=9 and L=10 significantly improved the robustness of the adversarial attacks.

  • Future Research Directions:
    The limitations observed suggest opportunities for further research. Modifying the methodology could yield better results for different datasets and tasks, especially when working with stricter models like Llama-2.


Conclusion and Discussion

This project demonstrates the potential and challenges of using LLMs for adversarial prompt modification. Our findings show that GPT-3.5 Turbo outperforms Llama-2 in handling adversarial prompts, but the methodologies need refinement for broader applicability.

We also highlight that current techniques are unsuitable for AdvBench, but modified approaches could unlock new research opportunities for adversarial NLP tasks.


Acknowledgments

This project builds on the AttackPrompt framework proposed in [Xu et al., 2023]. We are grateful to the creators of AdvGLUE and AdvBench datasets for their contributions to this research.


References

  • Xu, et al. (2023). AttackPrompt: Adversarial Prompting with Large Language Models. arXiv. Link to paper