A project that demonstrates Evasion Attack on LLMs in sentiment analysis tasks.
This project explores two primary methods of evasion attacks:
- White-Box Attack: SALSA attack on BERT. For more details, refer to the paper available at SALSA: Salience-Based Switching Attack for Adversarial Perturbations in Fake News Detection Models.
- Black-Box Attack: Prompt-based attack by ChatGPT. Further information can be found on ArXiv at An LLM can Fool Itself: A Prompt-Based Adversarial Attack.
The project specifically targets the following large language models as potential threat models:
- BERT: A transformer-based model that excels in understanding the context within natural language.
- Llama-3-8B: A variant of large language models optimized for specific tasks with an 8 billion parameter configuration.
- ChatGPT: Based on the GPT architecture, designed to generate human-like text in response to prompts.
- Dataset: IMDb dataset
- labels: Either positive or negative
- Size: For this experiment, we only take a subset of 1000 reviews to demonstrate and generate our adversarial examples
- Youtube:https://youtu.be/EH1s5jgB8Qc
- Slides: Link