Awesome-AI-Assistant-Exploits

Curated collection of prompts that break popular AI assistants leading to unexpected behaviour

Contents:

Objective
Test Categories
Systems under investigation
Results

Objective

To drive an open-source efforts to collect information on how large-scale generative models (hidden behind an API) can be exploited to drive their response away from the expected behavior. The systems being tested should be available publically to ensure equitable access.

Test Categories and Description

The following categories for testing are currently supported, each can have a subsection based on the type of vulnerability, for example mathematics under hallucination.

Jailbreaking

Definition: Engineering prompts that allow systems to bypass the initial invisible prompt and do as directed by the user.
Criteria: Demonstrate bypass behavior with and without prompt.

Hallucination

Definition: Engineer prompts that leads the system to generate imaginary text based on real-world entities (more useful in cases of assistants that are tailored to work on real-world data).
Criteria: Demonstrate hallucinated information for real-world entities.

Hate Speech/Bias Generation

Definition: Engineer prompts that leads the system to generate hate-speech/bias against a particular group.
Criteria: Demonstrate the generation of toxic text for particular group.

Contradiction

Definition: Engineer prompts that lead the system to generate self-contradicting behavior.
Criteria: Demonstrate self-contradiction in a multi-turn setting.

Illicit instruction generation

Defintion: Engineer prompts that lead the system to generate texts that contain instructions to harm a certain entity. Criteria: Demonstrate the generation of harmful-prompts.

Systems under testing

Results

Contributing

Please ensure that the submitted prompts are not a simple paraphrase or style modification of already existing exploits, as such contributions will be rejected. A template for making a submission can be found here, that should be included in addition to the PR submitted.

vaibhavk97/Awesome-AI-Assistant-Exploits