PsychoEvals: Prompt Security and Psychometrics Framework for LLMs
PsychoEvals is a lightweight Python library for evaluating and securing the behavior of large language models (LLMs) and agents, such as OpenAI's GPT series. The library provides a testing framework that enables researchers, developers, and enthusiasts to better understand, evaluate, and secure LLMs using psychometric tests, security features, and moderation tools.
🚀 4 Canonical Use Cases
- Colab notebook Secure your LLM response from basic prompt hijacking and injection attacks, and add your own tests.
- Apply a battery of 'troll' questions to provoke a NSFW answer from your Chatbot prompt
- Colab notebook Apply psychometric tests on Agent prompts (aka
CognitiveState
) - Colab notebook Moderate the response from your LLM calls for any issues based on some criteria (hate, violence, etc) & do pre and post processing
💾 Install
-
pip install the module.
pip install psychoevals
-
create a virtual env, i.e.
python -m venv .venv
-
activate the virtual env
source .venv/bin/activate
-
create a new file called .env and put your
OPENAI_API_KEY
in it
OPENAI_API_KEY=<your key>
- install dependencies
pip3 install -r requirements.txt
Usage Example:
from psychoevals.moderation import moderate, basic_moderation_handler
text_sequence_normal = "Sample text with non-offensive content."
text_sequence_violent = "I will kill them."
# demonstrates the use of Global flag. If any category is flagged, it's flagged and transformed.
# basic_moderation_handler is a function you pass to trigger a custom response
@moderate(handler=basic_moderation_handler, global_threshold=True)
def process_text_global(text_sequence):
return f"Processing the following text: {text_sequence}"
assert(process_text_global(text_sequence_normal) != "Flagged")
Motivation
As LLM-based agents become more prevalent, it is essential to have a standardized and accessible way to evaluate their pseudo "psychiatric" attributes and properties. Additionally, it is crucial to secure these models against malicious input and to moderate their responses to ensure safe usage. PsychoEvals aims to fill these gaps by providing a comprehensive framework that addresses both evaluation and security concerns.
Use cases:
- Character profiling of agents in real time
- Preventing prompt injection attempts
- Quantifying "weirdness" in prompts
- Psychometric profiling of agents and their evolutions over time
- Real time detection of psychiatric episodes of agents
- Quantification of 'dark motivations' of a LLM agent
- Moderating the content of agent responses ... and many more
How to Contribute
Any new psychometric tests, agents, security prompts, or moderation ideas would be welcome!
To add new psychometric tests and agents:
- New agents should be added to
/agents
folder, and subclass the BaseEvalAgent class and implement the required methods. - New prompt security policies and prompts should go to
/security.py
- New moderation API integrations should go to
/moderation.py
Steps:
- Fork the repository.
- Clone your forked repository to your local machine.
- Create a new branch for your feature or bugfix.
- Implement your changes, making sure to follow the project's coding style and guidelines.
- Commit your changes and push them to your forked repository.
- Create a pull request, describing the changes you've made and the problem they solve.
How It Works
PsychoEvals is built around three core modules: agents, security, and moderation.
Agents Quickstart
# TrollAgent applies a battery of tests to provoke a NSFW answer from the prompt
troll_agent = TrollAgent() # Instantiate the TrollAgent
cognitive_state = CognitiveState(<agent's prompt state>) # Instantiate a Sandbox for Your Agent's Prompt
evaluation = troll_agent.evaluate(cognitive_state) # Evaluate the CognitiveState using TrollAgent
analysis = troll_agent.analyze(evaluation) # Analyze the evaluation
assert(len(analysis["nsfw_responses"]) == 0) # assert no NSFW responses
The agents module provides a range of evaluation tools, such as psychometric tests, that can be used to assess the behavior and characteristics of LLMs. Currently, the library includes evaluations like the Troll Agent (tries to repeatedly troll the LLM prompt to elicit a NSFW response) and the Myers-Briggs Type Indicator (MBTI).
Security Quickstart
from psychoevals.security import secure_prompt
...
# Function using secure_prompt decorator with the custom filter
@secure_prompt(policy_filters=[policy_filter], handler=http_response_handler)
def process_text(text_sequence: str) -> str:
return f"Processing the following text: {text_sequence}"
The security module offers a set of tools and decorators designed to protect LLMs from prompt injection attacks and other malicious input. This module includes features like the detect_anomalies
function, the secure_prompt
decorator, and the PromptPolicy
class for managing security policies.
Moderation Quickstart
from psychoevals.moderation import moderate, basic_moderation_handler
text_sequence_normal = "Sample text with non-offensive content."
text_sequence_violent = "I will kill them."
# demonstrates the use of Global flag. If any category is flagged, it's flagged and transformed.
@moderate(handler=basic_moderation_handler, global_threshold=True)
def process_text_global(text_sequence):
return f"Processing the following text: {text_sequence}"
assert(process_text_global(text_sequence_normal) != "Flagged")
The moderation module provides tools and decorators for moderating the content of LLM-generated responses to ensure that they meet specific content guidelines or restrictions. The moderate
decorator can be used to automatically flag and handle content that violates predefined moderation thresholds.
List of Evaluations
- Extensible Evaluation Agent framework
- TrollAgent
- Myers-Briggs Type Indicator (MBTI)
- Prompt Injection Detection
- more to be added.
List of Security Features
detect_anomalies
function for detecting weirdness in promptssecure_prompt
decorator for securing prompts against injection attacksprompt_filter_generator
create custom prompt filters against custom PromptPoliciesPromptPolicy
class for managing and applying security policies
List of Moderation Tools
moderate
decorator for flagging and handling content violations- Customizable content moderation thresholds and policies
How to Cite
@misc{nextworddev2023psychoevals, title={PsychoEvals: A Psychometrics Evaluation Testing Framework for Large Language Models}, author={John, Nextworddev}, year={2023}, url={https://github.com/nextworddev/psychoevals}, }