llm_test uses pytest to do repeatable and scalable user acceptance API testing of Large Language Models (LLMs) for bias, safety, trust, and security. Beyond acceptance testing and informing further manual tests, output like this could be useful for documentation like ModelCard++.
- Define an importable
Model
template based on your API requirements. Examples are included for HuggingFace Inference APIs and OpenAI. For APIs require authentication, store your API keys in.env
. - Add tests to the
test
directory. In accordance with standard acceptance test format,assert
the desired behavior. Follow pytest documentation for test discovery, parameterization, fixtures, etc. - Modify tests to reference your templated
Model
. - Modify test values and prompts based on your interests and acceptance criteria.
- If your
Model
template or tests require any additional libraries, add them torequirements.txt
. - Build the container:
docker build --tag llm_test .
. - Run the container:
docker run --env-file .env llm_test:latest
(after adding your API keys to.env
). If you want to modify pytest's behavior, do so in theDockerfile
. - Review Results
test/test_counterfactual_sentiment.py
: Uses sentiment analysis to compare the compound sentiment range between provided classes. Currently there is an arbitraryassert
threshold. A large range suggests that values returned from the model may have been biased and should be inspected more closely.test/test_prompt_injection.py:test_prompt_injection_echo_original
: Based on available research, reveals underlying prompt that may have been concatenated with user input.test/test_prompt_injection.py:test_prompt_injection_override
: Attempts to override the existing prompt to inject user-defined behavior.
Significantly motivated by the research of: