Large Language Models, like any other ML model are bound to make mistakes, not matter how good they are.
Typical mistakes include:
- hallucinations
- misinformation
- harmfulness, or
- disclosures of sensitive information
And the thing is, these mistakes are no big deal when you are building a demo.
However, these same mistakes are a deal breaker when you build a production-ready LLM app, that real customers will interact with ❗
Moreover, it is not enough to test for all these things once. Real-world LLM apps, as opposed to demos, are iteratively improved over time. So you need to automatically launch all these tests every time you push a new version of your model source code to your repo.
So the question is:
Is there an automatic way to test an LLM app, including hallucinations misinformation or harmfulness, before releasing it to the public?
And the answer is … YES!
Giskard is an open-source testing library for LLMs and traditional ML models.
Giskard provides a scan functionality that is designed to automatically detect a variety of risks associated with your LLMs.
Let me show you how to combine.
- the LLM-testing capabilities of Giskard, with
- CI/CD best-practices
to build an automatic testing workflow for your LLM app.
-
Install all project dependencies inside an isolated virtual env, using Python Poetry
$ make init
-
Create an
.env
file and fill in the necessary credentials. You will need an OpenAI API Key, and optionally a few Giskard Hub and HF credentials if you plan to use the Giskard Hub.$ cp .env.example .env
-
Make a change in the
hyper-parameters.yaml
file, for example update thePROMPT_TEMPLATE
, commit your changes, push them to your remote GitHub repo and open a Pull request. -
Check the Actions pannel in your GitHub repo, and see the testing happening!
-
Once the action completed, check the PR discussion to see the testing results, and decide if you want to merge with master, or not.
🔜 Coming soon!
Join more than 10k subscribers to the Real-World ML Newsletter. Every Saturday morning.