Detect hallucinated content in generated answers for any RAG system. A live demo of this tool can be tried here.
Read more about the background of this project in our Medium post.
This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation systems.
- Create virtual environment
pyenv virtualenv 3.11.7 NAME && pyenv activate NAME
- Install dependencies
pip install -r requirements.txt
- Add your OpenAI-API key as an environment variable
source OPENAI_API_KEY=your key
- Run
python3 app.py
You can now view the Swagger documentation of the API in your browser under localhost:3000
.
The endpoint check
performs a check if a sentence
is contained in the source
.
If this test is passed, the endpoint returns a boolean value result
as true
. If the information from the sentence is
not contained in the source, it will return false.
Besides this boolean value, the endpoint returns an array answers
, which spell out for each sentence in the source
why or why not it is contained in the source.
As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy. The default value is 0.65.
curl -X 'POST' \
'http://localhost:3000/check?semantic_similarity_threshold=0.65&model=gpt-3.5-turbo' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"source": "string",
"sentence": "string"
}'
This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection
systems.
The script evaluation.py
aims to validate the effectiveness of the system by comparing its predictions with the
gold standard dataset, ultimately providing a measure of accuracy.
The script predictor.py
focuses on processing the test data set using the provided API to create set to validate
against.
The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing
BR24 articles, split by <p>
aragraphs. For the test set 150 of those paragraphs are randomly sampled and saved to
data/test.csv
.
This file is used by create_training_data.py
to generate a question which can be answered given the paragraph.
Using this question and the paragraph, GPT 3.5 Turbo is used to generate answers to the questions. In some cases the LLM is explicitly asked to add wrong but plausible content to the answer.
Your hypothesis data should be placed in the data folder and be suffixed with _result.jsonl
. Each row shall contain a
JSON object with the structure as follows:
{
"id": "string",
"hallucination": true,
"prob": 0.01
}
To run the evaluation simply run python evaluate.py
after you've placed your results in the data folder.
The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples.