Installation | Basic usage | Overview | Benchmark | Demo application | Documentation
LM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.
The framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.
git clone https://github.com/IINemo/lm-polygraph.git && cd lm-polygraph && pip install .
- Initialize the model (encoder-decoder or decoder-only) from HuggingFace or a local file. For example,
bigscience/bloomz-3b
from lm_polygraph.utils.model import WhiteboxModel
model = WhiteboxModel.from_pretrained(
"bigscience/bloomz-3b",
device="cuda:0",
)
- Specify UE method
from lm_polygraph.estimators import *
ue_method = MeanPointwiseMutualInformation()
- Get predictions and their uncertainty scores
from lm_polygraph.utils.manager import estimate_uncertainty
input_text = "Who is George Bush?"
estimate_uncertainty(model, ue_method, input_text=input_text)
- example.ipynb: simple examples of scoring individual queries;
- qa_example.ipynb: an example of scoring the
bigscience/bloomz-3b
model on theTriviaQA
dataset; - mt_example.ipynb: an of scoring the
facebook/wmt19-en-de
model on theWMT14 En-De
dataset; - ats_example.ipynb: an example of scoring the
facebook/bart-large-cnn
model on theXSUM
summarization dataset; - colab: demo web application in Colab (
bloomz-560m
,gpt-3.5-turbo
, andgpt-4
fit the default memory limit; other models require Colab-pro).
Uncertainty Estimation Method | Type | Category | Compute | Memory | Need Training Data? |
---|---|---|---|---|---|
Maximum sequence probability | White-box | Information-based | Low | Low | No |
Perplexity (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |
Mean token entropy (Fomicheva et al., 2020a) | White-box | Information-based | Low | Low | No |
Monte Carlo sequence entropy (Kuhn et al., 2023) | White-box | Information-based | High | Low | No |
Pointwise mutual information (PMI) (Takayama and Arase, 2019) | White-box | Information-based | Medium | Low | No |
Conditional PMI (van der Poel et al., 2022) | White-box | Information-based | Medium | Medium | No |
Semantic entropy (Kuhn et al., 2023) | White-box | Meaning diversity | High | Low | No |
Sentence-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |
Token-level ensemble-based measures (Malinin and Gales, 2020) | White-box | Ensembling | High | High | Yes |
Mahalanobis distance (MD) (Lee et al., 2018) | White-box | Density-based | Low | Low | Yes |
Robust density estimation (RDE) (Yoo et al., 2022) | White-box | Density-based | Low | Low | Yes |
Relative Mahalanobis distance (RMD) (Ren et al., 2023) | White-box | Density-based | Low | Low | Yes |
Hybrid Uncertainty Quantification (HUQ) (Vazhentsev et al., 2023a) | White-box | Density-based | Low | Low | Yes |
p(True) (Kadavath et al., 2022) | White-box | Reflexive | Medium | Low | No |
Number of semantic sets (NumSets) (Kuhn et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Sum of eigenvalues of the graph Laplacian (EigV) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Degree matrix (Deg) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Eccentricity (Ecc) (Lin et al., 2023) | Black-box | Meaning Diversity | High | Low | No |
Lexical similarity (LexSim) (Fomicheva et al., 2020a) | Black-box | Meaning Diversity | High | Low | No |
To evaluate the performance of uncertainty estimation methods consider a quick example:
HYDRA_CONFIG=../configs/polygraph_eval/polygraph_eval.yaml python ./scripts/polygraph_eval \
dataset="./workdir/data/triviaqa.csv" \
model="databricks/dolly-v2-3b" \
save_path="./workdir/output" \
seed=[1,2,3,4,5]
Use visualization_tables.ipynb
to generate the summarizing tables for an experiment.
A detailed description of the benchmark is in the documentation.
docker run -p 3001:3001 -it \
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
--gpus all mephodybro/polygraph_demo:0.0.17 polygraph_server
The server should be available on http://localhost:3001
A more detailed description of the demo is available in the documentation.
The chat GUI implementation is based on the chatgpt-web-application project.