/lm-polygraph

Primary LanguageJupyter NotebookMIT LicenseMIT

License: MIT Python 3.10

LM-Polygraph: Uncertainty estimation for LLMs

Installation | Basic usage | Overview | Benchmark | Demo application | Documentation

LM-Polygraph provides a battery of state-of-the-art of uncertainty estimation (UE) methods for LMs in text generation tasks. High uncertainty can indicate the presence of hallucinations and knowing a score that estimates uncertinaty can help to make applications of LLMs safer.

The framework also introduces an extendable benchmark for consistent evaluation of UE techniques by researchers and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses.

Installation

git clone https://github.com/IINemo/lm-polygraph.git && cd lm-polygraph && pip install .

Basic usage

  1. Initialize the model (encoder-decoder or decoder-only) from HuggingFace or a local file. For example, bigscience/bloomz-3b
from lm_polygraph.utils.model import WhiteboxModel

model = WhiteboxModel.from_pretrained(
    "bigscience/bloomz-3b",
    device="cuda:0",
)
  1. Specify UE method
from lm_polygraph.estimators import *

ue_method = MeanPointwiseMutualInformation()
  1. Get predictions and their uncertainty scores
from lm_polygraph.utils.manager import estimate_uncertainty

input_text = "Who is George Bush?"
estimate_uncertainty(model, ue_method, input_text=input_text)

Other examples:

  • example.ipynb: simple examples of scoring individual queries;
  • qa_example.ipynb: an example of scoring the bigscience/bloomz-3b model on the TriviaQA dataset;
  • mt_example.ipynb: an of scoring the facebook/wmt19-en-de model on the WMT14 En-De dataset;
  • ats_example.ipynb: an example of scoring the facebook/bart-large-cnn model on the XSUM summarization dataset;
  • colab: demo web application in Colab (bloomz-560m, gpt-3.5-turbo, and gpt-4 fit the default memory limit; other models require Colab-pro).

Overview of methods

Uncertainty Estimation Method Type Category Compute Memory Need Training Data?
Maximum sequence probability White-box Information-based Low Low No
Perplexity (Fomicheva et al., 2020a) White-box Information-based Low Low No
Mean token entropy (Fomicheva et al., 2020a) White-box Information-based Low Low No
Monte Carlo sequence entropy (Kuhn et al., 2023) White-box Information-based High Low No
Pointwise mutual information (PMI) (Takayama and Arase, 2019) White-box Information-based Medium Low No
Conditional PMI (van der Poel et al., 2022) White-box Information-based Medium Medium No
Semantic entropy (Kuhn et al., 2023) White-box Meaning diversity High Low No
Sentence-level ensemble-based measures (Malinin and Gales, 2020) White-box Ensembling High High Yes
Token-level ensemble-based measures (Malinin and Gales, 2020) White-box Ensembling High High Yes
Mahalanobis distance (MD) (Lee et al., 2018) White-box Density-based Low Low Yes
Robust density estimation (RDE) (Yoo et al., 2022) White-box Density-based Low Low Yes
Relative Mahalanobis distance (RMD) (Ren et al., 2023) White-box Density-based Low Low Yes
Hybrid Uncertainty Quantification (HUQ) (Vazhentsev et al., 2023a) White-box Density-based Low Low Yes
p(True) (Kadavath et al., 2022) White-box Reflexive Medium Low No
Number of semantic sets (NumSets) (Kuhn et al., 2023) Black-box Meaning Diversity High Low No
Sum of eigenvalues of the graph Laplacian (EigV) (Lin et al., 2023) Black-box Meaning Diversity High Low No
Degree matrix (Deg) (Lin et al., 2023) Black-box Meaning Diversity High Low No
Eccentricity (Ecc) (Lin et al., 2023) Black-box Meaning Diversity High Low No
Lexical similarity (LexSim) (Fomicheva et al., 2020a) Black-box Meaning Diversity High Low No

Benchmark

To evaluate the performance of uncertainty estimation methods consider a quick example:

HYDRA_CONFIG=../configs/polygraph_eval/polygraph_eval.yaml python ./scripts/polygraph_eval \
    dataset="./workdir/data/triviaqa.csv" \
    model="databricks/dolly-v2-3b" \
    save_path="./workdir/output" \
    seed=[1,2,3,4,5]

Use visualization_tables.ipynb to generate the summarizing tables for an experiment.

A detailed description of the benchmark is in the documentation.

Demo web application

gui7

Start with Docker

docker run -p 3001:3001 -it \
    -v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
    --gpus all mephodybro/polygraph_demo:0.0.17 polygraph_server

The server should be available on http://localhost:3001

A more detailed description of the demo is available in the documentation.

Acknowledgements

The chat GUI implementation is based on the chatgpt-web-application project.