Code for TMLR paper "Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models" arxiv.
@article{
lin2024generating,
title={Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models},
author={Zhen Lin and Shubhendu Trivedi and Jimeng Sun},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=DWkJCSxKU5},
note={}
}
Update: Our code for a new preprint improves the generation/data processing pipeline in this repository (as well as other things like support for greedy decoding and new baselines). Please check it out!
We provided a simple evaluation in notebook/demo.ipynb
using 500 samples and the corresponding responses.
Note that to get the automatic evaluation based on GPT, you would need to update keys.json
with your API keys first.
First, set the corresponding paths in _settings.py
.
Use the llama-13b-hf
, opt-13b
or gpt-3.5-turbo
for model, and coqa
, triviaqa
and nq_open
for the dataset below. (You need to download the LLaMA weight first).
python -m pipeline.generate --model llama-13b-hf --dataset coqa
For gpt-3.5-turbo
experiments, please update keys.json
with your API keys first.
Update GEN_PATHS
in _settings.py
for next steps.
(You could find the exact generatoins we used in our paper here in "output".)
You can run dataeval/load.py
to cache down results first.
(We have uploaded the cache in persist_to_disk
to this link in "persist_to_disk".
Once you download the cache, you should be able to directly run dataeval/load.py
without missing the cache.)
I use persist_to_disk to cache experiment results (i.e. those @ptd.persistf
decorators and ptd.manual_cache
calls).
Then, please refer to notebook/main.ipynb
for an example.
As many may have noticed, gpt-3.5-turbo
's performnace dropped a lot recently. All experiments in this manuscript were carried out (and could be replicated) using gpt-3.5-turbo-0301
instead of the latest version.