kddubey/cappr

Calculate perplexity

galonpy opened this issue · 4 comments

Hello,
I want to evaluate the log probability (really the perplexity) of a sentence of text. Can i use your library for this?

Hello,

Yes: first apply token_logprobs with add_bos=True, end_of_prompt="", then drop the first None in each list, and then apply agg_log_probs.

But you don't need CAPPr for this computation. See this HuggingFace tutorial.

You may want to use CAPPr if your data looks like this—

text 1: {same_prompt} {different_completion_1}
text 2: {same_prompt} {different_completion_2}
.
.
.
text n: {same_prompt} {different_completion_n}

In that case, use predict_proba or predict_proba_examples and pass normalize=False to get the log-probability of every completion. CAPPr caches same_prompt to save time.

Thank you for the quick reply! The reason I came across your package is because I am on a short deadline and I have to use a model other than GPT 2 since I used it previously. If this is something you can help with, we can add you to the acknowledgments section of this paper. Do you have other models that work with your code (or another approach you may have the code for?)
https://arxiv.org/abs/2308.14132

Both CAPPr and the HuggingFace tutorial should work for any model that can be loaded via AutoModelForCausalLM.from_pretrained, e.g., GPT-J, Llama, Mistral, Bloom.

In the HuggingFace tutorial, change

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

device = "cuda"
model_id = "gpt2-large"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

to

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_id = "your-autoregressive-lm"  # for example, "EleutherAI/gpt-j-6B"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

and you should be good to go

@galonpy lmk if you need more help. If you're all set, feel free to close the issue