ftramer/LM_Memorization

A question of the calculatePerplexity

Ethan-Chen-plus opened this issue · 3 comments

I know that the following can calculate loss
image
However, why labels be input_id? After read the paper, maybe I think the code should be:

def calculatePerplexity(sentence, model1, model2, tokenizer):
    """
    exp(loss)
    """
    input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
    input_ids = input_ids.to(device)
    outputs_ids = model2.generate(**input_ids, **gen_kwargs).to(device)
    with torch.no_grad():
        outputs = model1(input_ids, labels=output_ids)
    loss, logits = outputs[:2]
    return torch.exp(loss)

this can test and verify whether the output of the two models is the same. If different, maybe one of the model memorizes the train data.

This code is just the standard way of calculating a model's loss over a given set of tokens.
Say you have a sentence of tokens (t1, ..., tn). Then if you feed this into the model (model(input_ids)), the model will output n prediction vectors where the i-th prediction vector contains the model's prediction for the i-th token, given the (i-1) first tokens.
Then, to calculate a loss, we feed in the same tokens as ground truth. So the model's prediction for the i-th token gets compared with the actual i-th token of the sentence.

In the paper, we just compare two models by looking at the perplexities they assign to the same sentence.

Thanks a lot!
but I am wondering this:

    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            position_ids=inputs["position_ids"],
            labels=inputs["labels"],
        ).loss

but in this repo, the code use input_ids labels, I thought that we might use inputs["labels"] as labels.
@ftramer

Where is this code from?
The code we use is a standard way of calculating perplexity in huggingface.