Computing output likelihoods with the model

Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?

The use-case would be something like:
Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):

For tasks that involve choosing the correct completion from several options (e.g. multiple choice
question answering), we follow Brown et al. (2020) and use rank classification to evaluate our
model: we compute the log-likelihood of each of the target options under the fine-tuned model and
select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply
length normalization to the log-likelihoods of the target options.

Is it straightforward to do this with Fromage? I assume with the model forward function at inference (haven't dug into this yet)?

Yes, this should be easy to compute. All you would need to do is pass in the appropriate input (i.e., interleaved image + text + the target option) as the labels argument in the forward pass. Since we use the HuggingFace endpoint (https://github.com/kohjingyu/fromage/blob/main/fromage/models.py#L264), I think that output.loss will already give you the negative of the log-likelihood (so the option with the lowest loss will be the answer).

Alternatively, you can add a line to generate_for_images_and_texts() to return the loss rather than the embeddings. This might be easier if you have some non-standard input, since you can pass in any arbitrarily interleaved list of PIL.Images and str objects. You would probably just need to specify num_words=0 in the arguments, and edit this line to return outputs.loss rather than the generated text + images:

fromage/fromage/models.py

Line 516 in 2c89107

    
           outputs = self.model.lm(inputs_embeds=input_embs, use_cache=False, output_hidden_states=True)

Hope that makes sense!

Great, thanks -- this is super helpful!

When I try the second option you suggested, the outputs.loss is None. I specify num_words=0 and return outputs.loss, but this doesn't seem to work. My input to the generate_for_images_and_texts() is the entire interleaved list of PIL images and strs. Could you let me know what the issue might be? I think the problem is the labels argument to the lm is not set in this call, but I am not able to figure out what exactly I should set it to?

Ah I see, you're right. I think you will need to pass in a labels tensor which contains the tokenizer() outputs for the text, and -100 for image embeddings. Something like this:

fromage/fromage/models.py

Lines 275 to 278 in 2c89107

    
           full_labels = torch.cat([ 
        
             torch.zeros(prefix_embs.shape[:2], dtype=torch.int64).to(labels.device) - 100, 
        
             full_labels 
        
           ], axis=1)

So for example, a sequence of <image>a lazy cat<image>a happy dog would be encoded as something like [-100, 2, 102, 22414, 4758, -100, 102, 1372, 2335] (from the OPT tokenizer) if each image is encoded as a single vector, which the base model is (there is another vis4 model that embeds them as 4 vectors, in which case it would be [-100, -100, -100, -100] instead). And you can compute the loss with that sequence.

I think that should work, but please let me know if it doesn't.

Thanks @kohjingyu, I was able to get the output log lik scores using your suggested method. This is my method inside FromageModel, could you please check if this looks about right? [Note: This is for the base model, so one token per image.]

  def get_log_lik_scores(
    self, prompts: List):
    """
    Output the log likelihoods of the given interleaved prompts.

    Args:
      prompts: List of interleaved PIL.Image.Image and strings representing input to the model.
    Returns:
      log lik score of prompt sequence.
    """
    input_embs = []
    input_ids = []
    add_bos = True

    for i, p in enumerate(prompts):
      if type(p) == Image.Image:
        # Encode as image.
        pixel_values = utils.get_pixel_values_for_model(self.model.feature_extractor, p)
        pixel_values = pixel_values.to(device=self.model.logit_scale.device, dtype=self.model.logit_scale.dtype)
        pixel_values = pixel_values[None, ...]

        visual_embs = self.model.get_visual_embs(pixel_values, mode='captioning')  # (1, n_visual_tokens, D)
        input_embs.append(visual_embs)
        id_ = torch.tensor([-100], dtype=torch.int64).to(self.model.logit_scale.device).unsqueeze(0)
        input_ids.append(id_)
      elif type(p) == str:
        text_ids = self.model.tokenizer(p, add_special_tokens=True, return_tensors="pt").input_ids.to(self.model.logit_scale.device)
        if not add_bos:
          # Remove <bos> tag.
          text_ids = text_ids[:, 1:]
        else:
          # Only add <bos> once.
          add_bos = False

        text_embs = self.model.input_embeddings(text_ids)  # (1, T, D)
        input_embs.append(text_embs)
        input_ids.append(text_ids)
      else:
        raise ValueError(f'Input prompts should be either PIL.Image.Image or str types, got {type(p)} instead.')
    input_embs = torch.cat(input_embs, dim=1)
    input_ids = torch.cat(input_ids, dim=1)

    outputs = self.model.lm(inputs_embeds=input_embs, labels=input_ids, use_cache=False, output_hidden_states=True)
    return -outputs.loss.item()

Maybe another high-level question is: Now that I can compute these scores for any interleaved sequences, do you think length normalisation of the log-likelihood scores would be an important factor for comparing across sequences?

For example, if I am doing ImageNet evaluation, I would have sequences like <image> This is a photo of a tick vs <image> This is a photo of a Mexican hairless dog (xoloitzcuintli), would you normalise the log-likelihood scores by the length of the classnames for a fair comparison?

That looks good! Only thing I would change is:

id_ = torch.zeros(visual_embs.shape[:2], dtype=torch.int64).to(visual_embs.device) - 100

This will generalize when visual_embs is not a vector (one of the checkpoints we released has visual_embs as 4 vectors).

Maybe another high-level question is: Now that I can compute these scores for any interleaved sequences, do you think length normalisation of the log-likelihood scores would be an important factor for comparing across sequences?

In my experience measuring FROMAGe on things like VQA, normalization doesn't seem to be super important, and results appear mostly similar. I think it's mostly an empirical question, so if it's not too difficult I'd just try both.

Also: would you be interested in opening a pull request with adding this functionality into models.py? I think it'd be a great addition. No worries if not, and I can also do it myself if it's ok with you. Thanks for looking into this!

Great thanks, opened one here: #14

	full_labels = torch.cat([
	torch.zeros(prefix_embs.shape[:2], dtype=torch.int64).to(labels.device) - 100,
	full_labels
	], axis=1)