illuin-tech/colpali

Possible to normalize colpali_engine.models.ColPaliProcessor.score_multi_vector() to value between 0-1?

plamb-viso opened this issue · 5 comments

Forgive my ignorance as I have not been able to look very deeply at the paper, but I am very excited about this technology. I've been using the example code to run some experiments with colpali and I'm having a hard time interpreting the scores returned by score_multi_vector().

For instance, in the example, a white image and black image are passed along with some text queries. When I run that example i get scores back like 9.9, 10.9 etc. So i guessed that maybe this is the lowest type of score you can get. When i run an example where related text is on the passed image, i get back a scores like 19.8 or 27.8. What is the range of possible scores? How can I tell when the similarity is really high?

I tried to normalize the scores into 0-1 values using a technique i'd used on classic Colbert results where the maxlen of the query is used to normalize them:

def normalize_scores(batch_queries, scores):
    maxlen = max(len(query) for query in batch_queries)
    percentage_scores = [(score.item() / maxlen) for score in scores.view(-1)]
    percentage_scores_tensor = torch.tensor(percentage_scores).reshape(scores.shape)
    return percentage_scores_tensor

But this often returns weird results like scores greater than 1. Essentially my question is: do you have any documentation on how to interpret the scoring and do you have any existing functions in the package that normalize the scores to 0-1 values?

Hello !
So final scores are going to depend on the length of the query since it's a sum over max sim patches for all terms of the query.

Each term of (query_token, patch) is a dot product so it's bounded in the (-1,1) range and most often it's >0.

You can thus get a normalized score by taking the mean of these scores. What might put you over 1 is that we add tokens at the end of the query as a sort of query augmentation scheme, and a "query:" prefix . If you try dividing the score by the number of tokens of your query + 12 or so to account for these extra tokens, I would imagine you should be good :)

In case it helps anyone else, here is how I normalized them:

def normalize_scores(batch_queries, scores):
    attention_mask = batch_queries["attention_mask"].cpu()
    query_lengths = attention_mask.sum(dim=1, dtype=torch.float32).unsqueeze(1)

    # Avoid division by zero
    query_lengths[query_lengths == 0] = 1
    normalized_scores = scores / query_lengths
    return normalized_scores

The reason for using the batch_queries['attention_mask'] vs just the query_embeddings is because the model pads each query to the longest query so the length of the embeddings doesnt match the actual query length. This results in normalization degradation when batch size increases.

scores is the output of colpali-engines score_multi_vector()

Nice ! Yup, for sure, padding needs to be removed, GG !
Curious why you would need to readd extra tokens though if you count the non-zero inputs like this ?

How does it work, do the confidence values make sense ?

Ah thats a great point -- i guess summing over the attention mask would already be accounting for the extra tokens the model adds. I'll edit my post.

The confidence values are making sense given my few trials; page images that i would expect to demonstrate higher similarity are showing similarity scores that set them clearly apart from page images that would lack similarity. Generally a difference of about 15-20 percentage points.

Awesome to know, thanks !