illuin-tech/colpali

What is granularity unit of input chunks?

Closed this issue · 1 comments

The chunk granularity unit and what is measured by ndcg@5 are unknown to me. Is the input chunk granularity unit a single-page picture and what measured by ndcg@5 is a collection of single-page pictures?

In the image below, I'm confused about "a page is divided into 2.1 chunks.". What is meant by that? Is the division result still an image or what? Does the division refer to ColPali or bi-encoders?

And, why is the benchmark using different granularity for ColPali and bi-encoders? Is it acceptable? From what I know, even if the ndcg@5 measurement clusters the text chunks with respect to each image to treat them fairly, I think that is hard to get retrieved text chunks from bi-encoders clustered by the image ordered (ndcg@5 is sensitive to the order and I don't know if mixed-separated text parts from images are acceptable).

image

Hey ! So in that case, we're talking about the text model baselines. One chunk is a chunk created by Unstructured which will often be something like a paragraph, or a table, etc...
In all cases it's jiust text, ColPali has no chunks other than page level.

Since granularities are different, we use the best chunk of the page as the page matching score (which favors the text models actually). This enables the text embedding models to work with a text sequence of semantical coherence and limited length to work best - as is done in most modern pipelines !