illuin-tech/colpali

Inconsistent results - hosted demo vs. running locally

RahulNET opened this issue ยท 44 comments

Firstly, thanks for this cool work combining Paligemma and Colbert!

Given a PDF file
When I try the same PDF using the HF demo app at: https://huggingface.co/spaces/manu/ColPali-demo
And I try the same PDF using my locally hosted demo instance using the provided demo code
Then, the results from HF demo app are matching expectation
But, the results from the locally hosted instance are quite poor.

Could you please suggest what could be wrong?

Thanks!

On a local system I use, the results look like random at each invocation

Interesting! There is some stochasticity in the loading but "random" is a bit weird... What versions of PEFT are you using ?

You are loading the adapters right ?

yes, we are directly using the code at: https://github.com/illuin-tech/colpali/blob/main/demo/app.py

Regarding PEFT version, shall let you know in a few minutes.

Could this issue regarding custom_text_proj weights be related: #11 ?

Hi,
We are using the PEFT version 0.11.1 . ( I am working with Rahul)

Surely both issues are related but although there is some non-determinisim, I never observed large performance differences... You seem to say it goes from working to random, most I observed with this adapter loading is like ~1% diff so there may be something else going on...

Yes, given our query, we are getting random results for the pages same as mentioned by @amit-jain . This seems to be the issue when running locally in offline mode. I was able to replicate this in different systems.

Locally in online mode it doesn't do this ?

As you suggested, we are testing now local with online mode. Will update here shortly.

It shouldn't change normally... I'm also trying to understand... I can understand if it's a bit inconsistent but weird if results are truly bad / random...
In all cases, I'm planning on releasing a perfectly reproducible model without the adapters (harder to download but worth it if people have problems like that)

It shouldn't change normally... I'm also trying to understand... I can understand if it's a bit inconsistent but weird if results are truly bad / random...
In all cases, I'm planning on releasing a perfectly reproducible model without the adapters (harder to download but worth it if people have problems like that).
I'm guessing torch init is somewhat hardware dependent and some setups are just too different from the ones I was using (or the HF space is using)

yes, it seems that the quantization is having some hardware affinity. It would be great and helpful if you could release the full model soon instead of QLoRA adapters.

@ManuelFay I was trying to replicate the vespa notebook & PEFT version is 0.11.1.
The random is from the pov of retrieval (returning 2) using the above notebook on a 4 page internal PDF. I get entirely different results in different runs.

So, for example the max_sim scores returned for 2 runs

  • {'0': 14.057649612426758, '1': 15.255558013916016, '2': 13.401391983032227, '3': 14.392743110656738}
  • {'0': 27.25721549987793, '1': 25.002473831176758, '2': 25.534408569335938, '3': 26.65641212463379}
    Looking at the docs the 2nd iteration scores makes sense but I am guessing a little bit of instability 3-4% dramatically changes the result for the application.

At the application level it might also get accentuated with different initializations during the pre-processing phase and during the query phase. Probably the effect would have been less in the case where same random initialization was done for both.

Those scores should never be so different (doubling) except if the queries double in length basically...
Pretty sure there is something else going on than just the adapters (might be the precision?)

Having said that, what I'll do for the time being is:

  • train a model and export both the adapters and the base model with the random linear layer and upload everything for perfect reproducibility

  • upload both a bfloat16 version as well as a float32 hardware compatibility

This hopefully should fix your issues (and the model should also be better than the current version)

I am currently on vacation, this will come either Aug. 20, or a bit earlier if I have time.

At the moment, you can already just load the model and export the base model, then use the same everytime instead of PaliGemma and guarantee there is no random init on loading !

Cheers and very sorry about that,
Manu

Thanks @ManuelFay

sorry about that

please don't be. Really appreaciate you helping us out

Those scores should never be so different (doubling) except if the queries double in length basically...

I have seen scores around 45 ish as well. Mostly I have see score around 15-16 but with quite instability with basically any 1 out of the 4 pages being ranked higher

No worries, till the time I'll try what you suggest.

Those scores should never be so different (doubling) except if the queries double in length basically... Pretty sure there is something else going on than just the adapters (might be the precision?)

Having said that, what I'll do for the time being is:

  • train a model and export both the adapters and the base model with the random linear layer and upload everything for perfect reproducibility
  • upload both a bfloat16 version as well as a float32 hardware compatibility

This hopefully should fix your issues (and the model should also be better than the current version)

I am currently on vacation, this will come either Aug. 20, or a bit earlier if I have time.

At the moment, you can already just load the model and export the base model, then use the same everytime instead of PaliGemma and guarantee there is no random init on loading !

Cheers and very sorry about that, Manu

Apologies to trouble you during your vacation! And, sincere thanks for help us out!

I tried running the code in a jupyter notebook where the model is loaded only once. For the same PDF and query, the results are coming same across several runs (model being loaded once in memory). However, the results are still poor compared to the huggingface hosted demo instance.

when I look at the max_sim scores (sorted in descending order), for the first ~5-10 entries, most of the time they are same and the remaining ones are also differing by a very small margin). At times, all the scores are same. I suspect loss of precision is causing this issue. Possibly, the huggingface instance is using float32 instead of bfloat16. Or, it could be due to hardware induced precision issues.

Hey - I created this repo that reproduces this error, it seems that we already know the root cause, but just sharing.
I tried with cpu/gpu locally and on my k8s cluster, still getting "random" results.

https://github.com/marcoaleixo/colpali-fastapi

Hey - so everything should be deterministic now !
Would be awesome if you guys can confirm using this new model:
https://huggingface.co/vidore/colpali-v1.1

and the code in branch: https://github.com/illuin-tech/colpali/tree/hard-negs (optional but should get you better performance and fixes a padding issue)

The base model version is fixed !

With the original adapter model (https://huggingface.co/vidore/colpali) using mps with float , I'm able to reproduce the results from the paper on DocVQA with an acceptable error margin.

  • Paper reports nDCG@5 54.4
  • ColPali 1.0 with float vectors (doc and query) Vespa nDCG@5 53.7
  • ColPali 1.0 with binary vectors (doc) Vespa nDCG@5 50.7

If you are experiencing Vespa "random results" on custom data using this notebook, please check that you are not experiencing graceful degradation as the timeout might have to be adjusted upwards. It might be that you are better off modelling it with one page per vespa document instead of the notebook that repress one PDF in one document with the page as a tensor dimension.

from vespa.package import Schema, Document, Field, FieldSet

colpali_schema = Schema(
    name="pdf_page",
    document=Document(
        fields=[
            Field(name="url", type="string", indexing=["summary"]),
            Field(name="url_hash", type="int", indexing=["summary", "attribute"]),
            Field(name="page_number", type="int", indexing=["summary", "attribute"]),
            Field(
                name="title",
                type="string",
                indexing=["summary", "index"],
                index="enable-bm25",
            ),
            Field(
                name="authors",
                type="string",
                indexing=["summary", "index"],
                index="enable-bm25",
            ),
            Field(
                name="text",
                type="string",
                indexing=["index"],
                index="enable-bm25",
            ),
            Field(
                name="image",
                type="string",
                indexing=["summary"],
            ),
            Field(
                name="embedding",
                type="tensor<bfloat16>(patch{}, v[128])",
                indexing=["attribute"],
            )
        ]
    ),
    fieldsets=[FieldSet(name="default", fields=["title", "text"])]
)

And with optional hybrid ranking like

from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking, GlobalPhaseRanking

colpali_profile = RankProfile(
    name="default",
    inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
    functions=[
        Function(
            name="max_sim",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * attribute(embedding) , v
                        ),
                        max, patch
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="bm25_score", expression="bm25(title) + bm25(text)"
        )
    ],
    first_phase=FirstPhaseRanking(expression="bm25_score"),
    second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=100),
    match_features=["max_sim", "bm25_score"],
)
colpali_hybrid_profile = RankProfile(
    name="hybrid",
    inherits="default",
    first_phase=FirstPhaseRanking(expression="bm25_score"),
    second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=100),
    match_features=["max_sim", "bm25_score"],
    global_phase=GlobalPhaseRanking(expression="reciprocal_rank_fusion(max_sim, bm25_score)"),
)
colpali_schema.add_rank_profile(colpali_profile)
colpali_schema.add_rank_profile(colpali_hybrid_profile)

Thanks @ManuelFay I see positive results with the new model versions. I'll test more I was thinking that bfloat16 -> float32 conversion for mps was problematic but as @jobergum above suggests that's not the case.

Thanks @jobergum, from what I understand the coverage would be reported as less than 100% if there's a graceful degradation condition. I see 100% coverage so, as of now not hitting this case. But good to keep this in mind.

from what I understand the coverage would be reported as less than 100% if there's a graceful degradation condition

Yes, that is correct.

Hey - so everything should be deterministic now !
Would be awesome if you guys can confirm using this new model:
https://huggingface.co/vidore/colpali-v1.1

and the code in branch: https://github.com/illuin-tech/colpali/tree/hard-negs (optional but should get you better >performance and fixes a padding issue)

By performance do you mean better accuracy or throughput? I'm running the branch now with the v1.1 adapter and encoding the images is a lot faster now, from 50 seconds per batch of 4 to 15 seconds.

That's very weird... The images are encoded exactly the same way, the queries are now right padded instead of left padded (and model 1.1 is trained with this padding change basically) so speed should not change for image encoding...

We will version everything before pushing this on main.

Curious to understand why your images run faster though... Old model with new code / old code with new model gives the same speeds ?

I haven't tried combinations.

With git+https://github.com/illuin-tech/colpali/@hard-negs + the v1.1 model.

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
from io import BytesIO

from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries

if torch.cuda.is_available():
  device = torch.device("cuda")
  type = torch.bfloat16
elif torch.backends.mps.is_available():
  device = torch.device("mps")
  type = torch.float32
else:
  device = torch.device("cpu")
  type = torch.float32

model_name = "vidore/colpali-v1.1"
model = ColPali.from_pretrained("google/paligemma-3b-mix-448", torch_dtype=type).eval()
model.load_adapter(model_name)
model.to(device)
processor = AutoProcessor.from_pretrained(model_name)

from datasets import load_dataset

ds = load_dataset("vidore/docvqa_test_subsampled", split="test")

dataloader = DataLoader(
        ds['image'],
        batch_size=4,
        shuffle=False,
        collate_fn=lambda x: process_images(processor, x),
)
embeddings = []
for batch_doc in tqdm(dataloader):
  with torch.no_grad():
    batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
    embeddings_doc = model(**batch_doc)
    embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu"))))

Gives me about 15 seconds per batch of 4, so about 3 seconds per image on mps on my M1. This is lower than what it used to be, but could also be some system variances. 500 pages from the docvqa

image

On this dataset, I do get a slight degradation in nDCG though compared to the v1.0 on the docvqa dataset (from 53 to 51)

{nDCG@1: 0.448,
 nDCG@10: 0.5484406432471343,
 RR@10: 0.5137984126984129,
 nDCG@5: 0.5302204114231861}

CUDA on Colab with T4 is about 50% faster than MPS with 10 second per batch of 4 notebook.

image

For the perf, the base model should not the Google version but rather our fixed base model:

model_name = "vidore/colpali-v1.1"
model = ColPali.from_pretrained("vidore/colpaligemma-3b-mix-448-base", torch_dtype=torch.bfloat16, device_map="cuda").eval()
model.load_adapter(model_name)
processor = AutoProcessor.from_pretrained(model_name)

I'll look into the speed thing ! I had noticed the data loader is often CPU bottlenecked when feeding images as a list at init, but probably unrelated here

I feel what was weirder was the 50 seconds per batch of 4 to be honest @jobergum !

I tried out the colab code. I get about ~12 sec per batch of 4 with (new branch, new model / old branch, new model / old branch, old model) so perhaps an environment difference more than anything else ?

Thanks, will try the new base-model! I've missed this. Yes, this could maybe be something with the M1 device, or python gc etc. It was a long lived kernel that I observed this so it might be a red herring and that there is no real throughput change.

I'm testing the new basemodel + adapter for accuracy now on docvqa.

BTW; Do you have any nDCG numbers for docvqa with the new basemodel and adapter? and these are compatible with the hard-negs branch?

model_name = "vidore/colpali-v1.1"
model = ColPali.from_pretrained("vidore/colpaligemma-3b-mix-448-base", torch_dtype=type).eval()
model.load_adapter(model_name)
model.to(device)
processor = AutoProcessor.from_pretrained(model_name)

Yes, I got 0.55 nDCG@5 on DocVQA with colpali-v1.1 + the new branch.

https://huggingface.co/spaces/vidore/vidore-leaderboard ---> refresh button

@ManuelFay Sorry for polluting this conversation here what's the difference between the top 2 models and colpali-v1.1 (3rd) listed on the vidore leaderboard?

Differences should be explained in the model card - but basically one is trained from another PaliGemma ckpt (pt version instead of mix), the other is trained with a bigger training set (adding docmatix samples). Better models will come but we wanted to keep 1.1 exactly the same as the original one to be able to test for non-regression !

I get different results depending on:

  • whether I torch.compile the model or not
  • the gpu itself (L4, T4)
  • different runs with the same exact setup but different allocated GPUs of the same type

They only vary between 0.535 and 0.55 nDCG@5 on DocVQA but there really shouldn't be any randomness at any point so I am a bit surprised, probably hardware things

Thanks for all the support @ManuelFay, with the new base model and the new adapter I get the following results combining ColPali with Vespa

Using full precision document vectors

{nDCG@1: 0.448,
 nDCG@5: 0.5440631335340895,
 RR@10: 0.5238023809523807,
 nDCG@10: 0.5627166712532925}

Using bit vectors (quantized by >0) and reducing storage by 32x

{nDCG@1: 0.436,
 nDCG@5: 0.5236435792799352,
 RR@10: 0.5055357142857142,
 nDCG@10: 0.5412033513993704}

Okay got it, the lora adapters also have dropout which we have to remove even if the base model is in eval mode.

model_name = "vidore/colpali-v1.1"
model = ColPali.from_pretrained("vidore/colpaligemma-3b-mix-448-base", torch_dtype=torch.bfloat16, device_map="cuda").eval()
model.load_adapter(model_name)
model = model.eval()
processor = AutoProcessor.from_pretrained(model_name)

I get deterministic results like this !
Thanks to everybody on this issue for the reporting, will document and close the issue when I merge the branch :)

Also I figured out that with MPS, you need to make sure that Google Chrome is spinning the GPU, using all it's capacity, that was the root cause of having 50 sec per batch and not 15 sec per batch :)

Hi guys, are the changes from the original vidore/colpali to vidore/colpali-v1.2 and from the google/paligemma-3b-pt-448 base vision model to vidore/colpaligemma-3b-pt-448-base documented somewhere? It's not very clear to me just by looking at the model cards.

From reading this thread and the model cards I see these possible changes:

  • fp32 instead of bf16 checkpoints for hardware compatibility (base shows bf16 still, and adapter doesn't display data type at all)
  • random linear layer in both the adapter and the base model (more on this would be appreciated)
  • adapter trained for 5 epochs, with in-batch negatives and hard mined negatives (can we find traces of this in the code? if so, where?)

Also, most of the LoRAs I see are contructed with rank=alpha/2 whereas your Colpali is rank=alpha=32. How was this chosen?

Finally, will the changes the hard-negs branch contains be documented in some kind of changelog when it is merged?

This issue might not be the best place for these questions. Would you consider opening the "discussion" section in your github settings?

Thanks!

Hey - for the moment, v1.2 is not "official" as it is based on the hard-negs branch code. This is why it is not displayed by default on the leaderboard. By the end of the week, we will version the current main branch, merge the hard negs branch and version it as well and detail all changes in a detailed Changelog.

All code for training v1.2 is in the model card of the model but will be detailed and expanded upon the "official" release, once branches are cleanly merged and versioned.

Furthermore, these model checkpoints are experimental more than anything else, step change improvements will come through better data, and we will soon release models that are better across languages, document types, harder documents... V1.x models are just trying to push training a bit with the same data as in the paper !

Thanks for the interest !

Hi @ManuelFay !

I will wait until the big merges before creating a branch of my own. Until then, I will play with the benchmark repo and see how 1.2 compares with the "official" ones. Talk to you soon and good luck with the merges!

I decided to modify the vidore benchmark to evaluate colpali-1.2 with my branch. While it was running I was comparing the results with the benchmark page. And at some point I could see colpali-1.2 up there and then I refreshed the browser and it disappeared. Did I hallucinate?

Anyways, here's what I got on a L40S, pytorch 2.4.0 and CUDA 12.4 (I only kept the score lines):

NDCG@5 for vidore/colpali-v1.2 on vidore/arxivqa_test_subsampled: 0.78518
NDCG@5 for vidore/colpali-v1.2 on vidore/docvqa_test_subsampled: 0.55965
NDCG@5 for vidore/colpali-v1.2 on vidore/infovqa_test_subsampled: 0.81636
NDCG@5 for vidore/colpali-v1.2 on vidore/tabfquad_test_subsampled: 0.89177
NDCG@5 for vidore/colpali-v1.2 on vidore/tatdqa_test: 0.67537
NDCG@5 for vidore/colpali-v1.2 on vidore/shiftproject_test: 0.81018
NDCG@5 for vidore/colpali-v1.2 on vidore/syntheticDocQA_artificial_intelligence_test: 0.97393
NDCG@5 for vidore/colpali-v1.2 on vidore/syntheticDocQA_energy_test: 0.95155
NDCG@5 for vidore/colpali-v1.2 on vidore/syntheticDocQA_government_reports_test: 0.92696
NDCG@5 for vidore/colpali-v1.2 on vidore/syntheticDocQA_healthcare_industry_test: 0.95016

Safe to say it comes out on top. Can't wait to read about the improvements you've made! I will try to find out with the material available now (code, model cards, etc.). Good job!

Yeah, v1.2 is up there in the leaderboard ! changes are detailed in the model card, but essentially doing multiple epochs + longer warmup helped with non-english language performance as there was less forgetting !
Fun to see your nice results with the old code, normally you should use colpali-v1.2 with colpali-engine>=0.2.0 for best performance. The main branch is now merged with the hard-negs branch and versionned !

@ManuelFay : Fun to see your nice results with the old code, normally you should use colpali-v1.2 with colpali-engine>=0.2.0 for best performance.

From what I gathered, the difference between the new code and the old code is the "pixel_values" linear projection postprocessing on what you get from paligemma, correct? I'm not sure what that means: guarantee deterministic projection layer initialization . Is there something I can read to help me understand?

Are there finetunes you made to the weights (and not just the code) of the original google/paligemma-3b-pt-448 model in vidore/colpaligemma-3b-pt-448-base besides the conversion from fp32 to bf16?

Thanks!

"Base" models are not finetuned, just a way to not have to randomly initialize the projection layers we add to PaliGemma everytime we load the model as I explained in the other issue you opened

I am not sure what you mean with the pixel values stuff... The fix with the padding is referenced in the Changelog and detailed in the linked issue. Don't hesitate to send me a mail (corresponding author on the paper) if you need extrw explanations, might be simpler.