Florence-2 vs Grounding DINO + SAM2

Question

Florence-2 vs Grounding DINO + SAM2

radames opened this issue 4 months ago · 1 comments

Hello, thanks for the awesome collection of demos and code.
I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet.
At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.

Answer 1 · 2024-08-25T08:32:23.000Z

Hi @radames

Your observation is very thorough, and the questions you've raised are highly valuable.

We haven't benchmarked the two approaches implemented in this repo ourselves, but I believe each of these models currently has its own strengths.

For Grounding DINO 1.5, we can see its zero-shot detection capability is stronger than Florence-2, which achieves zero-shot 54.3 AP and 55.7 AP on LVIS minival, and Florence-2 achieves 43.4 AP on COCO zero-shot benchmark.

But after training on FLD-5B datasets, Florence-2 can not only localize main phrase on caption and also has a strong referring capability, you can refer to the following table:

And it can also serve as a foundation model for users to fine-tune it on their specific scenarios.