Florence-2 vs Grounding DINO + SAM2
radames opened this issue · 1 comments
Hello, thanks for the awesome collection of demos and code.
I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet.
At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.
Hi @radames
Your observation is very thorough, and the questions you've raised are highly valuable.
We haven't benchmarked the two approaches implemented in this repo ourselves, but I believe each of these models currently has its own strengths.
For Grounding DINO 1.5
, we can see its zero-shot detection capability is stronger than Florence-2
, which achieves zero-shot 54.3 AP
and 55.7 AP
on LVIS minival, and Florence-2
achieves 43.4 AP
on COCO zero-shot benchmark.
But after training on FLD-5B
datasets, Florence-2
can not only localize main phrase on caption and also has a strong referring capability, you can refer to the following table:
And it can also serve as a foundation model for users to fine-tune it on their specific scenarios.