🕸 LinkedIn • 📙 Kaggle • 💻 Medium Blog • 🤗 Hugging Face •
Florence-2, released by Microsoft in June 2024, is an advanced, lightweight foundation vision-language model open-sourced under the MIT license. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks. Despite its small size, it achieves results comparable to those of much larger models, such as Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.
Model | Model size | Model Description |
---|---|---|
Florence-2-base [HF] | 0.23B | Pretrained model with FLD-5B |
Florence-2-large [HF] | 0.77B | Pretrained model with FLD-5B |
Florence-2-base-ft [HF] | 0.23B | Finetuned model on a colletion of downstream tasks |
Florence-2-large-ft [HF] | 0.77B | Finetuned model on a colletion of downstream tasks |
Florence 2 supports many tasks out of the box:
- Caption,
- Detailed Caption,
- More Detailed Caption,
- Dense Region Caption,
- Object Detection,
- OCR,
- Caption to Phrase Grounding,
- segmentation,
- Region proposal,
- OCR,
- OCR with Region.
You can try out the model via HF Space.
Vision tasks are diverse and vary in terms of spatial hierarchy and semantic granularity. Instance segmentation provides detailed information about object locations within an image but lacks semantic information. On the other hand, image captioning allows for a deeper understanding of the relationships between objects, but without reference to their actual locations.
Figure 1. Illustration showing the level of spatial hierarchy and semantic granularity expressed by each task. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.
The authors of Florence-2 decided that instead of training a series of separate models capable of executing individual tasks, they would unify their representation and train a single model capable of executing over 10 tasks. However, this requires a new dataset.
Florence-2's strength doesn't stem from its architecture, but from the massive dataset it was pre-trained on. The authors noted that leading computer vision datasets typically contain limited information - WIT only includes image/caption pairs, SA-1B only contains images and associated segmentation masks. Therefore, they decided to build a new FLD-5B dataset containing a wide range of information about each image - boxes, masks, captions, and grounding. The dataset creation process was largely automated. The authors used off-the-shelf task-specific models and a set of heuristics and quality checks to clean the obtained results. The result was a new dataset containing over 5 billion annotations for 126 million images, which was used to pre-train the Florence-2 model.
An illustrative example of an image and its corresponding annotations in the FLD-5B dataset. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.
FLD-5B is not yet publicly available, but the authors announced its upcoming release during CVPR 2024.
Summary of size, spatial hierarchy, and semantic granularity of top datasets. Source: Florence-2 CVPR 2024 poster.
Regardless of the computer vision task being performed, Florence-2 formulates the problem as a sequence-to-sequence task. Florence-2 takes an image and text as inputs, and generates text as output. The model has a simple structure. It uses a DaViT vision encoder to convert images into visual embeddings, and BERT to convert text prompts into text and location embeddings. The resulting embeddings are then processed by a standard encoder-decoder transformer architecture, generating text and location tokens.
Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.
For region-specific tasks, location tokens representing quantized coordinates are added to the tokenizer's vocabulary.
- Box Representation (x0, y0, x1, y1): Location tokens correspond to the box coordinates, specifically the top-left and bottom-right corners.
- Polygon Representation (x0, y0, ..., xn, yn): Location tokens represent the polygon's vertices in clockwise order.
Florence-2 is smaller and more accurate than its predecessors. The Florence-2 series consists of two models: Florence-2-base and Florence-2-large, with 0.23 billion and 0.77 billion parameters, respectively. This size allows for deployment on even mobile devices. Despite its small size, Florence-2 achieves better zero-shot results than Kosmos-2 across all benchmarks, even though Kosmos-2 has 1.6 billion parameters.
Even if Florence-2 supports many tasks, maybe your task or domain might not be supported, or you may want to better control the model's output for your task. That's when you will need to fine-tune.
- This post shows an example on fine-tuning Florence on DocVQA.
- Finetuning notebook
Title | Type | Brief Description | Links |
---|---|---|---|
Florence-2 Demo | Demo | HF Space | Link |
Florence-2 DocVQA Demo | Demo | HF Space | Link |
Florence-2 Finetuned Demo | Demo | HF Space | Link |
Florence-2 Inference Notebook | Notebook | Notebook | Link |
Florence-2 Finetuning Notebook | Notebook | Notebook | Link |
Vision Language Models Explained | Blog article | article | Link |
Florence-2 Finetuning on DocVQA | Video | Video | Link |
Florence-2 Finetuning on | Video | Vido | Link |
-
@article{xiao2023florence, title={Florence-2: Advancing a unified representation for a variety of vision tasks}, author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu}, journal={arXiv preprint arXiv:2311.06242}, year={2023} }
-
Piotr Skalski. (Jun 20, 2024). Florence-2: Open Source Vision Foundation Model by Microsoft. Roboflow Blog
-
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models