/Florence-2-Vision-Language-Model

Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.

Primary LanguageJupyter Notebook

Florence-2: Microsoft's Cutting-edge Vision Language Models

🕸 LinkedIn • 📙 Kaggle • 💻 Medium Blog • 🤗 Hugging Face


Open In

🔗 Short Links

📃 Model Description

Florence-2, released by Microsoft in June 2024, is an advanced, lightweight foundation vision-language model open-sourced under the MIT license. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks. Despite its small size, it achieves results comparable to those of much larger models, such as Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.

Florence-2 model series

Model Model size Model Description
Florence-2-base [HF] 0.23B Pretrained model with FLD-5B
Florence-2-large [HF] 0.77B Pretrained model with FLD-5B
Florence-2-base-ft [HF] 0.23B Finetuned model on a colletion of downstream tasks
Florence-2-large-ft [HF] 0.77B Finetuned model on a colletion of downstream tasks

Tasks

Florence 2 supports many tasks out of the box:

  • Caption,
  • Detailed Caption,
  • More Detailed Caption,
  • Dense Region Caption,
  • Object Detection,
  • OCR,
  • Caption to Phrase Grounding,
  • segmentation,
  • Region proposal,
  • OCR,
  • OCR with Region.
    You can try out the model via HF Space.

🕸 Unified Representation

Vision tasks are diverse and vary in terms of spatial hierarchy and semantic granularity. Instance segmentation provides detailed information about object locations within an image but lacks semantic information. On the other hand, image captioning allows for a deeper understanding of the relationships between objects, but without reference to their actual locations.

Open In
Figure 1. Illustration showing the level of spatial hierarchy and semantic granularity expressed by each task. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

The authors of Florence-2 decided that instead of training a series of separate models capable of executing individual tasks, they would unify their representation and train a single model capable of executing over 10 tasks. However, this requires a new dataset.

💎 Dataset

Florence-2's strength doesn't stem from its architecture, but from the massive dataset it was pre-trained on. The authors noted that leading computer vision datasets typically contain limited information - WIT only includes image/caption pairs, SA-1B only contains images and associated segmentation masks. Therefore, they decided to build a new FLD-5B dataset containing a wide range of information about each image - boxes, masks, captions, and grounding. The dataset creation process was largely automated. The authors used off-the-shelf task-specific models and a set of heuristics and quality checks to clean the obtained results. The result was a new dataset containing over 5 billion annotations for 126 million images, which was used to pre-train the Florence-2 model.

Open In An illustrative example of an image and its corresponding annotations in the FLD-5B dataset. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

FLD-5B is not yet publicly available, but the authors announced its upcoming release during CVPR 2024.

Open In Summary of size, spatial hierarchy, and semantic granularity of top datasets. Source: Florence-2 CVPR 2024 poster.

🧩 Architecture and Pre-training details

Regardless of the computer vision task being performed, Florence-2 formulates the problem as a sequence-to-sequence task. Florence-2 takes an image and text as inputs, and generates text as output. The model has a simple structure. It uses a DaViT vision encoder to convert images into visual embeddings, and BERT to convert text prompts into text and location embeddings. The resulting embeddings are then processed by a standard encoder-decoder transformer architecture, generating text and location tokens.

Open In Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

For region-specific tasks, location tokens representing quantized coordinates are added to the tokenizer's vocabulary.

  • Box Representation (x0, y0, x1, y1): Location tokens correspond to the box coordinates, specifically the top-left and bottom-right corners.
  • Polygon Representation (x0, y0, ..., xn, yn): Location tokens represent the polygon's vertices in clockwise order.

🦾 Capabilities

Florence-2 is smaller and more accurate than its predecessors. The Florence-2 series consists of two models: Florence-2-base and Florence-2-large, with 0.23 billion and 0.77 billion parameters, respectively. This size allows for deployment on even mobile devices. Despite its small size, Florence-2 achieves better zero-shot results than Kosmos-2 across all benchmarks, even though Kosmos-2 has 1.6 billion parameters.

Examples

🏋🏾‍♂️ Finetuning

Even if Florence-2 supports many tasks, maybe your task or domain might not be supported, or you may want to better control the model's output for your task. That's when you will need to fine-tune.

🗂 Resources

Title Type Brief Description Links
Florence-2 Demo Demo HF Space Link
Florence-2 DocVQA Demo Demo HF Space Link
Florence-2 Finetuned Demo Demo HF Space Link
Florence-2 Inference Notebook Notebook Notebook Link
Florence-2 Finetuning Notebook Notebook Notebook Link
Vision Language Models Explained Blog article article Link
Florence-2 Finetuning on DocVQA Video Video Link
Florence-2 Finetuning on Video Vido Link

🔗 Citations and References

  • @article{xiao2023florence, title={Florence-2: Advancing a unified representation for a variety of vision tasks}, author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu}, journal={arXiv preprint arXiv:2311.06242}, year={2023} }

  • Piotr Skalski. (Jun 20, 2024). Florence-2: Open Source Vision Foundation Model by Microsoft. Roboflow Blog

  • Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models