Florence-2 is an advanced vision foundation model developed by Microsoft that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.
Versatile task handling: Florence-2 can interpret simple text prompts to perform various tasks like image captioning, object detection, visual grounding, and segmentation.
Multi-task learning: The model leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning.
Architecture: Florence-2 employs a sequence-to-sequence architecture that enables it to excel in both zero-shot and fine-tuned settings.
Model variants: There are two main variants - Florence-2-base (0.23B parameters) and Florence-2-large (0.77B parameters).
Performance: Florence-2 demonstrates exceptional zero-shot performance in tasks such as image captioning, visual grounding, and referring expression comprehension. For example, Florence-2-L achieved a CIDEr score of 135.6 on the COCO caption benchmark, outperforming models with significantly more parameters.
Downstream task capabilities: The model shows strong performance in object detection, instance segmentation, and semantic segmentation tasks, often surpassing previous state-of-the-art models.
Unified approach: Florence-2 adopts a sequence-to-sequence framework to address a wide range of vision tasks in a unified manner, treating each task as a translation problem.
Florence-2 represents a significant advancement in computer vision, offering a versatile and powerful model that can handle multiple vision-related tasks with high efficiency and accuracy.