Florence-2 Base
microsoft/Florence-2-base
published Jun 2024 · updated Aug 2025
Florence-2 Base is a vlm model that performs multiple vision tasks like captioning, object detection, and segmentation using text prompts.
specs
| Task | Vision-Language Multi-Task (captioning, object detection, OCR, segmentation, etc.) |
| Architecture | Sequence-to-sequence encoder-decoder |
| Parameters | 0.23B |
| License | MIT |
about this model
Florence-2-base is a vision-language model (VLM) that uses a prompt-based approach to perform captioning, object detection, segmentation, OCR, and other vision-language tasks through simple text instructions. Developed by Microsoft and trained on the FLD-5B dataset (5.4 billion annotations across 126 million images), the model employs a sequence-to-sequence architecture to handle multi-task learning, achieving strong results in both zero-shot and fine-tuned settings. It is released under the MIT license. gigarouter hosts this model as a managed, OpenAI-compatible API, requiring no local installation.
Zero-shot performance
| Method | #params | COCO Cap. test CIDEr | NoCaps val CIDEr | TextCaps val CIDEr | COCO Det. val2017 mAP |
|---|---|---|---|---|---|
| Flamingo | 80B | 84.3 | - | - | - |
| Florence-2-base | 0.23B | 133.0 | 118.7 | 70.1 | 34.7 |
| Florence-2-large | 0.77B | 135.6 | 120.8 | 72.8 | 37.5 |
Fine-tuned performance (Florence-2-base-ft)
When fine-tuned on a collection of downstream tasks, Florence-2-base-ft achieves a COCO Caption Karpathy test CIDEr of 140.0, NoCaps val CIDEr of 116.7, TextCaps val CIDEr of 143.9, VQAv2 test-dev accuracy of 79.7, TextVQA test-dev accuracy of 63.6, and VizWiz test-dev accuracy of 63.6.
best for
- ·Generate detailed image captions
- ·Detect objects with bounding boxes
- ·Extract OCR text from images
FAQ
It supports captioning, detailed caption, object detection, region proposal, dense region caption, OCR, OCR with region, caption-to-phrase grounding, and referring expression comprehension via different prompt tokens.
It has 0.23B parameters.
It is released under the MIT license.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and a task prompt as text instructions.
Florence-2 Large has 0.77B parameters (vs 0.23B) and generally achieves higher scores on benchmarks like COCO Caption and Object Detection.
We're benchmarking and onboarding Florence-2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.