skip to content
gigarouter gigarouter
models / vision-language · coming soon

Florence-2 Base

microsoft/Florence-2-base

published Jun 2024 · updated Aug 2025

Florence-2 Base is a vlm model that performs multiple vision tasks like captioning, object detection, and segmentation using text prompts.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
2.6M
license
mit

specs

TaskVision-Language Multi-Task (captioning, object detection, OCR, segmentation, etc.)
ArchitectureSequence-to-sequence encoder-decoder
Parameters0.23B
LicenseMIT

about this model

Florence-2-base is a vision-language model (VLM) that uses a prompt-based approach to perform captioning, object detection, segmentation, OCR, and other vision-language tasks through simple text instructions. Developed by Microsoft and trained on the FLD-5B dataset (5.4 billion annotations across 126 million images), the model employs a sequence-to-sequence architecture to handle multi-task learning, achieving strong results in both zero-shot and fine-tuned settings. It is released under the MIT license. gigarouter hosts this model as a managed, OpenAI-compatible API, requiring no local installation.

Zero-shot performance

Method#paramsCOCO Cap. test CIDErNoCaps val CIDErTextCaps val CIDErCOCO Det. val2017 mAP
Flamingo80B84.3---
Florence-2-base0.23B133.0118.770.134.7
Florence-2-large0.77B135.6120.872.837.5

Fine-tuned performance (Florence-2-base-ft)

When fine-tuned on a collection of downstream tasks, Florence-2-base-ft achieves a COCO Caption Karpathy test CIDEr of 140.0, NoCaps val CIDEr of 116.7, TextCaps val CIDEr of 143.9, VQAv2 test-dev accuracy of 79.7, TextVQA test-dev accuracy of 63.6, and VizWiz test-dev accuracy of 63.6.

best for

FAQ

What tasks does Florence-2 Base support?

It supports captioning, detailed caption, object detection, region proposal, dense region caption, OCR, OCR with region, caption-to-phrase grounding, and referring expression comprehension via different prompt tokens.

How many parameters does Florence-2 Base have?

It has 0.23B parameters.

What is the license for Florence-2 Base?

It is released under the MIT license.

How do I call Florence-2 Base via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and a task prompt as text instructions.

How does Florence-2 Base compare to Florence-2 Large?

Florence-2 Large has 0.77B parameters (vs 0.23B) and generally achieves higher scores on benchmarks like COCO Caption and Object Detection.

not yet live

We're benchmarking and onboarding Florence-2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →