Florence-2 Base

microsoft/Florence-2-base

published Jun 2024 · updated Aug 2025

Florence-2 Base is a vlm model that performs multiple vision tasks like captioning, object detection, and segmentation using text prompts.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

2.6M

license

mit

specs

Task	Vision-Language Multi-Task (captioning, object detection, OCR, segmentation, etc.)
Architecture	Sequence-to-sequence encoder-decoder
Parameters	0.23B
License	MIT

about this model

Florence-2-base is a vision-language model (VLM) that uses a prompt-based approach to perform captioning, object detection, segmentation, OCR, and other vision-language tasks through simple text instructions. Developed by Microsoft and trained on the FLD-5B dataset (5.4 billion annotations across 126 million images), the model employs a sequence-to-sequence architecture to handle multi-task learning, achieving strong results in both zero-shot and fine-tuned settings. It is released under the MIT license. gigarouter hosts this model as a managed, OpenAI-compatible API, requiring no local installation.

Zero-shot performance

Method	#params	COCO Cap. test CIDEr	NoCaps val CIDEr	TextCaps val CIDEr	COCO Det. val2017 mAP
Flamingo	80B	84.3	-	-	-
Florence-2-base	0.23B	133.0	118.7	70.1	34.7
Florence-2-large	0.77B	135.6	120.8	72.8	37.5

Fine-tuned performance (Florence-2-base-ft)

When fine-tuned on a collection of downstream tasks, Florence-2-base-ft achieves a COCO Caption Karpathy test CIDEr of 140.0, NoCaps val CIDEr of 116.7, TextCaps val CIDEr of 143.9, VQAv2 test-dev accuracy of 79.7, TextVQA test-dev accuracy of 63.6, and VizWiz test-dev accuracy of 63.6.

best for

·Generate detailed image captions
·Detect objects with bounding boxes
·Extract OCR text from images

FAQ

What tasks does Florence-2 Base support?

It supports captioning, detailed caption, object detection, region proposal, dense region caption, OCR, OCR with region, caption-to-phrase grounding, and referring expression comprehension via different prompt tokens.

How many parameters does Florence-2 Base have?

It has 0.23B parameters.

What is the license for Florence-2 Base?

It is released under the MIT license.

How do I call Florence-2 Base via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and a task prompt as text instructions.

How does Florence-2 Base compare to Florence-2 Large?

Florence-2 Large has 0.77B parameters (vs 0.23B) and generally achieves higher scores on benchmarks like COCO Caption and Object Detection.

not yet live

We're benchmarking and onboarding Florence-2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit