Kosmos-2

microsoft/kosmos-2-patch14-224

published Oct 2023 · updated Nov 2023

Kosmos-2 is a multimodal large language model that grounds text to visual world, enabling tasks like phrase grounding, referring expression comprehension, and grounded image captioning.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

166.7K

license

mit

specs

Task	Multimodal Grounding and Image-to-Text
Architecture	Transformer-based Multimodal Large Language Model
Training Data	GRIT dataset (20 million grounded image-caption pairs)

about this model

Kosmos-2-patch14-224 is an image-to-text multimodal large language model that grounds textual descriptions to spatial regions in images. Built by Microsoft, it extends a language model with visual perception, enabling the generation of text that is linked to specific bounding boxes in the input image.

Capabilities

The model supports a range of grounded tasks through prompt-based control:

Multimodal grounding – phrase grounding (locating a phrase in the image) and referring expression comprehension (locating a described object).
Multimodal referring – generating a description for a given image region (referring expression generation).
Grounded visual question answering (VQA) – answering questions about an image while outputting the relevant bounding boxes.
Grounded image captioning – brief or detailed captions that include spatial references to objects in the scene.

Training data

The model was trained on the GRIT dataset, a collection of approximately 20 million grounded image-caption pairs derived from COYO-700M and LAION-2B. The dataset is released under the Microsoft Public License (ms-pl) and supports tasks such as image-to-text generation, object detection, and visual question answering.

Example output

A snowman warming himself by a fire, with bounding boxes around the snowman and the fire.

Given the image above, Kosmos-2 can generate a caption like “An image of a snowman warming himself by a fire” and return the spatial coordinates for “a snowman” and “a fire”. The model outputs both the text and the corresponding entity locations, making it suitable for applications that require visual grounding.

Integration through gigarouter

gigarouter hosts Kosmos-2-patch14-224 as a managed, OpenAI-compatible API. Developers can send an image and a prompt, and receive text responses with optional grounding information, without needing to manage model dependencies or infrastructure.

best for

·Phrase grounding – linking text phrases to image regions
·Referring expression comprehension – identifying objects described by expressions
·Grounded visual question answering – answering questions with region references

FAQ

What tasks can Kosmos-2 perform?

Kosmos-2 can perform phrase grounding, referring expression comprehension, grounded VQA, and grounded image captioning.

How is the model input formatted?

Input is an image and a text prompt. For grounding tasks, the prompt should start with <grounding> and use <phrase> tags to specify text to ground.

How can I use this model via the gigarouter API?

Access the model through the gigarouter OpenAI-compatible endpoint using an API key; send an image and prompt as per the model's input format.

What training data was used for Kosmos-2?

Kosmos-2 was trained on the GRIT dataset, which contains about 20 million grounded image-caption pairs derived from COYO-700M and LAION-2B.

What is the license for this model?

The model weights are publicly available on Hugging Face; the license is not specified in the available sources, but the GRIT dataset is under Microsoft Public License (ms-pl).

not yet live

We're benchmarking and onboarding Kosmos-2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo