Kosmos-2
microsoft/kosmos-2-patch14-224
published Oct 2023 · updated Nov 2023
Kosmos-2 is a multimodal large language model that grounds text to visual world, enabling tasks like phrase grounding, referring expression comprehension, and grounded image captioning.
specs
| Task | Multimodal Grounding and Image-to-Text |
| Architecture | Transformer-based Multimodal Large Language Model |
| Training Data | GRIT dataset (20 million grounded image-caption pairs) |
about this model
Kosmos-2-patch14-224 is an image-to-text multimodal large language model that grounds textual descriptions to spatial regions in images. Built by Microsoft, it extends a language model with visual perception, enabling the generation of text that is linked to specific bounding boxes in the input image.
Capabilities
The model supports a range of grounded tasks through prompt-based control:
- Multimodal grounding – phrase grounding (locating a phrase in the image) and referring expression comprehension (locating a described object).
- Multimodal referring – generating a description for a given image region (referring expression generation).
- Grounded visual question answering (VQA) – answering questions about an image while outputting the relevant bounding boxes.
- Grounded image captioning – brief or detailed captions that include spatial references to objects in the scene.
Training data
The model was trained on the GRIT dataset, a collection of approximately 20 million grounded image-caption pairs derived from COYO-700M and LAION-2B. The dataset is released under the Microsoft Public License (ms-pl) and supports tasks such as image-to-text generation, object detection, and visual question answering.
Example output
Given the image above, Kosmos-2 can generate a caption like “An image of a snowman warming himself by a fire” and return the spatial coordinates for “a snowman” and “a fire”. The model outputs both the text and the corresponding entity locations, making it suitable for applications that require visual grounding.
Integration through gigarouter
gigarouter hosts Kosmos-2-patch14-224 as a managed, OpenAI-compatible API. Developers can send an image and a prompt, and receive text responses with optional grounding information, without needing to manage model dependencies or infrastructure.
best for
- ·Phrase grounding – linking text phrases to image regions
- ·Referring expression comprehension – identifying objects described by expressions
- ·Grounded visual question answering – answering questions with region references
FAQ
Kosmos-2 can perform phrase grounding, referring expression comprehension, grounded VQA, and grounded image captioning.
Input is an image and a text prompt. For grounding tasks, the prompt should start with <grounding> and use <phrase> tags to specify text to ground.
Access the model through the gigarouter OpenAI-compatible endpoint using an API key; send an image and prompt as per the model's input format.
Kosmos-2 was trained on the GRIT dataset, which contains about 20 million grounded image-caption pairs derived from COYO-700M and LAION-2B.
The model weights are publicly available on Hugging Face; the license is not specified in the available sources, but the GRIT dataset is under Microsoft Public License (ms-pl).
We're benchmarking and onboarding Kosmos-2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.