blip itm large coco
Salesforce/blip-itm-large-coco
published Dec 2022 · updated Feb 2025
A popular open specialist model model, with 4.6K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.
about this model
BLIP (Bootstrapping Language-Image Pre-training) is a vision-language model that performs image-text matching (ITM) and retrieval, using a large ViT-L backbone fine-tuned on the COCO dataset. It outputs an ITM score and a cosine similarity score for a given image-text pair, enabling both fine-grained matching and ranking.
The model unifies understanding and generation tasks, leveraging a captioner-filter framework to bootstrap noisy web data. On the COCO image-text retrieval benchmark, it achieves state-of-the-art results: text retrieval recall@1 of 82.0, recall@5 of 95.8, and recall@10 of 98.1; image retrieval recall@1 of 64.5, recall@5 of 86.0, and recall@7 of 91.7. On the VQA v2 test-dev set it scores 78.23, and on COCO image captioning it achieves a CIDEr of 133.5 and BLEU@4 of 39.9. The architecture is based on the BLIP framework described in the paper arXiv:2201.12086.
Benchmark performance (from LAVIS evaluation)
| Task | Metric | Score |
|---|---|---|
| COCO Text Retrieval | Recall@1 | 82.0 |
| COCO Text Retrieval | Recall@5 | 95.8 |
| COCO Text Retrieval | Recall@10 | 98.1 |
| COCO Image Retrieval | Recall@1 | 64.5 |
| COCO Image Retrieval | Recall@5 | 86.0 |
| COCO Image Retrieval | Recall@7 | 91.7 |
| VQA v2 | Test-dev | 78.23 |
| COCO Captioning | CIDEr | 133.5 |
| COCO Captioning | BLEU@4 | 39.9 |
Gigarouter hosts this model as a managed API, eliminating the need for local setup. The official BLIP repository is deprecated; the model is integrated into the LAVIS library (BSD 3-Clause license).
We're benchmarking and onboarding blip itm large coco as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.