blip itm large coco

Salesforce/blip-itm-large-coco

published Dec 2022 · updated Feb 2025

A popular open specialist model model, with 4.6K downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

status

coming soon

API providers

downloads / mo

4.6K

license

bsd-3-clause

about this model

BLIP (Bootstrapping Language-Image Pre-training) is a vision-language model that performs image-text matching (ITM) and retrieval, using a large ViT-L backbone fine-tuned on the COCO dataset. It outputs an ITM score and a cosine similarity score for a given image-text pair, enabling both fine-grained matching and ranking.

The model unifies understanding and generation tasks, leveraging a captioner-filter framework to bootstrap noisy web data. On the COCO image-text retrieval benchmark, it achieves state-of-the-art results: text retrieval recall@1 of 82.0, recall@5 of 95.8, and recall@10 of 98.1; image retrieval recall@1 of 64.5, recall@5 of 86.0, and recall@7 of 91.7. On the VQA v2 test-dev set it scores 78.23, and on COCO image captioning it achieves a CIDEr of 133.5 and BLEU@4 of 39.9. The architecture is based on the BLIP framework described in the paper arXiv:2201.12086.

Benchmark performance (from LAVIS evaluation)

Task	Metric	Score
COCO Text Retrieval	Recall@1	82.0
COCO Text Retrieval	Recall@5	95.8
COCO Text Retrieval	Recall@10	98.1
COCO Image Retrieval	Recall@1	64.5
COCO Image Retrieval	Recall@5	86.0
COCO Image Retrieval	Recall@7	91.7
VQA v2	Test-dev	78.23
COCO Captioning	CIDEr	133.5
COCO Captioning	BLEU@4	39.9

Gigarouter hosts this model as a managed API, eliminating the need for local setup. The official BLIP repository is deprecated; the model is integrated into the LAVIS library (BSD 3-Clause license).

not yet live

We're benchmarking and onboarding blip itm large coco as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →

electra-base-discriminator

wespeaker-voxceleb-resnet34-LM

6.8M dl/mo

unidepth-v2-vitl14

6.3M dl/mo