BLIP ITM Base COCO
Salesforce/blip-itm-base-coco
published Dec 2022 · updated Feb 2025
BLIP ITM Base COCO is a vision-language model fine-tuned for image-text matching, using a ViT-B backbone and trained on the COCO dataset.
specs
| Task | Image-Text Matching |
| Architecture | BLIP with ViT-B backbone |
| Parameters | Not specified |
| License | Creative Commons Attribution 4.0 International |
best for
- ·Scoring the relevance between an image and a text caption
- ·Retrieving images that match a given text description
FAQ
The model expects an image and a text caption. Use the BlipProcessor to preprocess both into tensors.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image and text as part of the request.
The ITM head outputs a direct matching score, while cosine similarity uses the unimodal embeddings for a similarity measure.
It is released under the Creative Commons Attribution 4.0 International license.
No, this specific checkpoint is for image-text matching only; other BLIP variants handle captioning.
We're benchmarking and onboarding BLIP ITM Base COCO as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.