BLIP ITM Base Flickr
Salesforce/blip-itm-base-flickr
published Dec 2022 · updated Feb 2025
BLIP ITM Base Flickr is a vision-language model for image-text matching, fine-tuned on Flickr30k.
specs
| Task | Image-Text Matching |
| Architecture | BLIP with ViT-B (Vision Transformer Base) backbone |
| Parameters | Not specified |
| License | Not specified |
about this model
Salesforce/blip-itm-base-flickr is an image-text matching (ITM) model that computes the alignment score between an image and a text caption, built on the BLIP framework with a ViT-Base backbone and finetuned on the Flickr30k dataset.
Capabilities
The model accepts an image and a text description and returns both an image-text matching score (via a dedicated ITM head) and a cosine similarity score from the multimodal encoder. It is designed for tasks such as image-text retrieval and verifying caption accuracy. BLIP uniquely handles noisy web data by bootstrapping captions: a captioner generates synthetic captions and a filter removes noisy ones, enabling effective pre-training on large-scale web data.
Training and Performance
The ViT-Base backbone was pre-trained on 129 million image-text pairs and then finetuned on Flickr30k for retrieval. In the BLIP paper, the framework achieved state-of-the-art results: image-text retrieval improved by +2.7% in average recall@1, image captioning by +2.8% in CIDEr, and VQA by +1.6% in VQA score. The model also generalizes to video-language tasks in a zero-shot manner.
Usage via gigarouter
gigarouter hosts this model as a managed, OpenAI-compatible API. No local installation or GPU infrastructure is required. Developers send image and text inputs to the API endpoint and receive the matching scores in return.
best for
- ·Computing image-text similarity scores
- ·Image-text retrieval (e.g., finding matching captions for images)
- ·Evaluating alignment between images and textual descriptions
FAQ
It is best for image-text matching tasks, such as scoring how well a caption describes an image, and for image-text retrieval on Flickr30k data.
This is the base version with a ViT-B backbone, pre-trained on 129M images. It is smaller than the ViT-L variant and does not use the CapFilt large captioning/filtering stage.
Input: an image and a text string. Output: an image-text matching score (logit) and optionally a cosine similarity score when using the ITM head or not.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending image and text inputs as per the API documentation.
The original BLIP codebase is deprecated and no longer supported; the model is now integrated into the LAVIS library. The hosted API on gigarouter remains available for inference.
We're benchmarking and onboarding BLIP ITM Base Flickr as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.