BLIP ITM Large Flickr
Salesforce/blip-itm-large-flickr
published Dec 2022 · updated Feb 2025
BLIP ITM Large Flickr is a vision-language model fine-tuned for image-text matching on the Flickr30k dataset using a ViT large backbone.
specs
| Task | Image-Text Matching |
| Architecture | BLIP with ViT large backbone |
| License | Creative Commons Attribution 4.0 International |
about this model
Salesforce/blip-itm-large-flickr is an image-text matching model that computes the alignment score between an image and a text caption, built on the BLIP (Bootstrapping Language-Image Pre-training) framework with a ViT large backbone and fine-tuned on the Flickr30k dataset.
BLIP is a unified vision-language pre-training framework designed to transfer flexibly to both understanding and generation tasks. Unlike earlier VLP models that specialized in one type of task, BLIP handles both. It addresses the problem of noisy web-sourced image-text pairs through a bootstrapping approach: a captioner generates synthetic captions and a filter removes noisy ones, enabling more effective use of large-scale web data.
Architecture and Capabilities
This specific variant uses a ViT-Large backbone and is fine-tuned on Flickr30k for the image-text matching (ITM) task. The model computes an alignment score between an input image and a text caption, outputting both an ITM score (via a dedicated matching head) and a cosine similarity score. It can be used for conditional and unconditional image captioning as well as multimodal feature extraction.
Benchmark Performance
On the BLIP framework overall, the authors report state-of-the-art results across multiple vision-language tasks:
- Image-text retrieval: +2.7% in average recall@1
- Image captioning: +2.8% in CIDEr
- Visual question answering: +1.6% in VQA score
The model also demonstrates strong zero-shot generalization to video-language tasks.
Model Variants
The BLIP framework includes multiple pre-trained and fine-tuned checkpoints. This model is the large variant (ViT-L) fine-tuned on Flickr30k for image-text retrieval. The BLIP framework is now integrated into the LAVIS library.
Ethical Considerations
This model is released for research purposes. It has not been specifically designed or evaluated for all downstream applications. Users should assess accuracy, safety, and fairness before deploying, particularly in high-risk scenarios.
best for
- ·Scoring image-text pairs for relevance
- ·Retrieving images that match a given caption
- ·Retrieving captions that match a given image
FAQ
It is best for image-text matching tasks, such as scoring how well a caption describes an image or retrieving relevant images/captions.
It uses a BLIP framework with a ViT large backbone, fine-tuned on Flickr30k.
It is released under the Creative Commons Attribution 4.0 International license.
Use the gigarouter OpenAI-compatible endpoint with your API key to send image and text inputs for matching scores.
It expects an image and a text caption; the processor tokenizes the text and preprocesses the image for the model.
We're benchmarking and onboarding BLIP ITM Large Flickr as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.