Question 1

What is the input format for this model?

Accepted Answer

The model expects an image and a text caption. Use the BlipProcessor to preprocess both into tensors.

Question 2

How do I call this model via the gigarouter API?

Accepted Answer

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image and text as part of the request.

Question 3

What is the difference between the ITM head and cosine similarity scores?

Accepted Answer

The ITM head outputs a direct matching score, while cosine similarity uses the unimodal embeddings for a similarity measure.

Question 4

What license does this model use?

Accepted Answer

It is released under the Creative Commons Attribution 4.0 International license.

Question 5

Can this model be used for image captioning?

Accepted Answer

No, this specific checkpoint is for image-text matching only; other BLIP variants handle captioning.

Task	Image-Text Matching
Architecture	BLIP with ViT-B backbone
Parameters	Not specified
License	Creative Commons Attribution 4.0 International

BLIP ITM Base COCO

specs