BiomedCLIP

microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

published Apr 2023 · updated Jan 2025

BiomedCLIP is a zero-shot-image model that performs biomedical vision-language tasks such as cross-modal retrieval, image classification, and visual question answering using contrastive learning on 15 million image-text pairs from PubMed Central.

status

coming soon

API providers

downloads / mo

565.8K

license

mit

specs

Task	Zero-shot Image Classification, Cross-modal Retrieval, Visual Question Answering
Architecture	PubMedBERT text encoder + Vision Transformer (ViT-B/16) image encoder
License	MIT

about this model

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is a zero-shot image classification model that performs biomedical vision-language processing (VLP) tasks including cross-modal retrieval, image classification, and visual question answering. It uses PubMedBERT as the text encoder and a Vision Transformer (ViT-B/16) as the image encoder, pretrained on the PMC-15M dataset — 15 million figure-caption pairs extracted from 4.4 million scientific articles in PubMed Central. This dataset is two orders of magnitude larger than prior biomedical multimodal datasets such as MIMIC-CXR.

Comparison of BiomedCLIP performance against prior VLP approaches across multiple biomedical benchmarks

BiomedCLIP establishes new state-of-the-art results on a broad set of standard biomedical imaging benchmarks. It outperforms both general-purpose VLP models and radiology-specific models such as BioViL on radiology tasks, including RSNA pneumonia detection. The model is evaluated on retrieval, classification, and visual question-answering tasks through extensive experiments and ablation studies. It can also serve as a privacy-preserving proxy for analyzing proprietary biomedical data.

The model was trained on English-language corpora and is therefore limited to English text inputs. It is released under the MIT license.

best for

·Cross-modal retrieval of biomedical images
·Zero-shot classification of medical images (e.g., histopathology, radiology)
·Visual question answering on biomedical figures

FAQ

What is BiomedCLIP best for?

It excels at biomedical zero-shot image classification, cross-modal retrieval, and visual question answering, with state-of-the-art performance on standard benchmarks.

What are the input and output formats?

Input: an image (preprocessed) and a text prompt (e.g., "this is a photo of [label]"). Output: similarity logits or classification probabilities.

What license does BiomedCLIP use?

It is released under the MIT license.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; see the gigarouter documentation for details.

What was the training data?

It was trained on PMC-15M, a dataset of 15 million figure-caption pairs from 4.4 million biomedical articles.

not yet live

We're benchmarking and onboarding BiomedCLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336