BiomedCLIP
microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
published Apr 2023 · updated Jan 2025
BiomedCLIP is a zero-shot-image model that performs biomedical vision-language tasks such as cross-modal retrieval, image classification, and visual question answering using contrastive learning on 15 million image-text pairs from PubMed Central.
specs
| Task | Zero-shot Image Classification, Cross-modal Retrieval, Visual Question Answering |
| Architecture | PubMedBERT text encoder + Vision Transformer (ViT-B/16) image encoder |
| License | MIT |
about this model
BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is a zero-shot image classification model that performs biomedical vision-language processing (VLP) tasks including cross-modal retrieval, image classification, and visual question answering. It uses PubMedBERT as the text encoder and a Vision Transformer (ViT-B/16) as the image encoder, pretrained on the PMC-15M dataset — 15 million figure-caption pairs extracted from 4.4 million scientific articles in PubMed Central. This dataset is two orders of magnitude larger than prior biomedical multimodal datasets such as MIMIC-CXR.
BiomedCLIP establishes new state-of-the-art results on a broad set of standard biomedical imaging benchmarks. It outperforms both general-purpose VLP models and radiology-specific models such as BioViL on radiology tasks, including RSNA pneumonia detection. The model is evaluated on retrieval, classification, and visual question-answering tasks through extensive experiments and ablation studies. It can also serve as a privacy-preserving proxy for analyzing proprietary biomedical data.
The model was trained on English-language corpora and is therefore limited to English text inputs. It is released under the MIT license.
best for
- ·Cross-modal retrieval of biomedical images
- ·Zero-shot classification of medical images (e.g., histopathology, radiology)
- ·Visual question answering on biomedical figures
FAQ
It excels at biomedical zero-shot image classification, cross-modal retrieval, and visual question answering, with state-of-the-art performance on standard benchmarks.
Input: an image (preprocessed) and a text prompt (e.g., "this is a photo of [label]"). Output: similarity logits or classification probabilities.
It is released under the MIT license.
Use the OpenAI-compatible endpoint with your API key; see the gigarouter documentation for details.
It was trained on PMC-15M, a dataset of 15 million figure-caption pairs from 4.4 million biomedical articles.
We're benchmarking and onboarding BiomedCLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.