CLIP ViT-B/16 LAION-2B

laion/CLIP-ViT-B-16-laion2B-s34B-b88K

published Jan 2023 · updated Apr 2023

CLIP ViT-B/16 LAION-2B is a zero-shot-image model that performs image classification and image-text retrieval by learning visual concepts from natural language supervision.

status

coming soon

API providers

downloads / mo

402.8K

license

mit

specs

Task	Zero-shot image classification, image and text retrieval
Architecture	ViT-B/16 (Vision Transformer with patch size 16)
Parameters	~150M (ViT-B/16)
License	MIT (OpenCLIP)

about this model

laion/CLIP-ViT-B-16-laion2B-s34B-b88K is a zero-shot image classification model that performs contrastive language-image pre-training (CLIP) using a ViT-B/16 vision encoder and a text encoder, trained on the 2-billion-sample English subset of LAION-5B with OpenCLIP. The model is hosted as a managed API on gigarouter, providing OpenAI-compatible endpoints for zero-shot inference.

Key Capabilities

Zero-shot image classification: assign labels to images without task-specific fine-tuning by matching image embeddings to text prompt embeddings.
Image and text retrieval: rank images by text queries or vice versa using cosine similarity between embeddings.
Downstream adaptability: can be fine-tuned for image classification, used as a frozen feature extractor for linear probing, or employed for image generation guidance and conditioning.

Training and Data

The model was trained by Mehdi Cherti on the JUWELS Booster supercomputer at Jülich Supercomputing Centre. Training data is the English subset of LAION-5B, an uncurated dataset of 2 billion image-text pairs crawled from public internet. The dataset is intended for research; users should be aware of potential harmful content.

Benchmark Results

The model achieves a 70.2% zero-shot top-1 accuracy on ImageNet-1k. Evaluation was performed using the LAION CLIP Benchmark suite on VTAB+ (Visual Task Adaptation Benchmark with additional robustness datasets) for classification and on COCO and Flickr for retrieval. Extended benchmark results are available in the CLIP benchmark repository.

Usage Notes

As per the original OpenAI CLIP model card, this model is a research output intended for research communities. Deployed or unconstrained production use is not recommended without thorough domain-specific testing. Use in surveillance or facial recognition is out of scope. The model is designed for English-language inputs only.

best for

·Zero-shot image classification without task-specific training
·Image and text retrieval (e.g., searching images by natural language descriptions)
·Fine-tuning for downstream image classification tasks

FAQ

What is the primary use case for this model?

It is designed for zero-shot image classification and image-text retrieval, allowing you to classify or search images using natural language without task-specific training.

What is the model's zero-shot accuracy on ImageNet?

The model achieves 70.2% zero-shot top-1 accuracy on ImageNet-1k.

What license is this model released under?

The model is released under the MIT license via OpenCLIP.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and text prompt for classification or retrieval.

What training data was used for this model?

It was trained on the 2 billion sample English subset of LAION-5B.

not yet live

We're benchmarking and onboarding CLIP ViT-B/16 LAION-2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336