skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

MobileCLIP S2

apple/MobileCLIP-S2-OpenCLIP

published Jun 2024 · updated Feb 2025

MobileCLIP S2 is a zero-shot-image model that performs fast and accurate image-text contrastive learning for zero-shot classification and retrieval tasks.

status
coming soon
API providers
0
downloads / mo
185.4K
license
apple-amlr

specs

TaskZero-Shot Image Classification & Retrieval
ArchitectureMobileCLIP-S2 (image encoder + text encoder)
Parameters35.7M (image) + 63.4M (text) = 99.1M total
Licenseapple-amlr

about this model

MobileCLIP-S2-OpenCLIP is a zero-shot image classification model that uses multi-modal reinforced training to achieve fast inference with high accuracy. It is part of the MobileCLIP family introduced in CVPR 2024, optimized for runtime performance on mobile and edge devices.

Chart comparing MobileCLIP variants to OpenAI and SigLIP models on zero-shot performance vs latency

Performance

MobileCLIP-S2 is trained on 13 billion seen samples and contains 35.7M image parameters plus 63.4M text parameters. It achieves a latency of 3.6 ms (image) + 3.3 ms (text). On ImageNet-1k zero-shot classification it reaches 74.4% top-1 accuracy, with an average performance of 63.7% across 38 evaluation datasets.

Compared to the previous best CLIP model based on ViT-B/16, MobileCLIP-S2 is 2.3× faster while being more accurate and 2.1× smaller. The multi-modal reinforced training approach also yields 10×–1000× improved learning efficiency versus non-reinforced CLIP training.

Checkpoint Comparison

Model# Seen Samples (B)# Params (M) (img + txt)Latency (ms) (img + txt)IN-1k Zero-Shot Top-1 Acc. (%)Avg. Perf. (%) on 38 datasets
MobileCLIP-S01311.4 + 42.41.5 + 1.667.858.1
MobileCLIP-S11321.5 + 63.42.5 + 3.372.661.3
MobileCLIP-S21335.7 + 63.43.6 + 3.374.463.7
MobileCLIP-B1386.3 + 63.410.4 + 3.376.865.2
MobileCLIP-B (LT)3686.3 + 63.410.4 + 3.377.265.8

Key Details

  • Released under the apple-amlr license (see license file).
  • Training code and DataCompDR datasets are available on Hugging Face as part of the apple/mobileclip-models-datacompdr-data collection.
  • The model is hosted as a managed API on gigarouter, requiring no local installation—simply call the OpenAI-compatible endpoint.

best for

FAQ

What is MobileCLIP S2 best used for?

MobileCLIP S2 excels at zero-shot image classification and cross-modal retrieval tasks where speed and efficiency are critical, such as mobile or edge deployments.

How does MobileCLIP S2 compare to OpenAI's ViT-B/16 in speed and accuracy?

MobileCLIP S2 is 2.3x faster and 2.1x smaller than ViT-B/16 while achieving better average zero-shot performance across 38 datasets.

What license is this model released under?

The MobileCLIP-S2 model is released under the apple-amlr license (see the LICENSE_weights_data file in the Apple repository).

What are the input and output formats for the API?

The model accepts image and text inputs and outputs similarity scores. For API usage, send a request to the gigarouter OpenAI-compatible endpoint with your API key.

How was MobileCLIP S2 trained?

It uses multi-modal reinforced training (MMRT) with knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders, achieving 10x–1000x improved learning efficiency over non-reinforced CLIP training.

not yet live

We're benchmarking and onboarding MobileCLIP S2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →