MobileCLIP S2
apple/MobileCLIP-S2-OpenCLIP
published Jun 2024 · updated Feb 2025
MobileCLIP S2 is a zero-shot-image model that performs fast and accurate image-text contrastive learning for zero-shot classification and retrieval tasks.
specs
| Task | Zero-Shot Image Classification & Retrieval |
| Architecture | MobileCLIP-S2 (image encoder + text encoder) |
| Parameters | 35.7M (image) + 63.4M (text) = 99.1M total |
| License | apple-amlr |
about this model
MobileCLIP-S2-OpenCLIP is a zero-shot image classification model that uses multi-modal reinforced training to achieve fast inference with high accuracy. It is part of the MobileCLIP family introduced in CVPR 2024, optimized for runtime performance on mobile and edge devices.
Performance
MobileCLIP-S2 is trained on 13 billion seen samples and contains 35.7M image parameters plus 63.4M text parameters. It achieves a latency of 3.6 ms (image) + 3.3 ms (text). On ImageNet-1k zero-shot classification it reaches 74.4% top-1 accuracy, with an average performance of 63.7% across 38 evaluation datasets.
Compared to the previous best CLIP model based on ViT-B/16, MobileCLIP-S2 is 2.3× faster while being more accurate and 2.1× smaller. The multi-modal reinforced training approach also yields 10×–1000× improved learning efficiency versus non-reinforced CLIP training.
Checkpoint Comparison
| Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets |
|---|---|---|---|---|---|
| MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 |
| MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 |
| MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 |
| MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 |
| MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 |
Key Details
- Released under the
apple-amlrlicense (see license file). - Training code and DataCompDR datasets are available on Hugging Face as part of the
apple/mobileclip-models-datacompdr-datacollection. - The model is hosted as a managed API on gigarouter, requiring no local installation—simply call the OpenAI-compatible endpoint.
best for
- ·Zero-shot image classification on custom categories
- ·Image-to-text and text-to-image retrieval
- ·Visual search and recommendation systems
FAQ
MobileCLIP S2 excels at zero-shot image classification and cross-modal retrieval tasks where speed and efficiency are critical, such as mobile or edge deployments.
MobileCLIP S2 is 2.3x faster and 2.1x smaller than ViT-B/16 while achieving better average zero-shot performance across 38 datasets.
The MobileCLIP-S2 model is released under the apple-amlr license (see the LICENSE_weights_data file in the Apple repository).
The model accepts image and text inputs and outputs similarity scores. For API usage, send a request to the gigarouter OpenAI-compatible endpoint with your API key.
It uses multi-modal reinforced training (MMRT) with knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders, achieving 10x–1000x improved learning efficiency over non-reinforced CLIP training.
We're benchmarking and onboarding MobileCLIP S2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.