Question 1

What is MobileCLIP S2 best used for?

Accepted Answer

MobileCLIP S2 excels at zero-shot image classification and cross-modal retrieval tasks where speed and efficiency are critical, such as mobile or edge deployments.

Question 2

How does MobileCLIP S2 compare to OpenAI's ViT-B/16 in speed and accuracy?

Accepted Answer

MobileCLIP S2 is 2.3x faster and 2.1x smaller than ViT-B/16 while achieving better average zero-shot performance across 38 datasets.

Question 3

What license is this model released under?

Accepted Answer

The MobileCLIP-S2 model is released under the apple-amlr license (see the LICENSE_weights_data file in the Apple repository).

Question 4

What are the input and output formats for the API?

Accepted Answer

The model accepts image and text inputs and outputs similarity scores. For API usage, send a request to the gigarouter OpenAI-compatible endpoint with your API key.

Question 5

How was MobileCLIP S2 trained?

Accepted Answer

It uses multi-modal reinforced training (MMRT) with knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders, achieving 10x–1000x improved learning efficiency over non-reinforced CLIP training.

Task	Zero-Shot Image Classification & Retrieval
Architecture	MobileCLIP-S2 (image encoder + text encoder)
Parameters	35.7M (image) + 63.4M (text) = 99.1M total
License	apple-amlr

Model	# Seen Samples (B)	# Params (M) (img + txt)	Latency (ms) (img + txt)	IN-1k Zero-Shot Top-1 Acc. (%)	Avg. Perf. (%) on 38 datasets
MobileCLIP-S0	13	11.4 + 42.4	1.5 + 1.6	67.8	58.1
MobileCLIP-S1	13	21.5 + 63.4	2.5 + 3.3	72.6	61.3
MobileCLIP-S2	13	35.7 + 63.4	3.6 + 3.3	74.4	63.7
MobileCLIP-B	13	86.3 + 63.4	10.4 + 3.3	76.8	65.2
MobileCLIP-B (LT)	36	86.3 + 63.4	10.4 + 3.3	77.2	65.8

MobileCLIP S2

specs

about this model

Performance

Checkpoint Comparison

Key Details

best for

FAQ

related zero-shot image models