TinyCLIP ViT-8M/16

wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M

published Dec 2023 · updated May 2024

TinyCLIP ViT-8M/16 is a zero-shot-image model that performs image-text similarity and zero-shot classification, distilled from CLIP via affinity mimicking and weight inheritance.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

311.3K

license

mit

specs

Task	Zero-shot image classification, image-text similarity
Architecture	Vision Transformer (ViT-8M/16) + Text Transformer (3M)
Parameters	11M total (8M vision + 3M text)
License	Microsoft (see LICENSE file in repo)

about this model

TinyCLIP-ViT-8M-16-Text-3M-YFCC15M is a zero-shot image classification model that performs cross-modal retrieval between images and text without task-specific fine-tuning. It is a distilled CLIP model using affinity mimicking and weight inheritance, designed to balance accuracy and computational efficiency. The model achieves 41.1% ImageNet zero-shot top-1 accuracy while requiring only 2.0 MACs (G) and supporting 4,150 image-text pairs per second throughput. It surpasses the original CLIP ViT-B/16 by 3.5% in ImageNet zero-shot accuracy while using only 8.9% of the parameters. The distillation with weight inheritance also speeds up training by 1.4–7.8× compared to training from scratch.

Key Strengths

Extreme parameter efficiency: 8M vision encoder and 3M text encoder
High throughput: 4,150 pairs/s for rapid inference
Competitive zero-shot accuracy relative to model size

Benchmark Results

Model Variant	ImageNet Acc@1 (%)	MACs (G)	Throughput (pairs/s)
TinyCLIP ViT-8M/16 Text-3M	41.1	2.0	4,150

Diagram of TinyCLIP distillation method showing affinity mimicking and weight inheritance process

Comparison chart showing TinyCLIP model size versus zero-shot accuracy relative to CLIP ViT-B/32

This model is hosted by gigarouter as a managed, OpenAI-compatible API. No local installation or model loading is required; simply call the API endpoint for zero-shot image classification or image-text similarity tasks.

best for

·Zero-shot classification of images without fine-tuning
·Image-text similarity search and retrieval
·Lightweight CLIP model for edge or real-time applications

FAQ

What is the input format for this model?

It accepts text prompts and images. Use the CLIPProcessor to tokenize text and preprocess images into tensors.

How many parameters does TinyCLIP ViT-8M/16 have?

The model has 11 million parameters total (8M vision encoder + 3M text encoder).

What is the zero-shot accuracy on ImageNet?

It achieves 41.1% top-1 accuracy on ImageNet zero-shot classification.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending a request with prompt and image data.

What is the license for this model?

The model is released under a Microsoft license; see the LICENSE file in the official repository.

not yet live

We're benchmarking and onboarding TinyCLIP ViT-8M/16 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336