skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

ALIGN Base

kakaobrain/align-base

published Feb 2023 · updated Mar 2023

ALIGN Base is a zero-shot image classification and multi-modal embedding model that uses a dual-encoder architecture with EfficientNet and BERT to align visual and text representations via contrastive learning.

status
coming soon
API providers
0
downloads / mo
12.3K

specs

TaskZero-Shot Image Classification & Multi-Modal Embedding Retrieval
ArchitectureDual-encoder: EfficientNet (vision) + BERT (text)
ParametersNot specified in the card
LicenseNot specified in the card

about this model

ALIGN (base model) is a zero-shot image classification and multi-modal embedding model that uses a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, trained with contrastive learning to align visual and language representations.

Developed by Kakao Brain, this model is a reproduction of Google's ALIGN architecture, trained on the open-source COYO-700M dataset rather than Google's non-public 1.8 billion image-text pairs. Despite training on a smaller dataset, Kakao Brain's ALIGN achieves performance on par with or exceeding Google's reported metrics. The model supports zero-shot image classification, multi-modal embedding retrieval, and cross-modality search.

Architecture and Training

The model uses a dual-encoder design: EfficientNet for vision encoding and BERT for text encoding, trained with contrastive learning. The COYO-700M dataset (actually containing 747 million image-text pairs) was collected from CommonCrawl (October 2020 to August 2021) and underwent specific filtering:

  • Image-level filtering: removed images smaller than 5KB, aspect ratio greater than 3.0, minimum dimension below 200 pixels, and NSFW scores above 0.5 from OpenNSFW2 and GantMan/NSFW models.
  • Text-level filtering: English-only via cld3, removed texts with length ≤5 characters, no noun form, fewer than 3 words or more than 256 words, length over 1000, texts appearing more than 10 times, and NSFW content.
  • Deduplication: removed duplicate (image_phash, text) pairs, including against external datasets (ImageNet, MS-COCO, CC-3M/12M, Flickr-30K).

Dataset Comparison

FeatureCOYO-700MLAION 2BALIGN 1.8B
Image-text similarity scoresProvided as metadata (CLIP ViT-B/32 and ViT-L/14); nothing filtered outProvided; only examples above 0.28 thresholdMinimal, frequency-based filtering
NSFW filteringOn images and textOn imagesGoogle Cloud Vision API
Face recognition dataProvided as metadataNoneN/A
Pair count700 million (747M actual)2 billion1.8 billion
SourceCommonCrawl Oct 2020–Aug 2021CommonCrawl 2014–2020N/A
Aesthetic scoreYesPartialN/A
Watermark scoreMore robustYesN/A
Public availabilityHugging Face HubHugging Face HubNot public

Comparison of zero-shot classification accuracy across datasets for ALIGN models

The model is intended as a research output for AI researchers studying zero-shot classification, robustness, generalization, and the capabilities and biases of vision-language models.

best for

FAQ

What is ALIGN Base best used for?

It is best for zero-shot image classification and multi-modal embedding retrieval, allowing you to classify images or search across images and text without task-specific fine-tuning.

What architecture does ALIGN Base use?

It uses a dual-encoder architecture with EfficientNet as the vision encoder and BERT as the text encoder, trained with contrastive learning.

What dataset was ALIGN Base trained on?

It was trained on the open-source COYO-700M dataset, which contains 747 million image-text pairs from CommonCrawl web pages.

How do I use ALIGN Base via the gigarouter API?

You can call the model using the gigarouter OpenAI-compatible endpoint with your API key, sending text and image inputs for zero-shot classification or embedding retrieval.

What are the input and output formats for ALIGN Base?

Input is a text prompt and an image; output is either classification probabilities (logits) or separate text and image embeddings.

not yet live

We're benchmarking and onboarding ALIGN Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →