Falcon Perception

tiiuae/Falcon-Perception

published Feb 2026 · updated May 2026

Falcon Perception is a 0.6B parameter early-fusion vision-language model for open-vocabulary grounding and instance segmentation that generates pixel-accurate masks from an image and a natural language query.

status

coming soon

API providers

downloads / mo

5.2K

license

apache-2.0

specs

Task	Open-vocabulary grounding and instance segmentation
Architecture	Early-fusion dense Transformer with hybrid attention (bidirectional image tokens, causal text/task tokens)
Parameters	632M
License	Apache 2.0

about this model

Falcon Perception is a 632 million parameter early-fusion vision-language model for open-vocabulary instance segmentation and grounding. Given an image and a natural language query, it returns zero or more segmented instances with pixel-accurate masks.

Architecture

The model uses a single dense Transformer with hybrid attention: bidirectional among image tokens for visual context, causal for text and task tokens conditioned on the image. For each instance it generates a structured sequence of task tokens (<|coord|>, <|size|>, <|seg|>). The <|seg|> token acts as a mask query; its hidden state is projected and dotted with upsampled image features to produce a full-resolution binary mask without autoregressive mask generation. Small specialized heads with Fourier features handle continuous spatial outputs, enabling parallel mask prediction. Falcon Perception example outputs showing segmentation masks for multiple objects.

Falcon Perception example outputs showing segmentation masks for multiple objects.

Evaluation Results

SA-Co (open-vocabulary segmentation): 68.0 Macro F1, compared to 62.3 for SAM 3.
PBench (diagnostic benchmark for compositional prompts): 57 average score (community evaluation).
Main remaining gap is presence calibration (Average MCC 0.64 vs 0.82 for SAM 3).

Limitations

False positives are more likely on hard negatives due to autoregressive dense decoding.
OCR-driven prompts are sensitive to text size and image resolution.
Dense scenes benefit strongly from high-resolution inputs.

The model is licensed under Apache 2.0. A companion model, Falcon OCR (300M parameters), achieves 80.3% on olmOCR and 88.64 on OmniDocBench for text extraction.

best for

·Natural language driven object selection in images
·Promptable instance segmentation for downstream pipelines
·Dense, crowded scenes with many variable instances

FAQ

What is Falcon Perception best used for?

It is designed for dense grounding and open-vocabulary instance segmentation, such as selecting objects in images by natural language or segmenting crowded scenes with many instances.

What is the model architecture and size?

It is a 632M parameter early-fusion dense Transformer with hybrid attention (bidirectional image tokens, causal text/task tokens) and specialized heads for coordinate, size, and mask prediction.

What license is Falcon Perception released under?

It is released under the Apache 2.0 license.

What input and output formats does the model use?

Input: a PIL image and a natural language query string. Output: a list of prediction dicts, each containing normalized center coordinates (xy), size (hw), and a COCO RLE mask at original resolution.

How can I call Falcon Perception via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and a text query to generate masks.

not yet live

We're benchmarking and onboarding Falcon Perception as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related mask generation models

compare all →

sam3

1.7M dl/mo