Falcon Perception
tiiuae/Falcon-Perception
published Feb 2026 · updated May 2026
Falcon Perception is a 0.6B parameter early-fusion vision-language model for open-vocabulary grounding and instance segmentation that generates pixel-accurate masks from an image and a natural language query.
specs
| Task | Open-vocabulary grounding and instance segmentation |
| Architecture | Early-fusion dense Transformer with hybrid attention (bidirectional image tokens, causal text/task tokens) |
| Parameters | 632M |
| License | Apache 2.0 |
about this model
Architecture
The model uses a single dense Transformer with hybrid attention: bidirectional among image tokens for visual context, causal for text and task tokens conditioned on the image. For each instance it generates a structured sequence of task tokens (<|coord|>, <|size|>, <|seg|>). The <|seg|> token acts as a mask query; its hidden state is projected and dotted with upsampled image features to produce a full-resolution binary mask without autoregressive mask generation. Small specialized heads with Fourier features handle continuous spatial outputs, enabling parallel mask prediction.
Evaluation Results
- SA-Co (open-vocabulary segmentation): 68.0 Macro F1, compared to 62.3 for SAM 3.
- PBench (diagnostic benchmark for compositional prompts): 57 average score (community evaluation).
- Main remaining gap is presence calibration (Average MCC 0.64 vs 0.82 for SAM 3).
Limitations
- False positives are more likely on hard negatives due to autoregressive dense decoding.
- OCR-driven prompts are sensitive to text size and image resolution.
- Dense scenes benefit strongly from high-resolution inputs.
best for
- ·Natural language driven object selection in images
- ·Promptable instance segmentation for downstream pipelines
- ·Dense, crowded scenes with many variable instances
FAQ
It is designed for dense grounding and open-vocabulary instance segmentation, such as selecting objects in images by natural language or segmenting crowded scenes with many instances.
It is a 632M parameter early-fusion dense Transformer with hybrid attention (bidirectional image tokens, causal text/task tokens) and specialized heads for coordinate, size, and mask prediction.
It is released under the Apache 2.0 license.
Input: a PIL image and a natural language query string. Output: a list of prediction dicts, each containing normalized center coordinates (xy), size (hw), and a COCO RLE mask at original resolution.
Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and a text query to generate masks.
We're benchmarking and onboarding Falcon Perception as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.