GTA1 7B

Salesforce/GTA1-7B

published Oct 2025 · updated Oct 2025

GTA1 7B is a vision-language model optimized for GUI grounding and agent tasks, using reinforcement learning (GRPO) to produce accurate click predictions from screenshots.

est. price

~$1.341

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.3K

license

mit

specs

Task	GUI Grounding & Agent
Architecture	Qwen2-VL-based
Parameters	7B

about this model

GTA1-7B is a vision-language model for GUI grounding and agent task execution, trained using reinforcement learning (GRPO) to optimize for successful clicks on interface elements rather than verbose chain-of-thought reasoning. It is part of the GTA1 family that achieves state-of-the-art results on both grounding and agent benchmarks.

Grounding Performance

Evaluated on three standard benchmarks, GTA1 (7B) outperforms all open-source models at its size and competes with much larger proprietary systems:

Model	ScreenSpot-V2	ScreenSpotPro	OSWORLD-G	OSWORLD-G-Refined
GTA1 (7B)	93.4	55.5	60.1	68.8
OpenCUA (7B)	92.3	50.0	55.3	68.3
UI-TARS-1.5* (7B)	89.7	42.0	52.8	64.2

On the ScreenSpotPro leaderboard (applications under 12B parameters), GTA1-7B achieves a micro-average of 55.5 and is ranked #10 overall. Per-application results include: Android Studio macOS 48.8, AutoCAD Windows 41.2, Blender Windows 57.7, DaVinci Resolve macOS 56.8.

Agent Task Execution

When paired with a planner (e.g., o3 or GPT-5), GTA1-7B achieves strong results on end-to-end agent benchmarks:

Agent model	OSWorld	OSWorld-Verified
GTA1-7B-2507 w/ o3	45.2	53.1
GTA1-7B-2507 w/ GPT-5	—	61.0
UI-TARS-1.5-7B	26.9	27.4

On WindowsAgentArena, GTA1-7B-2507 with o3 achieves a 47.9% success rate (100 steps), and with GPT-5 reaches 49.2%.

The model is hosted as a managed, OpenAI-compatible API on gigarouter, requiring no local installation or infrastructure.

best for

·Automated UI testing with precise click targeting
·GUI agent task execution on desktop and web environments

FAQ

What input format does GTA1 7B expect?

The model takes an image (screenshot) and a text instruction, such as "click start". It outputs a predicted click coordinate (x, y).

How is GTA1 7B trained?

It is fine-tuned using GRPO (Group Relative Policy Optimization), which rewards any click within the target bounding box rather than forcing exact center prediction.

How can I call GTA1 7B via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and instruction in the standard chat completion format.

not yet live

We're benchmarking and onboarding GTA1 7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit