skip to content
gigarouter gigarouter
models / vision-language · coming soon

GTA1 7B

Salesforce/GTA1-7B

published Oct 2025 · updated Oct 2025

GTA1 7B is a vision-language model optimized for GUI grounding and agent tasks, using reinforcement learning (GRPO) to produce accurate click predictions from screenshots.

est. price
~$1.341
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
1.3K
license
mit

specs

TaskGUI Grounding & Agent
ArchitectureQwen2-VL-based
Parameters7B

about this model

GTA1-7B is a vision-language model for GUI grounding and agent task execution, trained using reinforcement learning (GRPO) to optimize for successful clicks on interface elements rather than verbose chain-of-thought reasoning. It is part of the GTA1 family that achieves state-of-the-art results on both grounding and agent benchmarks.

Grounding Performance

Evaluated on three standard benchmarks, GTA1 (7B) outperforms all open-source models at its size and competes with much larger proprietary systems:

ModelScreenSpot-V2ScreenSpotProOSWORLD-GOSWORLD-G-Refined
GTA1 (7B)93.455.560.168.8
OpenCUA (7B)92.350.055.368.3
UI-TARS-1.5* (7B)89.742.052.864.2

On the ScreenSpotPro leaderboard (applications under 12B parameters), GTA1-7B achieves a micro-average of 55.5 and is ranked #10 overall. Per-application results include: Android Studio macOS 48.8, AutoCAD Windows 41.2, Blender Windows 57.7, DaVinci Resolve macOS 56.8.

Agent Task Execution

When paired with a planner (e.g., o3 or GPT-5), GTA1-7B achieves strong results on end-to-end agent benchmarks:

Agent modelOSWorldOSWorld-Verified
GTA1-7B-2507 w/ o345.253.1
GTA1-7B-2507 w/ GPT-561.0
UI-TARS-1.5-7B26.927.4

On WindowsAgentArena, GTA1-7B-2507 with o3 achieves a 47.9% success rate (100 steps), and with GPT-5 reaches 49.2%.

The model is hosted as a managed, OpenAI-compatible API on gigarouter, requiring no local installation or infrastructure.

best for

FAQ

What input format does GTA1 7B expect?

The model takes an image (screenshot) and a text instruction, such as "click start". It outputs a predicted click coordinate (x, y).

How is GTA1 7B trained?

It is fine-tuned using GRPO (Group Relative Policy Optimization), which rewards any click within the target bounding box rather than forcing exact center prediction.

How can I call GTA1 7B via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and instruction in the standard chat completion format.

not yet live

We're benchmarking and onboarding GTA1 7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →