Question 1

What is UI-TARS 2B SFT best used for?

Accepted Answer

It is best for automating GUI interactions by perceiving screenshots and performing human-like actions like clicking and typing on mobile, desktop, and web interfaces.

Question 2

How does UI-TARS 2B SFT compare to larger models in the family?

Accepted Answer

UI-TARS 2B SFT is the smallest model in the UI-TARS family. It achieves competitive performance on perception and grounding benchmarks (e.g., 82.3 on ScreenSpot, 27.7 on ScreenSpot Pro) while being faster and more lightweight than the 7B and 72B variants.

Question 3

What are the input and output formats for this model?

Accepted Answer

The model takes a screenshot image as input and outputs text describing the action to take (e.g., click, type) along with coordinates for grounding. It uses absolute coordinates based on the Qwen 2.5VL architecture.

Question 4

How can I call UI-TARS 2B SFT via the gigarouter API?

Accepted Answer

Use the gigarouter OpenAI-compatible endpoint with your API key. Send a request with the screenshot image and a prompt describing the task, and the model will return the predicted action and coordinates.

Question 5

What are the key innovations of UI-TARS compared to other GUI agents?

Accepted Answer

UI-TARS integrates perception, reasoning, grounding, and memory into a single VLM. Key innovations include enhanced perception via large-scale GUI data, unified action modeling across platforms, System-2 reasoning (task decomposition, reflection), and iterative training with reflective online traces on virtual machines.

Task	GUI agent (perception, grounding, and action)
Architecture	Vision-Language Model (VLM)
Parameters	2B
License	Unknown

Benchmark	Metric	Score
Multimodal Mind2Web (Cross-Task)	Element Accuracy	62.3
	Operation F1	90.0
	Step Success Rate	56.3
OSWorld (50 steps)	Success Rate	24.6
OSWorld (15 steps)	Success Rate	22.7
AndroidWorld	Success Rate	46.6

UI-TARS 2B SFT

specs

about this model

Key Capabilities and Performance

Benchmark Results

Perception

Grounding

Offline Agent Task Execution

Innovations

best for

FAQ

related vision-language models