HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Abstract

Graphical user interface (GUI) agents powered by large vision–language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantic-ambiguous actions—interactions whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) a hardness-driven exploration that guides data collection toward ambiguous yet informative interactions, and (2) an alignment-guided refinement that iteratively validates and repairs instruction–execution alignment. The two modules operate in a closed-loop manner—exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.

Brief View

Overview of trajectory synthesis paradigms. Compared with (a) existing methods, (b) HATS integrates hardness-driven exploration and alignment-guided refinement in a closed loop, producing high-quality trajectories with rich semantic coverage and strong instruction–execution alignment. (c) Experiments show HATS outperforms OS-Genesis by 100%↑ on AndroidWorld (22.60 vs. 11.30) and 215%↑ on WebArena (20.60 vs. 6.53).

The Problem: Semantic-Ambiguous Actions

Current GUI trajectory synthesis pipelines struggle with semantic-ambiguous actions—interactions whose functional meaning depends on contextual, sequential, or visual cues. These actions are:

Under-represented: Over 70% of collected traces collapse into trivial actions like “open menu” or “tap back”
Poorly processed: When captured, they often lead to instruction-execution misalignment, introducing noisy supervision

Examples of semantic-ambiguous actions include: (a) Identical icons triggering different functions depending on context; (b) Operations requiring prerequisite steps to succeed; (c) Visually similar elements leading to distinct outcomes.

HATS Framework

HATS consists of two cooperative modules unified through Hardness-Driven Monte Carlo Tree Search (HD-MCTS). The exploration module steers data collection toward semantically ambiguous interactions, while the refinement module iteratively validates and repairs instruction–execution alignment. The two modules operate in a closed loop, as illustrated below.

1. Hardness-Driven Exploration Module

Problem with uniform exploration: Random walks oversample trivial actions and miss semantically challenging interactions.

Our solution: Replace random exploration with a hardness-aware policy that:

Uses UCB-based selection to balance exploration and exploitation
Prioritizes under-represented, semantically complex UI states
Concentrates search effort on high-value, ambiguous actions

2. Alignment-Guided Refinement Module

Problem with one-shot synthesis: Direct instruction generation produces vague descriptions that fail to replay consistently.

Our solution: Multi-round refinement process that:

Synthesizes initial instruction from exploration trace
Replays instruction to verify execution consistency
Measures alignment using action-level reconstruction recall
Refines instruction by injecting missing contextual cues
Iterates until semantic alignment is achieved ($R \geq 0.7$)

Only verified trajectories passing alignment checks are admitted to the training corpus.

3. Closed-Loop Integration

The two modules form a feedback cycle:

Exploration → Refinement: Hardness-driven search supplies challenging trajectories for validation
Refinement → Exploration: Misalignment signals are converted into hardness rewards that guide future exploration

This closed loop progressively enhances both diversity (coverage of semantic-ambiguous actions) and fidelity (instruction–execution alignment) of synthesized data.

Main Experimental Results

Main Results on AndroidWorld

Main Results on WebArena

BibTeX

@inproceedings{shao2026hats,
  title={HATS: Hardness-Aware Trajectory Synthesis for GUI Agents},
  author={Shao, Rui and Gao, Ruize and Xie, Bin and Li, Yixing and Zhou, Kaiwen and Wang, Shuai and Guan, Weili and Chen, Gongwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}