The two modules form a feedback cycle:
- Exploration → Refinement: Hardness-driven search supplies challenging trajectories for validation
- Refinement → Exploration: Misalignment signals are converted into hardness rewards that guide future exploration
This closed loop progressively enhances both diversity (coverage of semantic-ambiguous actions) and fidelity (instruction–execution alignment) of synthesized data.