Home > Blog > WARC-Bench: How Small Language Models Can Be Better Executors than Giants

WARC-Bench: How Small Language Models Can Be Better Executors than Giants

Research Paper: WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions, Accepted in ICLR 2026

Key Innovation: Introduces a novel GUI benchmark that poses a challenge to frontier models. We also show how state-of-the-art RLVR training techniques can unlock frontier-level performance in compact 7B and 72B models

The Cost-Accuracy Revolution

Traditional wisdom in AI development suggests that tackling complex tasks requires massive models with hundreds of billions of parameters. These behemoths deliver impressive results but come with substantial costs in computational resources, inference latency, and deployment complexity. Our work challenges this narrative by demonstrating that using SLMs for solving the right problem, along with the right training approach, can allow SLMs to compete with—and in some cases outperform—much larger alternatives. To show this, we develop a brand-new GUI navigation benchmark – WARC-Bench.

Introducing WARC-Bench: A New Standard for GUI Subtasks

WARC-Bench is a comprehensive benchmark featuring 438 carefully designed tasks that evaluate AI agents on GUI subtasks—short-horizon interactions with user interface components – that form the building blocks of complex web automation. Subtasks include actions like selecting dates in a calendar picker, navigating dropdown menus, scrolling through containers to extract information, or filling out multi-step forms. We use web archive files, which are high-fidelity snapshots of real websites as interactive environments for our benchmark.

WARC-Bench GUI benchmarks

WARC-Bench tackles the following problems when compared to other GUI benchmarks:

  • Real-world complexity: Uses Web Archive (WARC) files to create sandboxed, interactive environments with realistic websites including GitHub, Zendesk, Google Earth, and custom-built synthetic pages
  • Programmatic evaluation: Automatic verification through reward functions eliminates subjective assessment of task completion
  • Subtask focus: Targets the critical middle layer between basic UI actions and full task completion—a capability gap in existing benchmarks like MiniWob++[1], WebArena[2], OSWorld[3] etc.
  • Dynamic interactions: Evaluates agents on real-time manipulation of complex UI widgets, not just static visual understanding, which is a capability gap in benchmarks like Mind2Web[4], OmniAct[5] etc.

Even the most advanced frontier models find WARC-Bench challenging. Claude Sonnet 4.0 achieves the highest success rate at 64.8%, while OpenAI’s GPT-5 scores 51.33%, leaving substantial room for improvement and demonstrating the benchmark’s ability to differentiate capabilities at the cutting edge.

The Power of Small Models: ActIO-UI Model Family

We trained ActIO-UI models in two sizes—7B and 72B parameters—using the base Qwen2.5-VL models with Supervised Finetuning and Reinforcement Learning. We present the results on two splits of the benchmark – development and test splits. The development set was used to analyze erroneous behaviors of trained models to guide hyperparameter search, while the test set is completely held out.

ModelParametersWARC-Bench Dev Success RateWARC-Bench Test Success RateDeployment Profile
Claude Sonnet 4.0unknown83.61%64.8%API-only, high cost
GPT5unknown69.89%51.3%API-only, high cost
Base Qwen2.5-VL-7B7B15.54%4.7%Standard GPU (16GB VRAM), low cost
ActIO-UI-7B-SFT7B66.54%27.3%Standard GPU (16GB VRAM), low cost
ActIO-UI-7B-RLVR7B72.13%29.17%Standard GPU (16GB VRAM), low cost
Base Qwen2.5-VL-72B72B61.66%37.3%Multi-GPU setup, medium cost
ActIO-UI-72B-SFT72B75.88%48.3%Multi-GPU setup medium cost
ActIO-UI-72B-RLVR72B84.31%52.8%Multi-GPU setup medium cost

For most practical applications, the 7B variant represents the optimal balance between capability and accessibility due to being deployable on standard hardware with just 16GB of VRAM. Organizations can run these models on commodity hardware rather than requiring expensive multi-GPU server configurations.

RLVR: The Secret Weapon for Small Model Excellence

Our models’ strong performance is based on Reinforcement Learning with Verifiable Rewards (RLVR), a training technique that has proven effective for developing reasoning-capable AI systems. Unlike traditional supervised learning that requires expensive human annotations, RLVR uses automatically verifiable outcomes to guide model improvement.

How RLVR Works

  1. Verifiable Outcomes: For GUI tasks, we can deterministically verify whether an agent successfully completed a subtask (e.g., did the correct date get selected? did the form submit properly?)
  2. Exploration and Learning: The model explores different interaction strategies and receives immediate feedback based on task completion
  3. Iterative Refinement: Through repeated trials, the model learns optimal strategies without requiring human labelers to evaluate every attempt
  4. Synthetic Data Scaling: RLVR enables training on large-scale synthetic environments, dramatically expanding the diversity of scenarios the model encounters

Beyond Parameter Count: What Really Drives Performance

Our research reveals several crucial insights about what makes smaller models competitive with larger alternatives:

1. Task Specialization Enables Efficiency

By focusing our models specifically on GUI subtasks rather than trying to be generalists, we achieve exceptional performance in our target domain. This specialization allows smaller models to allocate their capacity efficiently.

2. Two Stage State-of-the-Art Training Pipeline

Two-Stage Training Pipeline

Our success stems from a carefully orchestrated two-stage training approach that combines the strengths of multiple techniques:

Stage 1: Supervised Fine-Tuning (SFT)

We begin by distilling knowledge from strong frontier models like Claude, creating high-quality demonstrations of GUI subtask execution. This SFT phase established a solid foundation, dramatically improving the base Qwen2.5-VL models:

  • 7B model: 4.67% → 27.33% success rate
  • 72B model: 37.33% → 48.33% success rate

Stage 2: RLVR Enhancement

Building on the SFT checkpoints, we applied agentic RLVR training using the GRPO (Group Relative Policy Optimization) algorithm, with only 1059 training tasks, that still provide dramatic improvements of +4.5 percentage point improvement from SFT to RLVR on the 72B model (48.3% → 52.8%).

Using strong teacher models to help a student model mimic agentic behavior allowed us to train strong baselines and bridge the gap between SLMs and frontier model performances. On top of that, RLVR allows smaller models to encounter diverse scenarios without requiring large-scale human annotation.

We find that RLVR models improve on the following fronts when compared to the SFT counterparts:

  • Enhanced visual grounding: Greater precision in identifying small interface elements like calendar dates and nested menu items
  • Better exploration: Increased use of scrolling actions and improved contextual awareness
  • Improved efficiency: Fewer overall actions, reduced redundant clicks, and more direct paths to task completion
  • Dynamic task mastery: Substantial gains in form filling, menu navigation, table manipulation, and date picker interactions

Real-World Deployment Advantages

The practical benefits of our SLM approach extend far beyond benchmark scores:

Cost Efficiency

  • Infrastructure: 7B models run on consumer-grade GPUs ($500-2000 hardware) vs. enterprise server requirements for 200B+ models ($50,000+)
  • Inference cost: 7x relative cost vs. 70x+ for frontier-scale models—a 10x+ operational savings
  • Energy consumption: Dramatically lower power requirements enable green AI deployment

Latency and User Experience

  • Response time: ~250ms for 7B models vs. 2.5s+ for 70B+ models—critical for interactive applications like GUI or voice agents
  • Throughput: Higher tokens per second enables better user experiences in production
  • Real-time interaction: Low latency makes GUI automation feel natural and responsive

Operational Flexibility

  • On-premise deployment: Smaller models enable air-gapped, secure environments without cloud dependencies
  • Edge computing: Deploy closer to end users for lower latency and better data privacy
  • Easier maintenance: Simpler model serving infrastructure reduces operational complexity

Conclusion: Smarter Training Beats Bigger Models

WARC-Bench and our ActIO-UI models demonstrate that with state-of-the-art training techniques like RLVR, small language models can achieve frontier-level performance at a fraction of the cost.

The future of AI isn’t just about scale; it’s about smart specialization, innovative training methods, and finding the optimal balance between capability and practicality. As we continue to refine RLVR techniques and expand WARC-Bench with more diverse tasks, we’re excited to see how the community pushes these boundaries further. The era of accessible, high-performance AI automation has arrived, and it’s smaller than we think.

Learn More:

Research conducted by the Orby AI team, a Uniphore company. For inquiries, contact the research team at Uniphore.