Benchmark for AI Agents in Evolving Environments

ClawArena

Benchmarking AI Agents in Evolving Information Environments

A rigorous evaluation framework for AI agents — measuring reasoning under multi-source conflicts, dynamic belief revision, and implicit personalization across 64 evolving scenarios in 8 professional domains.

Scenarios

across 8 domains

Domains

professional settings

1,879

Eval Rounds

evaluation turns

365

Updates

dynamic spec changes

View Leaderboard Read Paper Get Dataset

Top Rankings

Cross-Model comparison (OpenClaw framework) — sorted by overall score

Full leaderboard

What we found

Model capability dominates framework design

Across four frameworks and five models, model choice accounts for a 15.4% performance range while framework design accounts for 6.8%. The right model matters roughly twice as much as the right framework.

Belief revision depends on update design, not volume

Concentrated, targeted updates cause 0.28–0.36 score drops. Distributed updates across 12 scenarios show only +1.7% change. How updates are structured matters far more than how many there are.

Aggregate scores mask divergent failures

Two configurations scoring 0.833 on the same question fail on structurally opposite options — one misses a genuine conflict, the other over-flags a non-conflict. Per-option diagnostics are essential.

Three coupled challenges

Real information environments are multi-source, dynamic, and personalized. ClawArena evaluates all three jointly.

Multi-Source Conflict

Evidence is scattered across heterogeneous sources that may contradict each other. The agent must judge source reliability across four canonical conflict types.

Dynamic Belief Revision

New evidence can invalidate previously correct conclusions. 365 staged update packages test whether agents revise rather than simply accumulate.

Implicit Personalization

User preferences surface through corrections and behavioral patterns, not explicit instructions. A four-stage protocol ends in silent-exam rounds.

14-category taxonomy

7 dimension combinations × 2 types (Recall, Reasoning) prevent systems from scoring well by solving only one dimension.

Executable checks

Shell-based verification of workspace file state. Agents must produce working artifacts, not just text answers.

6-layer specifications

Hidden ground truth (L0) is never shown to agents. Observable layers are noisy, partial reflections of the same underlying reality.