Top Rankings
Cross-Model comparison (OpenClaw framework) — sorted by overall score
What we found
Model capability dominates framework design
Across four frameworks and five models, model choice accounts for a 15.4% performance range while framework design accounts for 6.8%. The right model matters roughly twice as much as the right framework.
Belief revision depends on update design, not volume
Concentrated, targeted updates cause 0.28–0.36 score drops. Distributed updates across 12 scenarios show only +1.7% change. How updates are structured matters far more than how many there are.
Aggregate scores mask divergent failures
Two configurations scoring 0.833 on the same question fail on structurally opposite options — one misses a genuine conflict, the other over-flags a non-conflict. Per-option diagnostics are essential.
Three coupled challenges
Real information environments are multi-source, dynamic, and personalized. ClawArena evaluates all three jointly.
14-category taxonomy
7 dimension combinations × 2 types (Recall, Reasoning) prevent systems from scoring well by solving only one dimension.
Executable checks
Shell-based verification of workspace file state. Agents must produce working artifacts, not just text answers.
6-layer specifications
Hidden ground truth (L0) is never shown to agents. Observable layers are noisy, partial reflections of the same underlying reality.
