ClawEval

Composable eval runner for long-running, tool-using AI assistants. Suite > episode > step > check.

Eval results (2026-03-28)

From reports/matrix/run_20260328_133737/_matrix_summary.md:

Rank	Agent	Pass Rate	Passed/Total	Avg Duration
1	hermes	94.4%	17/18	23044ms
2	openclaw	88.9%	16/18	30018ms
3	nanobot	83.3%	15/18	20805ms
4	openai_direct	83.3%	15/18	56629ms
5	picoclaw	83.3%	15/18	64774ms
6	ironclaw	66.7%	12/18	27207ms

All agents evaluated against qwen/qwen3.5-397b-a17b via OpenRouter.

Per-suite breakdown

Suite	hermes	openclaw	nanobot	openai_direct	picoclaw	ironclaw
matrix_basic	4/4	4/4	4/4	4/4	4/4	4/4
matrix_memory	3/3	3/3	3/3	3/3	3/3	2/3
matrix_structured	3/3	3/3	2/3	3/3	3/3	2/3
matrix_longhorizon_reliability	2/2	1/2	2/2	2/2	2/2	1/2
matrix_async_simulated	3/3	3/3	3/3	2/3	2/3	2/3
matrix_tools_grounding	2/3	2/3	1/3	1/3	1/3	1/3

Eval suite descriptions

matrix_basic (4 episodes): single-turn reliability checks (PONG, echo, factual answer, non-refusal).
matrix_memory (3 episodes): multi-turn memory and state retention across a session.
matrix_reasoning (4 episodes): arithmetic, logic, word problems, plus one llm_judge quality check.
matrix_structured (3 episodes): structured output checks for exact JSON, schema compliance, and JSON array format.
matrix_async_simulated (3 episodes): exec probes, oracle event writes, and polling for dynamic readiness.
matrix_tools_grounding (3 episodes): oracle HTTP probes, deterministic fetch, and conflict resolution with uncertainty.
matrix_longhorizon_reliability (2 episodes): ten-turn secret retention and counter state drift across long sessions.

Features

Backends: OpenClaw, HTTP/OpenAI, generic command
Checks: contains, regex, equals_trim, json_pointer_equals, json_pointer_exists, latency_under_ms, within_time_window_ms, status_code_equals, eventually_contains, llm_judge
Step types: user, sleep, note, exec, http_probe, poll
Parallel execution, pass-rate stats with Wilson 95% CI

Install

cargo build --release

Run

claweval run suites/openclaw_basic.json --out reports/report.json

# all options
claweval run suites/openclaw_basic.json \
  --local --profile claweval --repeats 3 --enable-llm-judge \
  --out reports/report.json

Suite format

See suites/ for examples. Episodes contain steps (user messages, sleeps) and checks (assertions on responses).

Custom backends

Use the command backend to evaluate any executable without writing Rust:

{
  "type": "command",
  "command": "myagent",
  "args": ["--session", "{session}", "--message", "{message}"]
}

Or implement AgentBackend in src/backend/mod.rs.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
examples		examples
matrix		matrix
reports/harness		reports/harness
src		src
suites		suites
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClawEval

Eval results (2026-03-28)

Per-suite breakdown

Eval suite descriptions

Features

Install

Run

Suite format

Custom backends

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClawEval

Eval results (2026-03-28)

Per-suite breakdown

Eval suite descriptions

Features

Install

Run

Suite format

Custom backends

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages