-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Objective
Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.
Design Latitude
What changes in agentv-bench:
- Step 5 (Improve) dispatches
mutatorsubagent instead of waiting for human input - Human checkpoints at iterations 3/6/9 are skipped
- Hill-climbing ratchet enforced: track
bestversion explicitly, mutation always reads from best - Convergence detection: stop after N consecutive cycles with no improvement (default 3)
- Session state persisted for resumability
State files (in .agentv/autoresearch/{session}/):
| File | Purpose | Mutable? |
|---|---|---|
original.md |
Snapshot of artifact before first mutation | No |
best.md |
Current best-scoring version | Yes (on KEEP) |
state.json |
{best_score, cycle, best_cycle, convergence_count} |
Yes |
results.jsonl |
One line per cycle: score, per-assertion breakdown, timestamp | Append-only |
Activation: Triggered when user says "run autoresearch on this skill" or similar. Could also support an autoresearch: section in EVAL.yaml:
autoresearch:
max_cycles: 10
convergence: 3 # stop after 3 cycles with no improvement
artifact: ./SKILL.md # the file being optimizedThe loop:
1. RUN EVAL — agentv run (or agent-mode pipeline) with current artifact
2. ANALYZE — dispatch analyzer subagent on results
3. DECIDE — if score > best_score: KEEP (copy to best.md), else DROP
4. MUTATE — dispatch mutator subagent with failure analysis
5. GOTO 1 — until convergence or max_cycles
Interactive/autonomous hybrid: Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.
Acceptance Signals
- agentv-bench SKILL.md documents autoresearch mode as an alternative to interactive iteration
- Given a SKILL.md + EVAL.yaml, runs N improvement cycles without human input
- Hill-climbing ratchet: artifact quality only increases or stays the same across cycles
- Stops on convergence (no improvement for N cycles) or max_cycles
- Session state survives context resets — a fresh agent can resume from
state.json+best.md - Original artifact preserved in
original.md results.jsonlcontains per-cycle score trajectory for post-hoc analysis
Non-Goals
- Not a new skill — this is a mode of the existing agentv-bench skill
- Not a replacement for interactive mode — both coexist
- Not multi-file mutation (start with single artifact, expand later)
- Not a dashboard (use
agentv trace stats+agentv comparefor now) - Does not modify the eval definition — only the artifact under test
Context
The autoresearch pattern — originated by karpathy/autoresearch and generalized by pi-autoresearch — runs autonomous experiment loops: modify → test → keep/discard → repeat. agentv-bench already has Steps 1-4 (understand → write evals → run & grade → analyze). The missing piece is autonomous Step 5 (mutate and loop without human input).
Key design principles from autoresearch research:
- Immutable eval harness — the loop mutates the artifact, never the eval. Prevents gaming.
- Hill-climbing ratchet — mutation always reads from proven best. Monotonic improvement.
- Single-file mutation surface — constrain what the agent can change. Keeps diffs reviewable.
- Session persistence — state files enable resume across context window resets.
Related
- feat(bench): mutator subagent — autonomous artifact rewriting from failure analysis #746 — mutator subagent (required dependency — provides the mutation logic)
- feat(eval-writer): eval-generator subagent — bootstrap EVAL.yaml from existing artifacts #747 — eval-generator subagent (complementary — removes cold-start friction)
- feat(eval): Ralph Loop — iterative improvement with feedback injection #699 — Ralph Loop (complementary: Ralph improves outputs within a run; autoresearch improves artifacts across runs)
- feat: Pass@k Trial Strategy for LLM Non-Determinism #214 — Pass@k trials (complementary: statistical significance testing between cycles)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status