Skip to content

feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet #748

@christso

Description

@christso

Objective

Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.

Design Latitude

What changes in agentv-bench:

  • Step 5 (Improve) dispatches mutator subagent instead of waiting for human input
  • Human checkpoints at iterations 3/6/9 are skipped
  • Hill-climbing ratchet enforced: track best version explicitly, mutation always reads from best
  • Convergence detection: stop after N consecutive cycles with no improvement (default 3)
  • Session state persisted for resumability

State files (in .agentv/autoresearch/{session}/):

File Purpose Mutable?
original.md Snapshot of artifact before first mutation No
best.md Current best-scoring version Yes (on KEEP)
state.json {best_score, cycle, best_cycle, convergence_count} Yes
results.jsonl One line per cycle: score, per-assertion breakdown, timestamp Append-only

Activation: Triggered when user says "run autoresearch on this skill" or similar. Could also support an autoresearch: section in EVAL.yaml:

autoresearch:
  max_cycles: 10
  convergence: 3        # stop after 3 cycles with no improvement
  artifact: ./SKILL.md  # the file being optimized

The loop:

1. RUN EVAL      — agentv run (or agent-mode pipeline) with current artifact
2. ANALYZE       — dispatch analyzer subagent on results
3. DECIDE        — if score > best_score: KEEP (copy to best.md), else DROP
4. MUTATE        — dispatch mutator subagent with failure analysis
5. GOTO 1        — until convergence or max_cycles

Interactive/autonomous hybrid: Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.

Acceptance Signals

  • agentv-bench SKILL.md documents autoresearch mode as an alternative to interactive iteration
  • Given a SKILL.md + EVAL.yaml, runs N improvement cycles without human input
  • Hill-climbing ratchet: artifact quality only increases or stays the same across cycles
  • Stops on convergence (no improvement for N cycles) or max_cycles
  • Session state survives context resets — a fresh agent can resume from state.json + best.md
  • Original artifact preserved in original.md
  • results.jsonl contains per-cycle score trajectory for post-hoc analysis

Non-Goals

  • Not a new skill — this is a mode of the existing agentv-bench skill
  • Not a replacement for interactive mode — both coexist
  • Not multi-file mutation (start with single artifact, expand later)
  • Not a dashboard (use agentv trace stats + agentv compare for now)
  • Does not modify the eval definition — only the artifact under test

Context

The autoresearch pattern — originated by karpathy/autoresearch and generalized by pi-autoresearch — runs autonomous experiment loops: modify → test → keep/discard → repeat. agentv-bench already has Steps 1-4 (understand → write evals → run & grade → analyze). The missing piece is autonomous Step 5 (mutate and loop without human input).

Key design principles from autoresearch research:

  1. Immutable eval harness — the loop mutates the artifact, never the eval. Prevents gaming.
  2. Hill-climbing ratchet — mutation always reads from proven best. Monotonic improvement.
  3. Single-file mutation surface — constrain what the agent can change. Keeps diffs reviewable.
  4. Session persistence — state files enable resume across context window resets.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions