feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet

## Objective

Add an autoresearch mode to agentv-bench that runs the eval-improve loop unattended: evaluate → analyze → keep/drop → mutate → repeat. This turns agentv-bench from a human-directed optimization tool into one that can also run autonomously overnight.

## Design Latitude

**What changes in agentv-bench**:
- Step 5 (Improve) dispatches `mutator` subagent instead of waiting for human input
- Human checkpoints at iterations 3/6/9 are skipped
- Hill-climbing ratchet enforced: track `best` version explicitly, mutation always reads from best
- Convergence detection: stop after N consecutive cycles with no improvement (default 3)
- Session state persisted for resumability

**State files** (in `.agentv/autoresearch/{session}/`):

| File | Purpose | Mutable? |
|------|---------|----------|
| `original.md` | Snapshot of artifact before first mutation | No |
| `best.md` | Current best-scoring version | Yes (on KEEP) |
| `state.json` | `{best_score, cycle, best_cycle, convergence_count}` | Yes |
| `results.jsonl` | One line per cycle: score, per-assertion breakdown, timestamp | Append-only |

**Activation**: Triggered when user says "run autoresearch on this skill" or similar. Could also support an `autoresearch:` section in EVAL.yaml:

```yaml
autoresearch:
  max_cycles: 10
  convergence: 3        # stop after 3 cycles with no improvement
  artifact: ./SKILL.md  # the file being optimized
```

**The loop**:
```
1. RUN EVAL      — agentv run (or agent-mode pipeline) with current artifact
2. ANALYZE       — dispatch analyzer subagent on results
3. DECIDE        — if score > best_score: KEEP (copy to best.md), else DROP
4. MUTATE        — dispatch mutator subagent with failure analysis
5. GOTO 1        — until convergence or max_cycles
```

**Interactive/autonomous hybrid**: Users can start in interactive mode (existing behavior), build confidence in their eval, then switch to autoresearch mode to run unattended.

## Acceptance Signals

- agentv-bench SKILL.md documents autoresearch mode as an alternative to interactive iteration
- Given a SKILL.md + EVAL.yaml, runs N improvement cycles without human input
- Hill-climbing ratchet: artifact quality only increases or stays the same across cycles
- Stops on convergence (no improvement for N cycles) or max_cycles
- Session state survives context resets — a fresh agent can resume from `state.json` + `best.md`
- Original artifact preserved in `original.md`
- `results.jsonl` contains per-cycle score trajectory for post-hoc analysis

## Non-Goals

- Not a new skill — this is a mode of the existing agentv-bench skill
- Not a replacement for interactive mode — both coexist
- Not multi-file mutation (start with single artifact, expand later)
- Not a dashboard (use `agentv trace stats` + `agentv compare` for now)
- Does not modify the eval definition — only the artifact under test

## Context

The autoresearch pattern — originated by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) and generalized by [pi-autoresearch](https://github.com/davebcn87/pi-autoresearch) — runs autonomous experiment loops: modify → test → keep/discard → repeat. agentv-bench already has Steps 1-4 (understand → write evals → run & grade → analyze). The missing piece is autonomous Step 5 (mutate and loop without human input).

**Key design principles from autoresearch research**:
1. **Immutable eval harness** — the loop mutates the artifact, never the eval. Prevents gaming.
2. **Hill-climbing ratchet** — mutation always reads from proven best. Monotonic improvement.
3. **Single-file mutation surface** — constrain what the agent can change. Keeps diffs reviewable.
4. **Session persistence** — state files enable resume across context window resets.

## Related

- #746 — mutator subagent (required dependency — provides the mutation logic)
- #747 — eval-generator subagent (complementary — removes cold-start friction)
- #699 — Ralph Loop (complementary: Ralph improves outputs within a run; autoresearch improves artifacts across runs)
- #214 — Pass@k trials (complementary: statistical significance testing between cycles)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet #748

Objective

Design Latitude

Acceptance Signals

Non-Goals

Context

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Purpose	Mutable?
`original.md`	Snapshot of artifact before first mutation	No
`best.md`	Current best-scoring version	Yes (on KEEP)
`state.json`	`{best_score, cycle, best_cycle, convergence_count}`	Yes
`results.jsonl`	One line per cycle: score, per-assertion breakdown, timestamp	Append-only

feat(bench): autoresearch mode — unattended eval-improve loop with hill-climbing ratchet #748

Description

Objective

Design Latitude

Acceptance Signals

Non-Goals

Context

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions