memory-workbench

memory-workbench is a continuity benchmark workbench for long-running agents.

It does not ask "can the system remember more?" first. It asks the narrower operational question that actually breaks work:

Can interrupted work resume correctly, transfer safely, and stay auditable without replaying the full trace?

At A Glance

Signal	Current value
Focus	Continuity artifacts for `resume`, `handoff`, `stale-check`, and `audit`
Benchmark cases	`2`
Scripted reruns	`1`
Average replay-cost reduction	`64.6%`
Remote status	published to `origin/main`
Main remaining gap	`independent_second_operator_rerun`

Why This Repo Exists

Most agent-memory work emphasizes retrieval breadth, personalization, or token savings. This repo starts from a different failure mode:

the agent restarts and loses execution state
a second operator cannot tell what happened
the trace exists, but it is not compact, replayable, or decision-relevant
stale state is easy to reuse by mistake
audit still requires rereading raw logs

The goal is to turn memory from "stored context" into an artifact system that makes continuity measurable.

What Success Looks Like

A fresh operator should be able to:

understand the current goal and workflow state quickly
distinguish verified facts from assumptions
find the evidence behind the important decisions
detect stale state before acting
continue the next action without reopening the whole trace

If a new idea mainly improves retrieval breadth, long-term semantic coverage, or generic memory APIs, it probably belongs in another repo. If it lowers restart cost, clarifies handoff, or catches stale state before action, it belongs here.

Proof Surface

Current proof assets:

Current measured outcomes:

Case	Result
`operator-handoff-001`	replay cost `4.0 -> 1.5` minutes, `62.5%` reduction
`stale-pack-rotation-001`	replay cost `3.0 -> 1.0` minute, `66.7%` reduction
Rollup	`64.6%` average replay-cost reduction across the current proof set

Use It In Two Ways

1. Benchmark an existing memory system

Use this repo when you already have a memory system and want to test whether it preserves workflow continuity.

Choose a real interrupted workflow slice.
Export a continuity artifact from that system.
Compare raw trace only versus raw trace + continuity artifact.
Score replay cost, restart clarity, evidence traceability, and stale detection.

Start here:

2. Design better continuity artifacts

Use the cases, schema, validator, and generated proof packets here to iterate on what an actually reusable continuity artifact needs to contain.

Fast Start

If you only want the shortest path to the current proof:

Read docs/proof-surface.md.
Run python3 scripts/run_independent_rerun_check.py --json.
Inspect docs/independent-rerun-kit.md for the next operator handoff path.

If you want the full benchmark contract first:

Read evals/continuity-benchmark.md.
Read evals/operator-handoff-001-runbook.md.
Validate the existing pack with python3 scripts/validate_continuity_artifacts.py --pack packs/operator-handoff-001/continuity-pack.md --report reports/operator-handoff-001-report-2026-03-26.md.

Repo Map

Path	Why it matters
docs/product-spec.md	Product definition, boundary, and exit criteria
docs/architecture.md	Artifact model and evaluation loop
docs/evaluate-a-memory-system.md	Practical entrypoint for benchmarking another memory system
docs/target-system-eval-packet.md	Fixed intake packet for external-system evaluation
evals/continuity-benchmark.md	Scoring contract and benchmark dimensions
evals/operator-handoff-001-runbook.md	Runnable evaluator procedure
cases/operator-handoff-001.md	Handoff failure-mode case
cases/stale-pack-rotation-001.md	Stale-state failure-mode case
schemas/continuity-pack.example.yaml	Minimal continuity-pack schema
scripts/	Builders, validators, rerun harnesses, and packet generators

Runnable Commands

python3 scripts/run_independent_rerun_check.py --json
python3 scripts/build_proof_surface.py --json
python3 scripts/build_proof_refresh_bundle.py --json
python3 scripts/build_remote_sync_manifest.py --json
python3 scripts/verify_remote_sync_manifest.py --json
python3 scripts/build_public_citation_pack.py --json
python3 scripts/build_independent_rerun_kit.py --json
python3 scripts/build_upstream_issue_packet.py --json
python3 scripts/build_target_system_eval_packet.py --json
python3 scripts/rerun_stale_pack_rotation.py --json

Principles

proof before polish
continuity before feature breadth
auditable artifacts over hidden state
stale-check before action
small reproducible cases over broad claims

Current Status

The repo is now published to origin/main and the generated proof packets describe a runnable local baseline instead of a blocked pre-publish state.

The largest remaining product gap is still independent_second_operator_rerun: the proof is reproducible locally, but still needs a genuinely separate operator or later-session rerun to strengthen the claim.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
cases		cases
docs		docs
evals		evals
examples		examples
packs		packs
reports		reports
schemas		schemas
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
ROADMAP.md		ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

memory-workbench

At A Glance

Why This Repo Exists

What Success Looks Like

Proof Surface

Use It In Two Ways

1. Benchmark an existing memory system

2. Design better continuity artifacts

Fast Start

Repo Map

Runnable Commands

Principles

Current Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

memory-workbench

At A Glance

Why This Repo Exists

What Success Looks Like

Proof Surface

Use It In Two Ways

1. Benchmark an existing memory system

2. Design better continuity artifacts

Fast Start

Repo Map

Runnable Commands

Principles

Current Status

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages