Skip to content

feat: git-native artifact storage — configurable repo for eval run checkpoints #761

@christso

Description

@christso

Objective

Add a git-native storage backend for eval run artifacts, inspired by entireio/cli. Eval results (metadata, transcripts, scores) are committed to a configurable branch and remote, making runs self-contained, versionable, and independent of cloud storage.

Design

Storage model

Each eval run produces one commit containing:

<runId[:2]>/<runId[2:]>/
  metadata.json        # run config, model, timestamps, scores, source commit/repo
  transcript.jsonl     # full session transcript (redacted)
  summary.json         # condensed stats for indexing

Sharding by first two hex chars of run ID (up to 256 buckets) prevents directory bloat.

Branch creation logic

Branch exists?
  YES → commit to it (orphan, regular, main — doesn't matter)
  NO  → is it the repo's default branch?
          YES → error (refuse to create main/master)
          NO  → create as orphan, commit to it

Once a branch exists, all operations are identical regardless of how it was created.

Configuration

# .agentv/config.yaml
artifacts:
  backend: git                    # "git" | "local" (default: local)
  git:
    remote: agentv-evals          # git remote name or URL
    branch: agentv/checkpoints/v1 # branch name (default)
    path: .agentv/runs            # optional subdirectory prefix (useful on shared branches like main)
  • Default backend: local (current behavior, no change)
  • git backend: commits to configured branch, pushes to configured remote
  • Remote can be the same repo (origin) or a separate repo — user's choice
  • path: when committing to a shared branch like main, scopes artifacts under a subdirectory to avoid polluting the root

Example configurations

Dedicated eval repo (recommended)

artifacts:
  backend: git
  git:
    remote: git@github.com:org/agentv-evals.git
    branch: agentv/checkpoints/v1

Same repo, orphan branch

artifacts:
  backend: git
  git:
    remote: origin
    branch: agentv/checkpoints/v1

Same repo, main branch (mixed human + machine artifacts)

artifacts:
  backend: git
  git:
    remote: origin
    branch: main
    path: .agentv/runs

Write flow

  1. Eval run completes → runner has result payload
  2. Git storage backend:
    • Branch exists → fetch latest
    • Branch doesn't exist and isn't default branch → create as orphan
    • Branch doesn't exist and is default branch → error
    • Build tree object with sharded path (under path prefix if configured)
    • Commit with message Run: <runId> and trailers (AgentV-Eval, AgentV-Model, Source-Commit)
    • Push to configured remote
  3. On conflict (concurrent runs): fetch, rebase, retry (append-only so always fast-forward compatible)

Read flow

  • agentv results listgit log <branch> --oneline
  • agentv results show <runId>git show <branch>:<path>/<shard>/<id>/metadata.json
  • Dashboard / web UI reads from the git remote directly

Cross-repo linking

Each metadata.json includes:

{
  "sourceRepo": "org/repo",
  "sourceCommit": "abc123def",
  "evalFile": "evals/my-eval.yaml",
  "runId": "a3b2c4d5e6f78901",
  "model": "claude-sonnet-4-6",
  "scores": { ... },
  "timestamp": "2026-03-25T12:00:00Z"
}

This solves the multi-repo eval problem — runs from different codebases all land in one eval results repo with provenance.

Why git-native

  • No cloud dependency — works offline, self-hosted, air-gapped
  • Familiar toolinggit log, git show, git diff for querying results
  • Access control — inherits git remote permissions
  • Auditability — immutable append-only history
  • CI-friendly — runners just need git push access to the eval repo
  • Separation of concerns — eval data scales independently of source code

Why separate repo (recommended default)

  • Source repo stays lean (eval transcripts are large, append-only)
  • Different retention policies (prune old runs without touching code)
  • Scoped CI permissions (eval runners don't need code repo write access)
  • Natural home for cross-repo evals

Using main on the same repo is fully supported for teams that prefer a single repo with human-editable artifacts alongside automated results.

Implementation plan

Phase 1: Git storage backend

  1. Add artifacts.git config schema to config loader
  2. Implement GitArtifactStore class with write(runResult) and list()/get(runId) methods
  3. Branch creation logic: exists → use it, new + non-default → orphan, new + default → error
  4. Sharded path builder: runId<id[:2]>/<id[2:]>/
  5. Commit with trailers, push to remote

Phase 2: CLI integration

  1. Wire GitArtifactStore into eval runner via backend config
  2. agentv results list — read from git branch
  3. agentv results show <runId> — read metadata/transcript from git branch

Phase 3: Concurrency & robustness

  1. Fetch-rebase-retry loop for concurrent pushes
  2. Graceful handling of missing remote, auth failures, network errors (fall back to local with warning)

Phase 4: Dashboard integration

  1. Dashboard reads results from git remote (extends feat: self-hosted dashboard — historical trends, dataset management, YAML editor #563)

Prior art

  • entireio/cli — two-tier model with shadow branches + orphan checkpoint branch, checkpoint_remote for separate repo support
  • Git notes — similar concept but limited to annotating existing commits

Acceptance signals

  • artifacts.backend: git config option is respected
  • Branch creation follows the exists/orphan/error logic
  • Eval results written to sharded paths on the branch
  • path prefix respected when configured (for shared branches)
  • Push to configured remote after each run
  • agentv results list/show reads from the git branch
  • Concurrent runs don't corrupt the branch
  • Existing local backend unchanged (default)

Non-goals

  • Shadow branches / mid-run checkpointing (entireio's Tier 1) — not needed since we write after run completion
  • Git hooks integration — eval runs are triggered by CLI, not git commit
  • Transcript deduplication across runs — git's object dedup handles this naturally

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions