CLI release gate for structured AI changes.
EvalGate runs a saved JSONL dataset against a prompt and model, validates the output against a JSON schema, computes deterministic metrics, and writes artifacts you can use locally or in CI.
Best fit:
- classification
- structured extraction
- tagging and routing
Most valuable use case:
- a team has a prompt or model change for a structured AI feature and needs one repeatable command to decide whether it is still safe to ship
corepack enable
pnpm install
pnpm evalgate:sampleThat sample uses:
Artifacts are written to:
.artifacts/report.json.artifacts/summary.md.artifacts/junit.xml
Create a starter config:
pnpm evalgate:initRun an eval:
pnpm evalgate run --dataset ./my-dataset.jsonl --config ./evalgate.config.jsonCreate a baseline from a finished run:
pnpm evalgate baseline create --from ./report.json --out ./baseline.jsonCompare a report to a baseline:
pnpm evalgate compare --report ./report.json --baseline ./baseline.jsonFail on gate or regression:
pnpm evalgate run \
--dataset ./my-dataset.jsonl \
--config ./evalgate.config.json \
--baseline ./baseline.json \
--fail-on-gate \
--fail-on-regressionEvalGate always writes report.json.
By default it also writes:
summary.mdjunit.xml
Optional:
sarif.jsonvia--formats summary,junit,sarif
Useful flags:
--output-dir ./artifacts/evalgate--out ./artifacts/report.json--formats summary,junit,sarif
Key report fields:
schema_versiontool_versionprovidermodelprompt_versiondataset_sha256config_sha256git_shagit_branchstarted_atfinished_atduration_msfailure_counts_by_type
The repo ships with one complete walkthrough:
That example shows:
- the dataset file
- the config file
- the exact command to run
- the terminal output
- the generated report artifacts
GitHub Actions example:
For OpenAI-backed runs:
export OPENAI_API_KEY=your_key_here
pnpm evalgate:sample:openaiSee .env.example for environment variables.
Do not commit generated reports built from sensitive datasets. report.json, summary.md, and other artifacts can include raw inputs, outputs, and diffs.
MIT. See LICENSE.