LLM Eval Framework

A structured evaluation framework for LLM outputs — built around the five dimensions that determine whether an AI feature is actually ready to ship.

The problem this solves

Most AI teams evaluate their models informally: someone runs a few prompts, the outputs "look good", and the feature ships. This works until it doesn't — until a hallucinated number makes it into a customer report, or a format change in a model update breaks a structured output nobody was tracking.

Systematic LLM evaluation is the difference between "the AI seems to be working" and "I can tell you exactly which dimension degraded, by how much, and when it started."

This framework gives PMs and AI teams a structured, repeatable way to:

Define what "good" looks like before you ship, not after
Score outputs consistently using an LLM-as-judge approach
Track quality regressions across model versions and prompt changes
Produce shareable reports that go into PRs, sprint reviews, and stakeholder decks

The five dimensions

These dimensions were chosen because they map to the five ways AI features actually fail in production — not in the lab.

Dimension	What it measures	Why it matters
Accuracy	Does the output correctly answer what was asked?	Wrong answers erode trust faster than any UX problem
Hallucination	Does it invent specific details not grounded in the input?	Fabricated facts in enterprise contexts cause real-world harm
Completeness	Does it cover all required elements of the task?	AI that answers 70% of the question looks fine in demos, frustrates in production
Format adherence	Does it match expected structure, tone, and length?	Format failures are the most visible failure mode for end users
Conciseness	Is it appropriately concise without padding or hedging?	Verbose outputs have a measurable UX cost; length control = model control

Each dimension uses an explicit 1–5 rubric that an LLM judge can apply consistently. Vague rubrics produce noisy scores; noisy scores make regression tracking useless.

How it works

YAML test suite (prompts + configs)
        │
        ▼
┌──────────────────────────────────────┐
│  eval_runner.py                      │
│  1. Load test cases                  │
│  2. Run model under test             │
│  3. For each output × each dimension │
│     → LLM judge scores 1-5          │
│     → Rationale captured            │
│  4. Aggregate + compute pass rate    │
│  5. Compare vs baseline (optional)   │
└──────────────────────────────────────┘
        │
        ├── Markdown report (for PRs / Notion)
        ├── HTML dashboard (interactive, shareable)
        └── JSON results (for regression tracking)

LLM-as-judge

The evaluator uses Claude to score Claude outputs (or any other model's outputs). Each dimension is scored in a separate API call with the rubric passed explicitly — this reduces anchoring effects from multi-dimension prompts and produces more calibrated scores.

Judge model: claude-sonnet-4-20250514 at temperature 0 (deterministic scoring).
Model under test: configurable — defaults to claude-sonnet-4-20250514 but can be any model.

Regression tracking

Save results from any run to JSON. Pass a previous run as --baseline to see:

Per-dimension delta (↑ Improving / → Stable / ↓ Regressing)
Overall trajectory score
Which specific test cases flipped from PASS to FAIL

This is how you detect model-version regressions before they reach production.

Quickstart

git clone https://github.com/dvmukul/llm-eval-framework.git
cd llm-eval-framework
pip install -r requirements.txt
export ANTHROPIC_API_KEY=your_key_here

# Run the PM workflow suite
python eval_runner.py --suite test_cases/pm_workflow.yaml

# Run the SaaS AI features suite
python eval_runner.py --suite test_cases/saas_product.yaml

# Save results for regression tracking
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --save-results results/baseline.json

# Compare after a prompt or model change
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --baseline results/baseline.json

# Run specific dimensions only
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --dimensions accuracy hallucination

# Compare two models on the same suite
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --model claude-haiku-4-5-20251001

Test suite format

Test suites are YAML files — no code required to add new test cases.

name: My Product Suite
description: Evaluates AI outputs for my product's core use cases.

default_system_instructions: >
  You are an AI assistant. Be concise and accurate.

test_cases:
  - id: tc_001
    name: Summarize support ticket
    prompt: Summarize this ticket in 2 sentences: [ticket content]
    max_tokens: 150

  - id: tc_002
    name: Classify churn risk
    prompt: Classify this account as High/Medium/Low churn risk: [data]
    system_instructions: Override the default system prompt for this case.
    max_tokens: 100

Example terminal output

══════════════════════════════════════════════════════════
  Suite: PM Workflow Suite
  Model: claude-sonnet-4-20250514
  Cases: 5
══════════════════════════════════════════════════════════

[1/5] Running model on: Competitive brief — value proposition
  📋 Evaluating: Competitive brief — value proposition
     ✓ Accuracy              4/5
     ✓ Hallucination         5/5
     ✓ Completeness          4/5
     ✓ Format adherence      3/5
     ✓ Conciseness           4/5
  → PASS  Composite: 4.0/5

[2/5] Running model on: Hallucination resistance — unverifiable claim
  📋 Evaluating: Hallucination resistance — unverifiable claim
     ✓ Accuracy              3/5
     ✗ Hallucination         2/5
     ✗ Completeness          2/5
     ✗ Format adherence      2/5
     ✓ Conciseness           4/5
  → FAIL  Composite: 2.6/5

══════════════════════════════════════════════════════════
  Results: 4/5 passed (80%)
  Avg composite score: 3.7/5
  Regression: → Stable (+0.1 vs baseline)
══════════════════════════════════════════════════════════

📄 Markdown report: reports/pm_workflow_20250401_143022.md
📊 HTML dashboard:  reports/pm_workflow_20250401_143022.html

Design decisions

Why LLM-as-judge? Ground truth labelling is expensive and slow. LLM-as-judge with explicit rubrics achieves high agreement with human raters on structured tasks and runs in seconds. The key is the rubric — a judge that knows exactly what a "3" looks like is far more consistent than one asked to "score quality 1-5."

Why five dimensions, not one composite? A single quality score hides the failure mode. "Score: 3.2/5" is useless. "Accuracy: 4/5, Hallucination: 1/5" tells you exactly what to fix and why users are complaining.

Why separate API calls per dimension? Asking for all five scores in one prompt creates anchoring effects — the first dimension score influences the rest. Independent calls produce more calibrated results at marginal cost.

Why YAML test suites? PMs, not just engineers, should be able to write test cases. YAML is readable, diffable, and doesn't require a development environment to edit. A PM who can write a test suite is far more valuable to an AI team than one who can't.

Roadmap

--compare mode — side-by-side comparison of two models on the same suite
Custom dimension support — define your own rubric in YAML
CI integration — GitHub Action to run evals on every prompt change
Trend view — track dimension averages across multiple runs over time

About

Built by Mukul Dewangan — Senior PM specializing in AI and data products.

The hardest part of AI PM work isn't building the feature. It's defining what "good" looks like before you ship it — and knowing when it's degraded after you do. This framework is a working answer to that problem.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
__init__.py		__init__.py
dimensions.py		dimensions.py
generic_qa.yaml		generic_qa.yaml
generic_qa_sample.html		generic_qa_sample.html
judge.py		judge.py
pm_research_agent.yaml		pm_research_agent.yaml
report.py		report.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py
runner.py		runner.py
suite_loader.py		suite_loader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval Framework

The problem this solves

The five dimensions

How it works

LLM-as-judge

Regression tracking

Quickstart

Test suite format

Example terminal output

Design decisions

Roadmap

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Eval Framework

The problem this solves

The five dimensions

How it works

LLM-as-judge

Regression tracking

Quickstart

Test suite format

Example terminal output

Design decisions

Roadmap

About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages