Skip to content

dvmukul/LLM-Eval

Repository files navigation

LLM Eval Framework

A structured evaluation framework for LLM outputs — built around the five dimensions that determine whether an AI feature is actually ready to ship.


The problem this solves

Most AI teams evaluate their models informally: someone runs a few prompts, the outputs "look good", and the feature ships. This works until it doesn't — until a hallucinated number makes it into a customer report, or a format change in a model update breaks a structured output nobody was tracking.

Systematic LLM evaluation is the difference between "the AI seems to be working" and "I can tell you exactly which dimension degraded, by how much, and when it started."

This framework gives PMs and AI teams a structured, repeatable way to:

  • Define what "good" looks like before you ship, not after
  • Score outputs consistently using an LLM-as-judge approach
  • Track quality regressions across model versions and prompt changes
  • Produce shareable reports that go into PRs, sprint reviews, and stakeholder decks

The five dimensions

These dimensions were chosen because they map to the five ways AI features actually fail in production — not in the lab.

Dimension What it measures Why it matters
Accuracy Does the output correctly answer what was asked? Wrong answers erode trust faster than any UX problem
Hallucination Does it invent specific details not grounded in the input? Fabricated facts in enterprise contexts cause real-world harm
Completeness Does it cover all required elements of the task? AI that answers 70% of the question looks fine in demos, frustrates in production
Format adherence Does it match expected structure, tone, and length? Format failures are the most visible failure mode for end users
Conciseness Is it appropriately concise without padding or hedging? Verbose outputs have a measurable UX cost; length control = model control

Each dimension uses an explicit 1–5 rubric that an LLM judge can apply consistently. Vague rubrics produce noisy scores; noisy scores make regression tracking useless.


How it works

YAML test suite (prompts + configs)
        │
        ▼
┌──────────────────────────────────────┐
│  eval_runner.py                      │
│  1. Load test cases                  │
│  2. Run model under test             │
│  3. For each output × each dimension │
│     → LLM judge scores 1-5          │
│     → Rationale captured            │
│  4. Aggregate + compute pass rate    │
│  5. Compare vs baseline (optional)   │
└──────────────────────────────────────┘
        │
        ├── Markdown report (for PRs / Notion)
        ├── HTML dashboard (interactive, shareable)
        └── JSON results (for regression tracking)

LLM-as-judge

The evaluator uses Claude to score Claude outputs (or any other model's outputs). Each dimension is scored in a separate API call with the rubric passed explicitly — this reduces anchoring effects from multi-dimension prompts and produces more calibrated scores.

Judge model: claude-sonnet-4-20250514 at temperature 0 (deterministic scoring).
Model under test: configurable — defaults to claude-sonnet-4-20250514 but can be any model.

Regression tracking

Save results from any run to JSON. Pass a previous run as --baseline to see:

  • Per-dimension delta (↑ Improving / → Stable / ↓ Regressing)
  • Overall trajectory score
  • Which specific test cases flipped from PASS to FAIL

This is how you detect model-version regressions before they reach production.


Quickstart

git clone https://github.com/dvmukul/llm-eval-framework.git
cd llm-eval-framework
pip install -r requirements.txt
export ANTHROPIC_API_KEY=your_key_here

# Run the PM workflow suite
python eval_runner.py --suite test_cases/pm_workflow.yaml

# Run the SaaS AI features suite
python eval_runner.py --suite test_cases/saas_product.yaml

# Save results for regression tracking
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --save-results results/baseline.json

# Compare after a prompt or model change
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --baseline results/baseline.json

# Run specific dimensions only
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --dimensions accuracy hallucination

# Compare two models on the same suite
python eval_runner.py --suite test_cases/pm_workflow.yaml \
  --model claude-haiku-4-5-20251001

Test suite format

Test suites are YAML files — no code required to add new test cases.

name: My Product Suite
description: Evaluates AI outputs for my product's core use cases.

default_system_instructions: >
  You are an AI assistant. Be concise and accurate.

test_cases:
  - id: tc_001
    name: Summarize support ticket
    prompt: Summarize this ticket in 2 sentences: [ticket content]
    max_tokens: 150

  - id: tc_002
    name: Classify churn risk
    prompt: Classify this account as High/Medium/Low churn risk: [data]
    system_instructions: Override the default system prompt for this case.
    max_tokens: 100

Example terminal output

══════════════════════════════════════════════════════════
  Suite: PM Workflow Suite
  Model: claude-sonnet-4-20250514
  Cases: 5
══════════════════════════════════════════════════════════

[1/5] Running model on: Competitive brief — value proposition
  📋 Evaluating: Competitive brief — value proposition
     ✓ Accuracy              4/5
     ✓ Hallucination         5/5
     ✓ Completeness          4/5
     ✓ Format adherence      3/5
     ✓ Conciseness           4/5
  → PASS  Composite: 4.0/5

[2/5] Running model on: Hallucination resistance — unverifiable claim
  📋 Evaluating: Hallucination resistance — unverifiable claim
     ✓ Accuracy              3/5
     ✗ Hallucination         2/5
     ✗ Completeness          2/5
     ✗ Format adherence      2/5
     ✓ Conciseness           4/5
  → FAIL  Composite: 2.6/5

══════════════════════════════════════════════════════════
  Results: 4/5 passed (80%)
  Avg composite score: 3.7/5
  Regression: → Stable (+0.1 vs baseline)
══════════════════════════════════════════════════════════

📄 Markdown report: reports/pm_workflow_20250401_143022.md
📊 HTML dashboard:  reports/pm_workflow_20250401_143022.html

Design decisions

Why LLM-as-judge? Ground truth labelling is expensive and slow. LLM-as-judge with explicit rubrics achieves high agreement with human raters on structured tasks and runs in seconds. The key is the rubric — a judge that knows exactly what a "3" looks like is far more consistent than one asked to "score quality 1-5."

Why five dimensions, not one composite? A single quality score hides the failure mode. "Score: 3.2/5" is useless. "Accuracy: 4/5, Hallucination: 1/5" tells you exactly what to fix and why users are complaining.

Why separate API calls per dimension? Asking for all five scores in one prompt creates anchoring effects — the first dimension score influences the rest. Independent calls produce more calibrated results at marginal cost.

Why YAML test suites? PMs, not just engineers, should be able to write test cases. YAML is readable, diffable, and doesn't require a development environment to edit. A PM who can write a test suite is far more valuable to an AI team than one who can't.


Roadmap

  • --compare mode — side-by-side comparison of two models on the same suite
  • Custom dimension support — define your own rubric in YAML
  • CI integration — GitHub Action to run evals on every prompt change
  • Trend view — track dimension averages across multiple runs over time

About

Built by Mukul Dewangan — Senior PM specializing in AI and data products.

The hardest part of AI PM work isn't building the feature. It's defining what "good" looks like before you ship it — and knowing when it's degraded after you do. This framework is a working answer to that problem.

About

A structured evaluation framework for LLM outputs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors