Epistemic Question: How do we design evals that detect specification drift over time?

## Epistemic Question

**Core Question:** How do we design evaluation harnesses that detect when agent behavior drifts from the original specification intent, especially when the specification itself evolves over time?

## Why This Matters

The [HFIS Practical Guide](https://github.com/weisberg/knowledge_base_public/wiki/Vibe-Analytics-HFIS-Practical-Guide#principle-4-use-ai-for-evaluation-not-just-generation) emphasizes:

> "Use AI for evaluation, not just generation. Most teams use AI to produce drafts but never to check them."

And [Agile Agentic Analytics](https://github.com/weisberg/knowledge_base_public/wiki/Vibe-Analytics-Agile-Agentic-Analytics#1-retro-driven-hardening) recommends:

> "Every recurring agent failure becomes a backlog item: add a deterministic check, add a rule to AGENTS.md/CLAUDE.md, add a sandbox boundary, or add an eval case."

**But there's a dynamic system problem:** 
1. Specifications evolve as business requirements change
2. Agents learn from feedback and adjust behavior
3. Evaluation criteria themselves might become stale
4. New failure modes emerge that weren't anticipated in original evals

**Result:** Evaluation drift — where evals pass but outputs no longer align with true intent.

## The Specification-Eval Co-Evolution Problem

**Traditional software testing:** Tests are deterministic. If tests pass, the code does what the tests specify.

**Agentic testing paradox:** 
- **Evals test alignment with specification**, but what if the specification is incomplete or outdated?
- **Agents optimize for eval pass rate**, but what if evals miss critical constraints?
- **Specifications evolve**, but do evals evolve in sync?

**Example failure mode:**

```markdown
# Original Specification (Q1 2025)
"Calculate campaign ROI using 30-day attribution window"

# Eval (Q1 2025)
✓ Uses 30-day window
✓ Calculates ROI formula correctly

# Business Reality Changes (Q3 2025)
- Multi-touch attribution now required (not just last-touch)
- Attribution window extended to 45 days for some products

# Specification Updated (Q3 2025)
"Calculate campaign ROI using product-specific attribution windows (30d for cash, 45d for retirement) with multi-touch attribution"

# Eval Status (Q3 2025)
⚠️ Still testing old spec → passes, but outputs are wrong
```

**The drift is silent.** No one notices until a stakeholder questions the numbers.

## Open Questions to Explore

1. **Specification versioning:** How do we version-control specifications in a way that propagates to eval updates automatically?

2. **Eval completeness metrics:** How do we know if an eval suite has "sufficient coverage" of the specification? (Analogous to code coverage, but for intent coverage)

3. **Adversarial eval generation:** Can we automatically generate "adversarial evals" — tests designed to catch specification ambiguities or edge cases the human author didn't anticipate?

4. **Eval regression detection:** How do we detect when an eval has become stale relative to the current specification?

5. **Intent vs. implementation drift:** How do we test that agent behavior aligns with *intent* (what we actually want) vs. *specification* (what we wrote down)?

6. **Multi-stakeholder intent:** When specifications reflect compromises between conflicting stakeholder priorities, how do evals capture the trade-offs?

## Hypotheses to Test

**Hypothesis 1: The "Eval Coverage Gap"**
- Most eval suites test "happy path" and known failure modes, but miss emergent edge cases
- **Testable:** Measure what % of specification constraints have corresponding eval cases

**Hypothesis 2: The "Silent Drift Detector"**
- We can build meta-evals that compare agent outputs across time to detect drift (even when individual evals pass)
- **Testable:** Historical analysis — do agent outputs for the "same" task diverge over time?

**Hypothesis 3: The "Specification-Eval Synchronization Tax"**
- Teams that update specifications without updating evals experience 2-3x higher error rates
- **Testable:** Correlate specification change frequency with eval update frequency; measure downstream errors

**Hypothesis 4: The "Adversarial Eval Advantage"**
- Automatically generated adversarial evals catch more drift than human-authored evals
- **Testable:** A/B test teams using adversarial vs. manual evals; measure drift detection rates

**Hypothesis 5: The "Intent Audit Trail"**
- Requiring agents to document their "reasoning" (why they made specific choices) creates an auditable trail for detecting drift
- **Testable:** Compare drift detection when agents provide rationale vs. when they don't

## Potential Research Directions

- [ ] Design "specification coverage metrics" — analogous to code coverage, but for intent constraints
- [ ] Build automated "eval staleness detector" — flags evals that haven't been updated when specifications change
- [ ] Explore "differential evals" — compare agent outputs at T1 vs. T2 for same input to detect drift
- [ ] Create "adversarial eval generator" — LLM that generates edge case evals from specifications
- [ ] Develop "intent vs. implementation testing" framework — tests business outcomes, not just spec compliance
- [ ] Study "eval evolution strategies" — how do teams that successfully maintain evals structure their workflows?

## Example Framework (Hypothesis)

**Multi-Layer Eval Architecture:**

**Layer 1: Deterministic Checks (mandatory, automated)**
- Schema validation
- Range checks
- Referential integrity
- **Drift risk:** Low (deterministic rules rarely drift)

**Layer 2: Specification Compliance (generated from spec)**
- Does output satisfy all documented constraints?
- **Drift risk:** Medium (if spec updates, these must update)

**Layer 3: Historical Consistency (regression detection)**
- Does output for same input match historical baseline (within tolerance)?
- **Drift risk:** High (catches drift but can create false positives if business legitimately changed)

**Layer 4: Intent Alignment (stakeholder validation)**
- Do business stakeholders accept the output as correct?
- **Drift risk:** Very High (human judgment can drift independently of spec)

**Layer 5: Adversarial Edge Cases (auto-generated)**
- Does agent handle pathological inputs correctly?
- **Drift risk:** Low (generated from spec, can auto-update)

## Success Criteria for Answering This Question

We will know we've made progress when we can:
1. Automatically flag "eval debt" — when specification changes haven't propagated to evals
2. Measure "intent coverage" — what % of business constraints are actually tested
3. Detect drift proactively — before stakeholders notice wrong outputs
4. Provide tooling that makes "keep evals in sync with specs" a low-friction process

## Cross-References

- [HFIS Practical Guide: Principle 4 - Use AI for Evaluation](https://github.com/weisberg/knowledge_base_public/wiki/Vibe-Analytics-HFIS-Practical-Guide#principle-4-use-ai-for-evaluation-not-just-generation)
- [Agile Agentic Analytics: Retro-Driven Hardening](https://github.com/weisberg/knowledge_base_public/wiki/Vibe-Analytics-Agile-Agentic-Analytics#1-retro-driven-hardening)
- [HFIS Technical Deep Dive: Evidence Requirements](https://github.com/weisberg/knowledge_base_public/wiki/Vibe-Analytics-High-Fidelity-Intent-Specifications#evidence-quantifiable-truth-requirements)
- [Agile Agentic Analytics: Definition of Done](https://github.com/weisberg/knowledge_base_public/wiki/Vibe-Analytics-Agile-Agentic-Analytics#agentic-definition-of-done-template)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epistemic Question: How do we design evals that detect specification drift over time? #5

Epistemic Question

Why This Matters

The Specification-Eval Co-Evolution Problem

Open Questions to Explore

Hypotheses to Test

Potential Research Directions

Example Framework (Hypothesis)

Success Criteria for Answering This Question

Cross-References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Epistemic Question: How do we design evals that detect specification drift over time? #5

Description

Epistemic Question

Why This Matters

The Specification-Eval Co-Evolution Problem

Open Questions to Explore

Hypotheses to Test

Potential Research Directions

Example Framework (Hypothesis)

Success Criteria for Answering This Question

Cross-References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions