Epistemic Question
Core Question: How do we design evaluation harnesses that detect when agent behavior drifts from the original specification intent, especially when the specification itself evolves over time?
Why This Matters
The HFIS Practical Guide emphasizes:
"Use AI for evaluation, not just generation. Most teams use AI to produce drafts but never to check them."
And Agile Agentic Analytics recommends:
"Every recurring agent failure becomes a backlog item: add a deterministic check, add a rule to AGENTS.md/CLAUDE.md, add a sandbox boundary, or add an eval case."
But there's a dynamic system problem:
- Specifications evolve as business requirements change
- Agents learn from feedback and adjust behavior
- Evaluation criteria themselves might become stale
- New failure modes emerge that weren't anticipated in original evals
Result: Evaluation drift — where evals pass but outputs no longer align with true intent.
The Specification-Eval Co-Evolution Problem
Traditional software testing: Tests are deterministic. If tests pass, the code does what the tests specify.
Agentic testing paradox:
- Evals test alignment with specification, but what if the specification is incomplete or outdated?
- Agents optimize for eval pass rate, but what if evals miss critical constraints?
- Specifications evolve, but do evals evolve in sync?
Example failure mode:
# Original Specification (Q1 2025)
"Calculate campaign ROI using 30-day attribution window"
# Eval (Q1 2025)
✓ Uses 30-day window
✓ Calculates ROI formula correctly
# Business Reality Changes (Q3 2025)
- Multi-touch attribution now required (not just last-touch)
- Attribution window extended to 45 days for some products
# Specification Updated (Q3 2025)
"Calculate campaign ROI using product-specific attribution windows (30d for cash, 45d for retirement) with multi-touch attribution"
# Eval Status (Q3 2025)
⚠️ Still testing old spec → passes, but outputs are wrong
The drift is silent. No one notices until a stakeholder questions the numbers.
Open Questions to Explore
-
Specification versioning: How do we version-control specifications in a way that propagates to eval updates automatically?
-
Eval completeness metrics: How do we know if an eval suite has "sufficient coverage" of the specification? (Analogous to code coverage, but for intent coverage)
-
Adversarial eval generation: Can we automatically generate "adversarial evals" — tests designed to catch specification ambiguities or edge cases the human author didn't anticipate?
-
Eval regression detection: How do we detect when an eval has become stale relative to the current specification?
-
Intent vs. implementation drift: How do we test that agent behavior aligns with intent (what we actually want) vs. specification (what we wrote down)?
-
Multi-stakeholder intent: When specifications reflect compromises between conflicting stakeholder priorities, how do evals capture the trade-offs?
Hypotheses to Test
Hypothesis 1: The "Eval Coverage Gap"
- Most eval suites test "happy path" and known failure modes, but miss emergent edge cases
- Testable: Measure what % of specification constraints have corresponding eval cases
Hypothesis 2: The "Silent Drift Detector"
- We can build meta-evals that compare agent outputs across time to detect drift (even when individual evals pass)
- Testable: Historical analysis — do agent outputs for the "same" task diverge over time?
Hypothesis 3: The "Specification-Eval Synchronization Tax"
- Teams that update specifications without updating evals experience 2-3x higher error rates
- Testable: Correlate specification change frequency with eval update frequency; measure downstream errors
Hypothesis 4: The "Adversarial Eval Advantage"
- Automatically generated adversarial evals catch more drift than human-authored evals
- Testable: A/B test teams using adversarial vs. manual evals; measure drift detection rates
Hypothesis 5: The "Intent Audit Trail"
- Requiring agents to document their "reasoning" (why they made specific choices) creates an auditable trail for detecting drift
- Testable: Compare drift detection when agents provide rationale vs. when they don't
Potential Research Directions
Example Framework (Hypothesis)
Multi-Layer Eval Architecture:
Layer 1: Deterministic Checks (mandatory, automated)
- Schema validation
- Range checks
- Referential integrity
- Drift risk: Low (deterministic rules rarely drift)
Layer 2: Specification Compliance (generated from spec)
- Does output satisfy all documented constraints?
- Drift risk: Medium (if spec updates, these must update)
Layer 3: Historical Consistency (regression detection)
- Does output for same input match historical baseline (within tolerance)?
- Drift risk: High (catches drift but can create false positives if business legitimately changed)
Layer 4: Intent Alignment (stakeholder validation)
- Do business stakeholders accept the output as correct?
- Drift risk: Very High (human judgment can drift independently of spec)
Layer 5: Adversarial Edge Cases (auto-generated)
- Does agent handle pathological inputs correctly?
- Drift risk: Low (generated from spec, can auto-update)
Success Criteria for Answering This Question
We will know we've made progress when we can:
- Automatically flag "eval debt" — when specification changes haven't propagated to evals
- Measure "intent coverage" — what % of business constraints are actually tested
- Detect drift proactively — before stakeholders notice wrong outputs
- Provide tooling that makes "keep evals in sync with specs" a low-friction process
Cross-References
Epistemic Question
Core Question: How do we design evaluation harnesses that detect when agent behavior drifts from the original specification intent, especially when the specification itself evolves over time?
Why This Matters
The HFIS Practical Guide emphasizes:
And Agile Agentic Analytics recommends:
But there's a dynamic system problem:
Result: Evaluation drift — where evals pass but outputs no longer align with true intent.
The Specification-Eval Co-Evolution Problem
Traditional software testing: Tests are deterministic. If tests pass, the code does what the tests specify.
Agentic testing paradox:
Example failure mode:
The drift is silent. No one notices until a stakeholder questions the numbers.
Open Questions to Explore
Specification versioning: How do we version-control specifications in a way that propagates to eval updates automatically?
Eval completeness metrics: How do we know if an eval suite has "sufficient coverage" of the specification? (Analogous to code coverage, but for intent coverage)
Adversarial eval generation: Can we automatically generate "adversarial evals" — tests designed to catch specification ambiguities or edge cases the human author didn't anticipate?
Eval regression detection: How do we detect when an eval has become stale relative to the current specification?
Intent vs. implementation drift: How do we test that agent behavior aligns with intent (what we actually want) vs. specification (what we wrote down)?
Multi-stakeholder intent: When specifications reflect compromises between conflicting stakeholder priorities, how do evals capture the trade-offs?
Hypotheses to Test
Hypothesis 1: The "Eval Coverage Gap"
Hypothesis 2: The "Silent Drift Detector"
Hypothesis 3: The "Specification-Eval Synchronization Tax"
Hypothesis 4: The "Adversarial Eval Advantage"
Hypothesis 5: The "Intent Audit Trail"
Potential Research Directions
Example Framework (Hypothesis)
Multi-Layer Eval Architecture:
Layer 1: Deterministic Checks (mandatory, automated)
Layer 2: Specification Compliance (generated from spec)
Layer 3: Historical Consistency (regression detection)
Layer 4: Intent Alignment (stakeholder validation)
Layer 5: Adversarial Edge Cases (auto-generated)
Success Criteria for Answering This Question
We will know we've made progress when we can:
Cross-References