Summary
Built-in evaluation system: define test suites per agent with input/expected-output pairs, run evaluations on demand or on config change, track quality scores over time. Support multiple scorer types.
Motivation
W&B Weave, Strongly.AI, and LangSmith all offer evals. MATE has 134 unit tests but no agent behavior testing. Without evals, users can't measure if a config change improved or degraded the agent.
Scope
- New
agent_evaluations and evaluation_results tables
- Test suite definition: list of input, expected_output, scorer_type per agent
- Scorer types: exact match, regex, contains, semantic similarity, LLM-as-judge
- Run evaluations: on demand, on config save (optional), via API
- Dashboard eval results page: pass/fail rates, score trends over time, per-case details
- CI hook:
python -m mate.eval --agent <name> --suite <suite> for pipeline integration
Acceptance Criteria
Summary
Built-in evaluation system: define test suites per agent with input/expected-output pairs, run evaluations on demand or on config change, track quality scores over time. Support multiple scorer types.
Motivation
W&B Weave, Strongly.AI, and LangSmith all offer evals. MATE has 134 unit tests but no agent behavior testing. Without evals, users can't measure if a config change improved or degraded the agent.
Scope
agent_evaluationsandevaluation_resultstablespython -m mate.eval --agent <name> --suite <suite>for pipeline integrationAcceptance Criteria