Agent Evaluation & Testing Framework

## Summary

Built-in evaluation system: define test suites per agent with input/expected-output pairs, run evaluations on demand or on config change, track quality scores over time. Support multiple scorer types.

## Motivation

W&B Weave, Strongly.AI, and LangSmith all offer evals. MATE has 134 unit tests but no *agent behavior* testing. Without evals, users can't measure if a config change improved or degraded the agent.

## Scope

- New `agent_evaluations` and `evaluation_results` tables
- Test suite definition: list of input, expected_output, scorer_type per agent
- Scorer types: exact match, regex, contains, semantic similarity, LLM-as-judge
- Run evaluations: on demand, on config save (optional), via API
- Dashboard eval results page: pass/fail rates, score trends over time, per-case details
- CI hook: `python -m mate.eval --agent <name> --suite <suite>` for pipeline integration

## Acceptance Criteria

- [ ] Define evaluation suites via dashboard
- [ ] Run evaluations and view results with pass/fail per case
- [ ] At least 3 scorer types implemented
- [ ] Score trend charts over time
- [ ] CLI command for CI/CD integration


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Evaluation & Testing Framework #14

Summary

Motivation

Scope

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Agent Evaluation & Testing Framework #14

Description

Summary

Motivation

Scope

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions