Skip to content

Improve evals#2100

Open
dgageot wants to merge 5 commits intodocker:mainfrom
dgageot:improve-evals-5
Open

Improve evals#2100
dgageot wants to merge 5 commits intodocker:mainfrom
dgageot:improve-evals-5

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 13, 2026

  • Validate LLM as a judge upfront
  • Disable thinking for llm as a judge
  • Remove duplicated creation of the judge provider
  • Remove handoffs from evals

dgageot added 3 commits March 13, 2026 20:49
Signed-off-by: David Gageot <david.gageot@docker.com>
Signed-off-by: David Gageot <david.gageot@docker.com>
Signed-off-by: David Gageot <david.gageot@docker.com>
@dgageot dgageot requested a review from a team as a code owner March 13, 2026 19:51
Signed-off-by: David Gageot <david.gageot@docker.com>
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🟡 NEEDS ATTENTION

This PR improves evaluation robustness by validating the judge model upfront and streamlining error handling. The changes are well-structured and achieve the stated goals of failing fast on configuration issues.

Key Finding

Medium Severity: The new error handling in CheckRelevance changes the system's fault tolerance behavior. While the fail-fast approach is intentional for catching configuration errors early, it also means transient errors (network blips, rate limits) will abort entire evaluations rather than being retried or logged as warnings.

Summary

  • ✅ Judge validation added upfront (good defensive programming)
  • ✅ Thinking disabled for judge providers (appropriate for structured evaluation)
  • ✅ Duplicate judge provider creation removed
  • ⚠️ Error handling now fails fast on any error, including transient ones
  • ✅ Progress display improvements
  • ✅ Handoff metrics removed

The code quality is good overall. The one concern is whether the reduced fault tolerance for transient errors is acceptable for your use case.

derekmisler
derekmisler previously approved these changes Mar 13, 2026
Signed-off-by: David Gageot <david.gageot@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants