Context
From a PR review patterns audit across 7 MongoDB Agent Skills, consistency / contradictions were identified in 10 instances across 4 PRs. Content within a skill contradicts itself, presents different things with misleading equivalence, or examples don't align narratively.
Proposed LLM judge check
Add a consistency check that evaluates three specific areas:
Check 1: Metadata consistency between SKILL.md and reference files
If SKILL.md contains a summary table with impact/priority levels (e.g., "HIGH", "CRITICAL", "MEDIUM"), verify these match the impact levels stated in the individual reference files.
Example (from PR 7): Impact levels in SKILL.md quick reference table (all HIGH for fundamentals) were inconsistent with individual reference files (CRITICAL, MEDIUM).
Check 2: Example narrative continuity
Within a single reference file, verify that examples in a walkthrough/tutorial use consistent field names and variable names throughout. Flag cases where an example introduces fields (e.g., name, email) but subsequent code in the same walkthrough uses different fields (e.g., address) without explanation.
Example (from PR 7): Schema drift examples use name/email but migration code uses address field.
Check 3: "Prefer X over Y" without caveats
Flag directive statements that recommend one approach over another without noting when the less-preferred approach is actually correct.
Example (from PR 2): "Prefer $ne: null over $exists: false" without noting semantic difference (fields with explicit null values behave differently).
Output format
Consistency issues found:
- SKILL.md line 34 says "HIGH" impact for schema-validation, but schema-validation.md says "CRITICAL"
- walkthrough.md: lines 10-25 use field "name" but lines 40-55 switch to "address" without transition
- Line 78: "Prefer X over Y" — add caveat about when Y is the correct choice
Examples from PR reviews
| PR |
Issue |
| 7 |
Impact levels in quick reference table inconsistent with reference files |
| 7 |
Schema drift examples use name/email but migration code uses address field |
| 7 |
Soft 1MB guideline presented with same weight as hard 16MB limit |
| 5 |
Language patterns reference describes "code examples" that don't exist in the referenced file |
| 2 |
"Prefer $ne: null over $exists: false" without noting semantic difference |
Related
Part of a series of LLM judge enhancements derived from PR review pattern analysis. Requires the judge to read multiple files within a skill, so implementation should consider whether this is a separate pass or integrated into per-file scoring.
Context
From a PR review patterns audit across 7 MongoDB Agent Skills, consistency / contradictions were identified in 10 instances across 4 PRs. Content within a skill contradicts itself, presents different things with misleading equivalence, or examples don't align narratively.
Proposed LLM judge check
Add a consistency check that evaluates three specific areas:
Check 1: Metadata consistency between SKILL.md and reference files
If SKILL.md contains a summary table with impact/priority levels (e.g., "HIGH", "CRITICAL", "MEDIUM"), verify these match the impact levels stated in the individual reference files.
Example (from PR 7): Impact levels in SKILL.md quick reference table (all HIGH for fundamentals) were inconsistent with individual reference files (CRITICAL, MEDIUM).
Check 2: Example narrative continuity
Within a single reference file, verify that examples in a walkthrough/tutorial use consistent field names and variable names throughout. Flag cases where an example introduces fields (e.g.,
name,email) but subsequent code in the same walkthrough uses different fields (e.g.,address) without explanation.Example (from PR 7): Schema drift examples use name/email but migration code uses address field.
Check 3: "Prefer X over Y" without caveats
Flag directive statements that recommend one approach over another without noting when the less-preferred approach is actually correct.
Example (from PR 2): "Prefer
$ne: nullover$exists: false" without noting semantic difference (fields with explicit null values behave differently).Output format
Examples from PR reviews
$ne: nullover$exists: false" without noting semantic differenceRelated
Part of a series of LLM judge enhancements derived from PR review pattern analysis. Requires the judge to read multiple files within a skill, so implementation should consider whether this is a separate pass or integrated into per-file scoring.