feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked#33
feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked#33
Conversation
Replaces hardcoded profile rubrics with auto-classification, composable markdown rubric fragments, and ground-truth measurements (axe-core + real WCAG contrast math). The LLM is restricted to subjective visual judgment; everything measurable is now measured. Architecture (src/design/audit/): - types.ts — all audit types - classify.ts — single LLM call for type/domain/framework/maturity - rubric/loader.ts — composes rubric from markdown fragments by predicate - rubric/fragments/ — 12 markdown fragments (universal + type + domain + maturity) - measure/contrast.ts — WCAG 2.1 contrast math (in-page, deterministic) - measure/a11y.ts — axe-core wrapper - measure/index.ts — gathers measurements in parallel - evaluate.ts — composes classification + rubric + measurements + LLM vision - pipeline.ts — orchestrator Key wins: - Auto-classifies pages (no more "wrong profile" failures) - Drop a markdown file in fragments/ to add a domain — no code change - Real contrast measurement: 720 elements checked on Stripe, 32 actual AA failures - Real axe violations: catches missing button text, link text, list semantics - Accessibility dimension in design system score reflects measured truth - Overall score still reflects visual quality (LLM job) — both visible Calibration preserved 5/5 vs Gen 1 baseline: Stripe 9, Apple 9, Linear 9, Anthropic 8, Airbnb 8 (overall scores unchanged) Apple a11y dim: 4/10 (real) | Linear a11y dim: 4/10 (real) CLI: - `bad design-audit --url X` runs Gen 2 by default (auto-classifies) - `bad design-audit --url X --gen 1` falls back to legacy hardcoded profiles - `bad design-audit --url X --profile marketing` overrides classification - New: `--rubrics-dir <path>` for user-supplied rubric fragments Tests: - 27 new unit tests covering rubric loader, fragment parsing, predicate evaluation, measurements-to-findings mapping, severity mapping - 662 total tests pass Build: - New copy-rubric-fragments.mjs script copies .md files to dist/ post-tsc - axe-core 4.11.2 added as runtime dependency
Gen 2 added context (auto-classification) and truth (real measurements). Gen 3 makes the output actionable: every finding gets impact/effort/blast, the report opens with "Top 5 Fixes by ROI", and findings appearing across multiple pages collapse into a single systemic finding with elevated blast. Architecture (src/design/audit/): - roi.ts (new) — pure-function ROI scoring + cross-page systemic detection - evaluate.ts — LLM prompt now asks for impact/effort/blast on each finding; measurements get derived defaults (contrast: system/effort=1, axe: component/effort=3) - measure/a11y.ts — CSP-bypass injection ladder (addScriptTag → CDP → eval) - rubric/loader.ts — fragments can declare a custom dimension; composed rubric exposes deduped dimensions[] - types.ts — DesignFinding gains impact/effort/blast/roi/pageCount; RubricFragment gains optional dimension; ComposedRubric gains dimensions[] Rubric updates (per-fragment dimensions): - domain-fintech: dimension=trust-signals - domain-crypto: dimension=trust-signals - type-docs: dimension=readability - type-ecommerce: dimension=conversion CLI integration: - Cross-page systemic detection runs after all pages audited - Top 5 fixes (by ROI) computed via topByRoi() on the deduped set - generateReport() opens with "Top Fixes (by ROI)" section showing ROI, impact, effort, blast, location, fix, and CSS for each - JSON output exposes topFixes[] Live verification: - 3-page Stripe audit collapsed a contrast issue appearing on /, /pricing into a single [appears on 2 pages] systemic finding - Stripe classified as fintech → trust-signals dimension automatically scored - Airbnb classified as ecommerce → conversion dimension automatically scored - All 5 reference sites preserve calibration (Gen 1 = Gen 2 = Gen 3 scores) Tests: - 24 new ROI unit tests (computeRoi formulas, cross-page dedup, normalization, ranking stability, edge cases) - 4 new dimension tests (frontmatter parsing, composition, dedup) - 686 total tests pass (was 662) ROI formula: (impact * blastWeight) / max(effort, 1) Blast weights: page=1, section=1.25, component=1.75, system=2.5
The Gen 3 PR shipped ROI ranking but had three rough edges that this commit
fixes, plus completes the verification work that was outstanding.
## Measurement grouping (the big one)
Contrast and a11y findings now group BEFORE becoming findings:
- Contrast: by (color, background) pair. A site with 47 elements failing
the same gray gets ONE finding ("change --text-muted, affects 47 elements")
with blast scaled by element count (≥5 = system, 2-4 = component, 1 = page).
- axe: by rule id. A site with 8 buttons missing accessible names gets ONE
button-name finding with a node count, not 8 spammy entries.
This dramatically improves Top Fixes signal-to-noise. On Stripe: previously
showed 5 copies of the same contrast issue; now shows 5 distinct color pair
mismatches each affecting multiple elements (8, 5, 4, 3, 2).
## Effort calibration anchor
New universal rubric fragment `universal-effort-anchor.md` defining the
1-10 scale for effort/impact/blast with concrete anchors (effort=1 is one
CSS value, effort=5 is editing 3 components, effort=10 is full redesign).
The LLM now has a shared definition instead of guessing.
## Gen 1 audit code deleted
Removed the legacy fallback path:
- PROFILE_RUBRICS Record (lines 22-249, ~227 lines)
- buildAuditPrompt function
- auditSinglePage function (~80 lines)
- --gen flag from cli.ts and DesignAuditOptions
- generation === 1/2 branching in runDesignAudit, reproducibility loop
cli-design-audit.ts: 2418 → 2106 lines (-312 net, after grouping additions)
## End-to-end evolve loop verification (the killer feature)
This was untested across the Gen 1 and Gen 2 reflections.
Built /tmp/bad-design-test fixture with deliberately bad HTML+CSS.
Ran:
bad design-audit --url http://localhost:8765 --evolve claude-code \
--evolve-rounds 2 --project-dir /tmp/bad-design-test
Result: 3.0 → 5.0 (+2.0) over 2 rounds.
Claude Code rewrote the actual source files (index.html + src/styles.css):
- Added a CSS variable system (--brand, --text, --radius-*, --shadow-*)
- Hero gradient text fill, eyebrow badge, dual CTAs, lede paragraph
- Cards got SVG icons + real copy + hover transforms
- Sticky header with backdrop blur, multi-column footer
- Responsive grid breakpoint
9 of the top 10 ROI findings were fixed in round 1.
This validates both the agent dispatch architecture AND the ROI ranking.
## Reproducibility on Gen 3
Stripe reproducibility test: 9.0 / 9.0 / 9.0 — stddev 0.00.
Target was ±0.5. Gen 3 grouping makes scores MORE stable than Gen 1
because there's less per-finding variance.
## Documentation
Updated docs/guides/design-audit.md with Gen 3 architecture: pipeline
explanation, ROI scoring, cross-page systemic detection, measurement
grouping, dynamic dimensions.
## Skill updates
design-evolve skill now references the topFixes array as the entry point
instead of telling the agent to sort findings by severity from scratch.
Memory entries written for Gen 3 architecture and grouping behavior.
## Tests
Updated 2 measurement tests for the new grouping behavior. Added 2 new
tests for blast scaling. 688 tests pass (was 686).
Update: Gen 3 polish — closing the gaps from the self-reviewPushed Big fix: measurement groupingTop Fixes was previously dominated by 5 copies of the same contrast issue. Now contrast and a11y findings group BEFORE becoming findings:
On Stripe: Top 5 now shows 5 distinct color pair mismatches affecting 8, 5, 4, 3, 2 elements respectively. Real signal, not duplicates. Effort calibration anchorNew Killer feature verified end-to-endThe Claude Code rewrote actual source files:
9 of the top 10 ROI findings were fixed in round 1. ROI ranking validated. Gen 1 audit code deleted
Reproducibility re-validated on Gen 3Stripe: 9.0 / 9.0 / 9.0 — stddev 0.00 (target was ±0.5). Gen 3 grouping makes scores MORE stable than Gen 1. Docs + skill + memory
Final calibration (3 generations, 6 sites)
5/5 reference sites preserved. Bad fixture improved 2 points via real source edits. Tests688 pass on stable runs (was 662 on main). 2 integration tests are pre-existing flaky (browser timeouts in |
Summary
Two generations of design-audit redesign on one branch.
Gen 2 replaced 5 hardcoded profiles with auto-classification + composable markdown rubric fragments, and added ground-truth measurements (axe-core + real WCAG 2.1 contrast math).
Gen 3 made the output actionable: every finding gets an ROI score (impact × blast / effort), the report opens with a Top 5 Fixes section, and findings appearing on multiple pages are auto-detected as systemic.
What's new
Architecture (
src/design/audit/)types.tsclassify.tsrubric/loader.tsrubric/fragments/measure/contrast.tsmeasure/a11y.tsmeasure/index.tsevaluate.tsroi.tspipeline.tsCLI surface
Reports now include
[appears on N pages]when cross-page dedup finds repeated issuestrust-signalsscore, ecommerce getsconversion, docs getreadabilityCalibration (5/5 preserved across all 3 generations)
The drop in Apple/Linear's a11y dimension reflects ground truth — Linear has 152 actual WCAG AA contrast failures and 5 axe violations. Gen 1 missed all of them. The overall score still reflects visual quality (LLM's job); the a11y dimension reflects measured truth (axe + math). Both visible in the same report.
Live verification
stripe.com,/pricing,/contact/sales) collapsed a#635bffcontrast issue appearing on 2 pages into a single[appears on 2 pages]systemic finding (top-4 by ROI)marketing/fintech/world-classat 0.99 confidence; gottrust-signalsdimension automaticallyecommerce; gotconversiondimension automaticallyTest plan
pnpm build— cleanpnpm check:boundaries— passes (77 files)pnpm test— 686 tests pass (was 635 on main; +51 new tests)bad design-audit --url <your-app>and check the Top Fixes section--gen 1 --profile marketingto verify the legacy path still works as a fallbackChanges summary
axe-core@4.11.2scripts/copy-rubric-fragments.mjs(copies .md fragments to dist/).evolve/pursuits/2026-04-06-design-audit-gen2.md,gen3.mdReversibility
Gen 1 audit code is intact behind
--gen 1. New code is additive —src/design/audit/modules are independent of the legacy file. Easy to cherry-pick or revert individual pieces.