Skip to content

feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked#33

Merged
drewstone merged 4 commits intomainfrom
design-audit-gen2
Apr 6, 2026
Merged

feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked#33
drewstone merged 4 commits intomainfrom
design-audit-gen2

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Two generations of design-audit redesign on one branch.

Gen 2 replaced 5 hardcoded profiles with auto-classification + composable markdown rubric fragments, and added ground-truth measurements (axe-core + real WCAG 2.1 contrast math).

Gen 3 made the output actionable: every finding gets an ROI score (impact × blast / effort), the report opens with a Top 5 Fixes section, and findings appearing on multiple pages are auto-detected as systemic.

What's new

Architecture (src/design/audit/)

Module Purpose
types.ts All audit types
classify.ts Single LLM call: type / domain / framework / designSystem / maturity / intent
rubric/loader.ts Composes rubric from markdown fragments by predicate
rubric/fragments/ 12 markdown fragments (universal + type + domain + maturity)
measure/contrast.ts Pure-JS WCAG 2.1 contrast math (in-page, deterministic)
measure/a11y.ts axe-core wrapper with 3-tier CSP-bypass injection
measure/index.ts Parallel measurement gather
evaluate.ts Composes classification + rubric + measurements + LLM vision
roi.ts Gen 3 — ROI scoring, cross-page systemic detection, top-N selection
pipeline.ts Orchestrator

CLI surface

# Gen 2 default — auto-classifies, no profile flag needed
bad design-audit --url https://your-app.com

# Manual profile override (Gen 1 behavior)
bad design-audit --url https://your-app.com --profile saas

# Legacy Gen 1 audit path
bad design-audit --url https://your-app.com --gen 1 --profile marketing

# User-supplied rubric fragments
bad design-audit --url https://your-app.com --rubrics-dir ~/.bad/rubrics

Reports now include

  • Auto-classification block: type/domain/maturity for every page
  • Top Fixes (by ROI) section: 5 highest-impact fixes, sorted by impact × blast / effort
  • Systemic markers: [appears on N pages] when cross-page dedup finds repeated issues
  • Dynamic dimensions: fintech sites get a trust-signals score, ecommerce gets conversion, docs get readability
  • Measurement-driven a11y dimension: real contrast/axe data drives the accessibility score, not LLM vibes

Calibration (5/5 preserved across all 3 generations)

Site Gen 1 Gen 2 Gen 3 A11y dim Custom Dimensions
Stripe 9 9 9 8 trust-signals
Apple 9 9 9 4 (real)
Linear 9 9 9 4 (real)
Anthropic 8 8 8 7
Airbnb 8 8 8 8 conversion

The drop in Apple/Linear's a11y dimension reflects ground truth — Linear has 152 actual WCAG AA contrast failures and 5 axe violations. Gen 1 missed all of them. The overall score still reflects visual quality (LLM's job); the a11y dimension reflects measured truth (axe + math). Both visible in the same report.

Live verification

  • 3-page Stripe audit (stripe.com, /pricing, /contact/sales) collapsed a #635bff contrast issue appearing on 2 pages into a single [appears on 2 pages] systemic finding (top-4 by ROI)
  • Stripe correctly classified as marketing/fintech/world-class at 0.99 confidence; got trust-signals dimension automatically
  • Airbnb classified as ecommerce; got conversion dimension automatically
  • All 5 reference sites preserve overall scores within ±0 of Gen 1

Test plan

  • pnpm build — clean
  • pnpm check:boundaries — passes (77 files)
  • pnpm test — 686 tests pass (was 635 on main; +51 new tests)
  • Live audit against Stripe / Apple / Linear / Anthropic / Airbnb — calibration preserved
  • Cross-page systemic detection verified on 3-page Stripe audit
  • Custom dimensions verified on Stripe (trust-signals) and Airbnb (conversion)
  • Reviewer: try bad design-audit --url <your-app> and check the Top Fixes section
  • Reviewer: try --gen 1 --profile marketing to verify the legacy path still works as a fallback

Changes summary

  • 30 files changed in Gen 2 (+2515 lines)
  • 17 files changed in Gen 3 (+814 lines)
  • New runtime dep: axe-core@4.11.2
  • New build script: scripts/copy-rubric-fragments.mjs (copies .md fragments to dist/)
  • Pursuit specs: .evolve/pursuits/2026-04-06-design-audit-gen2.md, gen3.md

Reversibility

Gen 1 audit code is intact behind --gen 1. New code is additive — src/design/audit/ modules are independent of the legacy file. Easy to cherry-pick or revert individual pieces.

Replaces hardcoded profile rubrics with auto-classification, composable
markdown rubric fragments, and ground-truth measurements (axe-core +
real WCAG contrast math). The LLM is restricted to subjective visual
judgment; everything measurable is now measured.

Architecture (src/design/audit/):
- types.ts          — all audit types
- classify.ts       — single LLM call for type/domain/framework/maturity
- rubric/loader.ts  — composes rubric from markdown fragments by predicate
- rubric/fragments/ — 12 markdown fragments (universal + type + domain + maturity)
- measure/contrast.ts — WCAG 2.1 contrast math (in-page, deterministic)
- measure/a11y.ts   — axe-core wrapper
- measure/index.ts  — gathers measurements in parallel
- evaluate.ts       — composes classification + rubric + measurements + LLM vision
- pipeline.ts       — orchestrator

Key wins:
- Auto-classifies pages (no more "wrong profile" failures)
- Drop a markdown file in fragments/ to add a domain — no code change
- Real contrast measurement: 720 elements checked on Stripe, 32 actual AA failures
- Real axe violations: catches missing button text, link text, list semantics
- Accessibility dimension in design system score reflects measured truth
- Overall score still reflects visual quality (LLM job) — both visible

Calibration preserved 5/5 vs Gen 1 baseline:
  Stripe 9, Apple 9, Linear 9, Anthropic 8, Airbnb 8 (overall scores unchanged)
  Apple a11y dim: 4/10 (real)  | Linear a11y dim: 4/10 (real)

CLI:
- `bad design-audit --url X` runs Gen 2 by default (auto-classifies)
- `bad design-audit --url X --gen 1` falls back to legacy hardcoded profiles
- `bad design-audit --url X --profile marketing` overrides classification
- New: `--rubrics-dir <path>` for user-supplied rubric fragments

Tests:
- 27 new unit tests covering rubric loader, fragment parsing, predicate
  evaluation, measurements-to-findings mapping, severity mapping
- 662 total tests pass

Build:
- New copy-rubric-fragments.mjs script copies .md files to dist/ post-tsc
- axe-core 4.11.2 added as runtime dependency
Gen 2 added context (auto-classification) and truth (real measurements).
Gen 3 makes the output actionable: every finding gets impact/effort/blast,
the report opens with "Top 5 Fixes by ROI", and findings appearing across
multiple pages collapse into a single systemic finding with elevated blast.

Architecture (src/design/audit/):
- roi.ts (new) — pure-function ROI scoring + cross-page systemic detection
- evaluate.ts — LLM prompt now asks for impact/effort/blast on each finding;
  measurements get derived defaults (contrast: system/effort=1, axe: component/effort=3)
- measure/a11y.ts — CSP-bypass injection ladder (addScriptTag → CDP → eval)
- rubric/loader.ts — fragments can declare a custom dimension; composed rubric
  exposes deduped dimensions[]
- types.ts — DesignFinding gains impact/effort/blast/roi/pageCount;
  RubricFragment gains optional dimension; ComposedRubric gains dimensions[]

Rubric updates (per-fragment dimensions):
- domain-fintech: dimension=trust-signals
- domain-crypto: dimension=trust-signals
- type-docs: dimension=readability
- type-ecommerce: dimension=conversion

CLI integration:
- Cross-page systemic detection runs after all pages audited
- Top 5 fixes (by ROI) computed via topByRoi() on the deduped set
- generateReport() opens with "Top Fixes (by ROI)" section showing ROI,
  impact, effort, blast, location, fix, and CSS for each
- JSON output exposes topFixes[]

Live verification:
- 3-page Stripe audit collapsed a contrast issue appearing on /, /pricing
  into a single [appears on 2 pages] systemic finding
- Stripe classified as fintech → trust-signals dimension automatically scored
- Airbnb classified as ecommerce → conversion dimension automatically scored
- All 5 reference sites preserve calibration (Gen 1 = Gen 2 = Gen 3 scores)

Tests:
- 24 new ROI unit tests (computeRoi formulas, cross-page dedup, normalization,
  ranking stability, edge cases)
- 4 new dimension tests (frontmatter parsing, composition, dedup)
- 686 total tests pass (was 662)

ROI formula: (impact * blastWeight) / max(effort, 1)
Blast weights: page=1, section=1.25, component=1.75, system=2.5
The Gen 3 PR shipped ROI ranking but had three rough edges that this commit
fixes, plus completes the verification work that was outstanding.

## Measurement grouping (the big one)
Contrast and a11y findings now group BEFORE becoming findings:
- Contrast: by (color, background) pair. A site with 47 elements failing
  the same gray gets ONE finding ("change --text-muted, affects 47 elements")
  with blast scaled by element count (≥5 = system, 2-4 = component, 1 = page).
- axe: by rule id. A site with 8 buttons missing accessible names gets ONE
  button-name finding with a node count, not 8 spammy entries.

This dramatically improves Top Fixes signal-to-noise. On Stripe: previously
showed 5 copies of the same contrast issue; now shows 5 distinct color pair
mismatches each affecting multiple elements (8, 5, 4, 3, 2).

## Effort calibration anchor
New universal rubric fragment `universal-effort-anchor.md` defining the
1-10 scale for effort/impact/blast with concrete anchors (effort=1 is one
CSS value, effort=5 is editing 3 components, effort=10 is full redesign).
The LLM now has a shared definition instead of guessing.

## Gen 1 audit code deleted
Removed the legacy fallback path:
- PROFILE_RUBRICS Record (lines 22-249, ~227 lines)
- buildAuditPrompt function
- auditSinglePage function (~80 lines)
- --gen flag from cli.ts and DesignAuditOptions
- generation === 1/2 branching in runDesignAudit, reproducibility loop

cli-design-audit.ts: 2418 → 2106 lines (-312 net, after grouping additions)

## End-to-end evolve loop verification (the killer feature)
This was untested across the Gen 1 and Gen 2 reflections.
Built /tmp/bad-design-test fixture with deliberately bad HTML+CSS.
Ran:
  bad design-audit --url http://localhost:8765 --evolve claude-code \
    --evolve-rounds 2 --project-dir /tmp/bad-design-test

Result: 3.0 → 5.0 (+2.0) over 2 rounds.
Claude Code rewrote the actual source files (index.html + src/styles.css):
- Added a CSS variable system (--brand, --text, --radius-*, --shadow-*)
- Hero gradient text fill, eyebrow badge, dual CTAs, lede paragraph
- Cards got SVG icons + real copy + hover transforms
- Sticky header with backdrop blur, multi-column footer
- Responsive grid breakpoint
9 of the top 10 ROI findings were fixed in round 1.

This validates both the agent dispatch architecture AND the ROI ranking.

## Reproducibility on Gen 3
Stripe reproducibility test: 9.0 / 9.0 / 9.0 — stddev 0.00.
Target was ±0.5. Gen 3 grouping makes scores MORE stable than Gen 1
because there's less per-finding variance.

## Documentation
Updated docs/guides/design-audit.md with Gen 3 architecture: pipeline
explanation, ROI scoring, cross-page systemic detection, measurement
grouping, dynamic dimensions.

## Skill updates
design-evolve skill now references the topFixes array as the entry point
instead of telling the agent to sort findings by severity from scratch.
Memory entries written for Gen 3 architecture and grouping behavior.

## Tests
Updated 2 measurement tests for the new grouping behavior. Added 2 new
tests for blast scaling. 688 tests pass (was 686).
@drewstone
Copy link
Copy Markdown
Contributor Author

Update: Gen 3 polish — closing the gaps from the self-review

Pushed a30e0c8 addressing the issues called out in the Gen 3 self-assessment.

Big fix: measurement grouping

Top Fixes was previously dominated by 5 copies of the same contrast issue. Now contrast and a11y findings group BEFORE becoming findings:

  • Contrast groups by (color, background) pair. A site with 47 elements failing the same gray gets ONE finding ("change --text-muted, affects 47 elements") with blast scaled by element count.
  • axe groups by rule id. 8 buttons missing accessible names → 1 finding with node count, not 8.

On Stripe: Top 5 now shows 5 distinct color pair mismatches affecting 8, 5, 4, 3, 2 elements respectively. Real signal, not duplicates.

Effort calibration anchor

New universal-effort-anchor.md rubric fragment defines the 1-10 scale with concrete anchors (effort=1 is one CSS value, effort=5 is 3 component edits, effort=10 is full redesign). LLM has a shared definition instead of guessing.

Killer feature verified end-to-end

The --evolve claude-code agent dispatch was untested across the Gen 1 and Gen 2 reflections. Now verified live:

Built /tmp/bad-design-test fixture (deliberately bad HTML+CSS).
Ran: bad design-audit --url http://localhost:8765 \
       --evolve claude-code --evolve-rounds 2 \
       --project-dir /tmp/bad-design-test

Result: 3.0/10 → 5.0/10 (+2.0) over 2 rounds.

Claude Code rewrote actual source files:

  • Added a CSS variable system (--brand, --text, --radius-*, --shadow-*)
  • Hero got gradient text fill, eyebrow badge, dual CTAs, lede paragraph
  • Cards got SVG icons + real copy + hover transforms
  • Sticky header with backdrop blur, multi-column footer
  • Responsive grid breakpoint

9 of the top 10 ROI findings were fixed in round 1. ROI ranking validated.

Gen 1 audit code deleted

PROFILE_RUBRICS, buildAuditPrompt, auditSinglePage, the --gen flag, and all generation === 1/2 branching are gone. cli-design-audit.ts: 2418 → 2106 lines.

Reproducibility re-validated on Gen 3

Stripe: 9.0 / 9.0 / 9.0 — stddev 0.00 (target was ±0.5). Gen 3 grouping makes scores MORE stable than Gen 1.

Docs + skill + memory

  • docs/guides/design-audit.md documents the Gen 3 pipeline, ROI scoring, cross-page detection, measurement grouping, dynamic dimensions
  • design-evolve skill points agents at topFixes[] as the entry point
  • Memory entries written for Gen 3 architecture and grouping behavior

Final calibration (3 generations, 6 sites)

Site Gen 1 Gen 2 Gen 3 Notes
Stripe 9 9 9 trust-signals dimension
Apple 9 9 9 a11y dim=4 (real measurements)
Linear 9 9 9 a11y dim=4 (real measurements)
Anthropic 8 8 8
Airbnb 8 8 8 conversion dimension
/tmp/bad-design-test n/a n/a 3 → 5 evolve loop verified

5/5 reference sites preserved. Bad fixture improved 2 points via real source edits.

Tests

688 pass on stable runs (was 662 on main). 2 integration tests are pre-existing flaky (browser timeouts in afterAll), not introduced by this PR.

@drewstone drewstone merged commit d100c9e into main Apr 6, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant