feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked by drewstone · Pull Request #33 · tangle-network/browser-agent-driver

drewstone · 2026-04-06T19:14:43Z

Summary

Two generations of design-audit redesign on one branch.

Gen 2 replaced 5 hardcoded profiles with auto-classification + composable markdown rubric fragments, and added ground-truth measurements (axe-core + real WCAG 2.1 contrast math).

Gen 3 made the output actionable: every finding gets an ROI score (impact × blast / effort), the report opens with a Top 5 Fixes section, and findings appearing on multiple pages are auto-detected as systemic.

What's new

Architecture (`src/design/audit/`)

Module	Purpose
`types.ts`	All audit types
`classify.ts`	Single LLM call: type / domain / framework / designSystem / maturity / intent
`rubric/loader.ts`	Composes rubric from markdown fragments by predicate
`rubric/fragments/`	12 markdown fragments (universal + type + domain + maturity)
`measure/contrast.ts`	Pure-JS WCAG 2.1 contrast math (in-page, deterministic)
`measure/a11y.ts`	axe-core wrapper with 3-tier CSP-bypass injection
`measure/index.ts`	Parallel measurement gather
`evaluate.ts`	Composes classification + rubric + measurements + LLM vision
`roi.ts`	Gen 3 — ROI scoring, cross-page systemic detection, top-N selection
`pipeline.ts`	Orchestrator

CLI surface

# Gen 2 default — auto-classifies, no profile flag needed
bad design-audit --url https://your-app.com

# Manual profile override (Gen 1 behavior)
bad design-audit --url https://your-app.com --profile saas

# Legacy Gen 1 audit path
bad design-audit --url https://your-app.com --gen 1 --profile marketing

# User-supplied rubric fragments
bad design-audit --url https://your-app.com --rubrics-dir ~/.bad/rubrics

Reports now include

Auto-classification block: type/domain/maturity for every page
Top Fixes (by ROI) section: 5 highest-impact fixes, sorted by impact × blast / effort
Systemic markers: [appears on N pages] when cross-page dedup finds repeated issues
Dynamic dimensions: fintech sites get a trust-signals score, ecommerce gets conversion, docs get readability
Measurement-driven a11y dimension: real contrast/axe data drives the accessibility score, not LLM vibes

Calibration (5/5 preserved across all 3 generations)

Site	Gen 1	Gen 2	Gen 3	A11y dim	Custom Dimensions
Stripe	9	9	9	8	trust-signals
Apple	9	9	9	4 (real)	—
Linear	9	9	9	4 (real)	—
Anthropic	8	8	8	7	—
Airbnb	8	8	8	8	conversion

The drop in Apple/Linear's a11y dimension reflects ground truth — Linear has 152 actual WCAG AA contrast failures and 5 axe violations. Gen 1 missed all of them. The overall score still reflects visual quality (LLM's job); the a11y dimension reflects measured truth (axe + math). Both visible in the same report.

Live verification

3-page Stripe audit (stripe.com, /pricing, /contact/sales) collapsed a #635bff contrast issue appearing on 2 pages into a single [appears on 2 pages] systemic finding (top-4 by ROI)
Stripe correctly classified as marketing/fintech/world-class at 0.99 confidence; got trust-signals dimension automatically
Airbnb classified as ecommerce; got conversion dimension automatically
All 5 reference sites preserve overall scores within ±0 of Gen 1

Test plan

pnpm build — clean
pnpm check:boundaries — passes (77 files)
pnpm test — 686 tests pass (was 635 on main; +51 new tests)
Live audit against Stripe / Apple / Linear / Anthropic / Airbnb — calibration preserved
Cross-page systemic detection verified on 3-page Stripe audit
Custom dimensions verified on Stripe (trust-signals) and Airbnb (conversion)
Reviewer: try bad design-audit --url <your-app> and check the Top Fixes section
Reviewer: try --gen 1 --profile marketing to verify the legacy path still works as a fallback

Changes summary

30 files changed in Gen 2 (+2515 lines)
17 files changed in Gen 3 (+814 lines)
New runtime dep: axe-core@4.11.2
New build script: scripts/copy-rubric-fragments.mjs (copies .md fragments to dist/)
Pursuit specs: .evolve/pursuits/2026-04-06-design-audit-gen2.md, gen3.md

Reversibility

Gen 1 audit code is intact behind --gen 1. New code is additive — src/design/audit/ modules are independent of the legacy file. Easy to cherry-pick or revert individual pieces.

Replaces hardcoded profile rubrics with auto-classification, composable markdown rubric fragments, and ground-truth measurements (axe-core + real WCAG contrast math). The LLM is restricted to subjective visual judgment; everything measurable is now measured. Architecture (src/design/audit/): - types.ts — all audit types - classify.ts — single LLM call for type/domain/framework/maturity - rubric/loader.ts — composes rubric from markdown fragments by predicate - rubric/fragments/ — 12 markdown fragments (universal + type + domain + maturity) - measure/contrast.ts — WCAG 2.1 contrast math (in-page, deterministic) - measure/a11y.ts — axe-core wrapper - measure/index.ts — gathers measurements in parallel - evaluate.ts — composes classification + rubric + measurements + LLM vision - pipeline.ts — orchestrator Key wins: - Auto-classifies pages (no more "wrong profile" failures) - Drop a markdown file in fragments/ to add a domain — no code change - Real contrast measurement: 720 elements checked on Stripe, 32 actual AA failures - Real axe violations: catches missing button text, link text, list semantics - Accessibility dimension in design system score reflects measured truth - Overall score still reflects visual quality (LLM job) — both visible Calibration preserved 5/5 vs Gen 1 baseline: Stripe 9, Apple 9, Linear 9, Anthropic 8, Airbnb 8 (overall scores unchanged) Apple a11y dim: 4/10 (real) | Linear a11y dim: 4/10 (real) CLI: - `bad design-audit --url X` runs Gen 2 by default (auto-classifies) - `bad design-audit --url X --gen 1` falls back to legacy hardcoded profiles - `bad design-audit --url X --profile marketing` overrides classification - New: `--rubrics-dir <path>` for user-supplied rubric fragments Tests: - 27 new unit tests covering rubric loader, fragment parsing, predicate evaluation, measurements-to-findings mapping, severity mapping - 662 total tests pass Build: - New copy-rubric-fragments.mjs script copies .md files to dist/ post-tsc - axe-core 4.11.2 added as runtime dependency

Gen 2 added context (auto-classification) and truth (real measurements). Gen 3 makes the output actionable: every finding gets impact/effort/blast, the report opens with "Top 5 Fixes by ROI", and findings appearing across multiple pages collapse into a single systemic finding with elevated blast. Architecture (src/design/audit/): - roi.ts (new) — pure-function ROI scoring + cross-page systemic detection - evaluate.ts — LLM prompt now asks for impact/effort/blast on each finding; measurements get derived defaults (contrast: system/effort=1, axe: component/effort=3) - measure/a11y.ts — CSP-bypass injection ladder (addScriptTag → CDP → eval) - rubric/loader.ts — fragments can declare a custom dimension; composed rubric exposes deduped dimensions[] - types.ts — DesignFinding gains impact/effort/blast/roi/pageCount; RubricFragment gains optional dimension; ComposedRubric gains dimensions[] Rubric updates (per-fragment dimensions): - domain-fintech: dimension=trust-signals - domain-crypto: dimension=trust-signals - type-docs: dimension=readability - type-ecommerce: dimension=conversion CLI integration: - Cross-page systemic detection runs after all pages audited - Top 5 fixes (by ROI) computed via topByRoi() on the deduped set - generateReport() opens with "Top Fixes (by ROI)" section showing ROI, impact, effort, blast, location, fix, and CSS for each - JSON output exposes topFixes[] Live verification: - 3-page Stripe audit collapsed a contrast issue appearing on /, /pricing into a single [appears on 2 pages] systemic finding - Stripe classified as fintech → trust-signals dimension automatically scored - Airbnb classified as ecommerce → conversion dimension automatically scored - All 5 reference sites preserve calibration (Gen 1 = Gen 2 = Gen 3 scores) Tests: - 24 new ROI unit tests (computeRoi formulas, cross-page dedup, normalization, ranking stability, edge cases) - 4 new dimension tests (frontmatter parsing, composition, dedup) - 686 total tests pass (was 662) ROI formula: (impact * blastWeight) / max(effort, 1) Blast weights: page=1, section=1.25, component=1.75, system=2.5

The Gen 3 PR shipped ROI ranking but had three rough edges that this commit fixes, plus completes the verification work that was outstanding. ## Measurement grouping (the big one) Contrast and a11y findings now group BEFORE becoming findings: - Contrast: by (color, background) pair. A site with 47 elements failing the same gray gets ONE finding ("change --text-muted, affects 47 elements") with blast scaled by element count (≥5 = system, 2-4 = component, 1 = page). - axe: by rule id. A site with 8 buttons missing accessible names gets ONE button-name finding with a node count, not 8 spammy entries. This dramatically improves Top Fixes signal-to-noise. On Stripe: previously showed 5 copies of the same contrast issue; now shows 5 distinct color pair mismatches each affecting multiple elements (8, 5, 4, 3, 2). ## Effort calibration anchor New universal rubric fragment `universal-effort-anchor.md` defining the 1-10 scale for effort/impact/blast with concrete anchors (effort=1 is one CSS value, effort=5 is editing 3 components, effort=10 is full redesign). The LLM now has a shared definition instead of guessing. ## Gen 1 audit code deleted Removed the legacy fallback path: - PROFILE_RUBRICS Record (lines 22-249, ~227 lines) - buildAuditPrompt function - auditSinglePage function (~80 lines) - --gen flag from cli.ts and DesignAuditOptions - generation === 1/2 branching in runDesignAudit, reproducibility loop cli-design-audit.ts: 2418 → 2106 lines (-312 net, after grouping additions) ## End-to-end evolve loop verification (the killer feature) This was untested across the Gen 1 and Gen 2 reflections. Built /tmp/bad-design-test fixture with deliberately bad HTML+CSS. Ran: bad design-audit --url http://localhost:8765 --evolve claude-code \ --evolve-rounds 2 --project-dir /tmp/bad-design-test Result: 3.0 → 5.0 (+2.0) over 2 rounds. Claude Code rewrote the actual source files (index.html + src/styles.css): - Added a CSS variable system (--brand, --text, --radius-*, --shadow-*) - Hero gradient text fill, eyebrow badge, dual CTAs, lede paragraph - Cards got SVG icons + real copy + hover transforms - Sticky header with backdrop blur, multi-column footer - Responsive grid breakpoint 9 of the top 10 ROI findings were fixed in round 1. This validates both the agent dispatch architecture AND the ROI ranking. ## Reproducibility on Gen 3 Stripe reproducibility test: 9.0 / 9.0 / 9.0 — stddev 0.00. Target was ±0.5. Gen 3 grouping makes scores MORE stable than Gen 1 because there's less per-finding variance. ## Documentation Updated docs/guides/design-audit.md with Gen 3 architecture: pipeline explanation, ROI scoring, cross-page systemic detection, measurement grouping, dynamic dimensions. ## Skill updates design-evolve skill now references the topFixes array as the entry point instead of telling the agent to sort findings by severity from scratch. Memory entries written for Gen 3 architecture and grouping behavior. ## Tests Updated 2 measurement tests for the new grouping behavior. Added 2 new tests for blast scaling. 688 tests pass (was 686).

drewstone · 2026-04-06T19:50:30Z

Update: Gen 3 polish — closing the gaps from the self-review

Pushed a30e0c8 addressing the issues called out in the Gen 3 self-assessment.

Big fix: measurement grouping

Top Fixes was previously dominated by 5 copies of the same contrast issue. Now contrast and a11y findings group BEFORE becoming findings:

Contrast groups by (color, background) pair. A site with 47 elements failing the same gray gets ONE finding ("change --text-muted, affects 47 elements") with blast scaled by element count.
axe groups by rule id. 8 buttons missing accessible names → 1 finding with node count, not 8.

On Stripe: Top 5 now shows 5 distinct color pair mismatches affecting 8, 5, 4, 3, 2 elements respectively. Real signal, not duplicates.

Effort calibration anchor

New universal-effort-anchor.md rubric fragment defines the 1-10 scale with concrete anchors (effort=1 is one CSS value, effort=5 is 3 component edits, effort=10 is full redesign). LLM has a shared definition instead of guessing.

Killer feature verified end-to-end

The --evolve claude-code agent dispatch was untested across the Gen 1 and Gen 2 reflections. Now verified live:

Built /tmp/bad-design-test fixture (deliberately bad HTML+CSS).
Ran: bad design-audit --url http://localhost:8765 \
       --evolve claude-code --evolve-rounds 2 \
       --project-dir /tmp/bad-design-test

Result: 3.0/10 → 5.0/10 (+2.0) over 2 rounds.

Claude Code rewrote actual source files:

Added a CSS variable system (--brand, --text, --radius-*, --shadow-*)
Hero got gradient text fill, eyebrow badge, dual CTAs, lede paragraph
Cards got SVG icons + real copy + hover transforms
Sticky header with backdrop blur, multi-column footer
Responsive grid breakpoint

9 of the top 10 ROI findings were fixed in round 1. ROI ranking validated.

Gen 1 audit code deleted

PROFILE_RUBRICS, buildAuditPrompt, auditSinglePage, the --gen flag, and all generation === 1/2 branching are gone. cli-design-audit.ts: 2418 → 2106 lines.

Reproducibility re-validated on Gen 3

Stripe: 9.0 / 9.0 / 9.0 — stddev 0.00 (target was ±0.5). Gen 3 grouping makes scores MORE stable than Gen 1.

Docs + skill + memory

docs/guides/design-audit.md documents the Gen 3 pipeline, ROI scoring, cross-page detection, measurement grouping, dynamic dimensions
design-evolve skill points agents at topFixes[] as the entry point
Memory entries written for Gen 3 architecture and grouping behavior

Final calibration (3 generations, 6 sites)

Site	Gen 1	Gen 2	Gen 3	Notes
Stripe	9	9	9	trust-signals dimension
Apple	9	9	9	a11y dim=4 (real measurements)
Linear	9	9	9	a11y dim=4 (real measurements)
Anthropic	8	8	8	—
Airbnb	8	8	8	conversion dimension
/tmp/bad-design-test	n/a	n/a	3 → 5	evolve loop verified

5/5 reference sites preserved. Bad fixture improved 2 points via real source edits.

Tests

688 pass on stable runs (was 662 on main). 2 integration tests are pre-existing flaky (browser timeouts in afterAll), not introduced by this PR.

drewstone added 3 commits April 6, 2026 11:44

docs: reflect on Gen 3 polish + verification

644f79e

drewstone merged commit d100c9e into main Apr 6, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked#33

feat(design-audit): Gen 2 + Gen 3 — context-aware, measurement-grounded, ROI-ranked#33
drewstone merged 4 commits intomainfrom
design-audit-gen2

drewstone commented Apr 6, 2026

Uh oh!

drewstone commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 6, 2026

Summary

What's new

Architecture (src/design/audit/)

CLI surface

Reports now include

Calibration (5/5 preserved across all 3 generations)

Live verification

Test plan

Changes summary

Reversibility

Uh oh!

drewstone commented Apr 6, 2026

Update: Gen 3 polish — closing the gaps from the self-review

Big fix: measurement grouping

Effort calibration anchor

Killer feature verified end-to-end

Gen 1 audit code deleted

Reproducibility re-validated on Gen 3

Docs + skill + memory

Final calibration (3 generations, 6 sites)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Architecture (`src/design/audit/`)