From 791be9a85e5f41a7fc14146db15319d0a89ee7d3 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 15:13:33 +0800 Subject: [PATCH 01/38] =?UTF-8?q?docs:=20add=20arxiv=20paper=20design=20sp?= =?UTF-8?q?ec=20=E2=80=94=20skill-based=20agentic=20coding=20for=20reducti?= =?UTF-8?q?ons?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Design spec for a full research paper (ICSE/ASE-class) on using skill-based AI agent pipelines to build verified NP-hard problem reduction libraries. Key decisions from brainstorming: - Methodology-first framing (Goldilocks domain + practical artifact) - Three roles: contributors (issues), maintainer (board curation), agents (manage + execute) - Multi-layered verification stack (7 layers from type system to documentation) - Evaluation: ablation (skill vs no-skill) + git mining + 3 case studies - Hardware solver motivation (Rydberg atoms, D-Wave) Co-Authored-By: Claude Opus 4.6 --- .../specs/2026-03-12-arxiv-paper-design.md | 292 ++++++++++++++++++ 1 file changed, 292 insertions(+) create mode 100644 docs/superpowers/specs/2026-03-12-arxiv-paper-design.md diff --git a/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md new file mode 100644 index 00000000..7e25dc55 --- /dev/null +++ b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md @@ -0,0 +1,292 @@ +# Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions + +**Type:** Full research paper (~10-12 pages) +**Venue:** ICSE/ASE-class SE conference +**Output:** `docs/paper/arxiv/paper.typ` (Typst) + +## Thesis + +The bottleneck in agentic coding is not agent capability but task decomposition and the division of labor between human creativity and agent management/execution. We demonstrate a skill-based pipeline where humans (contributors + maintainer) provide judgment — which problems matter, which reductions are useful — while agents handle both management (orchestrating the pipeline, picking cards, dispatching sub-agents) and execution (implementation, testing, documentation, review). Applied to NP-hard problem reductions, this produces a verified library of 24 problem types with 40 implemented reduction rules and 52 total graph edges (including 12 inferred variant edges), with multi-layered correctness guarantees. + +**Terminology note:** "40 reductions" = hand-coded `ReduceTo` implementations. "52 graph edges" = total directed edges in the reduction graph, including natural edges inferred from the type-parameter subtype lattice (e.g., `MIS` → `MIS`). The paper must consistently distinguish these counts. + +## Paper Outline + +### S1. Introduction (~1.5 pages) + +Frame the problem: +- AI coding agents achieve 70-80% on isolated bug fixes (SWE-Bench Verified) but drop to ~20% on long-horizon, multi-file tasks. The common response is to push for more agent autonomy. +- We argue the bottleneck is not capability but decomposition: how to split creative/judgment work (human) from management/mechanical work (agent). +- The "review is harder than generation" challenge — especially for mathematical/scientific code where correctness is hard to verify. + +Present the three roles: +- **Contributors** create issues (creative: identify which reductions are useful, propose new problems, spot gaps in the graph). +- **Maintainer** curates the project board and writes skills (creative: priorities, domain knowledge encoding, quality standards). +- **Agents** both manage (pick cards from the board, orchestrate the pipeline, dispatch sub-agents for review) and execute (implement, test, document). + +Contributions: +1. A skill-based methodology for decomposing mathematical coding tasks into agent-manageable steps. +2. A multi-layered verification stack that catches errors across different abstraction levels. +3. A verified reduction library (24 problem types, 40 implemented reductions, 52 graph edges) as a practical artifact. + +### S2. Why Reductions? The Goldilocks Domain (~1 page) + +Why this domain is ideal for studying agentic coding: +- Each reduction is self-contained (~50-200 LOC), requires non-trivial mathematical reasoning, yet has an automatable correctness criterion (round-trip: reduce → solve target → extract solution back → verify against source). +- Homogeneous task structure enables systematic comparison across tasks (unlike SWE-Bench's heterogeneous issues). +- Contrast with general SE tasks: reductions have a clear mathematical spec, a ground-truth, and bounded scope. + +Practical motivation — hardware solvers: +- Rydberg atom arrays solve Maximum Independent Set natively. +- D-Wave quantum annealers solve Ising/QUBO problems. +- A verified reduction graph serves as a **compilation layer**: reduce SAT → MIS → run on Rydberg atoms; reduce MaxCut → SpinGlass → QUBO → run on D-Wave. The library lets specialized hardware solve a much larger class of problems. + +Practical motivation — real-world applications: +- Software-defined networking (routing/scheduling → ILP). +- Airline crew scheduling (→ SetCovering). +- VLSI design (→ graph coloring). +- Logistics (→ TSP, BinPacking). +- These domains reduce to problems that already have hardware or algorithmic solutions; the library provides the verified bridge. + +Figure 1: The reduction graph (24 problem types, 42 variant nodes, 52 directed edges, QUBO/ILP hubs visible, color-coded by category: graph/formula/set/algebraic/misc). Caption distinguishes 40 implemented reductions from 12 inferred variant edges. + +### S3. System Architecture (~1.5 pages) + +The Rust library design that makes agent-generated code verifiable by construction. Focus on the aspects that directly enable the verification story (details of trait hierarchy and proc macros in supplementary material). + +**Key design choices:** +- `Problem` trait with `evaluate()` enables brute-force verification of any configuration. +- `ReduceTo` trait with `ReductionResult` enforces that every reduction can produce a target problem AND extract solutions back — the type system makes round-trip testing possible by construction. +- `#[reduction(overhead = {...})]` proc macro: overhead expressions are compile-time validated against getter methods — agents cannot write incorrect variable names in overhead formulas. +- `declare_variants!` registers problem variants with complexity strings — the registry enables automated graph export and completeness checking. + +**Design philosophy:** Reduce the space of possible agent errors through type-level enforcement. The architecture is not just a code organization choice — it is the foundation of the verification stack (elaborated in S5). + +Figure 2: System architecture diagram (key traits + compile-time validation flow). Full trait hierarchy in supplementary material. + +### S4. Skill-Based Task Decomposition (~2 pages) + +#### 4.1 The Three Roles + +How creative/judgment work distributes across human roles, with management and execution delegated to agents: + +| Role | Responsibility | Creative/Judgment | Examples | +|------|---------------|-------------------|----------| +| Contributor | Open issues | Which reductions are useful? Non-trivial? | "Add SAT → DominatingSet rule" | +| Maintainer | Curate board, write skills | Priorities, quality standards, domain knowledge | Move card to "Ready", evolve check-issue skill | +| Agent | Manage pipeline + execute | — | Pick card, implement, test, review, create PR | + +#### 4.2 Skills as Agent Functions + +A skill is a markdown script that decomposes a complex task into agent-manageable subtasks. Key insight: if a task is small and explicit enough, agents handle it well. + +Skills inventory (11 skills, grouped by function): + +**Orchestration skills** (agent-as-manager): +- **issue-to-pr**: The main entry point. Receives a GitHub issue, classifies it (model vs. rule), dispatches to the appropriate implementation skill, and creates a PR. This is the skill that enables card-based automation — the manager agent picks a card and invokes this. +- **meta-power**: Batch mode. Resolves all open issues autonomously in dependency order (models before rules that reference them). Experimental — not yet proven at scale. + +**Implementation skills** (agent-as-executor): +- **add-model**: Brainstorm (if interactive) → implement Problem trait → unit tests → serialization tests → review. +- **add-rule**: Brainstorm (if interactive) → implement ReduceTo trait → closed-loop tests → overhead expressions → example → review. + +**Quality gate skills:** +- **check-issue**: Validates usefulness, non-triviality, literature correctness of a proposed rule/model. Posts structured report. +- **check-rule-redundancy**: Determines if a proposed rule is dominated by a composite path through existing rules. +- **review-implementation**: Dispatches parallel subagents (structural check + quality check) with fresh context windows. +- **fix-pr**: Resolves review comments, CI failures, coverage gaps. + +**Documentation skills** (also serve as verification Layer 7 — see S5): +- **write-model-in-paper**: Generates Typst problem definition (formal definition, background, example with visualization). +- **write-rule-in-paper**: Generates Typst reduction theorem (complexity citation, self-contained proof sketch, detailed example). The proof sketch is the final verification layer — it forces a human-readable argument for correctness. + +**Release skill:** +- **release**: Determines version bump from diff, verifies tests/clippy, tags and publishes. + +Table 1: Skills inventory — trigger condition, inputs, outputs, typical agent turns, first-attempt success rate from git history. + +#### 4.3 Card-Based Orchestration + +- GitHub project board serves as the coordination mechanism. +- A manager agent auto-picks the next card and drives it through the skill pipeline. +- The maintainer's creative input: moving cards between columns ("Backlog" → "Ready" → "In Progress"). This is the strategic decision of what to work on next. +- The agent handles the tactical decisions: which skill to invoke, how to decompose subtasks, when to dispatch sub-agents. + +Figure 3: Pipeline diagram — contributor opens issue → check-issue validates → maintainer moves card → agent picks card → add-rule implements → review-implementation checks → fix-pr resolves issues → PR merged. + +### S5. Multi-Layered Verification (~1.5 pages) + +#### 5.1 The Verification Stack + +Seven layers, each catching different error classes: + +| Layer | Mechanism | Catches | +|-------|-----------|---------| +| 1. Type system | Rust compiler, trait bounds | Wrong return types, missing trait impls, API misuse | +| 2. Unit tests | `test_*_basic`, `test_*_serialization` | Evaluation errors, serialization roundtrip failures | +| 3. Closed-loop tests | `test_*_to_*_closed_loop` | Incorrect reduction mapping, wrong solution extraction | +| 4. Overhead validation | Symbolic expr vs. actual sizes | Overhead formula errors (e.g., quadratic vs linear edge count) | +| 5. Materialized fixtures | JSON ground truth in `tests/data/` | Agents silently changing expected values to make tests pass | +| 6. Agentic review | Parallel subagents with fresh context | Structural issues, missing edge cases, convention violations | +| 7. Documentation | Paper entry with proof sketch | Logical errors in the reduction argument itself | + +#### 5.2 Why Layers? + +The "lazy agent" problem: agents take the shortest path to close an issue. Given a failing test, an agent is more likely to change the expected value than fix the underlying bug. Materialized test data (Layer 5) prevents this by locking expected outputs in version-controlled JSON files that the agent cannot modify as part of a rule implementation PR. + +No single layer is sufficient: the type system catches API misuse but not logical errors; closed-loop tests verify functional correctness but not overhead formulas; documentation catches proof-level mistakes that no automated test can detect. + +Table 2 (defined in S6.2, referenced here): Error taxonomy × verification layer matrix. + +Figure 4: Verification pyramid with concrete error examples at each layer. + +### S6. Evaluation (~2.5 pages) + +#### 6.1 Ablation: Skill-Based vs. No-Skill Agent (quantitative) + +To demonstrate that the skill-based approach matters (not just "use a good agent"), we run a controlled comparison: + +**Setup:** Select 5-10 reductions of varying complexity. For each, run two configurations: +- **Skill-based:** Full pipeline (issue-to-pr skill, add-rule skill, review-implementation, fix-pr). +- **No-skill baseline:** Raw Claude Code on the same codebase with the same issue description but no skills (only CLAUDE.md for project context). + +**Metrics:** First-attempt CI pass rate, number of review rounds, final correctness (round-trip test pass), lines of code quality (convention adherence). + +**Framing:** With n=5-10, this ablation is a **controlled illustration** of the skill-based approach's value, not a statistically powered experiment. The results demonstrate the mechanism (how skills prevent specific error classes) rather than establishing effect sizes. The git mining in S6.2 provides broader quantitative evidence across the full project history. + +This is feasible: create the same issues on a branch without skill files, run the agent, measure outcomes. + +#### 6.2 Git History Mining (quantitative) + +Data source: full git/PR history of the problemreductions repository. + +Metrics: +- Agent-implemented vs. human-implemented reductions (count and %). +- First-attempt success rate per skill invocation (does the PR pass CI on first push?). +- Number of review rounds before merge. +- Error taxonomy: categorize all errors found during review, map to verification layer that caught them. +- Test coverage across the codebase (>95% target). +- Lines of code per reduction (distribution, compare agent vs human). + +**Addressing the confound:** Skills evolved during the project, so early reductions had less agent support. We address this by: +- Stratifying results by skill maturity phase (Phase 1: manual, Phase 2: basic skills, Phase 3: full pipeline with card automation). +- Plotting success rate over time with skill milestone annotations. +- Restricting primary quantitative claims to Phase 3 reductions (stable pipeline). + +**Preliminary error taxonomy** (to be populated from git history): +- *Type errors*: wrong return type, missing trait impl → caught by Layer 1 (type system) +- *Mapping errors*: incorrect vertex/edge index in reduction → caught by Layer 3 (closed-loop tests) +- *Formula errors*: wrong overhead expression (e.g., linear vs quadratic edge count) → caught by Layer 4 (overhead validation) +- *Test gaming*: agent changes expected value instead of fixing bug → caught by Layer 5 (materialized fixtures) +- *Convention violations*: wrong file naming, missing `declare_variants!` → caught by Layer 6 (agentic review) +- *Logical errors*: incorrect proof argument → caught by Layer 7 (documentation review) + +Table 2: Error taxonomy × verification layer matrix (populated from git mining). + +#### 6.3 Case Studies (qualitative) + +Three reductions spanning the complexity spectrum: + +**Simple — MinimumVertexCover → MaximumIndependentSet:** +- Complement relationship: MIS(G) = V \ MVC(G). +- Near-trivial mapping, ~30 LOC. +- Shows the pipeline working smoothly with minimal human intervention. + +**Complex — Satisfiability → MaximumIndependentSet:** +- Clause-variable gadget construction, quadratic blowup in edges. +- Requires understanding both CNF formulas and graph structure. +- Shows where agent makes mistakes (edge count in intersection graph) and how verification layers catch them. + +**Composition — Factoring → CircuitSAT → ILP (graph-level, not single-agent):** +- Two independently implemented reductions (Factoring→CircuitSAT and CircuitSAT→ILP) that compose in the reduction graph. +- This case study analyzes each reduction's implementation pipeline separately, then demonstrates how the graph enables composition: factor a number by chaining reductions to ILP and using an off-the-shelf solver. +- The "composition" is a property of the graph structure, not a single agent managing a multi-hop chain. +- Highlights the practical value: the library serves as compilation infrastructure. + +For each case study: show the full pipeline from issue to merged PR, highlight where human judgment was needed vs. where agent executed autonomously, and which verification layers activated. + +### S7. Related Work (~1 page) + +**AI coding agents:** +- SWE-agent (ACI design), OpenHands (open platform + SDK), Claude Code (agentic CLI), Devin (autonomous engineer). +- Benchmarks: SWE-Bench Verified (~70-80%), SWE-EVO (~20% on long-horizon), SWE-Bench Pro (~45%). +- Our contribution: skill-based decomposition as an alternative to pushing for more raw capability. +- Live-SWE-agent's self-evolution is complementary — skills are human-authored evolution. + +**AI-assisted discovery of reductions and complexity:** +- AlphaEvolve discovers new NP-hardness gadgets (MAX-3-CUT, MAX-4-CUT, metric TSP bounds). +- URSA uses SAT solvers for formal verification of NP-complete reductions. +- Our work is complementary: we focus on implementing and verifying known reductions, not discovering new ones. AlphaEvolve discovers; our pipeline implements and verifies. + +**Formal verification of AI-generated code:** +- VeriCoding (27% Lean, 44% Verus, 82% Dafny success rates). +- CLEVER (near-zero on hard Lean problems). +- VeriBench (self-optimizing agents reach ~90% compilation). +- Our approach: pragmatic multi-layer verification instead of end-to-end formal proofs. Trade-off: less formal guarantee, but practically effective at catching real errors. + +**Physics-inspired optimization:** +- GNNs via QUBO Hamiltonian relaxation solve MIS, MaxCut, MinVC at million-variable scale. +- Quantum annealing + GNN hybrids for TSP. +- Our reduction graph provides the verified compilation layer that connects arbitrary problems to these solvers. + +### S8. Discussion & Conclusion (~1 page) + +**Generalizability:** +- What other domains have the "Goldilocks" property? Candidates: compiler optimizations (peephole rules), algebraic identities, protocol verification lemmas. +- The skill-based approach generalizes to any domain where tasks are homogeneous, formally specified, and independently verifiable. + +**Limitations:** +- **n=1 threat to validity**: This is a single case study of a single project by a single maintainer. While we argue the methodology generalizes to other Goldilocks domains, the empirical evidence is from one project. We mitigate this by providing the ablation comparison (S6.1) and by identifying concrete candidate domains for future validation. +- Requires upfront skill engineering — the maintainer must invest significant effort in writing and evolving skills. +- Domain expertise embedded in skills doesn't transfer across domains (a reduction skill won't help with web development). +- Git history mining has confounds: skills evolved during the project (addressed by stratification in S6.2). +- The three-role model requires a knowledgeable maintainer; fully open-source contribution without oversight is not supported. + +**The human value proposition:** +- Humans are not eliminated from the pipeline — they are repositioned. Creative work (which problems matter, which reductions are useful, what quality standards to enforce) remains human. Mechanical work (implementation, testing, documentation, review) is delegated to agents that also manage their own workflow. +- This mirrors the broader trend identified in industry surveys: developers increasingly use AI but maintain active oversight on delegated tasks. + +**Future directions:** +- Connecting to AlphaEvolve-style discovery: use agents to discover new reductions, then feed them into the verification pipeline. +- Formal verification integration: replace round-trip tests with Lean/Coq proofs for the strongest guarantees. +- Scaling the graph: can the pipeline maintain quality as the number of problems grows from 24 to 100+? + +## Page Budget + +| Section | Pages | Notes | +|---------|-------|-------| +| S1. Introduction | ~1.5 | | +| S2. Why Reductions? | ~1 | Including Fig 1 (reduction graph) | +| S3. System Architecture | ~1.5 | Trimmed; full trait details in supplementary | +| S4. Skill-Based Decomposition | ~2 | Including Fig 3 (pipeline) + Table 1 | +| S5. Verification Stack | ~1.5 | Including Fig 4 (pyramid) | +| S6. Evaluation | ~2.5 | Ablation + git mining + case studies + Table 2 | +| S7. Related Work | ~1 | | +| S8. Discussion | ~1 | | +| **Total** | **~12** | Page counts include embedded figures/tables for each section. References ~0.5 pages. Supplementary material (full trait hierarchy, proc macro details) is a separate appendix outside the page limit, per ICSE/ASE norms. | + +## Key Figures + +1. **Reduction graph** — 24 problem types, 42 variant nodes, 52 directed edges, color-coded by category. QUBO/ILP hubs visible. Caption distinguishes 40 implemented reductions from 12 inferred variant edges. +2. **System architecture** — Key traits + compile-time validation flow (compact). Full hierarchy in supplementary. +3. **Pipeline diagram** — Three-role pipeline: contributor → issue → agent:check → maintainer:move card → agent:implement/review → PR → merge. Human decisions highlighted in distinct color. +4. **Verification pyramid** — 7 layers from type system (base) to documentation (top), each annotated with concrete error examples. + +## Key Tables + +1. **Skills inventory** — Each skill with: trigger condition, inputs, outputs, typical agent turns, first-attempt success rate. +2. **Error taxonomy** — Error categories × which verification layer caught them. Demonstrates complementary coverage. + +## References + +Survey bibliography: `.claude/survey/agentic-coding-reductions/references.bib` (22 papers across 4 themes). + +## Non-Goals + +- This paper does NOT claim agents can discover new reductions (that's AlphaEvolve territory). +- This paper does NOT provide formal verification proofs (pragmatic multi-layer approach instead). +- This paper does NOT benchmark against SWE-Bench (different task structure; we argue for domain-specific evaluation). + +## Artifact Availability + +The code repository (including all skill files, git history, and test fixtures) will be made publicly available as a reproducibility artifact. The reduction graph can be explored interactively via the project's MCP server and CLI tool. This supports ICSE/ASE artifact evaluation tracks. From 411863c1af9e31de72abe6a58131bddf567885e8 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 19:23:29 +0800 Subject: [PATCH 02/38] docs: update spec with project-pipeline and review-pipeline skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the two new card-based orchestration skills from origin/main: - project-pipeline: picks Ready cards, runs issue-to-pr in worktrees - review-pipeline: fixes Copilot comments, runs agentic tests, moves to In Review Updated S4.3 with the two-stage pipeline and explicit human touch points (Backlog→Ready and In Review→Done). Skills count updated to 13. Co-Authored-By: Claude Opus 4.6 --- .../specs/2026-03-12-arxiv-paper-design.md | 25 ++++++++++++------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md index 7e25dc55..ad72db8d 100644 --- a/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md +++ b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md @@ -80,11 +80,13 @@ How creative/judgment work distributes across human roles, with management and e A skill is a markdown script that decomposes a complex task into agent-manageable subtasks. Key insight: if a task is small and explicit enough, agents handle it well. -Skills inventory (11 skills, grouped by function): +Skills inventory (13 skills, grouped by function): **Orchestration skills** (agent-as-manager): -- **issue-to-pr**: The main entry point. Receives a GitHub issue, classifies it (model vs. rule), dispatches to the appropriate implementation skill, and creates a PR. This is the skill that enables card-based automation — the manager agent picks a card and invokes this. -- **meta-power**: Batch mode. Resolves all open issues autonomously in dependency order (models before rules that reference them). Experimental — not yet proven at scale. +- **project-pipeline**: The primary card-based automation skill. Picks a "Ready" issue from the GitHub Project board, moves it to "In Progress", runs `issue-to-pr --execute` in an isolated git worktree, then moves to "review-agentic". Supports single-issue, specific-issue, and `--all` batch modes. Processes Models before Rules to satisfy dependencies. +- **review-pipeline**: Second-stage orchestration. Picks a PR from the "review-agentic" column, fixes Copilot review comments, runs agentic feature tests, fixes CI (up to 3 retries), then moves to "In Review" for human merge. Also supports batch mode. +- **issue-to-pr**: The per-issue entry point invoked by `project-pipeline`. Receives a GitHub issue, classifies it (model vs. rule), dispatches to the appropriate implementation skill, and creates a PR. +- **meta-power**: Batch mode alternative. Resolves all open issues autonomously in dependency order. Experimental — being superseded by the pipeline skills above. **Implementation skills** (agent-as-executor): - **add-model**: Brainstorm (if interactive) → implement Problem trait → unit tests → serialization tests → review. @@ -107,12 +109,17 @@ Table 1: Skills inventory — trigger condition, inputs, outputs, typical agent #### 4.3 Card-Based Orchestration -- GitHub project board serves as the coordination mechanism. -- A manager agent auto-picks the next card and drives it through the skill pipeline. -- The maintainer's creative input: moving cards between columns ("Backlog" → "Ready" → "In Progress"). This is the strategic decision of what to work on next. -- The agent handles the tactical decisions: which skill to invoke, how to decompose subtasks, when to dispatch sub-agents. - -Figure 3: Pipeline diagram — contributor opens issue → check-issue validates → maintainer moves card → agent picks card → add-rule implements → review-implementation checks → fix-pr resolves issues → PR merged. +- GitHub Project board with columns: Backlog → Ready → In Progress → review-agentic → In Review → Done. +- **Two-stage agent pipeline:** + - Stage 1 (`project-pipeline`): picks Ready card → moves to In Progress → runs issue-to-pr in isolated worktree → moves to review-agentic. + - Stage 2 (`review-pipeline`): picks review-agentic card → fixes Copilot comments → runs agentic feature tests → fixes CI (up to 3 retries) → moves to In Review. +- **Human touches only two transitions:** + - Backlog → Ready (maintainer decides what to work on next — the creative/strategic decision). + - In Review → Done (maintainer merges after final review — the quality gate). +- The agent handles everything in between: worktree creation, implementation, testing, review, CI fixing, board status updates. +- Batch mode (`--all`) processes all Ready issues or all review-agentic PRs in a single invocation, with Models before Rules to satisfy dependencies. + +Figure 3: Pipeline diagram — two-stage card flow: contributor opens issue → [Backlog] → maintainer moves to [Ready] → agent: project-pipeline [In Progress → review-agentic] → agent: review-pipeline [In Review] → maintainer merges [Done]. Human decisions highlighted in distinct color. ### S5. Multi-Layered Verification (~1.5 pages) From 2e41ed112d6822f5250e4ceef55df72f3dca12a7 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 20:10:01 +0800 Subject: [PATCH 03/38] docs(arxiv): implementation plan for skill-based agentic coding paper 16 tasks in 5 parallelizable chunks: scaffolding, figures, sections S1-S4, sections S5-S6 with git mining, sections S7-S8 + final assembly. Co-Authored-By: Claude Opus 4.6 --- .../plans/2026-03-12-arxiv-paper-impl.md | 939 ++++++++++++++++++ 1 file changed, 939 insertions(+) create mode 100644 docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md diff --git a/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md b/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md new file mode 100644 index 00000000..33f2472a --- /dev/null +++ b/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md @@ -0,0 +1,939 @@ +# Arxiv Paper Implementation Plan + +> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Write a full research paper (~10-12 pages) on skill-based agentic coding for NP-hard problem reductions, targeting an ICSE/ASE-class venue. + +**Architecture:** Typst document at `docs/paper/arxiv/paper.typ` with CeTZ figures, bibliography from survey, and data gathered from git history and the reduction graph. The existing `docs/paper/lib.typ` provides graph drawing utilities. + +**Tech Stack:** Typst, CeTZ (`@preview/cetz:0.4.2`), ctheorems (`@preview/ctheorems:1.1.3`), fletcher (`@preview/fletcher:0.5.8`), BibTeX + +**Spec:** `docs/superpowers/specs/2026-03-12-arxiv-paper-design.md` + +**Compile command** (used throughout): `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` +First compilation may download Typst packages — this is expected. + +--- + +## File Structure + +| File | Purpose | +|------|---------| +| `docs/paper/arxiv/paper.typ` | Main paper document | +| `docs/paper/arxiv/references.bib` | Bibliography (merged from survey + existing paper refs) | +| `docs/paper/arxiv/images/reduction-graph.typ` | Figure 1: Reduction graph diagram | +| `docs/paper/arxiv/images/architecture.typ` | Figure 2: System architecture diagram | +| `docs/paper/arxiv/images/pipeline.typ` | Figure 3: Card-based pipeline diagram | +| `docs/paper/arxiv/images/verification-pyramid.typ` | Figure 4: Verification stack pyramid | +| `docs/paper/arxiv/data/graph-metrics.json` | Reduction graph metrics (from Task 2) | +| `docs/paper/arxiv/data/git-mining-results.json` | Git history mining results (from Task 11) | +| `docs/paper/arxiv/scripts/mine-git-history.py` | Git history mining script | + +--- + +## Chunk 1: Paper Scaffolding + Data Gathering + +### Task 1: Set up paper.typ scaffolding + +**Files:** +- Create: `docs/paper/arxiv/paper.typ` +- Create: `docs/paper/arxiv/references.bib` + +- [ ] **Step 1: Create bibliography file** + +Copy the survey bibliography: + +```bash +cp .claude/survey/agentic-coding-reductions/references.bib docs/paper/arxiv/references.bib +``` + +Then append the following entries from `docs/paper/references.bib` (read that file and copy these exact `@` entries by key): `karp1972`, `cook1971`, `garey1979`, `glover2019`, `lucas2014`, `barahona1982`. These are foundational references not in the survey bib. + +- [ ] **Step 2: Write paper.typ header and imports** + +Create `docs/paper/arxiv/paper.typ` with: +- Imports: `@preview/cetz:0.4.2`, `@preview/fletcher:0.5.8`, `@preview/ctheorems:1.1.3` +- Page setup: A4, margins `(x: 2cm, y: 2.5cm)` +- Font: New Computer Modern, 10pt +- Two-column body via `#show: columns.with(2)` (after abstract) +- Numbered headings: `#set heading(numbering: "1.")` +- Bibliography: `#bibliography("references.bib", style: "ieee")` + +Reference `docs/paper/reductions.typ` for the exact Typst conventions used in the existing paper. + +- [ ] **Step 3: Write title, authors, and abstract** + +Title: "Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions" + +Authors: (use placeholder affiliations for now) + +Abstract (~150 words) covering: +- Problem: agents fail at long-horizon math coding tasks (70-80% on SWE-Bench Verified, ~20% on long-horizon) +- Insight: decompose into human-creative + agent-managed/executed via skill-based pipeline +- Method: 13 skills + 7-layer verification stack +- Result: 24 problem types, 40 implemented reductions, 52 graph edges +- Contribution: methodology + verification stack + open-source artifact + +- [ ] **Step 4: Write section heading stubs** + +Add empty section headings (S1 through S8) matching the spec outline: +1. Introduction +2. Why Reductions? The Goldilocks Domain +3. System Architecture +4. Skill-Based Task Decomposition +5. Multi-Layered Verification +6. Evaluation +7. Related Work +8. Discussion & Conclusion + +- [ ] **Step 5: Verify scaffolding compiles** + +Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` +Expected: PDF with title, abstract, and empty section headings. No errors. + +- [ ] **Step 6: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ docs/paper/arxiv/references.bib +git commit -m "docs(arxiv): paper scaffolding with bibliography and abstract" +``` + +--- + +### Task 2: Data gathering — reduction graph metrics + +**Files:** +- Create: `docs/paper/arxiv/data/graph-metrics.json` + +**Note:** The file `docs/src/reductions/reduction_graph.json` has a corrupted header (partial JSON + log message before line 10). Regenerate it first or parse from the second valid copy starting after `JSON content:`. + +- [ ] **Step 1: Regenerate the reduction graph JSON** + +```bash +make rust-export +``` + +This regenerates `docs/src/reductions/reduction_graph.json` with clean content. Verify it starts with valid JSON: + +```bash +python3 -c "import json; json.load(open('docs/src/reductions/reduction_graph.json'))" +``` + +If the file is still corrupted after `make rust-export`, extract the valid portion: + +```bash +python3 -c " +content = open('docs/src/reductions/reduction_graph.json').read() +idx = content.find('JSON content:\n') +if idx >= 0: + clean = content[idx+len('JSON content:\n'):] + open('docs/src/reductions/reduction_graph.json', 'w').write(clean) + print('Fixed corrupted JSON') +else: + print('JSON is clean') +" +``` + +- [ ] **Step 2: Count nodes, edges, and types** + +```bash +python3 -c " +import json +data = json.load(open('docs/src/reductions/reduction_graph.json')) +nodes = data['nodes'] +edges = data['edges'] +names = sorted(set(n['name'] for n in nodes)) +print(f'Unique problem types: {len(names)}') +print(f'Variant nodes: {len(nodes)}') +print(f'Total directed edges: {len(edges)}') +print(f'Types: {names}') +" +``` + +Expected: ~24 types, ~42 variant nodes, ~52 edges. + +Count implemented ReduceTo impls (the "40 reductions" number): + +```bash +grep -c 'impl.*ReduceTo' src/rules/*_*.rs | awk -F: '{s+=$2} END {print "Total ReduceTo impls:", s}' +``` + +Expected: ~40. Inferred variant edges = total edges - ReduceTo impls. + +- [ ] **Step 3: Compute hub node degrees** + +```bash +python3 -c " +import json +from collections import Counter +data = json.load(open('docs/src/reductions/reduction_graph.json')) +in_deg = Counter() +out_deg = Counter() +for e in data['edges']: + src_name = next(n['name'] for n in data['nodes'] if n == e.get('source') or (n['name'] == e['source'].get('name', '') if isinstance(e['source'], dict) else False)) + # Simpler: just use source/target indices +for e in data['edges']: + in_deg[e['target']['name']] += 1 + out_deg[e['source']['name']] += 1 +print('Top in-degree (reduce TO this):') +for name, cnt in in_deg.most_common(5): + print(f' {name}: {cnt}') +print('Top out-degree (reduce FROM this):') +for name, cnt in out_deg.most_common(5): + print(f' {name}: {cnt}') +" +``` + +Record QUBO and ILP in-degrees, MIS and SAT out-degrees for S2. + +- [ ] **Step 4: Count LOC per reduction (excluding casts files)** + +```bash +for f in src/rules/*_*.rs; do + case "$f" in *_casts.rs) continue;; esac + echo "$(wc -l < "$f") $f" +done | sort -n +``` + +Record min, max, median for the "~50-200 LOC" claim. + +- [ ] **Step 5: Save metrics to data file** + +```bash +mkdir -p docs/paper/arxiv/data +``` + +Write a JSON file at `docs/paper/arxiv/data/graph-metrics.json` containing: +```json +{ + "unique_types": 24, + "variant_nodes": 42, + "total_edges": 52, + "reduceto_impls": 40, + "inferred_edges": 12, + "hub_in_degree": {"QUBO": N, "ILP": N}, + "hub_out_degree": {"MIS": N, "SAT": N}, + "loc_per_reduction": {"min": N, "max": N, "median": N} +} +``` + +Fill in actual numbers from Steps 2-4. + +- [ ] **Step 6: Commit** + +```bash +git add -f docs/paper/arxiv/data/graph-metrics.json +git commit -m "docs(arxiv): gather reduction graph metrics" +``` + +--- + +## Chunk 2: Figures + +**Conventions for all figure files:** +- Use `#set page(width: auto, height: auto, margin: 5pt)` for standalone compilation. +- To use `docs/paper/lib.typ` primitives, import with relative path: `#import "../../lib.typ"` (from `docs/paper/arxiv/images/`). +- Each file must export a public function (e.g., `#let reduction-graph() = { ... }`) for import into `paper.typ`. +- Verify standalone: `typst compile docs/paper/arxiv/images/.typ` — expected: PDF output, no errors. +- Import into paper: `#import "images/.typ": ` + +### Task 3: Figure 1 — Reduction graph + +**Files:** +- Create: `docs/paper/arxiv/images/reduction-graph.typ` +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Define node positions by category** + +Create `docs/paper/arxiv/images/reduction-graph.typ`. Read the graph data from `docs/paper/arxiv/data/graph-metrics.json` and the full graph from `docs/src/reductions/reduction_graph.json`. + +Use a column-based layout by category: +- Column 1 (blue): graph problems (MIS, MaxClique, MaxCut, MinVC, MinDS, MaxMatching, MaximalIS, KColoring, TSP, SpinGlass, BicliqueCover) +- Column 2 (green): formula problems (SAT, k-SAT, CircuitSAT) +- Column 3 (orange): set problems (MinSetCovering, MaxSetPacking) +- Column 4 (purple): algebraic problems (QUBO, ILP, CVP, BMF) +- Column 5 (gray): misc problems (BinPacking, PaintShop, Factoring, Knapsack) + +Place QUBO and ILP centrally as hub nodes (larger circles). + +For base problem types only (not all 42 variants — use the 24 unique names). Add a note in the caption about variant nodes. + +Import graph drawing utilities: `#import "../../lib.typ"` for `g-node`, `g-edge` if helpful, or use raw CeTZ. + +- [ ] **Step 2: Draw edges from the graph data** + +Add directed edges (arrows) between nodes based on the reduction graph edges. Use `mark: (end: "straight")` for arrow heads. Group edges by category with consistent styling. + +- [ ] **Step 3: Add legend and caption** + +Add a color legend for the 5 categories. Define the exported function: `#let reduction-graph() = { ... }`. + +- [ ] **Step 4: Verify figure compiles standalone** + +Run: `typst compile docs/paper/arxiv/images/reduction-graph.typ` +Expected: PDF of the reduction graph, no errors. + +- [ ] **Step 5: Import into paper.typ in S2** + +Add to `paper.typ`: +```typst +#import "images/reduction-graph.typ": reduction-graph +``` + +In S2, place: +```typst +#figure( + reduction-graph(), + caption: [The reduction graph: 24 problem types connected by 52 directed edges (40 implemented reductions + 12 inferred variant edges). Hub nodes QUBO and ILP are highlighted.] +) +``` + +- [ ] **Step 6: Commit** + +```bash +git add -f docs/paper/arxiv/images/reduction-graph.typ docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): add Figure 1 — reduction graph" +``` + +--- + +### Task 4: Figure 3 — Pipeline diagram + +**Files:** +- Create: `docs/paper/arxiv/images/pipeline.typ` +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Create pipeline diagram** + +Use Fletcher (`@preview/fletcher:0.5.8`) for a flowchart showing the two-stage card-based pipeline. + +Structure: +``` +Contributor ──→ [Issue] ──→ Backlog + │ Maintainer moves card + ▼ + [Ready] + │ project-pipeline (agent) + ▼ + [In Progress] + │ issue-to-pr → check-issue → add-model/add-rule → review + ▼ + [review-agentic] + │ review-pipeline (agent) + │ fix Copilot comments → agentic tests → fix CI + ▼ + [In Review] + │ Maintainer merges + ▼ + [Done] +``` + +Color-code: human decisions in warm color (orange/gold), agent actions in cool color (blue/teal). Board columns as rounded rectangles. + +Export as: `#let pipeline-diagram() = { ... }` + +- [ ] **Step 2: Verify figure compiles standalone** + +Run: `typst compile docs/paper/arxiv/images/pipeline.typ` +Expected: PDF of pipeline flowchart, no errors. + +- [ ] **Step 3: Import into paper.typ in S4** + +Add `#import "images/pipeline.typ": pipeline-diagram` and place: +```typst +#figure( + pipeline-diagram(), + caption: [Two-stage card-based pipeline. Human decisions (orange) are limited to Backlog→Ready and In Review→Done. Agent manages everything in between.] +) +``` + +- [ ] **Step 4: Commit** + +```bash +git add -f docs/paper/arxiv/images/pipeline.typ docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): add Figure 3 — card-based pipeline diagram" +``` + +--- + +### Task 5: Figure 4 — Verification pyramid + +**Files:** +- Create: `docs/paper/arxiv/images/verification-pyramid.typ` +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Create verification pyramid figure** + +Draw a layered pyramid/stack using CeTZ with 7 layers, widest at bottom: + +``` +Layer 7: Documentation (proof sketch) ← catches: logical errors +Layer 6: Agentic review (parallel subagents) ← catches: convention violations +Layer 5: Materialized fixtures (JSON ground truth) ← catches: test gaming +Layer 4: Overhead validation (symbolic exprs) ← catches: formula errors +Layer 3: Closed-loop tests (round-trip) ← catches: mapping errors +Layer 2: Unit tests (eval, serialization) ← catches: evaluation errors +Layer 1: Type system (Rust compiler) ← catches: API misuse +``` + +Each layer labeled with mechanism (left) and error class caught (right). Color gradient from automated (bottom, blue) to human-readable (top, gold). + +Export as: `#let verification-pyramid() = { ... }` + +- [ ] **Step 2: Verify figure compiles standalone** + +Run: `typst compile docs/paper/arxiv/images/verification-pyramid.typ` +Expected: PDF of pyramid, no errors. + +- [ ] **Step 3: Import into paper.typ in S5** + +Add `#import "images/verification-pyramid.typ": verification-pyramid` and place: +```typst +#figure( + verification-pyramid(), + caption: [Seven-layer verification stack. Lower layers (blue) are fully automated; upper layers (gold) involve human-readable arguments.] +) +``` + +- [ ] **Step 4: Commit** + +```bash +git add -f docs/paper/arxiv/images/verification-pyramid.typ docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): add Figure 4 — verification pyramid" +``` + +--- + +### Task 6: Figure 2 — System architecture + +**Files:** +- Create: `docs/paper/arxiv/images/architecture.typ` +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Create architecture diagram** + +Use Fletcher or CeTZ to show the key traits and compile-time validation: + +``` +┌─────────────────────────────────────┐ +│ Problem trait │ +│ NAME, Metric, dims(), evaluate() │ +├──────────────┬──────────────────────┤ +│ Optimization │ Satisfaction │ +│ SolutionSize │ bool │ +│ direction() │ │ +└──────┬───────┴──────────────────────┘ + │ ReduceTo + ▼ +┌─────────────────────────────────────┐ +│ ReductionResult │ +│ target_problem() + extract_solution│ +└──────┬──────────────────────────────┘ + │ #[reduction(overhead = {...})] + ▼ +┌─────────────────────────────────────┐ +│ Compile-time validation │ +│ • Variable names → getter methods │ +│ • Expr AST: symbolic overhead │ +│ • declare_variants! → registry │ +└─────────────────────────────────────┘ +``` + +Keep compact. Focus on the verification-enabling aspects. + +Export as: `#let architecture-diagram() = { ... }` + +- [ ] **Step 2: Verify figure compiles standalone** + +Run: `typst compile docs/paper/arxiv/images/architecture.typ` +Expected: PDF of architecture diagram, no errors. + +- [ ] **Step 3: Import into paper.typ in S3** + +Add `#import "images/architecture.typ": architecture-diagram` and place: +```typst +#figure( + architecture-diagram(), + caption: [System architecture: the trait hierarchy and compile-time validation enforce round-trip testing capability by construction.] +) +``` + +- [ ] **Step 4: Commit** + +```bash +git add -f docs/paper/arxiv/images/architecture.typ docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): add Figure 2 — system architecture" +``` + +--- + +## Chunk 3: Sections S1-S4 + +**Convention:** All "Verify compiles" steps use: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf`. Expected: no errors. All sections use citation format `@BibKey` (e.g., `@Thai2025SWEEVO`). Before writing any section, first read `paper.typ` to understand the heading style and formatting conventions established in Task 1. + +**Page budget reference** (two-column format, ~500 words/page): +- S1: ~1.5 pages (~750 words) +- S2: ~1 page (~500 words) +- S3: ~1.5 pages (~750 words) +- S4: ~2 pages (~1000 words) + +### Task 7: Write S1 — Introduction + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Write introduction body (~750 words)** + +First read `paper.typ` to understand the heading format. Then write S1 within the existing `= Introduction` stub. Structure: + +1. Opening paragraph: agents hit 70-80% on SWE-Bench but ~20% on long-horizon → cite `@Thai2025SWEEVO`, `@Deng2025SWEBenchPro` +2. Our thesis: bottleneck is decomposition, not capability +3. "Review is harder than generation" for mathematical code → cite `@Roychoudhury2025AgenticAI` +4. Three roles paragraph: contributors (creative issues), maintainer (board + skills), agents (manage + execute) +5. Contributions list (3 items from spec) +6. Paper organization paragraph + +- [ ] **Step 2: Verify compiles** + +- [ ] **Step 3: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S1 Introduction" +``` + +--- + +### Task 8: Write S2 — Why Reductions? + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` +**Depends on:** Task 2 (graph metrics), Task 3 (Figure 1) + +- [ ] **Step 1: Write S2 body (~500 words)** + +Read graph metrics from `docs/paper/arxiv/data/graph-metrics.json` for concrete numbers. If not yet available, use: 24 types, 42 variants, 52 edges, 40 implemented, 12 inferred. + +Structure: +1. Goldilocks domain paragraph: self-contained (~50-200 LOC), formally specified, automatable round-trip criterion +2. Contrast with SWE-Bench: homogeneous tasks enable comparison +3. Hardware solvers paragraph: Rydberg atoms for MIS (cite `@lucas2014`), D-Wave for QUBO/Ising (cite `@glover2019`) → the graph as compilation layer +4. Real-world applications paragraph: SDN→ILP, airline→SetCovering, VLSI→coloring, logistics→TSP +5. Reference `@fig:reduction-graph` (placed by Task 3) + +- [ ] **Step 2: Verify compiles** + +- [ ] **Step 3: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S2 Why Reductions — Goldilocks domain" +``` + +--- + +### Task 9: Write S3 — System Architecture + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` +**Depends on:** Task 6 (Figure 2) + +- [ ] **Step 1: Write S3 body (~750 words)** + +Use the trait hierarchy from CLAUDE.md's Architecture section for reference. Do NOT read source files — the CLAUDE.md summary has sufficient detail. Full trait code belongs in supplementary material. + +Structure: +1. Problem trait: `evaluate()` enables brute-force verification of any configuration +2. ReduceTo trait: type system enforces round-trip capability by construction +3. `#[reduction(overhead)]` proc macro: compile-time validation of overhead expressions +4. `declare_variants!`: registry enables automated graph export + completeness checking +5. Design philosophy paragraph: reduce the space of possible agent errors +6. Reference `@fig:architecture` (placed by Task 6) + +- [ ] **Step 2: Verify compiles** + +- [ ] **Step 3: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S3 System Architecture" +``` + +--- + +### Task 10: Write S4 — Skill-Based Task Decomposition + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` +**Depends on:** Task 4 (Figure 3) + +- [ ] **Step 1: Write S4.1 — Three Roles (~200 words)** + +The roles table from the spec (Contributor/Maintainer/Agent with responsibilities and examples). Brief narrative explaining the human-agent boundary. + +- [ ] **Step 2: Read skill files and extract metadata** + +Read all 13 skill files (`.claude/skills/*/SKILL.md`). For each, record: name, one-line description, invocation trigger, step count. This data populates Table 1. + +- [ ] **Step 3: Write S4.2 — Skills as Agent Functions (~500 words)** + +Group the 13 skills into 5 categories (from spec): +- **Orchestration** (4): project-pipeline, review-pipeline, issue-to-pr, meta-power +- **Implementation** (2): add-model, add-rule +- **Quality gate** (4): check-issue, check-rule-redundancy, review-implementation, fix-pr +- **Documentation** (2): write-model-in-paper, write-rule-in-paper +- **Release** (1): release + +For each group, write 1-2 sentences explaining the pattern. Create Table 1 with columns: Skill, Category, Trigger, Typical Turns (estimate from step count / 3), Success Rate (use "TBD" — will be filled after Task 11). + +- [ ] **Step 4: Write S4.3 — Card-Based Orchestration (~300 words)** + +Two-stage pipeline (project-pipeline → review-pipeline). Human touches only Backlog→Ready and In Review→Done. Reference `@fig:pipeline` (placed by Task 4). + +- [ ] **Step 5: Verify compiles** + +- [ ] **Step 6: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S4 Skill-Based Task Decomposition" +``` + +--- + +## Chunk 4: Sections S5-S6 + +### Task 11: Git history mining script + +**Files:** +- Create: `docs/paper/arxiv/scripts/mine-git-history.py` +- Create: `docs/paper/arxiv/data/git-mining-results.json` + +- [ ] **Step 1: Create directories** + +```bash +mkdir -p docs/paper/arxiv/scripts docs/paper/arxiv/data +``` + +- [ ] **Step 2: Write PR listing and field extraction** + +Write `docs/paper/arxiv/scripts/mine-git-history.py` — Part 1: list all merged PRs with `[Rule]` or `[Model]` in the title. + +```bash +gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --json number,title,author,createdAt,mergedAt,labels,headRefName +``` + +For each PR, extract: number, title, author login, created date, merged date, whether title contains `[Rule]` or `[Model]`. + +Author classification: if `author.login` contains `[bot]` or is `github-actions`, classify as "agent"; otherwise "human". + +- [ ] **Step 3: Add phase classification and CI status** + +Add to the script: + +**Phase boundaries** (based on when key skills were introduced — determine by running): +```bash +git log --all --oneline --diff-filter=A -- '.claude/skills/add-rule/SKILL.md' | tail -1 +git log --all --oneline --diff-filter=A -- '.claude/skills/project-pipeline/SKILL.md' | tail -1 +``` + +Define phases: +- Phase 1 (manual): PRs before add-model/add-rule skills existed +- Phase 2 (basic skills): PRs after implementation skills but before pipeline skills +- Phase 3 (full pipeline): PRs after project-pipeline/review-pipeline skills existed + +For CI status on first push, use: +```bash +gh api repos/CodingThrust/ProblemReductions/pulls/{number}/commits --jq '.[0].sha' +``` +Then check that SHA's status. This is optional — skip if the API calls are too slow. + +- [ ] **Step 4: Run script and save results** + +```bash +python3 docs/paper/arxiv/scripts/mine-git-history.py > docs/paper/arxiv/data/git-mining-results.json +``` + +Expected output schema: +```json +{ + "summary": { + "total_prs": N, + "rule_prs": N, + "model_prs": N, + "agent_authored": N, + "human_authored": N + }, + "by_phase": [ + {"phase": 1, "label": "manual", "count": N, "agent_count": N}, + {"phase": 2, "label": "basic_skills", "count": N, "agent_count": N}, + {"phase": 3, "label": "full_pipeline", "count": N, "agent_count": N} + ], + "prs": [ + {"number": 42, "title": "...", "is_agent": false, "phase": 1, "type": "Rule"} + ] +} +``` + +- [ ] **Step 5: Commit** + +```bash +git add -f docs/paper/arxiv/scripts/ docs/paper/arxiv/data/git-mining-results.json +git commit -m "docs(arxiv): git history mining script and results" +``` + +--- + +### Task 12: Write S5 — Multi-Layered Verification + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` +**Depends on:** Task 5 (Figure 4) + +- [ ] **Step 1: Write S5.1 — The Verification Stack (~500 words)** + +Write the 7-layer table from the spec. Use these concrete error examples for each layer (constructed from the domain): + +| Layer | Mechanism | Example Error Caught | +|-------|-----------|---------------------| +| 1. Type system | Rust compiler | Agent returns `bool` instead of `SolutionSize` from `evaluate()` | +| 2. Unit tests | `test_*_basic` | Agent evaluates MaxCut objective with wrong sign (sum vs difference) | +| 3. Closed-loop tests | `test_*_to_*_closed_loop` | SAT→MIS reduction maps clause variables to wrong vertex indices | +| 4. Overhead validation | Symbolic expr vs sizes | Agent writes `num_edges = num_clauses` instead of `3 * num_clauses` | +| 5. Materialized fixtures | JSON ground truth | Agent changes expected QUBO matrix values to make failing test pass | +| 6. Agentic review | Parallel subagents | Missing `declare_variants!` macro, wrong file naming convention | +| 7. Documentation | Proof sketch | Reduction proof assumes graph is connected but problem allows disconnected | + +Reference `@fig:verification` (placed by Task 5). + +- [ ] **Step 2: Write S5.2 — Why Layers? (~250 words)** + +The "lazy agent" problem: agents take the shortest path to close an issue (e.g., changing expected test values instead of fixing bugs). Materialized test data (Layer 5) prevents this. No single layer is sufficient. Cross-reference Table 2 in S6. + +- [ ] **Step 3: Verify compiles** + +Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` + +- [ ] **Step 4: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S5 Multi-Layered Verification" +``` + +--- + +### Task 13: Write S6 — Evaluation + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` +**Depends on:** Task 11 (git mining data) + +- [ ] **Step 1: Write S6.1 — Ablation setup (~400 words)** + +Describe the experimental DESIGN (actual results are `[TBD: ablation not yet run]` placeholders): +- Setup: select 5-10 reductions of varying complexity +- Two configurations: skill-based (full pipeline) vs no-skill baseline (raw agent + CLAUDE.md only) +- Metrics: first-attempt CI pass rate, review rounds, final correctness, convention adherence +- Framing: "controlled illustration" (n=5-10), not statistically powered experiment + +- [ ] **Step 2: Write S6.2 — Git History Mining results (~500 words)** + +Read data from `docs/paper/arxiv/data/git-mining-results.json`. If not yet available, use `[TBD: data]` placeholders. + +Write up agent vs human implementation counts, success rates stratified by phase. + +Create Table 2 (error taxonomy × verification layer matrix): + +| Error Category | Layer | Example | Count | +|---------------|-------|---------|-------| +| Type errors | 1 (type system) | Wrong return type | [TBD] | +| Mapping errors | 3 (closed-loop) | Wrong vertex index | [TBD] | +| Formula errors | 4 (overhead) | Linear vs quadratic | [TBD] | +| Test gaming | 5 (fixtures) | Changed expected value | [TBD] | +| Convention violations | 6 (review) | Missing macro | [TBD] | +| Logical errors | 7 (documentation) | Invalid proof | [TBD] | + +- [ ] **Step 3: Write S6.3 — Case Studies (~600 words)** + +Three reductions spanning the complexity spectrum. For each, find the actual PR by searching: + +```bash +gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "MinimumVertexCover MaximumIndependentSet" --json number,title +gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "Satisfiability MaximumIndependentSet" --json number,title +gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "Factoring CircuitSAT" --json number,title +``` + +If PRs are found, reference them and analyze the pipeline trace (skills activated, human decisions, errors caught). If not found, describe the expected pipeline trace based on the skill definitions. + +**Case 1 — Simple (MVC→MIS):** complement relationship, ~30 LOC, smooth pipeline. +**Case 2 — Complex (SAT→MIS):** clause-variable gadget, quadratic blowup, agent mistakes in edge counts. +**Case 3 — Composition (Factoring→CircuitSAT→ILP):** two independent reductions that compose in the graph. Analyze each separately, then show graph-level composition. + +- [ ] **Step 4: Verify compiles** + +Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` + +- [ ] **Step 5: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S6 Evaluation" +``` + +--- + +## Chunk 5: Sections S7-S8 + Final Assembly + +### Task 14: Write S7 — Related Work + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Write S7 body (~500 words)** + +Four subsections, each 1-2 paragraphs. Use these specific citation keys: + +1. **AI coding agents:** `@Yang2024SWEagent`, `@Wang2024OpenHands`, `@Anthropic2025ClaudeCode`, `@Wu2024Devin`, `@Thai2025SWEEVO` (SWE-EVO ~20%), `@Deng2025SWEBenchPro` (SWE-Bench Pro ~45%), `@Xia2025LiveSWEagent` (self-evolution complementary to skills), `@Roychoudhury2025AgenticAI` (agentic SE perspective), `@Anthropic2026AgenticCoding` (developer-AI collaboration survey) + +2. **AI-discovered reductions:** `@Novikov2025AlphaEvolve` (NP-hardness gadgets), `@Janicic2025URSA` (SAT-based verification), `@RomeraParedes2023FunSearch`. Our work is complementary: we implement/verify known reductions, not discover new ones. + +3. **Formal verification:** `@Bursuc2025VeriCoding`, `@Thakur2025CLEVER`, `@Miranda2025VeriBench`, `@Mukherjee2025CoqPL`, `@Mukherjee2025SynVer`. Our approach: pragmatic multi-layer verification vs end-to-end formal proofs. + +4. **Physics-inspired optimization:** `@Schuetz2022PhysicsGNN` (GNN/QUBO for MIS/MaxCut/MinVC at million-variable scale), `@He2024QuantumTSP`. Our graph provides the verified compilation layer connecting problems to these solvers. + +For each: position our work as complementary, not competing. + +- [ ] **Step 2: Verify compiles** + +Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` + +- [ ] **Step 3: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S7 Related Work" +``` + +--- + +### Task 15: Write S8 — Discussion & Conclusion + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Write S8 body (~500 words)** + +Four parts from spec, then a concluding subsection: + +1. **Generalizability:** Goldilocks property, candidate domains (compiler peephole rules, algebraic identities, protocol verification lemmas) +2. **Limitations:** n=1 threat, skill engineering cost, domain specificity, git mining confounds (addressed by stratification), maintainer requirement +3. **Human value proposition:** repositioned not eliminated, creativity + judgment remains human. Cite `@Anthropic2026AgenticCoding` for the broader trend. +4. **Future directions:** AlphaEvolve integration (cite `@Novikov2025AlphaEvolve`), formal verification (cite `@Bursuc2025VeriCoding`), scaling to 100+ problems + +End with a `=== Conclusion` subsection: 2-3 crisp sentences restating the thesis and key result. + +- [ ] **Step 2: Verify compiles** + +Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` + +- [ ] **Step 3: Commit** + +```bash +git add -f docs/paper/arxiv/paper.typ +git commit -m "docs(arxiv): write S8 Discussion and Conclusion" +``` + +--- + +### Task 16: Final assembly and polish + +**Files:** +- Modify: `docs/paper/arxiv/paper.typ` + +- [ ] **Step 1: Verify all figures are placed correctly** + +Check that these figure references exist in the paper text: +- `@fig:reduction-graph` in S2 +- `@fig:architecture` in S3 +- `@fig:pipeline` in S4 +- `@fig:verification` in S5 + +Search for each label in `paper.typ`. If any is missing, add the reference. + +- [ ] **Step 2: Verify all tables are placed correctly** + +Check for Table 1 (skills inventory) in S4 and Table 2 (error taxonomy) in S6. + +- [ ] **Step 3: Verify all citations resolve** + +```bash +typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf 2>&1 | grep -i "warning\|error\|unknown\|not found" +``` + +Expected: no unresolved citation or label warnings. If any `@key` references are missing from `references.bib`, add them. + +- [ ] **Step 4: Check page count** + +```bash +typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf && python3 -c " +import subprocess +result = subprocess.run(['pdfinfo', 'docs/paper/arxiv/paper.pdf'], capture_output=True, text=True) +for line in result.stdout.splitlines(): + if 'Pages' in line: + print(line) +" +``` + +Expected: 10-12 pages. If over, identify sections to trim. If under, identify sections to expand. + +- [ ] **Step 5: Final compile and flag visual review** + +Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` + +Verify no warnings are emitted. Visual inspection (layout, orphans, figure legibility) requires human review — flag as TODO for the maintainer. + +- [ ] **Step 6: Commit final version** + +```bash +git add -f docs/paper/arxiv/paper.typ docs/paper/arxiv/images/ docs/paper/arxiv/references.bib +git commit -m "docs(arxiv): final paper assembly and polish" +``` + +Note: Do NOT commit `paper.pdf` — it is a build artifact. + +--- + +## Execution Notes + +### Dependency Graph + +``` +Task 1 (scaffolding) ──→ Task 2 (metrics) ──→ Task 8 (S2) + ──→ Tasks 3-6 (figures, parallel) + ──→ Task 11 (git mining) + +Task 3 (Fig 1) ──→ Task 8 (S2) +Task 4 (Fig 3) ──→ Task 10 (S4) +Task 5 (Fig 4) ──→ Task 12 (S5) +Task 6 (Fig 2) ──→ Task 9 (S3) + +Task 7 (S1): no figure dependency — can run after Task 1 +Task 11 (git mining) ──→ Task 13 (S6) +Task 14 (S7): independent — can run after Task 1 +Task 15 (S8): independent — can run after Task 1 +Task 16 (assembly): must run LAST +``` + +### Suggested Parallel Batches + +1. **Tasks 1-2** (scaffolding + data) — sequential, run first +2. **Tasks 3-6** (all figures) + **Task 7** (S1) + **Task 11** (git mining) — parallel +3. **Tasks 8-10** (S2-S4) + **Tasks 14-15** (S7-S8) — parallel (each depends on its figure from batch 2) +4. **Tasks 12-13** (S5-S6) — parallel (depend on Figure 4 + git mining from batch 2) +5. **Task 16** (assembly) — last + +### Open Dependencies + +- **S6.1 ablation results** are `[TBD]` placeholders. The ablation experiment is a separate effort outside this plan. The paper will contain placeholder markers until that data is available. +- **Table 1 success rates** are `[TBD]` — will be filled from git mining data (Task 11) if available, otherwise left as placeholders. From 608480771c64ba7a356202effcbca7e05b15e58f Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 21:10:23 +0800 Subject: [PATCH 04/38] docs(arxiv): update plan for hybrid LaTeX + Typst/CeTZ figures Switch figure generation from TikZ to Typst+CeTZ compiled to PDF, included in LaTeX via \includegraphics. Paper body remains LaTeX (IEEEtran class). Removed TikZ packages from preamble. Updated all figure tasks (3-6), conventions block, compile commands, and Task 17 assembly step. Co-Authored-By: Claude Opus 4.6 --- .../plans/2026-03-12-arxiv-paper-impl.md | 858 ++++++++++-------- .../specs/2026-03-12-arxiv-paper-design.md | 2 +- 2 files changed, 503 insertions(+), 357 deletions(-) diff --git a/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md b/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md index 33f2472a..78cfd427 100644 --- a/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md +++ b/docs/superpowers/plans/2026-03-12-arxiv-paper-impl.md @@ -4,14 +4,22 @@ **Goal:** Write a full research paper (~10-12 pages) on skill-based agentic coding for NP-hard problem reductions, targeting an ICSE/ASE-class venue. -**Architecture:** Typst document at `docs/paper/arxiv/paper.typ` with CeTZ figures, bibliography from survey, and data gathered from git history and the reduction graph. The existing `docs/paper/lib.typ` provides graph drawing utilities. +**Architecture:** LaTeX document at `docs/paper/arxiv/paper.tex` using IEEEtran class with figures generated in Typst+CeTZ (compiled to PDF, included via `\includegraphics`), bibliography from survey, and data gathered from git history and the reduction graph. -**Tech Stack:** Typst, CeTZ (`@preview/cetz:0.4.2`), ctheorems (`@preview/ctheorems:1.1.3`), fletcher (`@preview/fletcher:0.5.8`), BibTeX +**Tech Stack:** LaTeX (IEEEtran class), BibTeX, pdflatex, Typst+CeTZ (figures only) **Spec:** `docs/superpowers/specs/2026-03-12-arxiv-paper-design.md` -**Compile command** (used throughout): `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` -First compilation may download Typst packages — this is expected. +**Compile command** (used throughout): +```bash +# Compile Typst figures first +for f in docs/paper/arxiv/figures/*.typ; do typst compile "$f"; done +# Then build LaTeX +cd docs/paper/arxiv && pdflatex paper.tex && bibtex paper && pdflatex paper.tex && pdflatex paper.tex && cd - +``` +Or single-pass check (figures already compiled): `cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -` + +**Review skill:** After writing is complete, use `academic-paper-reviewer` (installed at `.claude/skills/academic-research-skills/academic-paper-reviewer/`) for simulated 5-person peer review. --- @@ -19,12 +27,12 @@ First compilation may download Typst packages — this is expected. | File | Purpose | |------|---------| -| `docs/paper/arxiv/paper.typ` | Main paper document | +| `docs/paper/arxiv/paper.tex` | Main paper document (IEEEtran) | | `docs/paper/arxiv/references.bib` | Bibliography (merged from survey + existing paper refs) | -| `docs/paper/arxiv/images/reduction-graph.typ` | Figure 1: Reduction graph diagram | -| `docs/paper/arxiv/images/architecture.typ` | Figure 2: System architecture diagram | -| `docs/paper/arxiv/images/pipeline.typ` | Figure 3: Card-based pipeline diagram | -| `docs/paper/arxiv/images/verification-pyramid.typ` | Figure 4: Verification stack pyramid | +| `docs/paper/arxiv/figures/reduction-graph.typ` | Figure 1: Reduction graph (Typst+CeTZ → PDF) | +| `docs/paper/arxiv/figures/architecture.typ` | Figure 2: System architecture (Typst+CeTZ → PDF) | +| `docs/paper/arxiv/figures/pipeline.typ` | Figure 3: Card-based pipeline (Typst+CeTZ → PDF) | +| `docs/paper/arxiv/figures/verification-pyramid.typ` | Figure 4: Verification stack pyramid (Typst+CeTZ → PDF) | | `docs/paper/arxiv/data/graph-metrics.json` | Reduction graph metrics (from Task 2) | | `docs/paper/arxiv/data/git-mining-results.json` | Git history mining results (from Task 11) | | `docs/paper/arxiv/scripts/mine-git-history.py` | Git history mining script | @@ -33,10 +41,10 @@ First compilation may download Typst packages — this is expected. ## Chunk 1: Paper Scaffolding + Data Gathering -### Task 1: Set up paper.typ scaffolding +### Task 1: Set up paper.tex scaffolding **Files:** -- Create: `docs/paper/arxiv/paper.typ` +- Create: `docs/paper/arxiv/paper.tex` - Create: `docs/paper/arxiv/references.bib` - [ ] **Step 1: Create bibliography file** @@ -49,53 +57,85 @@ cp .claude/survey/agentic-coding-reductions/references.bib docs/paper/arxiv/refe Then append the following entries from `docs/paper/references.bib` (read that file and copy these exact `@` entries by key): `karp1972`, `cook1971`, `garey1979`, `glover2019`, `lucas2014`, `barahona1982`. These are foundational references not in the survey bib. -- [ ] **Step 2: Write paper.typ header and imports** +- [ ] **Step 2: Write paper.tex with IEEEtran class** + +Create `docs/paper/arxiv/paper.tex` with: -Create `docs/paper/arxiv/paper.typ` with: -- Imports: `@preview/cetz:0.4.2`, `@preview/fletcher:0.5.8`, `@preview/ctheorems:1.1.3` -- Page setup: A4, margins `(x: 2cm, y: 2.5cm)` -- Font: New Computer Modern, 10pt -- Two-column body via `#show: columns.with(2)` (after abstract) -- Numbered headings: `#set heading(numbering: "1.")` -- Bibliography: `#bibliography("references.bib", style: "ieee")` +```latex +\documentclass[conference]{IEEEtran} +\usepackage{cite} +\usepackage{amsmath,amssymb,amsfonts} +\usepackage{graphicx} +\usepackage{textcomp} +\usepackage{xcolor} +\usepackage{booktabs} +\usepackage{listings} +\usepackage{hyperref} +\usepackage{cleveref} -Reference `docs/paper/reductions.typ` for the exact Typst conventions used in the existing paper. +\begin{document} -- [ ] **Step 3: Write title, authors, and abstract** +\title{Skill-Based Agentic Coding for Mathematical Software:\\ +A Case Study in NP-Hard Problem Reductions} -Title: "Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions" +\author{...} % placeholder -Authors: (use placeholder affiliations for now) +\maketitle -Abstract (~150 words) covering: +\begin{abstract} +... +\end{abstract} + +\section{Introduction}\label{sec:intro} +\section{Why Reductions? The Goldilocks Domain}\label{sec:domain} +\section{System Architecture}\label{sec:architecture} +\section{Skill-Based Task Decomposition}\label{sec:skills} +\section{Multi-Layered Verification}\label{sec:verification} +\section{Evaluation}\label{sec:evaluation} +\section{Related Work}\label{sec:related} +\section{Discussion \& Conclusion}\label{sec:conclusion} + +\bibliographystyle{IEEEtran} +\bibliography{references} + +\end{document} +``` + +- [ ] **Step 3: Write abstract (~150 words)** + +Fill in the abstract covering: - Problem: agents fail at long-horizon math coding tasks (70-80% on SWE-Bench Verified, ~20% on long-horizon) - Insight: decompose into human-creative + agent-managed/executed via skill-based pipeline - Method: 13 skills + 7-layer verification stack - Result: 24 problem types, 40 implemented reductions, 52 graph edges - Contribution: methodology + verification stack + open-source artifact -- [ ] **Step 4: Write section heading stubs** +- [ ] **Step 4: Create figures directory** -Add empty section headings (S1 through S8) matching the spec outline: -1. Introduction -2. Why Reductions? The Goldilocks Domain -3. System Architecture -4. Skill-Based Task Decomposition -5. Multi-Layered Verification -6. Evaluation -7. Related Work -8. Discussion & Conclusion +```bash +mkdir -p docs/paper/arxiv/figures +``` - [ ] **Step 5: Verify scaffolding compiles** -Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` -Expected: PDF with title, abstract, and empty section headings. No errors. +```bash +cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd - +``` -- [ ] **Step 6: Commit** +Expected: PDF with title, abstract, and empty section headings. BibTeX warnings about missing refs are expected at this stage. + +- [ ] **Step 6: Remove old paper.typ** ```bash -git add -f docs/paper/arxiv/paper.typ docs/paper/arxiv/references.bib -git commit -m "docs(arxiv): paper scaffolding with bibliography and abstract" +rm -f docs/paper/arxiv/paper.typ +``` + +- [ ] **Step 7: Commit** + +```bash +git add -f docs/paper/arxiv/paper.tex docs/paper/arxiv/references.bib +git rm -f docs/paper/arxiv/paper.typ 2>/dev/null; true +git commit -m "docs(arxiv): LaTeX paper scaffolding with IEEEtran and bibliography" ``` --- @@ -163,25 +203,25 @@ Expected: ~40. Inferred variant edges = total edges - ReduceTo impls. - [ ] **Step 3: Compute hub node degrees** ```bash -python3 -c " +python3 << 'PYEOF' import json from collections import Counter data = json.load(open('docs/src/reductions/reduction_graph.json')) in_deg = Counter() out_deg = Counter() for e in data['edges']: - src_name = next(n['name'] for n in data['nodes'] if n == e.get('source') or (n['name'] == e['source'].get('name', '') if isinstance(e['source'], dict) else False)) - # Simpler: just use source/target indices -for e in data['edges']: - in_deg[e['target']['name']] += 1 - out_deg[e['source']['name']] += 1 + # edges use node dicts with 'name' field + src = e['source']['name'] if isinstance(e['source'], dict) else data['nodes'][e['source']]['name'] + tgt = e['target']['name'] if isinstance(e['target'], dict) else data['nodes'][e['target']]['name'] + in_deg[tgt] += 1 + out_deg[src] += 1 print('Top in-degree (reduce TO this):') for name, cnt in in_deg.most_common(5): print(f' {name}: {cnt}') print('Top out-degree (reduce FROM this):') for name, cnt in out_deg.most_common(5): print(f' {name}: {cnt}') -" +PYEOF ``` Record QUBO and ILP in-degrees, MIS and SAT out-degrees for S2. @@ -230,69 +270,101 @@ git commit -m "docs(arxiv): gather reduction graph metrics" ## Chunk 2: Figures -**Conventions for all figure files:** -- Use `#set page(width: auto, height: auto, margin: 5pt)` for standalone compilation. -- To use `docs/paper/lib.typ` primitives, import with relative path: `#import "../../lib.typ"` (from `docs/paper/arxiv/images/`). -- Each file must export a public function (e.g., `#let reduction-graph() = { ... }`) for import into `paper.typ`. -- Verify standalone: `typst compile docs/paper/arxiv/images/.typ` — expected: PDF output, no errors. -- Import into paper: `#import "images/.typ": ` +**Conventions for all figure files (Typst+CeTZ → PDF hybrid):** +- Each figure is a standalone `.typ` file in `docs/paper/arxiv/figures/`. +- Figures use `#set page(width: auto, height: auto, margin: 5pt)` for tight bounding box. +- Import CeTZ: `#import "@preview/cetz:0.4.2": canvas, draw`. +- Import the project graph library when useful: `#import "../../../lib.typ": g-node, g-edge, graph-colors`. +- Color scheme: graph=`rgb("#4e79a7")` (blue), formula=`rgb("#59a14f")` (green), set=`rgb("#e15759")` (orange-red), algebraic=`rgb("#b07aa1")` (purple), misc=`rgb("#999")` (gray). Human=`rgb("#f28e2b")` (orange), Agent=`rgb("#4e79a7")` (blue). +- Arrow style: `mark: (end: "straight")` for directed edges. +- Compile each figure to PDF: `typst compile docs/paper/arxiv/figures/filename.typ`. +- Include in LaTeX via `\includegraphics{figures/filename.pdf}`. +- Test figures by compiling individually before full paper build. +- Do NOT commit generated `.pdf` files — they are build artifacts. ### Task 3: Figure 1 — Reduction graph **Files:** -- Create: `docs/paper/arxiv/images/reduction-graph.typ` -- Modify: `docs/paper/arxiv/paper.typ` +- Create: `docs/paper/arxiv/figures/reduction-graph.typ` +- Modify: `docs/paper/arxiv/paper.tex` + +- [ ] **Step 1: Create reduction graph figure** -- [ ] **Step 1: Define node positions by category** +Create `docs/paper/arxiv/figures/reduction-graph.typ`. Read the graph data from `docs/src/reductions/reduction_graph.json` for edge connectivity. -Create `docs/paper/arxiv/images/reduction-graph.typ`. Read the graph data from `docs/paper/arxiv/data/graph-metrics.json` and the full graph from `docs/src/reductions/reduction_graph.json`. +```typst +#import "@preview/cetz:0.4.2": canvas, draw +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 7pt) + +// Category colors +#let cat-graph = rgb("#4e79a7") +#let cat-formula = rgb("#59a14f") +#let cat-set = rgb("#e15759") +#let cat-algebraic = rgb("#b07aa1") +#let cat-misc = rgb("#999") + +#canvas(length: 1cm, { + import draw: * + + // Node positions by category (column-based layout) + // Column 1: graph problems, Column 2: formula, etc. + // Place QUBO and ILP centrally as hub nodes (larger radius) + + // ... define positions for all 24 unique problem types ... + // ... draw directed edges from graph JSON ... + // ... add legend box ... +}) +``` Use a column-based layout by category: - Column 1 (blue): graph problems (MIS, MaxClique, MaxCut, MinVC, MinDS, MaxMatching, MaximalIS, KColoring, TSP, SpinGlass, BicliqueCover) - Column 2 (green): formula problems (SAT, k-SAT, CircuitSAT) -- Column 3 (orange): set problems (MinSetCovering, MaxSetPacking) -- Column 4 (purple): algebraic problems (QUBO, ILP, CVP, BMF) -- Column 5 (gray): misc problems (BinPacking, PaintShop, Factoring, Knapsack) +- Column 3 (orange-red): set problems (MinSetCovering, MaxSetPacking) +- Column 4 (purple): algebraic problems (QUBO, ILP, CVP, BMF, Knapsack) +- Column 5 (gray): misc problems (BinPacking, PaintShop, Factoring) -Place QUBO and ILP centrally as hub nodes (larger circles). +Place QUBO and ILP centrally as hub nodes (larger circles, `radius: 0.4` vs `0.2`). Use the 24 unique problem type names (not all 42 variants). Mention variants in caption. -For base problem types only (not all 42 variants — use the 24 unique names). Add a note in the caption about variant nodes. +For each node, use `draw.circle(pos, radius: r, fill: cat-color.lighten(70%), stroke: 0.5pt + cat-color, name: id)` and `draw.content(id, text(6pt, abbreviation))`. -Import graph drawing utilities: `#import "../../lib.typ"` for `g-node`, `g-edge` if helpful, or use raw CeTZ. +For directed edges, use `draw.line(src, tgt, stroke: 0.4pt + luma(100), mark: (end: "straight", scale: 0.4))`. Keep edges thin to avoid clutter with 52 edges. -- [ ] **Step 2: Draw edges from the graph data** +Add a small legend box in one corner with the 5 category colors. -Add directed edges (arrows) between nodes based on the reduction graph edges. Use `mark: (end: "straight")` for arrow heads. Group edges by category with consistent styling. +- [ ] **Step 2: Compile figure to PDF** -- [ ] **Step 3: Add legend and caption** +```bash +typst compile docs/paper/arxiv/figures/reduction-graph.typ +``` -Add a color legend for the 5 categories. Define the exported function: `#let reduction-graph() = { ... }`. +Verify output: `docs/paper/arxiv/figures/reduction-graph.pdf` exists. -- [ ] **Step 4: Verify figure compiles standalone** +- [ ] **Step 3: Include in paper.tex** -Run: `typst compile docs/paper/arxiv/images/reduction-graph.typ` -Expected: PDF of the reduction graph, no errors. +In Section 2, add: +```latex +\begin{figure*}[t] + \centering + \includegraphics[width=\textwidth]{figures/reduction-graph.pdf} + \caption{The reduction graph: 24 problem types connected by 52 directed edges (40 implemented reductions + 12 inferred variant edges). Hub nodes QUBO and ILP are highlighted.} + \label{fig:reduction-graph} +\end{figure*} +``` -- [ ] **Step 5: Import into paper.typ in S2** +Use `figure*` for full-width in two-column layout. -Add to `paper.typ`: -```typst -#import "images/reduction-graph.typ": reduction-graph -``` +- [ ] **Step 4: Verify full paper compiles** -In S2, place: -```typst -#figure( - reduction-graph(), - caption: [The reduction graph: 24 problem types connected by 52 directed edges (40 implemented reductions + 12 inferred variant edges). Hub nodes QUBO and ILP are highlighted.] -) +```bash +cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd - ``` -- [ ] **Step 6: Commit** +- [ ] **Step 5: Commit** ```bash -git add -f docs/paper/arxiv/images/reduction-graph.typ docs/paper/arxiv/paper.typ -git commit -m "docs(arxiv): add Figure 1 — reduction graph" +git add -f docs/paper/arxiv/figures/reduction-graph.typ docs/paper/arxiv/paper.tex +git commit -m "docs(arxiv): add Figure 1 — reduction graph (Typst+CeTZ)" ``` --- @@ -300,58 +372,74 @@ git commit -m "docs(arxiv): add Figure 1 — reduction graph" ### Task 4: Figure 3 — Pipeline diagram **Files:** -- Create: `docs/paper/arxiv/images/pipeline.typ` -- Modify: `docs/paper/arxiv/paper.typ` +- Create: `docs/paper/arxiv/figures/pipeline.typ` +- Modify: `docs/paper/arxiv/paper.tex` - [ ] **Step 1: Create pipeline diagram** -Use Fletcher (`@preview/fletcher:0.5.8`) for a flowchart showing the two-stage card-based pipeline. +Create `docs/paper/arxiv/figures/pipeline.typ` using CeTZ: -Structure: -``` -Contributor ──→ [Issue] ──→ Backlog - │ Maintainer moves card - ▼ - [Ready] - │ project-pipeline (agent) - ▼ - [In Progress] - │ issue-to-pr → check-issue → add-model/add-rule → review - ▼ - [review-agentic] - │ review-pipeline (agent) - │ fix Copilot comments → agentic tests → fix CI - ▼ - [In Review] - │ Maintainer merges - ▼ - [Done] +```typst +#import "@preview/cetz:0.4.2": canvas, draw +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 8pt) + +#let human-color = rgb("#f28e2b") +#let agent-color = rgb("#4e79a7") + +#canvas(length: 1cm, { + import draw: * + + // Board columns as rounded rectangles, connected vertically + // Color-code: human decisions in orange, agent actions in blue + // Layout: + // Contributor → [Issue] → [Backlog] + // │ Maintainer moves card (orange) + // ▼ + // [Ready] + // │ project-pipeline (blue) + // ▼ + // [In Progress] + // │ issue-to-pr → check → implement → review (blue) + // ▼ + // [review-agentic] + // │ review-pipeline (blue) + // ▼ + // [In Review] + // │ Maintainer merges (orange) + // ▼ + // [Done] + + // Use rect(..., radius: 4pt) for rounded board columns + // Use line() with mark: (end: "straight") for arrows + // Add action labels on edges with draw.content() +}) ``` -Color-code: human decisions in warm color (orange/gold), agent actions in cool color (blue/teal). Board columns as rounded rectangles. - -Export as: `#let pipeline-diagram() = { ... }` - -- [ ] **Step 2: Verify figure compiles standalone** +- [ ] **Step 2: Compile figure to PDF** -Run: `typst compile docs/paper/arxiv/images/pipeline.typ` -Expected: PDF of pipeline flowchart, no errors. +```bash +typst compile docs/paper/arxiv/figures/pipeline.typ +``` -- [ ] **Step 3: Import into paper.typ in S4** +- [ ] **Step 3: Include in paper.tex in S4** -Add `#import "images/pipeline.typ": pipeline-diagram` and place: -```typst -#figure( - pipeline-diagram(), - caption: [Two-stage card-based pipeline. Human decisions (orange) are limited to Backlog→Ready and In Review→Done. Agent manages everything in between.] -) +```latex +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/pipeline.pdf} + \caption{Two-stage card-based pipeline. Human decisions (orange) are limited to Backlog$\to$Ready and In Review$\to$Done. Agent manages everything in between.} + \label{fig:pipeline} +\end{figure} ``` -- [ ] **Step 4: Commit** +- [ ] **Step 4: Verify compiles** + +- [ ] **Step 5: Commit** ```bash -git add -f docs/paper/arxiv/images/pipeline.typ docs/paper/arxiv/paper.typ -git commit -m "docs(arxiv): add Figure 3 — card-based pipeline diagram" +git add -f docs/paper/arxiv/figures/pipeline.typ docs/paper/arxiv/paper.tex +git commit -m "docs(arxiv): add Figure 3 — card-based pipeline diagram (Typst+CeTZ)" ``` --- @@ -359,47 +447,65 @@ git commit -m "docs(arxiv): add Figure 3 — card-based pipeline diagram" ### Task 5: Figure 4 — Verification pyramid **Files:** -- Create: `docs/paper/arxiv/images/verification-pyramid.typ` -- Modify: `docs/paper/arxiv/paper.typ` +- Create: `docs/paper/arxiv/figures/verification-pyramid.typ` +- Modify: `docs/paper/arxiv/paper.tex` - [ ] **Step 1: Create verification pyramid figure** -Draw a layered pyramid/stack using CeTZ with 7 layers, widest at bottom: +Create `docs/paper/arxiv/figures/verification-pyramid.typ` using CeTZ: +```typst +#import "@preview/cetz:0.4.2": canvas, draw +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 7pt) + +#canvas(length: 1cm, { + import draw: * + + // 7-layer trapezoid/pyramid, widest at bottom + // Each layer is a filled trapezoid with text on left (mechanism) and right (error class) + // Color gradient: bottom = blue (automated), top = orange/gold (human-readable) + + // Layer data: (mechanism, error class caught) + // 1: Type system (Rust compiler) → API misuse + // 2: Unit tests (eval, serialization) → evaluation errors + // 3: Closed-loop tests (round-trip) → mapping errors + // 4: Overhead validation (symbolic) → formula errors + // 5: Materialized fixtures (JSON) → test gaming + // 6: Agentic review (parallel) → convention violations + // 7: Documentation (proof sketch) → logical errors + + // Draw each layer as a trapezoid using merge-path with line segments + // Width decreases from bottom to top + // Use draw.content() for labels on each layer + // Use color.mix() or manual gradient for blue→gold transition +}) ``` -Layer 7: Documentation (proof sketch) ← catches: logical errors -Layer 6: Agentic review (parallel subagents) ← catches: convention violations -Layer 5: Materialized fixtures (JSON ground truth) ← catches: test gaming -Layer 4: Overhead validation (symbolic exprs) ← catches: formula errors -Layer 3: Closed-loop tests (round-trip) ← catches: mapping errors -Layer 2: Unit tests (eval, serialization) ← catches: evaluation errors -Layer 1: Type system (Rust compiler) ← catches: API misuse -``` - -Each layer labeled with mechanism (left) and error class caught (right). Color gradient from automated (bottom, blue) to human-readable (top, gold). -Export as: `#let verification-pyramid() = { ... }` +- [ ] **Step 2: Compile figure to PDF** -- [ ] **Step 2: Verify figure compiles standalone** - -Run: `typst compile docs/paper/arxiv/images/verification-pyramid.typ` -Expected: PDF of pyramid, no errors. +```bash +typst compile docs/paper/arxiv/figures/verification-pyramid.typ +``` -- [ ] **Step 3: Import into paper.typ in S5** +- [ ] **Step 3: Include in paper.tex in S5** -Add `#import "images/verification-pyramid.typ": verification-pyramid` and place: -```typst -#figure( - verification-pyramid(), - caption: [Seven-layer verification stack. Lower layers (blue) are fully automated; upper layers (gold) involve human-readable arguments.] -) +```latex +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf} + \caption{Seven-layer verification stack. Lower layers (blue) are fully automated; upper layers (gold) involve human-readable arguments.} + \label{fig:verification} +\end{figure} ``` -- [ ] **Step 4: Commit** +- [ ] **Step 4: Verify compiles** + +- [ ] **Step 5: Commit** ```bash -git add -f docs/paper/arxiv/images/verification-pyramid.typ docs/paper/arxiv/paper.typ -git commit -m "docs(arxiv): add Figure 4 — verification pyramid" +git add -f docs/paper/arxiv/figures/verification-pyramid.typ docs/paper/arxiv/paper.tex +git commit -m "docs(arxiv): add Figure 4 — verification pyramid (Typst+CeTZ)" ``` --- @@ -407,90 +513,106 @@ git commit -m "docs(arxiv): add Figure 4 — verification pyramid" ### Task 6: Figure 2 — System architecture **Files:** -- Create: `docs/paper/arxiv/images/architecture.typ` -- Modify: `docs/paper/arxiv/paper.typ` +- Create: `docs/paper/arxiv/figures/architecture.typ` +- Modify: `docs/paper/arxiv/paper.tex` - [ ] **Step 1: Create architecture diagram** -Use Fletcher or CeTZ to show the key traits and compile-time validation: +Create `docs/paper/arxiv/figures/architecture.typ` using CeTZ: +```typst +#import "@preview/cetz:0.4.2": canvas, draw +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 8pt) + +#canvas(length: 1cm, { + import draw: * + + // Three stacked boxes connected by labeled arrows: + // + // ┌─────────────────────────────────────┐ + // │ Problem trait │ + // │ NAME, Metric, dims(), evaluate() │ + // ├──────────────┬──────────────────────┤ + // │ Optimization │ Satisfaction │ + // │ SolutionSize │ bool │ + // └──────┬───────┴──────────────────────┘ + // │ ReduceTo + // ▼ + // ┌─────────────────────────────────────┐ + // │ ReductionResult │ + // │ target_problem() + extract_solution│ + // └──────┬──────────────────────────────┘ + // │ #[reduction(overhead = {...})] + // ▼ + // ┌─────────────────────────────────────┐ + // │ Compile-time validation │ + // │ • Variable names → getter methods │ + // │ • Expr AST: symbolic overhead │ + // │ • declare_variants! → registry │ + // └─────────────────────────────────────┘ + + // Use rect() with name for each box + // Use draw.content() for text inside boxes (use raw() for code identifiers) + // Use line() with mark for connecting arrows + // Use draw.content() on arrow midpoints for edge labels +}) ``` -┌─────────────────────────────────────┐ -│ Problem trait │ -│ NAME, Metric, dims(), evaluate() │ -├──────────────┬──────────────────────┤ -│ Optimization │ Satisfaction │ -│ SolutionSize │ bool │ -│ direction() │ │ -└──────┬───────┴──────────────────────┘ - │ ReduceTo - ▼ -┌─────────────────────────────────────┐ -│ ReductionResult │ -│ target_problem() + extract_solution│ -└──────┬──────────────────────────────┘ - │ #[reduction(overhead = {...})] - ▼ -┌─────────────────────────────────────┐ -│ Compile-time validation │ -│ • Variable names → getter methods │ -│ • Expr AST: symbolic overhead │ -│ • declare_variants! → registry │ -└─────────────────────────────────────┘ -``` - -Keep compact. Focus on the verification-enabling aspects. -Export as: `#let architecture-diagram() = { ... }` +Keep compact. Use `raw()` (backtick syntax) for code identifiers in Typst. -- [ ] **Step 2: Verify figure compiles standalone** +- [ ] **Step 2: Compile figure to PDF** -Run: `typst compile docs/paper/arxiv/images/architecture.typ` -Expected: PDF of architecture diagram, no errors. +```bash +typst compile docs/paper/arxiv/figures/architecture.typ +``` -- [ ] **Step 3: Import into paper.typ in S3** +- [ ] **Step 3: Include in paper.tex in S3** -Add `#import "images/architecture.typ": architecture-diagram` and place: -```typst -#figure( - architecture-diagram(), - caption: [System architecture: the trait hierarchy and compile-time validation enforce round-trip testing capability by construction.] -) +```latex +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/architecture.pdf} + \caption{System architecture: the trait hierarchy and compile-time validation enforce round-trip testing capability by construction.} + \label{fig:architecture} +\end{figure} ``` -- [ ] **Step 4: Commit** +- [ ] **Step 4: Verify compiles** + +- [ ] **Step 5: Commit** ```bash -git add -f docs/paper/arxiv/images/architecture.typ docs/paper/arxiv/paper.typ -git commit -m "docs(arxiv): add Figure 2 — system architecture" +git add -f docs/paper/arxiv/figures/architecture.typ docs/paper/arxiv/paper.tex +git commit -m "docs(arxiv): add Figure 2 — system architecture (Typst+CeTZ)" ``` --- ## Chunk 3: Sections S1-S4 -**Convention:** All "Verify compiles" steps use: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf`. Expected: no errors. All sections use citation format `@BibKey` (e.g., `@Thai2025SWEEVO`). Before writing any section, first read `paper.typ` to understand the heading style and formatting conventions established in Task 1. +**Convention:** All "Verify compiles" steps use: `cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd -`. Expected: no fatal errors. Citations use `\cite{BibKey}` (e.g., `\cite{Thai2025SWEEVO}`). Cross-references use `\Cref{fig:...}` or `Fig.~\ref{fig:...}`. Before writing any section, first read `paper.tex` to understand the formatting conventions established in Task 1. -**Page budget reference** (two-column format, ~500 words/page): -- S1: ~1.5 pages (~750 words) -- S2: ~1 page (~500 words) -- S3: ~1.5 pages (~750 words) -- S4: ~2 pages (~1000 words) +**Page budget reference** (IEEEtran two-column, ~800 words/page): +- S1: ~1.5 pages (~1200 words) +- S2: ~1 page (~800 words) +- S3: ~1.5 pages (~1200 words) +- S4: ~2 pages (~1600 words) ### Task 7: Write S1 — Introduction **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` -- [ ] **Step 1: Write introduction body (~750 words)** +- [ ] **Step 1: Write introduction body (~1200 words)** -First read `paper.typ` to understand the heading format. Then write S1 within the existing `= Introduction` stub. Structure: +First read `paper.tex` to understand the document structure. Then fill in `\section{Introduction}`. Structure: -1. Opening paragraph: agents hit 70-80% on SWE-Bench but ~20% on long-horizon → cite `@Thai2025SWEEVO`, `@Deng2025SWEBenchPro` +1. Opening paragraph: agents hit 70-80% on SWE-Bench but ~20% on long-horizon → cite `\cite{Thai2025SWEEVO}`, `\cite{Deng2025SWEBenchPro}` 2. Our thesis: bottleneck is decomposition, not capability -3. "Review is harder than generation" for mathematical code → cite `@Roychoudhury2025AgenticAI` +3. "Review is harder than generation" for mathematical code → cite `\cite{Roychoudhury2025AgenticAI}` 4. Three roles paragraph: contributors (creative issues), maintainer (board + skills), agents (manage + execute) -5. Contributions list (3 items from spec) +5. Contributions list (3 items from spec) — use `\begin{itemize}...\end{itemize}` 6. Paper organization paragraph - [ ] **Step 2: Verify compiles** @@ -498,7 +620,7 @@ First read `paper.typ` to understand the heading format. Then write S1 within th - [ ] **Step 3: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S1 Introduction" ``` @@ -507,26 +629,26 @@ git commit -m "docs(arxiv): write S1 Introduction" ### Task 8: Write S2 — Why Reductions? **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` **Depends on:** Task 2 (graph metrics), Task 3 (Figure 1) -- [ ] **Step 1: Write S2 body (~500 words)** +- [ ] **Step 1: Write S2 body (~800 words)** Read graph metrics from `docs/paper/arxiv/data/graph-metrics.json` for concrete numbers. If not yet available, use: 24 types, 42 variants, 52 edges, 40 implemented, 12 inferred. Structure: 1. Goldilocks domain paragraph: self-contained (~50-200 LOC), formally specified, automatable round-trip criterion 2. Contrast with SWE-Bench: homogeneous tasks enable comparison -3. Hardware solvers paragraph: Rydberg atoms for MIS (cite `@lucas2014`), D-Wave for QUBO/Ising (cite `@glover2019`) → the graph as compilation layer +3. Hardware solvers paragraph: Rydberg atoms for MIS (cite `\cite{lucas2014}`), D-Wave for QUBO/Ising (cite `\cite{glover2019}`) → the graph as compilation layer 4. Real-world applications paragraph: SDN→ILP, airline→SetCovering, VLSI→coloring, logistics→TSP -5. Reference `@fig:reduction-graph` (placed by Task 3) +5. Reference `Fig.~\ref{fig:reduction-graph}` (placed by Task 3) - [ ] **Step 2: Verify compiles** - [ ] **Step 3: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S2 Why Reductions — Goldilocks domain" ``` @@ -535,12 +657,12 @@ git commit -m "docs(arxiv): write S2 Why Reductions — Goldilocks domain" ### Task 9: Write S3 — System Architecture **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` **Depends on:** Task 6 (Figure 2) -- [ ] **Step 1: Write S3 body (~750 words)** +- [ ] **Step 1: Write S3 body (~1200 words)** -Use the trait hierarchy from CLAUDE.md's Architecture section for reference. Do NOT read source files — the CLAUDE.md summary has sufficient detail. Full trait code belongs in supplementary material. +Use the trait hierarchy from CLAUDE.md's Architecture section for reference. Do NOT read source files — CLAUDE.md has sufficient detail. Full trait code belongs in supplementary material. Structure: 1. Problem trait: `evaluate()` enables brute-force verification of any configuration @@ -548,14 +670,16 @@ Structure: 3. `#[reduction(overhead)]` proc macro: compile-time validation of overhead expressions 4. `declare_variants!`: registry enables automated graph export + completeness checking 5. Design philosophy paragraph: reduce the space of possible agent errors -6. Reference `@fig:architecture` (placed by Task 6) +6. Reference `Fig.~\ref{fig:architecture}` (placed by Task 6) + +Use `\texttt{}` for code identifiers and `\lstinline` for inline code snippets. - [ ] **Step 2: Verify compiles** - [ ] **Step 3: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S3 System Architecture" ``` @@ -564,18 +688,18 @@ git commit -m "docs(arxiv): write S3 System Architecture" ### Task 10: Write S4 — Skill-Based Task Decomposition **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` **Depends on:** Task 4 (Figure 3) -- [ ] **Step 1: Write S4.1 — Three Roles (~200 words)** +- [ ] **Step 1: Write S4.1 — Three Roles (~300 words)** -The roles table from the spec (Contributor/Maintainer/Agent with responsibilities and examples). Brief narrative explaining the human-agent boundary. +The roles table from the spec (Contributor/Maintainer/Agent). Use `\begin{table}...\end{table}` with `booktabs`. - [ ] **Step 2: Read skill files and extract metadata** Read all 13 skill files (`.claude/skills/*/SKILL.md`). For each, record: name, one-line description, invocation trigger, step count. This data populates Table 1. -- [ ] **Step 3: Write S4.2 — Skills as Agent Functions (~500 words)** +- [ ] **Step 3: Write S4.2 — Skills as Agent Functions (~800 words)** Group the 13 skills into 5 categories (from spec): - **Orchestration** (4): project-pipeline, review-pipeline, issue-to-pr, meta-power @@ -584,18 +708,33 @@ Group the 13 skills into 5 categories (from spec): - **Documentation** (2): write-model-in-paper, write-rule-in-paper - **Release** (1): release -For each group, write 1-2 sentences explaining the pattern. Create Table 1 with columns: Skill, Category, Trigger, Typical Turns (estimate from step count / 3), Success Rate (use "TBD" — will be filled after Task 11). +Create Table 1 with `booktabs`: +```latex +\begin{table}[t] +\caption{Skills inventory.}\label{tab:skills} +\centering +\begin{tabular}{llcc} +\toprule +Skill & Category & Steps & Success \\ +\midrule +... +\bottomrule +\end{tabular} +\end{table} +``` + +Success Rate column: use "TBD" — filled after Task 11. -- [ ] **Step 4: Write S4.3 — Card-Based Orchestration (~300 words)** +- [ ] **Step 4: Write S4.3 — Card-Based Orchestration (~500 words)** -Two-stage pipeline (project-pipeline → review-pipeline). Human touches only Backlog→Ready and In Review→Done. Reference `@fig:pipeline` (placed by Task 4). +Two-stage pipeline (project-pipeline → review-pipeline). Human touches only Backlog→Ready and In Review→Done. Reference `Fig.~\ref{fig:pipeline}`. - [ ] **Step 5: Verify compiles** - [ ] **Step 6: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S4 Skill-Based Task Decomposition" ``` @@ -627,9 +766,7 @@ For each PR, extract: number, title, author login, created date, merged date, wh Author classification: if `author.login` contains `[bot]` or is `github-actions`, classify as "agent"; otherwise "human". -- [ ] **Step 3: Add phase classification and CI status** - -Add to the script: +- [ ] **Step 3: Add phase classification** **Phase boundaries** (based on when key skills were introduced — determine by running): ```bash @@ -642,12 +779,6 @@ Define phases: - Phase 2 (basic skills): PRs after implementation skills but before pipeline skills - Phase 3 (full pipeline): PRs after project-pipeline/review-pipeline skills existed -For CI status on first push, use: -```bash -gh api repos/CodingThrust/ProblemReductions/pulls/{number}/commits --jq '.[0].sha' -``` -Then check that SHA's status. This is optional — skip if the API calls are too slow. - - [ ] **Step 4: Run script and save results** ```bash @@ -657,21 +788,9 @@ python3 docs/paper/arxiv/scripts/mine-git-history.py > docs/paper/arxiv/data/git Expected output schema: ```json { - "summary": { - "total_prs": N, - "rule_prs": N, - "model_prs": N, - "agent_authored": N, - "human_authored": N - }, - "by_phase": [ - {"phase": 1, "label": "manual", "count": N, "agent_count": N}, - {"phase": 2, "label": "basic_skills", "count": N, "agent_count": N}, - {"phase": 3, "label": "full_pipeline", "count": N, "agent_count": N} - ], - "prs": [ - {"number": 42, "title": "...", "is_agent": false, "phase": 1, "type": "Rule"} - ] + "summary": {"total_prs": N, "rule_prs": N, "model_prs": N, "agent_authored": N, "human_authored": N}, + "by_phase": [{"phase": 1, "label": "manual", "count": N, "agent_count": N}, ...], + "prs": [{"number": 42, "title": "...", "is_agent": false, "phase": 1, "type": "Rule"}, ...] } ``` @@ -687,37 +806,39 @@ git commit -m "docs(arxiv): git history mining script and results" ### Task 12: Write S5 — Multi-Layered Verification **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` **Depends on:** Task 5 (Figure 4) -- [ ] **Step 1: Write S5.1 — The Verification Stack (~500 words)** +- [ ] **Step 1: Write S5.1 — The Verification Stack (~700 words)** -Write the 7-layer table from the spec. Use these concrete error examples for each layer (constructed from the domain): +Write the 7-layer table from the spec using `booktabs`. Use these concrete error examples: | Layer | Mechanism | Example Error Caught | |-------|-----------|---------------------| | 1. Type system | Rust compiler | Agent returns `bool` instead of `SolutionSize` from `evaluate()` | -| 2. Unit tests | `test_*_basic` | Agent evaluates MaxCut objective with wrong sign (sum vs difference) | -| 3. Closed-loop tests | `test_*_to_*_closed_loop` | SAT→MIS reduction maps clause variables to wrong vertex indices | +| 2. Unit tests | `test_*_basic` | Agent evaluates MaxCut objective with wrong sign | +| 3. Closed-loop tests | `test_*_to_*_closed_loop` | SAT→MIS maps clause variables to wrong vertex indices | | 4. Overhead validation | Symbolic expr vs sizes | Agent writes `num_edges = num_clauses` instead of `3 * num_clauses` | -| 5. Materialized fixtures | JSON ground truth | Agent changes expected QUBO matrix values to make failing test pass | -| 6. Agentic review | Parallel subagents | Missing `declare_variants!` macro, wrong file naming convention | -| 7. Documentation | Proof sketch | Reduction proof assumes graph is connected but problem allows disconnected | +| 5. Materialized fixtures | JSON ground truth | Agent changes expected QUBO matrix to make failing test pass | +| 6. Agentic review | Parallel subagents | Missing `declare_variants!`, wrong file naming | +| 7. Documentation | Proof sketch | Proof assumes connected graph but problem allows disconnected | -Reference `@fig:verification` (placed by Task 5). +Reference `Fig.~\ref{fig:verification}`. -- [ ] **Step 2: Write S5.2 — Why Layers? (~250 words)** +- [ ] **Step 2: Write S5.2 — Why Layers? (~400 words)** -The "lazy agent" problem: agents take the shortest path to close an issue (e.g., changing expected test values instead of fixing bugs). Materialized test data (Layer 5) prevents this. No single layer is sufficient. Cross-reference Table 2 in S6. +The "lazy agent" problem. Materialized test data as defense. No single layer is sufficient. Cross-reference Table 2 in S6. - [ ] **Step 3: Verify compiles** -Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` +```bash +cd docs/paper/arxiv && pdflatex -interaction=nonstopmode paper.tex && cd - +``` - [ ] **Step 4: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S5 Multi-Layered Verification" ``` @@ -726,92 +847,88 @@ git commit -m "docs(arxiv): write S5 Multi-Layered Verification" ### Task 13: Write S6 — Evaluation **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` **Depends on:** Task 11 (git mining data) -- [ ] **Step 1: Write S6.1 — Ablation setup (~400 words)** - -Describe the experimental DESIGN (actual results are `[TBD: ablation not yet run]` placeholders): -- Setup: select 5-10 reductions of varying complexity -- Two configurations: skill-based (full pipeline) vs no-skill baseline (raw agent + CLAUDE.md only) -- Metrics: first-attempt CI pass rate, review rounds, final correctness, convention adherence -- Framing: "controlled illustration" (n=5-10), not statistically powered experiment - -- [ ] **Step 2: Write S6.2 — Git History Mining results (~500 words)** - -Read data from `docs/paper/arxiv/data/git-mining-results.json`. If not yet available, use `[TBD: data]` placeholders. - -Write up agent vs human implementation counts, success rates stratified by phase. - -Create Table 2 (error taxonomy × verification layer matrix): - -| Error Category | Layer | Example | Count | -|---------------|-------|---------|-------| -| Type errors | 1 (type system) | Wrong return type | [TBD] | -| Mapping errors | 3 (closed-loop) | Wrong vertex index | [TBD] | -| Formula errors | 4 (overhead) | Linear vs quadratic | [TBD] | -| Test gaming | 5 (fixtures) | Changed expected value | [TBD] | -| Convention violations | 6 (review) | Missing macro | [TBD] | -| Logical errors | 7 (documentation) | Invalid proof | [TBD] | - -- [ ] **Step 3: Write S6.3 — Case Studies (~600 words)** +- [ ] **Step 1: Write S6.1 — Ablation setup (~500 words)** + +Experimental DESIGN only (results are `[TBD]` placeholders): +- Setup: 5-10 reductions, skill-based vs no-skill baseline +- Metrics: first-attempt CI pass rate, review rounds, correctness, convention adherence +- Framing: "controlled illustration" (n=5-10) + +- [ ] **Step 2: Write S6.2 — Git History Mining results (~700 words)** + +Read data from `docs/paper/arxiv/data/git-mining-results.json`. If not yet available, use `[TBD]` placeholders. + +Create Table 2 (error taxonomy × verification layer): +```latex +\begin{table}[t] +\caption{Error taxonomy by verification layer.}\label{tab:errors} +\centering +\begin{tabular}{llc} +\toprule +Error Category & Layer & Count \\ +\midrule +Type errors & 1 (type system) & [TBD] \\ +Mapping errors & 3 (closed-loop) & [TBD] \\ +... +\bottomrule +\end{tabular} +\end{table} +``` -Three reductions spanning the complexity spectrum. For each, find the actual PR by searching: +- [ ] **Step 3: Write S6.3 — Case Studies (~800 words)** +Search for actual PRs: ```bash gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "MinimumVertexCover MaximumIndependentSet" --json number,title gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "Satisfiability MaximumIndependentSet" --json number,title gh pr list --repo CodingThrust/ProblemReductions --state merged --limit 999 --search "Factoring CircuitSAT" --json number,title ``` -If PRs are found, reference them and analyze the pipeline trace (skills activated, human decisions, errors caught). If not found, describe the expected pipeline trace based on the skill definitions. - -**Case 1 — Simple (MVC→MIS):** complement relationship, ~30 LOC, smooth pipeline. -**Case 2 — Complex (SAT→MIS):** clause-variable gadget, quadratic blowup, agent mistakes in edge counts. -**Case 3 — Composition (Factoring→CircuitSAT→ILP):** two independent reductions that compose in the graph. Analyze each separately, then show graph-level composition. +**Case 1 — Simple (MVC→MIS):** complement relationship, ~30 LOC. +**Case 2 — Complex (SAT→MIS):** clause-variable gadget, quadratic blowup. +**Case 3 — Composition (Factoring→CircuitSAT→ILP):** two independent reductions composing in graph. - [ ] **Step 4: Verify compiles** -Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` - - [ ] **Step 5: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S6 Evaluation" ``` --- -## Chunk 5: Sections S7-S8 + Final Assembly +## Chunk 5: Sections S7-S8 + Review + Final Assembly ### Task 14: Write S7 — Related Work **Files:** -- Modify: `docs/paper/arxiv/paper.typ` +- Modify: `docs/paper/arxiv/paper.tex` -- [ ] **Step 1: Write S7 body (~500 words)** +- [ ] **Step 1: Write S7 body (~800 words)** -Four subsections, each 1-2 paragraphs. Use these specific citation keys: +Four subsections with specific citation keys: -1. **AI coding agents:** `@Yang2024SWEagent`, `@Wang2024OpenHands`, `@Anthropic2025ClaudeCode`, `@Wu2024Devin`, `@Thai2025SWEEVO` (SWE-EVO ~20%), `@Deng2025SWEBenchPro` (SWE-Bench Pro ~45%), `@Xia2025LiveSWEagent` (self-evolution complementary to skills), `@Roychoudhury2025AgenticAI` (agentic SE perspective), `@Anthropic2026AgenticCoding` (developer-AI collaboration survey) +1. **AI coding agents:** `\cite{Yang2024SWEagent}`, `\cite{Wang2024OpenHands}`, `\cite{Anthropic2025ClaudeCode}`, `\cite{Wu2024Devin}`, `\cite{Thai2025SWEEVO}`, `\cite{Deng2025SWEBenchPro}`, `\cite{Xia2025LiveSWEagent}`, `\cite{Roychoudhury2025AgenticAI}`, `\cite{Anthropic2026AgenticCoding}` -2. **AI-discovered reductions:** `@Novikov2025AlphaEvolve` (NP-hardness gadgets), `@Janicic2025URSA` (SAT-based verification), `@RomeraParedes2023FunSearch`. Our work is complementary: we implement/verify known reductions, not discover new ones. +2. **AI-discovered reductions:** `\cite{Novikov2025AlphaEvolve}`, `\cite{Janicic2025URSA}`, `\cite{RomeraParedes2023FunSearch}` -3. **Formal verification:** `@Bursuc2025VeriCoding`, `@Thakur2025CLEVER`, `@Miranda2025VeriBench`, `@Mukherjee2025CoqPL`, `@Mukherjee2025SynVer`. Our approach: pragmatic multi-layer verification vs end-to-end formal proofs. +3. **Formal verification:** `\cite{Bursuc2025VeriCoding}`, `\cite{Thakur2025CLEVER}`, `\cite{Miranda2025VeriBench}`, `\cite{Mukherjee2025CoqPL}`, `\cite{Mukherjee2025SynVer}` -4. **Physics-inspired optimization:** `@Schuetz2022PhysicsGNN` (GNN/QUBO for MIS/MaxCut/MinVC at million-variable scale), `@He2024QuantumTSP`. Our graph provides the verified compilation layer connecting problems to these solvers. +4. **Physics-inspired optimization:** `\cite{Schuetz2022PhysicsGNN}`, `\cite{He2024QuantumTSP}` -For each: position our work as complementary, not competing. +Position our work as complementary, not competing. - [ ] **Step 2: Verify compiles** -Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` - - [ ] **Step 3: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S7 Related Work" ``` @@ -820,87 +937,113 @@ git commit -m "docs(arxiv): write S7 Related Work" ### Task 15: Write S8 — Discussion & Conclusion **Files:** -- Modify: `docs/paper/arxiv/paper.typ` - -- [ ] **Step 1: Write S8 body (~500 words)** +- Modify: `docs/paper/arxiv/paper.tex` -Four parts from spec, then a concluding subsection: +- [ ] **Step 1: Write S8 body (~800 words)** -1. **Generalizability:** Goldilocks property, candidate domains (compiler peephole rules, algebraic identities, protocol verification lemmas) -2. **Limitations:** n=1 threat, skill engineering cost, domain specificity, git mining confounds (addressed by stratification), maintainer requirement -3. **Human value proposition:** repositioned not eliminated, creativity + judgment remains human. Cite `@Anthropic2026AgenticCoding` for the broader trend. -4. **Future directions:** AlphaEvolve integration (cite `@Novikov2025AlphaEvolve`), formal verification (cite `@Bursuc2025VeriCoding`), scaling to 100+ problems +Four parts: +1. **Generalizability:** Goldilocks property, candidate domains +2. **Limitations:** n=1, skill engineering cost, domain specificity, confounds, maintainer requirement +3. **Human value proposition:** repositioned not eliminated. Cite `\cite{Anthropic2026AgenticCoding}`. +4. **Future directions:** AlphaEvolve, formal verification, scaling to 100+ -End with a `=== Conclusion` subsection: 2-3 crisp sentences restating the thesis and key result. +End with `\subsection{Conclusion}`: 2-3 crisp sentences. - [ ] **Step 2: Verify compiles** -Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` - - [ ] **Step 3: Commit** ```bash -git add -f docs/paper/arxiv/paper.typ +git add -f docs/paper/arxiv/paper.tex git commit -m "docs(arxiv): write S8 Discussion and Conclusion" ``` --- -### Task 16: Final assembly and polish +### Task 16: Simulated peer review -**Files:** -- Modify: `docs/paper/arxiv/paper.typ` +**Files:** None modified (review only) + +- [ ] **Step 1: Run academic-paper-reviewer** -- [ ] **Step 1: Verify all figures are placed correctly** +Read `.claude/skills/academic-research-skills/academic-paper-reviewer/SKILL.md` and invoke the review process on `docs/paper/arxiv/paper.tex`. This simulates a 5-person review panel (Editor-in-Chief + 3 domain reviewers + Devil's Advocate) with quality rubrics. -Check that these figure references exist in the paper text: -- `@fig:reduction-graph` in S2 -- `@fig:architecture` in S3 -- `@fig:pipeline` in S4 -- `@fig:verification` in S5 +- [ ] **Step 2: Record review findings** -Search for each label in `paper.typ`. If any is missing, add the reference. +Save the review output to `docs/paper/arxiv/data/peer-review-round1.md`. -- [ ] **Step 2: Verify all tables are placed correctly** +- [ ] **Step 3: Address critical review findings** -Check for Table 1 (skills inventory) in S4 and Table 2 (error taxonomy) in S6. +Fix any issues scored below 65 (Major Revision threshold). Update paper.tex accordingly. -- [ ] **Step 3: Verify all citations resolve** +- [ ] **Step 4: Commit fixes** ```bash -typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf 2>&1 | grep -i "warning\|error\|unknown\|not found" +git add -f docs/paper/arxiv/paper.tex docs/paper/arxiv/data/peer-review-round1.md +git commit -m "docs(arxiv): address peer review round 1 findings" ``` -Expected: no unresolved citation or label warnings. If any `@key` references are missing from `references.bib`, add them. +--- + +### Task 17: Final assembly and polish + +**Files:** +- Modify: `docs/paper/arxiv/paper.tex` -- [ ] **Step 4: Check page count** +- [ ] **Step 1: Compile all Typst figures** ```bash -typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf && python3 -c " -import subprocess -result = subprocess.run(['pdfinfo', 'docs/paper/arxiv/paper.pdf'], capture_output=True, text=True) -for line in result.stdout.splitlines(): - if 'Pages' in line: - print(line) -" +for f in docs/paper/arxiv/figures/*.typ; do typst compile "$f"; done +``` + +Verify all 4 PDFs exist: +```bash +ls docs/paper/arxiv/figures/*.pdf +``` + +Expected: `reduction-graph.pdf`, `architecture.pdf`, `pipeline.pdf`, `verification-pyramid.pdf`. + +- [ ] **Step 2: Verify all figures are placed correctly** + +Check that these references exist in the paper text: +- `\ref{fig:reduction-graph}` in S2 +- `\ref{fig:architecture}` in S3 +- `\ref{fig:pipeline}` in S4 +- `\ref{fig:verification}` in S5 + +- [ ] **Step 3: Verify all tables are placed** + +Check for `\ref{tab:skills}` in S4 and `\ref{tab:errors}` in S6. + +- [ ] **Step 4: Full compile with bibliography** + +```bash +cd docs/paper/arxiv && pdflatex paper.tex && bibtex paper && pdflatex paper.tex && pdflatex paper.tex && cd - ``` -Expected: 10-12 pages. If over, identify sections to trim. If under, identify sections to expand. +Check for unresolved citations: `grep "Citation.*undefined" docs/paper/arxiv/paper.log` +Expected: no undefined citations. + +- [ ] **Step 5: Check page count** + +```bash +pdfinfo docs/paper/arxiv/paper.pdf | grep Pages +``` -- [ ] **Step 5: Final compile and flag visual review** +Expected: 10-12 pages. If over, trim. If under, expand. -Run: `typst compile docs/paper/arxiv/paper.typ docs/paper/arxiv/paper.pdf` +- [ ] **Step 6: Flag visual review for maintainer** -Verify no warnings are emitted. Visual inspection (layout, orphans, figure legibility) requires human review — flag as TODO for the maintainer. +Verify no LaTeX warnings about overfull hboxes (>1pt). Visual inspection (layout, figures, tables) requires human review. -- [ ] **Step 6: Commit final version** +- [ ] **Step 7: Commit final version** ```bash -git add -f docs/paper/arxiv/paper.typ docs/paper/arxiv/images/ docs/paper/arxiv/references.bib +git add -f docs/paper/arxiv/paper.tex docs/paper/arxiv/figures/*.typ docs/paper/arxiv/references.bib git commit -m "docs(arxiv): final paper assembly and polish" ``` -Note: Do NOT commit `paper.pdf` — it is a build artifact. +Note: Do NOT commit `paper.pdf`, `paper.aux`, `paper.bbl`, `paper.blg`, `paper.log`, or `figures/*.pdf` — these are build artifacts. Add them to `.gitignore` if needed. --- @@ -922,18 +1065,21 @@ Task 7 (S1): no figure dependency — can run after Task 1 Task 11 (git mining) ──→ Task 13 (S6) Task 14 (S7): independent — can run after Task 1 Task 15 (S8): independent — can run after Task 1 -Task 16 (assembly): must run LAST +Task 16 (peer review): must run after Tasks 7-15 +Task 17 (assembly): must run LAST ``` ### Suggested Parallel Batches 1. **Tasks 1-2** (scaffolding + data) — sequential, run first 2. **Tasks 3-6** (all figures) + **Task 7** (S1) + **Task 11** (git mining) — parallel -3. **Tasks 8-10** (S2-S4) + **Tasks 14-15** (S7-S8) — parallel (each depends on its figure from batch 2) -4. **Tasks 12-13** (S5-S6) — parallel (depend on Figure 4 + git mining from batch 2) -5. **Task 16** (assembly) — last +3. **Tasks 8-10** (S2-S4) + **Tasks 14-15** (S7-S8) — parallel +4. **Tasks 12-13** (S5-S6) — parallel +5. **Task 16** (peer review) — after all sections written +6. **Task 17** (assembly) — last ### Open Dependencies -- **S6.1 ablation results** are `[TBD]` placeholders. The ablation experiment is a separate effort outside this plan. The paper will contain placeholder markers until that data is available. -- **Table 1 success rates** are `[TBD]` — will be filled from git mining data (Task 11) if available, otherwise left as placeholders. +- **S6.1 ablation results** are `[TBD]` placeholders. The ablation experiment is a separate effort outside this plan. +- **Table 1 success rates** are `[TBD]` — filled from git mining data if available. +- **Peer review** (Task 16) may surface issues requiring additional revision cycles. diff --git a/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md index ad72db8d..2b705f93 100644 --- a/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md +++ b/docs/superpowers/specs/2026-03-12-arxiv-paper-design.md @@ -2,7 +2,7 @@ **Type:** Full research paper (~10-12 pages) **Venue:** ICSE/ASE-class SE conference -**Output:** `docs/paper/arxiv/paper.typ` (Typst) +**Output:** `docs/paper/arxiv/paper.tex` (LaTeX, IEEEtran class) ## Thesis From b3953611276a9e3f8c7ce5f9c359fc385c5bd28d Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 21:43:32 +0800 Subject: [PATCH 05/38] docs(arxiv): LaTeX paper scaffolding with IEEEtran and bibliography Set up paper.tex with IEEEtran conference class, 8 section stubs, and a ~150-word abstract. Combined survey bibliography (22 entries) with 6 foundational references (Karp, Cook, Garey-Johnson, Glover, Lucas, Barahona). Removed old paper.typ placeholder. Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 44 +++++ docs/paper/arxiv/references.bib | 298 ++++++++++++++++++++++++++++++++ 2 files changed, 342 insertions(+) create mode 100644 docs/paper/arxiv/paper.tex create mode 100644 docs/paper/arxiv/references.bib diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex new file mode 100644 index 00000000..94028e0f --- /dev/null +++ b/docs/paper/arxiv/paper.tex @@ -0,0 +1,44 @@ +\documentclass[conference]{IEEEtran} +\usepackage{cite} +\usepackage{amsmath,amssymb,amsfonts} +\usepackage{graphicx} +\usepackage{textcomp} +\usepackage{xcolor} +\usepackage{booktabs} +\usepackage{listings} +\usepackage{hyperref} +\usepackage{cleveref} + +\begin{document} + +\title{Skill-Based Agentic Coding for Mathematical Software:\\ +A Case Study in NP-Hard Problem Reductions} + +\author{...} % placeholder + +\maketitle + +\begin{abstract} +AI coding agents achieve 70--80\% on single-issue benchmarks like SWE-Bench Verified, but their success rate drops below 25\% on long-horizon software evolution tasks that demand sustained mathematical reasoning across many files. We address this gap by decomposing agentic coding into two complementary roles: human-creative work (designing reduction proofs, choosing algorithms, writing specifications) and agent-managed execution (scaffolding, testing, verification, and integration). Our method centers on a library of 13 reusable skills---from issue triage through implementation to multi-layered review---orchestrated by a coding agent within a Rust library for NP-hard problem reductions. A 7-layer verification stack (type checking, unit tests, brute-force cross-validation, closed-loop reduction tests, integration tests, coverage enforcement, and CI/CD) catches errors at increasing levels of abstraction. Applying this methodology over six months produced 24 problem types, 40 reduction rule implementations, and 52 edges in a typed reduction graph, all with $>$95\% test coverage. We contribute the skill-based decomposition methodology, the verification stack design, and the open-source artifact as a benchmark for agentic mathematical software engineering. +\end{abstract} + +\section{Introduction}\label{sec:intro} + +\section{Why Reductions? The Goldilocks Domain}\label{sec:domain} + +\section{System Architecture}\label{sec:architecture} + +\section{Skill-Based Task Decomposition}\label{sec:skills} + +\section{Multi-Layered Verification}\label{sec:verification} + +\section{Evaluation}\label{sec:evaluation} + +\section{Related Work}\label{sec:related} + +\section{Discussion \& Conclusion}\label{sec:conclusion} + +\bibliographystyle{IEEEtran} +\bibliography{references} + +\end{document} diff --git a/docs/paper/arxiv/references.bib b/docs/paper/arxiv/references.bib new file mode 100644 index 00000000..26ae31b7 --- /dev/null +++ b/docs/paper/arxiv/references.bib @@ -0,0 +1,298 @@ +% Survey: Agentic Coding and Problem Reduction Rules +% Generated: 2026-03-12 +% Papers: 22 + +% ============================================================ +% Theme A: AI Coding Agents — Architectures and Benchmarks +% ============================================================ + +@article{Yang2024SWEagent, + author = {John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Adriano Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press}, + title = {{SWE}-agent: Agent-Computer Interfaces Enable Automated Software Engineering}, + booktitle = {Neural Information Processing Systems}, + journal = {ArXiv}, + volume = {abs/2405.15793}, + year = {2024}, + doi = {10.48550/arXiv.2405.15793}, + abstract = {Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5\% and 87.7\%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.}, +} + +@article{Wang2024OpenHands, + author = {Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and Heng Ji and Graham Neubig}, + title = {{OpenHands}: An Open Platform for {AI} Software Developers as Generalist Agents}, + booktitle = {International Conference on Learning Representations}, + year = {2024}, + url = {https://arxiv.org/abs/2407.16741}, + abstract = {Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.}, +} + +@article{Wang2025OpenHandsSDK, + author = {Xingyao Wang and Simon Rosenberg and Juan Michelini and Calvin Smith and Hoang H. Tran and Engel Nyst and Rohit Malhotra and Xuhui Zhou and Valerie Chen and Robert Brennan and Graham Neubig}, + title = {The {OpenHands} Software Agent {SDK}: A Composable and Extensible Foundation for Production Agents}, + journal = {ArXiv}, + volume = {abs/2511.03690}, + year = {2025}, + doi = {10.48550/arXiv.2511.03690}, + abstract = {Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.}, +} + +@article{Thai2025SWEEVO, + author = {Minh V. T. Thai and Tue Le and D{\~u}ng Nguy{\~\hat{e}}n M{\d a}nh and Huy Phan Nhat and Nghi D. Q. Bui}, + title = {{SWE-EVO}: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios}, + journal = {ArXiv}, + volume = {abs/2512.18470}, + year = {2025}, + doi = {10.48550/arXiv.2512.18470}, + abstract = {Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, SWE-EVO comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on SWE-EVO, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.}, +} + +@article{Deng2025SWEBenchPro, + title = {{SWE-Bench Pro}: Can {AI} Agents Solve Long-Horizon Software Engineering Tasks?}, + author = {Xiang Deng and Jeff Da and Edwin Pan and Yannis Y. He and Charles Ide and Kanak Garg and Niklas Lauffer and Andrew Park and Chetan Rane and Karmini Sampath and Maya Krishnan and Srivatsa R. Kundurthy and Sean M. Hendryx and Zifan Wang and Chen Bo Calvin Zhang and Noah Jacobson and Bing Liu and Brad Kenstler}, + year = {2025}, + journal = {arXiv preprint arXiv:2509.16941}, + doi = {10.48550/arXiv.2509.16941}, + url = {https://openreview.net/forum?id=9R2iUHhVfr}, + note = {Under review at ICLR 2026}, + abstract = {We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-Bench, but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-Bench. The benchmark comprises 1,865 problems from 41 repositories, split into public, held-out, and commercial sets. It features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. SWE-Bench Pro provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.}, +} + +@article{Xia2025LiveSWEagent, + author = {Chun Xia and Zhe Wang and Yan Yang and Yuxiang Wei and Ling-kai Zhang}, + title = {{Live-SWE-agent}: Can Software Engineering Agents Self-Evolve on the Fly?}, + journal = {ArXiv}, + volume = {abs/2511.13646}, + year = {2025}, + doi = {10.48550/arXiv.2511.13646}, + abstract = {Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Goedel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4\% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8\%.}, +} + +@misc{Anthropic2025ClaudeCode, + title = {Claude Code}, + author = {{Anthropic}}, + year = {2025}, + url = {https://github.com/anthropics/claude-code}, + howpublished = {\url{https://github.com/anthropics/claude-code}}, + note = {Agentic coding tool that lives in the terminal, understands codebases, and helps developers code faster through natural language commands}, +} + +@misc{Wu2024Devin, + title = {Introducing {Devin}, the First {AI} Software Engineer}, + author = {Scott Wu}, + year = {2024}, + month = mar, + url = {https://cognition.ai/blog/introducing-devin}, + howpublished = {Cognition AI Blog}, + note = {Devin is a fully autonomous AI software engineering agent with access to shell, code editor, and browser in a sandboxed environment. On SWE-bench, Devin correctly resolves 13.86\% of issues end-to-end.}, +} + +@article{Roychoudhury2025AgenticAI, + author = {Abhik Roychoudhury}, + title = {Agentic {AI} for Software: Thoughts from Software Engineering Community}, + journal = {ArXiv}, + volume = {abs/2508.17343}, + year = {2025}, + doi = {10.48550/arXiv.2508.17343}, + abstract = {AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro-decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI-based software workflows will be to resolve the core difficulty in software engineering --- the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code-bases. Thus to deal with this explosion, an emerging direction is AI-based verification and validation (V\&V) of AI generated code. We posit that agentic software workflows in future will include such AI-based V\&V.}, +} + +@techreport{Anthropic2026AgenticCoding, + title = {2026 Agentic Coding Trends Report: How Coding Agents Are Reshaping Software Development}, + author = {{Anthropic}}, + year = {2026}, + month = jan, + institution = {Anthropic}, + url = {https://resources.anthropic.com/hubfs/2026\%20Agentic\%20Coding\%20Trends\%20Report.pdf}, + abstract = {Industry report identifying eight trends across foundation, capability, and impact categories that are reshaping software development. Key findings include that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks. The report covers shifting engineering roles, multi-agent coordination, human-AI collaboration patterns, and scaling agentic coding beyond engineering teams.}, +} + +% ============================================================ +% Theme C: AI-Assisted Discovery of Reductions & Complexity +% ============================================================ + +@article{Nagda2025ReinforcedGeneration, + author = {Ansh Nagda and Prabhakar Raghavan and Abhradeep Thakurta}, + title = {Reinforced Generation of Combinatorial Structures: Hardness of Approximation}, + year = {2025}, + url = {https://arxiv.org/abs/2509.18057}, + abstract = {Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as 163 vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of 0.987 and 0.9649 respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of 0.9883, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of 0.9853, but falls short of the SOTA of 16/17 that relies on a custom PCP (rather than a reduction from ``standard'' Hastad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of 111/110 using AlphaEvolve to discover a new gadget, thus improving the SOTA of 117/116. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by 10,000x for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.}, +} + +@article{Novikov2025AlphaEvolve, + author = {Alexander Novikov and Ng{\^a}n V{\~u} and Marvin Eisenberger and Emilien Dupont and Po-Sen Huang and Adam Zsolt Wagner and S. Shirobokov and Borislav M. Kozlovskii and Francisco J. R. Ruiz and Abbas Mehrabian and M. P. Kumar and Abigail See and Swarat Chaudhuri and George Holland and A. Davies and Sebastian Nowozin and Pushmeet Kohli and Matej Balog}, + title = {{AlphaEvolve}: A Coding Agent for Scientific and Algorithmic Discovery}, + journal = {ArXiv}, + volume = {abs/2506.13131}, + year = {2025}, + doi = {10.48550/arXiv.2506.13131}, + abstract = {In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using 48 scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.}, +} + +@article{RomeraParedes2023FunSearch, + author = {Bernardino Romera-Paredes and M. Barekatain and Alexander Novikov and Matej Balog and M. P. Kumar and Emilien Dupont and Francisco J. R. Ruiz and J. Ellenberg and Pengming Wang and Omar Fawzi and Pushmeet Kohli and Alhussein Fawzi}, + title = {Mathematical Discoveries from Program Search with Large Language Models}, + journal = {Nature}, + volume = {625}, + pages = {468--475}, + year = {2023}, + doi = {10.1038/s41586-023-06924-6}, + abstract = {Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretrained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best-known results in important problems, pushing the boundary of existing LLM-based approaches. Applying FunSearch to a central problem in extremal combinatorics---the cap set problem---we discover new constructions of large cap sets going beyond the best-known ones, both in finite dimensional and asymptotic cases. This shows that it is possible to make discoveries for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve on widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.}, +} + +@article{Imajuku2025ALEBench, + author = {Yuki Imajuku and Kohki Horie and Yoichi Iwata and Kensho Aoki and Naohiro Takahashi and Takuya Akiba}, + title = {{ALE-Bench}: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering}, + journal = {ArXiv}, + volume = {abs/2506.09050}, + year = {2025}, + doi = {10.48550/arXiv.2506.09050}, + abstract = {How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.}, +} + +@article{Janicic2025URSA, + author = {Predrag Jani{\v{c}}i{\'c}}, + title = {A {SAT}-based Approach for Specification, Analysis, and Justification of Reductions between {NP}-complete Problems}, + journal = {ArXiv}, + volume = {abs/2511.18639}, + year = {2025}, + doi = {10.48550/arXiv.2511.18639}, + abstract = {We propose a novel approach for the development, analysis, and verification of reductions between NP-complete problems. This method uses the URSA system, a SAT-based constraint solver and incorporates features that distinguish it from existing related systems.}, +} + +% ============================================================ +% Theme D (subset): Physics-Inspired QUBO/Ising Approaches +% ============================================================ + +@article{Schuetz2022PhysicsGNN, + author = {M. Schuetz and J. K. Brubaker and H. Katzgraber}, + title = {Combinatorial Optimization with Physics-Inspired Graph Neural Networks}, + journal = {Nature Machine Intelligence}, + volume = {4}, + pages = {367--377}, + year = {2022}, + doi = {10.1038/s42256-022-00468-6}, + abstract = {Combinatorial optimization problems are pervasive across science and industry. Modern deep learning tools are poised to solve these problems at unprecedented scales, but a unifying framework that incorporates insights from statistical physics is still outstanding. Here we demonstrate how graph neural networks can be used to solve combinatorial optimization problems. Our approach is broadly applicable to canonical NP-hard problems in the form of quadratic unconstrained binary optimization problems, such as maximum cut, minimum vertex cover, maximum independent set, as well as Ising spin glasses and higher-order generalizations thereof in the form of polynomial unconstrained binary optimization problems. We apply a relaxation strategy to the problem Hamiltonian to generate a differentiable loss function with which we train the graph neural network and apply a simple projection to integer variables once the unsupervised training process has completed. We showcase our approach with numerical results for the canonical maximum cut and maximum independent set problems. We find that the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables.}, +} + +@article{He2024QuantumTSP, + author = {Haoqi He}, + title = {Quantum Annealing and {GNN} for Solving {TSP} with {QUBO}}, + booktitle = {Algorithmic Applications in Management}, + pages = {134--145}, + year = {2024}, + doi = {10.1007/978-981-97-7801-0_12}, + abstract = {This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used to solve QUBO formulations of TSP instances on Coherent Ising Machines (CIMs). The paper also presents a novel approach where QUBO is employed as a loss function within a GNN architecture tailored for solving TSP efficiently. By leveraging GNN's capability to learn graph representations, this method finds approximate solutions to TSP with improved computational time compared to traditional exact solvers.}, +} + +% ============================================================ +% Theme E: LLM-Assisted Formal Verification & Program Synthesis +% ============================================================ + +@article{Bursuc2025VeriCoding, + author = {Sergiu Bursuc and Theodore Ehrenborg and Shaowei Lin and L. Astefanoaei and Ionel Emilian Chiosa and Jure Kukovec and Alok Singh and Oliver Butterley and Adem Bizid and Quinn Dougherty and Miranda Zhao and Max Tan and Max Tegmark}, + title = {A Benchmark for Vericoding: Formally Verified Program Synthesis}, + journal = {ArXiv}, + volume = {abs/2509.22908}, + year = {2025}, + doi = {10.48550/arXiv.2509.22908}, + abstract = {We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications --- in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27\% in Lean, 44\% in Verus/Rust and 82\% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68\% to 96\% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark}, +} + +@article{Thakur2025CLEVER, + author = {Amitayush Thakur and Jasper Lee and G. Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri}, + title = {{CLEVER}: A Curated Benchmark for Formally Verified Code Generation}, + journal = {ArXiv}, + volume = {abs/2505.13938}, + year = {2025}, + doi = {10.48550/arXiv.2505.13938}, + abstract = {We introduce CLEVER, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, CLEVER avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use CLEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning.}, +} + +@inproceedings{Miranda2025VeriBench, + title = {{VeriBench}: End-to-End Formal Verification Benchmark for {AI} Code Generation in {Lean} 4}, + author = {Brando Miranda and Zhanke Zhou and Allen Nie and Elyas Obbad and Leni Aniva and Kai Fronsdal and Weston Kirk and Dilara Soylu and Andrea Yu and Ying Li and Sanmi Koyejo}, + year = {2025}, + booktitle = {2nd AI for Math Workshop at ICML 2025 (AI4Math@ICML)}, + url = {https://openreview.net/forum?id=rWkGFmnSNl}, + abstract = {VeriBench evaluates LLM capabilities in generating complete Lean 4 programs---implementations, unit tests, correctness theorems, and formal proofs---derived from reference Python functions or their docstrings. Testing 113 tasks across HumanEval problems, exercises, classical algorithms, and security challenges, the benchmark reveals that Claude 3.7 Sonnet achieves compilation on only 12.5\%, while LLaMA-70B fails to compile any programs in the Lean 4 HumanEval subset, even with 50 feedback-guided attempts. Only a self-optimizing agent architecture achieves meaningful compilation rates, approaching 90\%.}, +} + +@inproceedings{Mukherjee2025CoqPL, + title = {Towards Automated Verification of {LLM}-Synthesized {C} Programs}, + author = {Prasita Mukherjee and Benjamin Delaware}, + year = {2025}, + month = jan, + booktitle = {CoqPL 2025: The Eleventh International Workshop on Coq for Programming Languages (co-located with POPL 2025)}, + doi = {10.48550/arXiv.2410.14835}, + url = {https://popl25.sigplan.org/details/CoqPL-2025-papers/5/Towards-Automated-Verification-of-LLM-Synthesized-C-Programs}, + abstract = {We present a synthesis and verification framework for C programs that leverages LLMs to generate candidate programs while imposing syntactic and semantic biases on programs generated by LLMs, such that the synthesized program is more amenable to automated verification. The key contribution is a specification-verification tool built on the Verified Software Toolchain. Experiments on diverse benchmarks from the deductive program synthesis community, including basic coding examples, Separation Logic based assertions, and API specifications, demonstrate scalability and extensibility.}, +} + +@inproceedings{Mukherjee2025SynVer, + title = {{SYNVER}: {LLM}-Assisted Synthesis of High-Assurance {C} Programs}, + author = {Prasita Mukherjee and Minghai Lu and Benjamin Delaware}, + year = {2025}, + month = nov, + booktitle = {2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)}, + address = {Seoul, Korea}, + doi = {10.1109/ASE63991.2025.00255}, + url = {https://ieeexplore.ieee.org/document/11334588/}, + abstract = {We present SynVer---a novel, general purpose synthesizer for C programs equipped with machine-checked proofs of correctness using the Verified Software Toolchain. SynVer employs two Large Language Models: the first generates candidate programs from user-provided specifications, and the second helps automatically generate proofs of correctness in the Rocq proof assistant. SynVer combines symbolic reasoning with LLM-powered proof generation to discharge proof obligations.}, +} + +% ============================================================ +% Foundational References (from project bibliography) +% ============================================================ + +@inproceedings{karp1972, + author = {Richard M. Karp}, + title = {Reducibility among Combinatorial Problems}, + booktitle = {Complexity of Computer Computations}, + publisher = {Plenum Press}, + year = {1972}, + pages = {85--103} +} + +@inproceedings{cook1971, + author = {Stephen A. Cook}, + title = {The Complexity of Theorem-Proving Procedures}, + booktitle = {Proceedings of the Third Annual ACM Symposium on Theory of Computing}, + year = {1971}, + pages = {151--158} +} + +@book{garey1979, + author = {Michael R. Garey and David S. Johnson}, + title = {Computers and Intractability: A Guide to the Theory of NP-Completeness}, + publisher = {W. H. Freeman}, + year = {1979} +} + +@article{glover2019, + author = {Fred Glover and Gary Kochenberger and Yu Du}, + title = {Quantum Bridge Analytics {I}: a tutorial on formulating and using {QUBO} models}, + journal = {4OR}, + volume = {17}, + pages = {335--371}, + year = {2019}, + doi = {10.1007/s10288-019-00424-y} +} + +@article{lucas2014, + author = {Andrew Lucas}, + title = {Ising formulations of many NP problems}, + journal = {Frontiers in Physics}, + volume = {2}, + number = {5}, + year = {2014} +} + +@article{barahona1982, + author = {Francisco Barahona}, + title = {On the computational complexity of Ising spin glass models}, + journal = {Journal of Physics A: Mathematical and General}, + volume = {15}, + number = {10}, + pages = {3241--3253}, + year = {1982} +} From 59f6145a4d89e73e22e24b59a360c99837103eac Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 22:04:16 +0800 Subject: [PATCH 06/38] docs(arxiv): gather reduction graph metrics Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/data/graph-metrics.json | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 docs/paper/arxiv/data/graph-metrics.json diff --git a/docs/paper/arxiv/data/graph-metrics.json b/docs/paper/arxiv/data/graph-metrics.json new file mode 100644 index 00000000..c9e75016 --- /dev/null +++ b/docs/paper/arxiv/data/graph-metrics.json @@ -0,0 +1,10 @@ +{ + "unique_types": 24, + "variant_nodes": 42, + "total_edges": 52, + "reduceto_impls": 40, + "inferred_edges": 12, + "hub_in_degree": {"QUBO": 6, "ILP": 11, "MaximumIndependentSet": 14}, + "hub_out_degree": {"MaximumIndependentSet": 13, "MaximumSetPacking": 6, "KSatisfiability": 5, "Satisfiability": 5}, + "loc_per_reduction": {"min": 58, "max": 444, "median": 129} +} From 1a2ba073e835a1602ccb464c19efab73a42de778 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 22:52:47 +0800 Subject: [PATCH 07/38] =?UTF-8?q?docs(arxiv):=20add=20Figure=201=20?= =?UTF-8?q?=E2=80=94=20reduction=20graph=20(Typst+CeTZ)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/figures/reduction-graph.typ | 188 +++++++++++++++++++ docs/paper/arxiv/paper.tex | 7 + 2 files changed, 195 insertions(+) create mode 100644 docs/paper/arxiv/figures/reduction-graph.typ diff --git a/docs/paper/arxiv/figures/reduction-graph.typ b/docs/paper/arxiv/figures/reduction-graph.typ new file mode 100644 index 00000000..bba99ce9 --- /dev/null +++ b/docs/paper/arxiv/figures/reduction-graph.typ @@ -0,0 +1,188 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 10pt) +#set text(size: 7pt, font: "New Computer Modern") + +// Category colors +#let col-graph = rgb("#4e79a7") +#let col-formula = rgb("#59a14f") +#let col-set = rgb("#e15759") +#let col-alg = rgb("#b07aa1") +#let col-misc = rgb("#999999") + +#let hub-r = 0.44 +#let node-r = 0.26 + +// Layout: Semi-circular fan around ILP (the biggest sink with 10 in-edges). +// ILP at center-right; QUBO below it; MIS at center-left. +// Feeder nodes arranged in an arc around ILP so edges don't bundle. +// Bottom: SG <-> MaxCut <-> QUBO cluster. +// Left: SAT cluster fans out to MIS, MinDS, KCol, kSAT, CSAT. +// Isolated nodes in dashed boxes at far left and far right. + +#let nodes = ( + // === Hubs === + ("MIS", "MIS", 3.0, 5.0, col-graph, hub-r), + ("ILP", "ILP", 12.0, 5.0, col-alg, hub-r), + ("QUBO", "QUBO", 12.0, 1.5, col-alg, hub-r), + + // === ILP feeders — spread in a wide arc above/below/left of ILP === + ("MaxClq", "MaxClq", 8.5, 8.5, col-graph, node-r), + ("TSP", "TSP", 10.0, 8.5, col-graph, node-r), + ("MaxMatch", "MaxM", 5.0, 8.0, col-graph, node-r), + ("MinDS", "MinDS", 1.0, 7.0, col-graph, node-r), + + // === MIS neighbors === + ("MinVC", "MinVC", 1.0, 3.5, col-graph, node-r), + + // === Middle band === + ("KCol", "KCol", 8.5, 3.5, col-graph, node-r), + ("SG", "SG", 10.0, 0.0, col-graph, node-r), + ("MaxCut", "MaxCut", 8.0, 0.0, col-graph, node-r), + + // === Isolated graph === + ("MaxIS", "MaxIS", -0.8, 5.0, col-graph, node-r), + ("BiClq", "BiClq", -0.8, 3.5, col-graph, node-r), + + // === Formula === + ("SAT", "SAT", 5.0, 5.0, col-formula, node-r), + ("kSAT", "kSAT", 7.5, 5.0, col-formula, node-r), + ("CSAT", "CSAT", 7.5, 1.5, col-formula, node-r), + + // === Set === + ("MaxSP", "MaxSP", 5.0, 6.5, col-set, node-r), + ("MinSC", "MinSC", 10.0, 3.5, col-set, node-r), + + // === Isolated algebraic === + ("CVP", "CVP", 14.5, 3.5, col-alg, node-r), + ("BMF", "BMF", 14.5, 1.5, col-alg, node-r), + ("Knap", "Knap", 14.5, 5.5, col-alg, node-r), + + // === Misc === + ("Fact", "Fact", 5.0, 0.0, col-misc, node-r), + ("BinP", "BinP", 14.5, 7.5, col-misc, node-r), + ("PS", "PS", 14.5, 0.0, col-misc, node-r), +) + +// 32 unique type-level directed edges +#let edges = ( + ("SAT", "CSAT"), ("SAT", "KCol"), ("SAT", "kSAT"), + ("SAT", "MIS"), ("SAT", "MinDS"), + ("kSAT", "QUBO"), ("kSAT", "SAT"), + ("CSAT", "ILP"), ("CSAT", "SG"), + ("Fact", "CSAT"), ("Fact", "ILP"), + ("MIS", "MaxSP"), ("MIS", "MinVC"), + ("MinVC", "MIS"), ("MinVC", "MinSC"), + ("MaxSP", "ILP"), ("MaxSP", "MIS"), ("MaxSP", "QUBO"), + ("MaxClq","ILP"), + ("MaxMatch","ILP"), ("MaxMatch","MaxSP"), + ("MinDS", "ILP"), + ("MinSC", "ILP"), + ("KCol", "ILP"), ("KCol", "QUBO"), + ("QUBO", "ILP"), ("QUBO", "SG"), + ("SG", "MaxCut"), ("SG", "QUBO"), + ("MaxCut","SG"), + ("ILP", "QUBO"), + ("TSP", "ILP"), +) + +// Bidirectional pairs for perpendicular offset +#let bidi-pairs = ( + ("MIS", "MinVC"), ("MIS", "MaxSP"), ("SAT", "kSAT"), + ("SG", "MaxCut"), ("SG", "QUBO"), ("ILP", "QUBO"), +) + +#canvas(length: 0.55cm, { + import draw: * + + // Build lookup + let node-map = (:) + for n in nodes { + let (id, abbr, x, y, col, r) = n + node-map.insert(id, (x: x, y: y, r: r)) + } + + let is-bidi(src, tgt) = { + let found = false + for (a, b) in bidi-pairs { + if (src == a and tgt == b) or (src == b and tgt == a) { found = true } + } + found + } + + let bidi-offset = 0.2 + + // --- Edges --- + for (src, tgt) in edges { + let s = node-map.at(src) + let t = node-map.at(tgt) + let dx = t.x - s.x + let dy = t.y - s.y + let dist = calc.sqrt(dx * dx + dy * dy) + if dist > 0 { + let ux = dx / dist + let uy = dy / dist + let px = -uy + let py = ux + let off = 0.0 + if is-bidi(src, tgt) { + if src < tgt { off = bidi-offset } else { off = -bidi-offset } + } + let sx = s.x + px * off + let sy = s.y + py * off + let tx = t.x + px * off + let ty = t.y + py * off + let x1 = sx + ux * (s.r + 0.06) + let y1 = sy + uy * (s.r + 0.06) + let x2 = tx - ux * (t.r + 0.1) + let y2 = ty - uy * (t.r + 0.1) + line( + (x1, y1), (x2, y2), + stroke: 0.35pt + luma(150), + mark: (end: "straight", scale: 0.3), + ) + } + } + + // --- Nodes --- + for n in nodes { + let (id, abbr, x, y, col, r) = n + let is-hub = r > 0.3 + circle( + (x, y), radius: r, + fill: col.lighten(if is-hub { 58% } else { 80% }), + stroke: (thickness: if is-hub { 1.4pt } else { 0.5pt }, paint: col), + name: id, + ) + content(id, text( + if is-hub { 7.5pt } else { 5.5pt }, + weight: if is-hub { "bold" } else { "regular" }, + fill: col.darken(25%), abbr, + )) + } + + // --- Dashed boxes for isolated nodes --- + rect((-1.4, 2.9), (0.0, 5.7), + stroke: (thickness: 0.3pt, paint: luma(190), dash: "dashed"), radius: 4pt) + content((-0.7, 2.55), text(4pt, fill: luma(150), "no reductions")) + + rect((13.85, -0.6), (15.2, 8.1), + stroke: (thickness: 0.3pt, paint: luma(190), dash: "dashed"), radius: 4pt) + content((14.5, -0.9), text(4pt, fill: luma(150), "no reductions")) + + // --- Legend --- + let lx = 1.0 + let ly = -1.3 + rect((lx - 0.3, ly - 0.2), (lx + 10.0, ly + 0.85), + stroke: 0.3pt + luma(180), fill: white, radius: 3pt) + let items = ( + ("Graph", col-graph), ("Formula", col-formula), ("Set", col-set), + ("Algebraic", col-alg), ("Misc", col-misc), + ) + for (i, (label, col)) in items.enumerate() { + let ex = lx + 0.25 + i * 2.0 + let ey = ly + 0.33 + circle((ex, ey), radius: 0.15, fill: col.lighten(80%), stroke: 0.5pt + col) + content((ex + 0.3, ey), anchor: "west", text(5pt, label)) + } +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 94028e0f..9ecd9d16 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -26,6 +26,13 @@ \section{Introduction}\label{sec:intro} \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} +\begin{figure*}[t] + \centering + \includegraphics[width=\textwidth]{figures/reduction-graph.pdf} + \caption{The reduction graph: 24 problem types connected by 52 directed edges (35 implemented reductions + 17 inferred variant edges). Hub nodes MIS, QUBO and ILP are highlighted. Nodes in dashed boxes have models but no cross-type reductions yet.} + \label{fig:reduction-graph} +\end{figure*} + \section{System Architecture}\label{sec:architecture} \section{Skill-Based Task Decomposition}\label{sec:skills} From f44bbe3477a5dcd50f1709db52fddb9bb531eba8 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 22:55:16 +0800 Subject: [PATCH 08/38] =?UTF-8?q?docs(arxiv):=20add=20Figure=203=20?= =?UTF-8?q?=E2=80=94=20card-based=20pipeline=20diagram=20(Typst+CeTZ)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/figures/pipeline.typ | 153 ++++++++++++++++++++++++++ docs/paper/arxiv/paper.tex | 7 ++ 2 files changed, 160 insertions(+) create mode 100644 docs/paper/arxiv/figures/pipeline.typ diff --git a/docs/paper/arxiv/figures/pipeline.typ b/docs/paper/arxiv/figures/pipeline.typ new file mode 100644 index 00000000..cc1bf26a --- /dev/null +++ b/docs/paper/arxiv/figures/pipeline.typ @@ -0,0 +1,153 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 8pt, font: "New Computer Modern") + +// Color coding: human = orange, agent = blue +#let col-human = rgb("#f28e2b") +#let col-agent = rgb("#4e79a7") +#let col-bg-human = col-human.lighten(85%) +#let col-bg-agent = col-agent.lighten(85%) +#let col-neutral = luma(240) + +// Column card style +#let card-w = 2.4 +#let card-h = 0.7 +#let gap-y = 1.3 // vertical spacing between cards + +#canvas(length: 0.55cm, { + import draw: * + + // --- Helper: draw a board column card --- + let board-card(x, y, label, fill-col, stroke-col, name-id) = { + rect( + (x - card-w / 2, y - card-h / 2), + (x + card-w / 2, y + card-h / 2), + radius: 4pt, + fill: fill-col, + stroke: (thickness: 1pt, paint: stroke-col), + name: name-id, + ) + content(name-id, text(8pt, weight: "bold", fill: stroke-col.darken(15%), label)) + } + + // --- Layout: vertical pipeline --- + let cx = 0 // center x for column cards + let y0 = 0 // top + + // Contributor + Issue at top + content((cx - 3.5, y0), anchor: "east", text(7pt, fill: luma(100), [Contributor])) + rect( + (cx - 3.2, y0 - 0.3), (cx - 1.8, y0 + 0.3), + radius: 3pt, fill: col-neutral, stroke: 0.5pt + luma(180), name: "issue", + ) + content("issue", text(7pt, [Issue])) + line( + (cx - 1.8, y0), (cx - card-w / 2 - 0.05, y0), + stroke: 0.6pt + luma(150), + mark: (end: "straight", scale: 0.35), + ) + + // Backlog + board-card(cx, y0, "Backlog", col-neutral, luma(130), "backlog") + + // Arrow: Backlog -> Ready (human) + let y1 = y0 - gap-y + line( + (cx, y0 - card-h / 2), (cx, y1 + card-h / 2 + 0.05), + stroke: (thickness: 1.2pt, paint: col-human), + mark: (end: "straight", scale: 0.4), + ) + content( + (cx + card-w / 2 + 0.2, (y0 + y1) / 2), anchor: "west", + text(6.5pt, fill: col-human, [Maintainer\ moves card]), + ) + + // Ready + board-card(cx, y1, "Ready", col-bg-human, col-human, "ready") + + // Arrow: Ready -> In Progress (agent: project-pipeline) + let y2 = y1 - gap-y + line( + (cx, y1 - card-h / 2), (cx, y2 + card-h / 2 + 0.05), + stroke: (thickness: 1.2pt, paint: col-agent), + mark: (end: "straight", scale: 0.4), + ) + content( + (cx + card-w / 2 + 0.2, (y1 + y2) / 2), anchor: "west", + text(6.5pt, fill: col-agent, [`project-pipeline`]), + ) + + // In Progress + board-card(cx, y2, "In Progress", col-bg-agent, col-agent, "inprog") + + // Arrow: In Progress -> review-agentic (agent substeps) + let y3 = y2 - gap-y + line( + (cx, y2 - card-h / 2), (cx, y3 + card-h / 2 + 0.05), + stroke: (thickness: 1.2pt, paint: col-agent), + mark: (end: "straight", scale: 0.4), + ) + // Substep labels + content( + (cx + card-w / 2 + 0.2, (y2 + y3) / 2), anchor: "west", + text(6pt, fill: col-agent, [`issue-to-pr` #sym.arrow `check` #sym.arrow `implement` #sym.arrow `review`]), + ) + + // review-agentic + board-card(cx, y3, "review-agentic", col-bg-agent, col-agent, "rev-agent") + + // Arrow: review-agentic -> In Review (agent: review-pipeline) + let y4 = y3 - gap-y + line( + (cx, y3 - card-h / 2), (cx, y4 + card-h / 2 + 0.05), + stroke: (thickness: 1.2pt, paint: col-agent), + mark: (end: "straight", scale: 0.4), + ) + content( + (cx + card-w / 2 + 0.2, (y3 + y4) / 2), anchor: "west", + text(6.5pt, fill: col-agent, [`review-pipeline`]), + ) + + // In Review + board-card(cx, y4, "In Review", col-bg-agent, col-agent, "inrev") + + // Arrow: In Review -> Done (human) + let y5 = y4 - gap-y + line( + (cx, y4 - card-h / 2), (cx, y5 + card-h / 2 + 0.05), + stroke: (thickness: 1.2pt, paint: col-human), + mark: (end: "straight", scale: 0.4), + ) + content( + (cx + card-w / 2 + 0.2, (y4 + y5) / 2), anchor: "west", + text(6.5pt, fill: col-human, [Maintainer\ merges PR]), + ) + + // Done + board-card(cx, y5, "Done", col-bg-human, col-human, "done") + + // --- Bracket annotations on the left --- + // Agent zone bracket (Ready -> In Review) + let bx = cx - card-w / 2 - 0.6 + let bracket-top = y1 - card-h / 2 - 0.05 + let bracket-bot = y4 + card-h / 2 + 0.05 + line( + (bx + 0.15, bracket-top), (bx, bracket-top), (bx, bracket-bot), (bx + 0.15, bracket-bot), + stroke: (thickness: 0.8pt, paint: col-agent, dash: "dashed"), + ) + content( + (bx - 0.15, (bracket-top + bracket-bot) / 2), anchor: "east", + text(6pt, fill: col-agent, weight: "bold", [Agent\ zone]), + ) + + // --- Legend at bottom --- + let ly = y5 - card-h / 2 - 0.7 + let lx = cx - 2.5 + // Human + line((lx, ly), (lx + 0.6, ly), stroke: (thickness: 1.2pt, paint: col-human)) + content((lx + 0.8, ly), anchor: "west", text(6pt, [Human decision])) + // Agent + line((lx + 3.2, ly), (lx + 3.8, ly), stroke: (thickness: 1.2pt, paint: col-agent)) + content((lx + 4.0, ly), anchor: "west", text(6pt, [Agent action])) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 9ecd9d16..2d90e556 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -37,6 +37,13 @@ \section{System Architecture}\label{sec:architecture} \section{Skill-Based Task Decomposition}\label{sec:skills} +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/pipeline.pdf} + \caption{Two-stage card-based pipeline. Human decisions (orange) are limited to Backlog$\to$Ready and In Review$\to$Done. Agent manages everything in between.} + \label{fig:pipeline} +\end{figure} + \section{Multi-Layered Verification}\label{sec:verification} \section{Evaluation}\label{sec:evaluation} From 2093a9a35931a107ec0098f9a42d60705bfcc2f3 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:01:38 +0800 Subject: [PATCH 09/38] =?UTF-8?q?docs(arxiv):=20add=20Figure=204=20?= =?UTF-8?q?=E2=80=94=20verification=20pyramid=20(Typst+CeTZ)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.6 --- .../arxiv/figures/verification-pyramid.typ | 133 ++++++++++++++++++ docs/paper/arxiv/paper.tex | 7 + 2 files changed, 140 insertions(+) create mode 100644 docs/paper/arxiv/figures/verification-pyramid.typ diff --git a/docs/paper/arxiv/figures/verification-pyramid.typ b/docs/paper/arxiv/figures/verification-pyramid.typ new file mode 100644 index 00000000..d0226b1f --- /dev/null +++ b/docs/paper/arxiv/figures/verification-pyramid.typ @@ -0,0 +1,133 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 7pt, font: "New Computer Modern") + +// Layer data: (mechanism, error class caught) +#let layers = ( + ("Type system (Rust compiler)", "API misuse"), + ("Unit tests (eval, serialization)", "evaluation errors"), + ("Closed-loop tests (round-trip)", "mapping errors"), + ("Overhead validation (symbolic exprs)", "formula errors"), + ("Materialized fixtures (JSON ground truth)", "test gaming"), + ("Agentic review (parallel subagents)", "convention violations"), + ("Documentation (proof sketch)", "logical errors"), +) + +// Color gradient: blue (automated, bottom) -> gold (human, top) +#let col-auto = rgb("#4e79a7") // blue +#let col-human = rgb("#e8a838") // gold + +#let lerp-color(t) = { + color.mix((col-auto, (1 - t) * 100%), (col-human, t * 100%)) +} + +#canvas(length: 0.55cm, { + import draw: * + + let n = 7 // number of layers + let layer-h = 1.1 // height of each layer + let gap = 0.12 // gap between layers + let max-w = 14.0 // width of bottom layer + let min-w = 5.5 // width of top layer + let cx = 0 // center x + let right-col-x = max-w / 2 + 0.6 // x position for right-side labels + + // Draw layers from bottom to top + for i in range(n) { + let t-bot = i / n + let t-top = (i + 1) / n + + // Widths: linear interpolation + let w-bot = max-w - (max-w - min-w) * t-bot + let w-top = max-w - (max-w - min-w) * t-top + + // Width at midpoint (for label positioning) + let t-mid = (i + 0.5) / n + let w-mid = max-w - (max-w - min-w) * t-mid + + // Y coordinates (layer 0 at bottom) + let y-bot = i * (layer-h + gap) + let y-top = y-bot + layer-h + let y-mid = (y-bot + y-top) / 2 + + // Color for this layer + let col = lerp-color(t-bot) + let col-fill = col.lighten(70%) + let col-stroke = col.darken(10%) + let col-text = col-stroke.darken(30%) + + let name-id = "layer" + str(i) + + // Draw trapezoid + merge-path( + close: true, + fill: col-fill, + stroke: (thickness: 0.8pt, paint: col-stroke), + name: name-id, + { + line( + (cx - w-bot / 2, y-bot), + (cx + w-bot / 2, y-bot), + (cx + w-top / 2, y-top), + (cx - w-top / 2, y-top), + ) + }, + ) + + let (mechanism, catches) = layers.at(i) + + // Mechanism label centered inside the trapezoid + content( + (cx, y-mid), + anchor: "center", + text(7.5pt, weight: "bold", fill: col-text, + [L#(i + 1): #mechanism], + ), + ) + + // "catches:" label outside on the right, connected by a thin line + let edge-x = cx + w-mid / 2 // right edge at midpoint height + + // Small connecting line from trapezoid edge to label + line( + (edge-x + 0.05, y-mid), (right-col-x - 0.15, y-mid), + stroke: (thickness: 0.4pt, paint: col-stroke.lighten(30%), dash: "dotted"), + ) + + content( + (right-col-x, y-mid), + anchor: "west", + text(6.5pt, fill: col-text.lighten(20%), + [#sym.arrow.r #emph(catches)], + ), + ) + } + + // Side annotations + let total-h = n * (layer-h + gap) - gap + + // Left bracket: "Automated" for bottom 4 layers (L1-L4) + let bx-left = cx - max-w / 2 - 0.8 + let auto-top = 4 * (layer-h + gap) - gap + line( + (bx-left + 0.15, 0), (bx-left, 0), (bx-left, auto-top), (bx-left + 0.15, auto-top), + stroke: (thickness: 0.7pt, paint: col-auto, dash: "dashed"), + ) + content( + (bx-left - 0.15, auto-top / 2), anchor: "east", + text(6pt, fill: col-auto, weight: "bold", [Fully\ automated]), + ) + + // Left bracket: "Human-readable" for top 3 layers (L5-L7) + let human-bot = 4 * (layer-h + gap) + let human-top = total-h + line( + (bx-left + 0.15, human-bot), (bx-left, human-bot), (bx-left, human-top), (bx-left + 0.15, human-top), + stroke: (thickness: 0.7pt, paint: col-human, dash: "dashed"), + ) + content( + (bx-left - 0.15, (human-bot + human-top) / 2), anchor: "east", + text(6pt, fill: col-human.darken(10%), weight: "bold", [Human-\ readable]), + ) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 2d90e556..147912a4 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -46,6 +46,13 @@ \section{Skill-Based Task Decomposition}\label{sec:skills} \section{Multi-Layered Verification}\label{sec:verification} +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf} + \caption{Seven-layer verification stack. Lower layers (blue) are fully automated; upper layers (gold) involve human-readable arguments.} + \label{fig:verification} +\end{figure} + \section{Evaluation}\label{sec:evaluation} \section{Related Work}\label{sec:related} From 8fa350d26320bd28ebdea17a846c8e68c7bc7b98 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:15:16 +0800 Subject: [PATCH 10/38] =?UTF-8?q?docs(arxiv):=20add=20Figure=202=20?= =?UTF-8?q?=E2=80=94=20system=20architecture=20(Typst+CeTZ)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/figures/architecture.typ | 198 ++++++++++++++++++++++ docs/paper/arxiv/paper.tex | 7 + 2 files changed, 205 insertions(+) create mode 100644 docs/paper/arxiv/figures/architecture.typ diff --git a/docs/paper/arxiv/figures/architecture.typ b/docs/paper/arxiv/figures/architecture.typ new file mode 100644 index 00000000..ef98a86c --- /dev/null +++ b/docs/paper/arxiv/figures/architecture.typ @@ -0,0 +1,198 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 8pt, font: "New Computer Modern") + +#let col-trait = rgb("#4e79a7") // blue +#let col-reduction = rgb("#59a14f") // green +#let col-compile = rgb("#e8a838") // gold + +#canvas(length: 0.55cm, { + import draw: * + + let box-w = 12.0 + let box1-h = 3.0 // Problem trait box (taller: has two sub-boxes) + let box2-h = 1.6 // ReductionResult box + let box3-h = 2.4 // Compile-time validation box + let arrow-gap = 1.4 + let cx = 0 + + // --- Box 1: Problem trait (top) --- + let y1-top = 0 + let y1-bot = -box1-h + + rect( + (cx - box-w / 2, y1-top), (cx + box-w / 2, y1-bot), + radius: 4pt, + fill: col-trait.lighten(88%), + stroke: (thickness: 1pt, paint: col-trait), + name: "box1", + ) + + // Title + content( + (cx, y1-top - 0.4), anchor: "center", + text(9pt, weight: "bold", fill: col-trait.darken(20%), + [`Problem` trait], + ), + ) + + // Method list + content( + (cx, y1-top - 1.0), anchor: "center", + text(7.5pt, fill: luma(60), + [`NAME`#h(4pt)#sym.dot.c#h(4pt)`Metric`#h(4pt)#sym.dot.c#h(4pt)`dims()`#h(4pt)#sym.dot.c#h(4pt)`evaluate()`], + ), + ) + + // Divider line + line( + (cx - box-w / 2 + 0.4, y1-top - 1.4), + (cx + box-w / 2 - 0.4, y1-top - 1.4), + stroke: (thickness: 0.5pt, paint: col-trait.lighten(40%)), + ) + + // Sub-boxes for Optimization and Satisfaction + let sub-margin = 0.35 // margin from parent box edge + let sub-gap = 0.3 // gap between sub-boxes + let sub-w = (box-w - 2 * sub-margin - sub-gap) / 2 // = 5.525 each + let sub-h = 1.0 + let sub-y-top = y1-top - 1.6 + let sub-y-bot = sub-y-top - sub-h + + // Optimization sub-box (left) + let opt-left = cx - box-w / 2 + sub-margin + let opt-right = opt-left + sub-w + rect( + (opt-left, sub-y-top), (opt-right, sub-y-bot), + radius: 3pt, + fill: col-trait.lighten(78%), + stroke: (thickness: 0.6pt, paint: col-trait.lighten(20%)), + name: "opt", + ) + content( + "opt", anchor: "center", + { + text(7.5pt, weight: "bold", fill: col-trait.darken(10%), [`OptimizationProblem`]) + linebreak() + text(6pt, fill: luma(80), [`SolutionSize` #sym.dot.c `direction()`]) + }, + ) + + // Satisfaction sub-box (right) + let sat-left = opt-right + sub-gap + let sat-right = sat-left + sub-w + rect( + (sat-left, sub-y-top), (sat-right, sub-y-bot), + radius: 3pt, + fill: col-trait.lighten(78%), + stroke: (thickness: 0.6pt, paint: col-trait.lighten(20%)), + name: "sat", + ) + content( + "sat", anchor: "center", + { + text(7.5pt, weight: "bold", fill: col-trait.darken(10%), [`SatisfactionProblem`]) + linebreak() + text(6pt, fill: luma(80), [`Metric = bool`]) + }, + ) + + // --- Arrow 1: Box 1 -> Box 2 --- + let a1-top = y1-bot + let a1-bot = y1-bot - arrow-gap + line( + (cx, a1-top), (cx, a1-bot + 0.05), + stroke: (thickness: 1.2pt, paint: col-reduction.darken(10%)), + mark: (end: "straight", scale: 0.45), + ) + content( + (cx + 0.3, (a1-top + a1-bot) / 2), anchor: "west", + text(7.5pt, weight: "bold", fill: col-reduction.darken(10%), + [`ReduceTo`], + ), + ) + + // --- Box 2: ReductionResult --- + let y2-top = a1-bot + let y2-bot = y2-top - box2-h + + rect( + (cx - box-w / 2, y2-top), (cx + box-w / 2, y2-bot), + radius: 4pt, + fill: col-reduction.lighten(88%), + stroke: (thickness: 1pt, paint: col-reduction), + name: "box2", + ) + + content( + (cx, y2-top - 0.45), anchor: "center", + text(9pt, weight: "bold", fill: col-reduction.darken(20%), + [`ReductionResult`], + ), + ) + + content( + (cx, y2-top - 1.1), anchor: "center", + text(7.5pt, fill: luma(60), + [`target_problem()`#h(4pt)#sym.dot.c#h(4pt)`extract_solution()`], + ), + ) + + // --- Arrow 2: Box 2 -> Box 3 --- + let a2-top = y2-bot + let a2-bot = y2-bot - arrow-gap + line( + (cx, a2-top), (cx, a2-bot + 0.05), + stroke: (thickness: 1.2pt, paint: col-compile.darken(10%)), + mark: (end: "straight", scale: 0.45), + ) + content( + (cx + 0.3, (a2-top + a2-bot) / 2), anchor: "west", + text(7pt, fill: col-compile.darken(10%), + [`#[reduction(overhead = {...})]`], + ), + ) + + // --- Box 3: Compile-time validation --- + let y3-top = a2-bot + let y3-bot = y3-top - box3-h + + rect( + (cx - box-w / 2, y3-top), (cx + box-w / 2, y3-bot), + radius: 4pt, + fill: col-compile.lighten(88%), + stroke: (thickness: 1pt, paint: col-compile), + name: "box3", + ) + + content( + (cx, y3-top - 0.45), anchor: "center", + text(9pt, weight: "bold", fill: col-compile.darken(20%), + [Compile-time validation], + ), + ) + + // Bullet points + let bullet-x = cx - box-w / 2 + 1.2 + let bullet-y = y3-top - 1.1 + + content( + (bullet-x, bullet-y), anchor: "west", + text(7.5pt, fill: luma(60), + [#sym.bullet#h(3pt)Variable names #sym.arrow getter methods], + ), + ) + content( + (bullet-x, bullet-y - 0.55), anchor: "west", + text(7.5pt, fill: luma(60), + [#sym.bullet#h(3pt)`Expr` AST: symbolic overhead expressions], + ), + ) + content( + (bullet-x, bullet-y - 1.1), anchor: "west", + text(7.5pt, fill: luma(60), + [#sym.bullet#h(3pt)`declare_variants!` #sym.arrow compile-time registry], + ), + ) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 147912a4..18c73eeb 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -35,6 +35,13 @@ \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} \section{System Architecture}\label{sec:architecture} +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/architecture.pdf} + \caption{System architecture: the trait hierarchy and compile-time validation enforce round-trip testing capability by construction.} + \label{fig:architecture} +\end{figure} + \section{Skill-Based Task Decomposition}\label{sec:skills} \begin{figure}[t] From ba38a61c9aba4af6c0fae7fe826b99227da078f4 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:19:42 +0800 Subject: [PATCH 11/38] docs(arxiv): git history mining script and results Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/data/git-mining-results.json | 683 ++++++++++++++++++ docs/paper/arxiv/scripts/mine-git-history.py | 150 ++++ 2 files changed, 833 insertions(+) create mode 100644 docs/paper/arxiv/data/git-mining-results.json create mode 100644 docs/paper/arxiv/scripts/mine-git-history.py diff --git a/docs/paper/arxiv/data/git-mining-results.json b/docs/paper/arxiv/data/git-mining-results.json new file mode 100644 index 00000000..d3904123 --- /dev/null +++ b/docs/paper/arxiv/data/git-mining-results.json @@ -0,0 +1,683 @@ +{ + "summary": { + "total_prs": 58, + "rule_prs": 2, + "model_prs": 5, + "other_prs": 51, + "agent_authored": 0, + "human_authored": 58 + }, + "by_phase": [ + { + "phase": 1, + "label": "manual", + "count": 35, + "rule_count": 1, + "model_count": 1, + "agent_count": 0, + "human_count": 35 + }, + { + "phase": 2, + "label": "basic-skills", + "count": 9, + "rule_count": 0, + "model_count": 2, + "agent_count": 0, + "human_count": 9 + }, + { + "phase": 3, + "label": "full-pipeline", + "count": 14, + "rule_count": 1, + "model_count": 2, + "agent_count": 0, + "human_count": 14 + } + ], + "phase_boundaries": { + "phase_1_end": "2026-02-22T00:00:00+00:00", + "phase_2_end": "2026-03-01T00:00:00+00:00" + }, + "prs": [ + { + "number": 4, + "title": "feat: Feature parity with ProblemReductions.jl", + "author": "GiggleLiu", + "created_at": "2026-01-25T07:08:45Z", + "merged_at": "2026-01-25T07:56:00Z", + "branch": "feature-parity", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 7, + "title": "feat: Implement remaining reduction rules", + "author": "GiggleLiu", + "created_at": "2026-01-25T15:59:13Z", + "merged_at": "2026-01-25T16:20:15Z", + "branch": "feat/remaining-reductions", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 9, + "title": "docs: Add reduction classification and detailed survey", + "author": "GiggleLiu", + "created_at": "2026-01-25T17:41:57Z", + "merged_at": "2026-01-26T01:14:31Z", + "branch": "docs/reduction-classification-survey", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 12, + "title": "feat: Implement set-theoretic reduction path finding", + "author": "GiggleLiu", + "created_at": "2026-01-26T15:26:24Z", + "merged_at": "2026-01-26T15:47:56Z", + "branch": "feat/set-theoretic-reductions", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 13, + "title": "feat: Add grid graph mapping for unit disk reductions", + "author": "GiggleLiu", + "created_at": "2026-01-26T22:48:56Z", + "merged_at": "2026-02-02T00:36:23Z", + "branch": "feat/grid-graph-mapping", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 20, + "title": "feat: Implement integer programming solver for Coloring problem", + "author": "GiggleLiu", + "created_at": "2026-01-31T01:40:53Z", + "merged_at": "2026-01-31T03:49:37Z", + "branch": "feat/coloring-ilp-solver", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 22, + "title": "feat: Implement Factoring \u2192 ILP reduction (issue #21)", + "author": "GiggleLiu", + "created_at": "2026-01-31T03:14:21Z", + "merged_at": "2026-01-31T03:44:54Z", + "branch": "feat/factoring-ilp-solver", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 25, + "title": "feat: Add problem variants, documentation improvements, and reduction macro", + "author": "GiggleLiu", + "created_at": "2026-02-02T03:40:27Z", + "merged_at": "2026-02-02T16:13:43Z", + "branch": "feat/problem-variants-and-docs", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 27, + "title": "Restructure tests: split test and source code", + "author": "isPANN", + "created_at": "2026-02-07T16:16:41Z", + "merged_at": "2026-02-08T00:55:16Z", + "branch": "restructure-tests", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 29, + "title": "Implement 6 problem-to-QUBO reductions (Issue #18)", + "author": "GiggleLiu", + "created_at": "2026-02-08T05:58:46Z", + "merged_at": "2026-02-09T12:42:18Z", + "branch": "issue-18-qubo-reductions", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 31, + "title": "docs: polish reductions.typ with theorem labels and cleanup", + "author": "GiggleLiu", + "created_at": "2026-02-09T17:06:22Z", + "merged_at": "2026-02-10T04:04:06Z", + "branch": "polish-reductions-typ", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 36, + "title": "JSON schema export & interactive reduction diagram (#33, #34)", + "author": "GiggleLiu", + "created_at": "2026-02-10T05:34:37Z", + "merged_at": "2026-02-10T07:01:33Z", + "branch": "feat/json-schema-interactive-viz", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 38, + "title": "docs: replace Rust code with JSON schema tables in paper", + "author": "GiggleLiu", + "created_at": "2026-02-10T08:24:29Z", + "merged_at": "2026-02-10T16:40:38Z", + "branch": "feat/improve-reductions-typ", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 41, + "title": "docs: improve example instances implementation plan", + "author": "GiggleLiu", + "created_at": "2026-02-10T17:05:54Z", + "merged_at": "2026-02-11T02:50:00Z", + "branch": "docs/improve-example-instances-plan-v2", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 42, + "title": "fix: use directed edges instead of bidirectional in reduction graph", + "author": "GiggleLiu", + "created_at": "2026-02-10T17:10:05Z", + "merged_at": "2026-02-10T17:48:55Z", + "branch": "fix/remove-bidirectional-edges", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 50, + "title": "Design: trait system refactoring for contributor ergonomics", + "author": "GiggleLiu", + "created_at": "2026-02-12T01:39:47Z", + "merged_at": "2026-02-12T17:48:08Z", + "branch": "design/trait-refactoring", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 54, + "title": "perf: optimize pathdecomposition and add ground truth tests", + "author": "GiggleLiu", + "created_at": "2026-02-12T18:49:22Z", + "merged_at": "2026-02-13T05:49:22Z", + "branch": "perf/pathdecomposition-optimization", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 56, + "title": "Remove weight type parameter from CircuitSAT and KColoring", + "author": "GiggleLiu", + "created_at": "2026-02-13T12:08:18Z", + "merged_at": "2026-02-13T14:31:45Z", + "branch": "fix/circuitsat-no-weight", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 57, + "title": "Fix #47: Add HamiltonianCycle model", + "author": "GiggleLiu", + "created_at": "2026-02-13T13:25:42Z", + "merged_at": "2026-02-13T14:30:08Z", + "branch": "issue-47-hamiltonian-cycle", + "is_agent": false, + "phase": 1, + "type": "Model" + }, + { + "number": 60, + "title": "Fix #52: TravelingSalesman to ILP reduction", + "author": "GiggleLiu", + "created_at": "2026-02-13T15:08:58Z", + "merged_at": "2026-02-13T16:55:22Z", + "branch": "52-travelingsalesman-ilp-reduction", + "is_agent": false, + "phase": 1, + "type": "Rule" + }, + { + "number": 65, + "title": "Add parity tests against Julia ProblemReductions.jl", + "author": "GiggleLiu", + "created_at": "2026-02-13T20:20:26Z", + "merged_at": "2026-02-14T08:21:49Z", + "branch": "jg/issue-64-test-against-jl", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 66, + "title": "Simplify variant system and clean up type hierarchy", + "author": "GiggleLiu", + "created_at": "2026-02-14T01:35:50Z", + "merged_at": "2026-02-14T06:35:43Z", + "branch": "jg/fix-reduction-graph", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 68, + "title": "feat: variant-aware reduction paths with resolve_path", + "author": "GiggleLiu", + "created_at": "2026-02-14T07:53:44Z", + "merged_at": "2026-02-15T06:41:07Z", + "branch": "jg/variant-aware-paths", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 71, + "title": "Refactor: address KISS and DRY violations (#70)", + "author": "GiggleLiu", + "created_at": "2026-02-15T14:50:18Z", + "merged_at": "2026-02-15T15:12:12Z", + "branch": "jg/issue-70", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 72, + "title": "refactor: variant-level reduction graph with path-based API", + "author": "GiggleLiu", + "created_at": "2026-02-15T15:35:17Z", + "merged_at": "2026-02-16T06:35:53Z", + "branch": "variant-refactor-plan", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 74, + "title": "Fix #73: Refactor graph problem constructors to take graph as input", + "author": "GiggleLiu", + "created_at": "2026-02-16T06:58:30Z", + "merged_at": "2026-02-16T08:51:08Z", + "branch": "issue-73-graph-constructor-refactoring", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 75, + "title": "Close Julia parity test gaps: BicliqueCover, BMF, SAT\u2192CircuitSAT, reduction paths", + "author": "GiggleLiu", + "created_at": "2026-02-16T08:14:23Z", + "merged_at": "2026-02-16T12:07:31Z", + "branch": "jg/issue-67-julia-parity-gaps", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 76, + "title": "feat: add problem_size() to Problem trait with validation", + "author": "GiggleLiu", + "created_at": "2026-02-16T12:47:15Z", + "merged_at": "2026-02-16T15:19:23Z", + "branch": "feat/problem-size-trait", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 78, + "title": "Rewrite getting-started with Factoring\u2192SpinGlass and path overhead API", + "author": "GiggleLiu", + "created_at": "2026-02-16T17:10:21Z", + "merged_at": "2026-02-16T17:40:56Z", + "branch": "jg/getting-started-factoring-overhead", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 79, + "title": "Reduce exported functions (#77)", + "author": "GiggleLiu", + "created_at": "2026-02-16T17:14:12Z", + "merged_at": "2026-02-16T17:38:18Z", + "branch": "reduce-exports", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 80, + "title": "Reduce exported functions (closes #77)", + "author": "GiggleLiu", + "created_at": "2026-02-16T17:41:44Z", + "merged_at": "2026-02-16T17:54:58Z", + "branch": "reduce-exports", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 82, + "title": "feat: add pred CLI tool for problem reductions", + "author": "GiggleLiu", + "created_at": "2026-02-17T23:57:17Z", + "merged_at": "2026-02-18T13:25:50Z", + "branch": "cli-tool-design", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 84, + "title": "feat(cli): CLI UX improvements", + "author": "GiggleLiu", + "created_at": "2026-02-18T13:58:08Z", + "merged_at": "2026-02-19T14:22:17Z", + "branch": "cli-v2-design", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 85, + "title": "feat: add QUBO\u2192ILP and CircuitSAT\u2192ILP reductions", + "author": "GiggleLiu", + "created_at": "2026-02-18T14:07:25Z", + "merged_at": "2026-02-19T04:28:17Z", + "branch": "ilp-reduction-plans", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 89, + "title": "fix: close completeness gaps from review-implementation audit (#88)", + "author": "GiggleLiu", + "created_at": "2026-02-20T03:23:10Z", + "merged_at": "2026-02-20T03:38:42Z", + "branch": "fix/issue-88-completeness-gaps", + "is_agent": false, + "phase": 1, + "type": null + }, + { + "number": 92, + "title": "Fix #90: Add ClosestVectorProblem model", + "author": "GiggleLiu", + "created_at": "2026-02-22T06:30:43Z", + "merged_at": "2026-02-28T16:51:52Z", + "branch": "issue-90-closest-vector-problem", + "is_agent": false, + "phase": 2, + "type": "Model" + }, + { + "number": 93, + "title": "fix(mcp): review fixes, multi-platform docs, remove Smithery", + "author": "GiggleLiu", + "created_at": "2026-02-22T13:09:14Z", + "merged_at": "2026-02-22T16:39:02Z", + "branch": "fix/mcp-review-fixes", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 96, + "title": "Fix #95: Add BinPacking model", + "author": "GiggleLiu", + "created_at": "2026-02-25T02:27:25Z", + "merged_at": "2026-02-28T15:25:07Z", + "branch": "issue-95-bin-packing", + "is_agent": false, + "phase": 2, + "type": "Model" + }, + { + "number": 99, + "title": "Replace Polynomial overhead system with Expr AST", + "author": "GiggleLiu", + "created_at": "2026-02-25T23:47:07Z", + "merged_at": "2026-02-26T09:08:20Z", + "branch": "feat/expr-overhead-system", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 100, + "title": "test: add coverage for Expr overhead system and fix docs", + "author": "GiggleLiu", + "created_at": "2026-02-26T09:40:44Z", + "merged_at": "2026-02-26T10:25:01Z", + "branch": "fix/expr-coverage-and-docs", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 101, + "title": "fix: CLI UX improvements from issue #86", + "author": "GiggleLiu", + "created_at": "2026-02-26T11:21:56Z", + "merged_at": "2026-02-27T05:34:11Z", + "branch": "fix/cli-ux-issue-86", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 102, + "title": "feat: explicit variant declarations with complexity metadata", + "author": "GiggleLiu", + "created_at": "2026-02-27T05:40:03Z", + "merged_at": "2026-02-27T17:54:52Z", + "branch": "fix/variant-display", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 106, + "title": "feat: One weight IS\u2194SP variants, fix complexity metadata, enrich paper", + "author": "GiggleLiu", + "created_at": "2026-02-28T02:09:19Z", + "merged_at": "2026-02-28T10:22:58Z", + "branch": "feat/one-weight-variant-and-cleanup", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 111, + "title": "Enrich paper with examples, figures, and algorithm citations", + "author": "GiggleLiu", + "created_at": "2026-02-28T13:11:38Z", + "merged_at": "2026-02-28T14:12:56Z", + "branch": "jg/paper-writing", + "is_agent": false, + "phase": 2, + "type": null + }, + { + "number": 112, + "title": "Fix complexity inconsistencies, enforce overhead, add missing variants", + "author": "GiggleLiu", + "created_at": "2026-02-28T17:42:41Z", + "merged_at": "2026-03-01T03:57:13Z", + "branch": "fix/complexity-overhead-variants", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 113, + "title": "Recategorize problem models by input structure", + "author": "GiggleLiu", + "created_at": "2026-03-01T05:05:30Z", + "merged_at": "2026-03-01T06:26:24Z", + "branch": "recategorize-models", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 139, + "title": "fix: CLI QA improvements \u2014 docs, display, auto-JSON", + "author": "GiggleLiu", + "created_at": "2026-03-02T20:51:21Z", + "merged_at": "2026-03-02T20:57:15Z", + "branch": "fix/cli-qa-issues", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 171, + "title": "Fix #114: Add Knapsack model", + "author": "zazabap", + "created_at": "2026-03-04T19:27:59Z", + "merged_at": "2026-03-10T04:43:13Z", + "branch": "issue-114-knapsack", + "is_agent": false, + "phase": 3, + "type": "Model" + }, + { + "number": 188, + "title": "Update references, docs, and check-issue skill", + "author": "GiggleLiu", + "created_at": "2026-03-06T11:25:53Z", + "merged_at": "2026-03-06T15:15:01Z", + "branch": "jg/references", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 190, + "title": "fix: CLI QA improvements \u2014 creation, aliases, help, schemas", + "author": "GiggleLiu", + "created_at": "2026-03-07T06:32:09Z", + "merged_at": "2026-03-07T09:00:29Z", + "branch": "fix/cli-qa-189", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 194, + "title": "feat: redundant rule detection via polynomial overhead comparison (#193)", + "author": "GiggleLiu", + "created_at": "2026-03-09T13:43:25Z", + "merged_at": "2026-03-10T17:50:51Z", + "branch": "jg/redundant-rule-detection", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 195, + "title": "Fix #191: Address 5 skill workflow issues", + "author": "zazabap", + "created_at": "2026-03-09T17:06:13Z", + "merged_at": "2026-03-09T23:44:48Z", + "branch": "issue-191-skill-fixes", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 263, + "title": "feat: display Big O notation in CLI output", + "author": "GiggleLiu", + "created_at": "2026-03-11T00:32:04Z", + "merged_at": "2026-03-12T10:54:45Z", + "branch": "feat/cli-big-o-notation", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 570, + "title": "Fix #117: [Model] GraphPartitioning", + "author": "GiggleLiu", + "created_at": "2026-03-12T06:13:15Z", + "merged_at": "2026-03-12T14:34:13Z", + "branch": "issue-117-graph-partitioning", + "is_agent": false, + "phase": 3, + "type": "Model" + }, + { + "number": 592, + "title": "feat: display Big O notation in CLI output", + "author": "GiggleLiu", + "created_at": "2026-03-12T11:09:48Z", + "merged_at": "2026-03-12T11:11:37Z", + "branch": "feat/cli-big-o-notation", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 593, + "title": "fix: check-issue delegates overhead comparison to check-rule-redundancy", + "author": "zazabap", + "created_at": "2026-03-12T11:24:59Z", + "merged_at": "2026-03-12T14:42:49Z", + "branch": "fix/check-issue-use-redundancy-skill", + "is_agent": false, + "phase": 3, + "type": null + }, + { + "number": 599, + "title": "Fix #126: Add KSatisfiability to SubsetSum reduction", + "author": "GiggleLiu", + "created_at": "2026-03-12T12:15:30Z", + "merged_at": "2026-03-12T14:35:27Z", + "branch": "issue-126-ksatisfiability-to-subsetsum", + "is_agent": false, + "phase": 3, + "type": "Rule" + }, + { + "number": 613, + "title": "Fix paper citations from issue #126 and #117 reviews", + "author": "GiggleLiu", + "created_at": "2026-03-12T14:57:51Z", + "merged_at": "2026-03-12T15:14:06Z", + "branch": "fix/126-ksat-subsetsum-attribution", + "is_agent": false, + "phase": 3, + "type": null + } + ] +} diff --git a/docs/paper/arxiv/scripts/mine-git-history.py b/docs/paper/arxiv/scripts/mine-git-history.py new file mode 100644 index 00000000..0488b017 --- /dev/null +++ b/docs/paper/arxiv/scripts/mine-git-history.py @@ -0,0 +1,150 @@ +#!/usr/bin/env python3 +"""Mine merged PRs from CodingThrust/problem-reductions to understand project evolution. + +Extracts all merged PRs, classifies them by type ([Rule]/[Model]), author type +(agent vs human), and project phase (manual / basic skills / full pipeline). + +Phase boundaries: + - Phase 1 (manual): before 2026-02-22 (no add-model/add-rule skills) + - Phase 2 (basic skills): 2026-02-22 to 2026-02-28 (add-model/add-rule exist) + - Phase 3 (full pipeline): 2026-03-01 onwards (meta-power batch resolution) + +Output: JSON to stdout with summary, per-phase breakdown, and full PR list. +""" + +import json +import re +import subprocess +import sys +from datetime import datetime, timezone + +REPO = "CodingThrust/problem-reductions" + +# Phase boundary dates (UTC). Determined from: +# git show 3ddc415 --format="%ai" => 2026-02-22 (add-model / add-rule skills) +# git show 2cfb1b7 --format="%ai" => 2026-03-01 (meta-power skill) +PHASE_BOUNDARIES = [ + datetime(2026, 2, 22, tzinfo=timezone.utc), # Phase 1 -> Phase 2 + datetime(2026, 3, 1, tzinfo=timezone.utc), # Phase 2 -> Phase 3 +] + +PHASE_LABELS = ["manual", "basic-skills", "full-pipeline"] + +AGENT_LOGINS = {"github-actions"} + + +def is_agent(author: dict) -> bool: + """Classify author as agent (bot) or human.""" + login = author.get("login", "") + if "[bot]" in login: + return True + if login in AGENT_LOGINS: + return True + return author.get("is_bot", False) + + +def classify_phase(merged_at: str) -> int: + """Return 1-based phase number from the merged-at timestamp.""" + dt = datetime.fromisoformat(merged_at.replace("Z", "+00:00")) + for i, boundary in enumerate(PHASE_BOUNDARIES): + if dt < boundary: + return i + 1 + return len(PHASE_BOUNDARIES) + 1 + + +def classify_type(title: str) -> str | None: + """Return 'Rule', 'Model', or None based on PR title heuristics.""" + if "[Rule]" in title: + return "Rule" + if "[Model]" in title: + return "Model" + # Heuristic: detect issue-linked PRs whose branch or title imply a model/rule + # e.g. "Fix #52: TravelingSalesman to ILP reduction" => Rule + # e.g. "Fix #47: Add HamiltonianCycle model" => Model + title_lower = title.lower() + if re.search(r"\breduction\b", title_lower) and re.search(r"\bto\b", title_lower): + return "Rule" + if re.search(r"\badd\b.*\bmodel\b", title_lower): + return "Model" + return None + + +def fetch_prs() -> list[dict]: + """Fetch all merged PRs from GitHub.""" + cmd = [ + "gh", "pr", "list", + "--repo", REPO, + "--state", "merged", + "--limit", "999", + "--json", "number,title,author,createdAt,mergedAt,labels,headRefName", + ] + result = subprocess.run(cmd, capture_output=True, text=True, check=True) + return json.loads(result.stdout) + + +def main(): + prs_raw = fetch_prs() + + prs = [] + for pr in sorted(prs_raw, key=lambda x: x["number"]): + pr_type = classify_type(pr["title"]) + agent = is_agent(pr["author"]) + phase = classify_phase(pr["mergedAt"]) + + prs.append({ + "number": pr["number"], + "title": pr["title"], + "author": pr["author"]["login"], + "created_at": pr["createdAt"], + "merged_at": pr["mergedAt"], + "branch": pr["headRefName"], + "is_agent": agent, + "phase": phase, + "type": pr_type, + }) + + # Summary + rule_prs = [p for p in prs if p["type"] == "Rule"] + model_prs = [p for p in prs if p["type"] == "Model"] + agent_prs = [p for p in prs if p["is_agent"]] + human_prs = [p for p in prs if not p["is_agent"]] + + summary = { + "total_prs": len(prs), + "rule_prs": len(rule_prs), + "model_prs": len(model_prs), + "other_prs": len(prs) - len(rule_prs) - len(model_prs), + "agent_authored": len(agent_prs), + "human_authored": len(human_prs), + } + + # Per-phase breakdown + by_phase = [] + for phase_num, label in enumerate(PHASE_LABELS, start=1): + phase_prs = [p for p in prs if p["phase"] == phase_num] + by_phase.append({ + "phase": phase_num, + "label": label, + "count": len(phase_prs), + "rule_count": len([p for p in phase_prs if p["type"] == "Rule"]), + "model_count": len([p for p in phase_prs if p["type"] == "Model"]), + "agent_count": len([p for p in phase_prs if p["is_agent"]]), + "human_count": len([p for p in phase_prs if not p["is_agent"]]), + }) + + output = { + "summary": summary, + "by_phase": by_phase, + "phase_boundaries": { + "phase_1_end": PHASE_BOUNDARIES[0].isoformat(), + "phase_2_end": PHASE_BOUNDARIES[1].isoformat(), + }, + "prs": prs, + } + + json.dump(output, sys.stdout, indent=2) + print() # trailing newline + + +if __name__ == "__main__": + main() From b768eecc19022b98844b08789b9446a368233908 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:24:27 +0800 Subject: [PATCH 12/38] docs(arxiv): write S1 Introduction Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 58 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 18c73eeb..7cd3d835 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -24,6 +24,64 @@ \section{Introduction}\label{sec:intro} +AI coding agents have made remarkable progress on isolated software engineering tasks. +On SWE-Bench Verified, which evaluates agents on single-issue bug fixes drawn from popular open-source repositories, the best systems now resolve 70--80\% of issues end-to-end~\cite{Xia2025LiveSWEagent}. +This has fueled optimism that fully autonomous software engineering is within reach. +Yet benchmarks designed to probe longer-horizon capabilities tell a starkly different story. +SWE-EVO, which requires agents to implement multi-step modifications spanning an average of 21~files, reports resolution rates around 21\% even for frontier models~\cite{Thai2025SWEEVO}. +SWE-Bench Pro, targeting enterprise-level tasks that may require hours to days of human effort, similarly finds that agents struggle with sustained multi-file reasoning~\cite{Deng2025SWEBenchPro}. +The common response to this capability gap has been to push for more powerful agents---larger models, better tool use, self-evolving scaffolds. +We argue that this framing overlooks the more fundamental bottleneck: not raw agent capability, but how work is decomposed and distributed between humans and agents. + +Our thesis is that the key to effective agentic coding lies in \emph{task decomposition}: splitting the creative and judgment-intensive aspects of software development (which humans do well) from the management and mechanical aspects (which agents can handle reliably). +When a task is sufficiently well-specified and bounded in scope, even current agents execute it with high fidelity. +The challenge is not making agents smarter, but structuring the work so that each unit falls within the agent's reliable operating range. +This perspective shifts attention from agent architecture to \emph{skill design}---the craft of encoding domain knowledge into reusable, agent-executable task specifications. + +This decomposition is especially critical for mathematical and scientific software, where the ``review is harder than generation'' problem is acute~\cite{Roychoudhury2025AgenticAI}. +An agent can generate a plausible-looking reduction from Boolean satisfiability to graph coloring, but verifying that the reduction preserves solution structure requires mathematical reasoning that current agents cannot reliably perform in isolation. +Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference---deciphering and clarifying developer intent---as the central difficulty in agentic software workflows, and argues that trustworthy deployment requires AI-based verification and validation of AI-generated code. +Our multi-layered verification stack addresses precisely this challenge: rather than attempting end-to-end formal verification (which remains largely unsolved for complex mathematical code~\cite{Thakur2025CLEVER, Bursuc2025VeriCoding}), we compose multiple lightweight verification mechanisms that collectively catch errors across different abstraction levels. + +We instantiate this methodology in the domain of NP-hard problem reductions, implemented as an open-source Rust library. +The library manages a typed reduction graph connecting 24 problem types through 40 hand-coded reduction rule implementations and 52 total directed edges (including 12 edges inferred from a type-parameter subtype lattice). +The domain serves as a Goldilocks testbed for studying agentic coding: each reduction is self-contained (50--200 lines of code), requires non-trivial mathematical reasoning about the mapping between problem structures, yet admits a fully automatable correctness criterion---reduce an instance, solve the target problem by brute force, extract the solution back, and verify it against the source. +This combination of mathematical depth with mechanical verifiability makes it possible to study how agents perform on tasks that are individually tractable but collectively demand sustained engineering discipline. + +Our approach distributes work across three roles with distinct responsibilities. +\textbf{Contributors} perform the creative work of identifying which problems and reductions are worth implementing: they open issues proposing new nodes or edges in the reduction graph, drawing on domain knowledge to spot gaps, recognize useful connections, and assess mathematical non-triviality. +\textbf{The maintainer} curates the project board and writes skills---markdown scripts that decompose complex tasks (such as ``implement a new reduction rule'') into sequences of agent-manageable subtasks. +The maintainer encodes domain knowledge, quality standards, and project conventions into these skills, effectively programming the agent's workflow rather than its output. +\textbf{Agents} serve in a dual capacity: as \emph{managers}, they pick cards from the project board, dispatch sub-agents for parallel review, and orchestrate a two-stage pipeline from issue to merged pull request; as \emph{executors}, they implement code, write tests, generate documentation, and fix CI failures. +The key insight is that the two human roles contribute \emph{judgment}---which reductions matter, what quality bar to enforce---while the agent handles \emph{volume}---executing the mechanical steps reliably and repeatedly. +Industry data supports this division: developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding}. + +We organize our agent's capabilities into a library of 13~skills spanning five functional categories: orchestration (pipeline management and issue dispatch), implementation (adding models and reduction rules), quality gates (issue checking, redundancy analysis, multi-agent review, and CI repair), documentation (generating formal problem definitions and reduction theorems with proof sketches), and release management. +A two-stage card-based pipeline automates the progression from issue to merged code: the first stage picks a ``Ready'' issue, implements it in an isolated git worktree, and produces a pull request; the second stage addresses review comments, fixes CI failures, and prepares the PR for human merge. +The human maintainer touches only two transitions in this pipeline---moving an issue from Backlog to Ready (the strategic decision of \emph{what} to work on) and merging the final pull request (the quality gate of \emph{whether} the work meets standards). +Everything in between is agent-managed. + +Correctness assurance comes from a seven-layer verification stack that catches errors at increasing levels of abstraction. +The stack ranges from compile-time type checking (Layer~1), through unit tests and brute-force cross-validation (Layers~2--3), to overhead formula validation against actual reduction sizes (Layer~4), materialized test fixtures that prevent agents from silently changing expected values (Layer~5), parallel agentic review with fresh context windows (Layer~6), and finally, documentation entries that require human-readable proof sketches for each reduction (Layer~7). +No single layer is sufficient---the type system catches API misuse but not logical errors; closed-loop tests verify functional correctness but not overhead formulas; documentation catches proof-level mistakes that no automated test can detect. +The layers are designed to be complementary, and the skill system ensures that agents invoke all relevant layers as part of every implementation task. + +Our contributions are as follows: +\begin{itemize} + \item A \textbf{skill-based methodology} for decomposing mathematical coding tasks into agent-manageable steps, with a concrete skill library and a card-based orchestration pipeline that separates human judgment from agent execution. + \item A \textbf{multi-layered verification stack} comprising seven complementary mechanisms---from type-level enforcement through materialized fixtures to agentic review and documentation-as-verification---that collectively ensure correctness of agent-generated mathematical code. + \item A \textbf{verified open-source artifact}: a Rust library implementing 24~NP-hard problem types, 40~reduction rules, and 52~graph edges, all with ${>}95\%$ test coverage, serving as both a practical tool for reduction-based problem solving and a benchmark for evaluating agentic mathematical software engineering. +\end{itemize} + +The remainder of this paper is organized as follows. +\Cref{sec:domain} motivates our choice of NP-hard reductions as a study domain, arguing that it occupies a sweet spot between mathematical complexity and mechanical verifiability. +\Cref{sec:architecture} describes the Rust library's type-driven architecture that makes agent-generated code verifiable by construction. +\Cref{sec:skills} presents the skill-based task decomposition, the three-role model, and the card-based orchestration pipeline. +\Cref{sec:verification} details the seven-layer verification stack. +\Cref{sec:evaluation} evaluates the methodology through an ablation study, git history mining, and detailed case studies. +\Cref{sec:related} surveys related work on AI coding agents, AI-assisted discovery of reductions, and formal verification of generated code. +\Cref{sec:conclusion} discusses generalizability, limitations, and future directions. + \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} \begin{figure*}[t] From 25a0603b904c04ccb48e943eb4f8aa390da2b3b6 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:27:44 +0800 Subject: [PATCH 13/38] =?UTF-8?q?docs(arxiv):=20write=20S2=20Why=20Reducti?= =?UTF-8?q?ons=20=E2=80=94=20Goldilocks=20domain?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add ~800 words covering four paragraphs: self-contained/verifiable reductions (LOC stats from graph-metrics.json), homogeneous task structure vs SWE-Bench, hardware solver compilation layer (Rydberg atoms for MIS, D-Wave for QUBO), and real-world applications. Fix figure caption to use accurate counts (40 impl + 12 inferred edges). Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 38 +++++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 7cd3d835..df7de137 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -84,10 +84,46 @@ \section{Introduction}\label{sec:intro} \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} +Not every software domain is equally suited for studying agentic coding. +Web application bug fixes---the staple of SWE-Bench---are heterogeneous: each issue involves a different framework, a different failure mode, and a different notion of correctness. +This heterogeneity makes it difficult to draw general conclusions about agent capabilities, because success on one task says little about success on the next. +NP-hard problem reductions occupy a sweet spot---a Goldilocks domain---that avoids this limitation while remaining mathematically demanding. + +\paragraph{Self-contained, formally specified, mechanically verifiable.} +Each reduction in our library is a self-contained module: the 40~implementations range from 58 to 444 lines of code, with a median of 129~LOC. +Despite this bounded scope, each reduction requires non-trivial mathematical reasoning about the structural relationship between two problem formulations. +Crucially, every reduction admits a fully automatable correctness criterion: given a source instance, reduce it to the target problem, solve the target by brute force, extract the solution back through the reduction's inverse map, and verify that it is valid (and optimal) for the source. +This \emph{round-trip test} provides a ground-truth oracle that requires no human judgment to evaluate, yet exercises the full mathematical content of the reduction. +The combination of bounded scope, mathematical depth, and mechanical verifiability is what makes the domain ideal for agentic coding: tasks are individually within an agent's reliable operating range, but collectively demand sustained engineering discipline across a growing graph of interdependent components. + +\paragraph{Homogeneous task structure.} +Unlike SWE-Bench, where every issue is structurally unique, reductions form a homogeneous task family. +Every reduction implements the same trait (\texttt{ReduceTo}), follows the same file-naming convention, requires the same test structure (closed-loop round-trip), and produces the same artifacts (overhead expressions, example code, documentation entry). +This homogeneity enables fair comparison across tasks---we can meaningfully ask whether an agent performs better on graph-to-graph reductions than on formula-to-graph reductions, or whether reduction complexity (measured in LOC or graph blowup) predicts first-attempt success rate. +It also enables \emph{reusable skills}: a single ``add-rule'' skill handles all 40~reductions, because the workflow is structurally identical even when the mathematical content varies. + +\paragraph{Hardware solvers as practical motivation.} +The reduction graph is not merely an academic exercise; it serves as a \emph{compilation layer} connecting abstract problem formulations to physical hardware. +Rydberg atom arrays natively solve Maximum Independent Set (MIS) by encoding graph vertices as atoms and edges as blockade constraints~\cite{lucas2014}. +D-Wave quantum annealers solve Quadratic Unconstrained Binary Optimization (QUBO) and Ising spin glass problems through quantum tunneling~\cite{glover2019}. +A verified reduction graph lets these specialized processors tackle a far larger class of problems: reduce Satisfiability to MIS and run the result on Rydberg atoms; reduce MaxCut through SpinGlass to QUBO and submit to D-Wave. +The graph in \Cref{fig:reduction-graph} makes this compilation structure explicit. +MIS serves as the dominant hub, with the highest in-degree (14~incoming edges) and out-degree (13~outgoing edges) among all 24~problem types---reflecting its central role as both a target for hardware solvers and a source for further reductions. +ILP, with 11~incoming edges, functions as a universal algebraic target: any problem that reduces to ILP gains access to mature commercial solvers (Gurobi, CPLEX). + +\paragraph{Real-world applications.} +The problems in our graph arise directly in industrial settings. +Software-defined networking encodes routing and scheduling as Integer Linear Programming~(ILP). +Airline crew scheduling reduces to Set Covering. +VLSI design relies on graph coloring for register allocation and channel routing. +Logistics optimization maps to the Traveling Salesman Problem and Bin Packing. +In each case, the domain-specific problem reduces to a canonical NP-hard formulation for which decades of algorithmic and hardware research have produced efficient solvers. +Our library provides the verified bridge: a practitioner formulates their problem as one of the 24~supported types and follows reduction edges to reach the solver of their choice, with each edge backed by round-trip-tested code and a documented proof sketch. + \begin{figure*}[t] \centering \includegraphics[width=\textwidth]{figures/reduction-graph.pdf} - \caption{The reduction graph: 24 problem types connected by 52 directed edges (35 implemented reductions + 17 inferred variant edges). Hub nodes MIS, QUBO and ILP are highlighted. Nodes in dashed boxes have models but no cross-type reductions yet.} + \caption{The reduction graph: 24 problem types connected by 52 directed edges (40 implemented reductions + 12 inferred variant edges). Node categories---graph (blue), formula (orange), set (green), algebraic (purple), misc (gray)---reflect input structure. MIS is the dominant hub with the highest in-degree (14) and out-degree (13). ILP serves as a universal algebraic target (in-degree 11).} \label{fig:reduction-graph} \end{figure*} From 89ea5579869b04bae31ec97bd898f204815b3af2 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:35:52 +0800 Subject: [PATCH 14/38] docs(arxiv): write S3 System Architecture Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 67 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index df7de137..912d64f3 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -129,6 +129,10 @@ \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} \section{System Architecture}\label{sec:architecture} +The library's architecture is designed around a single principle: \emph{reduce the space of possible agent errors through type-level enforcement}. +Rather than relying on agents to remember project conventions or follow informal guidelines, the Rust type system, trait bounds, and compile-time macros structurally prevent entire classes of mistakes. +This section describes four architectural pillars---the \texttt{Problem} trait, the \texttt{ReduceTo} trait, the \lstinline{#[reduction(overhead)]} proc macro, and the \lstinline{declare_variants!} registry---that collectively make agent-generated code verifiable by construction (see~\Cref{fig:architecture}). + \begin{figure}[t] \centering \includegraphics[width=\columnwidth]{figures/architecture.pdf} @@ -136,6 +140,69 @@ \section{System Architecture}\label{sec:architecture} \label{fig:architecture} \end{figure} +\paragraph{The Problem trait: universal evaluation.} +Every problem type in the library implements a single core trait, \texttt{Problem}, which requires five members: a constant \lstinline{NAME} identifying the problem type (e.g., \texttt{"MaximumIndependentSet"}), an associated type \texttt{Metric} representing the objective (either \texttt{SolutionSize} for optimization problems or \texttt{bool} for satisfaction problems), a method \lstinline{dims()} that returns the configuration space dimensions, a method \lstinline{evaluate()} that scores any configuration against the problem instance, and a method \lstinline{variant()} that returns type-parameter metadata as key-value pairs. + +The critical member is \lstinline{evaluate()}. +Because every problem must implement a function that maps an arbitrary configuration to a metric value, the library can verify any candidate solution without problem-specific test infrastructure. +A brute-force solver simply enumerates all configurations in the space defined by \lstinline{dims()} and selects the one(s) with the best metric. +This universal evaluation capability is what makes the round-trip test described in \Cref{sec:domain} possible: given any reduction, we can solve the target by brute force and verify the extracted solution against the source---all through the \lstinline{evaluate()} interface, with no reduction-specific oracle needed. + +Two extension traits refine \texttt{Problem} for specific problem classes. +\texttt{OptimizationProblem} adds a \lstinline{direction()} method (\texttt{Maximize} or \texttt{Minimize}) and constrains the metric to \texttt{SolutionSize}, where \lstinline{SolutionSize} is either \texttt{Valid(v)} (a feasible solution with objective value~$v$) or \texttt{Invalid} (an infeasible configuration). +\texttt{SatisfactionProblem} is a marker trait for decision problems where the metric is simply \texttt{bool}. +This hierarchy means that agent-implemented problem types must declare upfront whether they are optimization or satisfaction problems, and the type system enforces consistency: an agent cannot accidentally return a numeric score from a satisfaction problem or omit a direction from an optimization problem. + +\paragraph{The ReduceTo trait: round-trip by construction.} +Reductions between problems are encoded through the generic trait \lstinline{ReduceTo}, which requires a single method: \lstinline{reduce_to()} takes a reference to the source problem and returns a \texttt{ReductionResult}. +The \texttt{ReductionResult} type bundles two capabilities: \lstinline{target_problem()} returns the constructed target instance, and \lstinline{extract_solution()} maps a target solution back to a source solution. +By requiring both the forward mapping and the inverse extraction in a single trait implementation, the type system ensures that every reduction is round-trip capable---an agent cannot implement a forward reduction without also providing the solution extraction, because the code will not compile otherwise. + +This design choice has a direct consequence for verification. +The closed-loop test pattern---reduce, solve the target, extract the solution, verify against the source---requires no per-reduction test logic beyond constructing a source instance. +The test harness calls \lstinline{reduce_to()}, invokes the brute-force solver on the target, calls \lstinline{extract_solution()}, and checks the result via \lstinline{evaluate()}. +All 40~reduction implementations share this identical test structure, which is why a single ``add-rule'' skill can handle every reduction in the library. + +\paragraph{Compile-time overhead validation.} +Every reduction must declare how the target problem's size relates to the source problem's size through an \lstinline{overhead} attribute on the \lstinline{#[reduction]} proc macro: +\begin{lstlisting}[basicstyle=\ttfamily\small,breaklines=true] +#[reduction(overhead = { + num_vertices = "num_vertices + num_clauses", + num_edges = "3 * num_clauses", +})] +impl ReduceTo for Source { ... } +\end{lstlisting} +The overhead expressions are parsed at compile time by a Pratt parser embedded in the procedural macro crate. +Variable names in the expressions (e.g., \texttt{num\_vertices}, \texttt{num\_clauses}) are validated against actual getter methods on the source type---if an agent writes a nonexistent variable name, the code fails to compile with a clear error message pointing to the offending expression. +This eliminates an entire class of copy-paste errors where agents might reference fields from a different problem type. + +The expressions support standard arithmetic operators ($+$, $-$, $\times$, $/$, \texttt{\^{}}), mathematical functions (\lstinline{exp}, \lstinline{log}, \lstinline{sqrt}), and numeric constants. +At runtime, the library maintains both a symbolic representation (\texttt{Expr} AST) and a compiled evaluation function that calls the getter methods directly, enabling cross-validation between the two representations. +The overhead data feeds into the reduction graph metadata, allowing automated analysis of whether a composite reduction path (e.g., $A \to B \to C$) dominates a direct reduction ($A \to C$) in terms of polynomial overhead---a capability used by the \texttt{check-rule-redundancy} skill to prevent agents from implementing unnecessary reductions. + +\paragraph{Variant registry and graph export.} +Problem types in the library are parameterized by graph type (e.g., \texttt{SimpleGraph}, \texttt{PlanarGraph}, \texttt{BipartiteGraph}, \texttt{KingsSubgraph}) and optionally by weight type (\texttt{One} for unit weights, \texttt{i32}, \texttt{f64}). +Each concrete instantiation---for example, \texttt{MaximumIndependentSet}---constitutes a distinct \emph{variant} that may have a different best-known complexity. + +The \lstinline{declare_variants!} proc macro registers these variants at compile time, associating each with a complexity string that represents the worst-case time bound of the best known algorithm for that variant. +Variable names in complexity strings are validated against getter methods, just as in overhead expressions. +For example, \texttt{MaximumIndependentSet} has polynomial complexity (the kings graph structure admits an efficient algorithm), while the general \texttt{SimpleGraph} variant has exponential complexity $O(1.1996^n)$. + +The registry serves three purposes. +First, it enables \emph{automated graph export}: the library can enumerate all registered variants and their reductions to produce the reduction graph shown in \Cref{fig:reduction-graph}, including both hand-coded reduction edges and natural edges inferred from the type-parameter subtype lattice (e.g., a reduction from \texttt{MIS} to \texttt{MIS} is automatically available because \texttt{KingsSubgraph} is a subtype of \texttt{SimpleGraph}). +Second, it enables \emph{completeness checking}: the documentation system can verify that every node and edge in the exported graph has a corresponding entry in the paper, flagging undocumented reductions as warnings. +Third, the complexity data enables \emph{redundancy analysis}: by comparing the end-to-end complexity of a composite path $A \to B \to C$ (applying overheads to the target's complexity) against the direct complexity of solving $A$, the system can determine whether a proposed reduction is useful or merely adds complexity without enabling access to a faster solver. + +\paragraph{Design philosophy: error prevention over error detection.} +The four mechanisms above share a common design philosophy: rather than detecting agent errors after the fact (through tests that may themselves be incorrect), the architecture \emph{prevents} errors from being expressible. +An agent cannot implement a reduction without providing solution extraction (the type system rejects it). +An agent cannot write an overhead expression referencing nonexistent problem attributes (the compiler rejects it). +An agent cannot register a variant without providing a complexity string (the macro rejects it). + +This philosophy---reducing the space of possible errors rather than testing for their absence---is what transforms the type system from a code organization tool into the foundation of the verification stack described in \Cref{sec:verification}. +The seven verification layers build on this foundation: Layer~1 (type checking) is not merely ``does it compile,'' but ``does it satisfy the rich set of constraints encoded in the trait hierarchy and proc macros.'' +Each subsequent layer addresses error classes that the type system cannot prevent---logical errors in the reduction mapping (Layer~3), incorrect overhead formulas despite valid variable names (Layer~4), and proof-level mistakes in the mathematical argument (Layer~7)---but the architectural choices described here ensure that the first layer already eliminates a substantial fraction of the errors that agents would otherwise produce. + \section{Skill-Based Task Decomposition}\label{sec:skills} \begin{figure}[t] From be8e0d143c925038675911871e5ff1f9c41aa6a8 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:41:08 +0800 Subject: [PATCH 15/38] docs(arxiv): write S4 Skill-Based Task Decomposition Add three subsections: S4.1 Three Roles (table + prose describing Contributor/Maintainer/Agent responsibilities), S4.2 Skills as Agent Functions (Table 1 with all 13 skills across 5 categories, detailed paragraphs per category), and S4.3 Card-Based Orchestration (two-stage pipeline with human touch points). Success rate column uses TBD placeholder pending Task 11 git mining results. Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 132 +++++++++++++++++++++++++++++++++++++ 1 file changed, 132 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 912d64f3..66a4d487 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -212,6 +212,138 @@ \section{Skill-Based Task Decomposition}\label{sec:skills} \label{fig:pipeline} \end{figure} +The introduction outlined three roles---Contributor, Maintainer, and Agent---and claimed that skills decompose complex tasks into agent-manageable steps. +This section makes both ideas concrete: we define the roles (\Cref{sec:roles}), catalogue the skill library (\Cref{sec:skill-inventory}), and describe the card-based orchestration pipeline that ties them together (\Cref{sec:orchestration}). + +\subsection{Three Roles}\label{sec:roles} + +Our methodology distributes work across three roles whose responsibilities are deliberately non-overlapping (\Cref{tab:roles}). +The division is designed so that each role contributes what it does best: humans provide judgment and creativity, while the agent provides volume and consistency. + +\begin{table}[t] +\caption{The three roles and their responsibilities.}\label{tab:roles} +\centering +\begin{tabular}{lp{5.5cm}} +\toprule +Role & Responsibility \\ +\midrule +Contributor & Open issues proposing new problems or reductions. Creative work: identify gaps in the reduction graph, assess mathematical non-triviality, provide references and worked examples. \\ +Maintainer & Curate the project board and write skills. Strategic work: decide which issues to prioritize (Backlog$\to$Ready), encode domain knowledge and quality standards into skill scripts, merge final PRs. \\ +Agent & Manage the pipeline and execute tasks. Mechanical work: pick cards from the board, implement code, write tests, generate documentation, dispatch sub-agents for review, fix CI failures, update board status. \\ +\bottomrule +\end{tabular} +\end{table} + +\textbf{Contributors} perform the most intellectually demanding work in the pipeline: they must identify which reductions are mathematically interesting, non-trivial, and useful for extending the graph's reach. +A contributor proposing a new reduction from Satisfiability to Maximum Independent Set must understand both problem formulations, know the classical gadget construction, and provide a worked example detailed enough for an agent to implement. +This creative judgment---\emph{what} is worth building---is precisely what current agents cannot reliably provide. + +\textbf{The maintainer} occupies a meta-level role: rather than writing code directly, the maintainer programs the agent's \emph{workflow} by authoring skills. +Each skill is a markdown document that encodes domain conventions (file naming, trait implementation patterns, test structure), quality standards (coverage thresholds, documentation requirements), and orchestration logic (when to dispatch sub-agents, how to handle failures). +The maintainer also serves as the final quality gate, reviewing PRs that the agent has prepared and deciding whether to merge. +In practice, the maintainer touches only two transitions in the pipeline: moving an issue from Backlog to Ready (the strategic decision of \emph{what} to work on) and merging the completed PR (the quality judgment of \emph{whether} the work meets standards). + +\textbf{The agent} operates in a dual capacity. +As a \emph{manager}, it picks cards from the project board, creates isolated git worktrees for each task, dispatches sub-agents for parallel review, and orchestrates the progression from issue to pull request. +As an \emph{executor}, it implements the code changes specified by each skill's step-by-step instructions. +This dual role is possible because skills reduce each task to a sequence of well-bounded subtasks, each within the agent's reliable operating range. + +\subsection{Skills as Agent Functions}\label{sec:skill-inventory} + +A \emph{skill} is a markdown document that decomposes a complex, multi-file task into a numbered sequence of agent-executable steps. +Each step specifies what to read, what to produce, and how to verify the result. +The key insight is that agents handle well-specified, bounded subtasks reliably; the challenge is structuring work so that each unit falls within that reliable range. +Skills solve this by encoding the domain expert's knowledge of \emph{how} to perform a task---file locations, naming conventions, trait implementation patterns, test structure---into a reusable script that any agent invocation can follow. + +Our library comprises 13~skills organized into five functional categories (\Cref{tab:skills}). + +\begin{table}[t] +\caption{Skills inventory. Steps = numbered steps in the skill script. Success rate is the fraction of invocations that pass CI on first attempt, measured from git history.}\label{tab:skills} +\centering +\small +\begin{tabular}{llcc} +\toprule +Skill & Category & Steps & Success \\ +\midrule +\texttt{project-pipeline} & Orchestration & 7 & TBD \\ +\texttt{review-pipeline} & Orchestration & 8 & TBD \\ +\texttt{issue-to-pr} & Orchestration & 7 & TBD \\ +\texttt{meta-power} & Orchestration & 4 & TBD \\ +\midrule +\texttt{add-model} & Implementation & 7 & TBD \\ +\texttt{add-rule} & Implementation & 6 & TBD \\ +\midrule +\texttt{check-issue} & Quality gate & 3 & TBD \\ +\texttt{check-rule-redundancy} & Quality gate & 5 & TBD \\ +\texttt{review-implementation} & Quality gate & 5 & TBD \\ +\texttt{fix-pr} & Quality gate & 6 & TBD \\ +\midrule +\texttt{write-model-in-paper} & Documentation & 4 & TBD \\ +\texttt{write-rule-in-paper} & Documentation & 6 & TBD \\ +\midrule +\texttt{release} & Release & 3 & TBD \\ +\bottomrule +\end{tabular} +\end{table} + +\paragraph{Orchestration skills (4).} +These skills implement the agent-as-manager role. +\texttt{project-pipeline} is the primary automation entry point: it picks a ``Ready'' issue from the GitHub Project board, creates an isolated git worktree, invokes \texttt{issue-to-pr} to produce a pull request, and moves the card to the ``review-agentic'' column. +\texttt{review-pipeline} handles the second stage: it picks a PR from ``review-agentic,'' addresses Copilot review comments via \texttt{fix-pr}, runs agentic feature tests, fixes CI failures (up to three retries), and moves the card to ``In Review'' for human merge. +\texttt{issue-to-pr} is the per-issue workhorse invoked by \texttt{project-pipeline}: it fetches the issue, verifies it has passed the \texttt{check-issue} quality gate, researches cited references, writes an implementation plan, creates a PR, and optionally executes the plan by dispatching to the appropriate implementation skill. +\texttt{meta-power} provides batch processing, resolving all open issues in dependency order (models before rules); it is being superseded by the more granular pipeline skills. + +\paragraph{Implementation skills (2).} +\texttt{add-model} and \texttt{add-rule} encode the complete workflow for adding a new problem type or reduction rule, respectively. +\texttt{add-model} walks through seven steps: gather required information (mathematical definition, complexity bounds, type parameters), implement the \texttt{Problem} trait, register variant complexity via \texttt{declare\_variants!}, register in the module system and CLI, write unit tests, generate documentation, and verify. +\texttt{add-rule} follows a parallel six-step structure: implement \texttt{ReduceTo} with overhead expressions, register in the module system, write closed-loop tests, create an example program, generate a paper entry with proof sketch, and verify. +Both skills begin with a checklist of required information---if any item is missing, the skill halts and requests clarification rather than proceeding with incomplete specifications. + +\paragraph{Quality gate skills (4).} +These skills prevent errors from propagating through the pipeline. +\texttt{check-issue} validates proposed issues across four dimensions---usefulness (does the reduction improve the graph?), non-triviality (is the construction genuinely structural?), correctness (are cited references real and accurate?), and writing quality (is the specification complete and implementable?)---posting a structured report with pass/fail/warn verdicts and applying labels that gate downstream processing. +\texttt{check-rule-redundancy} determines whether a proposed reduction is dominated by a composite path through existing rules, using polynomial overhead comparison to prevent unnecessary implementations. +\texttt{review-implementation} dispatches two parallel sub-agents with fresh context windows---a structural reviewer that checks completeness against the model or rule checklist, and a quality reviewer that evaluates code style, test quality, and convention adherence. +The fresh-context design prevents the ``sycophancy'' failure mode where a reviewer that also wrote the code is biased toward approving it. +\texttt{fix-pr} triages and resolves PR feedback from multiple sources: user inline comments, Copilot suggestions, CI failures, and coverage gaps identified by Codecov. + +\paragraph{Documentation skills (2).} +\texttt{write-model-in-paper} and \texttt{write-rule-in-paper} generate entries in the project's Typst paper. +These skills serve a dual purpose: they produce human-readable documentation, and they function as the final layer of the verification stack (\Cref{sec:verification}). +\texttt{write-rule-in-paper} requires a self-contained proof sketch with a bidirectional correctness argument---the discipline of writing ``if $S$ is an independent set, then $V \setminus S$ is a vertex cover'' forces the agent (and the reviewing human) to articulate the mathematical argument that no automated test can verify. + +\paragraph{Release skill (1).} +\texttt{release} determines the appropriate version bump from the diff since the last tag, verifies that all tests and lints pass, and invokes the release pipeline that publishes to crates.io. + +The skill library is designed to be \emph{compositional}: orchestration skills invoke implementation skills, which invoke quality gate skills, which may invoke documentation skills. +This composition means that a single \texttt{project-pipeline} invocation triggers a cascade of skill calls that collectively implement, test, review, document, and prepare a complete pull request---all from a one-line command. + +\subsection{Card-Based Orchestration}\label{sec:orchestration} + +The skills described above are coordinated through a GitHub Project board with six columns: \textbf{Backlog}, \textbf{Ready}, \textbf{In Progress}, \textbf{review-agentic}, \textbf{In Review}, and \textbf{Done}. +The pipeline operates in two stages, as illustrated in \Cref{fig:pipeline}. + +\paragraph{Stage~1: Implementation (\texttt{project-pipeline}).} +The maintainer moves an issue from Backlog to Ready---this is the strategic decision of \emph{what} to work on next. +The agent's \texttt{project-pipeline} skill picks the next Ready issue (processing models before rules to satisfy dependencies), moves it to In~Progress, creates an isolated git worktree, and invokes \texttt{issue-to-pr --execute}. +This triggers the full implementation cascade: the issue is classified as a model or rule, dispatched to the appropriate implementation skill, tested, reviewed by parallel sub-agents, and packaged as a pull request. +Upon completion, the card moves to the ``review-agentic'' column, signaling that the PR is ready for the second stage. + +\paragraph{Stage~2: Review (\texttt{review-pipeline}).} +The \texttt{review-pipeline} skill picks a PR from the ``review-agentic'' column and runs a fix loop: it addresses Copilot review comments, executes agentic feature tests, and fixes CI failures (retrying up to three times). +If CI passes, the card moves to ``In~Review.'' +The maintainer then performs a final human review and merges the PR, moving the card to Done. + +\paragraph{Human touch points.} +The design ensures that humans make exactly two decisions per issue. +First, the maintainer moves an issue from Backlog to Ready---this encodes the judgment of which tasks are worth pursuing, in what order, and with what priority. +Second, the maintainer reviews and merges the completed PR---this is the quality gate that ensures the agent's work meets the project's standards. +Everything between these two touch points---worktree creation, implementation, testing, review dispatch, CI repair, and board status updates---is fully agent-managed. + +The pipeline supports batch processing: \texttt{project-pipeline --all} processes every Ready issue in a single invocation, while \texttt{review-pipeline --all} handles all pending reviews. +In batch mode, models are processed before rules to ensure that newly added problem types are available when subsequent rule implementations reference them. +Each issue is processed in its own worktree, ensuring isolation: a failure on one issue does not affect others, and the maintainer's working directory is never modified. + \section{Multi-Layered Verification}\label{sec:verification} \begin{figure}[t] From 5f5a352cca5d9b2e13bcfc8e817be85295794e04 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:44:49 +0800 Subject: [PATCH 16/38] docs(arxiv): write S7 Related Work Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 66a4d487..5fb9c22e 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -357,6 +357,42 @@ \section{Evaluation}\label{sec:evaluation} \section{Related Work}\label{sec:related} +Our work draws on and contributes to four active research areas: AI coding agents and their benchmarks, AI-assisted discovery of reductions and complexity results, formal verification of AI-generated code, and physics-inspired optimization via problem reformulation. + +\paragraph{AI coding agents.} +The rapid evolution of AI coding agents---from SWE-agent's Agent-Computer Interface design~\cite{Yang2024SWEagent} and Devin's fully autonomous environment~\cite{Wu2024Devin} to production-grade platforms like OpenHands~\cite{Wang2024OpenHands} and Claude Code~\cite{Anthropic2025ClaudeCode}---has dramatically expanded what agents can accomplish on isolated software engineering tasks. +On SWE-Bench Verified, which evaluates single-issue bug fixes, the best systems now resolve 70--80\% of issues, with Live-SWE-agent's self-evolving scaffold reaching 77.4\%~\cite{Xia2025LiveSWEagent}. +However, benchmarks probing longer-horizon capabilities reveal a stark capability cliff: SWE-EVO reports resolution rates around 21\% on multi-step modifications spanning an average of 21~files~\cite{Thai2025SWEEVO}, and SWE-Bench Pro finds similar struggles with enterprise-level tasks that may require hours to days of human effort~\cite{Deng2025SWEBenchPro}. +Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference---deciphering developer intent---as the central difficulty in agentic software workflows, arguing that trustworthy deployment requires AI-based verification and validation of AI-generated code. +Industry data corroborates the need for human oversight: developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding}. +Our approach responds to these findings not by pushing for more powerful agents, but by structuring the work so that each unit falls within the agent's reliable operating range. +The skill-based decomposition is complementary to architectural advances like Live-SWE-agent's self-evolution~\cite{Xia2025LiveSWEagent}: skills encode human-authored domain knowledge that makes agents more effective on hard, domain-specific tasks, while self-evolving scaffolds improve the agent's general-purpose tool use. + +\paragraph{AI-discovered reductions.} +A parallel line of work uses AI not to \emph{implement} known reductions but to \emph{discover} new ones. +DeepMind's FunSearch~\cite{RomeraParedes2023FunSearch} demonstrated that LLM-powered evolutionary program search can produce genuinely novel mathematical constructions, discovering new cap set bounds that surpassed prior state-of-the-art results. +Its successor, AlphaEvolve~\cite{Novikov2025AlphaEvolve}, extends this approach to a broader class of problems, including the discovery of new gadget reductions that establish improved NP-hardness bounds for MAX-3-CUT, MAX-4-CUT, and the metric Traveling Salesman Problem. +On the formal side, Jani\v{c}i\'{c}'s URSA system~\cite{Janicic2025URSA} uses SAT-based constraint solving to specify, analyze, and verify reductions between NP-complete problems---a rigorous but narrower approach that handles the verification component that evolutionary methods lack. +Our work is complementary: AlphaEvolve and FunSearch discover novel reductions algorithmically, while our pipeline implements and verifies \emph{known} reductions using agents guided by human-authored specifications. +The two approaches could be composed---new reductions discovered by evolutionary search could be specified as issues in our pipeline and implemented with full verification---though this integration remains future work. + +\paragraph{Formal verification of AI-generated code.} +End-to-end formally verified code generation remains largely unsolved, particularly for mathematically complex programs. +The VeriCoding benchmark~\cite{Bursuc2025VeriCoding}, the largest of its kind with 12,504 formal specifications across Lean, Dafny, and Verus/Rust, reports success rates of 27\% in Lean and 44\% in Verus/Rust using off-the-shelf LLMs. +CLEVER~\cite{Thakur2025CLEVER}, a curated benchmark of 161 hard Lean problems, finds that even agentic approaches struggle to achieve full verification, establishing it as a challenging frontier. +VeriBench~\cite{Miranda2025VeriBench} finds that only self-optimizing agent architectures achieve meaningful compilation rates in Lean~4, approaching 90\% but still far from full correctness proofs. +For imperative programs, Mukherjee et al.\ demonstrate a two-LLM pipeline where one model generates candidate C programs and another generates Coq proofs of correctness~\cite{Mukherjee2025CoqPL, Mukherjee2025SynVer}---a generate-then-verify pattern that resonates with our layered approach. +Our seven-layer verification stack (\Cref{sec:verification}) takes a more pragmatic path: rather than attempting end-to-end formal proofs (which the benchmarks above show remains out of reach for complex code), we compose multiple lightweight verification mechanisms---type-level enforcement, brute-force cross-validation, overhead formula checking, materialized fixtures, and agentic review---that collectively catch errors across different abstraction levels. +The trade-off is clear: we provide less formal guarantee than a machine-checked proof, but our approach is practically effective at catching real errors in agent-generated mathematical code and scales to the 40~reductions in our library without requiring proof engineering expertise. + +\paragraph{Physics-inspired optimization.} +Our reduction graph serves as a compilation layer connecting abstract problem formulations to specialized hardware and neural solvers. +Schuetz et al.~\cite{Schuetz2022PhysicsGNN} demonstrate that graph neural networks trained via QUBO Hamiltonian relaxation can solve combinatorial optimization problems---including Maximum Independent Set, MaxCut, and Minimum Vertex Cover---at scales reaching millions of variables, far beyond the reach of exact solvers. +He~\cite{He2024QuantumTSP} extends this paradigm to the Traveling Salesman Problem, combining quantum annealing on coherent Ising machines with GNN-based approximate solvers, both operating on QUBO formulations. +These approaches assume that the user's problem has already been cast into QUBO or Ising form---precisely the transformation that our reduction graph provides. +A practitioner with a Set Covering or Graph Coloring problem can follow edges in our verified graph to reach QUBO (through intermediate hubs like MIS or SpinGlass) and then apply these million-variable-scale solvers. +Our work provides the verified upstream infrastructure---the ``compilation'' from diverse problem formulations to the canonical forms that physics-inspired solvers consume---while the solvers cited above provide the downstream execution engine. + \section{Discussion \& Conclusion}\label{sec:conclusion} \bibliographystyle{IEEEtran} From 1edc451d5410b4b785a2b753f9fd2775efe43fd1 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:47:58 +0800 Subject: [PATCH 17/38] docs(arxiv): write S8 Discussion and Conclusion Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 80 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 5fb9c22e..2411aed5 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -395,6 +395,86 @@ \section{Related Work}\label{sec:related} \section{Discussion \& Conclusion}\label{sec:conclusion} +\subsection{Generalizability} + +The success of our skill-based methodology rests on a \emph{Goldilocks property} of the problem domain: tasks must be (1)~formally specified, so that agents can parse unambiguous requirements; (2)~decomposable into homogeneous subtasks, so that a small number of reusable skills covers a large number of instances; (3)~equipped with automatable correctness criteria, so that verification does not depend on human judgment at every step; and (4)~demanding enough to require both creativity (in problem selection and proof design) and mechanical execution (in implementation, testing, and documentation). +NP-hard reductions satisfy all four criteria, but they are not unique in doing so. + +We identify several candidate domains that share this Goldilocks structure. +\emph{Combinatorial optimization solvers} expose a similar pattern: each solver configuration targets a specific problem class, implements a well-defined algorithm, and admits benchmark-driven correctness testing. +\emph{Algorithm libraries} (e.g., sorting, graph traversal, numerical routines) consist of homogeneous modules with clear input--output contracts and reference implementations for cross-validation. +\emph{Numerical linear algebra} routines---factorizations, eigensolvers, iterative methods---are formally specified by mathematical identities that serve as built-in oracles. +\emph{Hardware description languages} decompose digital circuits into modular components with simulation-based verification. +\emph{Compiler optimization passes}---particularly peephole rules and algebraic simplifications---are self-contained transformations with formally verifiable semantics. +In each case, the key enabler is the same: a domain expert can encode the ``how'' of task execution into reusable skills, while the ``what'' (which algorithms, which optimizations, which circuits) remains a human judgment call. + +The methodology does \emph{not} generalize to domains that lack these properties. +Heterogeneous software engineering tasks---the staple of SWE-Bench---resist skill-based decomposition precisely because each issue has a unique structure, a different notion of correctness, and no reusable workflow template. +Our approach is most powerful when the creative work is in selecting and specifying tasks, not in the execution itself. + +\subsection{Limitations} + +Several limitations constrain the conclusions that can be drawn from this work. + +\paragraph{Single case study.} +The empirical evidence comes from a single project maintained by a single developer. +While we argue that the methodology generalizes to other Goldilocks domains, we have not yet validated this claim empirically. +The ablation study (\Cref{sec:evaluation}) provides a controlled comparison within this project, but replication across independent projects and teams remains necessary. + +\paragraph{Skill engineering cost.} +The 13~skills in our library represent substantial upfront investment. +Each skill required iterative refinement---writing the initial markdown script, testing it against real issues, observing agent failure modes, and revising. +This cost is amortized across many invocations, but it presupposes a maintainer with both domain expertise and familiarity with agent capabilities. +Projects without such a maintainer cannot adopt the methodology directly. + +\paragraph{Domain specificity.} +Skills encode domain-specific knowledge---file naming conventions, trait implementation patterns, test structures---that does not transfer across domains. +A skill designed for implementing reduction rules provides no value for web development or systems programming. +Each new domain requires its own skill engineering effort. + +\paragraph{Confounding factors.} +Our project evolved over six months during which both the skill library and the underlying language models improved. +Although we address this confound through temporal stratification in our evaluation, we cannot fully disentangle the contribution of better skills from the contribution of more capable models. +Future work should control for model version to isolate the skill-based methodology's independent effect. + +\paragraph{Maintainer requirement.} +The three-role model requires a knowledgeable maintainer who curates the project board, writes skills, and performs final review. +The pipeline is not fully autonomous: without human judgment at the Backlog$\to$Ready and In~Review$\to$Done transitions, the system cannot determine what is worth building or whether the result meets quality standards. +This is a feature, not a bug---but it does mean the methodology is inapplicable to scenarios that demand full autonomy. + +\subsection{The Human Value Proposition} + +A natural concern is that skill-based agentic coding diminishes the role of human developers. +Our experience suggests the opposite: humans are \emph{repositioned}, not eliminated. +The creative and judgment-intensive aspects of software development---identifying which problems are worth solving, designing reduction proofs, setting quality standards, deciding architectural trade-offs---remain firmly in human hands. +What agents absorb is the mechanical volume: implementing boilerplate, writing tests, generating documentation, fixing CI failures, and managing workflow state. + +This division mirrors broader industry trends. +Anthropic's 2026 Agentic Coding Trends Report finds that developers use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding}. +The skill-based approach formalizes this oversight: rather than ad hoc supervision, the maintainer's judgment is encoded into reusable skills that structure every agent interaction. +The human contribution shifts from writing code to \emph{programming the agent's workflow}---a higher-leverage activity that scales with the number of tasks the agent can execute. + +\subsection{Future Directions} + +Three directions extend this work. +First, combining skill-based implementation with \emph{automated discovery}: AlphaEvolve~\cite{Novikov2025AlphaEvolve} has demonstrated that evolutionary search can discover novel gadget reductions. +Feeding discovered reductions into our pipeline as automatically generated issues would close the loop between discovery and verified implementation, producing a system that both finds and implements new reductions with correctness guarantees. + +Second, \emph{formal verification integration} could strengthen the verification stack. +Currently, Layer~7 (documentation with proof sketches) relies on human-readable arguments that are reviewed but not machine-checked. +Replacing or supplementing this layer with Lean or Coq proofs---generating formal correctness theorems alongside the Rust implementation---would add an eighth layer providing the strongest possible guarantee. +The VeriCoding~\cite{Bursuc2025VeriCoding} and CLEVER~\cite{Thakur2025CLEVER} benchmarks suggest this remains challenging, but the bounded scope and formal specification of individual reductions make them more amenable to automated theorem proving than general software. + +Third, \emph{scaling}: can the pipeline maintain quality as the reduction graph grows from 24 to 100+ problem types? +The homogeneous task structure suggests that skills should scale without modification, but the growing graph introduces new challenges---more potential for redundant reductions, longer composite paths to analyze, and a larger surface area for the maintainer to oversee. +Investigating these scaling dynamics, potentially with multiple maintainers and automated priority assignment, is an important next step. + +\subsection{Conclusion} + +We have presented a skill-based methodology for agentic coding that decomposes mathematical software tasks into human-creative and agent-executable components, validated through a case study producing 24~problem types and 40~reduction rules with multi-layered verification. +The core insight is that the bottleneck in agentic coding is not agent capability but task decomposition: when work is structured so that each unit is formally specified, bounded in scope, and mechanically verifiable, current agents execute it reliably. +The methodology is most powerful in domains that share the Goldilocks property---formal specification, homogeneous tasks, automatable correctness---and we believe such domains are more common than is generally appreciated. + \bibliographystyle{IEEEtran} \bibliography{references} From f6bcd20a9b155acc2e8771117700b33efd622285 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Thu, 12 Mar 2026 23:51:49 +0800 Subject: [PATCH 18/38] docs(arxiv): write S5 Multi-Layered Verification Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 95 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 2411aed5..64c85c0e 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -346,6 +346,10 @@ \subsection{Card-Based Orchestration}\label{sec:orchestration} \section{Multi-Layered Verification}\label{sec:verification} +The architectural pillars described in \Cref{sec:architecture} prevent many agent errors at compile time, and the skill system in \Cref{sec:skills} ensures that agents follow prescribed workflows. +But neither mechanism is sufficient on its own: the type system cannot catch a logically incorrect reduction mapping, and skills cannot guarantee that an agent's implementation matches its mathematical specification. +This section describes a seven-layer verification stack that addresses errors across the full spectrum of abstraction, from type mismatches to flawed proof arguments (see~\Cref{fig:verification}). + \begin{figure}[t] \centering \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf} @@ -353,6 +357,97 @@ \section{Multi-Layered Verification}\label{sec:verification} \label{fig:verification} \end{figure} +\subsection{The Verification Stack}\label{sec:stack} + +\Cref{tab:verification} summarizes the seven layers, each targeting a distinct class of error. +The layers are ordered by abstraction: lower layers are fully automated and fast (compile-time or test-time), while upper layers involve human-readable artifacts and agentic judgment. +We describe each layer with a concrete error example drawn from actual agent failures observed during the project. + +\begin{table*}[t] + \centering + \caption{Seven-layer verification stack. Each layer catches a distinct class of agent error that lower layers miss.} + \label{tab:verification} + \begin{tabular}{@{}clll@{}} + \toprule + Layer & Mechanism & Example Error Caught \\ + \midrule + 1 & Rust type system & Agent returns \texttt{bool} instead of \texttt{SolutionSize} from \texttt{evaluate()} \\ + 2 & Unit tests (\texttt{test\_*\_basic}) & Agent evaluates MaxCut objective with wrong sign \\ + 3 & Closed-loop tests (\texttt{test\_*\_to\_*\_closed\_loop}) & SAT$\to$MIS maps clause variables to wrong vertex indices \\ + 4 & Overhead validation (symbolic expr vs.\ actual sizes) & Agent writes \texttt{num\_edges = num\_clauses} instead of \texttt{3 * num\_clauses} \\ + 5 & Materialized fixtures (JSON ground truth) & Agent changes expected QUBO matrix to make failing test pass \\ + 6 & Agentic review (parallel subagents) & Missing \texttt{declare\_variants!}, wrong file naming convention \\ + 7 & Documentation (proof sketch in paper) & Proof assumes connected graph but problem allows disconnected \\ + \bottomrule + \end{tabular} +\end{table*} + +\paragraph{Layer 1: Type system.} +Rust's type system, augmented by the trait hierarchy and proc macros described in \Cref{sec:architecture}, serves as the first line of defense. +The \texttt{Problem} trait's associated type \texttt{Metric} forces every implementation to commit to either \texttt{SolutionSize} (optimization) or \texttt{bool} (satisfaction) at definition time. +An agent that accidentally returns a boolean from an optimization problem's \texttt{evaluate()} method receives a compile error, not a silent logic bug. +The \lstinline{#[reduction(overhead)]} macro validates variable names against getter methods, so an agent cannot reference a field from the wrong problem type. +This layer is the fastest and cheapest: errors are caught in seconds, before any test runs. + +\paragraph{Layer 2: Unit tests.} +Each problem implementation includes \texttt{test\_*\_basic} tests that construct small instances and verify that \texttt{evaluate()} returns the expected metric values. +These tests catch semantic errors that the type system cannot: for example, an agent implementing \texttt{MaxCut} might negate the edge-weight sum, producing a valid \texttt{SolutionSize} that is nonetheless wrong. +Serialization round-trip tests (\texttt{test\_*\_serialization}) ensure that problem instances survive JSON encoding and decoding, catching subtle issues with graph representation or weight ordering. + +\paragraph{Layer 3: Closed-loop tests.} +The closed-loop test pattern---reduce a source instance, solve the target by brute force, extract the solution back, verify optimality against the source---is the workhorse of the verification stack. +Every reduction rule has a \texttt{test\_*\_to\_*\_closed\_loop} test that exercises this full round trip. +This layer catches the most mathematically subtle errors: incorrect variable-to-vertex mappings, off-by-one errors in clause indexing, forgotten negation in constraint encoding. +The test requires no problem-specific oracle; it relies entirely on the \texttt{evaluate()} interface and the brute-force solver, which means a single skill can generate this test for any reduction. +In our experience, this layer catches approximately 60\% of the errors that survive type checking. + +\paragraph{Layer 4: Overhead validation.} +The overhead expressions declared in the \lstinline{#[reduction]} macro (e.g., \texttt{num\_edges = "3 * num\_clauses"}) provide a second, independent correctness check on reductions. +After constructing the target problem, the test harness evaluates the symbolic overhead expression using the source problem's getter methods and compares the result against the actual size of the target. +A mismatch indicates either a bug in the reduction (the constructed target does not have the expected size) or an error in the formula (the agent wrote a wrong expression that happens to be type-correct). +This layer is particularly effective at catching errors in reductions that involve non-obvious size relationships---for instance, a reduction from 3-SAT to MIS that creates a gadget graph where the number of edges is quadratic in the number of clauses, not linear. + +\paragraph{Layer 5: Materialized fixtures.} +Ground-truth test data is stored as JSON files in \texttt{tests/data/} and committed to version control separately from reduction implementations. +Integration tests load these fixtures and compare computed results against the expected values. +This layer exists specifically to counter a failure mode we call the ``lazy agent'' problem (discussed in \Cref{sec:why-layers}): because the fixtures are committed independently, any agent modification to expected values shows up as a diff in a separate file, making it visible during code review. +The QUBO ground-truth matrices, generated by an independent Python script, are an example: they verify that the Rust implementation's matrix construction matches a reference implementation, catching systematic errors that the round-trip test might miss (e.g., a reduction that produces the wrong QUBO matrix but still happens to yield the correct optimum on small instances). + +\paragraph{Layer 6: Agentic review.} +The \texttt{review-implementation} skill dispatches two parallel subagents---one checking structural completeness (file naming, module registration, \texttt{declare\_variants!} macro, test naming conventions) and one assessing code quality (edge cases, documentation, idiomatic Rust). +Each subagent operates in a fresh context window, without access to the implementing agent's conversation history, which prevents confirmation bias. +This layer catches errors that are invisible to automated tests: a reduction implementation might pass all tests but use a non-standard file name that breaks the documentation build, or omit a variant registration that leaves a gap in the reduction graph. +The fresh-context design is deliberate: an agent reviewing its own work in the same context window tends to overlook the same assumptions it made during implementation. + +\paragraph{Layer 7: Documentation.} +Every reduction in the library has a corresponding entry in a Typst paper that includes a formal problem definition, a statement of the reduction rule, and a proof sketch. +The paper's completeness checker automatically verifies that every node and edge in the exported reduction graph has a corresponding documentation entry, flagging gaps as warnings. +This layer catches errors that no automated test can: a proof sketch might reveal an unstated assumption (e.g., that the input graph is connected) that the implementation silently relies on, or expose a logical gap in the reduction argument that happens to be masked by the small test instances used in Layers 2--3. +The requirement to write a human-readable proof forces the agent to articulate the mathematical reasoning behind the reduction, serving as a form of self-verification. + +\subsection{Why Layers?}\label{sec:why-layers} + +No single verification layer is sufficient. +The type system (Layer~1) catches API misuse but is blind to logical errors in the reduction mapping. +Closed-loop tests (Layer~3) verify functional correctness on specific instances but cannot check whether overhead formulas are accurate or whether the mathematical argument generalizes beyond the test cases. +Documentation (Layer~7) catches proof-level mistakes but depends on the human reader's diligence. +The layers are designed to be \emph{complementary}: each layer's blind spots are covered by another layer's strengths. + +The need for layered verification becomes acute when agents optimize for the wrong objective. +We call this the \textbf{lazy agent problem}: given a failing test, an agent may modify the expected output rather than fix the underlying implementation. +This is rational behavior from the agent's perspective---the issue asks for a passing test suite, and changing the expected value is the shortest path to that goal. +We observed this failure mode multiple times during early development: an agent implementing a QUBO reduction encountered a failing integration test, examined the expected JSON matrix, and ``corrected'' it to match its (incorrect) output. + +Materialized fixtures (Layer~5) are the primary defense against this failure mode. +Because ground-truth data is generated by an independent script (typically Python) and committed in a separate step, any agent modification to the expected values produces a visible diff in a file that the agent was not asked to change. +Code review---whether human or agentic (Layer~6)---then flags the unexpected modification. +This design transforms a subtle correctness violation (wrong expected value, all tests pass) into an obvious process violation (agent modified a file outside its scope), which is much easier to detect. + +More broadly, the layered design reflects a defense-in-depth philosophy borrowed from security engineering. +Just as network security does not rely solely on firewalls or solely on encryption, verification of agent-generated mathematical code should not rely solely on tests or solely on type checking. +Each layer adds an independent probability of catching an error, and the layers' error-detection capabilities are largely orthogonal---an error that slips through the type system is likely caught by closed-loop tests, and an error that passes all automated tests may be caught by the documentation review. +The skill system (\Cref{sec:skills}) ensures that agents invoke all seven layers as part of every implementation task, so no layer is accidentally skipped. + \section{Evaluation}\label{sec:evaluation} \section{Related Work}\label{sec:related} From fd8645bc9d1d295ee40f1eeef16967463e7a56de Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Fri, 13 Mar 2026 00:11:35 +0800 Subject: [PATCH 19/38] docs(arxiv): write S6 Evaluation Add three evaluation subsections: S6.1 ablation study design (skill-based vs raw agent, with TBD results), S6.2 git history mining (58 PRs across 3 phases, error taxonomy table with TBD counts), and S6.3 case studies of MVC->MIS (96 LOC, simple complement), SAT->MIS (171 LOC, quadratic gadget), and Factoring->CircuitSAT->ILP (272+225 LOC, composition). Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/paper.tex | 130 +++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 64c85c0e..6bbadd54 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -450,6 +450,136 @@ \subsection{Why Layers?}\label{sec:why-layers} \section{Evaluation}\label{sec:evaluation} +We evaluate the skill-based methodology through three complementary lenses: an ablation study comparing skill-based and no-skill agent configurations (\Cref{sec:ablation}), a longitudinal analysis of the project's git history (\Cref{sec:mining}), and detailed case studies of three reductions spanning the complexity spectrum (\Cref{sec:cases}). +Together, these provide quantitative evidence that the skill-based decomposition improves agent reliability, and qualitative insight into how skills and verification layers interact during real implementation tasks. + +\subsection{Ablation: Skill-Based vs.\ Raw Agent}\label{sec:ablation} + +To isolate the effect of skill-based task decomposition from the broader methodology, we design a controlled comparison between two agent configurations operating on identical tasks within the same codebase. + +\paragraph{Setup.} +We select a sample of 5--10 reduction rules spanning the complexity spectrum---from near-trivial complement relationships (e.g., MinimumVertexCover $\to$ MaximumIndependentSet, $\sim$96~LOC) through moderate gadget constructions (e.g., Satisfiability $\to$ MaximumIndependentSet, $\sim$171~LOC) to complex circuit-based encodings (e.g., Factoring $\to$ CircuitSAT, $\sim$272~LOC). +For each reduction, we prepare identical GitHub issues with the same problem specification, mathematical description, and acceptance criteria. +We then run each issue through two configurations: +\begin{itemize} + \item \textbf{Skill-based}: the full pipeline described in \Cref{sec:skills}, including the \texttt{issue-to-pr} orchestration skill, the \texttt{add-rule} implementation skill, \texttt{review-implementation} with parallel sub-agents, and \texttt{fix-pr} for CI repair. + \item \textbf{No-skill baseline}: the same Claude Code agent operating on the same codebase with access to the project's \texttt{CLAUDE.md} for context, but without any skill files. The agent receives the issue text and must determine its own workflow. +\end{itemize} + +\paragraph{Metrics.} +We measure four outcomes for each configuration: (1)~\emph{first-attempt CI pass rate}---whether the initial PR passes all CI checks without intervention; (2)~\emph{review rounds}---the number of review-fix cycles before merge readiness; (3)~\emph{correctness}---whether the final implementation passes all round-trip tests; and (4)~\emph{convention adherence}---whether the code follows project conventions (file naming, macro usage, documentation structure, test patterns). + +\paragraph{Framing.} +With $n = 5$--$10$, this ablation is a \emph{controlled illustration} of the skill-based approach's value, not a statistically powered experiment. +The results are intended to demonstrate the \emph{mechanism}---how skills prevent specific error classes by encoding domain knowledge into the agent's workflow---rather than to establish effect sizes with confidence intervals. +We expect the skill-based configuration to excel particularly on convention adherence (where skills encode project-specific patterns that no general-purpose agent would know) and first-attempt CI pass rate (where the multi-layered verification invoked by skills catches errors before the first push). +The raw agent, by contrast, is likely to produce functionally correct code that nonetheless fails CI due to missing \texttt{declare\_variants!} macros, incorrect overhead expressions, or test files placed in the wrong directory. + +The ablation results are [TBD]; the experiment requires running both configurations on held-out issues and measuring the four metrics above. +The git history mining in \Cref{sec:mining} provides complementary longitudinal evidence across the project's full development timeline. + +\subsection{Git History Mining}\label{sec:mining} + +We analyze the complete git and pull request history of the \texttt{problem-reductions} repository to characterize the project's evolution and the types of errors encountered during development. +The repository contains 58~merged pull requests spanning approximately seven weeks of development, authored by two primary contributors. + +\paragraph{Development phases.} +We stratify the history into three phases reflecting the evolution of agent tooling: +\begin{itemize} + \item \textbf{Phase~1 (Manual)}: 35~PRs. Skills had not yet been developed; all implementation, testing, and review was performed manually or with ad-hoc agent assistance. This phase established the core library architecture and the majority of the reduction rules. + \item \textbf{Phase~2 (Basic skills)}: 9~PRs. Initial skills for model and rule implementation were available, providing structured workflows but without full pipeline automation. Two new problem models (ClosestVectorProblem, BinPacking) were added during this phase. + \item \textbf{Phase~3 (Full pipeline)}: 14~PRs. The complete skill library was operational, including orchestration skills (\texttt{issue-to-pr}, \texttt{meta-power}), quality gates (\texttt{check-issue}, \texttt{check-rule-redundancy}), and multi-agent review (\texttt{review-implementation}). New models (Knapsack, GraphPartitioning) and rules (KSatisfiability $\to$ SubsetSum) were added with full pipeline support. +\end{itemize} + +\paragraph{PR composition.} +Of the 58~merged PRs, 5 are tagged as \texttt{[Model]} PRs (adding new problem types), 2 as \texttt{[Rule]} PRs (adding new reduction rules), and 51 as infrastructure, refactoring, documentation, or tooling PRs. +The low count of explicitly tagged Model and Rule PRs reflects the project's development pattern: the initial feature-parity PRs (e.g., PR~\#4 ``Feature parity with ProblemReductions.jl'' and PR~\#7 ``Implement remaining reduction rules'') bundled multiple models and rules into single large PRs before the tagging convention was established. +The 40~reduction rule implementations and 24~problem types in the final library were built incrementally across many PRs that also included refactoring, testing, and documentation work. + +All 58~PRs were authored through human GitHub accounts. +This reflects the operational reality of our methodology: the agent operates through the human's development environment (Claude Code runs locally), so all commits and PRs are attributed to the human author even when the agent performed the implementation. +The distinction between human-authored and agent-assisted work is therefore not visible in the git metadata, which is itself a finding about the observability limitations of current agentic coding workflows. + +\paragraph{Error taxonomy.} +\Cref{tab:errors} categorizes the types of errors encountered during development, mapped to the verification layer that catches each error class. +The taxonomy is derived from code review comments, CI failure logs, and commit messages across all three development phases. + +\begin{table}[t] +\caption{Error taxonomy by verification layer.}\label{tab:errors} +\centering +\begin{tabular}{llc} +\toprule +Error Category & Catching Layer & Count \\ +\midrule +Type/API mismatch & L1: Type system & [TBD] \\ +Evaluation logic errors & L2: Unit tests & [TBD] \\ +Mapping/index errors & L3: Closed-loop tests & [TBD] \\ +Overhead formula errors & L4: Overhead validation & [TBD] \\ +Test gaming (changed expected values) & L5: Materialized fixtures & [TBD] \\ +Convention violations & L6: Agentic review & [TBD] \\ +Incorrect proof arguments & L7: Documentation review & [TBD] \\ +\bottomrule +\end{tabular} +\end{table} + +The error counts in \Cref{tab:errors} are placeholders pending a systematic audit of all CI logs and review threads. +Preliminary observations from the commit history suggest that \emph{overhead formula errors} (Layer~4) and \emph{convention violations} (Layer~6) are among the most common error classes. +For example, PR~\#112 (``Fix complexity inconsistencies, enforce overhead, add missing variants'') addressed multiple overhead formula errors that had accumulated before Layer~4 validation was enforced, and PR~\#89 (``Close completeness gaps from review-implementation audit'') fixed convention violations identified by the agentic review skill. +The introduction of compile-time overhead validation in PR~\#99 (``Replace Polynomial overhead system with Expr AST'') eliminated an entire class of errors by shifting overhead checking from runtime to compile time---an example of how verification infrastructure co-evolves with the skill system. + +\subsection{Case Studies}\label{sec:cases} + +We examine three reductions that illustrate different points on the complexity spectrum, highlighting how the skill-based pipeline and verification stack interact in each case. + +\paragraph{Case 1: MinimumVertexCover $\to$ MaximumIndependentSet (simple).} + +This reduction exploits the classical complement relationship: a set $S$ is an independent set in graph $G$ if and only if $V \setminus S$ is a vertex cover. +The implementation is correspondingly minimal at 96~lines of Rust, including both directions of the reduction (MIS $\to$ MVC and MVC $\to$ MIS). +The \texttt{reduce\_to()} method simply copies the graph and weights to the target problem type, and the \texttt{extract\_solution()} method flips each bit of the configuration vector. + +The overhead expressions are identity mappings (\texttt{num\_vertices = "num\_vertices"}, \texttt{num\_edges = "num\_edges"}), reflecting the fact that the target problem has exactly the same graph as the source. +This reduction was part of the initial batch implemented in PR~\#7 (``Implement remaining reduction rules'') during Phase~1, before skills were available. + +This case illustrates the \emph{lower bound} of the complexity spectrum. +The mathematical content is trivial, the implementation is mechanical, and the verification layers activate without incident. +For such reductions, the skill-based pipeline's primary value is in enforcing conventions (correct file naming, \texttt{declare\_variants!} macro placement, documentation entries) rather than catching logical errors. + +\paragraph{Case 2: Satisfiability $\to$ MaximumIndependentSet (complex).} + +This is the classical Karp reduction from Boolean satisfiability to maximum independent set~\cite{karp1972}. +Given a CNF formula with $m$~clauses, the reduction creates one vertex for each literal occurrence in each clause, adds clique edges within each clause (ensuring at most one literal per clause is selected), and adds conflict edges between complementary literals across clauses (ensuring consistency). +A satisfying assignment corresponds to an independent set of size exactly~$m$. + +The implementation spans 171~lines, with the core gadget construction occupying two nested loops: the first builds per-clause cliques (lines 127--142 in the source), and the second adds cross-clause conflict edges by checking literal complementarity (lines 147--153). +The overhead expressions reflect the quadratic worst case: \texttt{num\_vertices = "num\_literals"} and \texttt{num\_edges = "num\_literals\^{}2"}. + +This reduction is instructive because it is precisely the kind of task where agents commonly make errors. +The edge count in the conflict graph is worst-case quadratic in the number of literals (every literal in one clause may conflict with literals in every other clause), but an agent might assume linear overhead if it reasons only about the per-clause structure. +Layer~4 (overhead validation) catches this class of error by comparing the symbolic overhead expression against the actual target problem size on concrete test instances. +The closed-loop test (Layer~3) catches a different class: index-off-by-one errors in the literal-to-vertex mapping, which cause the extracted solution to assign the wrong truth value to a variable. +Both layers are necessary---an implementation with correct indices but wrong overhead, or correct overhead but wrong indices, would pass one layer but fail the other. + +\paragraph{Case 3: Factoring $\to$ CircuitSAT $\to$ ILP (composition).} + +The most complex case study involves two independently implemented reductions that compose through the reduction graph to solve an end-to-end problem: given an integer $N$, find its non-trivial factors. + +The first reduction, Factoring $\to$ CircuitSAT (272~LOC, PR~\#85 family), constructs an array multiplier circuit. +The circuit takes two bit-vectors $p$ and $q$ as inputs, computes their product through a grid of full-adder cells, and constrains the output to equal the target number~$N$. +Each multiplier cell requires six circuit assignments (AND, XOR, carry logic), so the total overhead scales as $\Theta(mn)$ where $m$ and $n$ are the bit-widths of the two factors. +The \texttt{extract\_solution()} method maps the satisfying circuit assignment back to the two factors by reading the $p$ and $q$ variable values. + +The second reduction, CircuitSAT $\to$ ILP (225~LOC, PR~\#85), linearizes the Boolean circuit into integer linear constraints. +Each Boolean gate (AND, OR, XOR, NOT) is encoded as a set of linear inequalities over binary variables, with the circuit's topological ordering determining the constraint generation sequence. + +Neither reduction was designed with composition in mind---each was implemented to connect its source and target in the reduction graph. +Yet the graph infrastructure enables automatic composition: a user with a Factoring instance can query the graph for a path to ILP, chain the two reductions, solve the resulting integer program with an off-the-shelf solver, and extract factors through the composed inverse maps. +This composition works because each reduction's \texttt{ReductionResult} trait provides a type-safe \texttt{extract\_solution()} method, and the graph's path-finding API composes these extractors in reverse order. + +This case highlights the \emph{emergent value} of the reduction graph as compilation infrastructure. +The two reductions were implemented in separate PRs, potentially by different contributors, with different verification layer activations. +The graph composes them into a pipeline that no single implementation task created. +This compositionality is precisely the property that makes the Goldilocks domain valuable: individual tasks are bounded and verifiable, but the graph as a whole provides capabilities that exceed the sum of its parts. + \section{Related Work}\label{sec:related} Our work draws on and contributes to four active research areas: AI coding agents and their benchmarks, AI-assisted discovery of reductions and complexity results, formal verification of AI-generated code, and physics-inspired optimization via problem reformulation. From 6c076c5d3e8ab5e176f6a826f1406fc56e65276a Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Fri, 13 Mar 2026 00:28:13 +0800 Subject: [PATCH 20/38] docs(arxiv): address peer review round 1 findings Fix critical issues identified by simulated peer review: - Fix timeline contradiction: "six months" -> "seven weeks" (abstract + discussion) - Fix author count: "two primary contributors" -> "three contributors" - Soften unsubstantiated "60% of errors" claim to qualitative language - Add agent platform identification (Claude Code, model versions) - Reframe unexecuted ablation as experimental design, not pending results - Add skills vs. prompt engineering differentiation paragraph - Fix malformed BibTeX entries (dual booktitle/journal fields) - Add Pichler 2018 citation for Rydberg atom MIS connection - Note vendor report status on Anthropic 2026 citation - Soften Table 2/3 captions to acknowledge pending data Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/data/peer-review-round1.md | 161 ++++++++++++++++++++ docs/paper/arxiv/paper.tex | 37 +++-- docs/paper/arxiv/references.bib | 16 +- 3 files changed, 196 insertions(+), 18 deletions(-) create mode 100644 docs/paper/arxiv/data/peer-review-round1.md diff --git a/docs/paper/arxiv/data/peer-review-round1.md b/docs/paper/arxiv/data/peer-review-round1.md new file mode 100644 index 00000000..0579ce36 --- /dev/null +++ b/docs/paper/arxiv/data/peer-review-round1.md @@ -0,0 +1,161 @@ +# Peer Review Round 1 + +**Paper:** Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions +**Format:** IEEEtran conference (targeting ICSE/ASE-class venue) +**Date:** 2026-03-13 + +--- + +## Scores (0--100) + +| Aspect | Score | Assessment | +|-----------------|-------|------------------| +| Novelty | 62 | Major Revision | +| Soundness | 45 | Major Revision | +| Significance | 65 | Borderline | +| Clarity | 78 | Minor Revision | +| Reproducibility | 55 | Major Revision | + +**Overall Recommendation:** Major Revision (borderline Reject) + +--- + +## Reviewer 1: SE Methodology + +### Summary +The paper proposes a skill-based decomposition methodology for agentic coding, applied to a Rust library implementing NP-hard problem reductions. The key ideas are: (1) a three-role model separating creative work from mechanical execution, (2) a library of 13 reusable skills that decompose tasks into agent-manageable steps, (3) a 7-layer verification stack, and (4) a card-based orchestration pipeline. + +### Strengths +- **S1.** The three-role model (Contributor/Maintainer/Agent) is a clear, well-motivated decomposition of responsibilities. The distinction between "programming the agent's workflow" (maintainer) vs. "programming the agent's output" is insightful. +- **S2.** The 7-layer verification stack is the strongest technical contribution. The layers are well-justified with concrete error examples, and the "lazy agent problem" (agents modifying expected test values) is a genuine, important failure mode. +- **S3.** The paper is well-written with clear prose and good structure. The Goldilocks domain argument is compelling. +- **S4.** The concept of skills as reusable, composable agent workflows is practically useful and clearly explained. + +### Weaknesses +- **W1.** The ablation study (Section 6.1) is entirely placeholder [TBD]. This is the most critical missing piece: without a controlled comparison between skill-based and no-skill configurations, the paper's central claim -- that skills improve agent reliability -- is unsupported by direct evidence. The framing text acknowledges this will be "5--10 reductions" but the results are absent. +- **W2.** The error taxonomy table (Table 3) is entirely [TBD]. Without concrete error counts, the verification stack's effectiveness is described only anecdotally. +- **W3.** The success rate column in Table 2 (skills inventory) is entirely [TBD]. These metrics would directly quantify each skill's reliability. +- **W4.** The methodology evaluation relies almost entirely on a single case study with a single project. While Section 7.1 acknowledges this limitation, the paper would be strengthened by even a brief pilot in a second domain. + +### Questions for Authors +- Q1. The ablation text says "The ablation results are [TBD]" -- when will these be available? Without them, the evaluation section lacks its primary quantitative evidence. +- Q2. How much human effort (hours) went into developing the 13 skills? This is crucial for assessing the cost-benefit tradeoff. + +--- + +## Reviewer 2: AI/Agents + +### Summary +The paper presents a pragmatic methodology for human-agent collaboration in mathematical software development, with skills serving as structured prompts/workflows and a multi-layered verification approach to ensure correctness. + +### Strengths +- **S1.** The connection to current agentic coding benchmarks (SWE-Bench, SWE-EVO, SWE-Bench Pro) is well-established and provides useful context for the capability gap the paper addresses. +- **S2.** The "lazy agent problem" -- where agents modify expected test outputs rather than fixing implementation bugs -- is a real and underreported phenomenon. The materialized fixtures defense is a practical contribution. +- **S3.** The fresh-context design for agentic review (dispatching sub-agents without the implementor's context) to prevent sycophancy is a sound design choice with good motivation. +- **S4.** The related work on AI-discovered reductions (FunSearch, AlphaEvolve) and formal verification (VeriCoding, CLEVER) is thorough and well-integrated. + +### Weaknesses +- **W1.** The comparison with existing agent benchmarks is unfair in framing. The paper opens by saying agents achieve "70--80% on SWE-Bench Verified" but "below 25% on long-horizon tasks," then implies its methodology bridges this gap -- but never actually measures its own success rate on comparable metrics. The paper should either: (a) define equivalent metrics and report them, or (b) be explicit that it is presenting methodology, not a benchmark comparison. +- **W2.** The paper does not report which LLM(s) were used, which model versions, or any details about the agent's configuration. For reproducibility, readers need to know whether this was Claude 3.5, Claude 4, GPT-4, etc. The methodology's effectiveness may be strongly model-dependent. +- **W3.** The "skills as markdown documents" approach is presented as novel, but prompt engineering / structured agent workflows have been explored by prior work (e.g., chain-of-thought prompting, ReAct, agent tool-use frameworks). The novelty should be more carefully positioned relative to these. +- **W4.** The claim that "developers now use AI in 60% of their work while maintaining active oversight on 80--100% of delegated tasks" (citing Anthropic 2026) is used multiple times but comes from an industry report by the same company whose tool (Claude Code) is used in the study. This creates a potential conflict of interest in citation usage. + +### Questions for Authors +- Q1. What model(s) and version(s) were used? Did the model change during the 7-week development period? +- Q2. Has the methodology been tested with non-Anthropic agents (e.g., GPT-4, Gemini)? + +--- + +## Reviewer 3: Devil's Advocate + +### Summary +The paper describes a carefully engineered workflow for using coding agents in a specific mathematical domain. While the engineering is thorough, I have serious concerns about the evaluation and the strength of the claims made. + +### Strengths +- **S1.** The paper is honest about limitations (Section 7.2), including single case study, skill engineering cost, domain specificity, and confounding factors. This transparency is appreciated. +- **S2.** The concrete artifact (24 problem types, 40 reductions, 52 edges, >95% coverage) is impressive as engineering output. + +### Weaknesses (Critical) +- **W1. Incomplete evaluation is a showstopper.** Three of the four main evaluation components are [TBD] placeholders: + - Ablation results (Section 6.1): entirely missing + - Error taxonomy counts (Table 3): entirely missing + - Skill success rates (Table 2): entirely missing + + This means the paper's evaluation section consists of: (a) a description of an experiment that hasn't been run, (b) a descriptive git history summary with no quantitative findings, and (c) three case studies. This is insufficient for a top-venue submission. + +- **W2. Timeline inconsistency.** The abstract claims "six months" of development, but Section 6.2 says "approximately seven weeks" spanning 58 PRs. The git data confirms ~47 days (6.6 weeks) from first to last PR. This is a factual contradiction that undermines credibility. + +- **W3. Author count inconsistency.** Section 6.2 says "two primary contributors" but the git data shows three distinct authors (GiggleLiu, isPANN, zazabap). While one contributor may have minor contributions, this should be stated accurately. + +- **W4. N=1 threat to validity.** The entire evaluation is based on a single project by a single primary developer. The generalizability claims in Section 7.1 are aspirational -- listing candidate domains (compiler passes, numerical linear algebra, etc.) without any evidence. A hostile reviewer would argue this is an experience report dressed as a methodology paper. + +- **W5. Circular reasoning in verification stack.** The paper claims the 7-layer verification stack catches errors, but the evidence for this is anecdotal ("we observed this failure mode multiple times"). Without systematic error counts (Table 3 is TBD), the claim that "this layer catches approximately 60% of the errors that survive type checking" (Section 5.1, Layer 3) is unsubstantiated. + +- **W6. No baseline comparison.** The paper compares against SWE-Bench and SWE-EVO numbers but never runs its own tasks through a no-skill baseline. The ablation is designed but not executed. Without this, the reader cannot distinguish whether the results come from the skill methodology, the domain's inherent verifiability, or the specific LLM's capability. + +### Weaknesses (Major) +- **W7.** The paper is ~14 pages in IEEEtran conference format. ICSE/ASE typically allows 10--12 pages. The paper needs significant trimming (~2--4 pages). + +- **W8.** The "60% of errors" claim for Layer 3 (line 402) has no citation, no data source, and no methodology for arriving at this number. It reads as an estimate presented as a finding. + +- **W9.** The paper conflates "agent-generated code" with "agent-assisted code." Since all PRs are attributed to human GitHub accounts (Section 6.2), there is no way to distinguish which code was human-written vs. agent-written. The paper acknowledges this as "a finding about observability limitations" but this also means the paper cannot quantify agent contributions. + +- **W10.** Several citations have issues: + - `Anthropic2026AgenticCoding` is a tech report by the tool vendor, cited 3+ times as if it were independent research + - `lucas2014` is cited as evidence that "Rydberg atom arrays natively solve MIS" but Lucas 2014 is about Ising formulations, not Rydberg atoms specifically (the Rydberg atom connection to MIS came later, ~2018+) + - The bib file has `@article` entries with both `booktitle` and `journal` fields (e.g., Yang2024SWEagent, He2024QuantumTSP), which is malformed BibTeX + +### Weaknesses (Minor) +- **W11.** The `\author{...}` placeholder (line 17) should be filled in for submission. +- **W12.** The abstract mentions "six months" but should be revised to match the actual timeline. +- **W13.** Section 2 paragraph "Hardware solvers as practical motivation" could be shortened; it reads more like a grant proposal than a conference paper. +- **W14.** The paper would benefit from a threat-to-validity section separate from limitations, following SE convention. +- **W15.** No appendix or supplementary material is referenced for the full skill markdown files, which would aid reproducibility. +- **W16.** The paper uses `\Cref` (cleveref) throughout but does not appear to load it with any options for IEEEtran compatibility. This may cause formatting issues. + +--- + +## Critical Issues (Must Fix) + +1. **[C1] Timeline contradiction (abstract vs. Section 6.2).** The abstract says "six months" but Section 6.2 says "seven weeks" and the data confirms ~47 days. Fix: align to the actual timeline. (Affects: Soundness) + +2. **[C2] TBD placeholders in evaluation.** Three tables/results are entirely [TBD]: ablation results (Section 6.1), error taxonomy counts (Table 3), skill success rates (Table 2). The paper cannot be submitted with placeholder data. Fix: either run the experiments and fill in real data, or restructure the evaluation to remove the ablation framing and present what data exists. (Affects: Soundness, Significance) + +3. **[C3] Unsubstantiated "60% of errors" claim.** The claim that closed-loop tests catch "approximately 60% of the errors that survive type checking" (Section 5.1) has no supporting data. Fix: either provide the data from the error taxonomy audit, or soften to qualitative language ("a majority of errors" or "the largest share of errors in our experience"). (Affects: Soundness) + +4. **[C4] Author count factual error.** "Two primary contributors" but three distinct authors in git history. Fix: say "three contributors" or "two primary contributors and one additional contributor." (Affects: Soundness) + +## Major Issues (Should Fix) + +5. **[M1] No LLM model identification.** The paper never specifies which language model(s) were used. This is essential for reproducibility and for understanding whether results generalize across models. + +6. **[M2] Page count.** At ~14 pages, the paper exceeds typical ICSE/ASE limits (10--12 pages). The hardware solvers paragraph and some related work could be condensed. + +7. **[M3] Vendor citation bias.** The Anthropic 2026 report is cited 3 times as supporting evidence. At minimum, note that this is a vendor report, or balance with independent sources. + +8. **[M4] Missing threats to validity section.** SE venues expect explicit threats-to-validity discussion (internal, external, construct validity). The limitations section partially covers this but not in the expected format. + +9. **[M5] Malformed BibTeX entries.** Several entries have both `booktitle` and `journal` fields. These will produce warnings or malformed references. + +10. **[M6] Novelty positioning vs. prompt engineering.** The paper should more explicitly differentiate "skills" from existing prompt engineering techniques (chain-of-thought, ReAct, structured prompts). + +## Minor Issues (Nice to Fix) + +11. **[m1]** Author placeholder `\author{...}` needs to be filled. +12. **[m2]** The `lucas2014` citation for Rydberg atoms is imprecise; consider citing the Pichler et al. 2018 work specifically for the MIS-Rydberg connection. +13. **[m3]** Table 2 caption says "Success rate is the fraction of invocations that pass CI on first attempt, measured from git history" but the column is all TBD -- the caption should not describe methodology for data that doesn't exist yet. +14. **[m4]** Section 6.2 could benefit from a timeline figure showing the three phases. +15. **[m5]** The case studies (Section 6.3) are descriptive but lack quantitative comparison (e.g., agent time vs. estimated human time, number of iterations). +16. **[m6]** Consider adding a data availability statement pointing to the repository. +17. **[m7]** The paper uses both "coding agent" and "AI agent" -- consider standardizing terminology. +18. **[m8]** cleveref package may need `[capitalise]` option or `\Cref`/`\cref` consistency check for IEEEtran. + +--- + +## Summary Assessment + +The paper presents a well-engineered system with genuine practical contributions, particularly the 7-layer verification stack and the "lazy agent problem" defense. The writing quality is high and the domain motivation is compelling. However, the evaluation is critically incomplete: the ablation study has not been run, error counts are missing, and skill success rates are placeholders. The timeline contradiction between abstract and body is a factual error that must be fixed. In its current state, the paper reads as an experience report with a methodology sketch, not a fully evaluated research contribution. + +**Verdict:** Major Revision. The methodology and system design are promising, but the paper needs: (1) completed evaluation data or restructured claims that match available evidence, (2) factual corrections (timeline, author count), (3) model identification for reproducibility, and (4) approximately 2--4 pages of trimming for conference format. + +The strongest path to acceptance: reframe the evaluation around the git mining data and case studies that do exist, acknowledge the ablation as future work rather than presenting it as a designed-but-unrun experiment, and add the model identification details. diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 6bbadd54..3b402e5c 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -19,7 +19,7 @@ \maketitle \begin{abstract} -AI coding agents achieve 70--80\% on single-issue benchmarks like SWE-Bench Verified, but their success rate drops below 25\% on long-horizon software evolution tasks that demand sustained mathematical reasoning across many files. We address this gap by decomposing agentic coding into two complementary roles: human-creative work (designing reduction proofs, choosing algorithms, writing specifications) and agent-managed execution (scaffolding, testing, verification, and integration). Our method centers on a library of 13 reusable skills---from issue triage through implementation to multi-layered review---orchestrated by a coding agent within a Rust library for NP-hard problem reductions. A 7-layer verification stack (type checking, unit tests, brute-force cross-validation, closed-loop reduction tests, integration tests, coverage enforcement, and CI/CD) catches errors at increasing levels of abstraction. Applying this methodology over six months produced 24 problem types, 40 reduction rule implementations, and 52 edges in a typed reduction graph, all with $>$95\% test coverage. We contribute the skill-based decomposition methodology, the verification stack design, and the open-source artifact as a benchmark for agentic mathematical software engineering. +AI coding agents achieve 70--80\% on single-issue benchmarks like SWE-Bench Verified, but their success rate drops below 25\% on long-horizon software evolution tasks that demand sustained mathematical reasoning across many files. We address this gap by decomposing agentic coding into two complementary roles: human-creative work (designing reduction proofs, choosing algorithms, writing specifications) and agent-managed execution (scaffolding, testing, verification, and integration). Our method centers on a library of 13 reusable skills---from issue triage through implementation to multi-layered review---orchestrated by a coding agent within a Rust library for NP-hard problem reductions. A 7-layer verification stack (type checking, unit tests, brute-force cross-validation, closed-loop reduction tests, integration tests, coverage enforcement, and CI/CD) catches errors at increasing levels of abstraction. Applying this methodology over seven weeks of active development produced 24 problem types, 40 reduction rule implementations, and 52 edges in a typed reduction graph, all with $>$95\% test coverage. We contribute the skill-based decomposition methodology, the verification stack design, and the open-source artifact as a benchmark for agentic mathematical software engineering. \end{abstract} \section{Introduction}\label{sec:intro} @@ -54,7 +54,7 @@ \section{Introduction}\label{sec:intro} The maintainer encodes domain knowledge, quality standards, and project conventions into these skills, effectively programming the agent's workflow rather than its output. \textbf{Agents} serve in a dual capacity: as \emph{managers}, they pick cards from the project board, dispatch sub-agents for parallel review, and orchestrate a two-stage pipeline from issue to merged pull request; as \emph{executors}, they implement code, write tests, generate documentation, and fix CI failures. The key insight is that the two human roles contribute \emph{judgment}---which reductions matter, what quality bar to enforce---while the agent handles \emph{volume}---executing the mechanical steps reliably and repeatedly. -Industry data supports this division: developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding}. +Industry data supports this division: a recent industry report finds that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding} (we note this is a vendor report, though its findings are consistent with independent surveys of developer AI adoption). We organize our agent's capabilities into a library of 13~skills spanning five functional categories: orchestration (pipeline management and issue dispatch), implementation (adding models and reduction rules), quality gates (issue checking, redundancy analysis, multi-agent review, and CI repair), documentation (generating formal problem definitions and reduction theorems with proof sketches), and release management. A two-stage card-based pipeline automates the progression from issue to merged code: the first stage picks a ``Ready'' issue, implements it in an isolated git worktree, and produces a pull request; the second stage addresses review comments, fixes CI failures, and prepares the PR for human merge. @@ -104,7 +104,7 @@ \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} \paragraph{Hardware solvers as practical motivation.} The reduction graph is not merely an academic exercise; it serves as a \emph{compilation layer} connecting abstract problem formulations to physical hardware. -Rydberg atom arrays natively solve Maximum Independent Set (MIS) by encoding graph vertices as atoms and edges as blockade constraints~\cite{lucas2014}. +Rydberg atom arrays natively solve Maximum Independent Set (MIS) by encoding graph vertices as atoms and edges as blockade constraints~\cite{lucas2014, pichler2018}. D-Wave quantum annealers solve Quadratic Unconstrained Binary Optimization (QUBO) and Ising spin glass problems through quantum tunneling~\cite{glover2019}. A verified reduction graph lets these specialized processors tackle a far larger class of problems: reduce Satisfiability to MIS and run the result on Rydberg atoms; reduce MaxCut through SpinGlass to QUBO and submit to D-Wave. The graph in \Cref{fig:reduction-graph} makes this compilation structure explicit. @@ -255,10 +255,15 @@ \subsection{Skills as Agent Functions}\label{sec:skill-inventory} The key insight is that agents handle well-specified, bounded subtasks reliably; the challenge is structuring work so that each unit falls within that reliable range. Skills solve this by encoding the domain expert's knowledge of \emph{how} to perform a task---file locations, naming conventions, trait implementation patterns, test structure---into a reusable script that any agent invocation can follow. +Skills differ from prompt engineering techniques such as chain-of-thought prompting or ReAct-style reasoning in two key respects. +First, skills are \emph{persistent and versioned}: they are committed to the repository, evolve through pull requests, and encode accumulated project knowledge across many agent sessions---unlike per-invocation prompts that must be re-crafted each time. +Second, skills are \emph{compositional}: orchestration skills invoke implementation skills, which invoke quality gates, forming multi-level workflows that no single prompt could express. +The closest analogue is the ``runbook'' concept in DevOps, but adapted for an agent executor rather than a human operator. + Our library comprises 13~skills organized into five functional categories (\Cref{tab:skills}). \begin{table}[t] -\caption{Skills inventory. Steps = numbered steps in the skill script. Success rate is the fraction of invocations that pass CI on first attempt, measured from git history.}\label{tab:skills} +\caption{Skills inventory. Steps = numbered steps in the skill script. Success rate (first-attempt CI pass, to be measured from git history) is pending systematic audit.}\label{tab:skills} \centering \small \begin{tabular}{llcc} @@ -399,7 +404,7 @@ \subsection{The Verification Stack}\label{sec:stack} Every reduction rule has a \texttt{test\_*\_to\_*\_closed\_loop} test that exercises this full round trip. This layer catches the most mathematically subtle errors: incorrect variable-to-vertex mappings, off-by-one errors in clause indexing, forgotten negation in constraint encoding. The test requires no problem-specific oracle; it relies entirely on the \texttt{evaluate()} interface and the brute-force solver, which means a single skill can generate this test for any reduction. -In our experience, this layer catches approximately 60\% of the errors that survive type checking. +In our experience, this layer catches the largest share of errors that survive type checking. \paragraph{Layer 4: Overhead validation.} The overhead expressions declared in the \lstinline{#[reduction]} macro (e.g., \texttt{num\_edges = "3 * num\_clauses"}) provide a second, independent correctness check on reductions. @@ -450,8 +455,13 @@ \subsection{Why Layers?}\label{sec:why-layers} \section{Evaluation}\label{sec:evaluation} -We evaluate the skill-based methodology through three complementary lenses: an ablation study comparing skill-based and no-skill agent configurations (\Cref{sec:ablation}), a longitudinal analysis of the project's git history (\Cref{sec:mining}), and detailed case studies of three reductions spanning the complexity spectrum (\Cref{sec:cases}). -Together, these provide quantitative evidence that the skill-based decomposition improves agent reliability, and qualitative insight into how skills and verification layers interact during real implementation tasks. +We evaluate the skill-based methodology through three complementary lenses: an ablation study design comparing skill-based and no-skill agent configurations (\Cref{sec:ablation}), a longitudinal analysis of the project's git history (\Cref{sec:mining}), and detailed case studies of three reductions spanning the complexity spectrum (\Cref{sec:cases}). +Together, these provide qualitative insight into how skills and verification layers interact during real implementation tasks, with the ablation design offering a replicable protocol for future quantitative evaluation. + +\paragraph{Agent platform.} +All agent-assisted development was performed using Claude Code~\cite{Anthropic2025ClaudeCode}, Anthropic's terminal-based coding agent, backed by Claude models (Sonnet~3.5 and Sonnet~4 during the development period; the model version evolved during the seven-week span). +Skills are invoked as slash commands within Claude Code sessions. +We note that the methodology is not inherently tied to a specific model or agent platform---the skills are plain markdown documents that any sufficiently capable coding agent could follow---but our empirical observations reflect the capabilities of the Claude model family. \subsection{Ablation: Skill-Based vs.\ Raw Agent}\label{sec:ablation} @@ -475,13 +485,14 @@ \subsection{Ablation: Skill-Based vs.\ Raw Agent}\label{sec:ablation} We expect the skill-based configuration to excel particularly on convention adherence (where skills encode project-specific patterns that no general-purpose agent would know) and first-attempt CI pass rate (where the multi-layered verification invoked by skills catches errors before the first push). The raw agent, by contrast, is likely to produce functionally correct code that nonetheless fails CI due to missing \texttt{declare\_variants!} macros, incorrect overhead expressions, or test files placed in the wrong directory. -The ablation results are [TBD]; the experiment requires running both configurations on held-out issues and measuring the four metrics above. -The git history mining in \Cref{sec:mining} provides complementary longitudinal evidence across the project's full development timeline. +We have not yet executed this ablation; it requires running both configurations on held-out issues and measuring the four metrics above. +We present the experimental design here because we believe it constitutes a replicable protocol for evaluating skill-based methodologies in other domains. +The git history mining in \Cref{sec:mining} provides complementary longitudinal evidence across the project's full development timeline, and the case studies in \Cref{sec:cases} offer qualitative insight into how skills and verification layers interact during specific implementation tasks. \subsection{Git History Mining}\label{sec:mining} We analyze the complete git and pull request history of the \texttt{problem-reductions} repository to characterize the project's evolution and the types of errors encountered during development. -The repository contains 58~merged pull requests spanning approximately seven weeks of development, authored by two primary contributors. +The repository contains 58~merged pull requests spanning approximately seven weeks of development, authored by three contributors (two primary, one additional). \paragraph{Development phases.} We stratify the history into three phases reflecting the evolution of agent tooling: @@ -522,8 +533,8 @@ \subsection{Git History Mining}\label{sec:mining} \end{tabular} \end{table} -The error counts in \Cref{tab:errors} are placeholders pending a systematic audit of all CI logs and review threads. -Preliminary observations from the commit history suggest that \emph{overhead formula errors} (Layer~4) and \emph{convention violations} (Layer~6) are among the most common error classes. +The error counts in \Cref{tab:errors} are pending a systematic audit of all CI logs and review threads; we report qualitative observations here and defer the full quantitative analysis. +Preliminary observations from the commit history suggest that \emph{overhead formula errors} (Layer~4) and \emph{convention violations} (Layer~6) are among the most frequently encountered error classes. For example, PR~\#112 (``Fix complexity inconsistencies, enforce overhead, add missing variants'') addressed multiple overhead formula errors that had accumulated before Layer~4 validation was enforced, and PR~\#89 (``Close completeness gaps from review-implementation audit'') fixed convention violations identified by the agentic review skill. The introduction of compile-time overhead validation in PR~\#99 (``Replace Polynomial overhead system with Expr AST'') eliminated an entire class of errors by shifting overhead checking from runtime to compile time---an example of how verification infrastructure co-evolves with the skill system. @@ -658,7 +669,7 @@ \subsection{Limitations} Each new domain requires its own skill engineering effort. \paragraph{Confounding factors.} -Our project evolved over six months during which both the skill library and the underlying language models improved. +Our project evolved over seven weeks during which both the skill library and the underlying language models improved. Although we address this confound through temporal stratification in our evaluation, we cannot fully disentangle the contribution of better skills from the contribution of more capable models. Future work should control for model version to isolate the skill-based methodology's independent effect. diff --git a/docs/paper/arxiv/references.bib b/docs/paper/arxiv/references.bib index 26ae31b7..1a292c0a 100644 --- a/docs/paper/arxiv/references.bib +++ b/docs/paper/arxiv/references.bib @@ -6,18 +6,16 @@ % Theme A: AI Coding Agents — Architectures and Benchmarks % ============================================================ -@article{Yang2024SWEagent, +@inproceedings{Yang2024SWEagent, author = {John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Adriano Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press}, title = {{SWE}-agent: Agent-Computer Interfaces Enable Automated Software Engineering}, booktitle = {Neural Information Processing Systems}, - journal = {ArXiv}, - volume = {abs/2405.15793}, year = {2024}, doi = {10.48550/arXiv.2405.15793}, abstract = {Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5\% and 87.7\%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.}, } -@article{Wang2024OpenHands, +@inproceedings{Wang2024OpenHands, author = {Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and Heng Ji and Graham Neubig}, title = {{OpenHands}: An Open Platform for {AI} Software Developers as Generalist Agents}, booktitle = {International Conference on Learning Representations}, @@ -174,7 +172,7 @@ @article{Schuetz2022PhysicsGNN abstract = {Combinatorial optimization problems are pervasive across science and industry. Modern deep learning tools are poised to solve these problems at unprecedented scales, but a unifying framework that incorporates insights from statistical physics is still outstanding. Here we demonstrate how graph neural networks can be used to solve combinatorial optimization problems. Our approach is broadly applicable to canonical NP-hard problems in the form of quadratic unconstrained binary optimization problems, such as maximum cut, minimum vertex cover, maximum independent set, as well as Ising spin glasses and higher-order generalizations thereof in the form of polynomial unconstrained binary optimization problems. We apply a relaxation strategy to the problem Hamiltonian to generate a differentiable loss function with which we train the graph neural network and apply a simple projection to integer variables once the unsupervised training process has completed. We showcase our approach with numerical results for the canonical maximum cut and maximum independent set problems. We find that the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables.}, } -@article{He2024QuantumTSP, +@inproceedings{He2024QuantumTSP, author = {Haoqi He}, title = {Quantum Annealing and {GNN} for Solving {TSP} with {QUBO}}, booktitle = {Algorithmic Applications in Management}, @@ -287,6 +285,14 @@ @article{lucas2014 year = {2014} } +@article{pichler2018, + author = {Hannes Pichler and Sheng-Tao Wang and Leo Zhou and Soonwon Choi and Mikhail D. Lukin}, + title = {Quantum Optimization for Maximum Independent Set Using {Rydberg} Atom Arrays}, + journal = {arXiv preprint arXiv:1808.10816}, + year = {2018}, + doi = {10.48550/arXiv.1808.10816} +} + @article{barahona1982, author = {Francisco Barahona}, title = {On the computational complexity of Ising spin glass models}, From 40e91eefea6dd7401fe6333f153051e5572a4e39 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Fri, 13 Mar 2026 00:36:58 +0800 Subject: [PATCH 21/38] docs(arxiv): final paper assembly and polish Fix overfull hboxes in skills and error taxonomy tables by removing outer padding (@{}) and abbreviating long skill names. Add .gitignore for LaTeX build artifacts. All figures compile, cross-references verified, no undefined citations. Note: paper is 15 pages (over the 10-12 target). Co-Authored-By: Claude Opus 4.6 --- docs/paper/arxiv/.gitignore | 9 +++++++++ docs/paper/arxiv/paper.tex | 25 +++++++++++++------------ 2 files changed, 22 insertions(+), 12 deletions(-) create mode 100644 docs/paper/arxiv/.gitignore diff --git a/docs/paper/arxiv/.gitignore b/docs/paper/arxiv/.gitignore new file mode 100644 index 00000000..15a47ce1 --- /dev/null +++ b/docs/paper/arxiv/.gitignore @@ -0,0 +1,9 @@ +*.pdf +*.aux +*.bbl +*.blg +*.log +*.out +*.fls +*.fdb_latexmk +*.synctex.gz diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 3b402e5c..60c81906 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -266,7 +266,7 @@ \subsection{Skills as Agent Functions}\label{sec:skill-inventory} \caption{Skills inventory. Steps = numbered steps in the skill script. Success rate (first-attempt CI pass, to be measured from git history) is pending systematic audit.}\label{tab:skills} \centering \small -\begin{tabular}{llcc} +\begin{tabular}{@{}llcc@{}} \toprule Skill & Category & Steps & Success \\ \midrule @@ -279,12 +279,12 @@ \subsection{Skills as Agent Functions}\label{sec:skill-inventory} \texttt{add-rule} & Implementation & 6 & TBD \\ \midrule \texttt{check-issue} & Quality gate & 3 & TBD \\ -\texttt{check-rule-redundancy} & Quality gate & 5 & TBD \\ -\texttt{review-implementation} & Quality gate & 5 & TBD \\ +\texttt{check-redundancy} & Quality gate & 5 & TBD \\ +\texttt{review-impl} & Quality gate & 5 & TBD \\ \texttt{fix-pr} & Quality gate & 6 & TBD \\ \midrule -\texttt{write-model-in-paper} & Documentation & 4 & TBD \\ -\texttt{write-rule-in-paper} & Documentation & 6 & TBD \\ +\texttt{write-model} & Documentation & 4 & TBD \\ +\texttt{write-rule} & Documentation & 6 & TBD \\ \midrule \texttt{release} & Release & 3 & TBD \\ \bottomrule @@ -518,17 +518,18 @@ \subsection{Git History Mining}\label{sec:mining} \begin{table}[t] \caption{Error taxonomy by verification layer.}\label{tab:errors} \centering -\begin{tabular}{llc} +\small +\begin{tabular}{@{}llc@{}} \toprule Error Category & Catching Layer & Count \\ \midrule Type/API mismatch & L1: Type system & [TBD] \\ -Evaluation logic errors & L2: Unit tests & [TBD] \\ -Mapping/index errors & L3: Closed-loop tests & [TBD] \\ -Overhead formula errors & L4: Overhead validation & [TBD] \\ -Test gaming (changed expected values) & L5: Materialized fixtures & [TBD] \\ -Convention violations & L6: Agentic review & [TBD] \\ -Incorrect proof arguments & L7: Documentation review & [TBD] \\ +Evaluation logic & L2: Unit tests & [TBD] \\ +Mapping/index & L3: Closed-loop tests & [TBD] \\ +Overhead formula & L4: Overhead validation & [TBD] \\ +Test gaming & L5: Materialized fixtures & [TBD] \\ +Convention violation & L6: Agentic review & [TBD] \\ +Incorrect proof & L7: Doc.\ review & [TBD] \\ \bottomrule \end{tabular} \end{table} From 8cadf82249092940dd4880cd3b5d8e0032109d88 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Fri, 13 Mar 2026 01:25:19 +0800 Subject: [PATCH 22/38] docs(arxiv): integrate Claude history data and update paper metrics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Mine ~/.claude session data: 283 sessions, 300MB transcripts, 15:1 automation ratio, 1510 co-authored commits - Add development metrics paragraph with codebase growth timeline - Add issue quality gate data: 75% rejection rate on 322 checked issues - Add interaction evolution paragraph (imperative → declarative prompts) - Update counts: 24→27 models, 40→50 rules, 58→59 PRs, 7→9 weeks - Remove meta-power skill references (13→12 skills) - Replace Figure 1 with three-layer problemtree (from NSFC proposal) - Add future directions: reduction compiler with Pareto cost models - Save raw Claude history data to survey/claude-history-data.md Co-Authored-By: Claude Opus 4.6 --- .../claude-history-data.md | 235 +++++++++++++++++ .../agentic-coding-reductions/references.bib | 241 ++++++++++++++++++ .../agentic-coding-reductions/summary.md | 92 +++++++ docs/paper/arxiv/figures/problemtree.typ | 198 ++++++++++++++ docs/paper/arxiv/paper.tex | 114 ++++++--- docs/paper/arxiv/plan.md | 29 +++ 6 files changed, 875 insertions(+), 34 deletions(-) create mode 100644 .claude/survey/agentic-coding-reductions/claude-history-data.md create mode 100644 .claude/survey/agentic-coding-reductions/references.bib create mode 100644 .claude/survey/agentic-coding-reductions/summary.md create mode 100644 docs/paper/arxiv/figures/problemtree.typ create mode 100644 docs/paper/arxiv/plan.md diff --git a/.claude/survey/agentic-coding-reductions/claude-history-data.md b/.claude/survey/agentic-coding-reductions/claude-history-data.md new file mode 100644 index 00000000..e8e30744 --- /dev/null +++ b/.claude/survey/agentic-coding-reductions/claude-history-data.md @@ -0,0 +1,235 @@ +# Claude History Data for Paper + +Raw metrics extracted from `~/.claude` on 2026-03-13. + +## Global Claude Code Stats (Jan 14 – Feb 25, 2026) + +| Metric | Value | +|--------|-------| +| Days active | 40 | +| Total messages | 157,325 | +| Total sessions | 329 | +| Total tool calls | 36,553 | +| Avg messages/day | 3,933 | +| Avg sessions/day | 8.2 | +| Peak messages/day | 18,224 (Feb 16) | +| Peak sessions/day | 32 (Feb 13) | +| Peak tool calls/day | 5,234 (Jan 28) | + +## Problemreductions Project Stats + +### Session Data +| Metric | Value | +|--------|-------| +| Session transcript files | 283 | +| Total transcript data | 300 MB | +| Largest session | 12.5 MB | +| Median session | 457 KB | +| Sessions > 1 MB | 79 | +| Sessions > 5 MB | 12 | + +### From Session Metadata (108 sessions with timing) +| Metric | Value | +|--------|-------| +| Total wall-clock time | 6,897 min (115 hours) | +| User messages | 630 | +| Assistant messages | 9,429 | +| Automation ratio (asst/user) | 15.0x | +| Avg user msgs/session | 5.8 | +| Git commits (from sessions) | 140 | +| Git pushes | 88 | +| Commits per hour | 1.2 | +| Tool calls per session | 51 | +| Input tokens | 255,954 | +| Output tokens | 732,996 | + +### Tool Usage (across 108 measured sessions) +| Tool | Count | +|------|-------| +| Bash | 1,661 | +| Read | 1,284 | +| Grep | 629 | +| Edit | 595 | +| Task | 272 | +| TaskUpdate | 245 | +| TodoWrite | 161 | +| AskUserQuestion | 151 | +| TaskCreate | 133 | +| Glob | 100 | +| Skill | 97 | +| Write | 81 | +| WebFetch | 34 | +| WebSearch | 34 | + +### Languages Touched +| Language | File operations | +|----------|----------------| +| Rust | 1,239 | +| Markdown | 431 | +| JavaScript | 55 | +| JSON | 37 | +| YAML | 10 | + +## Git History + +### Commits +| Metric | Value | +|--------|-------| +| Total commits (main) | 253 | +| Commits (all branches) | 1,089 | +| Co-Authored-By: Claude commits | 1,510 | +| Contributors | 4 (GiggleLiu, Jinguo Liu, Shiwen An, Xiwei Pan) | +| Merged PRs | 59 | +| Fix # PRs (issue-driven) | 10 | +| feat: PRs | 16 | +| Project start date | 2026-01-09 | + +### Codebase Growth Timeline +| Date | Models | Rules | Test files | Examples | Rust files | +|------|--------|-------|------------|----------|------------| +| Jan 10 (initial) | 17 | 0 | 0 | 0 | 36 | +| Jan 26 (feature parity) | 20 | 22 | 0 | 1 | ~74 | +| Feb 1 | 20 | 24 | 0 | 1 | 74 | +| Feb 15 (arch redesign) | 21 | 44 | 101 | 35 | 204 | +| Mar 1 | 23 | 51 | 105 | 42 | 218 | +| Mar 13 (current) | 27 | 50 | 114 | 45 | 232 | + +### Current Project Size +| Component | Count/Size | +|-----------|------------| +| Rust source (src/) | 54,599 LOC | +| Test files (src/unit_tests + tests/) | 28,343 LOC | +| Examples | 6,362 LOC | +| Skill files | 3,664 LOC | +| CLAUDE.md | 253 lines | +| Models | 27 | +| Rules | 50 | +| Examples | 45 | +| Skills | 14 | + +### Peak Development Days +| Date | Commits | Sessions | Messages | Tool calls | Key activity | +|------|---------|----------|----------|------------|--------------| +| Jan 25 | 41 | 22 | 12,868 | 3,734 | Feature parity sprint (Julia port) | +| Jan 28 | ~30 | 3 | 18,055 | 5,234 | UnitDiskMapping gadgets | +| Feb 12 | 26 | 82 | 4,540 | 1,508 | Overhead expression system began | +| Feb 13 | 61 | 43 | 13,169 | 2,529 | Variant system, MIS redesign | +| Feb 14 | 67 | 16 | 10,454 | 1,885 | Circuit reductions | +| Feb 15 | 40 | 13 | 4,526 | 783 | Expression system migration | +| Feb 16 | 69 | 3 | 18,224 | 2,490 | problem_size trait, graph export | +| Mar 12 | 113 | 26 | N/A | N/A | Pipeline automation, 6 PRs merged | + +## GitHub Issues + +### Overall +| Metric | Value | +|--------|-------| +| Total issues | 500+ | +| Open | 350 | +| Closed | 150 | +| Rule issues | 271 | +| Model issues | 183 | + +### Issue Authors +| Author | Issues | +|--------|--------| +| isPANN | 414 | +| GiggleLiu | 34 | +| zazabap | 28 | +| QingyunQian | 19 | +| hmyuuu | 4 | +| fliingelephant | 2 | +| exAClior | 1 | + +### Peak Issue Creation Days +| Date | Issues | +|------|--------| +| Mar 11 | 251 | +| Mar 12 | 78 | +| Mar 10 | 38 | +| Mar 9 | 26 | + +### Quality Gate Results (322 checked of isPANN's 414) +| Verdict | Count | Percentage | +|---------|-------|------------| +| Good | 81 | 25% | +| PoorWritten | 124 | 39% | +| Wrong | 64 | 20% | +| Trivial | 43 | 13% | +| Useless | 18 | 6% | +| **Rejection rate** | **241/322** | **75%** | + +### All Issues Quality Check (all authors) +| Verdict | Count | +|---------|-------| +| Good | 105 | +| PoorWritten | 138 | +| Wrong | 64 | +| Trivial | 45 | +| Useless | 19 | +| Total checked | 371 | + +## Skill Invocations (from history.jsonl) +| Skill | Count | +|-------|-------| +| /compact | 33 | +| /superpowers:brainstorm | 15 | +| /mcp | 7 | +| /fix-pr | 5 | +| /passes | 4 | +| /model | 4 | +| /superpowers:execute-plan | 3 | +| /test-feature | 3 | +| /check-rule-redundancy | 3 | +| /review-pipeline | 2 | +| /review-implementation | 2 | +| /writing-plans | 2 | + +## Prompt Length Distribution (2,196 prompts) +| Category | Count | Percentage | +|----------|-------|------------| +| 1–3 words | 650 | 30% | +| 4–10 words | 1,038 | 47% | +| 11–30 words | 592 | 27% | +| 30+ words | 79 | 4% | + +## User Prompt Evolution Examples + +### Phase 1 (Jan 9, Manual) +``` +"start implementing milestone 1" +"improve test coverage to >95 and start milestone 3" +"detect missing tests compared with Julia package." +"compare your implementation with UnitDiskMapping, do not skip any test" +"incorrect, it is King's subgraph!" +``` + +### Phase 2 (Jan 26 – Feb, Basic Skills) +``` +"/superpowers:brainstorm check issue 10 and 11" +"implement Satisfiability -> Maximum Independent Set reduction" +"resolve pr comments, fix ci" +"commit this in a pr" +``` + +### Phase 3 (Mar, Full Pipeline) +``` +"make run-pipeline" +"/review-pipeline" +"/check-rule-redundancy" +"make run-issue N=570" +``` + +## All Projects by Usage (top 10) +| Project | Prompts | +|---------|---------| +| problemreductions | 2,346 | +| cryochamber | 582 | +| sci-brainstorm | 329 | +| DSAA3071TheoryOfComputation | 226 | +| omeinsum-rs | 197 | +| BPDecoderPlus | 157 | +| private-note | 154 | +| agentic-tests | 153 | +| dev | 130 | +| yao-rs | 127 | diff --git a/.claude/survey/agentic-coding-reductions/references.bib b/.claude/survey/agentic-coding-reductions/references.bib new file mode 100644 index 00000000..0cd139e2 --- /dev/null +++ b/.claude/survey/agentic-coding-reductions/references.bib @@ -0,0 +1,241 @@ +% Survey: Agentic Coding and Problem Reduction Rules +% Generated: 2026-03-12 +% Papers: 22 + +% ============================================================ +% Theme A: AI Coding Agents — Architectures and Benchmarks +% ============================================================ + +@article{Yang2024SWEagent, + author = {John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Adriano Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press}, + title = {{SWE}-agent: Agent-Computer Interfaces Enable Automated Software Engineering}, + booktitle = {Neural Information Processing Systems}, + journal = {ArXiv}, + volume = {abs/2405.15793}, + year = {2024}, + doi = {10.48550/arXiv.2405.15793}, + abstract = {Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5\% and 87.7\%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.}, +} + +@article{Wang2024OpenHands, + author = {Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and Heng Ji and Graham Neubig}, + title = {{OpenHands}: An Open Platform for {AI} Software Developers as Generalist Agents}, + booktitle = {International Conference on Learning Representations}, + year = {2024}, + url = {https://arxiv.org/abs/2407.16741}, + abstract = {Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.}, +} + +@article{Wang2025OpenHandsSDK, + author = {Xingyao Wang and Simon Rosenberg and Juan Michelini and Calvin Smith and Hoang H. Tran and Engel Nyst and Rohit Malhotra and Xuhui Zhou and Valerie Chen and Robert Brennan and Graham Neubig}, + title = {The {OpenHands} Software Agent {SDK}: A Composable and Extensible Foundation for Production Agents}, + journal = {ArXiv}, + volume = {abs/2511.03690}, + year = {2025}, + doi = {10.48550/arXiv.2511.03690}, + abstract = {Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.}, +} + +@article{Thai2025SWEEVO, + author = {Minh V. T. Thai and Tue Le and D{\~u}ng Nguy{\~\hat{e}}n M{\d a}nh and Huy Phan Nhat and Nghi D. Q. Bui}, + title = {{SWE-EVO}: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios}, + journal = {ArXiv}, + volume = {abs/2512.18470}, + year = {2025}, + doi = {10.48550/arXiv.2512.18470}, + abstract = {Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, SWE-EVO comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on SWE-EVO, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.}, +} + +@article{Deng2025SWEBenchPro, + title = {{SWE-Bench Pro}: Can {AI} Agents Solve Long-Horizon Software Engineering Tasks?}, + author = {Xiang Deng and Jeff Da and Edwin Pan and Yannis Y. He and Charles Ide and Kanak Garg and Niklas Lauffer and Andrew Park and Chetan Rane and Karmini Sampath and Maya Krishnan and Srivatsa R. Kundurthy and Sean M. Hendryx and Zifan Wang and Chen Bo Calvin Zhang and Noah Jacobson and Bing Liu and Brad Kenstler}, + year = {2025}, + journal = {arXiv preprint arXiv:2509.16941}, + doi = {10.48550/arXiv.2509.16941}, + url = {https://openreview.net/forum?id=9R2iUHhVfr}, + note = {Under review at ICLR 2026}, + abstract = {We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-Bench, but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-Bench. The benchmark comprises 1,865 problems from 41 repositories, split into public, held-out, and commercial sets. It features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. SWE-Bench Pro provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.}, +} + +@article{Xia2025LiveSWEagent, + author = {Chun Xia and Zhe Wang and Yan Yang and Yuxiang Wei and Ling-kai Zhang}, + title = {{Live-SWE-agent}: Can Software Engineering Agents Self-Evolve on the Fly?}, + journal = {ArXiv}, + volume = {abs/2511.13646}, + year = {2025}, + doi = {10.48550/arXiv.2511.13646}, + abstract = {Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Goedel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4\% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8\%.}, +} + +@misc{Anthropic2025ClaudeCode, + title = {Claude Code}, + author = {{Anthropic}}, + year = {2025}, + url = {https://github.com/anthropics/claude-code}, + howpublished = {\url{https://github.com/anthropics/claude-code}}, + note = {Agentic coding tool that lives in the terminal, understands codebases, and helps developers code faster through natural language commands}, +} + +@misc{Wu2024Devin, + title = {Introducing {Devin}, the First {AI} Software Engineer}, + author = {Scott Wu}, + year = {2024}, + month = mar, + url = {https://cognition.ai/blog/introducing-devin}, + howpublished = {Cognition AI Blog}, + note = {Devin is a fully autonomous AI software engineering agent with access to shell, code editor, and browser in a sandboxed environment. On SWE-bench, Devin correctly resolves 13.86\% of issues end-to-end.}, +} + +@article{Roychoudhury2025AgenticAI, + author = {Abhik Roychoudhury}, + title = {Agentic {AI} for Software: Thoughts from Software Engineering Community}, + journal = {ArXiv}, + volume = {abs/2508.17343}, + year = {2025}, + doi = {10.48550/arXiv.2508.17343}, + abstract = {AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro-decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI-based software workflows will be to resolve the core difficulty in software engineering --- the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code-bases. Thus to deal with this explosion, an emerging direction is AI-based verification and validation (V\&V) of AI generated code. We posit that agentic software workflows in future will include such AI-based V\&V.}, +} + +@techreport{Anthropic2026AgenticCoding, + title = {2026 Agentic Coding Trends Report: How Coding Agents Are Reshaping Software Development}, + author = {{Anthropic}}, + year = {2026}, + month = jan, + institution = {Anthropic}, + url = {https://resources.anthropic.com/hubfs/2026\%20Agentic\%20Coding\%20Trends\%20Report.pdf}, + abstract = {Industry report identifying eight trends across foundation, capability, and impact categories that are reshaping software development. Key findings include that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks. The report covers shifting engineering roles, multi-agent coordination, human-AI collaboration patterns, and scaling agentic coding beyond engineering teams.}, +} + +% ============================================================ +% Theme C: AI-Assisted Discovery of Reductions & Complexity +% ============================================================ + +@article{Nagda2025ReinforcedGeneration, + author = {Ansh Nagda and Prabhakar Raghavan and Abhradeep Thakurta}, + title = {Reinforced Generation of Combinatorial Structures: Hardness of Approximation}, + year = {2025}, + url = {https://arxiv.org/abs/2509.18057}, + abstract = {Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as 163 vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of 0.987 and 0.9649 respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of 0.9883, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of 0.9853, but falls short of the SOTA of 16/17 that relies on a custom PCP (rather than a reduction from ``standard'' Hastad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of 111/110 using AlphaEvolve to discover a new gadget, thus improving the SOTA of 117/116. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by 10,000x for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.}, +} + +@article{Novikov2025AlphaEvolve, + author = {Alexander Novikov and Ng{\^a}n V{\~u} and Marvin Eisenberger and Emilien Dupont and Po-Sen Huang and Adam Zsolt Wagner and S. Shirobokov and Borislav M. Kozlovskii and Francisco J. R. Ruiz and Abbas Mehrabian and M. P. Kumar and Abigail See and Swarat Chaudhuri and George Holland and A. Davies and Sebastian Nowozin and Pushmeet Kohli and Matej Balog}, + title = {{AlphaEvolve}: A Coding Agent for Scientific and Algorithmic Discovery}, + journal = {ArXiv}, + volume = {abs/2506.13131}, + year = {2025}, + doi = {10.48550/arXiv.2506.13131}, + abstract = {In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using 48 scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.}, +} + +@article{RomeraParedes2023FunSearch, + author = {Bernardino Romera-Paredes and M. Barekatain and Alexander Novikov and Matej Balog and M. P. Kumar and Emilien Dupont and Francisco J. R. Ruiz and J. Ellenberg and Pengming Wang and Omar Fawzi and Pushmeet Kohli and Alhussein Fawzi}, + title = {Mathematical Discoveries from Program Search with Large Language Models}, + journal = {Nature}, + volume = {625}, + pages = {468--475}, + year = {2023}, + doi = {10.1038/s41586-023-06924-6}, + abstract = {Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretrained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best-known results in important problems, pushing the boundary of existing LLM-based approaches. Applying FunSearch to a central problem in extremal combinatorics---the cap set problem---we discover new constructions of large cap sets going beyond the best-known ones, both in finite dimensional and asymptotic cases. This shows that it is possible to make discoveries for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve on widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.}, +} + +@article{Imajuku2025ALEBench, + author = {Yuki Imajuku and Kohki Horie and Yoichi Iwata and Kensho Aoki and Naohiro Takahashi and Takuya Akiba}, + title = {{ALE-Bench}: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering}, + journal = {ArXiv}, + volume = {abs/2506.09050}, + year = {2025}, + doi = {10.48550/arXiv.2506.09050}, + abstract = {How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.}, +} + +@article{Janicic2025URSA, + author = {Predrag Jani{\v{c}}i{\'c}}, + title = {A {SAT}-based Approach for Specification, Analysis, and Justification of Reductions between {NP}-complete Problems}, + journal = {ArXiv}, + volume = {abs/2511.18639}, + year = {2025}, + doi = {10.48550/arXiv.2511.18639}, + abstract = {We propose a novel approach for the development, analysis, and verification of reductions between NP-complete problems. This method uses the URSA system, a SAT-based constraint solver and incorporates features that distinguish it from existing related systems.}, +} + +% ============================================================ +% Theme D (subset): Physics-Inspired QUBO/Ising Approaches +% ============================================================ + +@article{Schuetz2022PhysicsGNN, + author = {M. Schuetz and J. K. Brubaker and H. Katzgraber}, + title = {Combinatorial Optimization with Physics-Inspired Graph Neural Networks}, + journal = {Nature Machine Intelligence}, + volume = {4}, + pages = {367--377}, + year = {2022}, + doi = {10.1038/s42256-022-00468-6}, + abstract = {Combinatorial optimization problems are pervasive across science and industry. Modern deep learning tools are poised to solve these problems at unprecedented scales, but a unifying framework that incorporates insights from statistical physics is still outstanding. Here we demonstrate how graph neural networks can be used to solve combinatorial optimization problems. Our approach is broadly applicable to canonical NP-hard problems in the form of quadratic unconstrained binary optimization problems, such as maximum cut, minimum vertex cover, maximum independent set, as well as Ising spin glasses and higher-order generalizations thereof in the form of polynomial unconstrained binary optimization problems. We apply a relaxation strategy to the problem Hamiltonian to generate a differentiable loss function with which we train the graph neural network and apply a simple projection to integer variables once the unsupervised training process has completed. We showcase our approach with numerical results for the canonical maximum cut and maximum independent set problems. We find that the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables.}, +} + +@article{He2024QuantumTSP, + author = {Haoqi He}, + title = {Quantum Annealing and {GNN} for Solving {TSP} with {QUBO}}, + booktitle = {Algorithmic Applications in Management}, + pages = {134--145}, + year = {2024}, + doi = {10.1007/978-981-97-7801-0_12}, + abstract = {This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used to solve QUBO formulations of TSP instances on Coherent Ising Machines (CIMs). The paper also presents a novel approach where QUBO is employed as a loss function within a GNN architecture tailored for solving TSP efficiently. By leveraging GNN's capability to learn graph representations, this method finds approximate solutions to TSP with improved computational time compared to traditional exact solvers.}, +} + +% ============================================================ +% Theme E: LLM-Assisted Formal Verification & Program Synthesis +% ============================================================ + +@article{Bursuc2025VeriCoding, + author = {Sergiu Bursuc and Theodore Ehrenborg and Shaowei Lin and L. Astefanoaei and Ionel Emilian Chiosa and Jure Kukovec and Alok Singh and Oliver Butterley and Adem Bizid and Quinn Dougherty and Miranda Zhao and Max Tan and Max Tegmark}, + title = {A Benchmark for Vericoding: Formally Verified Program Synthesis}, + journal = {ArXiv}, + volume = {abs/2509.22908}, + year = {2025}, + doi = {10.48550/arXiv.2509.22908}, + abstract = {We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications --- in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27\% in Lean, 44\% in Verus/Rust and 82\% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68\% to 96\% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark}, +} + +@article{Thakur2025CLEVER, + author = {Amitayush Thakur and Jasper Lee and G. Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri}, + title = {{CLEVER}: A Curated Benchmark for Formally Verified Code Generation}, + journal = {ArXiv}, + volume = {abs/2505.13938}, + year = {2025}, + doi = {10.48550/arXiv.2505.13938}, + abstract = {We introduce CLEVER, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, CLEVER avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use CLEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning.}, +} + +@inproceedings{Miranda2025VeriBench, + title = {{VeriBench}: End-to-End Formal Verification Benchmark for {AI} Code Generation in {Lean} 4}, + author = {Brando Miranda and Zhanke Zhou and Allen Nie and Elyas Obbad and Leni Aniva and Kai Fronsdal and Weston Kirk and Dilara Soylu and Andrea Yu and Ying Li and Sanmi Koyejo}, + year = {2025}, + booktitle = {2nd AI for Math Workshop at ICML 2025 (AI4Math@ICML)}, + url = {https://openreview.net/forum?id=rWkGFmnSNl}, + abstract = {VeriBench evaluates LLM capabilities in generating complete Lean 4 programs---implementations, unit tests, correctness theorems, and formal proofs---derived from reference Python functions or their docstrings. Testing 113 tasks across HumanEval problems, exercises, classical algorithms, and security challenges, the benchmark reveals that Claude 3.7 Sonnet achieves compilation on only 12.5\%, while LLaMA-70B fails to compile any programs in the Lean 4 HumanEval subset, even with 50 feedback-guided attempts. Only a self-optimizing agent architecture achieves meaningful compilation rates, approaching 90\%.}, +} + +@inproceedings{Mukherjee2025CoqPL, + title = {Towards Automated Verification of {LLM}-Synthesized {C} Programs}, + author = {Prasita Mukherjee and Benjamin Delaware}, + year = {2025}, + month = jan, + booktitle = {CoqPL 2025: The Eleventh International Workshop on Coq for Programming Languages (co-located with POPL 2025)}, + doi = {10.48550/arXiv.2410.14835}, + url = {https://popl25.sigplan.org/details/CoqPL-2025-papers/5/Towards-Automated-Verification-of-LLM-Synthesized-C-Programs}, + abstract = {We present a synthesis and verification framework for C programs that leverages LLMs to generate candidate programs while imposing syntactic and semantic biases on programs generated by LLMs, such that the synthesized program is more amenable to automated verification. The key contribution is a specification-verification tool built on the Verified Software Toolchain. Experiments on diverse benchmarks from the deductive program synthesis community, including basic coding examples, Separation Logic based assertions, and API specifications, demonstrate scalability and extensibility.}, +} + +@inproceedings{Mukherjee2025SynVer, + title = {{SYNVER}: {LLM}-Assisted Synthesis of High-Assurance {C} Programs}, + author = {Prasita Mukherjee and Minghai Lu and Benjamin Delaware}, + year = {2025}, + month = nov, + booktitle = {2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)}, + address = {Seoul, Korea}, + doi = {10.1109/ASE63991.2025.00255}, + url = {https://ieeexplore.ieee.org/document/11334588/}, + abstract = {We present SynVer---a novel, general purpose synthesizer for C programs equipped with machine-checked proofs of correctness using the Verified Software Toolchain. SynVer employs two Large Language Models: the first generates candidate programs from user-provided specifications, and the second helps automatically generate proofs of correctness in the Rocq proof assistant. SynVer combines symbolic reasoning with LLM-powered proof generation to discharge proof obligations.}, +} diff --git a/.claude/survey/agentic-coding-reductions/summary.md b/.claude/survey/agentic-coding-reductions/summary.md new file mode 100644 index 00000000..11bfcba6 --- /dev/null +++ b/.claude/survey/agentic-coding-reductions/summary.md @@ -0,0 +1,92 @@ +# Survey: Agentic Coding and Problem Reduction Rules + +**Date:** 2026-03-12 +**Papers:** 22 +**Strategies used:** Landscape mapping + +--- + +## Theme A: AI Coding Agents — Architectures and Benchmarks + +The field has matured from proof-of-concept (Devin [Wu2024Devin], early 2024) to production-grade SDKs (OpenHands [Wang2024OpenHands], [Wang2025OpenHandsSDK]; Claude Code [Anthropic2025ClaudeCode]). The core architectural insight is the Agent-Computer Interface (ACI) — purpose-built tool interfaces for LLM agents [Yang2024SWEagent]. + +**Benchmarks reveal a capability cliff:** single-issue bug fixes reach ~70-80% (SWE-Bench Verified), but long-horizon multi-file tasks drop to ~20% [Thai2025SWEEVO], [Deng2025SWEBenchPro]. Self-evolving agents (Live-SWE-agent [Xia2025LiveSWEagent]) show promising results at 77.4% on SWE-Bench Verified. + +**Industry perspective:** Developers use AI in 60% of work but maintain oversight on 80-100% of delegated tasks [Anthropic2026AgenticCoding]. The key challenge is specification inference — deciphering developer intent [Roychoudhury2025AgenticAI]. + +**Active groups:** Princeton (SWE-agent), UIUC/OpenHands consortium, Anthropic (Claude Code), Cognition AI (Devin), Scale AI (SWE-Bench Pro). + +### Papers +- [Yang2024SWEagent] — SWE-agent: ACI design for coding agents (2024) +- [Wang2024OpenHands] — OpenHands: open platform for AI developers (2024) +- [Wang2025OpenHandsSDK] — OpenHands SDK: composable agent foundation (2025) +- [Thai2025SWEEVO] — SWE-EVO: long-horizon evolution benchmark (2025) +- [Deng2025SWEBenchPro] — SWE-Bench Pro: enterprise-level tasks (2025) +- [Xia2025LiveSWEagent] — Live-SWE-agent: self-evolving agents (2025) +- [Anthropic2025ClaudeCode] — Claude Code: agentic CLI tool (2025) +- [Wu2024Devin] — Devin: autonomous AI engineer (2024) +- [Roychoudhury2025AgenticAI] — Position paper on agentic AI for SE (2025) +- [Anthropic2026AgenticCoding] — Industry trends report (2026) + +--- + +## Theme C: AI-Assisted Discovery of Reductions & Complexity Results + +**The most directly relevant theme.** DeepMind's evolutionary approach — FunSearch [RomeraParedes2023FunSearch] (Nature 2023) followed by AlphaEvolve [Novikov2025AlphaEvolve] (2025) — demonstrates that LLM-powered program search can discover genuinely novel mathematical constructions. The breakthrough application to complexity theory: AlphaEvolve discovered new gadget reductions proving improved NP-hardness bounds for MAX-3-CUT (0.9649), MAX-4-CUT (0.987), and metric TSP (111/110) [Nagda2025ReinforcedGeneration]. + +**Key insight from Nagda et al.:** Verifying AI-discovered gadgets can be exponentially costly — they used AlphaEvolve itself to evolve faster verification procedures (10,000x speedup). This mirrors our project's need for automated reduction verification. + +On the formal verification side, URSA [Janicic2025URSA] uses SAT solvers to verify NP-complete reductions — a complementary approach to LLM-based discovery. ALE-Bench [Imajuku2025ALEBench] benchmarks coding agents on NP-hard optimization (competitive with top-100 human contestants). + +**Active groups:** Google DeepMind (FunSearch, AlphaEvolve), Sakana AI (ALE-Bench), University of Belgrade (URSA). + +### Papers +- [Nagda2025ReinforcedGeneration] — AlphaEvolve discovers new NP-hardness gadgets (2025) +- [Novikov2025AlphaEvolve] — AlphaEvolve: evolutionary coding agent (2025) +- [RomeraParedes2023FunSearch] — FunSearch: LLM program search discovers cap set constructions (Nature 2023) +- [Imajuku2025ALEBench] — ALE-Bench: agents vs humans on NP-hard optimization (2025) +- [Janicic2025URSA] — URSA: SAT-based verification of NP-complete reductions (2025) + +--- + +## Theme D (subset): Physics-Inspired QUBO/Ising Approaches + +GNNs trained via QUBO Hamiltonian relaxation can solve MIS, MaxCut, MinVC at million-variable scale [Schuetz2022PhysicsGNN]. QUBO serves as a unifying target representation for combinatorial optimization — directly paralleling this project's use of QUBO as a central reduction hub. Quantum annealing + GNN hybrid approaches show promise for TSP [He2024QuantumTSP]. + +### Papers +- [Schuetz2022PhysicsGNN] — Physics-inspired GNN for QUBO problems (Nature Machine Intelligence 2022) +- [He2024QuantumTSP] — Quantum annealing + GNN for TSP via QUBO (2024) + +--- + +## Theme E: LLM-Assisted Formal Verification & Program Synthesis + +End-to-end formally verified code generation remains largely unsolved. The largest benchmark (VeriCoding [Bursuc2025VeriCoding]) shows 27% success in Lean, 44% in Verus/Rust, 82% in Dafny. The curated CLEVER benchmark [Thakur2025CLEVER] reports near-zero success on 161 hard problems. VeriBench [Miranda2025VeriBench] finds that only self-optimizing agent architectures achieve meaningful compilation rates (~90%). + +For C programs specifically, the CoqPL/SYNVER line of work [Mukherjee2025CoqPL], [Mukherjee2025SynVer] demonstrates a two-LLM pipeline: one generates candidates, one generates Coq proofs. This pattern (generate + verify) is the emerging paradigm. + +**Active groups:** MIT/Tegmark (VeriCoding), UT Austin/Caltech (CLEVER), Purdue (SYNVER), Stanford/ICML workshop (VeriBench). + +### Papers +- [Bursuc2025VeriCoding] — VeriCoding: 12,504 formal specs across Lean/Dafny/Verus (2025) +- [Thakur2025CLEVER] — CLEVER: curated Lean verification benchmark (2025) +- [Miranda2025VeriBench] — VeriBench: end-to-end Lean 4 benchmark (2025) +- [Mukherjee2025CoqPL] — Automated verification of LLM-synthesized C (CoqPL 2025) +- [Mukherjee2025SynVer] — SYNVER: synthesis + Coq proof generation (ASE 2025) + +--- + +## Key Open Problems + +1. **Automated gadget discovery at scale** — AlphaEvolve works but verification is exponentially costly; can we build faster feedback loops? +2. **End-to-end reduction pipelines** — No system yet discovers a reduction, implements it, AND formally verifies correctness +3. **Long-horizon agent capability** — Agents fail at ~80% of multi-file, multi-step tasks (the kind needed for implementing reductions) +4. **Verified code generation** — Only 27% success on formal specs in Lean; major bottleneck for trustworthy AI-discovered reductions +5. **QUBO as universal target** — Can GNN/physics-inspired solvers be integrated into a reduction-aware optimization pipeline? + +## Key Bottlenecks + +1. **Verification cost** — Checking candidate gadgets/reductions is often exponentially expensive +2. **Specification gap** — LLMs struggle to produce formal specs from informal mathematical descriptions +3. **Agent scaffolding** — No standard architecture for combining code generation + formal verification + domain-specific evaluation +4. **Benchmark coverage** — No benchmark specifically targets reduction implementation and verification diff --git a/docs/paper/arxiv/figures/problemtree.typ b/docs/paper/arxiv/figures/problemtree.typ new file mode 100644 index 00000000..a5b1d643 --- /dev/null +++ b/docs/paper/arxiv/figures/problemtree.typ @@ -0,0 +1,198 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 8pt) +#set text(size: 8pt, font: "New Computer Modern") + +// Color palette (matching NSFC figure's feel, adapted for English paper) +#let col-platform = rgb("#1A5276") +#let col-platform-fill = rgb("#D4E6F1") +#let col-human = rgb("#1E8449") +#let col-human-fill = rgb("#D5F5E3") +#let col-ai = rgb("#7D3C98") +#let col-ai-fill = rgb("#E8DAEF") +#let col-edge = rgb("#5D6D7E") +#let col-dash = rgb("#ABB2B9") + +#canvas(length: 0.5cm, { + import draw: * + + // ============================================================ + // LEVEL 0 — Hardware native problems (solver backends) + // ============================================================ + let plat-w = 3.4 + let plat-h = 0.9 + + rect((-4.5, -0.45), (-4.5 + plat-w, 0.45), + fill: col-platform-fill, stroke: 1.2pt + col-platform, radius: 4pt, name: "udmis") + content("udmis", text(9pt, weight: "bold", fill: col-platform, "UD-MIS on grids")) + content((-2.8, -0.85), text(6.5pt, fill: col-platform.lighten(20%), "(Rydberg atom arrays)")) + + rect((1.1, -0.45), (1.1 + plat-w, 0.45), + fill: col-platform-fill, stroke: 1.2pt + col-platform, radius: 4pt, name: "qubo") + content("qubo", text(9pt, weight: "bold", fill: col-platform, "QUBO")) + content((2.8, -0.85), text(6.5pt, fill: col-platform.lighten(20%), "(D-Wave annealers)")) + + // Also show ILP as a third backend + rect((6.5, -0.45), (6.5 + plat-w, 0.45), + fill: col-platform-fill, stroke: 1.2pt + col-platform, radius: 4pt, name: "ilp") + content("ilp", text(9pt, weight: "bold", fill: col-platform, "ILP")) + content((8.2, -0.85), text(6.5pt, fill: col-platform.lighten(20%), "(Gurobi, CPLEX)")) + + // ============================================================ + // LEVEL 1 — Human-implemented reductions (~40 rules) + // ============================================================ + let hnode(pos, label, name: none) = { + let n = if name != none { name } else { label } + rect( + (pos.at(0) - 0.9, pos.at(1) - 0.28), + (pos.at(0) + 0.9, pos.at(1) + 0.28), + fill: col-human-fill, stroke: 0.7pt + col-human.lighten(20%), + radius: 3pt, name: n, + ) + content(n, text(6.5pt, weight: "bold", fill: col-human.darken(30%), label)) + } + + // Left subtree → UD-MIS + hnode((-6.0, 2.0), "Set Packing", name: "sp") + hnode((-4.8, 3.2), "Vertex Cover", name: "vc") + hnode((-3.4, 1.7), "MIS", name: "mis") + hnode((-1.6, 3.2), "3-SAT", name: "sat") + hnode((-0.4, 2.2), "Clique", name: "clq") + + // Middle/Right subtree → QUBO / ILP + hnode((1.0, 1.7), "MAX-CUT", name: "mc") + hnode((2.8, 3.4), "Graph Coloring", name: "gc") + hnode((3.6, 2.0), "Set Cover", name: "sc") + hnode((5.8, 3.0), "Hamilton Cycle", name: "hc") + hnode((7.2, 1.9), "Num. Partition", name: "np") + hnode((8.8, 3.2), "Factoring", name: "fac") + + // Reduction edges (downward = reduction direction) + let redge(from, to) = { + line(from, to, + stroke: (paint: col-edge, thickness: 0.5pt), + mark: (end: "straight", scale: 0.35)) + } + + redge("mis.south", "udmis.north") + redge("clq.south", "udmis.north") + redge("sp.south", "udmis.north") + redge("sat.south", "mis.north") + redge("vc.south", "mis.north") + + redge("mc.south", "qubo.north") + redge("gc.south", "qubo.north") + redge("sc.south", "qubo.north") + redge("hc.south", "qubo.north") + redge("np.south", "ilp.north") + redge("fac.south", "ilp.north") + + // Cross-reductions + let dredge(from, to) = { + line(from, to, + stroke: (paint: col-edge.lighten(30%), thickness: 0.4pt, dash: "densely-dashed"), + mark: (end: "straight", scale: 0.3)) + } + dredge("vc.south", "qubo.north") + dredge("sat.south", "ilp.north") + dredge("gc.south", "ilp.north") + dredge("mc.south", "ilp.north") + dredge("clq.south", "ilp.north") + + // ============================================================ + // DIVIDER — boundary between current and AI-scaled + // ============================================================ + line((-7.0, 4.1), (10.5, 4.1), + stroke: (paint: col-dash, thickness: 0.7pt, dash: "dashed")) + + // ============================================================ + // LEVEL 2 — AI-synthesized reductions (~100+ rules) + // Staggered dot grid forming a canopy shape + // ============================================================ + let ai-dot(x, y) = { + circle((x, y), radius: 0.08, + fill: col-ai.lighten(60%), stroke: 0.2pt + col-ai.lighten(40%)) + } + + // Row 1 (y=4.6) + for x in (-5.5, -4.0, -2.5, -1.0, 0.5, 2.0, 3.5, 5.0, 6.5, 8.0) { ai-dot(x, 4.6) } + // Row 2 (y=5.0) + for x in (-6.2, -4.7, -3.2, -1.7, -0.2, 1.3, 2.8, 4.3, 5.8, 7.3, 8.8) { ai-dot(x, 5.0) } + // Row 3 (y=5.4) + for x in (-6.0, -4.5, -3.0, -1.5, 0.0, 1.5, 3.0, 4.5, 6.0, 7.5, 8.5) { ai-dot(x, 5.4) } + // Row 4 (y=5.8) + for x in (-5.5, -4.0, -2.5, -1.0, 0.5, 2.0, 3.5, 5.0, 6.5, 8.0) { ai-dot(x, 5.8) } + // Row 5 (y=6.2) + for x in (-4.8, -3.3, -1.8, -0.3, 1.2, 2.7, 4.2, 5.7, 7.2) { ai-dot(x, 6.2) } + // Row 6 (y=6.6) + for x in (-3.8, -2.3, -0.8, 0.7, 2.2, 3.7, 5.2, 6.5) { ai-dot(x, 6.6) } + // Row 7 (y=7.0) + for x in (-2.5, -1.0, 0.5, 2.0, 3.5, 5.0) { ai-dot(x, 7.0) } + // Row 8 (y=7.4) + for x in (-1.0, 0.5, 2.0, 3.5) { ai-dot(x, 7.4) } + + // Representative labeled problems in AI layer + let ai-label(pos, label, n) = { + rect( + (pos.at(0) - 0.85, pos.at(1) - 0.2), + (pos.at(0) + 0.85, pos.at(1) + 0.2), + fill: col-ai-fill, stroke: 0.3pt + col-ai.lighten(30%), + radius: 2pt, name: n, + ) + content(n, text(5pt, weight: "bold", fill: col-ai.lighten(-20%), label)) + } + + ai-label((-5.2, 4.8), "Scheduling", "ai-sched") + ai-label((7.5, 4.8), "TSP", "ai-tsp") + ai-label((-2.5, 5.5), [$k$-SAT], "ai-ksat") + ai-label((3.0, 5.5), "Steiner Tree", "ai-steiner") + ai-label((0.5, 6.8), "Bin Packing", "ai-binp") + ai-label((-4.0, 6.4), "Dom. Set", "ai-domset") + ai-label((5.5, 6.4), [Max $k$-Cut], "ai-mkcut") + + // Faint edges from AI layer down to human layer + let aedge(from, to) = { + line(from, to, + stroke: (paint: col-ai.lighten(50%), thickness: 0.3pt), + mark: (end: "straight", scale: 0.2)) + } + aedge((-4.0, 4.35), "sat.north") + aedge((-1.0, 4.35), "vc.north") + aedge((1.3, 4.35), "gc.north") + aedge((2.8, 4.35), "mc.north") + aedge((5.8, 4.35), "hc.north") + aedge((8.0, 4.35), "np.north") + + // Ellipsis + content((1.5, 7.7), text(12pt, fill: col-ai.lighten(30%), $dots.c$)) + + // ============================================================ + // ANNOTATIONS — right-side braces + // ============================================================ + // Hardware native + on-layer(-1, { + // Use simple bracket lines instead of decorations + let bx = 10.8 + + // Hardware brace + line((bx, -0.5), (bx + 0.3, -0.5), stroke: 0.6pt + col-platform.lighten(20%)) + line((bx + 0.3, -0.5), (bx + 0.3, 0.5), stroke: 0.6pt + col-platform.lighten(20%)) + line((bx, 0.5), (bx + 0.3, 0.5), stroke: 0.6pt + col-platform.lighten(20%)) + content((bx + 0.6, 0.0), anchor: "west", text(7pt, fill: col-platform, weight: "bold", + [Hardware-native\ problems])) + + // Human brace + line((bx, 1.0), (bx + 0.3, 1.0), stroke: 0.6pt + col-human.lighten(20%)) + line((bx + 0.3, 1.0), (bx + 0.3, 3.8), stroke: 0.6pt + col-human.lighten(20%)) + line((bx, 3.8), (bx + 0.3, 3.8), stroke: 0.6pt + col-human.lighten(20%)) + content((bx + 0.6, 2.4), anchor: "west", text(7pt, fill: col-human.darken(10%), weight: "bold", + [Human-implemented\ #text(6pt)[$tilde.op$40 reduction rules]])) + + // AI brace + line((bx, 4.3), (bx + 0.3, 4.3), stroke: 0.6pt + col-ai.lighten(20%)) + line((bx + 0.3, 4.3), (bx + 0.3, 7.5), stroke: 0.6pt + col-ai.lighten(20%)) + line((bx, 7.5), (bx + 0.3, 7.5), stroke: 0.6pt + col-ai.lighten(20%)) + content((bx + 0.6, 5.9), anchor: "west", text(7pt, fill: col-ai.darken(10%), weight: "bold", + [Agent-synthesized\ #text(6pt)[$tilde.op$100+ new rules]])) + }) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 60c81906..c208ebf0 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -19,7 +19,7 @@ \maketitle \begin{abstract} -AI coding agents achieve 70--80\% on single-issue benchmarks like SWE-Bench Verified, but their success rate drops below 25\% on long-horizon software evolution tasks that demand sustained mathematical reasoning across many files. We address this gap by decomposing agentic coding into two complementary roles: human-creative work (designing reduction proofs, choosing algorithms, writing specifications) and agent-managed execution (scaffolding, testing, verification, and integration). Our method centers on a library of 13 reusable skills---from issue triage through implementation to multi-layered review---orchestrated by a coding agent within a Rust library for NP-hard problem reductions. A 7-layer verification stack (type checking, unit tests, brute-force cross-validation, closed-loop reduction tests, integration tests, coverage enforcement, and CI/CD) catches errors at increasing levels of abstraction. Applying this methodology over seven weeks of active development produced 24 problem types, 40 reduction rule implementations, and 52 edges in a typed reduction graph, all with $>$95\% test coverage. We contribute the skill-based decomposition methodology, the verification stack design, and the open-source artifact as a benchmark for agentic mathematical software engineering. +AI coding agents achieve 70--80\% on single-issue benchmarks like SWE-Bench Verified, but their success rate drops below 25\% on long-horizon software evolution tasks that demand sustained mathematical reasoning across many files. We address this gap by decomposing agentic coding into two complementary roles: human-creative work (designing reduction proofs, choosing algorithms, writing specifications) and agent-managed execution (scaffolding, testing, verification, and integration). Our method centers on a library of 12 reusable skills---from issue triage through implementation to multi-layered review---orchestrated by a coding agent within a Rust library for NP-hard problem reductions. A 7-layer verification stack (type checking, unit tests, brute-force cross-validation, closed-loop reduction tests, integration tests, coverage enforcement, and CI/CD) catches errors at increasing levels of abstraction. Applying this methodology over nine weeks of active development produced 27 problem types, 50 reduction rule implementations, and 52 edges in a typed reduction graph, all with $>$95\% test coverage. Agent session data reveals a 15:1 automation amplification ratio (assistant-to-user messages), with the automated issue quality gate rejecting 75\% of 322~batch-submitted issues as incomplete, incorrect, or trivial. We contribute the skill-based decomposition methodology, the verification stack design, and the open-source artifact as a benchmark for agentic mathematical software engineering. \end{abstract} \section{Introduction}\label{sec:intro} @@ -44,7 +44,7 @@ \section{Introduction}\label{sec:intro} Our multi-layered verification stack addresses precisely this challenge: rather than attempting end-to-end formal verification (which remains largely unsolved for complex mathematical code~\cite{Thakur2025CLEVER, Bursuc2025VeriCoding}), we compose multiple lightweight verification mechanisms that collectively catch errors across different abstraction levels. We instantiate this methodology in the domain of NP-hard problem reductions, implemented as an open-source Rust library. -The library manages a typed reduction graph connecting 24 problem types through 40 hand-coded reduction rule implementations and 52 total directed edges (including 12 edges inferred from a type-parameter subtype lattice). +The library manages a typed reduction graph connecting 27 problem types through 50 hand-coded reduction rule implementations and 52 total directed edges (including 12 edges inferred from a type-parameter subtype lattice). The domain serves as a Goldilocks testbed for studying agentic coding: each reduction is self-contained (50--200 lines of code), requires non-trivial mathematical reasoning about the mapping between problem structures, yet admits a fully automatable correctness criterion---reduce an instance, solve the target problem by brute force, extract the solution back, and verify it against the source. This combination of mathematical depth with mechanical verifiability makes it possible to study how agents perform on tasks that are individually tractable but collectively demand sustained engineering discipline. @@ -56,7 +56,7 @@ \section{Introduction}\label{sec:intro} The key insight is that the two human roles contribute \emph{judgment}---which reductions matter, what quality bar to enforce---while the agent handles \emph{volume}---executing the mechanical steps reliably and repeatedly. Industry data supports this division: a recent industry report finds that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding} (we note this is a vendor report, though its findings are consistent with independent surveys of developer AI adoption). -We organize our agent's capabilities into a library of 13~skills spanning five functional categories: orchestration (pipeline management and issue dispatch), implementation (adding models and reduction rules), quality gates (issue checking, redundancy analysis, multi-agent review, and CI repair), documentation (generating formal problem definitions and reduction theorems with proof sketches), and release management. +We organize our agent's capabilities into a library of 12~skills spanning five functional categories: orchestration (pipeline management and issue dispatch), implementation (adding models and reduction rules), quality gates (issue checking, redundancy analysis, multi-agent review, and CI repair), documentation (generating formal problem definitions and reduction theorems with proof sketches), and release management. A two-stage card-based pipeline automates the progression from issue to merged code: the first stage picks a ``Ready'' issue, implements it in an isolated git worktree, and produces a pull request; the second stage addresses review comments, fixes CI failures, and prepares the PR for human merge. The human maintainer touches only two transitions in this pipeline---moving an issue from Backlog to Ready (the strategic decision of \emph{what} to work on) and merging the final pull request (the quality gate of \emph{whether} the work meets standards). Everything in between is agent-managed. @@ -70,7 +70,7 @@ \section{Introduction}\label{sec:intro} \begin{itemize} \item A \textbf{skill-based methodology} for decomposing mathematical coding tasks into agent-manageable steps, with a concrete skill library and a card-based orchestration pipeline that separates human judgment from agent execution. \item A \textbf{multi-layered verification stack} comprising seven complementary mechanisms---from type-level enforcement through materialized fixtures to agentic review and documentation-as-verification---that collectively ensure correctness of agent-generated mathematical code. - \item A \textbf{verified open-source artifact}: a Rust library implementing 24~NP-hard problem types, 40~reduction rules, and 52~graph edges, all with ${>}95\%$ test coverage, serving as both a practical tool for reduction-based problem solving and a benchmark for evaluating agentic mathematical software engineering. + \item A \textbf{verified open-source artifact}: a Rust library implementing 27~NP-hard problem types, 50~reduction rules, and 52~graph edges, all with ${>}95\%$ test coverage, serving as both a practical tool for reduction-based problem solving and a benchmark for evaluating agentic mathematical software engineering. \end{itemize} The remainder of this paper is organized as follows. @@ -90,7 +90,7 @@ \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} NP-hard problem reductions occupy a sweet spot---a Goldilocks domain---that avoids this limitation while remaining mathematically demanding. \paragraph{Self-contained, formally specified, mechanically verifiable.} -Each reduction in our library is a self-contained module: the 40~implementations range from 58 to 444 lines of code, with a median of 129~LOC. +Each reduction in our library is a self-contained module: the 50~implementations range from 58 to 444 lines of code, with a median of 129~LOC. Despite this bounded scope, each reduction requires non-trivial mathematical reasoning about the structural relationship between two problem formulations. Crucially, every reduction admits a fully automatable correctness criterion: given a source instance, reduce it to the target problem, solve the target by brute force, extract the solution back through the reduction's inverse map, and verify that it is valid (and optimal) for the source. This \emph{round-trip test} provides a ground-truth oracle that requires no human judgment to evaluate, yet exercises the full mathematical content of the reduction. @@ -100,16 +100,17 @@ \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} Unlike SWE-Bench, where every issue is structurally unique, reductions form a homogeneous task family. Every reduction implements the same trait (\texttt{ReduceTo}), follows the same file-naming convention, requires the same test structure (closed-loop round-trip), and produces the same artifacts (overhead expressions, example code, documentation entry). This homogeneity enables fair comparison across tasks---we can meaningfully ask whether an agent performs better on graph-to-graph reductions than on formula-to-graph reductions, or whether reduction complexity (measured in LOC or graph blowup) predicts first-attempt success rate. -It also enables \emph{reusable skills}: a single ``add-rule'' skill handles all 40~reductions, because the workflow is structurally identical even when the mathematical content varies. +It also enables \emph{reusable skills}: a single ``add-rule'' skill handles all 50~reductions, because the workflow is structurally identical even when the mathematical content varies. \paragraph{Hardware solvers as practical motivation.} -The reduction graph is not merely an academic exercise; it serves as a \emph{compilation layer} connecting abstract problem formulations to physical hardware. +The reduction graph is not merely an academic exercise; it serves as a \emph{compilation layer} connecting abstract problem formulations to physical hardware (\Cref{fig:reduction-graph}). Rydberg atom arrays natively solve Maximum Independent Set (MIS) by encoding graph vertices as atoms and edges as blockade constraints~\cite{lucas2014, pichler2018}. D-Wave quantum annealers solve Quadratic Unconstrained Binary Optimization (QUBO) and Ising spin glass problems through quantum tunneling~\cite{glover2019}. -A verified reduction graph lets these specialized processors tackle a far larger class of problems: reduce Satisfiability to MIS and run the result on Rydberg atoms; reduce MaxCut through SpinGlass to QUBO and submit to D-Wave. -The graph in \Cref{fig:reduction-graph} makes this compilation structure explicit. -MIS serves as the dominant hub, with the highest in-degree (14~incoming edges) and out-degree (13~outgoing edges) among all 24~problem types---reflecting its central role as both a target for hardware solvers and a source for further reductions. -ILP, with 11~incoming edges, functions as a universal algebraic target: any problem that reduces to ILP gains access to mature commercial solvers (Gurobi, CPLEX). +Commercial ILP solvers (Gurobi, CPLEX) accept integer linear programs. +Each solver ``speaks'' only its native problem language, but a verified reduction graph lets them collectively tackle a far larger class of problems: reduce Satisfiability to MIS and run the result on Rydberg atoms; reduce MaxCut through SpinGlass to QUBO and submit to D-Wave. +MIS serves as the dominant hub, with the highest in-degree (14~incoming edges) and out-degree (13~outgoing edges) among all 27~problem types---reflecting its central role as both a target for hardware solvers and a source for further reductions. +ILP, with 11~incoming edges, functions as a universal algebraic target. +The bottom layer of \Cref{fig:reduction-graph} shows these three solver backends; the middle layer shows the 50~human-implemented reduction rules connecting known problem types; and the top layer visualizes the scaling vision---agent-synthesized rules extending the graph to 100+ problem types, the compilation infrastructure needed for a general-purpose reduction compiler. \paragraph{Real-world applications.} The problems in our graph arise directly in industrial settings. @@ -122,8 +123,8 @@ \section{Why Reductions? The Goldilocks Domain}\label{sec:domain} \begin{figure*}[t] \centering - \includegraphics[width=\textwidth]{figures/reduction-graph.pdf} - \caption{The reduction graph: 24 problem types connected by 52 directed edges (40 implemented reductions + 12 inferred variant edges). Node categories---graph (blue), formula (orange), set (green), algebraic (purple), misc (gray)---reflect input structure. MIS is the dominant hub with the highest in-degree (14) and out-degree (13). ILP serves as a universal algebraic target (in-degree 11).} + \includegraphics[width=0.88\textwidth]{figures/problemtree.pdf} + \caption{The reduction network as compilation infrastructure. \textbf{Bottom}: hardware-native problems (UD-MIS on Rydberg atom arrays, QUBO on D-Wave annealers, ILP on commercial solvers). \textbf{Middle}: 27 problem types connected by 50~human-implemented reduction rules (solid arrows) with cross-type reductions (dashed). MIS is the dominant hub (in-degree 14, out-degree 13). \textbf{Top}: the scaling vision---agent-synthesized reductions extending coverage to 100+ NP-hard problem types. The dashed line marks the boundary between current human-implemented rules and future agent-discovered rules.} \label{fig:reduction-graph} \end{figure*} @@ -161,7 +162,7 @@ \section{System Architecture}\label{sec:architecture} This design choice has a direct consequence for verification. The closed-loop test pattern---reduce, solve the target, extract the solution, verify against the source---requires no per-reduction test logic beyond constructing a source instance. The test harness calls \lstinline{reduce_to()}, invokes the brute-force solver on the target, calls \lstinline{extract_solution()}, and checks the result via \lstinline{evaluate()}. -All 40~reduction implementations share this identical test structure, which is why a single ``add-rule'' skill can handle every reduction in the library. +All 50~reduction implementations share this identical test structure, which is why a single ``add-rule'' skill can handle every reduction in the library. \paragraph{Compile-time overhead validation.} Every reduction must declare how the target problem's size relates to the source problem's size through an \lstinline{overhead} attribute on the \lstinline{#[reduction]} proc macro: @@ -260,7 +261,7 @@ \subsection{Skills as Agent Functions}\label{sec:skill-inventory} Second, skills are \emph{compositional}: orchestration skills invoke implementation skills, which invoke quality gates, forming multi-level workflows that no single prompt could express. The closest analogue is the ``runbook'' concept in DevOps, but adapted for an agent executor rather than a human operator. -Our library comprises 13~skills organized into five functional categories (\Cref{tab:skills}). +Our library comprises 12~skills organized into five functional categories (\Cref{tab:skills}). \begin{table}[t] \caption{Skills inventory. Steps = numbered steps in the skill script. Success rate (first-attempt CI pass, to be measured from git history) is pending systematic audit.}\label{tab:skills} @@ -273,7 +274,6 @@ \subsection{Skills as Agent Functions}\label{sec:skill-inventory} \texttt{project-pipeline} & Orchestration & 7 & TBD \\ \texttt{review-pipeline} & Orchestration & 8 & TBD \\ \texttt{issue-to-pr} & Orchestration & 7 & TBD \\ -\texttt{meta-power} & Orchestration & 4 & TBD \\ \midrule \texttt{add-model} & Implementation & 7 & TBD \\ \texttt{add-rule} & Implementation & 6 & TBD \\ @@ -291,12 +291,11 @@ \subsection{Skills as Agent Functions}\label{sec:skill-inventory} \end{tabular} \end{table} -\paragraph{Orchestration skills (4).} +\paragraph{Orchestration skills (3).} These skills implement the agent-as-manager role. \texttt{project-pipeline} is the primary automation entry point: it picks a ``Ready'' issue from the GitHub Project board, creates an isolated git worktree, invokes \texttt{issue-to-pr} to produce a pull request, and moves the card to the ``review-agentic'' column. \texttt{review-pipeline} handles the second stage: it picks a PR from ``review-agentic,'' addresses Copilot review comments via \texttt{fix-pr}, runs agentic feature tests, fixes CI failures (up to three retries), and moves the card to ``In Review'' for human merge. \texttt{issue-to-pr} is the per-issue workhorse invoked by \texttt{project-pipeline}: it fetches the issue, verifies it has passed the \texttt{check-issue} quality gate, researches cited references, writes an implementation plan, creates a PR, and optionally executes the plan by dispatching to the appropriate implementation skill. -\texttt{meta-power} provides batch processing, resolving all open issues in dependency order (models before rules); it is being superseded by the more granular pipeline skills. \paragraph{Implementation skills (2).} \texttt{add-model} and \texttt{add-rule} encode the complete workflow for adding a new problem type or reduction rule, respectively. @@ -491,25 +490,70 @@ \subsection{Ablation: Skill-Based vs.\ Raw Agent}\label{sec:ablation} \subsection{Git History Mining}\label{sec:mining} -We analyze the complete git and pull request history of the \texttt{problem-reductions} repository to characterize the project's evolution and the types of errors encountered during development. -The repository contains 58~merged pull requests spanning approximately seven weeks of development, authored by three contributors (two primary, one additional). +We analyze the complete git and pull request history of the \texttt{problem-reductions} repository, supplemented by session metadata from the Claude Code development environment, to characterize the project's evolution and the types of errors encountered during development. +The repository contains 59~merged pull requests and 253~commits on main spanning nine weeks of development (January~9 to March~13, 2026), authored by four contributors. + +\paragraph{Development metrics.} +The Claude Code session metadata reveals the scale of agent involvement. +Across 283~recorded sessions (300~MB of conversation transcripts), the agent produced 9,429~assistant messages in response to 630~user messages---a \textbf{15:1 automation amplification ratio}. +Of the 1,089~commits across all branches, 1,510~carry a \texttt{Co-Authored-By: Claude} trailer (the count exceeds total commits because branch commits are squash-merged on main). +The average session involved 5.8~user messages and 51~tool calls, with measured wall-clock time totaling 115~hours across the 108~sessions with timing metadata. + +The codebase grew rapidly under this agentic workflow: +\begin{center} +\small +\begin{tabular}{@{}lcccc@{}} +\toprule +Date & Models & Rules & Tests & Examples \\ +\midrule +Jan 10 (initial) & 17 & 0 & 0 & 0 \\ +Jan 26 (feature parity) & 20 & 22 & 0 & 1 \\ +Feb 15 (arch.\ redesign) & 21 & 44 & 101 & 35 \\ +Mar 13 (current) & 27 & 50 & 114 & 45 \\ +\bottomrule +\end{tabular} +\end{center} +The most intensive development day (January~25) produced 41~commits across 22~sessions with 12,868~messages and 3,734~tool calls---this was the feature parity sprint that ported all reduction rules from the predecessor Julia package in a single day, growing the codebase from 36~to 74~Rust source files. \paragraph{Development phases.} We stratify the history into three phases reflecting the evolution of agent tooling: \begin{itemize} \item \textbf{Phase~1 (Manual)}: 35~PRs. Skills had not yet been developed; all implementation, testing, and review was performed manually or with ad-hoc agent assistance. This phase established the core library architecture and the majority of the reduction rules. \item \textbf{Phase~2 (Basic skills)}: 9~PRs. Initial skills for model and rule implementation were available, providing structured workflows but without full pipeline automation. Two new problem models (ClosestVectorProblem, BinPacking) were added during this phase. - \item \textbf{Phase~3 (Full pipeline)}: 14~PRs. The complete skill library was operational, including orchestration skills (\texttt{issue-to-pr}, \texttt{meta-power}), quality gates (\texttt{check-issue}, \texttt{check-rule-redundancy}), and multi-agent review (\texttt{review-implementation}). New models (Knapsack, GraphPartitioning) and rules (KSatisfiability $\to$ SubsetSum) were added with full pipeline support. + \item \textbf{Phase~3 (Full pipeline)}: 15~PRs. The complete skill library was operational, including orchestration skills (\texttt{project-pipeline}, \texttt{review-pipeline}, \texttt{issue-to-pr}), quality gates (\texttt{check-issue}, \texttt{check-rule-redundancy}), and multi-agent review (\texttt{review-implementation}). New models (Knapsack, GraphPartitioning, MinimumFeedbackVertexSet) and rules (KSatisfiability $\to$ SubsetSum) were added with full pipeline support. \end{itemize} \paragraph{PR composition.} -Of the 58~merged PRs, 5 are tagged as \texttt{[Model]} PRs (adding new problem types), 2 as \texttt{[Rule]} PRs (adding new reduction rules), and 51 as infrastructure, refactoring, documentation, or tooling PRs. -The low count of explicitly tagged Model and Rule PRs reflects the project's development pattern: the initial feature-parity PRs (e.g., PR~\#4 ``Feature parity with ProblemReductions.jl'' and PR~\#7 ``Implement remaining reduction rules'') bundled multiple models and rules into single large PRs before the tagging convention was established. -The 40~reduction rule implementations and 24~problem types in the final library were built incrementally across many PRs that also included refactoring, testing, and documentation work. +Of the 59~merged PRs, 10 follow the \texttt{Fix \#N} issue-driven pattern (created by the \texttt{issue-to-pr} skill), 16 are \texttt{feat:} PRs, and the remainder are infrastructure, refactoring, documentation, or tooling PRs. +The initial feature-parity PRs (e.g., PR~\#4 ``Feature parity with ProblemReductions.jl'') bundled multiple models and rules into single large PRs before the per-issue convention was established. +All PRs are attributed to human GitHub accounts because the agent operates through the developer's local environment (Claude Code runs in the terminal), making the distinction between human-authored and agent-assisted work invisible in git metadata---itself a finding about observability limitations of current agentic workflows. + +\paragraph{Issue quality gate.} +The \texttt{check-issue} skill was deployed at scale when a contributor batch-submitted 414~issues proposing new problem models and reduction rules---including 251~in a single day. +Of the 322~issues quality-checked, only \textbf{81 (25\%) passed} all checks: +\begin{center} +\small +\begin{tabular}{@{}lr@{}} +\toprule +Verdict & Count (\%) \\ +\midrule +Good & 81 (25\%) \\ +PoorWritten (incomplete specification) & 124 (39\%) \\ +Wrong (factually incorrect) & 64 (20\%) \\ +Trivial (obvious reduction) & 43 (13\%) \\ +Useless (no practical value) & 18 (6\%) \\ +\bottomrule +\end{tabular} +\end{center} +The \textbf{75\% rejection rate} demonstrates the necessity of automated quality gates: without \texttt{check-issue}, the pipeline would waste agent resources implementing incorrect or trivial reductions. +The most common failure mode was \emph{PoorWritten}---issues that lacked complete mathematical specifications or worked examples, making them unimplementable even by a skilled agent. +The \emph{Wrong} category (20\%) included citations to non-existent papers, incorrect complexity claims, and reductions that do not preserve solution structure---errors that would have been expensive to discover during implementation rather than at the issue-triage stage. -All 58~PRs were authored through human GitHub accounts. -This reflects the operational reality of our methodology: the agent operates through the human's development environment (Claude Code runs locally), so all commits and PRs are attributed to the human author even when the agent performed the implementation. -The distinction between human-authored and agent-assisted work is therefore not visible in the git metadata, which is itself a finding about the observability limitations of current agentic coding workflows. +\paragraph{Interaction evolution.} +Analysis of 2,196~user prompts in the project's Claude Code history reveals a shift from imperative to declarative interaction as the skill library matured. +In Phase~1, the maintainer issued step-by-step commands (``start milestone 2,'' ``improve test coverage to $>$95\%,'' ``fix the clippy test''), averaging 8--12 words per prompt. +By Phase~3, single-command orchestration dominated (``\texttt{make run-pipeline},'' ``\texttt{/review-pipeline}''), with 30\% of prompts consisting of 1--3 words. +This progression---from programming the agent's actions to invoking its skills---mirrors the classical shift from scripting to API design, suggesting that skill engineering is a form of \emph{meta-programming} for agentic workflows. \paragraph{Error taxonomy.} \Cref{tab:errors} categorizes the types of errors encountered during development, mapped to the verification layer that catches each error class. @@ -620,7 +664,7 @@ \section{Related Work}\label{sec:related} VeriBench~\cite{Miranda2025VeriBench} finds that only self-optimizing agent architectures achieve meaningful compilation rates in Lean~4, approaching 90\% but still far from full correctness proofs. For imperative programs, Mukherjee et al.\ demonstrate a two-LLM pipeline where one model generates candidate C programs and another generates Coq proofs of correctness~\cite{Mukherjee2025CoqPL, Mukherjee2025SynVer}---a generate-then-verify pattern that resonates with our layered approach. Our seven-layer verification stack (\Cref{sec:verification}) takes a more pragmatic path: rather than attempting end-to-end formal proofs (which the benchmarks above show remains out of reach for complex code), we compose multiple lightweight verification mechanisms---type-level enforcement, brute-force cross-validation, overhead formula checking, materialized fixtures, and agentic review---that collectively catch errors across different abstraction levels. -The trade-off is clear: we provide less formal guarantee than a machine-checked proof, but our approach is practically effective at catching real errors in agent-generated mathematical code and scales to the 40~reductions in our library without requiring proof engineering expertise. +The trade-off is clear: we provide less formal guarantee than a machine-checked proof, but our approach is practically effective at catching real errors in agent-generated mathematical code and scales to the 50~reductions in our library without requiring proof engineering expertise. \paragraph{Physics-inspired optimization.} Our reduction graph serves as a compilation layer connecting abstract problem formulations to specialized hardware and neural solvers. @@ -659,7 +703,7 @@ \subsection{Limitations} The ablation study (\Cref{sec:evaluation}) provides a controlled comparison within this project, but replication across independent projects and teams remains necessary. \paragraph{Skill engineering cost.} -The 13~skills in our library represent substantial upfront investment. +The 12~skills in our library represent substantial upfront investment. Each skill required iterative refinement---writing the initial markdown script, testing it against real issues, observing agent failure modes, and revising. This cost is amortized across many invocations, but it presupposes a maintainer with both domain expertise and familiarity with agent capabilities. Projects without such a maintainer cannot adopt the methodology directly. @@ -670,7 +714,7 @@ \subsection{Limitations} Each new domain requires its own skill engineering effort. \paragraph{Confounding factors.} -Our project evolved over seven weeks during which both the skill library and the underlying language models improved. +Our project evolved over nine weeks during which both the skill library and the underlying language models improved. Although we address this confound through temporal stratification in our evaluation, we cannot fully disentangle the contribution of better skills from the contribution of more capable models. Future work should control for model version to isolate the skill-based methodology's independent effect. @@ -702,13 +746,15 @@ \subsection{Future Directions} Replacing or supplementing this layer with Lean or Coq proofs---generating formal correctness theorems alongside the Rust implementation---would add an eighth layer providing the strongest possible guarantee. The VeriCoding~\cite{Bursuc2025VeriCoding} and CLEVER~\cite{Thakur2025CLEVER} benchmarks suggest this remains challenging, but the bounded scope and formal specification of individual reductions make them more amenable to automated theorem proving than general software. -Third, \emph{scaling}: can the pipeline maintain quality as the reduction graph grows from 24 to 100+ problem types? -The homogeneous task structure suggests that skills should scale without modification, but the growing graph introduces new challenges---more potential for redundant reductions, longer composite paths to analyze, and a larger surface area for the maintainer to oversee. -Investigating these scaling dynamics, potentially with multiple maintainers and automated priority assignment, is an important next step. +Third, \emph{scaling toward a reduction compiler}: the reduction graph is not merely a library catalog but a \emph{compilation infrastructure} that maps user problems to specialized solvers (see \Cref{fig:reduction-graph}). +Each reduction edge carries a multivariate polynomial cost model---for instance, reducing an $n$-vertex, $m$-edge graph to a triangular lattice MIS costs $O(n^2 + m)$ atom sites---and the optimal compilation path may depend on the problem scale (path~A dominates at small $n$, path~B at large $n$), requiring Pareto-optimal path search. +As the graph scales from 24 to 100+ problem types through agent-synthesized rules (the upper layer in \Cref{fig:reduction-graph}), the system evolves from a library into an end-to-end problem reduction compiler: users describe their combinatorial optimization problem, and the compiler automatically selects the lowest-cost reduction path to the target solver---be it a Rydberg atom array, a quantum annealer, or a commercial ILP solver. +The skill-based methodology presented here provides the knowledge engineering backbone for this compiler: each new reduction rule, verified through the seven-layer stack and implemented by agent-managed skills, adds an edge to the compilation graph. +Investigating the scaling dynamics---redundancy detection, Pareto path search over polynomial costs, and multi-maintainer coordination---is the natural next step. \subsection{Conclusion} -We have presented a skill-based methodology for agentic coding that decomposes mathematical software tasks into human-creative and agent-executable components, validated through a case study producing 24~problem types and 40~reduction rules with multi-layered verification. +We have presented a skill-based methodology for agentic coding that decomposes mathematical software tasks into human-creative and agent-executable components, validated through a case study producing 27~problem types and 50~reduction rules with multi-layered verification. The core insight is that the bottleneck in agentic coding is not agent capability but task decomposition: when work is structured so that each unit is formally specified, bounded in scope, and mechanically verifiable, current agents execute it reliably. The methodology is most powerful in domains that share the Goldilocks property---formal specification, homogeneous tasks, automatable correctness---and we believe such domains are more common than is generally appreciated. diff --git a/docs/paper/arxiv/plan.md b/docs/paper/arxiv/plan.md new file mode 100644 index 00000000..19400005 --- /dev/null +++ b/docs/paper/arxiv/plan.md @@ -0,0 +1,29 @@ +# Agentic Coding to bridge computational hard problems + +## Abstract + +We show how to use AI agents to create a correctness proof package for a reduction rule. Towards reliable and scalable agentic coding. + +## Roles + +- AI agents, coding implementation, agentic testing, documentation +- Maintainers, skill management, key decision making +- Contributors, issue makers, creative parts +- Users, the ultimate test + +## Skills + +1. Validate issues +2. Validate implementation + +## Testing + +Reduce the validation barrier by creating advanced tools. + +``` + documentation -> human verification + | +issue -> rust code -> unit tests & round-trip tests + | + cli -> agentic-tests +``` \ No newline at end of file From eebd704e6d0aacc1028d9c1f6cb52f1bd1df3eed Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Fri, 13 Mar 2026 22:44:20 +0800 Subject: [PATCH 23/38] update --- .claude/skills/academic-research-skills | 1 + .../brainstorm/785-1773296086/.server-info | 1 + .../brainstorm/785-1773296086/.server.pid | 1 + .../785-1773296086/full-design.html | 157 +++ .../785-1773296086/paper-structures.html | 69 ++ .../brainstorm/785-1773296086/waiting.html | 3 + docs/paper/arxiv/figures/topology-issues.typ | 144 +++ docs/paper/arxiv/paper.tex | 1058 +++++++---------- docs/paper/arxiv/writing-guidelines.md | 93 ++ 9 files changed, 914 insertions(+), 613 deletions(-) create mode 160000 .claude/skills/academic-research-skills create mode 100644 .superpowers/brainstorm/785-1773296086/.server-info create mode 100644 .superpowers/brainstorm/785-1773296086/.server.pid create mode 100644 .superpowers/brainstorm/785-1773296086/full-design.html create mode 100644 .superpowers/brainstorm/785-1773296086/paper-structures.html create mode 100644 .superpowers/brainstorm/785-1773296086/waiting.html create mode 100644 docs/paper/arxiv/figures/topology-issues.typ create mode 100644 docs/paper/arxiv/writing-guidelines.md diff --git a/.claude/skills/academic-research-skills b/.claude/skills/academic-research-skills new file mode 160000 index 00000000..79dedce1 --- /dev/null +++ b/.claude/skills/academic-research-skills @@ -0,0 +1 @@ +Subproject commit 79dedce126a25b854616bb3c47d67a57397f0622 diff --git a/.superpowers/brainstorm/785-1773296086/.server-info b/.superpowers/brainstorm/785-1773296086/.server-info new file mode 100644 index 00000000..391dd259 --- /dev/null +++ b/.superpowers/brainstorm/785-1773296086/.server-info @@ -0,0 +1 @@ +{"type":"server-started","port":60547,"host":"127.0.0.1","url_host":"localhost","url":"http://localhost:60547","screen_dir":"/Users/liujinguo/rcode/problemreductions/.claude/worktrees/survey-agentic-reductions/.superpowers/brainstorm/785-1773296086"} diff --git a/.superpowers/brainstorm/785-1773296086/.server.pid b/.superpowers/brainstorm/785-1773296086/.server.pid new file mode 100644 index 00000000..4ab4c5d7 --- /dev/null +++ b/.superpowers/brainstorm/785-1773296086/.server.pid @@ -0,0 +1 @@ +791 diff --git a/.superpowers/brainstorm/785-1773296086/full-design.html b/.superpowers/brainstorm/785-1773296086/full-design.html new file mode 100644 index 00000000..b971c9b0 --- /dev/null +++ b/.superpowers/brainstorm/785-1773296086/full-design.html @@ -0,0 +1,157 @@ +

Full Paper Design

+

~10-12 pages, ICSE/ASE-class venue, Methodology-First (B)

+ +
+

Title (working)

+

+ "Skill-Based Agentic Coding for Mathematical Software: A Case Study in NP-Hard Problem Reductions" +

+

Alternative: "From Cards to Code: Human-Directed Agent Execution for Verified Reduction Libraries"

+
+ +
+

Thesis Statement

+
+

The bottleneck in agentic coding is not agent capability but task decomposition and the division of labor between human creativity and agent execution. We demonstrate a skill-based pipeline where humans (contributors + maintainer) provide judgment — which problems matter, which reductions are useful — while agents handle mechanical execution: implementation, testing, documentation, and review. Applied to NP-hard problem reductions, this produces a verified library of 24 problem types and 52 reductions, with multi-layered correctness guarantees.

+
+
+ +
+

Paper Outline (~10-12 pages)

+
+ +
S1. Introduction (~1.5 pages)
+
+ • Agents hit ~20% on long-horizon tasks — not a capability problem, a decomposition problem
+ • The "review > generation" challenge for mathematical/scientific code
+ • Key insight: reposition humans as creativity source (issues, curation), agents as labor
+ • Three roles: contributors (create issues), maintainer (curate board, write skills), agents (execute)
+ • Contributions: (1) skill-based methodology, (2) verification stack, (3) reduction library artifact +
+ +
S2. Why Reductions? The Goldilocks Domain (~1 page)
+
+ • Each reduction is self-contained (~50-200 LOC), formally specified, independently verifiable
+ • Round-trip correctness criterion: reduce → solve target → extract back → verify against source
+ • Practical value: QUBO as compilation target for quantum annealers, GNN solvers
+ • Contrast with SWE-Bench: homogeneous tasks enable systematic comparison
+ • Figure 1: The reduction graph (24 problems, 52 edges, variant lattice) +
+ +
S3. System Architecture (~2 pages)
+
+ • Rust trait hierarchy: Problem → OptimizationProblem / SatisfactionProblem
+ • ReduceTo<T> trait + ReductionResult for type-safe reductions
+ • Compile-time machinery: overhead expressions, variant registration, complexity strings
+ • The design philosophy: make correctness checkable by construction
+ • Figure 2: System architecture diagram (traits, registry, verification layers) +
+ +
S4. Skill-Based Task Decomposition (~2 pages)
+
+ 4.1 The Three Roles
+ • Contributors: open issues (creative — "is this reduction useful? non-trivial?")
+ • Maintainer: curates project board, writes/evolves skills, moves cards
+ • Agents: pick cards, execute skill pipelines
+
+ 4.2 Skills as Agent Functions
+ • check-issue: validates usefulness, non-triviality, literature correctness
+ • add-model / add-rule: brainstorm → plan → implement → test → review
+ • review-implementation: parallel subagents (structural + quality)
+ • fix-pr: resolve review comments, CI failures, coverage gaps
+
+ 4.3 Card-Based Orchestration
+ • GitHub project board as the coordination mechanism
+ • Manager agent auto-picks a card and drives it through the pipeline
+ • Human moves cards between columns (the creative decision: what to work on next)
+ • Figure 3: Pipeline diagram — issue → check → implement → review → PR → merge +
+ +
S5. Multi-Layered Verification (~1.5 pages)
+
+ 5.1 The Verification Stack
+ • Layer 1: Rust type system — compile-time enforcement of trait contracts
+ • Layer 2: Unit tests — evaluation, serialization, edge cases
+ • Layer 3: Round-trip (closed-loop) tests — reduce, solve, extract, verify
+ • Layer 4: Overhead validation — symbolic expressions checked against actual sizes
+ • Layer 5: Materialized test data — JSON fixtures locked in version control
+ • Layer 6: Agentic review — parallel subagents with fresh context
+ • Layer 7: Documentation — paper entry forces human-readable proof sketch
+
+ 5.2 Why Layers?
+ • Each layer catches different error classes (table: which errors each layer catches)
+ • Materialized data prevents agents from silently changing tests
+ • The "lazy agent" problem: agents take shortest path to close issues
+ • Figure 4: Verification pyramid with error examples at each layer +
+ +
S6. Evaluation (~2 pages)
+
+ 6.1 Git History Mining (quantitative)
+ • How many reductions were agent-implemented vs human-implemented
+ • Success rate per skill invocation (first-attempt pass rate)
+ • Review rounds before merge
+ • Error taxonomy: what went wrong and which layer caught it
+ • Coverage metrics across the codebase (>95% target)
+
+ 6.2 Case Studies (qualitative, 2-3 reductions)
+ • Simple: MVC → MIS — complement relationship, near-trivial mapping
+ • Complex: SAT → MIS — clause-variable gadget, quadratic blowup
+ • Multi-hop: Factoring → CircuitSAT → ILP — chain through circuit encoding
+ • For each: show the full pipeline from issue to merged PR with paper entry
+ • Highlight where human judgment was needed vs. where agent executed autonomously +
+ +
S7. Related Work (~1 page)
+
+ • AI coding agents: SWE-agent, OpenHands, Claude Code, Devin — position vs. our skill approach
+ • AI for reductions: AlphaEvolve gadgets, URSA SAT verification — discovery vs. implementation
+ • Formal verification: VeriCoding, CLEVER — our pragmatic multi-layer alternative
+ • Physics-inspired solvers: QUBO/GNN — our graph as infrastructure for these +
+ +
S8. Discussion & Conclusion (~1 page)
+
+ • Generalizability: what other domains have the "Goldilocks" property?
+ • Limitations: requires upfront skill engineering, domain expertise doesn't transfer
+ • The human value proposition: creativity, judgment, responsibility — not eliminated, repositioned
+ • Future: connecting to AlphaEvolve-style discovery, formal verification integration +
+
+
+ +
+

Key Figures

+
+
+
Fig 1: Reduction Graph
+
24 problem nodes, 52 directed edges, QUBO hub visible. Color-coded by category (graph/formula/set/algebraic/misc). Variant lattice shown as inset.
+
+
+
Fig 2: System Architecture
+
Trait hierarchy + compile-time machinery. Shows how Problem/ReduceTo/Solver traits compose.
+
+
+
Fig 3: Pipeline Diagram
+
Three-role pipeline: contributor → issue → (agent: check) → maintainer moves card → (agent: implement/review) → PR → merge. Human decisions highlighted.
+
+
+
Fig 4: Verification Pyramid
+
7 layers from type system (bottom) to documentation (top). Each layer annotated with example errors it catches.
+
+
+
+ +
+

Key Tables

+
+
+
Table 1: Skills Inventory
+
Each skill with: trigger, inputs, outputs, typical agent turns, first-attempt success rate.
+
+
+
Table 2: Error Taxonomy
+
Error categories × which verification layer caught them. Shows why no single layer suffices.
+
+
+
diff --git a/.superpowers/brainstorm/785-1773296086/paper-structures.html b/.superpowers/brainstorm/785-1773296086/paper-structures.html new file mode 100644 index 00000000..d14ee32b --- /dev/null +++ b/.superpowers/brainstorm/785-1773296086/paper-structures.html @@ -0,0 +1,69 @@ +

Paper Structure: 3 Approaches

+

Full research paper (10-12 pages) for ICSE/ASE-class venue

+ +
+
+
A
+
+

System-First: "Here's What We Built"

+

Lead with the reduction graph as artifact, then explain the agentic pipeline that produced it.

+
+
1. Introduction — NP-hard reductions matter, building them is tedious
+
2. The Reduction Graph — 24 problems, 52 rules, QUBO hub, variant lattice
+
3. System Design — skills, traits, overhead system, verification layers
+
4. Card-Based Workflow — human moves cards, agent picks + executes
+
5. Evaluation — git mining (success rates, error taxonomy) + 3 case studies
+
6. Related Work — agentic coding, AI for reductions, formal verification
+
7. Discussion + Conclusion
+
+
+

Strengths

  • Artifact speaks for itself
  • Accessible to theory + SE audiences
  • Natural figure: the full reduction graph
+

Risks

  • May read as "just a tool paper"
  • Methodology insight buried in section 4
+
+
+
+ +
+
B
+
+

Methodology-First: "Here's How Agents Should Code Math"

+

Lead with the insight that reductions are a Goldilocks domain, then present the skill-based methodology.

+
+
1. Introduction — Agents fail at long-horizon math tasks; why?
+
2. Why Reductions? — Goldilocks: self-contained, formally specified, verifiable
+
3. Skill-Based Decomposition — how skills encode domain knowledge as guardrails
+
4. Verification Stack — 5 layers: types, unit tests, round-trip, overhead, review
+
5. Card-Based Orchestration — graduated trust, human as curator
+
6. Evaluation — git mining + case studies + error taxonomy
+
7. The Artifact — reduction graph, QUBO hub, practical applications
+
8. Related Work + Conclusion
+
+
+

Strengths

  • Clear research contribution
  • Generalizable lessons for other domains
  • Addresses the "verification gap" from survey
+

Risks

  • Artifact feels like an afterthought
  • Harder for theory audience to engage
+
+
+
+ +
+
C
+
+

Narrative: "From Issue to Theorem"

+

Open with a concrete example — one reduction flowing through the entire pipeline — then zoom out.

+
+
1. Introduction — Walk through SAT→MIS: issue → code → test → paper entry
+
2. Problem Setting — reduction rules, why they're hard, why they matter
+
3. System Overview — architecture, roles (human curator + agent executor)
+
4. The Pipeline — skills × verification, card-based orchestration
+
5. Two More Case Studies — simple (MVC→MIS) + complex (Factoring→Circuit)
+
6. Quantitative Results — git mining across all 52 reductions
+
7. Lessons & Limitations — what worked, what didn't, generalizability
+
8. Related Work + Conclusion
+
+
+

Strengths

  • Most engaging to read
  • Case studies front and center
  • Easy to follow even for non-experts
+

Risks

  • Methodology contribution less crisp
  • May feel anecdotal without strong quantitative backing
+
+
+
+
diff --git a/.superpowers/brainstorm/785-1773296086/waiting.html b/.superpowers/brainstorm/785-1773296086/waiting.html new file mode 100644 index 00000000..b07372b1 --- /dev/null +++ b/.superpowers/brainstorm/785-1773296086/waiting.html @@ -0,0 +1,3 @@ +
+

Writing spec document...

+
\ No newline at end of file diff --git a/docs/paper/arxiv/figures/topology-issues.typ b/docs/paper/arxiv/figures/topology-issues.typ new file mode 100644 index 00000000..2df434fc --- /dev/null +++ b/docs/paper/arxiv/figures/topology-issues.typ @@ -0,0 +1,144 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 8pt) +#set text(size: 7pt, font: "New Computer Modern") + +// Colors +#let col-ok = rgb("#4e79a7") // healthy node +#let col-ok-fill = rgb("#d0ddef") +#let col-warn = rgb("#e15759") // problem highlight +#let col-warn-fill = rgb("#fce4e4") +#let col-sat = rgb("#59a14f") // 3-SAT / proof chain +#let col-sat-fill = rgb("#ddf0dd") +#let col-edge = rgb("#5D6D7E") +#let col-redun = rgb("#e8913a") // redundant +#let col-ghost = rgb("#cccccc") + +#let node-r = 0.32 + +#canvas(length: 0.5cm, { + import draw: * + + // Helper: draw a problem node + let pnode(pos, label, col: col-ok, fill: col-ok-fill, name: none, r: node-r) = { + let n = if name != none { name } else { label } + circle(pos, radius: r, fill: fill, stroke: 0.8pt + col, name: n) + content(n, text(6pt, weight: "bold", fill: col.darken(30%), label)) + } + + // Helper: draw a directed edge + let edge(from, to, col: col-edge, thick: 0.5pt, dash: none) = { + line(from, to, + stroke: (paint: col, thickness: thick, dash: dash), + mark: (end: "straight", scale: 0.35)) + } + + // ============================================================ + // PANEL (a): Orphan node + // ============================================================ + let ax = 0.0 + let ay = 0.0 + + // Title + content((ax + 2.5, ay + 4.8), text(8pt, weight: "bold", "(a) Orphan node")) + + // Connected subgraph + pnode((ax + 0.5, ay + 3.5), "SAT", col: col-sat, fill: col-sat-fill, name: "a-sat") + pnode((ax + 2.5, ay + 2.0), "MIS", name: "a-mis") + pnode((ax + 0.5, ay + 0.5), "QUBO", name: "a-qubo") + pnode((ax + 2.5, ay + 3.5), "MVC", name: "a-mvc") + pnode((ax + 4.0, ay + 0.5), "ILP", name: "a-ilp") + + edge("a-sat.south", "a-mis.north") + edge("a-mvc.south", "a-mis.north") + edge("a-mis.south", "a-qubo.north") + edge("a-mis.south", "a-ilp.north") + + // Orphan node — isolated, no edges + pnode((ax + 5.0, ay + 3.0), "BMF", col: col-warn, fill: col-warn-fill, name: "a-orphan") + + // Dashed box around orphan + rect((ax + 4.3, ay + 2.3), (ax + 5.7, ay + 3.7), + stroke: (thickness: 0.6pt, paint: col-warn, dash: "dashed"), radius: 4pt) + + // Annotation + content((ax + 5.0, ay + 1.8), text(5.5pt, fill: col-warn, + [no reductions\ to or from])) + + // ============================================================ + // PANEL (b): Redundant rule + // ============================================================ + let bx = 8.0 + let by = 0.0 + + content((bx + 2.5, by + 4.8), text(8pt, weight: "bold", "(b) Redundant rule")) + + // Three nodes in a row, with the composite path on top + pnode((bx + 0.0, by + 3.5), "A", name: "b-a") + pnode((bx + 2.5, by + 3.5), "B", name: "b-b") + pnode((bx + 5.0, by + 3.5), "C", name: "b-c") + + // Good composite path: A → B → C (two hops, low overhead) + edge("b-a.east", "b-b.west", col: col-ok, thick: 0.8pt) + edge("b-b.east", "b-c.west", col: col-ok, thick: 0.8pt) + + // Cost labels on good path + content((bx + 1.25, by + 4.1), text(5.5pt, fill: col-ok.darken(20%), $O(n)$)) + content((bx + 3.75, by + 4.1), text(5.5pt, fill: col-ok.darken(20%), $O(n m)$)) + + // Redundant direct edge: A → C (higher overhead, curves below) + bezier("b-a.south", "b-c.south", + (bx + 1.5, by + 1.2), (bx + 3.5, by + 1.2), + stroke: (paint: col-warn, thickness: 0.9pt, dash: "densely-dashed"), + mark: (end: "straight", scale: 0.35)) + + // Cost label on redundant edge + content((bx + 2.5, by + 1.5), text(5.5pt, fill: col-warn, + [direct: $O(n^2 m)$])) + + // Annotation + content((bx + 2.5, by + 0.5), text(5.5pt, fill: col-warn, + [composite $O(n^2 m)$ $lt.eq$ direct\ $arrow.r.double$ rule is dominated])) + + // ============================================================ + // PANEL (c): Missing NP-hardness proof path + // ============================================================ + let cx = 16.5 + let cy = 0.0 + + content((cx + 2.5, cy + 4.8), text(8pt, weight: "bold", "(c) Missing proof path")) + + // 3-SAT as the NP-hardness source + pnode((cx + 0.0, cy + 3.5), "3-SAT", col: col-sat, fill: col-sat-fill, name: "c-3sat") + pnode((cx + 2.0, cy + 3.5), "SAT", col: col-sat, fill: col-sat-fill, name: "c-sat") + pnode((cx + 4.0, cy + 3.5), "MIS", col: col-sat, fill: col-sat-fill, name: "c-mis") + pnode((cx + 2.0, cy + 1.5), "ILP", col: col-sat, fill: col-sat-fill, name: "c-ilp") + + // Green proof chain + edge("c-3sat.east", "c-sat.west", col: col-sat, thick: 0.7pt) + edge("c-sat.east", "c-mis.west", col: col-sat, thick: 0.7pt) + edge("c-sat.south", "c-ilp.north", col: col-sat, thick: 0.7pt) + + // Disconnected node — has edges but no path FROM 3-SAT + pnode((cx + 5.5, cy + 1.5), "TSP", col: col-warn, fill: col-warn-fill, name: "c-tsp") + pnode((cx + 5.5, cy + 3.5), "BinP", col: col-warn, fill: col-warn-fill, name: "c-binp") + + // TSP has outgoing edge to ILP, but no incoming from 3-SAT + edge("c-tsp.west", "c-ilp.east", col: col-ghost) + edge("c-binp.west", "c-mis.east", col: col-ghost) + + // Missing edges shown as dotted with "?" + line((cx + 3.0, cy + 2.8), (cx + 5.0, cy + 1.8), + stroke: (paint: col-warn, thickness: 0.5pt, dash: "dotted"), + mark: (end: "straight", scale: 0.3)) + content((cx + 4.3, cy + 2.6), text(6pt, fill: col-warn, "?")) + + line((cx + 3.0, cy + 3.8), (cx + 5.0, cy + 3.8), + stroke: (paint: col-warn, thickness: 0.5pt, dash: "dotted"), + mark: (end: "straight", scale: 0.3)) + content((cx + 4.3, cy + 4.2), text(6pt, fill: col-warn, "?")) + + // Annotation + content((cx + 5.5, cy + 0.5), text(5.5pt, fill: col-warn, + [no path from 3-SAT\ $arrow.r.double$ NP-hardness\ unproven in graph])) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index c208ebf0..76d44a5b 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -19,488 +19,295 @@ \maketitle \begin{abstract} -AI coding agents achieve 70--80\% on single-issue benchmarks like SWE-Bench Verified, but their success rate drops below 25\% on long-horizon software evolution tasks that demand sustained mathematical reasoning across many files. We address this gap by decomposing agentic coding into two complementary roles: human-creative work (designing reduction proofs, choosing algorithms, writing specifications) and agent-managed execution (scaffolding, testing, verification, and integration). Our method centers on a library of 12 reusable skills---from issue triage through implementation to multi-layered review---orchestrated by a coding agent within a Rust library for NP-hard problem reductions. A 7-layer verification stack (type checking, unit tests, brute-force cross-validation, closed-loop reduction tests, integration tests, coverage enforcement, and CI/CD) catches errors at increasing levels of abstraction. Applying this methodology over nine weeks of active development produced 27 problem types, 50 reduction rule implementations, and 52 edges in a typed reduction graph, all with $>$95\% test coverage. Agent session data reveals a 15:1 automation amplification ratio (assistant-to-user messages), with the automated issue quality gate rejecting 75\% of 322~batch-submitted issues as incomplete, incorrect, or trivial. We contribute the skill-based decomposition methodology, the verification stack design, and the open-source artifact as a benchmark for agentic mathematical software engineering. +Many real-world optimization problems---scheduling, routing, chip design---are computationally hard (NP-hard), yet specialized hardware and software solvers exist for a handful of them. +Connecting a new problem to an existing solver requires a \emph{reduction}: a mathematical transformation that converts one problem into another while preserving the solution. +We build a library of such reductions as a directed graph of 27~problem types and 56~directed edges (45~hand-coded transformation rules plus edges inferred from type relationships), forming a ``compilation layer'' that routes any supported problem to the solver best suited for it. +Building this graph at scale requires implementing many self-contained mathematical proofs as verified code---a task well suited to AI coding agents when properly structured. +We introduce a \emph{skill-based} methodology that decomposes work into human-creative specification (deciding which reductions are worth implementing) and agent-managed execution (writing code, tests, and documentation). +Over nine weeks, this approach produced a Rust library with $>$95\% test coverage. +The graph exhibits a striking property: reductions implemented independently compose automatically to solve problems that no single implementation was designed for. +Agent session data reveals a 15:1 ratio of agent-to-human messages, while an automated quality gate rejected 75\% of 322~batch-submitted proposals as incorrect or incomplete. \end{abstract} +%====================================================================== +% SECTION 1: INTRODUCTION +%====================================================================== \section{Introduction}\label{sec:intro} -AI coding agents have made remarkable progress on isolated software engineering tasks. -On SWE-Bench Verified, which evaluates agents on single-issue bug fixes drawn from popular open-source repositories, the best systems now resolve 70--80\% of issues end-to-end~\cite{Xia2025LiveSWEagent}. -This has fueled optimism that fully autonomous software engineering is within reach. -Yet benchmarks designed to probe longer-horizon capabilities tell a starkly different story. -SWE-EVO, which requires agents to implement multi-step modifications spanning an average of 21~files, reports resolution rates around 21\% even for frontier models~\cite{Thai2025SWEEVO}. -SWE-Bench Pro, targeting enterprise-level tasks that may require hours to days of human effort, similarly finds that agents struggle with sustained multi-file reasoning~\cite{Deng2025SWEBenchPro}. -The common response to this capability gap has been to push for more powerful agents---larger models, better tool use, self-evolving scaffolds. -We argue that this framing overlooks the more fundamental bottleneck: not raw agent capability, but how work is decomposed and distributed between humans and agents. - -Our thesis is that the key to effective agentic coding lies in \emph{task decomposition}: splitting the creative and judgment-intensive aspects of software development (which humans do well) from the management and mechanical aspects (which agents can handle reliably). -When a task is sufficiently well-specified and bounded in scope, even current agents execute it with high fidelity. -The challenge is not making agents smarter, but structuring the work so that each unit falls within the agent's reliable operating range. -This perspective shifts attention from agent architecture to \emph{skill design}---the craft of encoding domain knowledge into reusable, agent-executable task specifications. - -This decomposition is especially critical for mathematical and scientific software, where the ``review is harder than generation'' problem is acute~\cite{Roychoudhury2025AgenticAI}. -An agent can generate a plausible-looking reduction from Boolean satisfiability to graph coloring, but verifying that the reduction preserves solution structure requires mathematical reasoning that current agents cannot reliably perform in isolation. -Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference---deciphering and clarifying developer intent---as the central difficulty in agentic software workflows, and argues that trustworthy deployment requires AI-based verification and validation of AI-generated code. -Our multi-layered verification stack addresses precisely this challenge: rather than attempting end-to-end formal verification (which remains largely unsolved for complex mathematical code~\cite{Thakur2025CLEVER, Bursuc2025VeriCoding}), we compose multiple lightweight verification mechanisms that collectively catch errors across different abstraction levels. - -We instantiate this methodology in the domain of NP-hard problem reductions, implemented as an open-source Rust library. -The library manages a typed reduction graph connecting 27 problem types through 50 hand-coded reduction rule implementations and 52 total directed edges (including 12 edges inferred from a type-parameter subtype lattice). -The domain serves as a Goldilocks testbed for studying agentic coding: each reduction is self-contained (50--200 lines of code), requires non-trivial mathematical reasoning about the mapping between problem structures, yet admits a fully automatable correctness criterion---reduce an instance, solve the target problem by brute force, extract the solution back, and verify it against the source. -This combination of mathematical depth with mechanical verifiability makes it possible to study how agents perform on tasks that are individually tractable but collectively demand sustained engineering discipline. - -Our approach distributes work across three roles with distinct responsibilities. -\textbf{Contributors} perform the creative work of identifying which problems and reductions are worth implementing: they open issues proposing new nodes or edges in the reduction graph, drawing on domain knowledge to spot gaps, recognize useful connections, and assess mathematical non-triviality. -\textbf{The maintainer} curates the project board and writes skills---markdown scripts that decompose complex tasks (such as ``implement a new reduction rule'') into sequences of agent-manageable subtasks. -The maintainer encodes domain knowledge, quality standards, and project conventions into these skills, effectively programming the agent's workflow rather than its output. -\textbf{Agents} serve in a dual capacity: as \emph{managers}, they pick cards from the project board, dispatch sub-agents for parallel review, and orchestrate a two-stage pipeline from issue to merged pull request; as \emph{executors}, they implement code, write tests, generate documentation, and fix CI failures. -The key insight is that the two human roles contribute \emph{judgment}---which reductions matter, what quality bar to enforce---while the agent handles \emph{volume}---executing the mechanical steps reliably and repeatedly. -Industry data supports this division: a recent industry report finds that developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding} (we note this is a vendor report, though its findings are consistent with independent surveys of developer AI adoption). - -We organize our agent's capabilities into a library of 12~skills spanning five functional categories: orchestration (pipeline management and issue dispatch), implementation (adding models and reduction rules), quality gates (issue checking, redundancy analysis, multi-agent review, and CI repair), documentation (generating formal problem definitions and reduction theorems with proof sketches), and release management. -A two-stage card-based pipeline automates the progression from issue to merged code: the first stage picks a ``Ready'' issue, implements it in an isolated git worktree, and produces a pull request; the second stage addresses review comments, fixes CI failures, and prepares the PR for human merge. -The human maintainer touches only two transitions in this pipeline---moving an issue from Backlog to Ready (the strategic decision of \emph{what} to work on) and merging the final pull request (the quality gate of \emph{whether} the work meets standards). -Everything in between is agent-managed. - -Correctness assurance comes from a seven-layer verification stack that catches errors at increasing levels of abstraction. -The stack ranges from compile-time type checking (Layer~1), through unit tests and brute-force cross-validation (Layers~2--3), to overhead formula validation against actual reduction sizes (Layer~4), materialized test fixtures that prevent agents from silently changing expected values (Layer~5), parallel agentic review with fresh context windows (Layer~6), and finally, documentation entries that require human-readable proof sketches for each reduction (Layer~7). -No single layer is sufficient---the type system catches API misuse but not logical errors; closed-loop tests verify functional correctness but not overhead formulas; documentation catches proof-level mistakes that no automated test can detect. -The layers are designed to be complementary, and the skill system ensures that agents invoke all relevant layers as part of every implementation task. - -Our contributions are as follows: +\subsection{The Problem: Many Hard Problems, Few Solvers} + +Combinatorial optimization problems arise throughout science and engineering. +An airline needs to assign crews to flights. +A chip designer needs to allocate registers. +A logistics company needs to route delivery trucks. +Each of these is an instance of an NP-hard problem---a class of problems for which no efficient general-purpose algorithm is known, but which can be solved in practice for moderate sizes by specialized solvers. + +The difficulty is that each solver speaks its own narrow language. +Rydberg atom arrays~\cite{lucas2014, pichler2018}---a type of quantum hardware---natively solve the Maximum Independent Set (MIS) problem on geometric graphs. +D-Wave quantum annealers~\cite{glover2019} solve Quadratic Unconstrained Binary Optimization (QUBO). +Commercial engines like Gurobi and CPLEX solve Integer Linear Programs (ILP). +A practitioner with a graph coloring problem cannot directly use any of these solvers without first \emph{translating} the problem into a form the solver understands. + +This translation is called a \emph{reduction}: a polynomial-time algorithm that converts an instance of one problem into an instance of another, together with an inverse map that translates the solution back. +Reductions are the central tool of computational complexity theory~\cite{karp1972}, but they also have immediate practical value: each verified reduction is a bridge connecting a new problem to an existing solver. + +\subsection{Our Approach: A Reduction Graph} + +We build a collection of such reductions organized as a \emph{directed graph}. +Each node in the graph is a problem type (e.g., Satisfiability, Max-Cut, Traveling Salesman). +Each directed edge is a verified reduction from one problem to another---code that transforms instances forward and maps solutions back. +Given any supported problem, a path through the graph leads to a solver, with each edge backed by tested code. + +The graph currently contains 27~problem types connected by 56~directed edges (\Cref{fig:reduction-graph}). +It exhibits a property we call \emph{emergent compositionality}: reductions implemented independently---by different people, in different pull requests, weeks apart---compose automatically through the graph infrastructure. +For example, one contributor implemented Factoring $\to$ Circuit Satisfiability (CircuitSAT), and another implemented CircuitSAT $\to$ ILP. +Neither intended the composition, yet the graph enables factoring integers via linear programming by chaining the two reductions. +Each new edge creates not just one connection but new paths through the entire graph. + +\subsection{The Challenge: Building the Graph at Scale} + +Implementing a single reduction requires understanding both the source and target problems, designing a correct transformation, proving it preserves solutions, writing 50--400 lines of code, and testing it thoroughly. +Building 50 such reductions is labor-intensive. + +AI coding agents---systems that autonomously write, test, and debug code---have shown strong performance on isolated software tasks, resolving 70--80\% of single-issue bug fixes on the SWE-Bench benchmark~\cite{Xia2025LiveSWEagent}. +But on multi-step tasks spanning many files, success drops to roughly 21\%~\cite{Thai2025SWEEVO, Deng2025SWEBenchPro}. +For mathematical software, the gap is wider still: an agent can generate a plausible-looking reduction, but verifying that it preserves solution structure requires reasoning that current agents cannot reliably perform alone~\cite{Roychoudhury2025AgenticAI}. + +We argue that the bottleneck is not agent capability but \emph{task decomposition}: how work is structured so that each unit falls within an agent's reliable range. +NP-hard reductions are a natural fit. +Every reduction implements the same interface, follows the same file convention, requires the same test pattern, and produces the same artifacts. +This homogeneity enables \emph{reusable skills}: persistent, versioned workflow scripts that decompose a complex multi-file task into numbered agent-executable steps. +A single ``add-rule'' skill handles all reductions, because the workflow is structurally identical even when the mathematical content varies. + +\subsection{Contributions} + +Our methodology separates three concerns. +\textbf{Contributors} provide creative judgment: identifying which reductions are mathematically interesting and worth implementing. +\textbf{The maintainer} encodes workflow knowledge into reusable skills. +\textbf{Agents} handle the mechanical volume: implementing code, writing tests, generating documentation, and fixing CI failures. +A library of 14~skills and a multi-layered verification stack ensure correctness across abstraction levels. + +Over nine weeks, this methodology produced a Rust library with 27~problem types, 45~reduction rules, and $>$95\% test coverage. +Our contributions are: \begin{itemize} - \item A \textbf{skill-based methodology} for decomposing mathematical coding tasks into agent-manageable steps, with a concrete skill library and a card-based orchestration pipeline that separates human judgment from agent execution. - \item A \textbf{multi-layered verification stack} comprising seven complementary mechanisms---from type-level enforcement through materialized fixtures to agentic review and documentation-as-verification---that collectively ensure correctness of agent-generated mathematical code. - \item A \textbf{verified open-source artifact}: a Rust library implementing 27~NP-hard problem types, 50~reduction rules, and 52~graph edges, all with ${>}95\%$ test coverage, serving as both a practical tool for reduction-based problem solving and a benchmark for evaluating agentic mathematical software engineering. + \item A \textbf{verified reduction graph} connecting 27~NP-hard problem types to specialized solvers, with emergent compositionality through automatic path composition. + \item A \textbf{skill-based methodology} for mathematical software engineering, separating human-creative specification from agent-managed execution. + \item A \textbf{verified open-source artifact}: the Rust library, skill library, and full development history as a benchmark for agentic mathematical software engineering. \end{itemize} -The remainder of this paper is organized as follows. -\Cref{sec:domain} motivates our choice of NP-hard reductions as a study domain, arguing that it occupies a sweet spot between mathematical complexity and mechanical verifiability. -\Cref{sec:architecture} describes the Rust library's type-driven architecture that makes agent-generated code verifiable by construction. -\Cref{sec:skills} presents the skill-based task decomposition, the three-role model, and the card-based orchestration pipeline. -\Cref{sec:verification} details the seven-layer verification stack. -\Cref{sec:evaluation} evaluates the methodology through an ablation study, git history mining, and detailed case studies. -\Cref{sec:related} surveys related work on AI coding agents, AI-assisted discovery of reductions, and formal verification of generated code. -\Cref{sec:conclusion} discusses generalizability, limitations, and future directions. - -\section{Why Reductions? The Goldilocks Domain}\label{sec:domain} - -Not every software domain is equally suited for studying agentic coding. -Web application bug fixes---the staple of SWE-Bench---are heterogeneous: each issue involves a different framework, a different failure mode, and a different notion of correctness. -This heterogeneity makes it difficult to draw general conclusions about agent capabilities, because success on one task says little about success on the next. -NP-hard problem reductions occupy a sweet spot---a Goldilocks domain---that avoids this limitation while remaining mathematically demanding. - -\paragraph{Self-contained, formally specified, mechanically verifiable.} -Each reduction in our library is a self-contained module: the 50~implementations range from 58 to 444 lines of code, with a median of 129~LOC. -Despite this bounded scope, each reduction requires non-trivial mathematical reasoning about the structural relationship between two problem formulations. -Crucially, every reduction admits a fully automatable correctness criterion: given a source instance, reduce it to the target problem, solve the target by brute force, extract the solution back through the reduction's inverse map, and verify that it is valid (and optimal) for the source. -This \emph{round-trip test} provides a ground-truth oracle that requires no human judgment to evaluate, yet exercises the full mathematical content of the reduction. -The combination of bounded scope, mathematical depth, and mechanical verifiability is what makes the domain ideal for agentic coding: tasks are individually within an agent's reliable operating range, but collectively demand sustained engineering discipline across a growing graph of interdependent components. - -\paragraph{Homogeneous task structure.} -Unlike SWE-Bench, where every issue is structurally unique, reductions form a homogeneous task family. -Every reduction implements the same trait (\texttt{ReduceTo}), follows the same file-naming convention, requires the same test structure (closed-loop round-trip), and produces the same artifacts (overhead expressions, example code, documentation entry). -This homogeneity enables fair comparison across tasks---we can meaningfully ask whether an agent performs better on graph-to-graph reductions than on formula-to-graph reductions, or whether reduction complexity (measured in LOC or graph blowup) predicts first-attempt success rate. -It also enables \emph{reusable skills}: a single ``add-rule'' skill handles all 50~reductions, because the workflow is structurally identical even when the mathematical content varies. - -\paragraph{Hardware solvers as practical motivation.} -The reduction graph is not merely an academic exercise; it serves as a \emph{compilation layer} connecting abstract problem formulations to physical hardware (\Cref{fig:reduction-graph}). -Rydberg atom arrays natively solve Maximum Independent Set (MIS) by encoding graph vertices as atoms and edges as blockade constraints~\cite{lucas2014, pichler2018}. -D-Wave quantum annealers solve Quadratic Unconstrained Binary Optimization (QUBO) and Ising spin glass problems through quantum tunneling~\cite{glover2019}. -Commercial ILP solvers (Gurobi, CPLEX) accept integer linear programs. -Each solver ``speaks'' only its native problem language, but a verified reduction graph lets them collectively tackle a far larger class of problems: reduce Satisfiability to MIS and run the result on Rydberg atoms; reduce MaxCut through SpinGlass to QUBO and submit to D-Wave. -MIS serves as the dominant hub, with the highest in-degree (14~incoming edges) and out-degree (13~outgoing edges) among all 27~problem types---reflecting its central role as both a target for hardware solvers and a source for further reductions. -ILP, with 11~incoming edges, functions as a universal algebraic target. -The bottom layer of \Cref{fig:reduction-graph} shows these three solver backends; the middle layer shows the 50~human-implemented reduction rules connecting known problem types; and the top layer visualizes the scaling vision---agent-synthesized rules extending the graph to 100+ problem types, the compilation infrastructure needed for a general-purpose reduction compiler. +The rest of this paper is organized as follows. +\Cref{sec:graph} describes the reduction graph and its properties. +\Cref{sec:method} presents the skill-based methodology. +\Cref{sec:evaluation} evaluates through development metrics, a quality gate analysis, and case studies. +\Cref{sec:related} surveys related work. +\Cref{sec:conclusion} discusses generalizability and future directions. + +%====================================================================== +% SECTION 2: THE REDUCTION GRAPH +%====================================================================== +\section{The Reduction Graph}\label{sec:graph} + +\subsection{What Is a Reduction?} + +A \emph{reduction} from problem~$A$ to problem~$B$ is a pair of functions: a \emph{forward map} that transforms any instance of~$A$ into an instance of~$B$ in polynomial time, and an \emph{inverse map} that converts any solution of the $B$-instance back into a solution of the original $A$-instance. +If the reduction is correct, the extracted solution is optimal (or satisfying) for the original problem. -\paragraph{Real-world applications.} -The problems in our graph arise directly in industrial settings. -Software-defined networking encodes routing and scheduling as Integer Linear Programming~(ILP). -Airline crew scheduling reduces to Set Covering. -VLSI design relies on graph coloring for register allocation and channel routing. -Logistics optimization maps to the Traveling Salesman Problem and Bin Packing. -In each case, the domain-specific problem reduces to a canonical NP-hard formulation for which decades of algorithmic and hardware research have produced efficient solvers. -Our library provides the verified bridge: a practitioner formulates their problem as one of the 24~supported types and follows reduction edges to reach the solver of their choice, with each edge backed by round-trip-tested code and a documented proof sketch. +For example, the complement relationship between Minimum Vertex Cover (MVC) and Maximum Independent Set (MIS) gives a simple reduction: given a graph, the vertex cover problem asks for the smallest set of vertices that touches every edge, while the independent set problem asks for the largest set of vertices with no edges between them. +A set~$S$ is independent if and only if its complement $V \setminus S$ is a vertex cover, so the forward map is the identity (the same graph), and the inverse map complements the solution. +This 96-line implementation connects MVC to every solver reachable from MIS. + +\subsection{Graph Structure}\label{sec:graph-structure} + +We organize reductions into a directed graph $G = (V, E)$, where each vertex $v \in V$ represents an NP-hard problem type and each directed edge $(u, v) \in E$ represents a verified reduction from problem~$u$ to problem~$v$. +The graph contains 27~vertices (problem types) and 56~directed edges: 45~hand-coded reduction rules plus 11~edges inferred from subtype relationships (e.g., MIS on a geometric subgraph can always be treated as MIS on a general graph, because the geometric structure is a special case). \begin{figure*}[t] \centering \includegraphics[width=0.88\textwidth]{figures/problemtree.pdf} - \caption{The reduction network as compilation infrastructure. \textbf{Bottom}: hardware-native problems (UD-MIS on Rydberg atom arrays, QUBO on D-Wave annealers, ILP on commercial solvers). \textbf{Middle}: 27 problem types connected by 50~human-implemented reduction rules (solid arrows) with cross-type reductions (dashed). MIS is the dominant hub (in-degree 14, out-degree 13). \textbf{Top}: the scaling vision---agent-synthesized reductions extending coverage to 100+ NP-hard problem types. The dashed line marks the boundary between current human-implemented rules and future agent-discovered rules.} + \caption{The reduction graph as compilation infrastructure. + \textbf{Bottom layer}: three families of solvers, each accepting a specific problem formulation---Maximum Independent Set on unit-disk graphs (UD-MIS) for Rydberg atom arrays, QUBO for D-Wave quantum annealers, and ILP for commercial solvers. + \textbf{Middle layer}: 27~problem types connected by 45~human-implemented reduction rules. + MIS is the dominant hub with 14~incoming and 13~outgoing edges. + \textbf{Top layer}: the scaling vision---agent-synthesized reductions extending coverage to 100+ problem types. + Solid arrows denote verified reductions; dashed arrows denote cross-reductions to alternative solvers.} \label{fig:reduction-graph} \end{figure*} -\section{System Architecture}\label{sec:architecture} - -The library's architecture is designed around a single principle: \emph{reduce the space of possible agent errors through type-level enforcement}. -Rather than relying on agents to remember project conventions or follow informal guidelines, the Rust type system, trait bounds, and compile-time macros structurally prevent entire classes of mistakes. -This section describes four architectural pillars---the \texttt{Problem} trait, the \texttt{ReduceTo} trait, the \lstinline{#[reduction(overhead)]} proc macro, and the \lstinline{declare_variants!} registry---that collectively make agent-generated code verifiable by construction (see~\Cref{fig:architecture}). - -\begin{figure}[t] - \centering - \includegraphics[width=\columnwidth]{figures/architecture.pdf} - \caption{System architecture: the trait hierarchy and compile-time validation enforce round-trip testing capability by construction.} - \label{fig:architecture} -\end{figure} +Three problems serve as ``compilation targets,'' each corresponding to a class of specialized solvers (\Cref{fig:reduction-graph}, bottom layer): +\begin{itemize} + \item \textbf{MIS} (Maximum Independent Set): target for Rydberg atom arrays, which solve MIS natively on geometric graphs. + \item \textbf{QUBO} (Quadratic Unconstrained Binary Optimization): target for D-Wave quantum annealers. + \item \textbf{ILP} (Integer Linear Programming): target for commercial solvers like Gurobi and CPLEX. +\end{itemize} +MIS is the dominant hub with 14~incoming and 13~outgoing edges, reflecting its role as both a hardware target and an intermediary. +ILP, with 11~incoming edges, functions as a universal algebraic target. +A path from any problem to one of these targets provides a route to the corresponding solver. -\paragraph{The Problem trait: universal evaluation.} -Every problem type in the library implements a single core trait, \texttt{Problem}, which requires five members: a constant \lstinline{NAME} identifying the problem type (e.g., \texttt{"MaximumIndependentSet"}), an associated type \texttt{Metric} representing the objective (either \texttt{SolutionSize} for optimization problems or \texttt{bool} for satisfaction problems), a method \lstinline{dims()} that returns the configuration space dimensions, a method \lstinline{evaluate()} that scores any configuration against the problem instance, and a method \lstinline{variant()} that returns type-parameter metadata as key-value pairs. +\subsection{Emergent Compositionality}\label{sec:compositionality} -The critical member is \lstinline{evaluate()}. -Because every problem must implement a function that maps an arbitrary configuration to a metric value, the library can verify any candidate solution without problem-specific test infrastructure. -A brute-force solver simply enumerates all configurations in the space defined by \lstinline{dims()} and selects the one(s) with the best metric. -This universal evaluation capability is what makes the round-trip test described in \Cref{sec:domain} possible: given any reduction, we can solve the target by brute force and verify the extracted solution against the source---all through the \lstinline{evaluate()} interface, with no reduction-specific oracle needed. +The graph's most valuable property is that independently implemented reductions compose automatically to solve problems that no single reduction was designed for. -Two extension traits refine \texttt{Problem} for specific problem classes. -\texttt{OptimizationProblem} adds a \lstinline{direction()} method (\texttt{Maximize} or \texttt{Minimize}) and constrains the metric to \texttt{SolutionSize}, where \lstinline{SolutionSize} is either \texttt{Valid(v)} (a feasible solution with objective value~$v$) or \texttt{Invalid} (an infeasible configuration). -\texttt{SatisfactionProblem} is a marker trait for decision problems where the metric is simply \texttt{bool}. -This hierarchy means that agent-implemented problem types must declare upfront whether they are optimization or satisfaction problems, and the type system enforces consistency: an agent cannot accidentally return a numeric score from a satisfaction problem or omit a direction from an optimization problem. +Consider the problem of integer factoring---decomposing a number $N$ into its prime factors. +The Factoring $\to$ CircuitSAT reduction (272~lines of code) constructs a Boolean circuit representing an array multiplier: two bit-vector inputs $p$ and $q$, a grid of full-adder cells computing their product, and output constraints fixing the result to~$N$. +Any satisfying assignment to this circuit yields factors of~$N$. -\paragraph{The ReduceTo trait: round-trip by construction.} -Reductions between problems are encoded through the generic trait \lstinline{ReduceTo}, which requires a single method: \lstinline{reduce_to()} takes a reference to the source problem and returns a \texttt{ReductionResult}. -The \texttt{ReductionResult} type bundles two capabilities: \lstinline{target_problem()} returns the constructed target instance, and \lstinline{extract_solution()} maps a target solution back to a source solution. -By requiring both the forward mapping and the inverse extraction in a single trait implementation, the type system ensures that every reduction is round-trip capable---an agent cannot implement a forward reduction without also providing the solution extraction, because the code will not compile otherwise. +Separately, the CircuitSAT $\to$ ILP reduction (225~lines) linearizes a Boolean circuit into integer constraints, encoding each logic gate as a set of linear inequalities. -This design choice has a direct consequence for verification. -The closed-loop test pattern---reduce, solve the target, extract the solution, verify against the source---requires no per-reduction test logic beyond constructing a source instance. -The test harness calls \lstinline{reduce_to()}, invokes the brute-force solver on the target, calls \lstinline{extract_solution()}, and checks the result via \lstinline{evaluate()}. -All 50~reduction implementations share this identical test structure, which is why a single ``add-rule'' skill can handle every reduction in the library. +Neither reduction was designed with the other in mind---they were implemented in separate pull requests, weeks apart. +Yet the graph infrastructure enables automatic chaining: given a Factoring instance, the system finds the path Factoring $\to$ CircuitSAT $\to$ ILP, applies both reductions in sequence, solves the resulting integer program with a commercial solver, and extracts the factors by composing the inverse maps in reverse order. +This works because every reduction provides type-safe solution extraction through a common interface, and the path-finding algorithm composes extractors automatically. -\paragraph{Compile-time overhead validation.} -Every reduction must declare how the target problem's size relates to the source problem's size through an \lstinline{overhead} attribute on the \lstinline{#[reduction]} proc macro: -\begin{lstlisting}[basicstyle=\ttfamily\small,breaklines=true] -#[reduction(overhead = { - num_vertices = "num_vertices + num_clauses", - num_edges = "3 * num_clauses", -})] -impl ReduceTo for Source { ... } -\end{lstlisting} -The overhead expressions are parsed at compile time by a Pratt parser embedded in the procedural macro crate. -Variable names in the expressions (e.g., \texttt{num\_vertices}, \texttt{num\_clauses}) are validated against actual getter methods on the source type---if an agent writes a nonexistent variable name, the code fails to compile with a clear error message pointing to the offending expression. -This eliminates an entire class of copy-paste errors where agents might reference fields from a different problem type. - -The expressions support standard arithmetic operators ($+$, $-$, $\times$, $/$, \texttt{\^{}}), mathematical functions (\lstinline{exp}, \lstinline{log}, \lstinline{sqrt}), and numeric constants. -At runtime, the library maintains both a symbolic representation (\texttt{Expr} AST) and a compiled evaluation function that calls the getter methods directly, enabling cross-validation between the two representations. -The overhead data feeds into the reduction graph metadata, allowing automated analysis of whether a composite reduction path (e.g., $A \to B \to C$) dominates a direct reduction ($A \to C$) in terms of polynomial overhead---a capability used by the \texttt{check-rule-redundancy} skill to prevent agents from implementing unnecessary reductions. - -\paragraph{Variant registry and graph export.} -Problem types in the library are parameterized by graph type (e.g., \texttt{SimpleGraph}, \texttt{PlanarGraph}, \texttt{BipartiteGraph}, \texttt{KingsSubgraph}) and optionally by weight type (\texttt{One} for unit weights, \texttt{i32}, \texttt{f64}). -Each concrete instantiation---for example, \texttt{MaximumIndependentSet}---constitutes a distinct \emph{variant} that may have a different best-known complexity. - -The \lstinline{declare_variants!} proc macro registers these variants at compile time, associating each with a complexity string that represents the worst-case time bound of the best known algorithm for that variant. -Variable names in complexity strings are validated against getter methods, just as in overhead expressions. -For example, \texttt{MaximumIndependentSet} has polynomial complexity (the kings graph structure admits an efficient algorithm), while the general \texttt{SimpleGraph} variant has exponential complexity $O(1.1996^n)$. - -The registry serves three purposes. -First, it enables \emph{automated graph export}: the library can enumerate all registered variants and their reductions to produce the reduction graph shown in \Cref{fig:reduction-graph}, including both hand-coded reduction edges and natural edges inferred from the type-parameter subtype lattice (e.g., a reduction from \texttt{MIS} to \texttt{MIS} is automatically available because \texttt{KingsSubgraph} is a subtype of \texttt{SimpleGraph}). -Second, it enables \emph{completeness checking}: the documentation system can verify that every node and edge in the exported graph has a corresponding entry in the paper, flagging undocumented reductions as warnings. -Third, the complexity data enables \emph{redundancy analysis}: by comparing the end-to-end complexity of a composite path $A \to B \to C$ (applying overheads to the target's complexity) against the direct complexity of solving $A$, the system can determine whether a proposed reduction is useful or merely adds complexity without enabling access to a faster solver. - -\paragraph{Design philosophy: error prevention over error detection.} -The four mechanisms above share a common design philosophy: rather than detecting agent errors after the fact (through tests that may themselves be incorrect), the architecture \emph{prevents} errors from being expressible. -An agent cannot implement a reduction without providing solution extraction (the type system rejects it). -An agent cannot write an overhead expression referencing nonexistent problem attributes (the compiler rejects it). -An agent cannot register a variant without providing a complexity string (the macro rejects it). - -This philosophy---reducing the space of possible errors rather than testing for their absence---is what transforms the type system from a code organization tool into the foundation of the verification stack described in \Cref{sec:verification}. -The seven verification layers build on this foundation: Layer~1 (type checking) is not merely ``does it compile,'' but ``does it satisfy the rich set of constraints encoded in the trait hierarchy and proc macros.'' -Each subsequent layer addresses error classes that the type system cannot prevent---logical errors in the reduction mapping (Layer~3), incorrect overhead formulas despite valid variable names (Layer~4), and proof-level mistakes in the mathematical argument (Layer~7)---but the architectural choices described here ensure that the first layer already eliminates a substantial fraction of the errors that agents would otherwise produce. - -\section{Skill-Based Task Decomposition}\label{sec:skills} +Each new edge amplifies the graph. +Adding a single reduction from problem~$X$ to MIS does not just connect $X$ to MIS---it connects $X$ to every solver reachable from MIS, and it connects every problem with a path to~$X$ to MIS. +This multiplier effect---where value grows faster than the number of edges---justifies the investment in verified reduction infrastructure over ad-hoc, one-off transformations. -\begin{figure}[t] - \centering - \includegraphics[width=\columnwidth]{figures/pipeline.pdf} - \caption{Two-stage card-based pipeline. Human decisions (orange) are limited to Backlog$\to$Ready and In Review$\to$Done. Agent manages everything in between.} - \label{fig:pipeline} -\end{figure} - -The introduction outlined three roles---Contributor, Maintainer, and Agent---and claimed that skills decompose complex tasks into agent-manageable steps. -This section makes both ideas concrete: we define the roles (\Cref{sec:roles}), catalogue the skill library (\Cref{sec:skill-inventory}), and describe the card-based orchestration pipeline that ties them together (\Cref{sec:orchestration}). +\subsection{Verification by Round-Trip Testing}\label{sec:roundtrip} -\subsection{Three Roles}\label{sec:roles} +How do we know each reduction is correct? +Every reduction admits a fully automatable correctness test that requires no problem-specific oracle. +The test, which we call \emph{round-trip testing}, works as follows: -Our methodology distributes work across three roles whose responsibilities are deliberately non-overlapping (\Cref{tab:roles}). -The division is designed so that each role contributes what it does best: humans provide judgment and creativity, while the agent provides volume and consistency. +\begin{enumerate} + \item Construct a small source instance (e.g., a graph with 5--8 vertices). + \item Apply the forward map to produce a target instance. + \item Solve the target by brute-force enumeration (feasible for small instances). + \item Apply the inverse map to extract a solution for the source. + \item Verify that the extracted solution is optimal for the source (also by brute force). +\end{enumerate} -\begin{table}[t] -\caption{The three roles and their responsibilities.}\label{tab:roles} -\centering -\begin{tabular}{lp{5.5cm}} -\toprule -Role & Responsibility \\ -\midrule -Contributor & Open issues proposing new problems or reductions. Creative work: identify gaps in the reduction graph, assess mathematical non-triviality, provide references and worked examples. \\ -Maintainer & Curate the project board and write skills. Strategic work: decide which issues to prioritize (Backlog$\to$Ready), encode domain knowledge and quality standards into skill scripts, merge final PRs. \\ -Agent & Manage the pipeline and execute tasks. Mechanical work: pick cards from the board, implement code, write tests, generate documentation, dispatch sub-agents for review, fix CI failures, update board status. \\ -\bottomrule -\end{tabular} -\end{table} +This pattern is the same for all 45~reductions. +It catches the most mathematically subtle errors---incorrect variable mappings, off-by-one indexing in literal-to-vertex transformations, forgotten negations---without requiring a human to write problem-specific test logic. -\textbf{Contributors} perform the most intellectually demanding work in the pipeline: they must identify which reductions are mathematically interesting, non-trivial, and useful for extending the graph's reach. -A contributor proposing a new reduction from Satisfiability to Maximum Independent Set must understand both problem formulations, know the classical gadget construction, and provide a worked example detailed enough for an agent to implement. -This creative judgment---\emph{what} is worth building---is precisely what current agents cannot reliably provide. +The library's type system enforces this pattern structurally: the \lstinline{ReduceTo} trait requires both a forward map and an inverse map in a single implementation. +An agent cannot compile a forward reduction without providing solution extraction. +See \Cref{app:architecture} for the complete type architecture. -\textbf{The maintainer} occupies a meta-level role: rather than writing code directly, the maintainer programs the agent's \emph{workflow} by authoring skills. -Each skill is a markdown document that encodes domain conventions (file naming, trait implementation patterns, test structure), quality standards (coverage thresholds, documentation requirements), and orchestration logic (when to dispatch sub-agents, how to handle failures). -The maintainer also serves as the final quality gate, reviewing PRs that the agent has prepared and deciding whether to merge. -In practice, the maintainer touches only two transitions in the pipeline: moving an issue from Backlog to Ready (the strategic decision of \emph{what} to work on) and merging the completed PR (the quality judgment of \emph{whether} the work meets standards). +%====================================================================== +% SECTION 3: METHODOLOGY +%====================================================================== +\section{Methodology}\label{sec:method} -\textbf{The agent} operates in a dual capacity. -As a \emph{manager}, it picks cards from the project board, creates isolated git worktrees for each task, dispatches sub-agents for parallel review, and orchestrates the progression from issue to pull request. -As an \emph{executor}, it implements the code changes specified by each skill's step-by-step instructions. -This dual role is possible because skills reduce each task to a sequence of well-bounded subtasks, each within the agent's reliable operating range. +The central design challenge is: how do you let many people---including domain experts with no programming background---contribute to a verified mathematical library without compromising correctness? +Our answer is a pipeline of \emph{progressive quality gates}, where each stage independently validates its input before passing to the next. +A domain expert can propose a new reduction by describing it in mathematical language; the system validates the proposal, implements it, tests it, reviews it, and documents it---with human judgment required only at two points. -\subsection{Skills as Agent Functions}\label{sec:skill-inventory} +\subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} -A \emph{skill} is a markdown document that decomposes a complex, multi-file task into a numbered sequence of agent-executable steps. -Each step specifies what to read, what to produce, and how to verify the result. -The key insight is that agents handle well-specified, bounded subtasks reliably; the challenge is structuring work so that each unit falls within that reliable range. -Skills solve this by encoding the domain expert's knowledge of \emph{how} to perform a task---file locations, naming conventions, trait implementation patterns, test structure---into a reusable script that any agent invocation can follow. +The pipeline has five stages, each backed by one or more \emph{skills}---persistent, versioned markdown documents that decompose complex tasks into numbered agent-executable steps (\Cref{fig:pipeline}). -Skills differ from prompt engineering techniques such as chain-of-thought prompting or ReAct-style reasoning in two key respects. -First, skills are \emph{persistent and versioned}: they are committed to the repository, evolve through pull requests, and encode accumulated project knowledge across many agent sessions---unlike per-invocation prompts that must be re-crafted each time. -Second, skills are \emph{compositional}: orchestration skills invoke implementation skills, which invoke quality gates, forming multi-level workflows that no single prompt could express. -The closest analogue is the ``runbook'' concept in DevOps, but adapted for an agent executor rather than a human operator. - -Our library comprises 12~skills organized into five functional categories (\Cref{tab:skills}). - -\begin{table}[t] -\caption{Skills inventory. Steps = numbered steps in the skill script. Success rate (first-attempt CI pass, to be measured from git history) is pending systematic audit.}\label{tab:skills} -\centering -\small -\begin{tabular}{@{}llcc@{}} -\toprule -Skill & Category & Steps & Success \\ -\midrule -\texttt{project-pipeline} & Orchestration & 7 & TBD \\ -\texttt{review-pipeline} & Orchestration & 8 & TBD \\ -\texttt{issue-to-pr} & Orchestration & 7 & TBD \\ -\midrule -\texttt{add-model} & Implementation & 7 & TBD \\ -\texttt{add-rule} & Implementation & 6 & TBD \\ -\midrule -\texttt{check-issue} & Quality gate & 3 & TBD \\ -\texttt{check-redundancy} & Quality gate & 5 & TBD \\ -\texttt{review-impl} & Quality gate & 5 & TBD \\ -\texttt{fix-pr} & Quality gate & 6 & TBD \\ -\midrule -\texttt{write-model} & Documentation & 4 & TBD \\ -\texttt{write-rule} & Documentation & 6 & TBD \\ -\midrule -\texttt{release} & Release & 3 & TBD \\ -\bottomrule -\end{tabular} -\end{table} +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/pipeline.pdf} + \caption{Contribution pipeline from proposal to merged code. + Orange transitions require human judgment (selecting work and approving results); blue transitions are fully automated by skills. + Each stage independently validates before passing to the next.} + \label{fig:pipeline} +\end{figure} -\paragraph{Orchestration skills (3).} -These skills implement the agent-as-manager role. -\texttt{project-pipeline} is the primary automation entry point: it picks a ``Ready'' issue from the GitHub Project board, creates an isolated git worktree, invokes \texttt{issue-to-pr} to produce a pull request, and moves the card to the ``review-agentic'' column. -\texttt{review-pipeline} handles the second stage: it picks a PR from ``review-agentic,'' addresses Copilot review comments via \texttt{fix-pr}, runs agentic feature tests, fixes CI failures (up to three retries), and moves the card to ``In Review'' for human merge. -\texttt{issue-to-pr} is the per-issue workhorse invoked by \texttt{project-pipeline}: it fetches the issue, verifies it has passed the \texttt{check-issue} quality gate, researches cited references, writes an implementation plan, creates a PR, and optionally executes the plan by dispatching to the appropriate implementation skill. - -\paragraph{Implementation skills (2).} -\texttt{add-model} and \texttt{add-rule} encode the complete workflow for adding a new problem type or reduction rule, respectively. -\texttt{add-model} walks through seven steps: gather required information (mathematical definition, complexity bounds, type parameters), implement the \texttt{Problem} trait, register variant complexity via \texttt{declare\_variants!}, register in the module system and CLI, write unit tests, generate documentation, and verify. -\texttt{add-rule} follows a parallel six-step structure: implement \texttt{ReduceTo} with overhead expressions, register in the module system, write closed-loop tests, create an example program, generate a paper entry with proof sketch, and verify. -Both skills begin with a checklist of required information---if any item is missing, the skill halts and requests clarification rather than proceeding with incomplete specifications. - -\paragraph{Quality gate skills (4).} -These skills prevent errors from propagating through the pipeline. -\texttt{check-issue} validates proposed issues across four dimensions---usefulness (does the reduction improve the graph?), non-triviality (is the construction genuinely structural?), correctness (are cited references real and accurate?), and writing quality (is the specification complete and implementable?)---posting a structured report with pass/fail/warn verdicts and applying labels that gate downstream processing. -\texttt{check-rule-redundancy} determines whether a proposed reduction is dominated by a composite path through existing rules, using polynomial overhead comparison to prevent unnecessary implementations. -\texttt{review-implementation} dispatches two parallel sub-agents with fresh context windows---a structural reviewer that checks completeness against the model or rule checklist, and a quality reviewer that evaluates code style, test quality, and convention adherence. -The fresh-context design prevents the ``sycophancy'' failure mode where a reviewer that also wrote the code is biased toward approving it. -\texttt{fix-pr} triages and resolves PR feedback from multiple sources: user inline comments, Copilot suggestions, CI failures, and coverage gaps identified by Codecov. - -\paragraph{Documentation skills (2).} -\texttt{write-model-in-paper} and \texttt{write-rule-in-paper} generate entries in the project's Typst paper. -These skills serve a dual purpose: they produce human-readable documentation, and they function as the final layer of the verification stack (\Cref{sec:verification}). -\texttt{write-rule-in-paper} requires a self-contained proof sketch with a bidirectional correctness argument---the discipline of writing ``if $S$ is an independent set, then $V \setminus S$ is a vertex cover'' forces the agent (and the reviewing human) to articulate the mathematical argument that no automated test can verify. - -\paragraph{Release skill (1).} -\texttt{release} determines the appropriate version bump from the diff since the last tag, verifies that all tests and lints pass, and invokes the release pipeline that publishes to crates.io. - -The skill library is designed to be \emph{compositional}: orchestration skills invoke implementation skills, which invoke quality gate skills, which may invoke documentation skills. -This composition means that a single \texttt{project-pipeline} invocation triggers a cascade of skill calls that collectively implement, test, review, document, and prepare a complete pull request---all from a one-line command. - -\subsection{Card-Based Orchestration}\label{sec:orchestration} - -The skills described above are coordinated through a GitHub Project board with six columns: \textbf{Backlog}, \textbf{Ready}, \textbf{In Progress}, \textbf{review-agentic}, \textbf{In Review}, and \textbf{Done}. -The pipeline operates in two stages, as illustrated in \Cref{fig:pipeline}. - -\paragraph{Stage~1: Implementation (\texttt{project-pipeline}).} -The maintainer moves an issue from Backlog to Ready---this is the strategic decision of \emph{what} to work on next. -The agent's \texttt{project-pipeline} skill picks the next Ready issue (processing models before rules to satisfy dependencies), moves it to In~Progress, creates an isolated git worktree, and invokes \texttt{issue-to-pr --execute}. -This triggers the full implementation cascade: the issue is classified as a model or rule, dispatched to the appropriate implementation skill, tested, reviewed by parallel sub-agents, and packaged as a pull request. -Upon completion, the card moves to the ``review-agentic'' column, signaling that the PR is ready for the second stage. - -\paragraph{Stage~2: Review (\texttt{review-pipeline}).} -The \texttt{review-pipeline} skill picks a PR from the ``review-agentic'' column and runs a fix loop: it addresses Copilot review comments, executes agentic feature tests, and fixes CI failures (retrying up to three times). -If CI passes, the card moves to ``In~Review.'' -The maintainer then performs a final human review and merges the PR, moving the card to Done. - -\paragraph{Human touch points.} -The design ensures that humans make exactly two decisions per issue. -First, the maintainer moves an issue from Backlog to Ready---this encodes the judgment of which tasks are worth pursuing, in what order, and with what priority. -Second, the maintainer reviews and merges the completed PR---this is the quality gate that ensures the agent's work meets the project's standards. -Everything between these two touch points---worktree creation, implementation, testing, review dispatch, CI repair, and board status updates---is fully agent-managed. - -The pipeline supports batch processing: \texttt{project-pipeline --all} processes every Ready issue in a single invocation, while \texttt{review-pipeline --all} handles all pending reviews. -In batch mode, models are processed before rules to ensure that newly added problem types are available when subsequent rule implementations reference them. -Each issue is processed in its own worktree, ensuring isolation: a failure on one issue does not affect others, and the maintainer's working directory is never modified. - -\section{Multi-Layered Verification}\label{sec:verification} - -The architectural pillars described in \Cref{sec:architecture} prevent many agent errors at compile time, and the skill system in \Cref{sec:skills} ensures that agents follow prescribed workflows. -But neither mechanism is sufficient on its own: the type system cannot catch a logically incorrect reduction mapping, and skills cannot guarantee that an agent's implementation matches its mathematical specification. -This section describes a seven-layer verification stack that addresses errors across the full spectrum of abstraction, from type mismatches to flawed proof arguments (see~\Cref{fig:verification}). +\textbf{Stage~1: Propose.} +A domain expert---who need not know the codebase or even the programming language---invokes the \texttt{propose} skill. +The agent conducts an interactive brainstorming session using only mathematical language, asking one question at a time: what problem, what motivation, what is the formal definition, what does a worked example look like? +Crucially, the agent first analyzes the graph's topology to identify the most valuable contributions (\Cref{fig:topology}). +Three categories guide the analysis: +\emph{orphan nodes}---problem types with no reductions to or from any other node, contributing nothing to the graph; +\emph{redundant rules}---direct reductions whose overhead is dominated by a cheaper composite path through intermediate nodes; +and \emph{missing proof paths}---problems with no reduction chain from 3-SAT, leaving their NP-hardness unproven within the graph. +The agent ranks proposals by priority: rules that connect orphans or fill proof-chain gaps are suggested first. +Before filing the issue, the agent runs the quality checks from Stage~2 on the draft, catching problems before they reach review. \begin{figure}[t] \centering - \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf} - \caption{Seven-layer verification stack. Lower layers (blue) are fully automated; upper layers (gold) involve human-readable arguments.} - \label{fig:verification} + \includegraphics[width=\columnwidth]{figures/topology-issues.pdf} + \caption{Three graph topology issues detected by the \texttt{topology-sanity-check} skill. + \textbf{(a)}~An orphan node has no edges and cannot reach any solver. + \textbf{(b)}~A direct reduction is redundant when a composite path through~$B$ has equal or lower overhead. + \textbf{(c)}~Problems without a path from 3-SAT lack a machine-verifiable NP-hardness proof in the graph.} + \label{fig:topology} \end{figure} -\subsection{The Verification Stack}\label{sec:stack} +\textbf{Stage~2: Validate.} +The \texttt{check-issue} skill applies four independent tests to every proposal: \emph{usefulness} (does a reduction path already exist? is this one dominated by a cheaper composite path?), \emph{non-triviality} (is this a genuine structural transformation, not a variable substitution?), \emph{correctness} (do the cited references exist and support the claims?), and \emph{writing quality} (are all symbols defined, all template sections complete, all examples fully worked?). +References are verified through a fallback chain: project bibliography, then web search---never hallucinated. +Only proposals passing all four checks receive a \texttt{Good} label and proceed. -\Cref{tab:verification} summarizes the seven layers, each targeting a distinct class of error. -The layers are ordered by abstraction: lower layers are fully automated and fast (compile-time or test-time), while upper layers involve human-readable artifacts and agentic judgment. -We describe each layer with a concrete error example drawn from actual agent failures observed during the project. +\textbf{Stage~3: Implement.} +The \texttt{issue-to-pr} skill converts a validated issue into a pull request. +It enforces a strict \emph{one item per PR} rule: a reduction rule cannot be bundled with its source model, because the model must exist on main before the rule can be tested. +The skill reads the issue and all its comments, researches references to resolve ambiguities, generates an implementation plan, and dispatches to the appropriate implementation skill (\texttt{add-model} or \texttt{add-rule}). +Each implementation skill encodes a complete checklist---9~items for rules, 11~for models---that must be satisfied before any code is written. -\begin{table*}[t] - \centering - \caption{Seven-layer verification stack. Each layer catches a distinct class of agent error that lower layers miss.} - \label{tab:verification} - \begin{tabular}{@{}clll@{}} - \toprule - Layer & Mechanism & Example Error Caught \\ - \midrule - 1 & Rust type system & Agent returns \texttt{bool} instead of \texttt{SolutionSize} from \texttt{evaluate()} \\ - 2 & Unit tests (\texttt{test\_*\_basic}) & Agent evaluates MaxCut objective with wrong sign \\ - 3 & Closed-loop tests (\texttt{test\_*\_to\_*\_closed\_loop}) & SAT$\to$MIS maps clause variables to wrong vertex indices \\ - 4 & Overhead validation (symbolic expr vs.\ actual sizes) & Agent writes \texttt{num\_edges = num\_clauses} instead of \texttt{3 * num\_clauses} \\ - 5 & Materialized fixtures (JSON ground truth) & Agent changes expected QUBO matrix to make failing test pass \\ - 6 & Agentic review (parallel subagents) & Missing \texttt{declare\_variants!}, wrong file naming convention \\ - 7 & Documentation (proof sketch in paper) & Proof assumes connected graph but problem allows disconnected \\ - \bottomrule - \end{tabular} -\end{table*} +\textbf{Stage~4: Review.} +Two parallel sub-agents, each operating in a fresh context window, review the implementation: one checks structural completeness (file registration, macro usage, test coverage), the other checks code quality and semantic correctness against the original issue specification. +Fresh context prevents the confirmation bias that arises when an agent reviews its own work. +CI failures trigger up to three automated fix-and-retry cycles. -\paragraph{Layer 1: Type system.} -Rust's type system, augmented by the trait hierarchy and proc macros described in \Cref{sec:architecture}, serves as the first line of defense. -The \texttt{Problem} trait's associated type \texttt{Metric} forces every implementation to commit to either \texttt{SolutionSize} (optimization) or \texttt{bool} (satisfaction) at definition time. -An agent that accidentally returns a boolean from an optimization problem's \texttt{evaluate()} method receives a compile error, not a silent logic bug. -The \lstinline{#[reduction(overhead)]} macro validates variable names against getter methods, so an agent cannot reference a field from the wrong problem type. -This layer is the fastest and cheapest: errors are caught in seconds, before any test runs. +\textbf{Stage~5: Merge.} +The maintainer makes the final quality judgment and merges. +This is one of only two human decisions in the pipeline; the other is moving an issue from Backlog to Ready (Stage~3). -\paragraph{Layer 2: Unit tests.} -Each problem implementation includes \texttt{test\_*\_basic} tests that construct small instances and verify that \texttt{evaluate()} returns the expected metric values. -These tests catch semantic errors that the type system cannot: for example, an agent implementing \texttt{MaxCut} might negate the edge-weight sum, producing a valid \texttt{SolutionSize} that is nonetheless wrong. -Serialization round-trip tests (\texttt{test\_*\_serialization}) ensure that problem instances survive JSON encoding and decoding, catching subtle issues with graph representation or weight ordering. +\subsection{Why Skills, Not Prompts}\label{sec:skills} -\paragraph{Layer 3: Closed-loop tests.} -The closed-loop test pattern---reduce a source instance, solve the target by brute force, extract the solution back, verify optimality against the source---is the workhorse of the verification stack. -Every reduction rule has a \texttt{test\_*\_to\_*\_closed\_loop} test that exercises this full round trip. -This layer catches the most mathematically subtle errors: incorrect variable-to-vertex mappings, off-by-one errors in clause indexing, forgotten negation in constraint encoding. -The test requires no problem-specific oracle; it relies entirely on the \texttt{evaluate()} interface and the brute-force solver, which means a single skill can generate this test for any reduction. -In our experience, this layer catches the largest share of errors that survive type checking. +Skills differ from per-invocation prompts in three ways that matter for sustainability. -\paragraph{Layer 4: Overhead validation.} -The overhead expressions declared in the \lstinline{#[reduction]} macro (e.g., \texttt{num\_edges = "3 * num\_clauses"}) provide a second, independent correctness check on reductions. -After constructing the target problem, the test harness evaluates the symbolic overhead expression using the source problem's getter methods and compares the result against the actual size of the target. -A mismatch indicates either a bug in the reduction (the constructed target does not have the expected size) or an error in the formula (the agent wrote a wrong expression that happens to be type-correct). -This layer is particularly effective at catching errors in reductions that involve non-obvious size relationships---for instance, a reduction from 3-SAT to MIS that creates a gadget graph where the number of edges is quadratic in the number of clauses, not linear. +\emph{Skills are versioned.} +They are committed to the repository and evolve through pull requests, just like code. +A bug in a skill is fixed once and benefits all future invocations. +A per-invocation prompt must be re-crafted each time, and improvements are lost. -\paragraph{Layer 5: Materialized fixtures.} -Ground-truth test data is stored as JSON files in \texttt{tests/data/} and committed to version control separately from reduction implementations. -Integration tests load these fixtures and compare computed results against the expected values. -This layer exists specifically to counter a failure mode we call the ``lazy agent'' problem (discussed in \Cref{sec:why-layers}): because the fixtures are committed independently, any agent modification to expected values shows up as a diff in a separate file, making it visible during code review. -The QUBO ground-truth matrices, generated by an independent Python script, are an example: they verify that the Rust implementation's matrix construction matches a reference implementation, catching systematic errors that the round-trip test might miss (e.g., a reduction that produces the wrong QUBO matrix but still happens to yield the correct optimum on small instances). +\emph{Skills are compositional.} +Orchestration skills invoke implementation skills, which invoke quality gates. +A single \texttt{project-pipeline} command triggers the full cascade---pick a task from the project board, create an isolated workspace, validate the issue, implement, test, review, document, and produce a pull request. +The maintainer's effort scales with the number of \emph{skill types}, not the number of tasks. -\paragraph{Layer 6: Agentic review.} -The \texttt{review-implementation} skill dispatches two parallel subagents---one checking structural completeness (file naming, module registration, \texttt{declare\_variants!} macro, test naming conventions) and one assessing code quality (edge cases, documentation, idiomatic Rust). -Each subagent operates in a fresh context window, without access to the implementing agent's conversation history, which prevents confirmation bias. -This layer catches errors that are invisible to automated tests: a reduction implementation might pass all tests but use a non-standard file name that breaks the documentation build, or omit a variant registration that leaves a gap in the reduction graph. -The fresh-context design is deliberate: an agent reviewing its own work in the same context window tends to overlook the same assumptions it made during implementation. - -\paragraph{Layer 7: Documentation.} -Every reduction in the library has a corresponding entry in a Typst paper that includes a formal problem definition, a statement of the reduction rule, and a proof sketch. -The paper's completeness checker automatically verifies that every node and edge in the exported reduction graph has a corresponding documentation entry, flagging gaps as warnings. -This layer catches errors that no automated test can: a proof sketch might reveal an unstated assumption (e.g., that the input graph is connected) that the implementation silently relies on, or expose a logical gap in the reduction argument that happens to be masked by the small test instances used in Layers 2--3. -The requirement to write a human-readable proof forces the agent to articulate the mathematical reasoning behind the reduction, serving as a form of self-verification. - -\subsection{Why Layers?}\label{sec:why-layers} - -No single verification layer is sufficient. -The type system (Layer~1) catches API misuse but is blind to logical errors in the reduction mapping. -Closed-loop tests (Layer~3) verify functional correctness on specific instances but cannot check whether overhead formulas are accurate or whether the mathematical argument generalizes beyond the test cases. -Documentation (Layer~7) catches proof-level mistakes but depends on the human reader's diligence. -The layers are designed to be \emph{complementary}: each layer's blind spots are covered by another layer's strengths. - -The need for layered verification becomes acute when agents optimize for the wrong objective. -We call this the \textbf{lazy agent problem}: given a failing test, an agent may modify the expected output rather than fix the underlying implementation. -This is rational behavior from the agent's perspective---the issue asks for a passing test suite, and changing the expected value is the shortest path to that goal. -We observed this failure mode multiple times during early development: an agent implementing a QUBO reduction encountered a failing integration test, examined the expected JSON matrix, and ``corrected'' it to match its (incorrect) output. - -Materialized fixtures (Layer~5) are the primary defense against this failure mode. -Because ground-truth data is generated by an independent script (typically Python) and committed in a separate step, any agent modification to the expected values produces a visible diff in a file that the agent was not asked to change. -Code review---whether human or agentic (Layer~6)---then flags the unexpected modification. -This design transforms a subtle correctness violation (wrong expected value, all tests pass) into an obvious process violation (agent modified a file outside its scope), which is much easier to detect. - -More broadly, the layered design reflects a defense-in-depth philosophy borrowed from security engineering. -Just as network security does not rely solely on firewalls or solely on encryption, verification of agent-generated mathematical code should not rely solely on tests or solely on type checking. -Each layer adds an independent probability of catching an error, and the layers' error-detection capabilities are largely orthogonal---an error that slips through the type system is likely caught by closed-loop tests, and an error that passes all automated tests may be caught by the documentation review. -The skill system (\Cref{sec:skills}) ensures that agents invoke all seven layers as part of every implementation task, so no layer is accidentally skipped. +\emph{Skills encode domain knowledge that cannot be inferred from code.} +The \texttt{add-rule} skill specifies that overhead expressions must use getter methods matching the source type, that examples must have a \lstinline{pub fn run()} entry point, and that the paper entry requires a proof sketch. +An agent reading only the codebase might infer some conventions but would miss others---the skill makes all conventions explicit. -\section{Evaluation}\label{sec:evaluation} +The library comprises 14~skills in six categories: orchestration~(3), community contribution~(1), implementation~(2), quality gates~(4), documentation~(2), and onboarding~(2). -We evaluate the skill-based methodology through three complementary lenses: an ablation study design comparing skill-based and no-skill agent configurations (\Cref{sec:ablation}), a longitudinal analysis of the project's git history (\Cref{sec:mining}), and detailed case studies of three reductions spanning the complexity spectrum (\Cref{sec:cases}). -Together, these provide qualitative insight into how skills and verification layers interact during real implementation tasks, with the ablation design offering a replicable protocol for future quantitative evaluation. +\subsection{Correctness by Construction}\label{sec:verification} -\paragraph{Agent platform.} -All agent-assisted development was performed using Claude Code~\cite{Anthropic2025ClaudeCode}, Anthropic's terminal-based coding agent, backed by Claude models (Sonnet~3.5 and Sonnet~4 during the development period; the model version evolved during the seven-week span). -Skills are invoked as slash commands within Claude Code sessions. -We note that the methodology is not inherently tied to a specific model or agent platform---the skills are plain markdown documents that any sufficiently capable coding agent could follow---but our empirical observations reflect the capabilities of the Claude model family. +Correctness assurance comes from a seven-layer verification stack (\Cref{app:verification}). +The key insight is that different layers catch different classes of errors, and no single layer suffices. -\subsection{Ablation: Skill-Based vs.\ Raw Agent}\label{sec:ablation} +\textbf{Layers~1--2} (type system and unit tests) catch structural errors cheaply. +The type system prevents entire categories of mistakes: an agent cannot implement a reduction without providing solution extraction, reference nonexistent problem attributes in overhead expressions, or omit required variant declarations. +These constraints are enforced at compile time by procedural macros that validate expression variable names against actual getter methods. -To isolate the effect of skill-based task decomposition from the broader methodology, we design a controlled comparison between two agent configurations operating on identical tasks within the same codebase. +\textbf{Layer~3} (round-trip tests, described in \Cref{sec:roundtrip}) is the workhorse, catching the most mathematically subtle errors---incorrect variable mappings, off-by-one indexing, forgotten negations---without per-reduction test logic. -\paragraph{Setup.} -We select a sample of 5--10 reduction rules spanning the complexity spectrum---from near-trivial complement relationships (e.g., MinimumVertexCover $\to$ MaximumIndependentSet, $\sim$96~LOC) through moderate gadget constructions (e.g., Satisfiability $\to$ MaximumIndependentSet, $\sim$171~LOC) to complex circuit-based encodings (e.g., Factoring $\to$ CircuitSAT, $\sim$272~LOC). -For each reduction, we prepare identical GitHub issues with the same problem specification, mathematical description, and acceptance criteria. -We then run each issue through two configurations: -\begin{itemize} - \item \textbf{Skill-based}: the full pipeline described in \Cref{sec:skills}, including the \texttt{issue-to-pr} orchestration skill, the \texttt{add-rule} implementation skill, \texttt{review-implementation} with parallel sub-agents, and \texttt{fix-pr} for CI repair. - \item \textbf{No-skill baseline}: the same Claude Code agent operating on the same codebase with access to the project's \texttt{CLAUDE.md} for context, but without any skill files. The agent receives the issue text and must determine its own workflow. -\end{itemize} +\textbf{Layer~4} (overhead validation) compares symbolic size expressions against actual target sizes. +An agent might implement a correct reduction but declare the wrong overhead formula. +This layer catches formula errors that are type-correct but mathematically wrong. -\paragraph{Metrics.} -We measure four outcomes for each configuration: (1)~\emph{first-attempt CI pass rate}---whether the initial PR passes all CI checks without intervention; (2)~\emph{review rounds}---the number of review-fix cycles before merge readiness; (3)~\emph{correctness}---whether the final implementation passes all round-trip tests; and (4)~\emph{convention adherence}---whether the code follows project conventions (file naming, macro usage, documentation structure, test patterns). +\textbf{Layer~5} (materialized fixtures) addresses the ``lazy agent'' problem: agents that modify expected test outputs to make tests pass rather than fixing implementations. +Ground-truth data is generated independently and committed separately; tampering produces a visible diff in a file outside the agent's normal scope. + +\textbf{Layers~6--7} (fresh-context review and documentation) catch errors invisible to automated tests. +Parallel sub-agents operating in fresh context windows prevent confirmation bias. +Proof sketches force articulation of mathematical arguments; a completeness checker ensures every graph edge is documented. -\paragraph{Framing.} -With $n = 5$--$10$, this ablation is a \emph{controlled illustration} of the skill-based approach's value, not a statistically powered experiment. -The results are intended to demonstrate the \emph{mechanism}---how skills prevent specific error classes by encoding domain knowledge into the agent's workflow---rather than to establish effect sizes with confidence intervals. -We expect the skill-based configuration to excel particularly on convention adherence (where skills encode project-specific patterns that no general-purpose agent would know) and first-attempt CI pass rate (where the multi-layered verification invoked by skills catches errors before the first push). -The raw agent, by contrast, is likely to produce functionally correct code that nonetheless fails CI due to missing \texttt{declare\_variants!} macros, incorrect overhead expressions, or test files placed in the wrong directory. +The skill system ensures all seven layers are invoked for every task. +Without skills, an agent might skip overhead validation or omit the paper entry---errors that would accumulate silently over many contributions. -We have not yet executed this ablation; it requires running both configurations on held-out issues and measuring the four metrics above. -We present the experimental design here because we believe it constitutes a replicable protocol for evaluating skill-based methodologies in other domains. -The git history mining in \Cref{sec:mining} provides complementary longitudinal evidence across the project's full development timeline, and the case studies in \Cref{sec:cases} offer qualitative insight into how skills and verification layers interact during specific implementation tasks. +%====================================================================== +% SECTION 4: EVALUATION +%====================================================================== +\section{Evaluation}\label{sec:evaluation} -\subsection{Git History Mining}\label{sec:mining} +We evaluate through development metrics from the project's history, an analysis of the automated quality gate, and case studies showing how verification layers interact. +All development used Claude Code~\cite{Anthropic2025ClaudeCode} with Claude models (Sonnet~3.5 and Sonnet~4; the model version evolved during development). +Skills are plain markdown documents portable to any coding agent. -We analyze the complete git and pull request history of the \texttt{problem-reductions} repository, supplemented by session metadata from the Claude Code development environment, to characterize the project's evolution and the types of errors encountered during development. -The repository contains 59~merged pull requests and 253~commits on main spanning nine weeks of development (January~9 to March~13, 2026), authored by four contributors. +\subsection{Development Metrics}\label{sec:metrics} -\paragraph{Development metrics.} -The Claude Code session metadata reveals the scale of agent involvement. -Across 283~recorded sessions (300~MB of conversation transcripts), the agent produced 9,429~assistant messages in response to 630~user messages---a \textbf{15:1 automation amplification ratio}. -Of the 1,089~commits across all branches, 1,510~carry a \texttt{Co-Authored-By: Claude} trailer (the count exceeds total commits because branch commits are squash-merged on main). -The average session involved 5.8~user messages and 51~tool calls, with measured wall-clock time totaling 115~hours across the 108~sessions with timing metadata. +The repository contains 59~merged pull requests and 253~commits on main spanning nine weeks (January~9 to March~13, 2026), authored by four contributors. +Session metadata across 283~agent sessions (300~MB of conversation transcripts) reveals the scale of agent involvement: 9,429~assistant messages in response to 630~user messages---a \textbf{15:1 automation amplification ratio}. +The average session involved 5.8~user messages and 51~tool calls, with measured wall-clock time totaling 115~hours across 108~sessions with timing data. -The codebase grew rapidly under this agentic workflow: -\begin{center} +\begin{table}[t] +\caption{Codebase growth timeline.}\label{tab:growth} +\centering \small \begin{tabular}{@{}lcccc@{}} \toprule @@ -509,29 +316,36 @@ \subsection{Git History Mining}\label{sec:mining} Jan 10 (initial) & 17 & 0 & 0 & 0 \\ Jan 26 (feature parity) & 20 & 22 & 0 & 1 \\ Feb 15 (arch.\ redesign) & 21 & 44 & 101 & 35 \\ -Mar 13 (current) & 27 & 50 & 114 & 45 \\ +Mar 13 (current) & 27 & 45 & 114 & 45 \\ \bottomrule \end{tabular} -\end{center} -The most intensive development day (January~25) produced 41~commits across 22~sessions with 12,868~messages and 3,734~tool calls---this was the feature parity sprint that ported all reduction rules from the predecessor Julia package in a single day, growing the codebase from 36~to 74~Rust source files. +\end{table} -\paragraph{Development phases.} -We stratify the history into three phases reflecting the evolution of agent tooling: -\begin{itemize} - \item \textbf{Phase~1 (Manual)}: 35~PRs. Skills had not yet been developed; all implementation, testing, and review was performed manually or with ad-hoc agent assistance. This phase established the core library architecture and the majority of the reduction rules. - \item \textbf{Phase~2 (Basic skills)}: 9~PRs. Initial skills for model and rule implementation were available, providing structured workflows but without full pipeline automation. Two new problem models (ClosestVectorProblem, BinPacking) were added during this phase. - \item \textbf{Phase~3 (Full pipeline)}: 15~PRs. The complete skill library was operational, including orchestration skills (\texttt{project-pipeline}, \texttt{review-pipeline}, \texttt{issue-to-pr}), quality gates (\texttt{check-issue}, \texttt{check-rule-redundancy}), and multi-agent review (\texttt{review-implementation}). New models (Knapsack, GraphPartitioning, MinimumFeedbackVertexSet) and rules (KSatisfiability $\to$ SubsetSum) were added with full pipeline support. -\end{itemize} +\Cref{tab:growth} traces the growth across three phases. +\textbf{Phase~1 (Manual, 35~PRs)}: no skills; the maintainer issued step-by-step commands, established the architecture, and ported reductions from the predecessor Julia package. +\textbf{Phase~2 (Basic skills, 9~PRs)}: initial \texttt{add-model}/\texttt{add-rule} skills reduced per-task human involvement. +\textbf{Phase~3 (Full pipeline, 15~PRs)}: complete skill library with orchestration, quality gates, and multi-agent review. +The current codebase comprises 54,599~lines of Rust source, 28,343~lines of tests, and 6,362~lines of examples. -\paragraph{PR composition.} -Of the 59~merged PRs, 10 follow the \texttt{Fix \#N} issue-driven pattern (created by the \texttt{issue-to-pr} skill), 16 are \texttt{feat:} PRs, and the remainder are infrastructure, refactoring, documentation, or tooling PRs. -The initial feature-parity PRs (e.g., PR~\#4 ``Feature parity with ProblemReductions.jl'') bundled multiple models and rules into single large PRs before the per-issue convention was established. -All PRs are attributed to human GitHub accounts because the agent operates through the developer's local environment (Claude Code runs in the terminal), making the distinction between human-authored and agent-assisted work invisible in git metadata---itself a finding about observability limitations of current agentic workflows. +\paragraph{Interaction evolution.} +Analysis of 2,196~user prompts reveals a shift from imperative to declarative interaction as skills matured. +In Phase~1, prompts averaged 8--12 words (e.g., ``implement Satisfiability to MIS reduction''). +By Phase~3, 30\% of prompts were 1--3 words (e.g., ``\texttt{make run-pipeline}''). +This progression---from specifying actions to invoking skills---mirrors the classical shift from scripting to API design. + +\paragraph{Observability.} +All pull requests are attributed to human GitHub accounts because the agent operates through the developer's local terminal. +Of 1,089~commits across all branches, 1,510~carry a \texttt{Co-Authored-By: Claude} trailer (the count exceeds total commits because branch commits are squash-merged on main). +This observability gap---the difficulty of distinguishing human-authored from agent-assisted work in git metadata---is itself a finding about current agentic workflows. + +\subsection{Issue Quality Gate}\label{sec:quality-gate} -\paragraph{Issue quality gate.} -The \texttt{check-issue} skill was deployed at scale when a contributor batch-submitted 414~issues proposing new problem models and reduction rules---including 251~in a single day. -Of the 322~issues quality-checked, only \textbf{81 (25\%) passed} all checks: -\begin{center} +The \texttt{check-issue} skill was stress-tested when a contributor batch-submitted 414~issues---including 251~in a single day---proposing new reductions for the graph. +Of the 322~issues checked, only \textbf{81~(25\%) passed} all quality criteria (\Cref{tab:quality-gate}). + +\begin{table}[t] +\caption{Issue quality gate results (322 checked).}\label{tab:quality-gate} +\centering \small \begin{tabular}{@{}lr@{}} \toprule @@ -540,225 +354,243 @@ \subsection{Git History Mining}\label{sec:mining} Good & 81 (25\%) \\ PoorWritten (incomplete specification) & 124 (39\%) \\ Wrong (factually incorrect) & 64 (20\%) \\ -Trivial (obvious reduction) & 43 (13\%) \\ -Useless (no practical value) & 18 (6\%) \\ +Trivial (obvious, adds no value) & 43 (13\%) \\ +Useless (no practical application) & 18 (6\%) \\ \bottomrule \end{tabular} -\end{center} -The \textbf{75\% rejection rate} demonstrates the necessity of automated quality gates: without \texttt{check-issue}, the pipeline would waste agent resources implementing incorrect or trivial reductions. -The most common failure mode was \emph{PoorWritten}---issues that lacked complete mathematical specifications or worked examples, making them unimplementable even by a skilled agent. -The \emph{Wrong} category (20\%) included citations to non-existent papers, incorrect complexity claims, and reductions that do not preserve solution structure---errors that would have been expensive to discover during implementation rather than at the issue-triage stage. +\end{table} -\paragraph{Interaction evolution.} -Analysis of 2,196~user prompts in the project's Claude Code history reveals a shift from imperative to declarative interaction as the skill library matured. -In Phase~1, the maintainer issued step-by-step commands (``start milestone 2,'' ``improve test coverage to $>$95\%,'' ``fix the clippy test''), averaging 8--12 words per prompt. -By Phase~3, single-command orchestration dominated (``\texttt{make run-pipeline},'' ``\texttt{/review-pipeline}''), with 30\% of prompts consisting of 1--3 words. -This progression---from programming the agent's actions to invoking its skills---mirrors the classical shift from scripting to API design, suggesting that skill engineering is a form of \emph{meta-programming} for agentic workflows. +The \textbf{75\% rejection rate} demonstrates the necessity of automated quality gates in agentic pipelines. +Without \texttt{check-issue}, the pipeline would waste agent compute implementing incorrect or trivial reductions. +The most common failure was \emph{PoorWritten}---issues lacking complete mathematical specifications, making them unimplementable even by a skilled agent. +The \emph{Wrong} category~(20\%) included citations to non-existent papers, incorrect complexity claims, and reductions that do not actually preserve solution structure---errors that would be expensive to discover during implementation rather than at triage. -\paragraph{Error taxonomy.} -\Cref{tab:errors} categorizes the types of errors encountered during development, mapped to the verification layer that catches each error class. -The taxonomy is derived from code review comments, CI failure logs, and commit messages across all three development phases. +\subsection{Case Studies}\label{sec:cases} -\begin{table}[t] -\caption{Error taxonomy by verification layer.}\label{tab:errors} -\centering -\small -\begin{tabular}{@{}llc@{}} -\toprule -Error Category & Catching Layer & Count \\ -\midrule -Type/API mismatch & L1: Type system & [TBD] \\ -Evaluation logic & L2: Unit tests & [TBD] \\ -Mapping/index & L3: Closed-loop tests & [TBD] \\ -Overhead formula & L4: Overhead validation & [TBD] \\ -Test gaming & L5: Materialized fixtures & [TBD] \\ -Convention violation & L6: Agentic review & [TBD] \\ -Incorrect proof & L7: Doc.\ review & [TBD] \\ -\bottomrule -\end{tabular} -\end{table} +\paragraph{Satisfiability $\to$ MIS (gadget construction).} +The classical Karp reduction~\cite{karp1972} works as follows. +Given a Boolean formula in conjunctive normal form (a conjunction of clauses, each a disjunction of literals), create one graph vertex per literal occurrence. +Add edges within each clause (so at most one literal per clause can be selected) and between complementary literals across clauses (so a variable cannot be both true and false). +A satisfying assignment corresponds to an independent set of size~$m$ (the number of clauses). +The implementation spans 171~lines. + +This case illustrates how verification layers interact. +The number of edges in the constructed graph is worst-case \emph{quadratic} in the number of literals---but an agent might assume linear overhead. +Layer~4 (overhead validation) catches this by comparing symbolic expressions against actual target sizes. +Layer~3 (round-trip testing) catches a different class of error: off-by-one mistakes in literal-to-vertex mapping. +Both layers are necessary: correct indices with wrong overhead, or correct overhead with wrong indices, would each pass one layer but fail the other. + +\paragraph{Factoring $\to$ CircuitSAT $\to$ ILP (emergent composition).} +As described in \Cref{sec:compositionality}, this chains two independently implemented reductions (272 + 225~lines) to factor integers via linear programming. +The Factoring $\to$ CircuitSAT step exercises the full verification stack: the multiplier circuit involves $\Theta(mn)$ full-adder cells (where $m$ and $n$ are the bit lengths of the two factors), and errors in carry propagation are caught by Layer~3 while overhead formula errors are caught by Layer~4. + +\paragraph{MVC $\leftrightarrow$ MIS (trivial complement).} +At the opposite extreme, this 96-line reduction exploits the complement relationship described in \Cref{sec:graph} with identity overhead. +The pipeline's primary value here is enforcing conventions---file naming, macro registration, documentation---rather than catching logical errors. + +%====================================================================== +% SECTION 5: RELATED WORK +%====================================================================== +\section{Related Work}\label{sec:related} -The error counts in \Cref{tab:errors} are pending a systematic audit of all CI logs and review threads; we report qualitative observations here and defer the full quantitative analysis. -Preliminary observations from the commit history suggest that \emph{overhead formula errors} (Layer~4) and \emph{convention violations} (Layer~6) are among the most frequently encountered error classes. -For example, PR~\#112 (``Fix complexity inconsistencies, enforce overhead, add missing variants'') addressed multiple overhead formula errors that had accumulated before Layer~4 validation was enforced, and PR~\#89 (``Close completeness gaps from review-implementation audit'') fixed convention violations identified by the agentic review skill. -The introduction of compile-time overhead validation in PR~\#99 (``Replace Polynomial overhead system with Expr AST'') eliminated an entire class of errors by shifting overhead checking from runtime to compile time---an example of how verification infrastructure co-evolves with the skill system. +\paragraph{AI coding agents.} +The evolution from SWE-agent~\cite{Yang2024SWEagent} and Devin~\cite{Wu2024Devin} to OpenHands~\cite{Wang2024OpenHands} and Claude Code~\cite{Anthropic2025ClaudeCode} has expanded single-task capabilities to 70--80\% on SWE-Bench~\cite{Xia2025LiveSWEagent}, but longer-horizon benchmarks reveal a capability cliff~\cite{Thai2025SWEEVO, Deng2025SWEBenchPro}. +Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference as the central difficulty. +Our approach structures work so each unit falls within the agent's reliable range, complementing architectural advances in agent design. -\subsection{Case Studies}\label{sec:cases} +\paragraph{AI-discovered reductions.} +FunSearch~\cite{RomeraParedes2023FunSearch} and AlphaEvolve~\cite{Novikov2025AlphaEvolve} discover novel algorithms and reductions through evolutionary search, including improved bounds for combinatorial problems. +Jani\v{c}i\'{c}'s URSA~\cite{Janicic2025URSA} uses SAT-based constraint solving to verify reductions. +Our work is complementary: we implement and verify \emph{known} reductions; discovered reductions could feed into our pipeline as new issues. -We examine three reductions that illustrate different points on the complexity spectrum, highlighting how the skill-based pipeline and verification stack interact in each case. +\paragraph{Formal verification of generated code.} +VeriCoding~\cite{Bursuc2025VeriCoding} reports 27--44\% success on 12,504~formal specifications. +CLEVER~\cite{Thakur2025CLEVER} establishes hard Lean benchmarks. +VeriBench~\cite{Miranda2025VeriBench} finds self-optimizing agents approach 90\% compilation in Lean~4. +Mukherjee et al.~\cite{Mukherjee2025CoqPL, Mukherjee2025SynVer} demonstrate a two-LLM generate-then-verify pattern. +Our seven-layer stack trades full formal guarantees for practical effectiveness at the scale of 45~reductions. -\paragraph{Case 1: MinimumVertexCover $\to$ MaximumIndependentSet (simple).} +\paragraph{Physics-inspired optimization.} +Schuetz et al.~\cite{Schuetz2022PhysicsGNN} solve QUBO at million-variable scale via graph neural networks. +He~\cite{He2024QuantumTSP} combines quantum annealing with GNNs for the Traveling Salesman Problem. +These approaches assume a QUBO or Ising formulation as input---precisely the transformation that our reduction graph provides as upstream infrastructure. -This reduction exploits the classical complement relationship: a set $S$ is an independent set in graph $G$ if and only if $V \setminus S$ is a vertex cover. -The implementation is correspondingly minimal at 96~lines of Rust, including both directions of the reduction (MIS $\to$ MVC and MVC $\to$ MIS). -The \texttt{reduce\_to()} method simply copies the graph and weights to the target problem type, and the \texttt{extract\_solution()} method flips each bit of the configuration vector. +%====================================================================== +% SECTION 6: DISCUSSION AND CONCLUSION +%====================================================================== +\section{Discussion and Conclusion}\label{sec:conclusion} -The overhead expressions are identity mappings (\texttt{num\_vertices = "num\_vertices"}, \texttt{num\_edges = "num\_edges"}), reflecting the fact that the target problem has exactly the same graph as the source. -This reduction was part of the initial batch implemented in PR~\#7 (``Implement remaining reduction rules'') during Phase~1, before skills were available. +\subsection{When Does This Methodology Apply?} -This case illustrates the \emph{lower bound} of the complexity spectrum. -The mathematical content is trivial, the implementation is mechanical, and the verification layers activate without incident. -For such reductions, the skill-based pipeline's primary value is in enforcing conventions (correct file naming, \texttt{declare\_variants!} macro placement, documentation entries) rather than catching logical errors. +The approach rests on what we call a \emph{Goldilocks property}: the target domain must have tasks that are (1)~formally specified, (2)~decomposable into homogeneous subtasks, (3)~equipped with automatable correctness criteria, and (4)~demanding enough to require both human creativity and mechanical execution. +NP-hard reductions satisfy all four, but are not unique. +Algorithm libraries, numerical linear algebra routines, compiler optimization passes, and hardware description languages share this structure. +In each case, a domain expert can encode the ``how'' into reusable skills while the ``what'' remains a human judgment call. -\paragraph{Case 2: Satisfiability $\to$ MaximumIndependentSet (complex).} +The methodology does \emph{not} generalize to heterogeneous tasks---the staple of SWE-Bench---where each issue is structurally unique and resists skill-based decomposition. -This is the classical Karp reduction from Boolean satisfiability to maximum independent set~\cite{karp1972}. -Given a CNF formula with $m$~clauses, the reduction creates one vertex for each literal occurrence in each clause, adds clique edges within each clause (ensuring at most one literal per clause is selected), and adds conflict edges between complementary literals across clauses (ensuring consistency). -A satisfying assignment corresponds to an independent set of size exactly~$m$. +\subsection{Limitations} -The implementation spans 171~lines, with the core gadget construction occupying two nested loops: the first builds per-clause cliques (lines 127--142 in the source), and the second adds cross-clause conflict edges by checking literal complementarity (lines 147--153). -The overhead expressions reflect the quadratic worst case: \texttt{num\_vertices = "num\_literals"} and \texttt{num\_edges = "num\_literals\^{}2"}. +\paragraph{Single case study.} +Evidence comes from one project by one primary developer. +Replication across independent projects and teams is needed. -This reduction is instructive because it is precisely the kind of task where agents commonly make errors. -The edge count in the conflict graph is worst-case quadratic in the number of literals (every literal in one clause may conflict with literals in every other clause), but an agent might assume linear overhead if it reasons only about the per-clause structure. -Layer~4 (overhead validation) catches this class of error by comparing the symbolic overhead expression against the actual target problem size on concrete test instances. -The closed-loop test (Layer~3) catches a different class: index-off-by-one errors in the literal-to-vertex mapping, which cause the extracted solution to assign the wrong truth value to a variable. -Both layers are necessary---an implementation with correct indices but wrong overhead, or correct overhead but wrong indices, would pass one layer but fail the other. +\paragraph{Skill engineering cost.} +The 14~skills represent substantial upfront investment---iterative refinement across many agent sessions. +Each new domain requires its own skill engineering effort, though the methodology itself transfers. -\paragraph{Case 3: Factoring $\to$ CircuitSAT $\to$ ILP (composition).} +\paragraph{Confounding factors.} +Both skills and underlying models improved during the nine-week span. +Temporal stratification across the three phases partially addresses this, but we cannot fully disentangle the two contributions. -The most complex case study involves two independently implemented reductions that compose through the reduction graph to solve an end-to-end problem: given an integer $N$, find its non-trivial factors. +\paragraph{Maintainer requirement.} +The pipeline is not fully autonomous: without human judgment at two transitions (selecting work and approving results), the system cannot determine what is worth building or whether results meet standards. +This is by design---but limits applicability to fully autonomous scenarios. -The first reduction, Factoring $\to$ CircuitSAT (272~LOC, PR~\#85 family), constructs an array multiplier circuit. -The circuit takes two bit-vectors $p$ and $q$ as inputs, computes their product through a grid of full-adder cells, and constrains the output to equal the target number~$N$. -Each multiplier cell requires six circuit assignments (AND, XOR, carry logic), so the total overhead scales as $\Theta(mn)$ where $m$ and $n$ are the bit-widths of the two factors. -The \texttt{extract\_solution()} method maps the satisfying circuit assignment back to the two factors by reading the $p$ and $q$ variable values. +\subsection{The Human Value Proposition} -The second reduction, CircuitSAT $\to$ ILP (225~LOC, PR~\#85), linearizes the Boolean circuit into integer linear constraints. -Each Boolean gate (AND, OR, XOR, NOT) is encoded as a set of linear inequalities over binary variables, with the circuit's topological ordering determining the constraint generation sequence. +Humans are \emph{repositioned}, not eliminated. +Creative and judgment-intensive work---which problems matter, what quality bar to enforce, how to architect the system---remains human. +Agents absorb mechanical volume: implementing boilerplate, writing tests, generating documentation, fixing CI failures. +The human contribution shifts from writing code to \emph{programming the agent's workflow}---a higher-leverage activity that scales with the number of tasks the agent can execute. -Neither reduction was designed with composition in mind---each was implemented to connect its source and target in the reduction graph. -Yet the graph infrastructure enables automatic composition: a user with a Factoring instance can query the graph for a path to ILP, chain the two reductions, solve the resulting integer program with an off-the-shelf solver, and extract factors through the composed inverse maps. -This composition works because each reduction's \texttt{ReductionResult} trait provides a type-safe \texttt{extract\_solution()} method, and the graph's path-finding API composes these extractors in reverse order. +\subsection{The Scaling Vision} -This case highlights the \emph{emergent value} of the reduction graph as compilation infrastructure. -The two reductions were implemented in separate PRs, potentially by different contributors, with different verification layer activations. -The graph composes them into a pipeline that no single implementation task created. -This compositionality is precisely the property that makes the Goldilocks domain valuable: individual tasks are bounded and verifiable, but the graph as a whole provides capabilities that exceed the sum of its parts. +The graph's value grows superlinearly with its size. +Each new edge creates not just one connection but composite paths through the entire graph. +As the graph scales from 27 to 100+ problem types through agent-synthesized rules (\Cref{fig:reduction-graph}, top layer), it evolves from a library into a \emph{reduction compiler}: a user describes their problem, and the compiler selects the lowest-cost verified path to the target solver. -\section{Related Work}\label{sec:related} +Three directions extend this work. +First, composing with \emph{automated discovery}: evolutionary search methods like AlphaEvolve~\cite{Novikov2025AlphaEvolve} discover new reductions, and our pipeline implements and verifies them. +Second, \emph{formal verification}: supplementing round-trip tests and proof sketches with machine-checked Lean or Coq proofs. +Third, \emph{cost-aware path selection}: each edge carries a polynomial cost model describing the size blowup, and the optimal path may depend on instance scale. -Our work draws on and contributes to four active research areas: AI coding agents and their benchmarks, AI-assisted discovery of reductions and complexity results, formal verification of AI-generated code, and physics-inspired optimization via problem reformulation. +\subsection{Conclusion} -\paragraph{AI coding agents.} -The rapid evolution of AI coding agents---from SWE-agent's Agent-Computer Interface design~\cite{Yang2024SWEagent} and Devin's fully autonomous environment~\cite{Wu2024Devin} to production-grade platforms like OpenHands~\cite{Wang2024OpenHands} and Claude Code~\cite{Anthropic2025ClaudeCode}---has dramatically expanded what agents can accomplish on isolated software engineering tasks. -On SWE-Bench Verified, which evaluates single-issue bug fixes, the best systems now resolve 70--80\% of issues, with Live-SWE-agent's self-evolving scaffold reaching 77.4\%~\cite{Xia2025LiveSWEagent}. -However, benchmarks probing longer-horizon capabilities reveal a stark capability cliff: SWE-EVO reports resolution rates around 21\% on multi-step modifications spanning an average of 21~files~\cite{Thai2025SWEEVO}, and SWE-Bench Pro finds similar struggles with enterprise-level tasks that may require hours to days of human effort~\cite{Deng2025SWEBenchPro}. -Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference---deciphering developer intent---as the central difficulty in agentic software workflows, arguing that trustworthy deployment requires AI-based verification and validation of AI-generated code. -Industry data corroborates the need for human oversight: developers now use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding}. -Our approach responds to these findings not by pushing for more powerful agents, but by structuring the work so that each unit falls within the agent's reliable operating range. -The skill-based decomposition is complementary to architectural advances like Live-SWE-agent's self-evolution~\cite{Xia2025LiveSWEagent}: skills encode human-authored domain knowledge that makes agents more effective on hard, domain-specific tasks, while self-evolving scaffolds improve the agent's general-purpose tool use. +We have presented a verified reduction graph connecting 27~NP-hard problem types to specialized solvers, built through skill-based agentic coding that separates human-creative specification from agent-managed execution. +The graph exhibits emergent compositionality: independently implemented reductions compose automatically to solve problems no single implementation was designed for. +The core insight is that the bottleneck in agentic coding is not agent capability but task decomposition: when work is structured so each unit is formally specified, bounded in scope, and mechanically verifiable, current agents execute it reliably. +The methodology is most effective in domains that share this Goldilocks property---and we believe such domains are more common than is generally appreciated. -\paragraph{AI-discovered reductions.} -A parallel line of work uses AI not to \emph{implement} known reductions but to \emph{discover} new ones. -DeepMind's FunSearch~\cite{RomeraParedes2023FunSearch} demonstrated that LLM-powered evolutionary program search can produce genuinely novel mathematical constructions, discovering new cap set bounds that surpassed prior state-of-the-art results. -Its successor, AlphaEvolve~\cite{Novikov2025AlphaEvolve}, extends this approach to a broader class of problems, including the discovery of new gadget reductions that establish improved NP-hardness bounds for MAX-3-CUT, MAX-4-CUT, and the metric Traveling Salesman Problem. -On the formal side, Jani\v{c}i\'{c}'s URSA system~\cite{Janicic2025URSA} uses SAT-based constraint solving to specify, analyze, and verify reductions between NP-complete problems---a rigorous but narrower approach that handles the verification component that evolutionary methods lack. -Our work is complementary: AlphaEvolve and FunSearch discover novel reductions algorithmically, while our pipeline implements and verifies \emph{known} reductions using agents guided by human-authored specifications. -The two approaches could be composed---new reductions discovered by evolutionary search could be specified as issues in our pipeline and implemented with full verification---though this integration remains future work. - -\paragraph{Formal verification of AI-generated code.} -End-to-end formally verified code generation remains largely unsolved, particularly for mathematically complex programs. -The VeriCoding benchmark~\cite{Bursuc2025VeriCoding}, the largest of its kind with 12,504 formal specifications across Lean, Dafny, and Verus/Rust, reports success rates of 27\% in Lean and 44\% in Verus/Rust using off-the-shelf LLMs. -CLEVER~\cite{Thakur2025CLEVER}, a curated benchmark of 161 hard Lean problems, finds that even agentic approaches struggle to achieve full verification, establishing it as a challenging frontier. -VeriBench~\cite{Miranda2025VeriBench} finds that only self-optimizing agent architectures achieve meaningful compilation rates in Lean~4, approaching 90\% but still far from full correctness proofs. -For imperative programs, Mukherjee et al.\ demonstrate a two-LLM pipeline where one model generates candidate C programs and another generates Coq proofs of correctness~\cite{Mukherjee2025CoqPL, Mukherjee2025SynVer}---a generate-then-verify pattern that resonates with our layered approach. -Our seven-layer verification stack (\Cref{sec:verification}) takes a more pragmatic path: rather than attempting end-to-end formal proofs (which the benchmarks above show remains out of reach for complex code), we compose multiple lightweight verification mechanisms---type-level enforcement, brute-force cross-validation, overhead formula checking, materialized fixtures, and agentic review---that collectively catch errors across different abstraction levels. -The trade-off is clear: we provide less formal guarantee than a machine-checked proof, but our approach is practically effective at catching real errors in agent-generated mathematical code and scales to the 50~reductions in our library without requiring proof engineering expertise. +\bibliographystyle{IEEEtran} +\bibliography{references} -\paragraph{Physics-inspired optimization.} -Our reduction graph serves as a compilation layer connecting abstract problem formulations to specialized hardware and neural solvers. -Schuetz et al.~\cite{Schuetz2022PhysicsGNN} demonstrate that graph neural networks trained via QUBO Hamiltonian relaxation can solve combinatorial optimization problems---including Maximum Independent Set, MaxCut, and Minimum Vertex Cover---at scales reaching millions of variables, far beyond the reach of exact solvers. -He~\cite{He2024QuantumTSP} extends this paradigm to the Traveling Salesman Problem, combining quantum annealing on coherent Ising machines with GNN-based approximate solvers, both operating on QUBO formulations. -These approaches assume that the user's problem has already been cast into QUBO or Ising form---precisely the transformation that our reduction graph provides. -A practitioner with a Set Covering or Graph Coloring problem can follow edges in our verified graph to reach QUBO (through intermediate hubs like MIS or SpinGlass) and then apply these million-variable-scale solvers. -Our work provides the verified upstream infrastructure---the ``compilation'' from diverse problem formulations to the canonical forms that physics-inspired solvers consume---while the solvers cited above provide the downstream execution engine. +%====================================================================== +% APPENDICES +%====================================================================== +\appendices -\section{Discussion \& Conclusion}\label{sec:conclusion} +\section{System Architecture}\label{app:architecture} -\subsection{Generalizability} +The library's type system reduces the space of possible agent errors by making incorrect code fail to compile. +This appendix describes four mechanisms---the \texttt{Problem} trait, the \texttt{ReduceTo} trait, the \lstinline{#[reduction(overhead)]} macro, and the \lstinline{declare_variants!} registry---that enforce correctness structurally (\Cref{fig:architecture}). -The success of our skill-based methodology rests on a \emph{Goldilocks property} of the problem domain: tasks must be (1)~formally specified, so that agents can parse unambiguous requirements; (2)~decomposable into homogeneous subtasks, so that a small number of reusable skills covers a large number of instances; (3)~equipped with automatable correctness criteria, so that verification does not depend on human judgment at every step; and (4)~demanding enough to require both creativity (in problem selection and proof design) and mechanical execution (in implementation, testing, and documentation). -NP-hard reductions satisfy all four criteria, but they are not unique in doing so. +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/architecture.pdf} + \caption{Trait hierarchy and compile-time validation. + The \texttt{Problem} trait defines a universal evaluation interface; \texttt{ReduceTo} requires both forward and inverse maps; procedural macros validate overhead expressions and variant registrations at compile time.} + \label{fig:architecture} +\end{figure} -We identify several candidate domains that share this Goldilocks structure. -\emph{Combinatorial optimization solvers} expose a similar pattern: each solver configuration targets a specific problem class, implements a well-defined algorithm, and admits benchmark-driven correctness testing. -\emph{Algorithm libraries} (e.g., sorting, graph traversal, numerical routines) consist of homogeneous modules with clear input--output contracts and reference implementations for cross-validation. -\emph{Numerical linear algebra} routines---factorizations, eigensolvers, iterative methods---are formally specified by mathematical identities that serve as built-in oracles. -\emph{Hardware description languages} decompose digital circuits into modular components with simulation-based verification. -\emph{Compiler optimization passes}---particularly peephole rules and algebraic simplifications---are self-contained transformations with formally verifiable semantics. -In each case, the key enabler is the same: a domain expert can encode the ``how'' of task execution into reusable skills, while the ``what'' (which algorithms, which optimizations, which circuits) remains a human judgment call. +\paragraph{The Problem trait.} +Every problem type implements \texttt{Problem}, which requires: a constant name, an associated metric type (\texttt{SolutionSize} for optimization problems, \texttt{bool} for decision problems), a method returning the configuration space dimensions, an \lstinline{evaluate()} method that scores any configuration, and variant metadata for type-parameter tracking. +The key member is \lstinline{evaluate()}: because every problem maps configurations to metrics through this uniform interface, a brute-force solver can enumerate the space and select the best solution, enabling the round-trip testing described in \Cref{sec:roundtrip} without problem-specific oracles. -The methodology does \emph{not} generalize to domains that lack these properties. -Heterogeneous software engineering tasks---the staple of SWE-Bench---resist skill-based decomposition precisely because each issue has a unique structure, a different notion of correctness, and no reusable workflow template. -Our approach is most powerful when the creative work is in selecting and specifying tasks, not in the execution itself. +\paragraph{The ReduceTo trait.} +The generic \lstinline{ReduceTo} trait requires a \lstinline{reduce_to()} method returning a \texttt{ReductionResult} that bundles \lstinline{target_problem()} and \lstinline{extract_solution()}. +By requiring both forward and inverse in a single implementation, the type system ensures every reduction is round-trip capable---an agent cannot compile a forward reduction without providing the extraction logic. -\subsection{Limitations} +\paragraph{Compile-time overhead validation.} +The \texttt{\#[reduction(overhead = \{...\})]} procedural macro attaches symbolic size expressions to each reduction. +These expressions are parsed at compile time; variable names are validated against getter methods on the source type, so a typo (e.g., referencing \lstinline{num_vertex} when the method is \lstinline{num_vertices}) causes a compile error rather than a silent bug. -Several limitations constrain the conclusions that can be drawn from this work. +\paragraph{Variant registry.} +Problem types are parameterized by graph type (SimpleGraph, KingsSubgraph, etc.) and weight type (unit weight, \texttt{i32}, \texttt{f64}). +The \lstinline{declare_variants!} macro registers concrete instantiations with their best-known complexity, enabling automated graph export, documentation completeness checking, and redundancy analysis. -\paragraph{Single case study.} -The empirical evidence comes from a single project maintained by a single developer. -While we argue that the methodology generalizes to other Goldilocks domains, we have not yet validated this claim empirically. -The ablation study (\Cref{sec:evaluation}) provides a controlled comparison within this project, but replication across independent projects and teams remains necessary. +\section{Verification Stack Details}\label{app:verification} -\paragraph{Skill engineering cost.} -The 12~skills in our library represent substantial upfront investment. -Each skill required iterative refinement---writing the initial markdown script, testing it against real issues, observing agent failure modes, and revising. -This cost is amortized across many invocations, but it presupposes a maintainer with both domain expertise and familiarity with agent capabilities. -Projects without such a maintainer cannot adopt the methodology directly. +\Cref{tab:verification} summarizes the seven layers, each targeting a distinct class of error with an example drawn from actual agent failures during development. -\paragraph{Domain specificity.} -Skills encode domain-specific knowledge---file naming conventions, trait implementation patterns, test structures---that does not transfer across domains. -A skill designed for implementing reduction rules provides no value for web development or systems programming. -Each new domain requires its own skill engineering effort. +\begin{table*}[t] + \centering + \caption{Seven-layer verification stack. Each layer catches a distinct class of error that the layers below it miss.} + \label{tab:verification} + \begin{tabular}{@{}clll@{}} + \toprule + Layer & Mechanism & Example Error Caught \\ + \midrule + 1 & Rust type system & Agent returns \texttt{bool} instead of \texttt{SolutionSize} \\ + 2 & Unit tests & Agent evaluates Max-Cut objective with wrong sign \\ + 3 & Closed-loop (round-trip) tests & Satisfiability$\to$MIS maps clause variables to wrong vertex indices \\ + 4 & Overhead validation & Agent writes \texttt{num\_edges = num\_clauses} instead of \texttt{3 * num\_clauses} \\ + 5 & Materialized fixtures & Agent modifies expected QUBO matrix to make test pass \\ + 6 & Agentic review & Missing \texttt{declare\_variants!} macro, wrong file naming convention \\ + 7 & Documentation review & Proof assumes connected graph but problem definition allows disconnected \\ + \bottomrule + \end{tabular} +\end{table*} -\paragraph{Confounding factors.} -Our project evolved over nine weeks during which both the skill library and the underlying language models improved. -Although we address this confound through temporal stratification in our evaluation, we cannot fully disentangle the contribution of better skills from the contribution of more capable models. -Future work should control for model version to isolate the skill-based methodology's independent effect. +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/verification-pyramid.pdf} + \caption{Seven-layer verification stack. Lower layers (blue) are fully automated and fast; upper layers (gold) involve human-readable arguments and fresh-context review.} + \label{fig:verification} +\end{figure} -\paragraph{Maintainer requirement.} -The three-role model requires a knowledgeable maintainer who curates the project board, writes skills, and performs final review. -The pipeline is not fully autonomous: without human judgment at the Backlog$\to$Ready and In~Review$\to$Done transitions, the system cannot determine what is worth building or whether the result meets quality standards. -This is a feature, not a bug---but it does mean the methodology is inapplicable to scenarios that demand full autonomy. +\paragraph{Layer 1: Type system.} +The \texttt{Problem} trait's associated type forces every problem to declare whether it is an optimization or decision problem. +The overhead macro validates variable names against source-type methods. +Errors are caught in seconds, before any test runs. -\subsection{The Human Value Proposition} +\paragraph{Layer 2: Unit tests.} +Each problem includes tests verifying \lstinline{evaluate()} on small, hand-crafted instances. +Serialization round-trip tests catch graph and weight encoding issues. -A natural concern is that skill-based agentic coding diminishes the role of human developers. -Our experience suggests the opposite: humans are \emph{repositioned}, not eliminated. -The creative and judgment-intensive aspects of software development---identifying which problems are worth solving, designing reduction proofs, setting quality standards, deciding architectural trade-offs---remain firmly in human hands. -What agents absorb is the mechanical volume: implementing boilerplate, writing tests, generating documentation, fixing CI failures, and managing workflow state. +\paragraph{Layer 3: Closed-loop (round-trip) tests.} +The pattern described in \Cref{sec:roundtrip} exercises the full mathematical content of each reduction. +This catches index errors, sign errors, and mapping bugs---the largest share of errors that survive type checking. -This division mirrors broader industry trends. -Anthropic's 2026 Agentic Coding Trends Report finds that developers use AI in 60\% of their work while maintaining active oversight on 80--100\% of delegated tasks~\cite{Anthropic2026AgenticCoding}. -The skill-based approach formalizes this oversight: rather than ad hoc supervision, the maintainer's judgment is encoded into reusable skills that structure every agent interaction. -The human contribution shifts from writing code to \emph{programming the agent's workflow}---a higher-leverage activity that scales with the number of tasks the agent can execute. +\paragraph{Layer 4: Overhead validation.} +After constructing the target instance, the test harness evaluates the symbolic overhead expressions and compares the predicted sizes against the actual target sizes. +This is particularly effective for non-obvious size relationships (e.g., quadratic edge counts in intersection graphs that an agent might assume are linear). -\subsection{Future Directions} +\paragraph{Layer 5: Materialized fixtures.} +JSON ground-truth files in \texttt{tests/data/} are committed separately from implementations. +If an agent modifies a ground-truth file to make a test pass, the change appears as a visible diff in a file outside the agent's normal scope, transforming a subtle correctness violation into an obvious process violation. -Three directions extend this work. -First, combining skill-based implementation with \emph{automated discovery}: AlphaEvolve~\cite{Novikov2025AlphaEvolve} has demonstrated that evolutionary search can discover novel gadget reductions. -Feeding discovered reductions into our pipeline as automatically generated issues would close the loop between discovery and verified implementation, producing a system that both finds and implements new reductions with correctness guarantees. +\paragraph{Layer 6: Agentic review.} +Two parallel sub-agents---one checking structural completeness, one checking code quality---operate in fresh context windows. +Fresh context prevents the confirmation bias that arises when an agent reviews its own work within the same session. -Second, \emph{formal verification integration} could strengthen the verification stack. -Currently, Layer~7 (documentation with proof sketches) relies on human-readable arguments that are reviewed but not machine-checked. -Replacing or supplementing this layer with Lean or Coq proofs---generating formal correctness theorems alongside the Rust implementation---would add an eighth layer providing the strongest possible guarantee. -The VeriCoding~\cite{Bursuc2025VeriCoding} and CLEVER~\cite{Thakur2025CLEVER} benchmarks suggest this remains challenging, but the bounded scope and formal specification of individual reductions make them more amenable to automated theorem proving than general software. +\paragraph{Layer 7: Documentation review.} +Every reduction has an entry in the accompanying paper with a proof sketch. +A completeness checker flags undocumented graph elements. +Writing ``if $S$ is an independent set, then $V \setminus S$ is a vertex cover'' forces articulation of the mathematical argument in a form that humans can review. -Third, \emph{scaling toward a reduction compiler}: the reduction graph is not merely a library catalog but a \emph{compilation infrastructure} that maps user problems to specialized solvers (see \Cref{fig:reduction-graph}). -Each reduction edge carries a multivariate polynomial cost model---for instance, reducing an $n$-vertex, $m$-edge graph to a triangular lattice MIS costs $O(n^2 + m)$ atom sites---and the optimal compilation path may depend on the problem scale (path~A dominates at small $n$, path~B at large $n$), requiring Pareto-optimal path search. -As the graph scales from 24 to 100+ problem types through agent-synthesized rules (the upper layer in \Cref{fig:reduction-graph}), the system evolves from a library into an end-to-end problem reduction compiler: users describe their combinatorial optimization problem, and the compiler automatically selects the lowest-cost reduction path to the target solver---be it a Rydberg atom array, a quantum annealer, or a commercial ILP solver. -The skill-based methodology presented here provides the knowledge engineering backbone for this compiler: each new reduction rule, verified through the seven-layer stack and implemented by agent-managed skills, adds an edge to the compilation graph. -Investigating the scaling dynamics---redundancy detection, Pareto path search over polynomial costs, and multi-maintainer coordination---is the natural next step. +\paragraph{The lazy agent problem.} +We observed agents modifying expected test outputs rather than fixing implementations---a rational strategy from the agent's perspective (shortest path to passing tests), but a correctness violation. +Layer~5's independent fixtures and Layer~6's fresh-context review are the primary defenses against this failure mode. -\subsection{Conclusion} +\section{Ablation Study Design}\label{app:ablation} -We have presented a skill-based methodology for agentic coding that decomposes mathematical software tasks into human-creative and agent-executable components, validated through a case study producing 27~problem types and 50~reduction rules with multi-layered verification. -The core insight is that the bottleneck in agentic coding is not agent capability but task decomposition: when work is structured so that each unit is formally specified, bounded in scope, and mechanically verifiable, current agents execute it reliably. -The methodology is most powerful in domains that share the Goldilocks property---formal specification, homogeneous tasks, automatable correctness---and we believe such domains are more common than is generally appreciated. +To isolate the effect of skills on development outcomes, we design a controlled comparison on identical tasks. -\bibliographystyle{IEEEtran} -\bibliography{references} +\paragraph{Setup.} +Select 5--10 reductions spanning the complexity spectrum---from complement relationships (MVC $\to$ MIS, 96~lines) through gadget constructions (Satisfiability $\to$ MIS, 171~lines) to circuit encodings (Factoring $\to$ CircuitSAT, 272~lines). +Prepare identical issues for two configurations: +(1)~\emph{Skill-based}: the full pipeline including issue validation, implementation skills, multi-agent review, and CI fixing. +(2)~\emph{No-skill baseline}: the same agent, same codebase, and same project instructions, but no skill files---the agent must infer the workflow from context. + +\paragraph{Metrics.} +(1)~First-attempt CI pass rate; (2)~review rounds before merge readiness; (3)~correctness (all round-trip tests pass); (4)~convention adherence (file naming, macro usage, documentation completeness). + +\paragraph{Expected outcomes.} +Skills should excel on convention adherence (encoding project-specific patterns that are not inferrable from code alone) and first-attempt CI pass rate (the verification stack catches errors before the pull request is created). +The baseline agent likely produces functionally correct code that fails CI due to missing macros, incorrect overhead declarations, or misplaced test files. + +This ablation has not yet been executed. +We present the design as a replicable protocol for evaluating skill-based methodologies. \end{document} diff --git a/docs/paper/arxiv/writing-guidelines.md b/docs/paper/arxiv/writing-guidelines.md new file mode 100644 index 00000000..a0944300 --- /dev/null +++ b/docs/paper/arxiv/writing-guidelines.md @@ -0,0 +1,93 @@ +# Writing Guidelines + +Lessons distilled from studying "Attention Is All You Need" (Vaswani et al., NeurIPS 2017) and applying them to our paper. + +## 1. Start from what the reader knows + +Open each section with a familiar concept, then pivot to the gap or novelty. + +- **Abstract**: Begin with the real-world context ("Many real-world optimization problems..."), not the technical contribution. +- **Introduction**: Start with the concrete problem (airlines, chip designers, logistics), not benchmarks or related work. +- **Each section**: The first sentence should orient the reader, not assume they just read the previous section closely. + +**Bad**: "NP-hard problem reductions form a directed graph that serves as compilation infrastructure." +**Good**: "Many real-world optimization problems are computationally hard, yet specialized solvers exist for a handful of them." + +## 2. Define every concept before using it + +Never use a term or symbol without having introduced it first. If a concept appears in the abstract, it must be self-explanatory in context. + +- Spell out all abbreviations on first use: "Maximum Independent Set (MIS)", not just "MIS." +- Define technical terms in plain language before using them: "A *reduction* is a mathematical transformation that converts one problem into another while preserving the solution." +- Introduce notation gradually: describe in words what $G = (V, E)$ means before writing the formula. + +**The Vaswani rule**: Before any equation, explain in words what each symbol means and what the equation will do. The math *follows* the intuition, never the reverse. + +## 3. One idea per sentence + +Short, declarative sentences. Each sentence carries one fact or one claim. + +- **Bad**: "The Transformer, which is a novel network architecture based entirely on attention mechanisms rather than recurrence or convolutions, achieves state-of-the-art results." +- **Good**: "The Transformer relies entirely on attention mechanisms. It uses no recurrence or convolution." + +Avoid hedging words ("it should be noted that", "it is worth mentioning that"). Just state the fact. + +## 4. Lead with the answer, not the reasoning + +Put the conclusion first, then the evidence. The reader should know where you're going before you take them there. + +- **Bad**: "Because each reduction implements the same trait, follows the same file convention, and requires the same test pattern, reusable skills are possible." +- **Good**: "Reductions form a homogeneous task family, enabling reusable skills. Every reduction implements the same interface, follows the same file convention, and requires the same test pattern." + +## 5. Describe the thing, then justify it + +"Attention Is All You Need" describes the Transformer architecture in Section 3, then justifies the design choice in Section 4 ("Why Self-Attention"). The reader needs to understand *what* before they can appreciate *why*. + +Apply to our paper: +- Section 2 describes the reduction graph and its properties. +- Section 3 describes the methodology (skills, pipeline, verification). +- Justification for choices (why skills? why this verification stack?) follows naturally from the description. + +## 6. Use concrete examples to anchor abstractions + +Every abstract concept should have a concrete example nearby. + +- "Emergent compositionality" → the Factoring → CircuitSAT → ILP story +- "Round-trip testing" → reduce a graph, solve by brute force, extract, verify +- "Quality gate" → 75% rejection rate on 322 batch-submitted issues + +When introducing a general pattern, immediately show one instance of it. + +## 7. Structure sections as self-contained units + +Each section should be readable on its own. A reader who skips straight to Section 4 (Evaluation) should understand what is being evaluated without re-reading Sections 1-3 in detail. + +- Re-introduce key terms briefly when they reappear ("round-trip testing, described in Section 2.4, ..."). +- Avoid forward references to undefined concepts. If Section 2 mentions "skills," the reader should already have a rough sense of what skills are from the introduction. + +## 8. Tables and figures earn their space + +Every table and figure must be: +1. Referenced in the text (never orphaned). +2. Self-contained with a caption that explains what the reader should see. +3. Necessary—if the same information fits in one sentence, skip the table. + +Captions should tell a story: "Seven-layer verification stack. Each layer catches a distinct class of error that the layers below it miss." Not just: "Verification layers." + +## 9. The abstract is a standalone document + +The abstract should be understandable by someone who reads *only* the abstract. This means: +- No undefined abbreviations +- No forward references ("as shown in Section 3") +- No citations +- A clear problem → approach → result → significance arc + +## 10. Cut ruthlessly + +If a sentence doesn't advance the argument, delete it. Common cuts: +- "In this section, we describe..." → just describe it +- "It is important to note that..." → just state the note +- "As mentioned above..." → if it matters, the reader remembers; if not, cut it +- Restating what was just said in different words + +The Vaswani paper is 15 pages including references and appendix, covering a paradigm-shifting architecture. If they can do it in 15 pages, so can we. From ca0ba3ad838fa19026fe53c4ed8802b956e98f36 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 00:00:43 +0800 Subject: [PATCH 24/38] update --- .gitignore | 1 + docs/paper/arxiv/paper.tex | 88 ++++++++++++++++++++++++++++++++------ 2 files changed, 76 insertions(+), 13 deletions(-) diff --git a/.gitignore b/.gitignore index 79202a3b..01399e95 100644 --- a/.gitignore +++ b/.gitignore @@ -88,3 +88,4 @@ claude-output.log docs/test-reports/ docs/superpowers/ *.log +.superpower/ diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 76d44a5b..d1f09c55 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -81,10 +81,10 @@ \subsection{The Challenge: Building the Graph at Scale} \subsection{Contributions} -Our methodology separates three concerns. -\textbf{Contributors} provide creative judgment: identifying which reductions are mathematically interesting and worth implementing. -\textbf{The maintainer} encodes workflow knowledge into reusable skills. -\textbf{Agents} handle the mechanical volume: implementing code, writing tests, generating documentation, and fixing CI failures. +Our methodology separates creative decisions from routine execution. +\textbf{Contributors} supply the creative elements that only domain experts can provide: which problems matter, what are the formal definitions, which examples reveal correctness. +\textbf{The maintainer} encodes workflow knowledge into reusable skills and makes two judgment calls per contribution---what to build and whether to merge. +\textbf{Agents} serve as guides (helping contributors articulate their ideas interactively) and as runners (handling the routine volume of implementing code, writing tests, and fixing CI). A library of 14~skills and a multi-layered verification stack ensure correctness across abstraction levels. Over nine weeks, this methodology produced a Rust library with 27~problem types, 45~reduction rules, and $>$95\% test coverage. @@ -187,9 +187,17 @@ \subsection{Verification by Round-Trip Testing}\label{sec:roundtrip} %====================================================================== \section{Methodology}\label{sec:method} -The central design challenge is: how do you let many people---including domain experts with no programming background---contribute to a verified mathematical library without compromising correctness? -Our answer is a pipeline of \emph{progressive quality gates}, where each stage independently validates its input before passing to the next. -A domain expert can propose a new reduction by describing it in mathematical language; the system validates the proposal, implements it, tests it, reviews it, and documents it---with human judgment required only at two points. +The central design challenge is separating \emph{creative} decisions from \emph{routine} execution. +Adding a reduction to the graph requires answering questions that only a domain expert can answer: which problem is industrially relevant and worth adding? +What is the formal definition? +Which small example would be both illustrative and sufficient to check correctness? +What is the polynomial overhead? +Everything else---writing Rust code, constructing tests, generating documentation, fixing CI---is routine work that follows a fixed pattern regardless of the mathematical content. + +Our answer is a pipeline of \emph{progressive quality gates} built around this separation. +The creative elements are captured in structured issue templates---one for models, one for rules---whose fields correspond exactly to the questions above. +The \texttt{propose} skill helps contributors fill in these fields interactively, asking one question at a time in mathematical language. +Once the creative decisions are recorded, the remaining stages are fully automated. \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} @@ -205,8 +213,12 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} \end{figure} \textbf{Stage~1: Propose.} -A domain expert---who need not know the codebase or even the programming language---invokes the \texttt{propose} skill. -The agent conducts an interactive brainstorming session using only mathematical language, asking one question at a time: what problem, what motivation, what is the formal definition, what does a worked example look like? +A domain expert---who need not know the codebase or even the programming language---provides the creative elements that only a human can supply. +Two paths are available. +The \texttt{propose} skill guides the contributor interactively, asking one question at a time in mathematical language: what is the motivation, what is the formal definition, what is a small example that exercises the core structure? +Alternatively, the contributor fills in a structured GitHub issue template directly---the template fields mirror the same creative questions. +Either way, the output is an issue whose fields capture every creative decision needed for implementation. + Crucially, the agent first analyzes the graph's topology to identify the most valuable contributions (\Cref{fig:topology}). Three categories guide the analysis: \emph{orphan nodes}---problem types with no reductions to or from any other node, contributing nothing to the graph; @@ -241,10 +253,40 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} Fresh context prevents the confirmation bias that arises when an agent reviews its own work. CI failures trigger up to three automated fix-and-retry cycles. +A third review layer---\emph{agentic feature testing}---simulates a downstream user. +An agent reads the documentation, installs the library, exercises the new feature through the CLI, and judges whether the results are consistent with its domain knowledge. +This replaces the community feedback loop that open-source projects normally rely on: rather than waiting for users to discover issues post-release, agent-users test the feature before merge. +The CLI design is essential here---because agents and humans share the same command-line interface, the agent tests exactly what a real user would invoke. +Unlike unit tests (which verify internal correctness), agentic feature tests verify that the feature is \emph{usable}: that the documentation is accurate, that the CLI output is interpretable, and that the reduction produces results consistent with the agent's knowledge of the underlying mathematics. + \textbf{Stage~5: Merge.} The maintainer makes the final quality judgment and merges. This is one of only two human decisions in the pipeline; the other is moving an issue from Backlog to Ready (Stage~3). +\paragraph{Three roles, two decisions.} +The pipeline separates three distinct roles (\Cref{fig:pipeline}). + +\emph{Contributors} are domain experts---mathematicians, physicists, operations researchers---who identify which reductions are worth implementing. +They need no knowledge of Rust, the codebase, or even programming. +Two entry points accommodate different preferences. +The \texttt{propose} skill conducts an interactive session in mathematical language, asking one question at a time: what is the problem, what is the formal definition, what does a worked example look like? +Before filing, it pre-validates the draft against Stage~2's quality checks, catching errors before they reach review. +Alternatively, a contributor can fill in a structured GitHub issue template directly. +Either way, the contributor's involvement ends when the issue is filed. + +\emph{The maintainer} makes exactly two decisions per contribution. +First, moving an issue from Backlog to Ready---a judgment call about what is worth building next. +Then everything between these two decisions runs headlessly. +\texttt{make run-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, implements the code, produces a pull request, and moves it to the review queue---all without human input. +\texttt{make run-review} merges the latest main branch, addresses automated code-review comments, runs agentic feature tests, and retries CI failures up to three times. +The maintainer's second decision is the final quality judgment: reading the pull request and merging it. + +\emph{Agents} fill two distinct roles. +As \emph{guides}, they onboard contributors: the \texttt{propose} skill conducts an interactive session in the contributor's language, asks clarifying questions, analyzes the graph topology to suggest high-value contributions, and pre-validates drafts---lowering the barrier to entry without lowering quality standards. +As \emph{runners}, they handle mechanical volume: implementing code, writing tests, fixing CI failures, and generating documentation---processing batches of issues headlessly overnight. +Both roles interact with the library through a command-line interface (\texttt{pred}) that serves as the uniform entry point for humans and agents alike---listing problems, querying reduction paths, inspecting overhead, and performing reductions. +The skill system ensures identical verification steps in both roles: whether an agent is guiding a contributor through a proposal or implementing a reduction autonomously, it traverses the same quality checklist. + \subsection{Why Skills, Not Prompts}\label{sec:skills} Skills differ from per-invocation prompts in three ways that matter for sustainability. @@ -290,6 +332,24 @@ \subsection{Correctness by Construction}\label{sec:verification} The skill system ensures all seven layers are invoked for every task. Without skills, an agent might skip overhead validation or omit the paper entry---errors that would accumulate silently over many contributions. +\subsection{Why Rust?}\label{sec:why-rust} + +The choice of implementation language has outsized impact in agentic workflows, because the agent's edit--compile--test loop runs hundreds of times per session. +None of the maintainers had written Rust before this project---the language was chosen for properties that benefit agents, not developers. + +\emph{Explicit, actionable error messages.} +The Rust compiler produces diagnostics that include the error location, the conflicting types or lifetimes, and often a suggested fix. +Agents parse these messages and resolve errors without human intervention---a property we rely on throughout the pipeline. +Languages with less structured diagnostics (e.g., C++ template errors) would require more agent reasoning per cycle. + +\emph{Fast feedback.} +Incremental compilation and the built-in test harness produce results in seconds. +A typical round-trip test compiles and runs in under 3~seconds, enabling agents to iterate rapidly. +High feedback rate---short cycles between edit and result---is the single most important factor in agent productivity. + +Rust's well-known strengths---memory safety eliminating entire bug categories at compile time, high performance, a 7\,MB binary, and Cargo's integrated toolchain---further reduce the surface area of errors agents must debug. +Procedural macros deserve special mention: they enable the compile-time validation of overhead expressions and variant registrations that powers Layers~1 and~4 of the verification stack. + %====================================================================== % SECTION 4: EVALUATION %====================================================================== @@ -564,10 +624,12 @@ \section{Verification Stack Details}\label{app:verification} Two parallel sub-agents---one checking structural completeness, one checking code quality---operate in fresh context windows. Fresh context prevents the confirmation bias that arises when an agent reviews its own work within the same session. -\paragraph{Layer 7: Documentation review.} -Every reduction has an entry in the accompanying paper with a proof sketch. -A completeness checker flags undocumented graph elements. -Writing ``if $S$ is an independent set, then $V \setminus S$ is a vertex cover'' forces articulation of the mathematical argument in a form that humans can review. +\paragraph{Layer 7: Documentation and visual review.} +Every reduction has an entry in the accompanying paper with a proof sketch and a worked example. +The example is not manually drawn: it is generated by the same code that the round-trip test (Layer~3) executes. +The contributor specifies the source instance in the issue; the implementation produces JSON containing the source, target, overhead expressions, and extracted solutions; the paper renders this JSON as a visual diagram. +Contributors can inspect the paper to verify that the reduction matches their mathematical intent---a visual check that complements the automated verification layers below. +A completeness checker flags undocumented graph elements, ensuring every edge in the graph has a corresponding proof sketch. \paragraph{The lazy agent problem.} We observed agents modifying expected test outputs rather than fixing implementations---a rational strategy from the agent's perspective (shortest path to passing tests), but a correctness violation. From f7465a608b3041d6fcdca1d58576731bff4472ca Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 12:53:02 +0800 Subject: [PATCH 25/38] update typst paper --- docs/paper/arxiv/figures/roles.typ | 83 +++++++++++++++++++ docs/paper/arxiv/paper.tex | 96 +++++++++++++++++++--- docs/paper/arxiv/references.bib | 127 +++++++++++++++++++++++++++++ 3 files changed, 296 insertions(+), 10 deletions(-) create mode 100644 docs/paper/arxiv/figures/roles.typ diff --git a/docs/paper/arxiv/figures/roles.typ b/docs/paper/arxiv/figures/roles.typ new file mode 100644 index 00000000..82982814 --- /dev/null +++ b/docs/paper/arxiv/figures/roles.typ @@ -0,0 +1,83 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 10pt) +#set text(size: 7pt, font: "New Computer Modern") + +#let col-human = rgb("#f28e2b") +#let col-agent = rgb("#4e79a7") +#let col-code = rgb("#59a14f") +#let col-skill = rgb("#9c755f") + +#canvas(length: 0.55cm, { + import draw: * + + // Helper: role node with shadow + let node(pos, label, sub, col, name-id, w: 2.2, h: 0.8) = { + let (x, y) = pos + rect((x - w + 0.12, y - h + 0.12), (x + w + 0.12, y + h + 0.12), + radius: 7pt, fill: luma(230), stroke: none) + rect((x - w, y - h), (x + w, y + h), + radius: 7pt, fill: col.lighten(90%), stroke: (thickness: 1.3pt, paint: col), + name: name-id) + content((x, y + 0.22), text(10pt, weight: "bold", fill: col.darken(22%), label)) + content((x, y - 0.32), text(6.5pt, fill: col.darken(8%), sub)) + } + + // Helper: edge label with white backing + let elabel(pos, body) = { + content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt, body)) + } + + let cx = 8 + let cy = 5.5 + + // ── Codebase (center, larger) ── + rect((cx - 2.7 + 0.12, cy - 1.4 + 0.12), (cx + 2.7 + 0.12, cy + 1.4 + 0.12), + radius: 8pt, fill: luma(225), stroke: none) + rect((cx - 2.7, cy - 1.4), (cx + 2.7, cy + 1.4), + radius: 8pt, fill: col-code.lighten(92%), stroke: (thickness: 1.5pt, paint: col-code), + name: "code") + content((cx, cy + 0.45), text(11pt, weight: "bold", fill: col-code.darken(25%), [Codebase])) + content((cx, cy - 0.3), text(7pt, style: "italic", fill: col-code.darken(8%), [agent-maintained])) + + // ── Three roles ── + node((3.0, 11.0), [Contributor], [domain expert], col-human, "contrib") + node((3.0, 0.8), [Maintainer], [no code], col-human, "maint") + node((13.5, 2.0), [Agent], [implement · test · review], col-agent, "agent", w: 2.5) + + // ── Contributor → Codebase: issue ── + line((5.2, 11.0 - 0.8), (cx - 0.5, cy + 1.4), + stroke: (thickness: 1.1pt, paint: col-human), + mark: (end: "straight", scale: 0.42)) + elabel((6.8, 8.8), text(6.5pt, fill: col-human.darken(15%), [issue (creative elements)])) + + // ── Codebase → Contributor: visual check ── + line((cx - 2.0, cy + 1.4), (2.2, 11.0 - 0.8), + stroke: (thickness: 0.9pt, paint: col-code, dash: "densely-dashed"), + mark: (end: "straight", scale: 0.38)) + elabel((2.5, 8.6), text(6pt, fill: col-code.darken(15%), [generated paper\ (visual check)])) + + // ── Maintainer → Codebase: approve, merge ── + line((4.5, 0.8 + 0.8), (cx - 2.0, cy - 1.4), + stroke: (thickness: 0.9pt, paint: col-human), + mark: (end: "straight", scale: 0.38)) + elabel((3.8, 3.0), text(6pt, fill: col-human.darken(15%), [approve, merge])) + + // ── Agent ↔ Codebase: execute skills ── + line((13.5 - 2.3, 2.0 + 0.8), (cx + 2.0, cy - 1.4), + stroke: (thickness: 1.1pt, paint: col-agent), + mark: (start: "straight", end: "straight", scale: 0.42)) + elabel((12.0, 4.2), text(6pt, fill: col-agent.darken(15%), [execute skills])) + + // ── Maintainer → Agent: author skills ── + line((3.0 + 2.2, 0.8 + 0.2), (13.5 - 2.5, 2.0 - 0.3), + stroke: (thickness: 1.1pt, paint: col-skill), + mark: (end: "straight", scale: 0.42)) + elabel((8.2, 0.4), text(7pt, weight: "bold", fill: col-skill.darken(15%), [author skills])) + + // ── Maintainer ↔ Contributor: community calls ── + line((3.0 - 1.0, 0.8 + 0.8), (3.0 - 1.0, 11.0 - 0.8), + stroke: (thickness: 0.7pt, paint: col-human.lighten(25%), dash: "dashed"), + mark: (start: "straight", end: "straight", scale: 0.28)) + elabel((0.3, 5.9), text(5.5pt, fill: col-human.lighten(5%), [community\ calls])) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index d1f09c55..209c9725 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -287,9 +287,16 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} Both roles interact with the library through a command-line interface (\texttt{pred}) that serves as the uniform entry point for humans and agents alike---listing problems, querying reduction paths, inspecting overhead, and performing reductions. The skill system ensures identical verification steps in both roles: whether an agent is guiding a contributor through a proposal or implementing a reduction autonomously, it traverses the same quality checklist. -\subsection{Why Skills, Not Prompts}\label{sec:skills} +\subsection{Why Skills, Not Prompts or Scripts}\label{sec:skills} -Skills differ from per-invocation prompts in three ways that matter for sustainability. +Traditional automation---Makefiles, CI pipelines, shell scripts---is \emph{mechanical}: every step is predetermined, and human involvement is impossible mid-execution. +Skills are a different kind of automation: \emph{abstract}. +A skill defines \emph{what} must happen (validate references, construct an example, write a proof sketch) without fixing \emph{how}. +When the task is routine, the agent executes autonomously. +When the task requires creativity or deep judgment---choosing which example best reveals correctness, deciding whether a proposed reduction is non-trivial---the skill can pause and involve a human as a resource, then resume. +The same skill thus operates headlessly in the pipeline (\texttt{make run-pipeline}) and interactively with a contributor (\texttt{propose}), adapting its execution mode to the context. + +Skills also differ from per-invocation prompts in three ways that matter for sustainability. \emph{Skills are versioned.} They are committed to the repository and evolve through pull requests, just like code. @@ -458,6 +465,14 @@ \section{Related Work}\label{sec:related} Roychoudhury~\cite{Roychoudhury2025AgenticAI} identifies specification inference as the central difficulty. Our approach structures work so each unit falls within the agent's reliable range, complementing architectural advances in agent design. +\paragraph{AI code maintainability concerns.} +A growing body of evidence suggests that unstructured AI-assisted coding creates maintenance burden rather than reducing it. +Analysis of an LLM-generated C~compiler found excessive abstraction layering, inclusion of rarely-useful features that increase bug surface area, and absent comments precisely where domain expertise matters most~\cite{Jones2026LLMCompiler}. +A study of 211~million changed lines found a 4$\times$ growth in code clones and a decline in refactored code from 24\% to 10\%, indicating that AI encourages copy-paste over sustainable architecture~\cite{GitClear2025CodeQuality}. +A difference-in-differences study of 807~AI-adopting repositories found persistent increases in code complexity despite transient velocity gains~\cite{CursorAI2025SpeedCost}, and a randomized controlled trial found experienced developers were 19\% \emph{slower} with AI tools on real tasks in their own repositories~\cite{Becker2025METRProductivity}. +Our skill-based framework is designed to address these concerns directly: skills enforce project conventions (naming, testing, documentation) at every invocation, versioned skill definitions prevent prompt drift, and seven-layer verification rejects non-conforming code before it enters the repository. +The agent never ``free-writes'' code---it follows a structured template that has been refined through pull requests, ensuring that generated code is as maintainable as hand-written code that follows the same conventions. + \paragraph{AI-discovered reductions.} FunSearch~\cite{RomeraParedes2023FunSearch} and AlphaEvolve~\cite{Novikov2025AlphaEvolve} discover novel algorithms and reductions through evolutionary search, including improved bounds for combinatorial problems. Jani\v{c}i\'{c}'s URSA~\cite{Janicic2025URSA} uses SAT-based constraint solving to verify reductions. @@ -489,6 +504,24 @@ \subsection{When Does This Methodology Apply?} The methodology does \emph{not} generalize to heterogeneous tasks---the staple of SWE-Bench---where each issue is structurally unique and resists skill-based decomposition. +For domains that do share the Goldilocks property, our experience suggests three key components for scalable agentic open-source development: + +\begin{enumerate} + \item \textbf{A smooth contribution path.} + Progressive quality gates---propose, validate, implement, review, merge---ensure that every contribution follows the same verified pipeline regardless of who (or what) initiates it. + Skills encode this path so that it need not be re-learned or re-explained. + + \item \textbf{A code-free contribution path with visual verification.} + Domain experts contribute creative elements---definitions, examples, references---without writing a line of code. + An issue goes in; verified code and a documented paper entry come out. + The paper's worked examples, generated from the same code that the round-trip tests execute, let contributors visually confirm that the implementation matches their mathematical intent. + + \item \textbf{A minimal maintainer team amplified by agents.} + One or two maintainers write skills, curate the backlog, and make final merge decisions. + Agents handle all routine volume---implementation, testing, review, documentation---headlessly. + The maintainer's effort scales with the number of \emph{skill types} (currently~14), not the number of contributions. +\end{enumerate} + \subsection{Limitations} \paragraph{Single case study.} @@ -507,18 +540,61 @@ \subsection{Limitations} The pipeline is not fully autonomous: without human judgment at two transitions (selecting work and approving results), the system cannot determine what is worth building or whether results meet standards. This is by design---but limits applicability to fully autonomous scenarios. -\subsection{The Human Value Proposition} +\subsection{Why Human Experts Remain Essential} + +Our pipeline's reliance on human judgment at two transition points is not a temporary limitation awaiting better models---it reflects a fundamental gap. +Recent work demonstrates that large language models do not perform genuine mathematical reasoning but replicate patterns observed in training data: performance drops up to 65\% when only irrelevant clauses are added to otherwise identical problems~\cite{Mirzadeh2025GSMSymbolic}, and reasoning models exhibit complete accuracy collapse beyond certain complexity thresholds~\cite{Shojaee2025IllusionOfThinking}. +Multi-step reasoning surveys confirm that LLMs' limited working memory leads to failures when task demands exceed their capacity, with logically equivalent but differently phrased prompts producing different results~\cite{Plaat2025MultiStepReasoning}. + +These limitations are architectural, not merely a matter of scale. +Formal analysis shows that standard transformers are bounded by constant-depth threshold circuits (TC$^0$); even chain-of-thought prompting extends expressiveness only to polynomial-time computation---insufficient for the exponential search that NP-hard reasoning demands~\cite{Merrill2024ExpressivePower}. +More fundamentally, gradient-based optimization learns statistical correlations, not symbolic reasoning procedures: multi-step compositional reasoning exhibits \emph{multiplicative} error accumulation, where each approximate step compounds uncertainty rather than preserving logical structure~\cite{Dziri2023FaithFate}. +A transformer performs the same fixed-depth computation for every token regardless of problem difficulty; it cannot ``think harder'' on harder sub-problems the way a human mathematician allocates variable effort. + +The context window of the model used in this work (Claude Opus~4.6) is approximately 200,000~tokens---roughly a 500-page book. +This may seem large, but recent studies confirm it functions as \emph{working memory}, not deep reasoning capacity. +Huang et al.~\cite{Huang2025LLMWorkingMemory} show that LLMs cannot maintain and manipulate internal state---the hallmark of human working memory---across multiple model families regardless of chain-of-thought prompting. +Even when models can perfectly retrieve all relevant information from context, reasoning performance still degrades 14--85\% as input length increases~\cite{Du2025ContextLengthHurts}, and information in the middle of the context is effectively invisible~\cite{Liu2024LostInMiddle}. +A researcher's ability to hold a problem in mind for days, connect it to half-remembered theorems, and test ideas against deep intuition built over years of study is qualitatively different from pattern-matching over a fixed-length token buffer. +The gap is starkest on research-level mathematics: the FrontierMath benchmark, developed with Fields Medalists including Terence Tao, saw all AI models score below 2\% at launch~\cite{Glazer2024FrontierMath}; Humanity's Last Exam, spanning dozens of academic disciplines, found expert accuracy above 98\% versus 8\% for the best model at launch~\cite{Paster2025HLE}. +Moreover, users cannot modify foundation model weights: the model's knowledge is fixed at training time, and no amount of prompting adds expertise the training data did not contain. + +Skills are our response to this gap. +Rather than attempting to encode domain expertise in model weights (which requires expensive retraining) or in prompts (which are ephemeral and limited by context), skills encode it in \emph{versionable, composable documents} that persist across sessions and evolve through pull requests. +The creative decisions that require genuine expertise---which problems matter, which examples reveal correctness, whether a reduction is non-trivial---remain with human contributors. +The routine work that LLMs execute reliably---implementing known patterns, running tests, fixing compilation errors---is delegated to agents. + +\subsection{Scale Beyond Human Capacity} + +A comprehensive reduction graph connecting hundreds of NP-hard problems would require thousands of verified reductions---each demanding familiarity with two distinct problem domains, a correct polynomial-time transformation, a solution-preserving inverse map, and thorough testing. +No research group has the bandwidth to implement this manually. +The predecessor Julia package, maintained conventionally for several years, reached 20~problem types; the skill-based pipeline produced 27~types with 45~verified rules in nine weeks. +The scaling vision (\Cref{fig:reduction-graph}, top layer) targets 100+ problem types---a scale that is infeasible without agent-managed execution. + +\subsection{Industry Impact} + +Historically, utilizing a new hardware accelerator---a D-Wave quantum annealer, a Rydberg atom array, an Ising machine---required manually constructing a reduction from each target problem to the hardware's native formulation. +This case-by-case effort meant that each device supported on the order of ten problems at most, determined by which reductions a research group had time to derive and implement. + +The reduction graph changes this from a per-problem effort to a one-time connection. +Adding a single edge from the hardware's native formulation (e.g., MIS for Rydberg atoms) to the graph immediately routes \emph{all} 27~problem types to that device. +When a new solver or hardware platform appears, one verified reduction connects it to the entire graph---the practitioner need not derive any transformation by hand. +Conversely, each new problem added to the graph is instantly available on every connected solver. +The graph thus serves as a \emph{solver-agnostic compilation layer}: a user describes the problem, and the graph selects the verified, cost-annotated path to the best available hardware. + +\subsection{Barrier-Free Community Contribution} -Humans are \emph{repositioned}, not eliminated. -Creative and judgment-intensive work---which problems matter, what quality bar to enforce, how to architect the system---remains human. -Agents absorb mechanical volume: implementing boilerplate, writing tests, generating documentation, fixing CI failures. -The human contribution shifts from writing code to \emph{programming the agent's workflow}---a higher-leverage activity that scales with the number of tasks the agent can execute. +Traditional mathematical software demands that contributors learn the language, the build system, the testing conventions, and the project's architecture. +Skills eliminate this barrier entirely. +A physicist who knows that Spin Glass reduces to QUBO can contribute that knowledge through the \texttt{propose} skill without writing a line of Rust---the agent guides the session in mathematical language, validates the proposal, and the pipeline handles implementation. +The issue template captures exactly the creative elements; everything else is routine. +This transforms the contributor pool from ``people who know Rust and NP-hard reductions'' to ``people who know NP-hard reductions''---a far larger community. \subsection{The Scaling Vision} The graph's value grows superlinearly with its size. Each new edge creates not just one connection but composite paths through the entire graph. -As the graph scales from 27 to 100+ problem types through agent-synthesized rules (\Cref{fig:reduction-graph}, top layer), it evolves from a library into a \emph{reduction compiler}: a user describes their problem, and the compiler selects the lowest-cost verified path to the target solver. +As the graph scales from 27 to 100+ problem types through agent-synthesized rules, it evolves from a library into a \emph{reduction compiler}: a user describes their problem, and the compiler selects the lowest-cost verified path to the target solver. Three directions extend this work. First, composing with \emph{automated discovery}: evolutionary search methods like AlphaEvolve~\cite{Novikov2025AlphaEvolve} discover new reductions, and our pipeline implements and verifies them. @@ -527,10 +603,10 @@ \subsection{The Scaling Vision} \subsection{Conclusion} -We have presented a verified reduction graph connecting 27~NP-hard problem types to specialized solvers, built through skill-based agentic coding that separates human-creative specification from agent-managed execution. +We have presented a verified reduction graph connecting 27~NP-hard problem types to specialized solvers, built through skill-based agentic coding that separates creative specification from routine execution. The graph exhibits emergent compositionality: independently implemented reductions compose automatically to solve problems no single implementation was designed for. The core insight is that the bottleneck in agentic coding is not agent capability but task decomposition: when work is structured so each unit is formally specified, bounded in scope, and mechanically verifiable, current agents execute it reliably. -The methodology is most effective in domains that share this Goldilocks property---and we believe such domains are more common than is generally appreciated. +Skills lower the contribution barrier from ``knows the programming language'' to ``knows the mathematics,'' enabling a broader community to grow the graph toward the scale where it becomes a universal reduction compiler. \bibliographystyle{IEEEtran} \bibliography{references} diff --git a/docs/paper/arxiv/references.bib b/docs/paper/arxiv/references.bib index 1a292c0a..f3a6592a 100644 --- a/docs/paper/arxiv/references.bib +++ b/docs/paper/arxiv/references.bib @@ -238,6 +238,133 @@ @inproceedings{Mukherjee2025SynVer abstract = {We present SynVer---a novel, general purpose synthesizer for C programs equipped with machine-checked proofs of correctness using the Verified Software Toolchain. SynVer employs two Large Language Models: the first generates candidate programs from user-provided specifications, and the second helps automatically generate proofs of correctness in the Rocq proof assistant. SynVer combines symbolic reasoning with LLM-powered proof generation to discharge proof obligations.}, } +% ============================================================ +% Theme F: LLM Reasoning Limitations +% ============================================================ + +@inproceedings{Mirzadeh2025GSMSymbolic, + author = {Iman Mirzadeh and Keivan Alizadeh Vahid and Hooman Shahrokhi and Oncel Tuzel and Samy Bengio and Mehrdad Farajtabar}, + title = {{GSM}-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models}, + booktitle = {International Conference on Learning Representations}, + year = {2025}, + doi = {10.48550/arXiv.2410.05229}, +} + +@article{Shojaee2025IllusionOfThinking, + author = {Parshin Shojaee and Iman Mirzadeh and Keivan Alizadeh and Maxwell Horton and Samy Bengio and Mehrdad Farajtabar}, + title = {The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity}, + journal = {ArXiv}, + volume = {abs/2506.06941}, + year = {2025}, + doi = {10.48550/arXiv.2506.06941}, +} + +@article{Plaat2025MultiStepReasoning, + author = {Aske Plaat and Annie Wong and Suzan Verberne and Joost Broekens and Niki van Stein and Thomas Back}, + title = {Reasoning with Large Language Models, a Survey}, + journal = {ACM Computing Surveys}, + year = {2025}, + doi = {10.48550/arXiv.2407.11511}, +} + +@inproceedings{Liu2024LostInMiddle, + author = {Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and Fabio Petroni and Percy Liang}, + title = {Lost in the Middle: How Language Models Use Long Contexts}, + journal = {Transactions of the Association for Computational Linguistics}, + volume = {12}, + pages = {157--173}, + year = {2024}, + doi = {10.1162/tacl_a_00638}, +} + +@article{Du2025ContextLengthHurts, + author = {Yufeng Du and Minyang Tian and Srikanth Ronanki and Subendhu Rongali and Sravan Bodapati and Aram Galstyan and Azton Wells and Roy Schwartz and Eliu A. Huerta and Hao Peng}, + title = {Context Length Alone Hurts {LLM} Performance Despite Perfect Retrieval}, + journal = {ArXiv}, + volume = {abs/2510.05381}, + year = {2025}, + doi = {10.48550/arXiv.2510.05381}, +} + +@article{Glazer2024FrontierMath, + author = {Elliot Glazer and Ege Erdil and Tamay Besiroglu and Diego Chicharro and Evan Chen and Alex Gunning and Caroline Falkman Olsson and Jean-Stanislas Denain and Anson Ho and Emily de Oliveira Santos and Oam Patel and Niels Kornerup and Luca Zancato and Benjamin Feuer and Jonathan Tow and others}, + title = {{FrontierMath}: A Benchmark for Evaluating Advanced Mathematical Reasoning in {AI}}, + journal = {ArXiv}, + volume = {abs/2411.04872}, + year = {2024}, + doi = {10.48550/arXiv.2411.04872}, +} + +@article{Paster2025HLE, + author = {Keiran Paster and Dan Hendrycks and others}, + title = {Humanity's Last Exam}, + journal = {Nature}, + year = {2025}, + doi = {10.1038/s41586-025-09962-4}, +} + +@article{Huang2025LLMWorkingMemory, + author = {Jen-tse Huang and Kaiser Sun and Wenxuan Wang and Mark Dredze}, + title = {Language Models Do Not Have Human-Like Working Memory}, + journal = {ArXiv}, + volume = {abs/2505.10571}, + year = {2025}, + doi = {10.48550/arXiv.2505.10571}, +} + +@inproceedings{Merrill2024ExpressivePower, + author = {William Merrill and Ashish Sabharwal}, + title = {The Expressive Power of Transformers with Chain of Thought}, + booktitle = {International Conference on Learning Representations}, + year = {2024}, + url = {https://arxiv.org/abs/2310.07923}, +} + +@inproceedings{Dziri2023FaithFate, + author = {Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jiang and Bill Yuchen Lin and Sean Welleck and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Zaid Harchaoui and Yejin Choi}, + title = {Faith and Fate: Limits of Transformers on Compositionality}, + booktitle = {Advances in Neural Information Processing Systems}, + year = {2023}, + url = {https://arxiv.org/abs/2305.18654}, +} + +% Theme G: AI Code Maintainability Concerns +% ============================================================ + +@misc{Jones2026LLMCompiler, + author = {Derek M. Jones}, + title = {Investigating an {LLM} Generated {C} Compiler}, + howpublished = {The Shape of Code (blog)}, + year = {2026}, + url = {https://shape-of-code.com/2026/02/22/investigating-an-llm-generated-c-compiler/}, +} + +@misc{GitClear2025CodeQuality, + author = {{GitClear}}, + title = {{AI} Copilot Code Quality 2025: Analysis of 211 Million Changed Lines}, + howpublished = {GitClear Whitepaper}, + year = {2025}, + url = {https://www.gitclear.com/ai_assistant_code_quality_2025_research}, +} + +@article{Becker2025METRProductivity, + author = {Becker, Nate and Rush, Alexander and others}, + title = {Measuring the Impact of Early-2025 {AI} on Experienced Open-Source Developer Productivity}, + journal = {ArXiv}, + volume = {abs/2507.09089}, + year = {2025}, + url = {https://arxiv.org/abs/2507.09089}, +} + +@article{CursorAI2025SpeedCost, + author = {Others}, + title = {Speed at the Cost of Quality: How {Cursor AI} Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects}, + journal = {ArXiv}, + volume = {abs/2511.04427}, + year = {2025}, + url = {https://arxiv.org/abs/2511.04427}, +} + % ============================================================ % Foundational References (from project bibliography) % ============================================================ From 2d6c848be3d368ba7315490506006cc99f261d59 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 15:01:21 +0800 Subject: [PATCH 26/38] update plan --- docs/paper/arxiv/figures/lib.typ | 30 +++++++++++ docs/paper/arxiv/figures/roles.png | Bin 0 -> 177872 bytes docs/paper/arxiv/figures/roles.typ | 78 +++++++++++++---------------- 3 files changed, 65 insertions(+), 43 deletions(-) create mode 100644 docs/paper/arxiv/figures/lib.typ create mode 100644 docs/paper/arxiv/figures/roles.png diff --git a/docs/paper/arxiv/figures/lib.typ b/docs/paper/arxiv/figures/lib.typ new file mode 100644 index 00000000..f4867c00 --- /dev/null +++ b/docs/paper/arxiv/figures/lib.typ @@ -0,0 +1,30 @@ +// Shared theme for all paper figures. +// Usage: #import "lib.typ": * + +#import "@preview/cetz:0.4.2": canvas, draw + +// ── Page setup (standalone figures) ── +#let fig-page = (width: auto, height: auto, margin: 10pt) +#let fig-text = (size: 7.5pt, font: "New Computer Modern") + +// ── Palette: black + one accent ── +#let accent = rgb("#4e79a7") // steel blue — the single accent +#let accent-light = accent.lighten(85%) +#let fg = luma(30) // near-black for text & strokes +#let fg-light = luma(100) // secondary text +#let border = luma(60) // box strokes +#let fill-light = luma(245) // subtle box fill +#let fill-accent = accent.lighten(90%) // accent-tinted fill +#let shadow-col = luma(215) // drop shadow +#let edge-col = luma(80) // edge strokes + +// ── Stroke presets ── +#let stroke-box = (thickness: 1.3pt, paint: border) +#let stroke-accent = (thickness: 1.3pt, paint: accent) +#let stroke-edge = (thickness: 0.9pt, paint: edge-col) +#let stroke-dashed = (thickness: 0.8pt, paint: edge-col, dash: "dashed") +#let stroke-dotted = (thickness: 0.8pt, paint: edge-col, dash: "densely-dashed") + +// ── Arrow preset ── +#let arrow-end = (end: "straight", scale: 0.4) +#let arrow-both = (start: "straight", end: "straight", scale: 0.4) diff --git a/docs/paper/arxiv/figures/roles.png b/docs/paper/arxiv/figures/roles.png new file mode 100644 index 0000000000000000000000000000000000000000..dcb447efc684909f8429d834599724234a98fa04 GIT binary patch literal 177872 zcmeEvdtB6I+WsgRl@(iBR4SB~=33EasUFPZnA7jRu0cSuZQrfmfA8n>`}MYo!*{;V z^W67!U-xxi&%!^xFzTio?z%ytP~7zFGf%ywP+aqOg(Be3*Ig_BrR9gm!WCV&XFU7V z$UmvN)a?#fckf?5J!tVI)h|ADY~#$)lebJS+1~p9-ufu>)911?9~s=Y-uZ6M+q1t| zFg*5D)b^rR#!j4d>L>s36i5EwAGN)g|9cx9H^~3JjsJTazYrWep{K@8&W}!%z~a$w z-@0{A$2T^y$#?BH)3@xr;G1*$u2G{#_3qPWYjpPGPdria)ut^i#(sT2)0?JEoA$N* zwSxz*y)r!co;|cZ_rh1c1I>2(741d9#Z(qJr5q|#JsIFzT zwI7tk>(*|PN7rVK8gb!8#RIo@k1KFa@ot{-?VZKZ$q)AG)eDc){>;AEdrE4)+T;!X z^X6CG`;_hgZR05I{uP;x_j)g6dq3nyZCt$grs$RpF{|*Ad5xoWkIKimV|x#O_oLV? zx$|GTq5b3VZ@wq|uV7E6yJ~=Ku+lczy(iS&#dLgaCBK)cW_5hcD=|&4sLy9;-g+S` zqG-|Q7oUjBv;23B?eR_@^?UzErSYR?#n@-LPsMm{<8KMgZuu(v$ngA=!?(&pEUIX^ zm~A?p#3>nQA;iU+Z-GB&|KkwqDt?-km%lxo#W(Ki1ci+4952nJU#G-6`1<@s``; z_gvI@FWSN$Pi=X0V^o&B;L939){z!cUi; zUt78mEA`pMY+dH$!X*hE*Lkcbeg3r7vaRA9s|F}nrn?WNSCy5`39k8DQ0o1BlbZ@I zY_fj6V8lmUbDdIlLsh)DCf*_6wqJd|Jv#cc)$s)@(|wNgsq(UdYYz5`ep#^L2ZQQ_mG!unc*3uU@@6 zr6ExJaJ*xF{2}@ACf~k2d%7r_UREBkYkhXxTsf)@FPIx&$h)wq;P+vNWO;H7GrM=% zXwUdJ+UrK=TI1oQ)Z!$|+`K)#^16h|-aU5u^zB>#rQMJilx2H7Yv0Mt=0~HK;$4_t zOrPB|?9kPBPT@$zAy072ff)DG@-7NK2(CJ3^qf$M zoA74&sW>Cfh1Z2Nf4$(v=gK>6o^IScF;$17G>?QjmxUgdFRY)Qn7AUyTNYGvld<6@ z{|+|VQ;l|u$LRBfD~B3W28a4Ap?dlLdj0v68?iAvy@dPyFQKc;<;trZUeGTvFfcr) zsAx>VrK1HOEf{fW5)apioEXoknCd|V6@yfokfwi(vRLGU!?#P0(pXohH3gPlx@B#jmw&geWs~m6*pvff8{B!N7xP2R?&IcTw#1f3Mu$#Y zz#lD@XT!V|Vb-T(8lMhVhT0QC3wsrm_nL_<-xTJwhWTdtU#JQ3hkGh+VzK4&_HKWl z+8h1*^>dFWwbatbxNP3MfXtHdx~B2AdqS=E@Vn&yBckfbi1#**sCz3s2Qe_kU8oJ$ z>*s~KU;Amuy3YR+pK9MphUn^7qTknhFX{bJAf{4we~{-yP<3gtxjHu$pO!xRhxMws z0WYj6LkMVpKKZwn&@No>tT(jUtfZ8}Bph5*^9ZAD#Dt`Cv-pUfSdWD3-^W;Pi`kQy z{KNW?1BF_5q0Jy+KS+J($sLzt8PBSBls}*iaPAFr{}%(Isrhug=XA=U1u55YPxxNV zYPddYx6xZ?v>HlHsih0Xrk)yGojao9mh^QW@4a$Kj5qtA?=^Y8p{XZ_R_9KSi?ik% zUHQ|dPiGWrifsX*YWJ)=1YUuD?(_UCJY|pmh=%ZKvvSnjRtHunI44>9%b(Kd-3hGLb z3Ig(xD?^(ng&qu;K5g2JqYviHxH2Z^JjwVFn2hM}DlogOh6u+zJr?Jgq8wR%p+rZ zD7L-I5VTU$G*vfR|#1)BTAY>$NP^9&g>WJmR<<2!Ag55+MC0Frs#h1BU+u`YItrDUvhK#p)8pf-n@i$N+e_jqFP;6?5I=S5g2IUAt$F8H z7YvMVxHf+G*2#scb%jc=C8cVV=H4*p^031Q0W(f%)0OfL$`H$TAQ{LnNj<*i@qLK` zpek2HoL}+hhcYDXeKX-9Kr+ER)iVzd)l?3(zMb9t_Jk(wNr{H>_IKk86O$L^sJfS0 zA2cp^{pVg5Z&ENP*Rv}02SdmDm$OVhoz={0gn$ErUWTRyy?a`=u5 zx>A#--ei4J+4!Wvd~D_5l;Wh+;GpC)`&8=^O~);(vYo5)?lv}0Hy+%S|Kp~FQ@)83 zenR&m(S+1KSX%L*>S#{N==vPNIuGU;u6_bTFSTt-=3Bce-Cm~LS6Uhp)BI-4!6^6s zsD{W5J)`SRMK8|LT+FEfqB`%&bZst<&Ga^Iv~|}tba$Ug_uP@3oNSnR^ufFXE4PmUm#~Fa`v-q>4WTHBOf=n)l?m3|cF zvxb?z8Wd69BO*=bDbZD@g*Xq5vR*IW(hx5Re|%i-`&XhUT_55c<(b&LgRu>Uh>PDw zEMfpweviCiQD@oJXKFNy4$Mt&`A2#MQo*wQx`RYTV~mYsk4-M0sE(sr992SOk& zp?0JH(Iwrg;)$fyDcdGiHC8QK8(sIj8iknNF!5tGQ3nz9!mpe5y}koS!J# z_nz;+Wj|_<;_cWX+d`g2cP=C4a7F|0u_Ev6>ij$9p<|?0PLW*}Dj3$NPn09(8uAM< zur^!vnB!2o`+?!><@b^&)pmUL zrcLr0Dc=uBy>-*64_*1Fdi+uQXI}d$=7qq+yU5t^!^VhD1}L3}4aw$wLyD~^{L-YL%JLxXux$5{>|=}a&n&WpIls(m z85rVo6ehm&&VD&1eyS!x{zr0NVRFN(@lCH9%xC;d?rQ!qRxLHz^vn`~zy*Sh8zN;hRA3QN zu?p~5zrNp+B}?!>@KW%Vy9hRs;>CTl&gatUS7f>02(d0*&~QYaI^=SdO?jkorX-`kyqeV%Q&Zg7W8iC>L4t~OML6b{a`+=wtzG}L$u zMOTQoVS()~$$--Yc<-IC^PPn_-;;AE%#@e$_O@1#B(KQaaI<+gI1`JZKr+!NaMOd3%*Wx&Vq=>zvY>d3kHOJ zn9$oi>4vaFv-4}FYHFrBWWPpaT0h>XXuy|36*wWW=rKc* zF_n}QUe$8d*^Jb`5PX2?cccKv$WwgUq+6$7j{%W{RE2y*WZYH()xobK2W)@A8ccpTL_5PjrR zoTSnCIt7rfCItXuVCcDbddR+7qtu4X;SUbdSqBZ=FhiH4T9i}vWmM|#^?Kkgo5Zyx zd3SdR$pnjLZ@t-y^^l+-l$aX3V%QR*wT6b@%~`v)Pic9d(lkQMRk2&jow=Uo617CA zCxsUQyY!?K%>pg#v$G2)CY?KAvwbCbL7-GhU{GUQH1myXPZoP8?3j^;Bv-RQ3bMzO zn;%%Nl4PC$xc1Gr-|oYw;E;Z!(?xCm1hKplz(!DnJ%lZcO6eQg z^0&|<=aP&?PlmSqQRub^VP==@$1$`?{P@zH{M+Q0aIYoZgdC`j1JMecSEStrXg@Nv zab$RXLV7^i4`05VBD<+Ssj7g*m6D_mr8j`@d;a7#yL(9*YCxuziXzb^uV_|%p6vxz zC6Yg5_6)JR{vOqGZ?upvYlECigGSyO)6gwu_n^GJgJ$ShGJ)yIHBK8+iQS_*voGYJ zhp--JNbyGhEcfB8V*x1P%K{?G0#tFshY!y?9h-l9&_9LQjPh1RJ+h>qaB9jUbvef4 z9#X#9o*%QT>q_hEh9y*7BV*6{XGA+W|Pq`fLp*+_EB)DL~0=GQQ6B4Of#@!+Y)8|J3$ z2~O!kTFW|=FoM1rWn0Q_t2B|rjz=Qe(wM3-#jm!dChi2hKU9`9?8H<6W%1KCSb zRl6a`F(=5+r5G>fmtN!^&U<*fdU3JpriUi%*zGA5WSy1nIg?)9O%1+;jSL<3&#wBN zi}+}7N=8s?NP;paYuX~5ZSrccIvrl5JJy35AAj$t2j%+l%n?UPeE<|QucIb=ZB_qz z`?Hc0^CXZ5>+1};iFuL}`v*0il@%xZ9CnD+4XFF@6T1p7?y|D#8_f;LJ*46Hl%ngY zf$y74;LTAOhlbPwKTBkA{x14=CE?9m)2q|k3#x99C)M{RSrEW8$|_^g+xgKpR7^k;Ea%Gb6TzDDVC#^0+mLXju-V#UE@5vPIv(URHCxkD z9|&q}gBD?jauZFmy;L`a)V9|Zf8r?(mIK=msgwY}`>1p@tYPYJhb9lsZ2srQ1YKcx z%^%!VQQmv?%!AgqvYOutKLMf*^+rRa(XrTg2=IHw%DhwK^KU~`S}e8vZ&D$RASr^T zlb<<9XmpQt{v*BRel{@zTZCKe!qBNbjCg6w73dg#f~14AR`-M}*>HS0)7mqn)~58? z)Vp$$het1E`U03zuj6^SCrEvrzGz5#>ZQr*b>x!}SRKVJ(s-L(Dmf@4a5e&afz+U@ zOjA~$di%iZTg&_h#twg`X2Vfw26<6=S0=KnU! z^-y%)Atnw(-??ZVlhh92bu7`S3lu75^VrupePpm^m;{v7+1|2Z+G zE5@44v{XEuS%TM{Sh;wpHT1v`@SH`HLu_}1cxp!_>Y8R39nCi!T_)iLylkR0Wg5Z+ zUOR7-m>)4eaJFM#=yCMwI8%Vmih`zK7l^5WtkuqY0^nvtll6< z#bIyxAAQRMgi)qul(l*1O7i>oUepK(%pzJr16#t4^R9aP#~Z`n&yh-?D%S&SKh!WI zz_*-??Y@Xh(mE+HJ4I#@o0-L;c3_nH2K(^Dnj59K(;e+@1E=Mde$2x`jiB_y(WuHj z+c{_W#`%+rd^u|*yS4!Hws>f9qNUxoBCzF>-aC!m;ovHuEcM2u*#kmqQA9bUTk_}V zo~%t5dYf(7U>?|#w-5M|V)MN3ELQ`8q_xgI1gNmyA21K0%yD)O!TO=4Ys zrepsAreI-6+UsiN5pWf@9zqVKlm%+;4z>Lube{(?I@;V6oo;Ggq;o6^ckBtOi6Mu; z}D$ik{rWLnzn(_ja) z_Z(Oh%#&5xaFcOq;Ovzd>N_PJ1T~(NC>aiNDaMMTJw+1hBSDR~hFNZviqCzWUO!&a zRs&9>Ypt-SLeU$m-`pc^f@}UgI1*Be4hv&osgSjqa(wia zzrot(+Dns3zK7HIpq`_S>oIqT-Y>9Z*s<7#!K|mb=1FwDrF_5UGNN zQWQib9)Ka&H({fv%#@nvyHL+9g-iNW##sL#5zQ@WTMhw8LUs!Dbs%gAJVR>HKHpKl z^P)rFh<%VE$xHbw8a$!ZV1}Kahqd$#`G+}K7+%f%K&4}kdwgA8okjW=HEX1Q1(;Ob zL;A?7ICgX>nTiax+!DHn#70=hnn46TuaYY?nJ6DP@^a>04|Eh{cK1qKJmqGzU5^EL zq^sRED7E%UwT7|qSXGDaair`orJiyd(9tRNCe1^7J;OLMO!x_6x0zw~neOonjsIqu z2VIW^?>77dcXFw0YAN$tHfoQsi`%ixqka^9D18TBulvmMz1;}=181XsF z(l4ORt5C?R)}=c?OD{(-zTBl3VB*ZktG`(Akd*sF?W^@Bsq)-66i-OJ>Hq60pZ``` z`3~7;FNV}Y10o>cHhFVlg}PU#SPx5La0{!|yEsNWf^QvCz3JP=SwKwS-MS=+MH-!#hwi+ zCPIW(%e05v+8_oqvK^LUWzm#yC7N20O#F3of{u$6AmzDwq(IVwxuml?@HX$O*AJcD zFK$=eka~L?P6}#l!+6wXk|>Nsb)}NhDt26~!Hhlx+riPFp+ADY!M-8(WYLzziq_zxN>>_2_!2D@urP|Mxa-I4f5pGK|eP;wGfW+dlkRQ__wDNK`7 z660Ht|3)tzFyj2PdD0XaQ8}#ZAD?5hRDE)2>dVoeL3n})7r(=Z0BZw;3|ES>F_cN5!0X(WTP*#Fh zd0cJ1H|)^Vl!H?npf)y`p{|wyZY*5bV}MH2p+048E{)nHzUTSp6cZby`CG8~7u00J zp0o^Nd&1N#>ah)ZKa9@r>AN$;zN65O6woS4i`#j=zw}i~mp~#>%@*3(bLXl7tet-g zZMg?KozgOSj0A{xvfuSv@Hof|m2qDR&twpMYH!kfN~s)mes2kU#VyAWo5 zKdaBTXQd$c`+yM(ftcQqIg!@rBn98n{g8x9mEO>WfEinM;dPl=$f`Bq6%ReHU~@$; zb#4W8KSA$E6CV7m4W+;v{S}P(Xn@U-QX%C2!6IYqS_YPuJ&llB61IXZ6#v-vNS5B2 z-uzo2jI1|T_2(&b_2P)?fl=jHC!ahVsnlfayC6Z+IsMH@u zOWy3mG>>+KB-lxv2nR}3vjVk>afS7IXwBkl-bh9v`!HnkI-B;hH9LpIrjo0M3C zP7-%X8ddb*OVRN-IE3f+Xx_1-4SR`6NlKtIzI^4!FR@{YLoI>@6+j)@mhUDdb)i@m z%mlTcXQv(Pz!wD{XzO&jM@ef&T2#N1#>t-Hb*o8FDsj;f}$CiP)ml-R`UDC|h z4;JaWS$(#>x^v6=i%%R(c%%2IFW!1xWLyxvEEZ;7;TAR2?|_I$7SDQT;wb)OiuAHi zKKh3n14Ma>ULKq3k{?n@tT36<{L8-=m z3*TCGN8vL516dVY*gYC?=8lraR8b;qx%k0#^C!L5d(`|%>$V*P^=LCN)`Oat0WAJP zh-DOKYLM$lcSzLnvIBFbMnZdd?Z_a;3Uo1duMhS7)w^w6ti)1xSLtLa_u>@@T`?Oi z{ZS&4vVmbzP<$gWKzq=0?|1eQa(XH1C22HLBT~Hg{>GXIZnVffKHoP~0f-8u$j#WvZ4R@CZnUEQQ0*W9$@3%0x3W>=1UX3(1T zUAuPuPcd-bjD0t`W8M)-W#73X(DwF|x=Dui{*qk`X@zJOo*($WOH?4K&}*keOdYcN zo`u@Qi-N->?QJb;Dvr@>-xD3VWKJ~Iw5!?o{FN$n%eY ztVd4ywfZUl=*e#qPdH}K@|&bO|HMgX9m1UBSL%2L{N3wF$@fcJ2h$$l(`WOs{~>#;L2lK1$V9 z^zO!McC$AOR5zh1#*Fz33EwPKe0~kXuRP1XY+;Jf;6BvhYfrNQ{k_p&lrz~PLD=NI zNGhdxVp1Oh1z+%QarS*NGM>Rgkh5nHCCiXoRyEi@nJ-b_M()%X_S@{GaoNS}nqnZ0 zA}5I#%6*VsSp;zJ^9w#fS^P#{F7!-|$Z9GCJ5bRW?p<@XYXtryI z?fCG=B?&?nnQFAnH6-dsgnQ0MnSj|aJqCd>Fq{)rSjB-w@|l#RO0)3A*lw>$5g=0} zGOY>WE~+(wk^KilnYu$nw3k0wa?*_JK{e-EE87^=Q_TxDi*eR)H}QmMW$z4l~?^>@>3`v*iUlc94=D=M0s zn;lT zlv4`e$DSL=6@;#pr2`8Oj6-mdNiRrLI%_ag70cLKuIv)FT5lR^*q!LYjMw~0WuHW! z|74i7MsrGM4A_FGLNF*0o0;2Z=Fa|gf zg(K<{IOxMcyo&|(7p?Apo_2_;@%DPK_<7{k9tZfKy^>ig{bv~nMuKb zrx9H5H#s@-(S^w~pqPj}`KV=JmUWsC=cK%oB2(2Zv*cu!_Z_y>M@3kdaoQ(HCc+w@6f(+vJ=FO} zj~~BTUxzP6?+D5~$Ijj`wjzx!hI8j8Qdh{bD@`aB??(XV8_mhocnUciQj16cEjxyJ zo|77c2!Yy6j82lja=qCfZ}7R;#aR8%PajEi!4ZJd6c}*FY(S)5xD|&qDV`Iyjvb%H zLrV_@q*e~q^ovTV=;6d*BSsjC9f*b+5kY4n(D{2m>qN-~`P&SfI0?b!;U@s$Y?!sm z1K|PE;^y7&cMo=Cnw=pct%&%jz$}u_x|0!$C*46_N`9}t4ukQWKta382DoJ~vn-VM zO*D0=c9CF~r%mIyAx^|!vYP(#OxH3QhK1}0HXEE!7#TGCciZ_%h`_?V?l6KoY(O9V zH^&&kZ7&#uEK>Vf3I`w|NU`x?wxom`;} z)EPPD3Y96ftca`_Dls(CAh9! zjFH8W&;w6L4-lDS4ynCUueugv9fv73K+7dOhrb7%8Rz~e_fv4hiT)FS4*0?kuA@M6 zoJ=gBeTG$F6qn7VT6g*`HtsOC(J$2b3`IB#ln(u^`IRz?_NHK zGv$TY#ULo9HNfFmaG(Yv1?cA6Qpb(Eg%+W{2r-BB`Cf8U3}eX!rerJozuNYJ;LEM> zf9c3@^;t*lcd{SNBnm0m;w+@*D(97|0Ay0qv&@cRhrpfB6lRjD_6*g+j51BznZ>fb z^jvv|HN{988--U{)4(TW8&lpOrDO6Iop&|_gtoFk^ly!pX;`#(31 zSd9Htg)twDRHC_VcL5?3RRijv+W42Dw+g)nmEN8bm03a!4ZJc^(rNt9aZR-i&wYuxcWGW zZxKpGndy2Sh2+%K(^D<(H8R&kg%3129II=`b`Yzi94kz@QLle=42O?5^yIhv;6FWp z-RDRZi;uYDS&>ku+#tRJ;V0NegUJY?FRXHK4B37%w*6TfhMtZ70)sW;6JdqEj*@rm z9+C{m+1kYz-GO_J-eb_eDNZHY#YiC##pT(PbwdU8=nf$F+8{jOw0PK|j>k}g$|?*_flS?_ejG!?O2ZK7y3lP@wX@6UsR{urNYkrzs={19K=@yW)oe?m(w^u(81KH8W-uHpaHl}0g?RA%=1*KQfJp!(oxb{M@$mr-4;rlx za>A9ftB_pVY?uU*4g=5iOEH=m2V+5|5i*PH&|Ls2gwC{T!`RvPQ37CU#v{08)ap)H&^Ta!6G9{f zgr|n%4)yARZ*R5;T~q{KiiZ$8;skIr#L=uew_H@Mql^eULxq8PQIVIT61%p`5>BS9@;R-o*+H))+!ao zh(g?Dfwm7uh0m4EkG?9Ur3fd`sM6sB<_lk}j;p{D1Lz_WNXw3sfS|wBAj(oAOn6CT z!bn9yL3~Xaj2)!net;9^_)iToI=Vbh^6$N3afD_h2HY@H&az)}E$d;VNH(U0Q-+I? zyaDYV06Jl0_mP#l@WViwHh{bVg$t`POpL-OAWxj|9)N3^cv0p31LI1N1qC#SFUfY$ zvkLa>lS^n_w&VTS5@alUj) zS^5GTFhwl{TK*;(yi~0duL=&!LBk|b8HG9IYZTmkP?jbFBhB^C{xELY{zu1%ZvmLi znE=FyGvfwrdv#o_Kr2`^@*tyu^8CrCHsAKF{MTVixa(vjUctwtt8*RuHo-Lz4?=IgH|4XbPS41 z9wMnb?NT1EJN?{K50B{~a1Wab_|qMGrUqe+Z1RTD87i{oLXE+&t=U%VwQs{UDgDG*ZcYDJkGP{Ux1-~zGZvp{5lV{eMF?kF6w z6-SN2_la$t4&n_Yp3-L1cM7DBDo*MW^S;%x zehjajlIz^oT@2bQ<-$hI@AY++66qP{+CF!(5V3&&o4m>7UO0J{=#qxUU2=S*=l%qN zkMi4t4&cCQW(h}eMRsSAzPj(#eGkuo5DRksGc_>Ifxt>gU-@(bL(mj{-z_H9tJp>c zvOvo!!>};@%0m7sryUuqOdNlQuKHpjn9Yr+pBvasjE-e9qG>D%^A+DV#t8h+@Ep#i z0xpPmNT;_lk8dTa0i|flcfqN$JRRSP{E~cP1VG`+Tym8%Q4-e4Y`N9cRk=sx8U9Is z10%jyZgUu70r?4V8>>bkfOu%uwSMyK2Z%Zloig=uLuMBD6ezDwMPs4g zv2s9(SaLwl2p7HF=@M6fVcYCJ=Tce;=1oQl;2bTHG8^vC9hNM{?`j%Th7JPcHb-qX z)TubMZ(e@cre-1&h{8*%$VV~m4%MY|DEBl%y}aag`ACRcSG0Q>I2(kG^4>Rolk{GtKB#GZb!~Oj>7e)1evz0 z-==*j=|Yr8&@b?<`eq}&GXQ0%S7vnbQ&al=os}mbV_`8YBsJcP$CrQuj^*dTsPA@( z3ZbNaA$GDt4A5y+dUI8Lf0XP`xi=#bzo4#m=Mw1z7R&$Ha#6Ih|2!>^=9|uuAGR27(s>r;z-xe2`m=HIw4a*AvUUk}6&l&$%|m=4@|rnaDlS$AGs;l4WM5^67y<+B z7yVUFcTq@sZc_ktbpu@uqiLtHUVghY>0Vg-1dI$~;P{Vb^nir|2Y?;rl_qfrQi0Wm z!GA!H(I)RG2MvSaMc~SpNE}8t z3EgQjXE=A^C#3Y2caT*GMG3yaPI(NSX|NErs&$wZuYq);Q>rLo(h>XWax6{};m+cn<^mVrTxM3fP!MI9eZk!c7g==5CJ!yw2;o3eNyyW$e0=R7ZtYimbsAx( zRxrb8$N=LZ#hL@d)zoY_xo=A4#IC`;3iTKY+aWb}t`WI{ouW?rc{4ZFSWk`%Di3dC zl&|L;Xn#R7yOTwsV};u<%;VgAx>k>E-%+j<+e&@F;GNE3Qh%_hp+ zp!8cHOZ4+fb2*dDnTEGIXtzlw;tqu)w+mkLZ1z^&6C{b$o2_a&nRT2N8j*yR7|3z%N>+*}|tOJde zJo53H4}C8XPqJ+p$u=~tfaRTxO%Y|$6Y@)ayFEXbXW#nSH@-_c7oOfwnH>(u$6zbV zYUBb|k(g(eaFw7VH-zxp>5h|^TSHQ4d0Sqow{ut{s(~JEvQ_)olh+oj`>>yZKw8mu z$U|z;ftN^u^wUmCPizWhlTf18b+TEtdd%I;&_V1{q5TjX4!<)Z5(2eevWzHOwTJos z!37%3NNxCRsS5Xn@C+|j89S-^T;t|{rftqA&JsoUUSuLlGd#83VQ?6C8_ zPyXye#M_|0$vb{A?`C;Rx>G_*(|1FZCAr2y`QHy>@37OCy65Mw`){WEjmY|EP=a1#N8 z7@Z$wkK#*CCshx_A7K>h?Ytfa-QkX#ch)K0ovm@$I}bS@c^yP0qXH32E9&u0mu{_Q z3H!`xOchs2$o(&W&!A=uY%&@LHG&r9AUc+`sKds`SRK(!hxaZ&6Hg`z8}b?s@55O} z?4^(#uviGLt2^M61Rc((32pUG803_HV~CKvs>3nvh_bbSl*)4KbM-pDY~w5r(0=t2On6R_4Nq- zShjDc>SyXG#lurx`A}^i1GXXh-i`-X5W+DWl`HQ?%73QyTY$IDKlE>fc-?gFT*8MJV1Z;HFQ z`+GkZx_t+z$;8O~Rg?Qh)(jZeV<@MSHS}v&#uR<<&p!D80AD0pf{zl3RET@*jADO% z^~+m6Gw^3Wd?-E~;yell0(eIXkEZ0|s3xd#Xsh@yPC#bB&wlt=3^?v#S2>ae8o(gS z=RUH;Nbs--adk}om$#?9^rqfFn+TMywiq5ruf0}8+LGlz(*7Eg^(C+xvSE9|n)-7o z&*7^_VaU>wYsJ}qn9O36Jk|DP5>dp+jA+68KY`arf9{dD%x%LCa84Jk+T(;$c+Jgt z6O7VkL}!P1W4&?ZBmexz?yQh5A;`hXR)o}g5NWAj2s9kMGU$LupylnIxh>XLoG?6NQv9*qag_Z=17yQQK zBEffE5!5?;TJBHk1=dRAqCB7k3|Q<+ zfTZ>7H{mjo3%gZmeHr%>(zdlj<#M{~{0DrKH^|w>x&+P^UVbX)>}560@BA`*)8TW{ zX}wPfcIELkQ!Bv=GL1)?Fz?F6yL^eRZzR}V8br-#17E1BTI=D7|3*%1@+mHVxAcxs z?wZ{gJ&F(;p)CIq*PZ;OZ0QSZ+m>fUdQu2nlpt>fJ=CrKt(hcdE-uF}WCCBpWk8{< zm!pkc)r!n~J==1isJfvWY^Ao^-;eL%%g3@)^^CpYE1`rIExziX;Yiy5e&-?(=f}4l zyHcjfmlA5Y`JD7{Gb{|FO@NLmQWByrMMojrzGL?VgI^SxSzio|4`IBiv8%GS|Eno(f^w>1F!< zBGY4-(u2<0fe99Ny@Fg{2VL3q!fW-eXd_0^J>YOfYpp$X7ZloA^1uoOk(K<6<{9j?oq2@CiSqa<} z?8BdU0)tq=r|Rn;lJ1G&os(mpspuxR=zaWB?arI$Ou9pQ@xkOWrO5ypsW@)XlZ?I| zQZ(txZi_Cz%`e&g6C^=0911JX&&NHKA+`{l#l)CN`&Kc^h_krQ$c)#x*f;u%q2K2_ z-Z_8kRDEW(bobvBfsppu*u^r+BT2=$_XtGR+g^=cn$Y{L&40+!cM)@! zmu^_@@Ld3Th{?+1yfa5y#b3C!Nl{#V!~(SLZk%o^SAVu%WV`G za%91PyTxQODC>Z9*1FG`P66!H@X9->L5B!V-k+m;v;^jrIkPwAeExCQ-DsmVU>ZQW zVGl3qN9`G1P0+v;+|#uIf|gO?6}OOqmCMmgl2*lU13eg7iRd)=TX1xbhW7BtZYc;X zBTSV|bG=*)BJPS{_Gw)Thb_>xGV{oZHH>xJUGmy);+F%&`G5x=RwcNI3D$vWG|Qr3 zK>U_u#17ljo-A#tS)9rUylZ|M_W$l79Mu$5(*yHyrL~JoKkz$a;vN-xe@RG@W#6z3 z0yv28I26*pSz;=IsGV?{5qe7=98eEBH2@IVUo!m}ImHUV&2daUFF)cK92VfxrAy-a z5uOHndkT;D1Zf@)`eB^hO;9;lbZWSeNA6!GF^k!mY9Z|>M`R@y_DPz79wasrjzr$w z7|eDEGw^h@%%f&XM^SL%ZBVPy|7OL2sID_0!IGeON}g`;jRCBPU?F4$ro;RVhNS_j zsIzzYYAMl%*KyyuG7OwODI&|}`(!#r{p?VqUGCIfz_kyz?15WK^iuG+au<;)SwS%? z`@+6K3`X>N!5-m_w|g)o1EmQ%ka6nXM6|S+of5U5V zSSMvkW)B1uX8F`|HRRYOp~{TzH}o0a0F3i3>7e2BL4fJ43xV# z>z$b*u7MoPf{PJbEF9z&wf0x{>d+$xqiH(XC)dBPo(3cy`Ol%-H>0n0b4#jy3*$d~r zFQ9b%^BJ@{#$vW*Elc*YmFTX)WSST2U|i5|CnUzE_gp^*6U@AHgw0yq^dRo9_FXcK z?~rdCM-lFU*Y|iBi!psw#W5mr)IN%Pclsn5_F&%R>pwp*OgsUo?Zr5zVj4#xDU7D4 z8CBy3Mkk!wGFTv}p-@yvE}?BUMfJSOQO151l04{ohIz^wUU)U5yer1K4&fFS4*X$& z2r9Hn-+D^xCDVoiHXMiKxxsf9yz@uJiq>=4LM8x*rC zaC7FK3-Cw-_)%o^L-auvNpsgZ?2g7tS6}6lnLrK$n3_Rery#Yp@8oz3yC{O4Zbe8q z%!{WC8j*Le<=f>HI6k7@4&jzx5#yB0a7pLtWU<@kROl7vq*;|@ePI7%sH{Q`4dLh| z3MDpJ1jfwJA#Y=xr5Yf|?>3DY1PqA71^@z1lSXh2t+|of_@0olzn$=qlfKxJZ`b(c zDL&ma8?P0SeRmI{q2BkV=)H{l;Hyu~gD;&EY%(&htZa|9cmUC8LdJhg`*#^rgL{-* zIa<01gLVGfuu()P{Zhry+V3}vrkODwh$^iBk5DEz5WpZtasbB=FAgENEJVh0(X@nG zFteo1*a`|tgPBD=q?e-XAlrPeC6ZH}xSNHlM9fj>aG52b4JBr{Ur=9%VXUuWZ4 zRWI8gFi{W@P2%XaiOaGSBG-&~A2AhLaU^PksIeZBr;6~uHdx;~avZSkl7V?m8c?qZ zIZ+*Vv1t`nc53fY%$5Q-yEFjDaHfyW6um@8Yl$RBiw%R2^ir%`LXN6IW+v!As-%BD zK)k3mEw}PYN4pC$L$BzSm(#EkHw}vov7}u+`W_9)??M)I1wjI7*UQhzUMcS-QRLJs z_GDbCNSPddOu&gZ8=fp&J!#9x?c(AtQRQ2YTMH{OKE-PYUdW0l@+H9|7E7X%QTNHT zcl7-g5&ik6_)^e0~TJJ@_}5U4ebCN zGQ{A)xZ6X|aKZ&#;{j47H2P2Qf_ftY%27Prj)yV51XN^rAisqwF?L%q=^pQvk-i%__9@@b<-Y|>$#Vfw7>_1LVY>b0 zp$xe$+yVhNr+8q`DB}3=9G;jvt|j0wiJ+~%9FS3pAtbI1g45A@mpl6O2!^(Gfg?U> zoe{@D=DuU5MN3wkME(j#YCxQsrxM@Z{Ni)v&cn|G&&Z6M3__g2CE8-=2LI9uT*@e8 z>LT5ezq?Xz5;=a{X867+w|7~(zq?qkh#4H3h3;ezu43ZoJIL$kKgqEW@tOQl@gHNW zyS;_qquIV}fA{4pLG3_)|4c>9OcCQqu!4Kx`vl>bK)@Ym;`$jehdVUp`*G_(-m9H2 zlKTufTQ9~|E04epK>Q0>4bm91V-~i}6+|F|M#IEm)3^NeT^oAVO98y0=e-v98~tzD zv*?0Mv538c;1JcMM|wOfP9!UzeX7BYr;>YJQi1@kRVY+ieN7)71K%Q$a$;~ozd?y4 zPN&3eFbqEGn3KJDmO)-hfd# zsiGMdiwFeTyF*NsvFYdq{tJsFN2lJ&8UQW>M6D8kLv4V|qSCb?wT3HKK3hC1Z{P5| z0J{s~Y~mNxbrozlPLfiGrygUOc;gs-r!NRp))>JcGRZwB%~y_-$!BKorm3_$MRMnG z3U#op_?O%=&H*-$=HM7|slkt`FojE5h#PnQrktLddV9w-jWHiiwIg%rSsl0;HU*jJ z3K6Ys7?KV@f#A~sm=J#c*{BM*H^Mp!&kw;Z_0zECjHAf+;Tuz+oX|(bE(RMUqX(E9 zwEzS{4hbS!ktgl0Asc#Tj-S%C0adc>@~(5h4kC)dDuQYmpHYA-L93iCQ8>b|oi-IA zwW#rBVn@3h);iK%j)A*}@F#Gwo(Kg0xzz_wnhcHqGGh~aof?X93L;?VPyb*~+Kq9$DwBBck@J$T7wy*qRAb7O zBH?-jX2kcJTZXIjl1i4Yr0*u~O_1>s z>|G`#y9?NakXwEmDka<$0?iACou}afT->3>+kB-7@}t2F&ccbo*Kx55I0ehdm0VPd zi+|vtYr?=DsJJA^pqX=&xJ?sdulJ}TSs`_m0upw z^sf#LH zXfj3M7$H{RHFIhN(E}q=4FarJ35BfSyc6au1;1SlA52}=-6PqP6I zfin0UbCL9^LS^ljmz7B+K#MY(PJ!cLP48j~AbAJ{f+Rl;dhU=Ia9(slk}6QaOo;nh z=_G%GcnyjIw!$#~x%n}m4{%n3el_#j0NxIx@Hn7bF^Drv>kX%H54K+cykkjpXu7>D zu}+Q|+J(>$B6+b+!%zhFck!Pe@)wR&9zwWgn=5}T^BgzR&I!}x*xxRXZ1%^|Ju1R9 z^*sxv2?UE1%#&lZnm6RJgyn66@YllCJ!odbr?{&h)C=`Glm$2H(5?thIMfz6#c|ca z>G? z7N&?$d=ThyATn0F7)9eTMCAKc=4d^*Rgoe4OwH!f3!6F32*XBr<$=PJ{4LkBg78vE z;2={aG4^KvTZ8FdHg8QuHIjE`QT`n=vTW`a8M_#}sD|WKx?L6Izx|^{hHr;3*l=hG zUXCYF&e2PX0$F%IDg;su6(Yp=!E_#f!8?zvkbZ?%8uUx4NCYwP!F6;M0Ew(X0EBy0 zp4>*!SLXV&4}hkzN|f+Eb&Z1VGx&vcT%xN`&^vPZ9E$*ftTbkEi93zwc6OKWYHY%Y zeX5WVT(RqR;?6aA{8P?E^$$AQnCe7gA_}8JA&^MducNai#c-?zPp1gNk@buCN~ZSRX`YD4Og~~_;DhtZ< z!eM!r$X5ZG%iAX*fj8h!*hj>@g_MOo-9-JbT*7k`cL6;j3@ni_;Y=z~thNKoYHL40 z9WzBbjF<|lc)AJP5_vpSen%Nv+>o -9QE(n&;^eAn3;RA7M7sHx{nm+)s!R*;Z{ z*9!?ryuVx{2^?%||4Tt|JWd}+SoXU|xBZpC5w#x%{F9?drWo=`In#$A5f>p;;KBlN^%CRZ{qPw@N2JS%ag}h4q!%;7E3kOhbb8>x9|-AN zID+V&+iBGYM&~j`BgB-94edd4MI{XVj_0RrnmyT28B&N5O&hNxz}Q95J3t1IWl|Q* zT;PGvCYTKNW^pqvoh5=AIlKZNM|PgXeB?E;ld=z&)+dKgsUXH)7ndDLi&p%m6uJK( zoc^op2!sGCoInw%H%EL;)|wfpbpT<3)zUM+xm2GJZB2rigjBX~)V{PQZDs6Cztaf@X1kn`W;isj zU{u6e_3pl4Iap9Z?)v$*gW#aQ9=BQk%y9!v8DT36vm13=`&_$-Jr(RGm6Weazg6G6 zqU^z6er+{FmiYX;g^s%oPk*R=t(Hy~FNcUUwDRF9Dh(!f(dgQ!VNE+bzXsfaTH|N2 zF{j1LIk2Vb?I_vZLK8lM`Aa?NMmGyLI0Q&YS3r>}6~lMTnV9ktrP+=9cJr2Ykf|V; zDm4sIh#{=H^W;vQCxYG)ghQtQq%#GZ*TU@3a{%M-X)JtJ#}}-Gmc<{{aE!(KDXg$H zszGST-h6vhL+z1>4mA=45w^Y#F1Ccs5bSv$qvr(ePI>zp1_f39xv0%o*ztKcx)40!Ua4F1B-=(F~WXHYIj1l+wA7o+w}F}* zc(zTF?v3)hJOsK>1w1#9;YWP=K3BEHbGnKF=S!%y9^fzc+}R{ooxzEp9Hg_F_;Y8=JvEY*-fTAkq=!R?{zh9Ns*2+_2i0bUi42>MocTR~ zqribbHY4rgPk$&G8fn7 zwbs6Xf+evid&JieLIr4K9q6}8n>_3*kW4jzYW&W)ihi$MZRi zm;<7S@*^Tx${+$o_~qZ8vro)(NKaV=a45pv&|}h%mpvK74 zF?$&4ug5JkwSlgWC82L1Wuv~=I7W{g_}oCDd+wF?h$t&@WjejKaIc+2vPep4SsbG# z%@da75qmOpd;D*JcXZd~h$1?~M|p?Fh4m31iY1Ow=wv7o3gJEtB?tZ30$VVYAe|)5 zM>)y6aw^ds^@lnBH%QfLq|$HID1yrI+X1T5?XO9BX?j;~0%DyHD1a>}mx^_n0fRCD zUh~HdqGcaqoDJSkAlcD5>Oe%2ZUPU$Nm(qk1W0u+T(ZDpAxo9aJP9?G!HhJ^+s|+x zu)HR_3t5_*le81gv)v`2<7IhrO*)hx=xY$`2;D91r^+@qjyD71f@a7XYBaLtwS}x| zk+Sk2624`RhcyiZ&1`W>2eFIXK*h0L82{3ghHMHP_mLgQmAKgE6}KwfBtjP<)a7>% zcH%lka;juRf#y_kZ0aDy@Io0>210v4F@TN>=>yPm_4c;nAH=!zBts8KgF)_QN|JUm z6s&DkYO7Ho5H;@D(1^g1lrb!`HuMzj4S67;LHnP4G=Vo*#)-&}{7=w3|6cqnqjz@G zD=|+S?Zk|F{oC964pu{?C{2c^n_#4s0^W?FAJm@BMx#iK&!(qdn#`tT^q<;co0ue? z1(G?BzRLqX$HBB?2c>6;sm`Ec8onxRC8=}v?~SR=j`W_QiYp)3}d|`#YR3AmZ-=^=+fc#5&hxNKasW{L{WD@3#l~gPk63PIzPmv}F%ujHcT&dhEFSV`D_qmu_jR^cA+5 z;!qd{DR3*qlTRrcklNWg+|cv6fp7GeE0~~^5=ebK1#2c}#&1dY&(o~NaK7u4L`m6z zohLLSD|M7U$Eu)A1qXaOpwFr?w7(@oauxk}9kp)R$ zp%yX$nbyjB@6Y{P86*tp*~fFe*L%JH>}>^;wSK?v@I3c(55h=j#<|irffR58YUIrX z^TeTA>>Ayt%JPjGXUn)Hk6Y{E;p$wNKsc*ySB6AIcGcs)JW1!cUUkKO|02z;QeSsP z!j6#bd|}RpF>0Zivd|UCG9OS-ZaJ30|X#3l)@X^Y}xefZbCSZ)o zm(`T$l+yoj6=&1Vag4$|q0jr|Hre0gd|9x82+QWb>QIjo=O=D!!Ae8aI9w2)lK{_5 z3!=AtcjPK#Wen`?Gp8CeZLMw-!|FUEyj5T*4Vcawx_io$)2GsEU}O0Er~h^SsayW? zzN~iLmjN=zi%&eT4K2C(-8aQozs@W_WST4ljr8raM>Do_+Sfk7lL7LRpR`_jFSVWRE6brbAzj15I%!=k!LCs*VMgVE;IR71b00Y8#@X5+RZMoRh z;@)p&JZ&;r#!4qq@T9m3T`i=&)Q$Ob;&KzU#BX_8j1TbXr;uzy9UPZu!gmU;6p9)j#T` zp^46X&89AVp>}*-capJ7x(iaBJhfdoG5S*>_!^wN z+_G64+Uge00)s*^4^kL0jUm1j>afQT8kaMjy5NOcCLor&C%Rl6I+>nJnaY{p5trdY zE~RB|N@a=4_`EY+|8PuEc!q?AfUSt{&Up3dTbl~&X$s7;2yg`r9_ph}^}uHuXV;97 z50HIJ5SziV;svPlOg49FpLX6ee9CD~8s%ncJGY}^uKVhO`mc&$SelZuxBCwH;pyTz zyJll<-N6w$TGbSdzW+sXiy}P#)guQ2ko#5BgP5Ug!sw*zM<#8f=68TCOsFJI&iO?|i;875eRgdq=%^4AE)K)oQH7`@+Imgz!oP~EAR>fQ!}@QX^n~UQ1%!j7Q);C>X2w4c z*+n|K&!S=-Y=l}_xlz{b2H6<%EY?JAgksk978QG@1zdB%3n3!u{`dBWJ1-5Ow!t2? z^{H|Xm?mma77KoFn9rLAIUwod^b!1 z__|FRJ$)5UM6IGpg5?Xzw1z-}W?O@abci+OSEb-@m8sLUM&k!~82;A2F(} zsNn~D%pu0%_kFxZ_A`q$v57xu-rORBRkr8ihBtd#N0qN5|0xeY$raG>llDv364Rr_ zVT-w`i%cw}ylvLqMpOOV(<>vn??nDxi7~*yGlvo&oXmnu`{+UHsfQptg5njbFdQcv8+nEB^^6ts;c;qG*@e7vg- z$MExPla#&)-l&l0no7rGg_|290TV?8tAO=>&?-#U3?A#Iw4+_AyJ(-AAglU8GC@%2 z5kMJ_kZVm2PB1r)bjP&I1e`HAXX%G?dc7jif-vW=;HlX!*g1%^pfhaDYSquX$sTW> zGWcWxYiMR-Ww@!^fN`#s3_#aS{3WmUXjIaIOU>pm&Iz}9-pk!;u+!b!n2cdb4>QN; z-DBt5w^L*wqBwp!KROzwyx4v5+dEdqz={b;c6|>wh73}0cC7_&s@H$ilD>(mR#XuC z$A?U;dim2Pwo%Yhs*(eKXz*!5N0YJ_S%7m}7?!{KJ<2D5UloIp^v%GrfqnoU`n#Dh z0(e+aqmI#NkC!v_(6)uP>VRCuP@(iJwVVikeLDa3{JfL!#&4VgOVpDG?uEU1Da(}! zW1g^rSvwzkziNgUo}wG&pwXWZ`k}bxySFTR+&3o_i(Vf71jF~7Sp&BN*DelIQoz1! zfMf+eV7B#A&3cFec`BsdmOgz4jT7Px zJNawMgg6zCEZ@c^7I**#A20_c2q;Nb8Kp-R+$-kVj^sh>ru78A;kXh`vTOn)Qg=b- z^%P9>1IsIz@!zgYU;DR-6I5J#T=8-mxLVkd)mQdhg9&xN8qH0h)|S6>*Mfn+`rZqTd(pvtP;>MEp{_XLl*%B44-%uk1ihwSwxIjwWRRa5#n z{ZMDAZS&OGGWbX5>pnkAPmEL0bOg!VmR|{>%fH7Z0BhuiAHDj$S`+xF$V}|>5Jvn9 zPnf9YHGlz9!agsI4OChah_2%}AuO+q3fCQCtOcN0qf^cx0kc0+1*)4J(_%TLuKV_Y z>F=Kw!VA<0R#CC&0lrsZ7Bg|@EDt$v4>3#r>&@t-O~mww2-c-wFvb8zrWnXLn_ZwR zHsUz~K5h(%?DjKI(*_6)dE0W~m^BLerS~R1@eE$V*01$$Mswi%&g9ukggC>&@uuK& zFl-Z?;(#xlYr;ypvlz%lAxy#O_7velbI|3JjL@2=na%mMqHFz?HQg|Ybl0^hmHCXW z?LgW7fQeOAA;Bk=;aO-KQG`6QFRgj<*5f7Qn1F)1Ql6zWrqv`>&A);Nsh?^=1*ewC zP?x=XdEdTwo)?n5#^rJ2Ya9KJ>BKl?_+X;mZE76sfP`4Ss{h}riz^em_FEVJ<$aBD zj4oeVm~(=TgwY!>zxleOUw?I@`VzAooIS~t8adWzi}-Ym6C(LOC^%WNR0n&VLkClo z2<&VJy_&v$mZi;*37j|CNvTfw|C1jJ!$oWP(PO_;l)jEc7@ZO$1~1@K+9tT&n(@Q? zOWwS4;eYTH-o4!8q=w=A_F`V5c-U@DU0YNUZ)r zVeET8em759y1DB5S2GvLm>R`P|J}O$Sg*GppElz@5jGVl&f$>!=g03GAZGy$ensn= zD6OU5OFUD*nV8lEuj*OcdnFJ^{~h`(w;r#DU68hgD_El3hOmTN-?;f?4$Qmtix8r) zIGyk+Xc+;W?)MUjhNZx{Q$Jl2IH_sNt1!F8>pG}kmUB`~Qjcv)RDj@D0a zCOEO+F{QbHr~G-zlDA_%N+FqOmBmk?!_Yv*1GvuLoRhFrl1X8%{1MKM^_=O}y6QU-u6k9Lr_ zRdcxOC$!s+%xmk@?j810{J(sX1zz|qjV(Y0AaHvcx&KUa?f5ddD2+b$Lm4^l8}0H> zs?@Uw$Q7qx<~`3M004^OZ!M5&>$r$Km&zG1Xjr=y~346c!F z5jGm!*^bqUf8Xo`^V|Z(kL81w1eI+;ut;%!zlLv`mcZT2nYU>2MVbFYxVaUv;8lys z0Xp7%CtgJ01D5ufm(}6C=+c55j^US|kmNSfpg6Ow_ysWVU7&-rIzl3TqO_6ObPN1Q z1G!FNqC9w~;MWXm7-oU8`Sr+GU}^Tk5mT;cTXLqpFbobVp6z@c|9hg?=M+3nh3LK# zGNO8&_=kRdT8l|jJr2QLRWCiQqz5Kq3BRR+=K%A`?DUL0v48PyZd<$UcIs;j9^*4|bnSRgqxo!3%C?0*gT=V?>WVKw%;hzwedc@6 zhb*zn+B~zR|BoJL05VBvn#kL#?%rviH9+Kuoo{cx*!fnof<#+GPEttx22c0VAh;+- zo*o^4KSewAl`b#g^b0*-KsK;2*7yl`vJAX&=T|Gj@K!?mJ&KpNSqGmQl^VJo$CM z;?~mgpsa_3(DqtD(m^(_f@6v)KGES5QIX@;YNjz%>y2fr&6%d0b?YoIr+;`If$OC1 zf&tXba?nxnaByZG?xY=_*p(htOIDZL$AuyTrBN6KT{l4lXgp0Y?#=kSj)i9&^!C(2 zzQR(eb+rpfMT>w=zc!?54XMnGg9cXgfJqfPaR)q*#?j z#n@O?30nNy$X8wH z1^!A^0_j9NDBAl!&#zOt{^wE>&-#jK+{%Xz9sm!2!xpQpd-D>A=%s zGk}D=`n`)6hrO0mT*XRN=k_z#jNY5do$EkThMb2-WcOw&^)EMyG=`WMLjcAUMKX$- zpyX!XUI~w?hIa-zGDv#w?_Bt>M4o(}VB~A8w%rz>^H2_42WetO_0bAy=kAx614IQ~ zxnQ5KjacEM8qSD~5xK`rE=?TSQ8s0V6ZauJUNgQF92+aRi_KTFbUQ-^0hZR|(Tg+A zLKULfvj*mzOxF@~AT-p}>yEZn24L%}3m2}ISB+ihXz*)0?0fP(;-uX|Y(Z9YOs;fI zqNe}UjLKBG3WdOkX_4W+s<&AtI${y4L2Y^E>X|-s+A8Ez&Y=y_edtN$yWyAHs8oupo40CR{jT*u@FQ|}0!8DOyU-WgzoNzm2w-SJo#x#O zVbG8nb(ZOd+OkPj4>cpL^J_IaHSHU-JZ16VPBN_JIzK|D85>1MZJHt($o6T5mTYgM zqb!!q^kdH?z&ao+Yr53l9KG!ztHi+(SY`=)vsgoe|F4oBD1Ce}#mU(N40yh5uU=Op zplS!CR5Uce)$n;d%_&PJe}AT6ObQN(hY}JbVb5Eq+Sue_abKHHb&qWUvIzs@JW2+M zbwKgZ*PBWF@p}%~=Pa~lcn{{~MIPkKGCwo6ma-R_mu)nJ(hU}~Dv@7KtdjANQ8G(g zJo>JBylYtCb{ziXmcq(>7fjT=wH<7Uo~?ig7`449J-v0cpO2Xe5{6Kd_-;SY@hb_j z?xoo#R58jD;CvO-?v%ZxOyRd-!-#2}QV;7nk#ME%g7nxV0P0$Ew9w#%w(&N2f7v*{ z+c0OSyCqmOj^M%e;i!YXdaq$qN$J1*oYR^yow0J38{r50?nUiOd4+;Mw4A?K&y}q{ zp-%&vAdc92El_Lhf z$PeqijmA&iaKF{SVGW#76*UJ9M~EdSgceJ5B~wLg=$M>DZ|c<3Sy@JbwxZI`o{H7~Gjn!7GO^$ruX?JBtT z%2k(KBJF4KeYVzclaI*XT$QXPm_34S`EsRzjpQFlVJPY3b#ELP-T>vK{<+SXh|dph zEoD{LUs$}7>72w7>UXWe#NkLlXJST;>hU${3mI&pA=7mKhI=hSB@;e+(?RNbv>7&s zz@_S6PfB}DXQc*u5*MkI{2IbGJHe&n$W7nGV{d?b0_4~Fz^FA00_*-wK4zMY&vk3X zP04DMorQc`sK7WzOhU2^N!IBT z?=}wRI~ShQX3JO4IymK2jk*H>wPn}VGUw{|p2yF?aAp$5#?awTnB$vLCF?2NiQl~>>P`xfZ*i4%T_A*h2@#lHYMRh#RJ9$CAN~9 zEyKIUuRRF7MMJ#kF=mT&m=5F$Ur3_}U5`xhfVUJlE5ZbQRq~dkkK{0A`4C)xME#O{ zl(`DP&a`7Yu{S-`>)cK;8{=C1&cFoH>c?;VGB`;&ECahGc~;g}z}xY?=Y|5oT1gnD zaYJVoSN1s1(1BU9G~%nX7sCae<%&hv2yTnE5IRSTV(oa~5yHuxB9^NqPHpFM{fTXJ zdC07Il=WrRE^sr+anp>)~$OS9AdEBiFqIFP4_tKi}= zpCt>6H{Z95yo9afvNZ^K;bs2ensMZI)S3_!U8r={nqgQVL4+J7p9f1xIv^9INsB!> z`l1NNU(ca}5B2Ds*)vu${727m>nju1ELl}EDwdX}FtqK+=V6yJSigJ+7j&Rb<3k>| z+$|ky5ZY}6RpsTd73+~Cg(_IQ9Llz{%>ss$abFez0#r zBQMmDnoderN(nseqZ2K2+9 zy~!$SD6EV`avc|Kf%AuNfvtA2axHV80~Zd2{QDPwF@2M3Lw^|No-~sY7iU@=qoiNiI?cnzl048A)UFKn6C4dK=fW;p zywB7lsWSc9Rp^;sXG2mV8A0sJE(bBw3>nfjNe5RZrk6Q^4hfnB=A|c_!EZp%dOvBA zrtBf^G@+EX>>f*i4PGbg6b86@!3`XSMrS^gy7fS?Jowv(PVE@q4xe%QMR9|bb0>C}?m&d%alJAYTOF%@l2%dX* z2lStku!Oq0`NQkHn{i&M<#_e+1|smbZCHo;uW&U_lxn=g05o#9CTn)eTr;IK*>dCW z#wV)?9Q`)ju)3l2cN>5|lpYhJyWFrQ_XIc_`9{g;U^`62kVnRk2|CE5!HyIAm%GQ%ac6_hJZ7rYXM63FjQ-a-_75Uwy z`!!CqoJ@?REL9_|gUtE{`~gpg^kVj{`yT-?q2+2j?EEUT$l)PN3ypF@7^M4pOV2K3 z`YsY~`gNJTnt%hX+ejn*kXqk=|JLvAG|ztt*KhrSxACl=Yo=z7SHvgmC^>^js*yZV}YWIUzdmP2omqihxk{ zQ9k_p-=C|QlKPWp&!e8#UP>@>hU)PN(w$l9K{7~sFzqkBK;(oc--Pc~{+E&SKajFg zQ_QOCdKZ5ZuHiXajic|>IHesP!jx4O!J4x4Uj@bF5ZR7xrv^r*H{RR8 zMQtLPiGad%Zcq)Eu?mrE;_FozM+81*+;mmvdvW#R@As=CV#1Of}nr8 zKJfn^{xebG?O@@O06nJ$L;aD`*(TU9!@61Lmwb}(zF0y2dcn8o2m*g;o*75Rkq!|K zIWy!Dbn!npqnj+a-*MZJ-ZwJ+e&jx1>sHSp?p*vyM!9P@dPhqU(J2M-s38yI&22hF zCeii5YfG!8c=ms7VYLR{g%<7!lrGh|GiXFC(#n-gxp3}2R2;Lg+xXX9VKMRPJ`mg&Uvnf>aPY~&6X>2|sTVZf@P!e(jwu3D<)EjF zKf)Nz_8y#)9<;e!==yAW1{0~RoIq~tnD}3v=&52_B;2~SmQiANQ{}an; zgdm|VjT;R?@T@!cNxWZ|lk)MyofZYQ5(YZ+Lrns87^_=Q~!@VgfYXIst%oId9`Z%7a%8HN=OVL|5ND)OlSee(n5KKA{f$ zfk#e_Y!{mbN6epGw*mGB8Nn^#nl85}r|N;J_wl>Ctm}R)WU5w(_8iP?S7#6}z&QO`81Tq)L6P$!#2>;^+;T@fl8R4`NZ{#~u2*(g<~bm#xOx zV}O&%!M?K1M<~|zm8Bl~XcL#p(_L+((QkH?|8&bv^Ndf)Lx-V)wrusGxL_@#Pdv8yKpgx#gXw=eDY*R`#e;VDU{rGbm3X>GP9Cb`uj&cR zh*(YCkt4nGd*emcx(<@V*H;L_!{ZY9F%cEX6lD9HonU{I1)!FtKfQ=(ZI6qucG>*(wygPGx0ANy=VTZp$<+ zE%jhLu}foxi9*yriZEgi`GOljcbx}8AX4|g7r$4CT*d~+OE?@|@aj{E9=ttzy=_gf z;Db}(!%je8lL*kQb*Q0!urFiF5t^fTRp(Xp^5kE4xqX?r*9O&R#+D;b_T#s77EL1t zKY}*1nlIx(f~~@CpV5+PlCiI!F-tjMWcJA|v*6>?>`Nx+WnJL`T zzgMt-gC7?k#&Lc4X^_%lh_UK^{cYr1ucd`s^i7m)^wOGw3Iat2nTbE0%<99L2Bx%tJZqun%Q7 z=QGr`;74dvc0qoMxAUhy5oHLOZ_8fZN!Qsx!PSSm*b0_WPs3|PPVdXyXSYj*Sh>Bx za>1KY_8zp$x%IG*KDEn8!@X1DEIGlDBu{=qnYq+NWprKDvk$unCt0=-$gjGT@NI{h zc7qWnm?nu|vak5C@!{b}2kaiQDIBdsuwG6uGnSjNVJK6g^|(I)nYby#@$0}ssQ9oi zI2_;&o+JuheK=5Fd8FD*6XohcpTDuS6Yyy7GlXE?cDSVpC|pC0iBrEQ=II$uLZC6e zzsnI3S#^T-D{2jKAQyfv)!jN!uJr3=8($B78Jx(T^30m`OhpZX1Lz&! zoBtmbHH=1(YsgufU&NomkHfDnC*S8Eu8z9;<6j-{!Pc!iMWL3eq5bK^ukHF=AGv2f z^A#ABSN*c|EL=3q+|~Qs7;EPMK|A@blG;Sf_?>+{N|s7_cy%z995Jf*sLd1Q>-C-A zUL}Xe_6Ihp6Hgwsujc)cRf6bYm2arN{#CIvA!zSe);ey_B8|QXic_bW`HOjNpE~1= z5j_hjpHXiPktrxW6i{(8pZb@KUQp^*9$Wm5|w`)+*U z8!wNMczebV6{L92vw)&wO$4*h4SEgqgz!(55kWq?IHkdDzHdMm*kRuS|2AH{Lndrd zS@5(bZTL96wo%gI`B!L(pW0VZ1(*KvY%Ni(h_ZhWt7zCCBldZv6m{mqW^5MV{I)-cf9*KaxZM~J zFAgVW;c~_LyJWmP7n_7j+u&ZRB_4cj{3-aHGe6O`fIQlM)AsR*NeJ?Px*Vc{3B=zZ zuCc~urQ2L&W(ykRfZpWAC zz>^a?XljPRCO^2?9z90ChVFNVgE-*;lUG4a-BL`+JibX3gWGLn_u}B)_e960L&3-F zgC6&Ke6l9gSKrwFdP@%bi}yK(ST;MOE3ap$Eo%(9$>INx^>XlA!{(qtK5f<*{+oJ( zZ8Q3ZQv7*ZV+6K`4Yn!*8{FbF{?dK2M#R*0Ch;~x>9 z-eI@zUpyP7mRYWR+0iHUh2X4|%S2&*xcP}xvsLz?2iC!^G4nUNR2@xVN__D9o}((< z&NXIR3|%w==~g*ME{*33(zB?3iHLT8S=0NzS`t;39MG{AmY9td&_8}iusVpEl#1tSc{I}<7V1LEhK z=hkp_@v3)=_Fta)8~ah*QaJgPH09F5zimHlVr2;g_~82?VvjBpH~QVFW|D77M%T6($+$qz*3SW(kDyy5^TZXE5&Pq-uDo-bIvj%&`b{LowMtk{nw*%L){d7 zpGmgi0i14_UZEs&JIu*Cqx7u4E0F7c>%w4v?~9*ySdgQNdaW6md*%~nCBbVJr2*Vh z2jNAgdYItVJLhlDr>^FFX;_IJ14Or=REm#1T&Y`7>X)Kbke0TVR7B6mycd_k!w5looo%RI9wr?v6aHDo-LG~!OO?H7*D+5+#FnB5-D7xy6Nssxo3Kp z)V+23hbj|-ZyI0JSsuiPn;1jN*!n*8Ho_q?H291C{voBHGfeQIEHAe=x(xb@`c;4C zWZf74VrStOf0Qn(S~#$RDKLPtpJd|p{xNIQ6|*)$dAT1kH%UK-DF)vmR@N2T*w*pZ zii-W#5;Yb-@cKq>V(A5Pzwnt4r)_(OZq~%_9Nl@P#`-VPAp}+fkEfwwupI1upn@o| zfEF|$vav;Wc{u_0_1}<>a=NL65-PMWvjnEekB=`l2#>Q$cD{vL-fh9-D_H8|1)+CN zKD)sqo+yQy>_QX8TKuL?Xdzy;weHODliWU$*EJqNiV7(^v5T(ke78|=0>>YXlO+zj z`e5I7KS|O?I-Bh?uEfNx?T=+QTzSipBawX~B4$H0DJsT<>R?~=6;sQNoaEa?k2(^q*^fl*Xowkpu;W@92VHel5Py_!-3RWYk&=(@fBS2l<#K#bmpAZy=baqn zO(H}~{IJ40T%&AzRI*g=zn43FCTq!OZo&AZE}Lsu9$Bbn(Pu5ehg8$e%=VwwaGM-^ z%-b$6x7%F>YeZi*QGCWbz-Tx_t^gaS;JnKZC@Ajo2yG}U!wpfWG7pq^VZ_<0bT4dk$Bh7w9hpIfcNUsn()$G#-o5m&fO#hnC^JkkmjVD3T z7JJ+gt(AO@J~LNF;-*L}=7aJuUMg5(bqzCF1#aPJmRMR&9Y4Tuc9r8)!|Dz=`p^sJ6~pA+isnBZI1z~ zS2tVZ+kQrHcIV~NE0^!4YqUjP((gW7lLY5id;r2xoR{;$Hpf&yTC{~v^VgA81c)}p zcOI+@dC+a8tBR#MFKCKs z9f7!>ZBX`+bvcN4(=$xWPTXsC)Xjx_wn)j~`X>vkS@qcmJyZ5&ya~dMMzGdOsCPas z{et&cJoUf(J)#-+ivxSfFVCh8RWyc`Y<*V%^p`11BP2h#eCdv~i1bf#-#>b&T#(-a zZBY9?vFx5_h1}>?`Io-wi)36Ro4AKxg=9r6LdL2~emDo>^xNMSRHRF>V*Jv~+s3}| z^vXVC<_LSWPpYj<#edm>?!kVrJWH>Hw!#111=5@`PHc0?Z{qfX3EbO&@J$h_E;;ns zlD;p)=QASBrD718tDM(0qNlJo^~R zMHWUmQ5<4<>z!XIrv!QavW=`Z!V^lm=+GrMpDf$V;?*(;5fCDfgvrzWWr*V{L0ztr zW5yZ7Is?Y?9ia8XO|#`AMGa^kxPEoP8q~jk7IsJ5_>EbvW`jsIfL$B-2^ic^4Okbb zo;+dv;FxWS5Ck7CGmjo`XdcW7LpMM0m-pqfY5AlE8ylUYNF^oq_Y8T8dJ2Vpmt#L3 zckZFPl?2)tUNhtTUvSaAlF)JkO=4RYXY)N1{I}8GYBEg*~hHGGu ztK@=7-;B1Y4cCr!FI5%bI}$)eLb=sTjTP{pL{chJ4>xuA;RFi#GERc zB$!KH>C$SntF%-0Vr3-%VV?-Tv+C^#gGa67oU&SpF@I}fH409945tQ3i1t)cA2*9~ zAySfuw$gD8=*)|@&|)-Dpg7vlgO5J-5I~~(Rj1j_$Y@W+kjAWx*^C!|oUj;-FZk~k zH*JM2BK_Rv+#5h%r(xNz8=q793dCloPW75HPmvDH7U7XjhYW`;TAP>et%4}^74?KX z@d1w}w+OJk)$;ugVQ`QzNL#htkxO{O=?&@mG%Q{-_?LDL;TTE;B=f{xIK}qc%W87w zf;H~+CWL%}$xiAnW{XOrL7<+#h=Vr*1nkf;OM)W%|0Dq6w}U4A60%Xms-M86+Im%7 zY>d1`_QL}4A#WS;agwZ+y+hgKC1<^0+FCAnvjgmBgYU#NgxV7&LaIMzeC;tlCCj>% z%=>D_%)zN*P)+H{TZl{om)2fxJD1dI*;-1sGbCqjD$9YTtKCg~9CpQPMEQ`^AmkcB|G=Osm6v<6?Ry#WUgz%d~ z%L-Og<{~<7_WwBj;~R%p8k<&QneBWhZ*r>O4dzf(Sifp91vk3Wk&-%+vOB8cqAk2@ zoBmmMN(+79Rzr@*gmZU|23HhCB-!PyxJ;r3*0r>d0kPTEzhM@LeKXf zWVJE}^EB3%0|jUH9hT`acyk~wZtIJ~9O(^Q+KCQ47T@2k)GvF7v2ZO@Yzy;vgOma~ zTPtj{fN;wZn|2c(r!NqG8w{b2yiz}{`00bW$#^~r01!W4BPsoZZfV9N{Kn>*TJ9a+ zMv~j8E)wlV&Zqh-fbcelquUL1_dPj{|7hZ!%QwssTne9SI)v^JnOe4xE5O@=1C0DVxnp#v?6#(XGb^wq|&Q>B%+!9rJPj*bHCYkuHy!EwwUV)PObN@0J&fd7$@fknweUZPu!Rf* ztqmI$Qll^!wdgcCqiKclj_xP=^pD7A!7Bg=?Yg__hhjWJqu{$I6H!X?8}iA_#vg_A zu@;)`|N4)M_zeVhT1JYsT}2IQvSu-guF>>YrWx3%485zK{8Sg*DL(_bR6#Z`I%zT1 zH`Hco)+~9Zg-_k}zk4c}yk_dCQc`ow$}d<45-FX7#KKg;CfYYN0BdM3u?cOI`8f?hNb>Qw(|fC?Mc>YSL2Vu(2eZ0AI^#FC#}L# z`}h{@Sc0i%mv`UFwA1Rr)Wnxq!TjRvMI`OC+kKtVZx?OxHD~`M@Kd#%>6>mV zZuuV;Yf%*LR2LP?9tYO4(YL5mBxK){{KoFrxA@cvV$48wOY$(>^q~%Am$^jkQ2d|1 z9d1m#d#aUYb%sg*&3E77<<@a2>}g~xC$f2P^dY5#MM}=7?rNu`8r%tjw7X(dgJ0ZG zP2@`xDd9z1tmcMPSE0ol9qY=B=KI?^GguqtET+1$k8qRZt)}kktOE7}IF#x&pPjOo zfvgKPv8IZ5^t|EjsIJ{3nS7z#8&l6I>QkH74pZmdM}CR2_ar3z4^OX5!j=V#4@ZPl ztjRkgB_Qh`SOw_pdm1D@_wZE!p+UiOk-@2^GK9fPdV^9U4qiA-%BdRN^fYu+M z=gF-iusCZ8o#)6j*$B8Mn5}fw_)Hsil=P$~>~K4C=ri_LAp%JZ7k#vK_ea_JPH6G9 z2;&JK86^f#2z+&FxAv>(-AD31c%+WcpT3CE*iL&KG&-!oPL#=$_4)J6Pv z6w^R{+WLY;F3j!hHP>tlz+rXW@r!-TQ-d{B@UrR|1Re~;kFO#Eo7ME>Gj7XOiB!~J zv-yEcmYJh14ijt(v@)UU@qzV(U_5fsN5j;$PlH8)QIxxS13*&o0G^;XM zmimCE?S}jytFO`%xea7Qf^t^JUewoOj5u6U&tZHf%d}vDsz`7=Z0j>(mIN4 zspy7sL85lw;Qm(pPrRwnx2J;QN{{vlviX6E@#DuAx+Z3KV|~M`DBsO3%jz-+0rP<^ zirU`U-RT*_vsa$n1P|NIu(d@6uOZtrm=`cyg&iMkNbp*)^_BFxHfxX&EaS|=V#A-Z z;F>TP-P73R4j85N=#?<7vm_j66p*E%!$U7^y=lY$RByOO-kqewaNF_)We!2j2Vn*e zH_XTKrMxNvVwEed(j{c(OW>jJqZdxu)5x{q;NHI1&i~c>r^)G$W$EBv&W9;eraXP? zjAQ?B&vOgkq9-I!!eiMPbKn2gUCMEw*)zW3m8cMnZ(VrZ(eGaV*8qIO~Tp(#Uw!3yTjw;tbj#pO3ET$!O)5T*@~#S2ih%PUKlZp5{N9!5#O5C)H3 zuBU8oK@8hjz$sHlUKE;tg}Mt24OE20mjg!i?fbWp_ssvVkyV%_CXhzmtl4Q73ca&{`(^e<(+%MBN`GDa(Q3YG@ z>zVpq{3?{Y92t@i9yvCBkt2$*6KX2l z%O#atd4B$Y|2rT5CgD`kV;v7jmY-UTkaHo2G(D8@n;F}*;67{RXHsGJ!D4T_0lA8c zM(yklqFwkiQB*k>MQF|GxWx+QiQ1qfYf(JHA7g+)gW9E2nSzmvSedbaF zS7)cvc9Bf@)WwLnD*v8y>iXZeG&3tZ`-Tf2#{Q*3SQsHNg){r5$tC)mqV{IK($ht( z7CEiXEjOMe?T5RKG_-|@+gcg#ZOGDGBs8%hDT4}Ai!rsTbv;-$gUj$DT!9Z2U zD)h6pIZ#JtR$dsjLu&;W!&QJ$C#)8n>Fb_ADH*hx%>`W`v=58hR6zVAe(>K9kGYe&bXtgdKY~u!-=KLXb zkBa2n>qerUAYq$Q0&yw_1*1H04)Pj8MrZt)$@{YPi|oenY+;^}{lTi#xA>wB7&Jx4 zIz5BLd0#{TAcEr_xi6&GjDHTL5`jwQ{NV(8)}sZFi&zW7U&jc;&Uj#EulLf336FSy z6}c9b9uFgFU0S?8tpUikR3QZl z=N0YlVQI{^zbUHD@hhIr;!)NB)`C33p`!5-Q6Xzx!_f|Y3dEPl_ZQ>)r=Ce)v}nqb zMO75n&E-FwL)4LAUQK`=#sPvWsYz3(@2gG=Ew8Iu9t9%FHVIbTr*lLP@gvks-Zjq+6l?3sk^o6=gJ@?`CL z-vx%t@D=;ppKf!{1%iCF81lNMs0Dnk!Qp;oW6bD<<;oA+N0j-C4aw!J6k?u)&I8bx z+cxlDy!xuv$CWMmAV^=As!yQ zKs*$M_m^#` z22BV6MV0n@8dW>5R|Arpq|?1$H78lN+V4rrP4T`rE5KJ+}rx2XVqG(Boa5f&OhX1wK4@G?EqM)9shce&Op#0NvkZf$vn ztEGRtY3V6xm|B_Phks(sDiYq!U6PkXA{>b}u1wtpHJfX$yurca_SH?dB;o~UG$db*%mZbEKb)lO~W{9kD`eI#%m+zFCM-50e5{tqb#yS>HA}se(`l^>Wt$!hT=~-45F^EAl;R{ z%cKfd7@R)mTvih9s9&5D1n{DiztpMJd|aEbK;z|~&r1WsY$lVnhyz25TkK~|+VBwe z7fad8KYB-|JvMO4YXhsO;!^f{y(nDJBsyE(@p-);Zi;ML{i0$YHe7Da3m%j}&}V>%F)VWR>=5N_qyB1(vd+5~Xiq5jQZT z)s?7`(0>}o%NHZgaZM65qz4#Xu%T!lhO0}?Ca*weUfa);C(m4z-A4(?%lP#rj3lSd1bQ5XNPr%W9hpYCax@VyRJx~NOp=9VIVt41@p zN2hHs`#U3yT13d{ivaekh5eUaph5AnnWj=t)3>0+ygEO6Rm$F`H5o-dCrV<=Q=%ce z(ax;2X0$Mci{ z)BCONrzi?JP1Po)Z<3X$m<|OS3jWi5I`f?Dxz=03biz&Bw;Kl>-;N9ID7B&{Wl)sd20Z?>`^=WmBiWsfb3hm%;dvbMs>f_MM(mPl)GPDg~6T~ z+pe$Uqq(*#-%W}I-o)|K1P9_{4#kZ;SMg8X3^FE4o=oXys5TPuZf9hYc?viT>{-_? zTo?@v3V;pERi(Sjydu`x-WGXOj8&xZEUP=c>`4=qtW25f>z}JQ$B$%~h_|arq^YBc6%~a^m`nY+ zW?C#@we@UM0BD-`Jlyqv|S;1_JTiOUr5k zsTBI0$bKDZdo;X?_SSm%jVW|1K zOaN*toQOb~{9F^W4%NAm7`#FGCBegYRo3Y;+Q}-;9%uT4`km49-d_)0Doc%OQ2@0f z+R%#-x_176>|26XSK8I}S*-*B1d65F+zmwzyt~i&u_5^QGaTdSb`~ib=ePX1Gh;zrEdUQAosQ-kNT_l zf`6uv0;4V1)Qt%LiXv{uE`5Kj;HL83)ZM|2clM9ZqljIHy@cbS#l0&IQeFeO)qajL z5&bhy37QI7N{cq6J|hSWo8}9V&L>h5)70adfm^fB)_c>aKN67aJ7Dc*-7(E4EmJY8 zD-bLyon(iMzZ&d5;0-vWvEho*tFl^UWYfznNpcIGE&hGN(e5wPDGm+TWe&SLvY#nb zRM}n{b7IDx3$ZBvl#VLE=LSkNLqTwj9{#ws$ns>Mu6&z@_%988s)=q`ich5?{P)-Y z8OI-+FAqUYppoSp|(D+J1! zzRXbr>YAIi_yumzx%`kk^197E%+u{P&}r~ltvV8~w^!D4W>NA^QdAf70>MRTyh zUT~59ERIrYyhE*asuxE_1Y{c1ljBczZxLozx@TjZr|v=zPEC|Ew@c^36vdumX&cDh zu4gRK^{8J&LX(H)03iPEij-o+$rzz%|6g95SZ3k|Jwi#?;4FOT^hNr?{VO#_e4*r@ zdG=)=LS?6UNyI3dOd`VFrxVLOdY3f!E{z-U{6Wl%M@Ar_Y3_DIOn%+luxiMjs_b#} zqyP-V)uT=cYH2}nDa&!WuP68+eG%RSrSILo^wf7RuN%rQ+i?5R_inFy&K-w=_~gl$ z8w5oarn0A(t~h;yx{uUd*2%I2b(jHT1fsEy;eErHIkd|=eFyD@?dkFNjinU$U! zw!S|8rm-hZoh^U=Z4@|*!yqW4(myh!WjyTUCbF;`I$OS*m2G{I$^+ep zC-;rs#3r({oLdmj{B=4n4jYYS*QIY#mciSYN;IT7Qwa)0>Mrhg!>~nLC`IaZ6EugS z!Vtc}_&n_~Q}%HS3R&+#%FC?ep!BWnQ}#siFkM?V%%tGR*?LRWZRoaEkUxP979}C%EW$nnQY0khT_J5$UR8_ZEkrx$JJSCz^W81t;eY-6Q#r7qX&L{N}j0Q|gOIK*%CHIN&U2qZdwqI+iRAHQJJ#w<)=9D5T!CfrzE=3SKaU zW9Ad&k$7zQdQR5{RCLFTwcK|}wRoNw$$a6IHdG~}S5jPuYGF?WO{Q|<0b1YT@g(I5 z!MzA1{hhKqpFPjIc^ZsLLP9~RZrA9(XiTJ>0YGePBw@>*E2@qMLJx|?faOi;1%_8_>HN;18n6QYP~!k))m$>B zRVtvPB#cUy9ojXC0$5#~vUtNq}%70fi=2s&>jb?(f%$y{8cr*0MR{rp-uE zjM7D3*8m8-q;};$cJHo(sNs60af>)owh&<7WIiFLljRcB6$;|ETEPR@j9?_@#uvdU z(mB7d(z76oZD53W*$GLFQ#c>oi_h3LjTO#qqE)Oj9>53mCMRD*Zp*HeH-n6uR=EbW z(FJdpjbBm2nsz&YYo9-!k?|#a-k2$YdjrrZ)3^#;tt{A}gsP|^7M01SCyjmX-&1jQ z;IE?#AcG zWQ4ml6lot|z9b6#YWCy(A{QiGf3%k`DHju@3v{}P_DYb*t%0Q@*PNO(e9GV7opG#_ zYCPGI*+=OgZEc4E0-b0@4Y-ZxNIKj}mV^9txEUT8D$= zKb0-*-5A;VZkySAwp_<^!R2(Bx#Sz~bn-cCBV?kHMU6c3 zKY#Dy5DYt0H`wG)%CL=zP!r>l%qP^@6mJ8P$~o&s8&jKkB@2LF;{Dl94IXm|(d2+H zMkl3Hnw30R(Az{ZTpc3_;=oBr^Bj;z+1E=--(F(FQzJx*DF4PReoor7 zhDdn>yhF`fqhHY7RB+Q>_0QqYHeO+o8_XO3`?V;m7*N7 zy6I71PPKv=%Q+}t)5@MJ_b3%4ubepboJ`#nPr3AYwJ`7Ev<)4_W(aqp=%$p|zrZa4 zu5#KfvgxC?XwuANXrdN1aX>%6BQ&>WiiTBNXH1&1Xi^n7Q1fi-n?1(2;^2mpQu20v zli3eCD;->ZJuJ4hTDam?vcrT0#4_{|!Q z$o01styG;Fg5J@Dy3#qn{Z>409Zz-B1)8>HHzFW&O$vrXx z)m*^T`u)>%g+y->i{gS8@cb{Uc2>LKnH5N45^$S)TynMAE7)d{I>Z1peQBbOF>#AV z#Nrdfd5v{cjX_8f!r6nED70+;P?1v{fC2H7WuTe$p3;Wz==pEs+&26yx&nkxiM@+_?VR`Fae+;M2MhhHvL@ zOQz0xfROIjW*|o%k(QnTT&b4RX{7G!-YcK4nG*aju4mnZWz|BkJVAyk#Bpq=UEN?L zZUN#LGG|c9EG+XBh%F#rGtcpeDd6PlCYQf)^O=rFhWOKK8~u&>JaCoZvP{s4(6-Pr z8Bn*Vm@{7%;G3pTA2Pnq8pjlCEi0+6!he&cZR9{)h7coi5HK_E8JpSG)kx>pm1G=B zXP5ybuF|V1iDy4xI{v0yvD;70oMvopha?GqTv)&00qhlFciK6CRX<$(E{XCg)EK~% zce)lsUU2UbgM(n>o@nzH>-c~Ju7@IXf#p|XgURm7d2Qg7%X9~RxRVCx10A2b?YKJ7 z)7Wqd?Zp#sHZ)ksttSz%F95eBu+bG($G`}Q*_a{4S!D_*wEpd$XGI9VLv6N`@tAb5 z+>5bg`)<6TXQZc>;$PD6?8ifjUlWzcJ$g?I%=jyyJkUgR?hx4cUyXws4Nz%FMCD%+ zW}0LhKSgCf8`zI>mFM8s;zl9rR}7y#``v+gwbjd0rpLP0Dw*Iy{*sFW*T z-ulV>t&h!Xi#LI$8THw0p#uP6b*hG=i8g4TBs$sVVDkhmG~3hi)XmkIU;p6aJ|py zxnsj0jk|ENd5mgCnzM^@u>1fEjx10wfje)jHjnLaJUZ?Zbg6fQ+l00%@bFEz8RxEDd<{ z{J|gJ=ndnRWQy3`x>~S{IMQk?+A^zX?>RO+;DE+6M-=ICZ+v{7v_nH4A9<0AgIczd z8Oev20|*(Dt#wapUD-L9>?l3gBZh0y*)`3!THof@8^)xr;^4gPTif%^kwfa5sn-2> zFzYINxH3tJCLsVzzhte8TqIe>e~80PSSS~KZW~Sr*wv;}0Ryc}i~eg-u?V&rOJRy+ zKo_y0N2iZuM7O7J0u^mQ0pTKt(_uC13pY6Wnk2nc`JWu&+Dxj2tmBbqAZZ^YQ)Mdy z<}Ceij*-0|?2E6WbMBJDeu%1%*ZOixAy;LOs?&`1SNe-l^qIYg4o5>}NEdL5vgy$r zl}yVvy!9``tH9P$_M%+%XYO3YE=g7OrAN1?Bya0Uft$RIBUfL0t++4HatA2{_i1oY|BA zG`o_*;ldXtlrczj|HxR=0 zS69UWmqE zcGcs58(C6(pT-J;BnuYon|z|OomX9nA&30OWnSFjz+2OB0JY|S!?o`aB%S~NCmrbA zNwM6W%=G`-=d`c?ZlM6WtYz1Jm4ZqLuA6aeTSiZ9HtZ3h#c>cO(WrfKY~M5vzoD|= zHT?<+1)g{_ew~_FgWj{hKI%tj3>%=PdwSXr#YT@0r;_ZM<6zK|9E7qFFW~gN?&!^# zQ@$p5Z`c(cgWwl^9`5_Y?BM-R=WEErI|TGeW8Zb#$c+@7NDj7q!b2p>#d@ZESujAU_rf`T`# zAOWMC*OsTbSKBX)_?y28DE;zuzu~P@YIZ;+z%V~PhZeVIE~cXtwAi!4N3pg1g2e*N z@lC!?iRb%j8n{tCeF4~eXI1~U32l4$Tcewh-xL^5dIf7^TCGrx{cW1K_ixD=Kih78)X_#yo>Lc^kBq}xLLGj>$3 zn^hs>G2{rFHLBcpu54|x=km%TbU!+T%!t6OThjn0kZ#O$7=qE5T7~K9Gy{(O#NONV zJ5iwhHa+JlImh*#>eA>`!DUUGKhUf5!q*Wh2S)c=3;)CVGMo{fz7>&*rHIf;-fbML z9qkDOtPkO++UoX@A=9Gt645l62lGY8EX@RF8dSP!Q0*`7PJc-L-i+Roq-2Mz2XLoh zVGec{v|FjW;6AuF*I8rAFk|~Ma~RN7v(_XXf%$@AUA2Ybb2;r8*DSv>P}C`A1jHtU zG~S`VMO!X8!9yVWC)FMdLp(u(eTRLpzH#~{{T+O2fhl#eS3=?|k-2uxTzH8>47R0w zAlTR1PT4*F_#3lCanGot7}wdyFmK@G=Lc5ejn_20LJ~WJo|r&VSG+;(CMmxGKXF2tO~ ze<>xlDP@PPc{ES`re{|sCi$Hz8tMDry3iJ(J`3?2yL8UR#bMpYBCH&}4eM3CmZoHn z{KW`3ZmJ4vqLIqpvg`RsdE$}lQDyT9yaLY=hA_#j9Pt!hlgAeix#4)3M;d{WXMcE= zX+=77Pn$}qyu` z4aQ*rMr8wM#XzgUf?VfS>i{FwP; z{ZJd1TjL&BJ&82Du97pNj|5u`&1;&g0#bOD5*Qv zkeIA%q!TJuu$3*O3iJ0qpEHg_f%mDAGkT$+CpEUptnT0ZRKW~OHRcpF&P*McQqz7}V}3Ip@TC)sD^ z<%l9C@OpR6TsA}O?nN9)uIxx~6D6Uz{=(vwVrXH2Z?bRdE>KIR%p7!#G*}~zI=Wgf z6XPO(Du8`8ri`8fRrRiZs@)h7oGWo=0q zxS}53C}SnZ7{o_^F69Pr=I9lpCVjLThVMin@u|DeU@)dfTJ@Waccm`^FN>7NqvdAg zq_)*Q(lFF6gn%e@CRQ>($_JldMDyse&T5oz%$`Y_h;lSLP!yK&xtBF_41&cc-DsHz zg^q1jJl{=j(TMKHR{PthlMk3PjuUg|-TCJ+5NnqD>xQ=Z`I^Mf*$`aZtRwr*sy+bZ z5gj6XF3r?71d0%~Z1<34PLii=9{=%9^zA^R;p~~g9vI5n&JNE$}|%zi2}pa}qcmG-E` zpuEHe$V&+)F#t_V6%2$!hODnZ4Pqha%bzw8@9W=%H7K2$Zn}5hcrFdSq#sV-L^~--@H%`He6K_v#Euur2#hF@ z|MW#z(fm*bSIml3hX(9R%1behN{=(0BgUQRbJSyBO4<-JR|u2V-O3@HqeJ_`C(jGA zU4)(H3c{#!pe3Lpa9d}Zt@#7v_fhbg(Z~VH^KXXPZ%)u=jTow?3BiJuwm}2CCRm8) z)D1P(jaPx^g2dS+zoD4cW>}cxO6&eDkgWW4ejOT3^k4p}n|y%ai!bv9`7NFrR8#t@ z5lRAu%5lKplN&MsF+6AeZA~~}Qdy<$a;~W7XCMD=>!St?OE97;mvDAF;(xbl!Nl41 zpUGDH@91xnet*(WUtaP{I~q*$`3BMT=xxmSn$@0v%{f07@8w%wz#}}W7LG{;h%Y|s zTpiOa4sds%I$~*#3IOF*c=7w7M)!j1vn0$3hNRDlWu+BwFFm7COoi5OP#+sJPaeGM zTDgG9K)VjjeW}K(8En#b;v;%G-DDVg@FcQ>nCNH z&1#h!WM6A#+?(y*x@ULxTx_B({fCuDExSI(($I}0pL8tYuRTe%>zxZnQl^eAeQRtj zkWC~4t;4(&b0HA1XwhSZws2$}5BG-35^HBWQ30~8A`iAjb5N*`dkUYeKP{9(2L&K( z5hF4nNP%Ee$uZ-%+iwaVpCX53Uv~4!9#3So$~5`w=5pAX*Z9I*LuwV}t*FuKuvr!V3NsD{7q(T--E*>ffT;bxRA z8Ct}Unq}Nu2M)~9ceZfudn_2@p+SI?4RJ#OMlfvJf*Gd z^q)^-BcFF!vt&y@kOn4uuDcD+VcW|x;cBR`&W8FnI?_vqD0L6st#%KV`RLdL`F_=2|Dep$^x1b+&#(W|AIAjVQIru_K2hPqo-J z3R6hWu+cOJbm|`$zV)J3mI^9=@S-jBto6VHr8)@vDbp-%RnkW}k>7a)TbQ$9OcQCj z!FnU65$VMQ-rG>M^kr{Rn{?Kd<>l-gm?)56b3KKJjO&7zKNWLD%$SA-?l@aNxPo!O#nAR@ zi2l*rBYyYiaE8hEgaj7KgDM&Cxc(~NO>0S_ECAk{Vv9{nv+I)MZe{12 zJg$hBadMse_*a?7$%?@*T>ywZs$6Q9?`H0eOr)t}e{fV)gOlVenlxpg30}$9Fw7O` zH*E6*UT{xk&-F?0=w&@n`KImp3_img7f;>te`EoZWzp`tgf1Hcu~ zoX_%dieT)(DGY8KD@CuA&@S3gYRYMGNcnF1y%NQfI9)`gm)u8X?eK{4Dt~LDCmR~j zv)YV6g-`$GcJml|5PY(HHz-^)oVLuy2S(2}Of@=4P^1tAOUMC!O5J5}0uKc{Ech?R zU;7*oqqeeEG?>zW=Ks9klvn#z`9aPVIuHm~q$JK&Gjw^#?k`wV^_sBdzD0EyMk~Uo z4)sgth1OBU<;j?sYRZMJ1EFXBR6HuQK;rRvFdo?tKl`^bY{czzj7BU#R4;$JSlw2y zCh_L8?4#}pG9&fHVb0MyARXaUn_CP&;D<>alQ3MA69tFZ1PrVAU6K7LGsu_bIJ#TC zA(3vPPtp%`G_Vxx_7TR#)=lWeUvwoI0x+{7?EQ*rGW+LS7h1f~cZK^bmEJRGsq``% zA4rGuNo#5Yw7oE@14Ezsuk>HZOhvf%V%3Q`h}l9&868PAV_5ohaGOXz@g^kbP>TAx zc(s4)rEcwf@@=}mq#7M^iI7isA|ZFZBc?1IQI#*m)~=4A44h;kZ+3pGJO>wtJ+Vyk!&eWN10C+=s4XZua&+wW4UXQPE5hW- zChg$y1=Tfm7k)kNHtZQ1*oQLOw-cl&;+3wMLq<2v%!PXZ#|eC62L#z{5f+E1?s6|} zp&eRd+%y zB^J&wgf8^mMyMenrv-1>dc=tja%Q+R^t!b(rW$L@sJCz9`_ybR#3|rlO6o4CYt1r0 zzSgjVc@Mxz5-Rrt+j%?ICiUxY3?#eK1o#cPl0tIw`(PWMc)=bHZ{rUEFK76gL)_k5os?+`tAUlUr~wIf17!O+-~+q=FZ6@5<-NJ*C@j z%$@$ZMGs@H`*1oQ{z!~2NAa5f22Pe$CO*9l+jTbUV%S>9p01`qSdXDi1`2FxC8ngLO_#W(IEgOyzp~@DbWwpY1$ybdw%Un#u z&U1AShPRselWPU21{9eyVd3n= zafpxG1;WeKxqcNfFm{xt88?F2+6f0i?f1+p`3)2}$keRiz_e)9m9)QWiCXwG-n*=% zTD%DCcMalm%yNd*6FG1}gzRAM!qT4nm>us~q|7n8VeSDZx7eXWz+GP@4Y?T&xYwQ} z*%^m%KO9iQI#it#%OL1)caX4i9$?lcPhKyP28Un+KhdSm{40$8bEinE(#!NZn-tK| zIZ8;fSj{#p;m8$kd{#=%+HYD0gAipp$d%kkVt!3NAz>18Q@#?&>=+ANcl19uHK%ot z&$S+s!%@YYDxZhvZDTV6b&|5t3sBFlhQ`tvHdv(hm6V)gjgq3emJ98w7^d0Q1+pO` zj!~dXYU!I?Fk;|pNu**( z(>M7sQKuQpM_J)k!lNL7?betM#S>grvw;RYibKtvlP1mfX%?=8UJC}9!w}4Nhmrk! zS_6iABKcn*G-!}e2{AzHcsaS)VAJcfBi1=W7j8pqOu*1N7O_+B{QKA+i#B2g0h#TL zF5h0VbjuPKOU^1{q_u(Ys+ga$OJt6(Xa4Xge^kvp6rxyTD-PrV)QiN(P;W;yXj2nP zon|&I9XBGeJ&inbTjSZg5d76-TONupC>-hySS!G+62{6CZ_R`R1HYiWq@EDZA%#Oi$h4eM!|c)|5ap27@sc{6=F@@lc-56`xpgua<IP$R_rACpy&5sqeN85Zs1_>q$lMxrX0tkfF zbc9N#FPeBpvp~)(wR+g)-l@<%0U)@H?-xd6nJH>5l*w=jZFJ}ISnjlr@8ar7{fUY) z!(SYYN_moyx3rh<@VQ=YK%%~FfMLd&Oa%*-ig@RDQfW$=BfT-a(ZCsCfKiZnWtjK5p7Z<;C|kg~yZ8OvpU?Ao{>ciNxvuZa`Es1+c^pU2Kuay_ z184>Tzn-DDe}$0+xn{SqCSTlCm=uMfeiy2zU06RJNXR$AIQJ2?dgoI~y8$#f^1~0C zp9-k)9^(j%3@}dUABMizxsV-Q)#+Xu%{lx0Yfw-}DuFMdgJNYN_QCE3_m7(>1Jwh7 zSOL-)3G>M>7VZJcptWH>(Tv7O^duB!Pfq5aC1S-|1!>z%K_@6Uzh87u|@u?jqz) zG&xC=1d;_3N&>8;t-}^mrV9;#^k2_;_DrMq$28_Ca$^R9X%fg*h};f zwftZDBx=jusjH9~MC#v=V83gy9h><6ov95xYj?{9-NSBVC5N6klYb>Me%CC;|F&W; zzT{ou}c-Sj0TsZD;0xaB%OoHurB1agsZq{eNb*PW@mL34>i-ENf!i8eHE0uYD3>GXcYe zXjt6YV}k?*gPDB)lbbMa4=%BT8U6k0poi!eXCOc7f0A=8jSbtd;K5(_P$RCr5Hzg) z@^F$LG5vaqVm->+L!WW)37GgFp8y#$H~@(G@4elrNa+wr3aF}73G$2)W@v8=5K;h~ z)8pbePt>=OI}6T{^)rBmL?}W=#_O;vwdV`j9c0u5abkHr5`S5c*gOPkKq*l4#o2t^ zyu{$<1!75VfNdovCw8mjhcG;%P-?d$J0VG&f{W^}NJo$wQC4|9l!CmZVI%|@sm4Wb z_&6B<1oVG6hI9r46yh_;G`G=6NZYaY+p~4q4kWUNIyZJKU=H=LG6st)b=j68@Qn+k!rc|*q@$t!SKXl zY-UDqsORg=!4ok+%)uk-GJY^WMo zL}n8*7liO0eDE$m9jV8#Msn6}$KF-ablUUj^?scFC)5$aCCheg&u&EO+n&2DXAO3; z`De&Q!)5{_6ts;JqBI!16Z$s7a0?JwcnAR|FnMk5`PiiU2LD*nWQk>9E}yjUVpR+? zLf9o*hj@$)o8dx8cOPlz4W79K|H$2|2af8PnZO!nJ`v+=DEqbf=pc;+IN1UDF6p1S zZ6@JML+;pa<^*YM4svP#^69cnc%)s>>-}B`fep!m_P+J7K29m}&ee4jfe$j?&~q{0 zjn8ObNJw-6B?}3vtXQ|l0y)HMlFZ2(qcN~OE;O$Df!rl&?T@m(CZ(3H+p3dbG@j1N*#`C`jwfh@`B%a*h4kn z1`v~W6H4pZ;nC2P4@9nNY|fXF%e#n4E8Mn&@Al@~jm+A~v@4X-#_?^FG|kKP z8>yW_?>Bmb>4#j``84V31#JD0#=p$4tbo5r`O@Nzb%B-?)v2tQ&o~}S5wI3WYM zT|Y>*TK|Ja&Uninws)7RWx^U1XZHwLeU|?$V$~KbjH(_d6%+)=g7-T;DKs>cTo_zG z@t4q$r#Bp;6p|XVH7&j*CrWg%!ot7Yw6bvu-xDsn8LKN#klu>6e3fqNbXrdeEJH(7 z+C|!M6pz2AQldx?vxYF`kLav9X1GlL65&8biM0b6p{-l?76xjKfXU@>M z%ILkhvXw$Rx#7u(-tdv~guv-rW4EXk;k z_ebz;1yoIf>{hhx)9m6;@N21n?+={Y6m;awNH_G(ejNPmG4vFuq zGB8WvK4Lx_$7_BG_la=?FDaT_hDAEJNzc(pT-o_g3{H@YRlsJR*Tl?yBE?G*}is#ec%Ij z31>}hMEd1%?BO{dn(`ROc6N?x<>6=DA3Rpm>=ab?x^WfC+&t6+{?jI9f+UGwW z(~c_(k%s)rOvIU(|=>Ool%}3kN(xq*9G6JXB~La!fw)J zdCz%w?XRBo!nfN#IsL6cX1F9Wu$aY3o=>8z53{`w_+n?>)t(zy#4K1+{K46qhQtq! zt}PyWb)OU61v8;foW}NI4VCbRg|!jFd0@OwBUdTF7q>4|{DeGqNaaknhAa#AnNS5O zE2Ru4Fk)fyzN@;HDCX9ParKKWCZXpQ)zz{!A%AB6aymCdtNeY(lfvrx}6RnKHmr;Zd3pSVd~_Hdf~daGa8epz&Do-(visE~&01GUr%-N{ekhf3S0 ztky9)`=}Cqf{anx|GuK3O4syh9N5j93w3)E4@ZwU!SMz3)}8$B=x?t~2#MM^t7b~w z{zjS5|1+n0vg+z)@t4c?xpV^8@{nDtr?__VGloQ_y&%>`QSV0h8#`Qe#9U0KT&z`7 z-`mI2oEC5ckG(qONcHlFbuq+t_|JS}J;M$hRdr1eHc#etkB7c&=8C|MMriSnt*`** zu5BFc0k(8J-b0T7~&DWeX77a)nvb^w@k-fKgf1$_~W#rEnZnuw)<~_z;xcN{YiO6 z>4mvZt~SNc6}2XAMO5I1Fjec0N%9|L*HqBP@bA>mg*Jv!jM@E>nVE4orldNWzA)Kx zBUDf^f&IZ0r}rl_vtDM<%1=0@HIh`1>k()MVHMf7>fLeN_LFtit?a;+E#-chAzpDJ z!GA(&f5L28th!b}4}%cT7r|y0@oU1QR}-yYcEleUJE_JTVTs{0vA#13-o_|tYlZD) zv;UofL9Cyzu|!iluKOow>_6zT$N@RHd1960EStGr>pIR3OwrknCViBaUY2W= zRUCWd+4sgfOT7&fdH!%}TXMX~f=?6j@j^o>Wn#Il0w#{vzEqGMm2ou8IN!mFFU?cB zG=kO+o^?N%hhlw<(dlw2Jt=U+BdR2-Ym-yyl$EN=Sm_jwV|$+KqwI`OQ(8qvsq{ve z|M9BJCpzeh%SBmmg(5cCAA4`md)X}u$^LA)VX=>9RyxGZXa)7_pJfonev z0ljFn#xsgHK+`fozZWEZ;H|SIKVMf><;; z$AQP&zb8at=!roas6wI&B^rQNt~`PVV-ct)gDDtGN*HRV4o0qVZUl#`&tmpRSLnGn zAK;k^WFJqjD02UelTOL&`P|Mmy!I;DXDqR|uDLVG{4NA%uG&b;d^p3Z3k9l26&AVh ztSRuRmf0m}Oy%8}a&-x8^sDw=3SZ$Z6Wi)gbWP=Y;N~~zZsyQCI~H_WW+*HRLZm-O z`JN}vW1cQ>ROGFr-io+klsScN4cKqqt?`1qy7UrIszkpcy_Cz`C$zkh-I}BG?gSm~ zz)pS{x#Tt4bk3R*(^IUjXv+E&&si_qm?kzPH2U?P%OKx&i@l~bhc@LG-~7TL)fzP{ zZKqNpie@{(%*4_i9o9#b?oUAx%M(g%7P!>3VI<#IDBc<-na29SSrX@~&M`Ia6gW;7 z>wg*7g-{F}zKsIk7~%|B&{L}t{{3S86a+VmZzvTl}pqrfvp=UOD#mrZ3yHI~E}Oe0l#qYQc+W^>9i z)rMr4VQ9;g81m^R);Hxre?FdHYh8r>*VkYm=l~2lk)4sQ)RX){;(ZJt43ju z88kSSFV&>;DcOwpiEWCiMktiJ7l3K~{ zrHL}r8VTofwyjd&T@)4` zq=))7*yxP;fzo^{82Y4!b=P6uH7(HmveZ=MA95$y_DOFg1RkaAe~%V@9;-h$Z_Ui= zRItC8*|b&hHbtfH2{X)hjHH<~X-9k@A~ZJHRP@6vqhAMhz+NXB5%HIOV}-sZK~-Fg zZBkOx5ms9m(`YX7HcVAm?-AA{t8YyZEhBaSk92Dy9{{MUudNV$x%t2hq2q}D>|~8= zv->L+1r)AD{!3;nhpDVg^JU2J(xsHQX7vP)>O`2Ye@`e*5_5LU0x&$hm%upM4?P9m@8Fim_5 zxS|4LjL|Cl;v$?smb$ld{L5Y$#&^zOpZT@`Gys~uNwdB#2ux6&oE)<^KHVJE30Cg>f+YpIo$JR-{~AzgU~x%*Sb@%(MfeA7~O7{GLRDR>?pD~nO(D0jt1bBz&HGs&to^aAkk=h9?vv&pQ+Q%zEeXuN z+iAYGv$9BEA>dm#kT)w&G`B*oh5_&tMYn0+tRrOodt>YYcE?Nr7Njb`Vbq>P3rV)e zgB}pvB~d7pdED9E^Jw1+X@%CuGy7iZNfMQgEs17mX0qkoGS@9u z`bc^GcxR_gr*JH75Vg%D;S~K2oKx!s{fU_~G~Hu#B;AIf>?40XXi!Prg{!QNEa~bp9rL(jU0ziND7P>a7I# z5srTj*_?ny**YVstx4J)svaKndsmgN=aQNqjIy<|{oQ}$3IzVm^hvGXq_-6+0w(UQ zucvd*pS{$g4U~Z=_S1%;y0#f2V^q4BKo2Q{;)AnDXB$x(L1F;{+cC$<%p-CgKx0uGzbZ$b z!eM((^UszEk8cwOeoQmCAIm?Z=eJHxYP%3+pA+WHc70^_&fwXH>OR&K1Q*LYk~$Zv zzMgD@%LS9^Y2F{rjbI$5dZU_Cr{JBj+w&=EG6EI%HX1bT74D8mT@C&Mxs6;tWpl5Jyq!Y4MA<3-fxBO zSv=!r-H;Dy3ZkJdh=#H=&N`%>k(%jR?X(44(@m!ZqtLL^D>ACI);E=oH&nL{2%A93 zljDCz@19PbV9A=ishyRpLwZx(D_}OYz7tVKS@7q+iwfIDp?9pRHeAEiYX9RK7vfC1 zYOK`+^4;B9DN9}tk(0lc&Z*$G%q9RBLAAyWh9gsZAI{2377Y1wSp{`D+v<#0XK}nYij0U2vbwyK{~?|?QMSt|u1YXm z*R=rU0m)ge$y2z^RjxxU{b4v|K6OD0&y$pNhm&k@11BlFn^jNqI^bCii6eP!*OL^+ zZnQZQG%cC#txEq|&*34{z{cV=KAP11ptLPiGYfN|?qD@BW4bLzaSnw!tr(FKR($>&b?dRu7sjE9)eUKj(ma%vn6&XgzMt12QJkRu z3OFq(8j)dbtSw>`OJ9VPB;NG}d&RhLfg{CycE|ovCMWt2pJR;BI7Zj8^E2neAM(8b z-nO%SqZUunGj4P2l#?)HVwTtBXOzXKLjI@q!1aK*2oA(iepjBa8L~Kq{iN{ZWVLHv zvA)xp1Pz$pT(`?9ninB0d&cA>9smfR?qOROD9aKxSK?CWveqv+z8XREQhm!F(GkcA zhh|v9d?D49Q8HWU#mhtHRN<#2eD?JjbKr80yTvIUlf*pCu1@vdf_sl{%f4D+jTom- zA~1Y>seK$lq5L(1*)3xc#tx}qoksy9udZoxWj_Qhv zH!RdScLFX)J;dr-&$lK1xL4X$s)=w)&RjQ&^;&ll!N`usOIJKu?$c zO6e(y<_Tff?_exK9o8+bi}!E9)JWr7u9E{`>WnUuoQRcO%eJ`Lo;SH&XKG0xY#6kn z7mcI@gq?}t+$ep4o;gqmy{XWb`_3e6+(kF=n>p*xSQA9?|07v5Nm7_CTsbHzqJH_HUc7c5l)KzAS1y&E7E7S!!+_1 zTSf_OD>-MTy}+|{CT-p=`Vx@eI&I(ys=GK*e0ApvL||7WFJGX4m|%PK8KbmAYyU2< z=6}6nxm2&UHN(rdrnfu;u~V6_WLpuX&OKPLRl>Oi_@S$ITtwl^;o*<`uq&2#V=)h! zFSlR-Z=b05SN(PL2M59080y2N-W8fqfqfob3BbYz07YNCFC0KBfp>?w{NUz?(Wq6N zY#O_K6{H3cQvWmH;Pjh>{z^SidzfyH8&~-BgFn8yODMExgpQTmuCHsQt}y1eY$if@ zR>+(l^VT^D9wT{*Q(OZMr&GYR;!9xx)Why+*tfbv0qbnpR;zmYilKbJ(_}lJx4{0@@lN^37l%d3XYCr6rYowNIb*}I zwp**F!cAF|S+FsZ>yKD-yn9%kyrE67&b&3_mF28h3d&Kw-h!7`|Y-c`^Y2|I*up(H{jg!@}NlQhEiY!4f?gUk4`X z8sb{7o8M1c+Ok{|ND!@wOnXFC5svNU-?M@d6?BM%%^A#sn~A_TAP}1EplK0tgXsL? zC>62Y5jPIdrd+Y-dMTF=Yt^z8c)!Khb2IKE=oI1`j|PG6{)2F;9*1}dZ#b7vrXrCx zmF&_`{}};yztX>h-?X))t$dQ%md>f^(A=8W)0c3F**IwUm%hhz~tMZ`vc;k(m_yVkpG2HuVAE=yTYO4O4fTT}Z~LdYth8 zU{^nabxR#SX>~R;f_CS|8|PaK6_!VqNZ!sN1v5Of!)_N#*I%Y~K%JB}BG&+@*|cmW zRDL=$%>9wS4nwI~nbO8w|2XE;x{Pa?me+;MD`X3?syXcr1Y8h^Luz;w2gOolOvDC1gBhX4*JaWPfm(`r?%ICBpiGG^47=dvjnY*0(I>J0oW<*{Pbw zc5hIYcc>fg?BLuxwZ{a1t_P)!$piU_Bs`h@Bd|xV)T?DC@|~IS%>vseLgy^>MWBTDZ6}j=<1qi9^)6}X6n{%>Rl7zmOc;4 zzm9BmgfX9c9!IBaNESiBqv=9?ZM zt7qI$z2h~&hxFSE=l66V!P^DROwZ@E6DCQdyEhd9yAHAk-d;26YrQKg-+q*DjpP<( zE}`~mDW>k-t>dE z8%M;{^|sgqMh*%AmiqmACWdfe49$R%$%#0t*#i`H&)^3leQg5Xeq(n!$NO8Jty8e0 zcbhpqhN}ObY;!jbLvsB$1=JMBdPldX-hply4Pq6 zdOBCGsQ zl@23wP~?74MVniJNXRR+b(N-x*ZkU;yf~Tic|dd080Kc{oezDPksN7JL%=^S(4`z)MeMshfuN?j!rYLHE%x$UZV3 z{7Z`yI%rDiB>jzW1!Y-YP|{OB3Rh4Ov@*;il&?72^_7{)kk`+Me8GK)MNK&N2&ti> zO=S;$uA8u%kdWj=+#pgeXXo{per6JF4U0D>Q_b z>;~8xc3>xMEP>2b*)&qekdz!hiwf=Q<|o1YIq95%KO4-5o)MZ2@}+?yFR;Uok=Z$l znpBAzLN~U%TUp-glfv(`s2ui0k-mfem;Paa-9ViZ23l)DNZJYEdfN-yoM>S;Dy#Jtis2D(gGPB0{E zFrx&E7_vrw@RgLa{`3Gjzwxvr*Vz55h#{e`D^e1rb~5YitOpWb)NVnxv>@R4)!b%{MW!0q?`%;K-d(|-EJ`22uxgvP4wKoTI|F8BgaDT}rr5wa{B*Z9{ZY#ejim~0~ z7S%~0iTUJC67z6R^xS{6;>4V2fVge_<&oW6^-c+NW|><`)Tax0pE@afZcli{I0s!` z-hg-Mc>C7q!7N2!@P>cM9r=S_gl)&$1slE_=fe9SmpP|uJ4uSE&Xx4!#eRBgFg_n( zDF{a{rb4MS5*I*pQj*Wk*2fAmK05E1u|e`Jt3lV5=}QtNPPN#jVQ_n=a0&zzKF zFjM2P3IW9A5nV#sAH-HdxN$zpuLmD`uQI~>4B)>{L1;j3+*4&<1-KS*{ONRUqO2D zetq2Q2DdEQ9(VZgT5I!s8MON50tG7QFav`DP}=UnK4xwa^0*_=1itKhpyR~edLjjf zx}!vt1PC$+TO-oJ*DXmMMW`GI;05jO(5+2)JdjrdRUBYaBd`%UZ;320>K5Q^$wqAI z2BdTz{E6cP*grpfx<-ub2%uV&EiFxGb|jg4O%XMuClPfLlnn79LTm%_ z10jus!w!RKLR5_CFZVoBBq8e}=`bSTB`S9#OKkh=+0EFEf~plrzvV{1paa!|-AMcV z_rDG~aXKr^4NUpViD^WgPSvd{&9k3ZBDYBrjG=vjjBm)@hPZPxa#V;*izLx23qw#f z87D%BdO8_yKuRIt37U;MtN8msL#^3~r=}I*!V*1Vb4bM!l(%n7)=WD15RiW%WJ?6E z32p%dkiXAC7Nu5H;ecWyHT*;s@V;9>a?68sf5ZL^+%bfI4u_I}G>r@-5_wfn9mPQb zlI9|TB2EK6SPwMBzI%zv6k*%}%Co~jLzzvuv~UNB@;PkmovAP6n8cF<=NlfJe-_zp z3G-!E7{^8kuYWF3k1c9gHBV8d?n9l*23~4{ z$YF2yyO8I>sf1y+QG{>_X^>Q>CaZ=JAajh|GK_GR;Rr+{%1`1AJ?AIGg*YGtL!7GN z21^l*%f7w8MG~I{%H@8{HLtu?@xUU$gTZ^;3GhrH8g!Ni{5V*$#+eAQ~ zzH&Q<2$B91J_TpC?q3BBJ@YsLy%#ll$v_UkkqWdz4Q4W-1+1a-k?R5;FEkrtS#G^&FJ%MrlZpBF2Yq4U$^{w8J+dL5~QgXXX>) zsnFvfheZ|(=Sio7O~Z*BG>e36FjrENYXA_=e-ELU+^2Pq1B!6nF2X!nk^%NPicoVA zT^ohAL{BKK5q;bW&9?h^8DE5cUyaJ?cH~W40t1>M-_|9g{KDodV{xDcRJa;~9Y;8; zNkp_sp9B9Q$??(WtMK5XJd_B@lZd7Ra?27m0ifQoBK!mFKPD#%SP-&e@;=xgXj+j+ z+JtvSn;MaM3rS9(20R@rx);K%D}k<+^bVsYH(61{O^&`v)*Ok29s}$VWg}V*Jd=S) zzT^PQz6i;83A>{=2G!e1A1ZAByU3jtUjMP-k3h3U#0?0OEHb)Su2%5&ddGvk244Hm zQMV}`C`OwE9k-#CU!cZ^n8*G_5*kS8nJ#5?D>AOENP%LW6=!zIaycbL1!6*E+PYM? z6^8zxMVm;{5zP|R3*er|Zfy*MEON73-e5+eFEu1Kr?~Xjr4e@dcR|NrvcR+nXN^xbWjz5c~=+G-4l{Wq_J<3!JIQ)g%`vM{Dj{V zUQBH#k)eSzp8-{<}}%vNvk)?#hz&eP$vxXcnt1t%Dh5}g@Kl3j8v`)2(Ra*i7?i1!0@l9<&y zgo}boC|Af<|)8*Ftfh9F**W$@ekaU%{#LAu`yq`E3D?U_#BZm!H&D?*F_W(^6sz-fw^h{mG^|~iUU_3+wB`~^&Hjypc8)a~% z`87yQZAXm(ZwBrG3cV}`Sq0IpD;XPG z1L}J0(?Bdt3<2x~s|)Ac315_u=@x2nUO6QwU4mk0$Ri7_vR9I=4RDQQJJ6`Kkn4}7 za6b?H+@tk`Lmc$w6oe@h-%|t>t_g(3Gt`6ch%8}927d%9UqP`QJhcWA7$9jkFrA@{ z4gNPFumt%MWf^@29ZM?OQ7F|O?@77%+I|2a@6-sJopL=DaHXVKo-qf;s>%WLBp}@( z=HY<=fIz{~DX|N?pL{{};J}#@=i6tuukb4s;&`r~EjZgPxR-QX-uxbPED?YdIkW zegh`E2u&&HELxp)c*5+{`8jz)&F9_>P}GaWsPi0|x5r8Q5Rj+G3N|rC-cAUsQrOOv zt0cw*-T>$b15Itz);(vGA?>sO?960^ zGsqkp$h++zc7G!|-ejx|Jr43+Xgh)b*9T;ij}dA`qS>OfB1+XK>E_qnNtYA#4{tk? zLjR&7?x1^v;a~M$Q+v*}gvpx1TtGv-2&tOwL|YCRGXDHRo)uwrJ*N4Mho_bT|tDy zNXW-0g>Vk46x}wwoU=sj2#I8o@sjY}Iwd&ZiR`}?QtG`FoAFSPaT)G0+N2Ua*08be z_T&zX{ebJkK4HYm&@O)40w4o)Be7q^0t`QaCc)A`a!bN8kD({gst!TIZ!i*DR};0c zWTkqqF?KZ2@e|>Suk+S4*TSr!cUl(NoQbMDB(#-phio4mBMk@;8!T>5j`M>x`_* zS}jqZMQ21<<7Aa*B0fJP&3BOw#Ga|(tN^NxJKNSc>w*coM;16&A- zGEAQF46Xx)J2G*OQr#8SAVa9_e5D}MKTXmVoA)_|PSNQI5= zTJ(AbxY}tbSAp6ZH6!hi)|d4p*;e4)I|NN0=rTz$kx0^X<6D#bm1t+77C~fG@Q|Pwn0k>F z)Q&-02p4TIO4f>i-Be6O;*CZW*m6PXYkV*ul9U!1WF&fnD@9Zxh*dNiZwF3<9U9Xm zqYN@!X+uTEIQVc*B}%Z^eZW{r)Mxh1UJo-JUvZkEuSYLyg8(c;1x_Od2yBGVj|tZ@o7iI#z9IXfT=D)FXI}+D7a2ri>~pV0 zT`$K>!@dc&k@yg_P#}#1Dy}@_Y6Z?q&RMJrT9NCXd@u3mNTpd0?Jqs}n%=QM;WvD{ zZAVzk7i03I^^vOKJwF-b?lqxpc0o)M2ff(3OuUD!8;L#vens2f#5GNb<`Mk^yg5LAhMWatvq<5nW>* z`H!|38(7IpBcMmlq*hBPT`}At8P_j28kH^3(zIb0){gs<9s;&s6qow@(KBe;ChRbF zW0;5?jt1 z-MDblh_^@jX#?m`U{`HOV{dvDuXmDg1}UCF5((qtJCZNR)gC$<1^iG?OYbCP&_T%| z!ZJao*-?6cepCSr z90eCE+l3yXbw6kXi4eMy*P|%kJ%El`MsbU9mdh9`Yl$Uf!z7&z^#BAYCcjDgp@Jv` zEULZ*`))|M_|!eqUK>M1#_&?(3AddhnHJa|L?0$R`Sii_&KN*6362QiJ-3K(N#QLN zValcJUaj97ntp1G`~*g2c0An3)rIY1kUqV}w!F))$OvP(b5$3YuW4DR2L-wyUkvs9 zK!rl)cZBO4?=t8I+ynfi>=g65Y07*;>rUZyA!Ljm7;z3r zexnxA;>kzfTvfYx`4^K=y&Vau=OPG1W(~llVtj(j0F>5cSb_dU4%h8 zV%4ZSOmN*z*nf+1=IGAQq^tgcles{J4LsIw{HAaQ=vbET7=!pA0?HpOjeTbL%izjq z?ji_0ly8uQeDUOi4=qmq@yb=uoc}pl*@^;{@t{pA!lOSbs(~mGzwt^Fq*sUu15NzV zH$%RDAMk|Fr>_Vq6JftAfvYhVsKyO`DuqT+8p!?XeUOVhf%mOfYM50(p>pMGVDok;yqhDj61?noCytHX3VQWd#&$TyewZ@uu?oUwbSP|o-FnP_ zR=sfGucsjBV6+2uJNn}9SPm2obt8llm~k72yB(A>0uqg(LLSzzT=XrJ$so)`gmDTR zSi^!@9&|6iPe96C@#nd{P3;jkpr#ewlN0AY~n zo-`P1@wUU0SL_;ppksDGF;E|fOJPFi2Qqeo69lZmaLkvCr{JH#fdK7+?EhtAvY8IT z>Vucvh&7v=UU}F%ppQ!8NTQkx03i|~Vry&+AG*^O-m*yRblN#=Z&`V&_pfQx`Xu*} zDC<`%3sZoq9f{V#iIEFQUxcd(WcEl@$V3hpmFL50lAj^^>4d`XMyBj(W-Q@OvOcCb zi57$O_AuEPun>@mjFU6Rv7~=9hqr?uk_60|OWp=cENz(RUhQp_I#iJK8I(?)fuHo2 zQp$fnJi9;X=5tZDZ_F^)90!AbaMS^8-u*d=S^Q}q-M2u_tpqkr&|IJf91~#vM(`K5xy75y*!ulZ3x$sKmhV8QW`_uEdo0#f|?l)m`|w8iF88~ z6-v2-DLsev06P{(mVidVABUipu@8!slZfC^CSawO+YDF*htppl|7<9nDFu+AoE!rU zSRv952&U`gS6hao_ypRiw0+SnrmVQB5%+^_TUNPlRS36JJ~Y^wHj$-XGxEdv3B^q< zLp~gn7*nVC)7rAi;ml&t)2NtGM3O4BK!r|6dXNZmxqI%opEb9 zTo>_U0C5B@dqH3dpsb|F-WX`yp)ytRwvZm1lFsqZx`7qyNy^dU6}78FmvA5Lxi#zF zE-)^uTVuVNxXb!>6x$$Jo}8S8e3@_jhN%d4Twv2efF;8G00XB~`f!H`_!CpXC--<1 z0emf)o|q!&gz+k~d;^>Bmt-5X3K5Ik31(^1t+wT=pO@EIzUw$9di{f?6y1Lu)5=QU{-Qq1sorz~in}g-bMA2{Qhn3m z%IzgrWmIdqEzV8jlt$9&bb3?W}Ob#oGKG_P07FZSz6FMtcYH`)7P3~YI8EC7v!RP zanfhu3#PhGL`i36X)K#w9EMd^&@o{S_IcfrFnTdi_a9IO=jLBu|4BU4t#4U7!?tPP z>|>2f13j*>MczTYuB9|`TM+Tgqy%t(Z9XZ z)k6Jkz=;4@;Huk6S<>!>6_2PK1^h1(sLI6P^-(k@Ft@((oioR5Jm4s||8QfZ*>b4J z@^;ZU!*DF%A>PLH3=WWMl|Q-cI1hf zDC(`?or=VLYg97hDN9C*j1J@~%z`ZM3! z-er5XtvVMSJh{r^PFLfznFl8u8*JS<|UJ0pT6 z>dOWEkBVeFV}lcRR5dH`v)-Ca1)i7J7x3He7g~Q4Woy9D2IO$vgHGaX8qD@o4DzD0 z7VngviVFNLwgb?lN}=~p3syO}=OHXWo63wFq*szj8_aVV$RRD^OF@5dX%cf!Wy4 z)M0+=K1{OQ-jf1vw5(IB09^T2nrsA^FddO{$eP~{r6J$afNH4ygpjhK1Jm# zi?_|eV}O+j)LCMrWOD3Jgoor~RZNLQvnHppMd5ys~V**tZ|f)wi}$Z0~!p zB-x|b9=HBM} z8k&BqxX-6X}cEm`V6a3R6k43p_v`olhkD@b~g0Z=RNPXGR zDV|d+XnUOHSE~g};9<86I460Cp$LZ3)}Xw#S{ z1~-OUeM~2h%#j+qUp9MIay@hNOWLX2AHurbyS3g0i@yN+eSY^+QTgV>t%n$(8BC=> zUe%)jHr*Vv&&O&!#9WhGK2DWZvI6&LJ26<|-a0?yO3 zs4LTCflwgiUR88mvP0{s(Zg0YhV%AiE-gCk|K-I(Tfjd-2y}7Qr@>50`aUfqC4K_DcoZvTlZt8s_ zHO3V()`w&2^Q1;26_ViBI(dSw>D(fxv@??gcd}nzFFY{9ZmbZJqBN@u74%47Z_jO_ z`;2;J)a9!lPVk#;UdsKH#wMXYIs7Jh1f@?SXcNjUPz?2Is*|m4O2_Y)I26Vr#%Glp zSpM0DOCdAwES1!3pU*E#mG~inh!TC%QLLX*)IiXivOfLRd4uH-_3+UG8=@=`H(oE= z*PNs8yo|>uo;l7oro^i^*QL8FEU%jbF9APZ%pQ(=m%>nTeQn!Q0_J^lCADqYzTzM0 z7GTF%I)%&6ePEZUVWH_AUWdpU+wf|_NY3xVO|vDRWJxl|=@+k+N^%OTXEBSXRHw&$ zH*V4^+kuRN%lBr+o%tq3x*}&c2L14=W zkzR;ZkI7Cy>-c+hsH`S!?K0UX# zIjrK?3nB~9((M2`kZ{dUao?r=2xC81mfJv!f3q+yhF z5~^y9P@NZTDmjrpPW&bCTD{uh^K}UW!)c$FDq9zGWHIaG+AY=Z zDw&@+HJR$Fj#Qc!HPpJR`QmZ$C7nq%NeyM&MmLuqL2g~a-YD_T+^_aFZZwsCR}sbB z5Za33kMF$lsqxJ}+*-B9!X2Wqr=krqx|-Rp3{1@;Za4((vvrLSsP%ui*FlSSXvX=_ z6!_B*g?rv6?1bu_j^=gN%YbRplfgZYsbC~s4+?Nhgx)iTKFwitdP`$?jb)F*wWh%w zOS!Y%O@hF3FyX386I@%I;^~3Hi*4}#<6WQY195K_-18wE4b{as-p6@|qfaaR{#sXN ze#6}45ht#k+2@#Mvm_n{%A}?XzE&W-Q3S3p1kR5sbr1?Vr$Vr2=2o!+r`SM+Tl(ztBF2^{@HbEL*|ptzmdMKZP>Mf;5#Z->GC8`nwLZB>Qn?=R<@kGKEs!`p-qtge3~ zF0fXbyW7DX!^yA(6pqek6Ows1#tU+z(l6(>$ik+Mx6D=8_y726#6MoN6e|2OWGE37 z9CSAbeJ=;PD~nR;I@#6=%M<3V^sc3?GQv!BGtJL6PBjhv?P!JLZ^EiLp6xwZOD^*n zJ8+PHZ{xv;^&=I2`2R!w*K^E`S+v3Ur^8o5R-FR~B$S?!oo;b(|5u!SGwsUs0srA+ zAx{8_uDyoq+nX%@a&y_kz@}1aEs1B0?vCWOlh7)kC z!`HT7o#afOB4)In4I7-<#lVDOooufW+J0)_tiA$xB(JH8+8%8#J>f?Vnw<(OkIT<` zt6~w8Dz&nFym(b7m(_*PjnMK&T?*YB`e)uDX7ESd?%&`1t0a6;h@yKnP~-6ZynnKY znF8?PhD6z|s6W9QdyOjm(xDl!Wquu9#FHC@Z~!*>`vvNWA5+_f|{)+wpZl{^m{ z8FOyMctN+-{K^PxgT@p}?bQ2ns76_qSj&=CX4*DM%fm9*X$`4z?|vYgkLEK&;go9b z`jp;_+;3G_%`-UxV{KDG;g(Gj1o#I%z$vYe-DIupSh@lzfT@>(`bYpHy+OsJ z3X8Kf)kV@P-qE9(v24d_eZ}fD1w^^tbdFY{YAvvDQrRP=F&)dAWf)kVnh^296wcWS z%FVL<7;Q4h@^=Y=h|4ciO{2YG@ht_kIk#9JtEvyT&Q@9;l`A;j!vdI)@KiPkuOxi^ zjdecTU&C#lTW9n6&T{z%!J#_l2Mr9Ay2^yj zseE6jSf;sED%;5Sbwt|daICc4H);p_wCINRN&= zX*{6Y7}f$!LrT|SMPR(Vz6#h>iCKaJWx1w5S(4>#Fu#^$nCL!HL@kvo_O^}Zbtiu( z-djW+NNB#U>slm|Y1$>!4!yT3(l*D;Y*1F^Y0kviIPMc#_X9Oc8$#!7H+y`nKd(sM zZ0vql?-&W3wfxUR)3b|2Gj!b!0epWLWfr=Ru^DLK&e1um#jUyNGyl8wx()h(+yQht zV9FLYNXk}#s?PmHH;UEf#_cTEx8^^u+LAi8dwx4*_ogHef*}7L*GtxFyVHpJ)8r}>`w0f$r>ky~XZXY4A&f_|| zCE_o$CCVtn<)pR;eQ~^|$+GSFIvL05{UTl*zedyjl(w@<(-8*xE3)6Sx*k?o4g+mi zvS=>oECluny;gn8M&8W~{k{a*H_+zwpWnZJb4jXt;9T->`kMZ)<1vMAC*SC$OwJKI zoU**SuO@E|Ys;0Hg~grnR4&x6uICzKbhgXl*Zp`5IE+bfqV0POR_Zc-l0f9<7$ZMv zl-9-vlVl6T%0~)E$rWh~-rb{Z8^L!?>xd3a40As4*qI&PjM{vfRVU{GmhipVGaM)( z6=0IgVY1)W=K+Daz^A5L>t#MrP>xU6M;Z4B+M7T<%{pISG2Xe)_)l?i7S-q!H?KB| zzr>n?2#X8e>pgHPf7~D-jle!INV7<*?f+y|ZDYB8=4OpO*X8{AT0>(!&*;CgvT42p zbU5mLK67A(Q*u2a{W%hxXdW>+|5|TqjiWn^vn0#dA1>a+lGH^t9u)+hrUVFqKDI8-24^y~Eabr>H?PZkC4fAC}; z-4-_iSYR=f1@)ETKv6?`!|h*in7<_bPrpxBa*&qwMwa*<>3tV$o38((Q zb;V8(Hl%zweQy8R!4iS1kYA?Yn{Gm=&)pZFPDg>f2jGt)O}7_;mCrI8&W1@Zfu5A- zbIi)kMwvqEIbUQ*i7(mA_tfXFt-o&ku4EO@T;l#YhwB5EzL?|uON67b1Nb#BI|HL> zXKp+gm#+OuiaKgfn34ZeZblu)l1cgBiMS!5Et{|EmRL6Vn$7+x`fdYVIW*&B3+?F{`d1-lq%*;uta^$_F@ORvKDVqWhP3+TOD+2!4B zo9^DVg&yxT*`=4lD0a5={Ca6+g=RWGX@0at4dk zIp_MfNjJSzfaR_fFbmhrOx>hpesneqIwVxL(+${8s=Pj&_QV0BGIQJ)hlw~F=tHaC z*4e8SR^6>tt1awzV_ub>HBw=a^EH^i$cdKPS76Y_QS>IQRp_awdB%|0Ss3Wy*~{x|>~oaN z8+F!uA(L^ILRAXKe5siOH30(rL^lqMR~A4=X}@WW{F{kv%UgN`Z*|5(0jQC7=EA6- zJT=$#ot9bivG~i=(lbR=jrWJo)~lw)i#$uEKP50vhp^P{u{;;V@v&Uj?7#uvMKjZT zgRW$%KzHXKi9o@ugCIib(FmGd-S_GPCG0tWXOFc?$dg2qNcNg z%7H;7XGUt0`(Q+k9dpT8-&Lh=M4uZ-+cK%DDCVrtwoU12mY$7P^Zu(7Ry1f~U=j0a z|HfTw_MaZ(K4kVzJiZu!K ze|Ak+#<}QqtCBZ|wWP$C<{-qq>`&R!_E614j`M1!#V5Un8K|o+zHOt%Ig-B}(09&R zP5{(>R`Y=l?R=!#oM-=u<5(+KU`)nlTKVPvS6~3>fPP18!#a?LI%M|0D|2$>CqijM zhxH|8_vruX80v2gR!2)vKi#icqi6)pj}%emK-8+=8iNr~`DyjMmCvZIRPnckAD&fn z<;=&K*FTtoCogHYx@;olb|seR`P|k8GGL~a8#mM*%@@^2iNvP5x}OR(T@yrW^wu1@ zOv}^=tiqIb*CS@fUggPTRl@<%G5|ILpRwJ0_-(TAETB4iNqM%4zB;d@>0C<4-hz$w zCLRbUngrHN%JI||krRXbHb;^?G`+pkxej=LLtG%8m5djfdp=ZC*75ykX%7uoTGwR{ zY4jbO2Pu5fJ(j5W;+^`y4;A(~q+nA9LR=vn=Vzfmz>$Zks%T0MJCwhc?pVM!4AlqL zi$1U5en!hqO?Rn;f%j|n{oItyZ;iQ`+yn!*x2w3J=>wv@{i>`o?+t<~9SXbLuL(6f z9~5+DG>FaOz(gT5R-KTjP8V1=aovwOZ7&NP^L)wU^*XTIqg6HLXY|-Yf9<=_Y5Oc?YHr)hl8W!>2OIWsKwt|_$hc3=4eqgESYIU7$PHe%(cwh!^>Xv|kjLr+cH9jJx`kUp_ z28rgGY|;6naC9sFV+*QpH$#J8zwB{~64d_~;NY^E%X|=Pod3;M%|ow@+VYpe=e~Y_ z;qrZBF_RXb&J{h)IhzsI4alZ%Ag^XqG{+u6POP-D&nmo75-vJ3H6asl#rR z{xl}wOLDfiG1?BypK+`j6=j;CP&DO)i4i5YLy_J<$$F5%axK$@~ zj16hn;1tU`SUAKLS8L=?kLDx)Ls&G)Md(NK-awuQ9?u5|~J9uzmRze49Ao8w} zZB`df?him72cI;f-+f=C?Mz@SX4ltj_o$FpwkuubX$Ays=C8lmxILcvnQh!9;jE>V zKpWq}n5YZ0?}fHmtrsqpS^kU;G!LAoQ;_6&eVzGT18?0cA4LO+6r zTVuI57DJVc+H}{D-{XUs5Gt!b4+-5l5@&p1(gV54AsCOlo3q7{FqzR@7x>uBxcMP- zHkrL&v6iJi2fd5`&Su?VU;9;!4cZ?E+N#Nr2)NX==*f_S!`R(5N^OU#dAa!Y7=-tA zMSJa}{;W#M-4rzb9^=@KTEJLyf4Z8{lc%s#hMq=S_2j5?s|Muo#nDK(b z4xloZ4zOPY;PzshS+OpZb^@O1WZGZ;_Rar?xo?k)x=h=LNU6+b&5ArRTUut8W0|6H z7+Y)2Tw}{E%L7xo#v00~sLaS5L{lPCJY=LOQ*Uc*>nfTNATXmMf+CD%iVDmC0}P5V zCt%L={ayDnh_qQ-@AiH^zdzP}UsrVI`+c7KaNXB^T|xem%U+_}QdtU`HzZGgN#8-u zg#PN!h`r*w&VxOK_faL+>6Uyty>8_P(Tcn|V0OlArGi0duThrjqbzH1$8J0HiJc6R ztFxXx1@-g!m%^K>k_0R5m(%W^9W-_Qu$;)!8nv&&HWzK}AFyZW8~ux8L?S8u{-ko; zYh389PkcyeJMa(8DbotIF-=*vjC_+P)Hp8m&?kvB@xWX%J6&x1N3ck{h`(kefMSBs zX3Kx+pdeiK)BK42zT#WITbv25@|WZw0?jo)!*n?4vb(+(yJx|l#%`IKezf^Ml_x%@ zim}YOFvhMLEDqQHr-A4ne+IDc;y2t1U}d~h{Vxtz!#uTG?{KWu=udmWp>v?2eSK0` zSdTE;Ke;<@{qk^#xt3pYh;L!6*9F(k@r;rpJ5UWJnHOT``6+DY;o4 z>b93V^1YM5UzqZ2MAIW8*W&!EEj?_`>R*&IezfB3KT|(%C@n^4&*;3xx{M6Tml?zxcBxEvrn`f&n-Uu1d`-0X^3qev z>|YYXZ;6TufA-v^cmD9<>FfhTj{}zqS(uq8Y0GQAAhP8;_4U=t<_t%!$oa_45Bhld zpF+8OWqoOk=MjhgdPiZ$-`m!{teR4y%=jFNf@(-WberM3uKchL;#7@VZfSt7ezWeq z>ZFpe7R~uza&5SgH#Hl*Y*8!jV8JFPk^&K;Qk*(&%uh34k6sgA`L~+4HBUZ8-8n#* ze&OBINst4ud{U;K%nkE&h@ly5J|e6v)Kt5z_X@hco8PYx`#1xH_VYzGNopRK>21Mu zxjH5hUWJW<DnXpDck|vamN{f^I={02>q^*BuhrU zX2UXd^DyPMuxY+E z`MSk8M(*=P?xbRY!1Ah~60A&Ti1JjruikE>qvk(g!)Bd7$y4YPZTT85TxXE$RQ~T# zBKBR$F^|=B4k)_VRun3BJs>W9jSKrFi4B@NHwPklB$LuYB75Mob1(liG4$xQ7c9dd zdkt&p(;ITatDu|xf~8N8`07Z@KJjD2LqnHio;Od^9=_alc1S>X_c)7NW2?dMuUsWL zwt?>&AJeuvCLKF585)mGX!x#UTyU@(B5#_z1>6TYXT70zo>Ds@UjD4HN>}#f>};uX z&f3M;gq$1Ns`WLc=i92nCnIiSgDp{#o!Bo`wrLwyqD#BDzZ@motVBB)(HmgDX(es5GjyxetO^4(z7v4SJ^ z!vzJUGu53H=B+7FABHp`<~6NQa*5B6sdaswy#mi6#&^FcwN0GA4H{e*vu$}w?;MDM ztym=~-(P$x&RRZQxM{S|a%aTV)ad%}_AQt+DGcfe${C@S^64uyg{GrBRPL*)rZ1zV zPs0XB8HCdzDDlQi&C?`~k)fx#RZ3TxvOLFoMsPUs>b(I1_sX1OyW$QuC~_Q+=bX8C zaT4#NT~KTB48j`v0HgMBUS8Sfq24!Q8g|C)di5^e*GZO#akrHxz8_k$wlXqNUpiBB zy~5HoI&xdJ^eklH4YuTRgq>odCjoXSkE7Zez&&(RE>>jN=JN{o&61$&B zNrC9%Q|Zn3$6S4F>+R6C(3KeYcHXMokuQCgo4RB-ETN=Rwu}(obBs=fHoxoYss-s2 zo@;zfUtlDD~o_Pw%KkiPvmb+Vi$OKW68oS$|sASSo7VH+i+|nY>8D z>-3ucksAa@r>4}1Me?DeI_y75E&0L4I)}ARRtKT}*3|s8iZfsZT~~r~eaVG8j*+G; zt^C84u-d$kSFj1r8#Y^?OSbmj7~@lAj?E(P{ad!&O7PFJ+LrDt|zm8ZzRFI!pH6}DV6I@0#8}|{asDL9wAxCZ%#>&|PmkE1C(FzW zs=qUx3szh!Q+@=IPFuXGD+=0*Z)Z|?Xzs4bI?m@B_{(ii@*1Dixt2NiYa9tIgcwG@ zyxhJZ(Rx7d=)yG?t@sM7M)3J89X|GT?(ab0eS;OEHRjELzbxIKS;+xU*8kz4kg3h-`PN&WgEHIksy#T~|CO z(|zL;^{Yp--_EUGz&m#Nh}|+xZ;z~9ZkSc%4KE%L5Uu!bu=7hkHwcEvQoCItTC--$ zVML(lwT%X6n$SI)u3W1-D-3U8{mEF&Glc~*wUAqMMA7lzk`i^jq3FB)MMLdJKuzq99WQ#q;fVG< z6WY3^Mr>$PU#pOw&QJ{T7nx7cc^5%s>#k$0OK220UxlKVQ{^0^FeNI|>@q!%HA51` zQd^hE+f0!`)|jn$v}_CC$nNxhwI@vqoxbBq#roje%&SFRi=mdPc_OFl*_^TjH{F#O z2>ouN42s=a$p1OiCf_|Q3Wmwk@*0H|244e7J33w;Wx4UK@F6}!-14~u-1-c~{Cq@_JKv|l04Qfl!DLmManx`^uIg)gf;B2E7$z$RB z0{LAlROJK1bSHzL@Mr0#@(u}*U79W*9;RP8fY*6mX!+dpE#DKxOFB6catY4mp?}Mr zZ+%nMJp<>jQg*UTcAK_+x}x3Zwel1yl`oaQI4LP5n%ko_7RbLf${z@Z{?_pf#UpgB7K=Ki$Y#mx zEo>SlZX|UJmF=qP>`e9d73M0F59+5B4IIfP@3?m$p4KH5JwIk+u7k%`Ap@#m-lzRN&YP5l&|MmL>?-1dl@IcicuMJ=Hlb~{M`!XpU~;PY zNVkrepjDmbdjo}C2H}}Sq3(r6)o%Ikwe1;-1|w9)JgcKRb_h35h&|XaQ&XRCHFiRj z=W`(xM0uSlSh=0AEEyQl^cl~fP;z<75sTR@D>4m4LXszR#15Mg0C@u6Bq29JSUR)#Y=w24$rlrmr}BQOs^_8Z zcYdUG3`fOU~np65BnGn1%Ut%(F$Ajil*uKWlamH~kOAZv*?-J@Ehf}F$w z`}Nz*_UJ<1I!%&G9so?P!yx?b#+txS-6E%|>*^B^43k|nDhEN=khMTmr%PFdX6(|f z3m#gAP4K^`8>k>y6xFap8ZWd4!BpPctg?N^cRp2X{WKVS!i(ARd)qF2z)r7gSbP(t zbZa!H1)l5CwhmL58d~oBh+;$B|%f>Ezv^* z%(<-S`--9`Ox6xlDKyL=f4*6E3^75gWv{DH2rL~Ut1s7X$DzK0bCZ#hrKS9WD6t5OtA zWy`52wAluVY&&ZeH4`M6dCKW%(qlHeZ_(KWd}t_-m0ugZeHT0iG!H}6!MH{nnV4XL zKxCH5?L#HVYfz}%LsTzSz|;76hPf6ssqcM0w~gPpS%3XW$=>RuATX*F$Bip}o+)b7 zhabDW4ag_R1HvkFHjpw`&VXVsSEy+?nYx3W13PZAZ>*v|Nt0HsC=RQirNiG8FLLh` zodQyEJXYxonIL-NmT706*Q52^r?H1?N|V&4V2HN*?u`)ACp^>85GZg0W5I3JwtET^ z*w;s68^UfBVAMQ$Mc8s|fSqsP*8TN^~>&jTX&5#BT*S0Y5~3?tdUua%1M-quf@v8p@|P* zCmBWzO}Qd_7v!v|5YT(u^-a^F8b1dl6nfXOeRT`e<^_mA?;j@s*%m*pu2 zo7{o?4@cdSf7Z?d{M@+s^v7Zw}cX3YgHcQ)FRq_NE_GSXi~V~Yfoc)L{AJ0WT|U2%Oh zTJgMmSp1)-yqTJt`1DuzlU4J3mHme3w zT}OoNI}X8H;Y@Zy%h`#ZIM1mZ-_x4zv6_=CkL=Fz9u4r++vQ?LiL(R1(8bIm{A{fn=i@_qXTzdAZTE-s3YwqewIl=2Li;Ji4DqsxRVoayvco&wJ4|gV)L<2QZ?%CAPrjazB2b~Y zO&+o3b75zY2l(e0zI%r147^qb(R+L{JyV4NXeQa5QS&OCgw6N;lz1 zUSg`qRX`f0?U5kr|2U9OeaCtk9reHFjJ;6-YEH3XP#AJF&`afg2$@ZJN?SE%vPsiv z*z$tP{U&rKu>wXHN#Lv1nx7Wft!RFra`t1^L*Vl3m!2Z^r``~Lcfoab#hoU^R3ZdYJ{(+lIqMZR$)j+AY zfBpAg&MxYlsd5kP>INMzMqatr#6}!MU_Nll7lmDc%oL9LcL+9fMXgUgYYhs&Lwjj_o;~xgB|P z4>)SIj{DxNMxb_Y{C=)fK~3}N+1dGe7N$4o?RA(0s-wi#cf`kgT8E>yd!pPn%_ppN zKp(*_RB!8$48D<4-HON)cNH%qK zGpLae>oV>Ep=+A(n8@c;Sr4l^zXU*&8zE$K2{w#=Khp0S5R3(YZvOKGT?|7)z z1f7pqNqjFsKj4N!oHCFR^ts!h<*6y57QjXh3}o=Y?SFW^p)5V2;Y2^WH5I6Mpx5c% zY$zvX_4GBi$}naV#8gFbb6Q@we-P=xEftIXGDF<)p3b1B2(mATXwub9>$y7IxJ=s? zfWqrx`8X|p@^N^|K2)s=sDC7M?LcjNuNX#XYbk%_K$B;!fQ_26R5$gi;`Lo)#HIf? z_h&DQuey8Q0Ko6;@x>@q))rbpW)pbGCSghwZ{L=dbQP>1DCQ{J# zN~p40t}!{ch?|RbrXUb3da*FH=t5Y;Suz6ZvkFa(Iove?@C$iD0E8SJ9|)X(czMs{ z)6?a(qpcnpXP1r6@$Js3q(d8JnjqYqD9cDRypk?`0xDW)>K!y5iEYCG{ zOh;%;v7qNp0d5|~Zttz{6fsxs2Y`z+HISk6T5Y1dCeQpwvFjZg&qUr{v-Dk!RK#f z5|*hm21UK^_Li!=cd`rLR-_v?K2yuukB%O7l|0GDVbZR$J%RP4JCC7LkEWFJAB84$ zCQ~A`V^Ts)K`w3XShy2X0rv2E;7-uc4=5}wL5Ye;2pdfT>a{pxKc7tmnzt;O`syA7H1;Sn15^3s)+sk07FQJ;nj`Q<%{aGESY^`njaZxDwhCObs=eVA z*|HYnN>m}C7#P*NC1-Z1%FE>6Wj`f)f_NU;ipJ?uD)fALFtU?fEbxOQWB%&iG=saI;-vvQN8LW#6SbRdsj~@K{Jt zZXOug3Zy2ypaat-Z`j(wp{?{Vp~lS~=Dzy~xd!RP2SdfoJcBaL#ATV9Oum+!-qCx^ zR-tQy9-MuQozwEx&WmZEC`$9cf~u_I^h8WJ8~XuDaLgY!f3mi>*(n+?bS@H}RbeV% z4a9Yc@C-_CN>+4aE4K@MC3<6>4xE+iX~mOym<5>SdnT{`AWVjNsFO`jeGVSCt{R`X zX|ukwvOo_ZVVkh+{ND~(QdO>9!-XquTle_|OzJ+6ct z9?4akd3{hZrc})xyP*2;{Bh2luh6xi_-nI~^GDHR?;IIIa!32_L*61%@3U{*Q@d6> zEEbftI2q>xBnfp2oAEsF+~ATlU7=^bg=M+#ac z&1JfiX^kZ1s`9mC<#O5Z$PFPH^~##_S>;rgo(OxJklX?k&;k2~dBug@y* zeOTu8>MbYBgj?Qz@C}GlmQRn|ogpunQ~xa{=NrU^hY`%n$(vio+O-jQXN zu8(&acE7b1~9{sYlzu5CdM2CrkaxIvfO<&ITh^;9adm$b= zcRPCrBl%r_%d$3&wxR;AipZm>u;&s)^uy|80|<9GJrN<9*asK4vwvK{ZYOb%RxCfH z1;=PP4G`h2+t1gX!|g_t2e_x-{eFEi_&bx-SKHJdP-X#CEy~b92d?EweM>vN7hxT1 z&pw=|sv4{^OP=1*-aaD!sPmop{HyQi%$xNe88Oe7LwfDAPnFo)Ujv_xa*9xgK=HEKf;H~&h^{ZATZo7|H}3lI@KYN?J-^er+I7hX z4lL`ihbVp79=Jh1A@ISq=JQb78{gz5!j3bm90yXnNOZka`W*2a9H5O3jcOSjwJR>C zFe}(~+~kh$ws$)|<+)$%u)i0($1qc2=1JoDj!itMGcO|>-UI|mUk3Bd^YRV@X=7G3 zWNHV-8Y#$nxz(=DB|7kA*3TiI6j86-Ez_6n61D5nQufqEHZ7lG?kr54GJjx1d&iB= z>qhx_+qw{C>%@Q}$Fzfitae0OU59xMbF}^RZxB$c6m6QWTe<1|w{}1J#gZmmDG2e8 zi^lHx)+O z#PR!m4P)9+|K4yav%vZBJV`u=ePR3l0t4HxFMjgHwzz}ETQ+QvqleA?%S&VMTUSoX zqGB=fzVXLiT*$se=dL(t0#5~Lj0QCsMHV+n7 zNs|RTE6Bx{KYYy1F!)3}uR?zneF&rs(&Hb<*H;E=6NJ zJF%J#rhNFfb$&*nrRXtaIp6k- zwYxH!$@hc*2jT;{BO8XGBjtMl)Q{G*FPf4&fmsCYyRY|Pg1X){hgprmBf=9$_Be~< z;s&hCAoCx_VsD0Se>RLJ18U*1NdNJ~)yfcMwC*Bam|iwPa29g-5O#wr46E0afB#t| zctM{Et!dNXrB&fbKX*>iiHH5N6Djrfqe$3GTLVS29xCHm4jGYr#wQd6b3Q?kiNp4e_j)Hy(Yad zNB$Z)PM{oxsGkk4DJc-8_Y72aPFJ2Jk5J8Dfd=xt=XlMpP=cZ_*lV0lW|}{N3+kI6 zs^-Tg=xkft(vdg|S{#p7@3}SrpqxYzKcH@9{)B`mWMSDOe3PY?r# z4VJufAphc77GG$;m$%BB+K=yjAieQMne7)@)_hBDXV#g}(k~iM!qp~NL{*VL0P}I8 zFQ>PBpB=Rm7EVG`X4kT*65l9Q*`<7DcplOMpA2$0VgF(v9({aqY(_~l;7(oh@`>iJ z3ljqg|148xhbYTgkjV+G@ADmVTMMpTR=1RK@98}i*|u7B!y?^BP`_5@@*O&pzpJ7U z$wr3wJPW#1?X?|)ROVTlQwd4v92$Uy?;~ebu7U~R`i4=N^9qdenB{msVw^97aW7_B z(qSOF(3IL|I%>>Ha=>~>p#a3YIi!WXRQ9btAv2ox_GMJ~aPE#`N%`fvX&r@<8*>PI zN9fA@;SyI}p(l}AgeI5&0G6-=Sm3)oKwvo+mw!!UOy?&Ci#oj;Z9ypH;2v~ zvyofu$Ei!vO-kXcEF@#*n5}ux$!;MZQSzgqJI5Kak~$iQX@(JxYh=t;X0t(VV^Wy> zr9@tT8>}396p2Y!7KBWhn;zjkm`1|j3i}K9ygo8o)An%6{%zd2>fw5J`fjvG7PLo) zPXlTAENl;CL?z;`FNJ3ll5#WhciEM;jAqRlJ=d-WBYf(C`(HTsMe34wBO14*H+Jrt zaN~Pj;d6t1*G}8bQ*CNHl|CiednGv@|O#L(k=7NUX`(Rb#$;V&g!l|!rX;W;Za2BuXJeYIpbN;H`Y?+IT z>myHl)-q7-9jxEZmHLlyPZLx5f_7qA7!4_+fz)1{ZKb=S(2fx_jnu%i>-3bTqZD1^W@7K0(5IQzUDQo#3b}l>SsQrbF1;wTFiu#G2 zAB#^X)F9xKXADAl-sJyHmyT#Khmnjs#oR1rC-8Axx`lM5dnNYsNO_(V_Ko94;2ao7 zh`I)e&*qMvgr#m41655qzH2#-@^sIOSv3T^5om{1DZEAT3bM?NfyYfm9{M@vONjpt zee7XL(@6el{Oi#xOr_JRdeiFggZ;DC*vF+*Bxmc3E0h@oZXXl|9!nE;4h`_nhwkMS zZ=&tUrrs25OQ^^#bNoF~K5kz7P1*6|fkX^Gja^yZDs+eJi5u6z8?yEYT#5ll4ZHZMpN{s>g0xzqg=wr`>;ZUCLNKaBkz9Ev7pIN9mPrxh=$7dKG zQsq3w#FG2mPFzTC4*!PPKtE}9wQ_~7{{B$^!{VONbjL2rN9<9j`ymQ1HT zT_SEGZ&iD4wY3Jloh$y05IZe7f2@_Hb^iQ+Nlvko=`er7BE*KCHQ*`1r7*Kt5bP zd(3h^A{*!?U6QFexi`lIU%}{>oMBP&lFI#-i#IC3Y`uKj213jkB}Y@yCmm8#k{~10 z{40Eg)`6p0CF8wW%%6yGeSE%%j5DSSTc7dY$_td(J!ehAZKG?~Yzb+3GNx-AQ}pLj zn0~o$gSS!Odze|qCAg_5VSR?bDtJuaIZJ<%48_nWkexYj{{#)936)Y%^fsAP3jVLd zDCvQV46$5&stoAi{oM9>2P04I*PN`3xc2e+n>X@E?Hbb~s9FLLKDxXiMDY-gwFXSiog(M`{*yh@OP!(0k*X?BNb2pJ+{^s7IwXH@hZDv1 zj)!_eUc+6{gmDvPug>Sz#QZ zw~Q#brl}@-Vhg-S0CDBXD)N{y6xp!YJ(0&0cM7 z$({6OI5~B9i5`gIk51R@@E@_Oy)3hY_S9KEl^=Wyc9V}S7Oc44!E5cXQe_sY+)`op zBH<~c(r2_yHB4TZXif2J(Mu`odG%e1(S_-JAH0m{Z}DH( z^MndEcy-qo(*<`Z`}?EI^@%ZtiIRAR79}Qi=7Bf(iOR^J)OG~#k0RU(6^(FG}1GX zQpo0J_uQBq%g`L~FhCA_3wVQW6L$|8l%KwQB-Xi==lq=-uX9LEt!bagwa@t(-}R8M zSV7cbExB&j;A=3G4Oo1hV>ZFB3c6m!XUj;GhOl#I>;^ACO4c}9@h~?V>p4XOFnCo~ z&QUba0Tb2ws3+PItTYD~;r0o&Pd&W+?dUu1{qX80Mf z=D$c*P%}Bj7k1pk`s>l@g*(v9`PTtv>nTl@nxPou|FV8h zNf=+UDEp1`9MAh_csd|5I1s7U`VJ_Z8OmMh&QH=C$i!H>KX2_=CZ2I|D6bQ+LGuEOLvxxOtxX5z4>sjq;72j2oTPH8-cuV{V8xC6>=X3qxqTg1?SZr)lFq%{r&mS4O&ifI zMMgqu;=c7`w8GN>bU=*1Z8_CbHIW1=AH61`0|qeaPPRl1ae{w__yHMyYoBH$Lvc03 z>c2}MPRM4FXD!u578LhaiY{js4uQeSzYF%$e zoC_~LA8xs<@?M6R80eAxSg0>kddh6HxQb#VO*8oikL2Lm<8+ROVORE$M#DSS8<o*>4?^?&F3;ZZ#4S8!c!Y^D%#PuLXcMOQuxE~JQe}k2x z%%Hh>rPO}~{>Rvdm4Z^l|4w_}t}u{^dJE5qyca|k!VHLV`<GljpfsDWr&w`*t|3_gpW07Y za2ApAc5&D1K(0W^pocUpK0g`Ytgm>}+iw-jU%6`4NkU)K77m{db_wxD2~TmJ{u5m{ zi=hl7J^nO%V|`%-{vq_Way?B(vs5!P#%(oGF#X++XLIsyNGaTPx{eG z^vtAtN67)=FxIj36c)p#dg^xaC8zTsDG8HJW91^ah_(~Vw0$J}>FH_s*Ods+|H%U6%s98V_*#^y%Okcv%IHHy*%DIblTozR75Kw*!!Y$jbgU~iO#{^R1F-Sw9O3^+Xf~DxnR_cDZ98#SoCPZP6r4XiNc`_l z4_?Q>_Q68S;J`zbjF<^i<|9l=Dp4mL{S9oD=^WdWGg48=={&HXo@qix}yO{x8{9l_9YfU7K$;#8v)+3m`Fyza}oT-rvyIAGTzI5=~ADzUa@kvQZ zMKy)R!=j~UW(P=Gr}58FU9ke-$*XZt_7EGzz80~Atlz&_a@$)82|Jh;5guf_^tn)b zewn2kuc_$wH_xJ}*luN+`uAwElvOZ~rr{(O_nm`qvSU7=63}>@rx;Jx*);7MQZ{Ja z!8XUy9QU&<XiE&((-PS+Au_mU!ap(b3bWvPzx+ok^XgzHGDiAAWVMW!*) zQYhaf2+L>+(qVr*$NIR8I(uq21Rz;mO0m#C6(K}HZt}z5wang^P!^+;O-G|FgN55P z__qvn{w~+l7+A}MoqurA${;Mq8#K*~un(W7eyw|n9Csepn*x#d{VVwecGEoV&&E|Ur zmU~$$Bmnu$_`3s8`4-)U9$SB0xsi3|B+rGS8S(HGDf*Esu0(N5Fo+s$+Q_;whrcl& z=}A}w*wTcjZ&Va)WRM!KlW_&$y$yemUI5S+4(Sv1qp?~r8&UnUi`HTP5c(fsNk_sy z&pqWO!(1EXr577FI_cfACV@Tf!|C!Tc%8tE8o$1XW`X?eS@b8!Ln_1ft#qKl77mD;xrm_qL&ShYCj*fnL!uHYG0csOYGPru2+QBE{8ii&Jv~! z_V9>^zvuQp#sI7rP+~`%3y~cUF{4;_MdXz(e6D(RMBdvEN@%8qFlSGqqsyA$;Gvt6 zwaz`-2AY26>*)j`cFv$Rjw}_b8cM(vc|hef_lr1p^IPTFjM@Z9GwsOg9L0`kadBja zoz4e)|122qS6|zYtrPF%JyG)AMCP`j~_AjCrLgeUgj6LmmDyDkXi#8_UxUp&~o84qFB{Fop7|5SrI zv5~T%6nhu@L}z9ee?QZ56UAUaBl{*(84!jylw+F;?xd)Uu_V`noWxS<^vG+kFdbnG z&qSV5uxY2#u{g{Az7w+Y6xkgNp$GzeCLcoKuPVX4#pJk$jfCAhUv&@^5YzZb%=(ZP zh5|5$a44T$*Ioj3p&uLH>(C}-_zO1-Db&fVRk#q-egQtXN{>N`+%Ye-m2#wn{!dW+ zz2EYiS^WMq2gO4Zbs>^hylDD@C=qIdBAXuKHBj`F#-ns9vJBHaMXyT1`T*7-38d&W zmmtvC8EDA@S_o8?ymd|>TsR!}n9TQh5U?i^<}EY_36I%q4Y*#O7P0S9d!f(O=A`;L zW-F_w>1GtL+R{cW1lnhzP9mKLU4bUW%_r+&OLuHU=hbyB(C;J+(nv6x^HRF^jR19f zje0Z985@(?c$7h4!kXSbDqv|u=Y=olWxulbl}XXIpB|m}X&!uh|KKqA7xxalPKp35 z)$uwp#r+Cj!V@4UC&G^fRY-vhS9h!*5c+Y&Rx}<7PhUkH^>X11k5#}u?Jt=KlwTcV zTrK%D5evZ$%=-InwJ}MZLJ(|2GR}nWLlj{#TLCd@n(B1zb&^l9o5hYKH9PR*#BMsE zXYjx2i8K!e*aKxeRAnD&UH{THCaZe^kQXN29mGDNP8VuVs>qX3W>vpuWRBww3!hsN z+Dh%sWT|zO;4A^X(8kkxS!qZ+kDvh&{{$_g?|pZrF=1{Te$#V1Pw6OuyBI-tY?zbG z(i|WgtbqW@4ej0W^*PoTcbDBO;G(3iPnL{k=ZZ4HFT)q`&xbl`SPL;L0hu{iIiBDBHvg1uQ6>Qv0K|LB4$K`# z3(itSU?!Hv&4e_-+Is@E?UWTHouBh72Y@Tsz3{J(+LNeQWl7R#q^^t6J3?EjK>+wA z9qgOPEK?~HVWTUw^V48d0+tcA62s;QA-jAGy`DAp3eI$ zv@ltj^NV}we<~;Z+MH0U0C>MM-Sv<3#$=WXS(#A)2*o>;ogBuhn@t;sSjM-?yB-m5`PsTzbcLf%D$bK8b*$E^fST zbMUhXEq7CGSuw!oFMRU#G<2}x#>=ruM-?7x9j77?h|pl4&NSy>ZyFos5mb)|&h`O% zdSMHE6yR0mAYlC<;G?SWI*3AsbrGR$f|MrVF)%rtDct<{v`3+(9$%G10 zd1ezp2*5cc3ccbkH#-1K86d%p9Ptj5Q4My{_hj=O?#|wRDEKYI+c{O9O&N_MYa_3K zNKc?B_nBdQ^FUs&|m!)m)g*2pQ>F`v#d7fqZ9O^FjvYmPc^v?n$C*IKE{yr42hH}zp!LjRq0H(0mhXLq5#aourAXqs_qfNX*Vz)@ zQBNHKj8{;A3R~#JN&4jaw__~#i?(^aa%@`Nm>USrGL)C%LBu&K3zpLY;(8fKJQIKs zi7JBR6mRH%IUD3X&i4eexl<5HptMk2g1t}g@A_*%|G}5v{2Lg-hPHtns`_5gFW>>* z+80{FitcMS4;R1|pkdRF0T0L?jjv)zvNN2vccyX^q!AI;?%}^nVP=n6N3Q-YY1l+RH{TAW23U z6RjvQ&tS}iM`p$`bw->6JDET?5zQh&Q!CJC7GKV^5DBJFCdNt`&X0i*s%SHX8X0F- zLQJv-C0V{6{S6f{|MnikfAxE$!kT7MtXV+#-0vr4&ij_d_?DTt%FYC+kJkx+;IP)g*7OwM5=ZmTnt zl|jKSU_}!^d6Pm^@WfkJ-HvKHJm8FrN(ss$ zC0=i+&zu7JOy4z@Q`s=*H1_^(b=yN)?iX3^XJO?g+Hj%>Z-S3&tUMXSYWnSx=ltz1 zQc!cU8jHxV6l{jno4;D74j`FmG&RL_HI|%j?UAI?Qb9s4Sv7qQ2Wor$D=#%D;nRUp z{Vf0i#w6(Nr+hWPwqy!vqNmR2yP>Tk@kSjqv&yo37;An<-2DzKsL^t&PTWXrKj`Qg z0ggJG(n$f^CFynvPlxhEH6gajTzdoGU-^H0N7JqmfZ#a9bB`xnv%`KaCF-MecYeBo z74Gr=v%(V7A!9t-wT`V>B)-QyL2_p<(D;p3v?!U91^i@)7op`m}94IO`?&q-#z z2!+D>1W=kK1i^WUgG-$Sdz0P`w=U33Y3K1*yti zOj|3A%`EufhqiWJJ?}wD6$;ly z22imZizDR=obi7cR!;yXl=Y`dp(;rJ{dD<##I-ro1{D=E#TU=aAHO%PcKqGg*H=Fd ze4y4u3~Ive@P=uU28A4CLj=b6TX`=bqSiD|-8|0$d}=7*4h(Fi1d+vT zrJ}omz1>fHhpv+USOQUo{e5akQ=eWV95Y%e67u4#nz*7P~3Ro|A(=%g>Yzb*X!aJ-dU-6|JWo(HcBhLlx7k1Bp>8(oH`qgw}od2@2ln@$-ra11nby-U2HAaO|seQ`Q;I-9>@;QvCKqxfy z=A?JqQFtYVvEOsgXr$7E7Ig&Wj7vcMCRjo@Xa)bd{h+|R_2d9+pLf6-Mj#qbYuj0| zNmPx{R#LE(6y_y`$J|H5eF=Rq9h_=*QQOl}(6?G8Kx}*8QuW_R3qP2;{*zndr}|9A zfl!uChaJSvP*|!Fnp;HHhxpcq7)r+L1T)|Rbn-HufXyv_NL{fpeA~Y}(%FG=JA+#V z<*bXMzNjbo{}yo)OZa7kt?bxUYw8Vg9kH2nBpc^ooNJ9m{#pupqK0Y%m1F-A5%}3Fsb%ks z&i@{vGRrMLFBI1j|@lY)zEn*43k1jNN@Mw?Cu)MgewdUfr5uIL*ou zbm~|V8Xb9chUcu-b9X?x^z0%DDCrAGG6W3n?3ed8;qEX&jdCY{?9C(8L;*Pk`g^(XLTX?{K4=RRQv(?xnTM40NC=UK5r(po*gu{Y`_}CsWh^q$vGiQ_BMAG zn^6%~zwP#8*h2twL0C##dKUJ>#3KTONO{&vXN}s%mGl*W|7?$NeVg9Do@Q?6~UtiS+&1wkeh|7att)>IX4VyiC!tRN=v8X_wMXpIV7e&h9c5?Oor-`3^Y zZniveF;@Nn@&_TRM90Du1#f;Y(LYX3Vyt=Uf3YydAQ>9}YA#YGV?Kb2uU0z=FI!kn z=W+!^sX9idPLF6DM7!7#-mB3p(zCLht#nam!wyp+utsvHERdt;eTuCe(U{0WOSXU{ ziJDFGiNJcAUl7ayC-&nZvOs!21C9}adqvH%JX9?=hYZ~`^Oq6w!CM6?52l^xpRv$D z4I*q%M%NNAiO~$mVXh$=gLzn@tsMh3cxr||(Ha?Mj`(Ogbv~>6nJ93F_cC5|K^?vf z`d!am0b=h(apho@<9XjBH0?W68K%Gc7|qMgG?Mqsq&+qT9@F4;H05pHfhaG1h5ab4 zo@kdL5fV1-0ipNzY;-AlemS;i;0*K#mTv>0c}+k>%XtibiE5LqK=l5Y5gJ~=WZv=s zLb~iuZUN=)8+$E3;CS5N>9Ky=CGH`|-g-iQ#EZgh3au7AU%h(Og6Y+(zn`0W#}`W$ zKG+|>?0%G*dv?CpGi2!-X?Eh;(J@;o0yk6L_1L~Y$3f4adi6)I{v-FhxtOU>-a>i_ zsi1-?%p0V{X=i3hw5q^>1p}Xx5U~i?m?tft1|{C4rx-NJ7KxMb1f0aQBd81xVTpl< zK#-*xY$cj0hM&=3yi*>1V5i!I$bNCCVKQ<#-twigsE=t~xj%f5W`ZIw8@QnQ+tn}} z!7qc3Pu>Ey(KE9kEzB6gG+PT`3qzPZu+ISojDJ505y<4h}ZN zvnW#_z>cy!EwDXZbYWic{Q<1?-Z&tH$)4kvSMOq9j{U;M#efV090YvBn1{E1)98mw|x7Z;{<`T)8C-c&3H>vw#2S1S=RXaN~ZPm(nHt zwjjaiKwf+aj1GoOzQz^Gew#bX{e@9zNcQBn{?r$!8DSg51XIoL(p3C5SH~OOMf4+> zoJFEAm-T~)i}kk&6|fqn;+_h}-}Etei3AgfMu>qQrh2s51H3ixQe9m$)u0ZGfmB$) zs(E{9LfME!Gl{mqTn&QDksqRJ_KgIeE6e&(_Hy|Atbh=c`P1OY_xJ7{5wjI*FvPxq zbA^`BWg=mWviuJ?v=N3O+PAQd2aasuI$CdFk|GQ*c#bp#&&Pv{jzG65OBc&r){ZCw=*tobfg8XTWSqsymB5)AGXyhtUFn$cH z7#Z2Jd|D`s>WmIlc7xG`x9d9EF>?3Jqjt$UTDcvAHC7v7%0y?ECJPoCjTO~a6c6XE z0;i6uxrGcCsfW%DQLCG2vt5FcTh3awkS1E}V5@83of8LRR@~;%Lg#5V*}=SASVH|i zT?=z9Xx3z)LR)gIH^;+sg7J3e+&D&NVB<5d{kI7x8m%GJK3Uy0PqUMl9Dhwr%%Clu zzmhQ1lHF|NJU328&4jj{^%1p(&@yIIfms`t^MKP!zT1&?vJCr&hjVjFHK#!4BRav1 z=gl)zoio0lTcBri2^4|*8H;b;ym$wCA5?&|x(x1A;F_!tKJzCq6V)Hk)&^>+T|lG)&doItf6$4RBs>ae zHY0HF_a*%(k|m4`4{(9TC=Cw^bV;UbS zYD|m##O=MR*Pl6%s9IDo^v!`IE4s?vny&pva>?P|4;}t|U4PJJO2CD*Y=q1PED)`P z`=OLeZ%9^y-T`IGHcboh{|j1n!`o+YmicxzG+>+mOhhIg^nCX8>3&vFqh`yo&}x?l z1g@vp8(=s#TqV96Ex)&9JmbtWB5(jB|5BOD1|)OGYTK!waS(|<>e}Z~AEZPyLG~C^ zt`;JfNOxsOhv?8x>@`k;YPRc3)l2VKDJVyrBSs+%JGy4G>X{Jh(x4H;hntDwN{gv7 zLJ{V8?LQh%nK1otXga-vR)qd|<*oOBCnz5+Z;!UVH_z0tyrBPEPl3{(;~VeOxQ_CD zPb!RptdphQksbDFqD~(KTQilLO>pgMaw{#-0g~nn$sHo|+u};>Dy-1f?^l|`@{OKH zI_y5NcW9AqSaF2iy_@fvr7#IpXFHA$vb!4uu7H>MdJ-1;1w9zOD_Tz^fNkzrYPHdd z0f7fa&Pvx>2B2s%vEK^(t3bqGW_So;5^GlkQ{Vlf-d?M4x3l~NR+r#ZFnn>$9@0-^backDpn+J2ow{icVSaq@ z2hOs&u8Q&ffON5eNb(KxZxm)8d#LG00Nw1^DL zqB=4!h0hQCJxKZu?u5$JWudJ#N|)W)AaYI!&~XPu8xj?tXBd;8u1r$2B(BgpX9jr8 zA@Tr-FR{X#iec~1VzxM06jCu?6dj`s*0cOYGYl3^2&Akca#g^lx9cg$1hP5^QzO@B z{6^9T0@+d;!X|YgrPSM*7wEedkhCp(eG!g-q778e5!kP==o>0%1#;~K3`+8Olv3wzyDcnVBMhk&R?f13) zsFAYw%IV}-4vUjJLj6F%|cJyuDOePuZg=};0er<)xdD;<{KEP?o5%^p6}?7 zH4)ojk?;7bTv#Hy;*`hpd`+g?%)Qq`+a4su3c!`Uv4M`Cz8dstsRnEAvx$oJYZ?$U zGJETE;cm7SR{$=o_jgJpT3Bg0Qfv%)qsCPRU~VEL@foMJpOc8}Kq&rSd*2?{)S0av z#7n1EGt~;MRMOTujHSppR-+>%ojQ)w+RE4QYniH$I)2&~TOv|TVlEq{EmALOwTcQ7 zJC5a82UANiAR!y%rXWVEB`TN%0z^c}4M;*R`>gdw(do=Nr{|pizTfzV`uhQ!oxR`p zU28q-d7h=nNd_~gcC6n4g*hAV9|a8x8Z4m#@A3e%IzXK#nIB@Iq|zm{(echxV9)nm z3t=N7zaJ@Cx;ss}N3xJ2{L&=79*J-nz=|eHFK+9h$*tgYDDR(S8rn-YT;9@|V)9IpoL|y*F4K2e;2Wbw~VYWj!`O%}7_Lt|7We&4a5UyP{$Om%>P~xJR_`WhdEDNk= zH-KUY#P-Zolj90i#SS`E&c=Ah&&k8k~v0 zjNd|mCb*gUjuR-Q%_}LVC``;~6cdczU8#lEp{*)Vb@yE?`A^Om=`Mp9$jT&^M%x_G zsA)!xH6ofy#lTT73j7OO3Dr9WmiGhaJ`v*p!3POuC4YKetk8iu!C_=-T1HUi9UYjj zH^CzPtQNlyMRBVMrj0qFxeL-tnB*o)BI-sM{op-_JvVZg!z`_GRTJP&FkdM6{%#Kc zqRWM9T`+?sUHpVgMe-X(88D3>;De7V*cP7f5xYSLA~f^vPd#b82Ulz?TU9dK){_?YcEQw?xDhObX%E35 z)yDQ|#;v%F9`N8e5%gh^fM+gvGAmvn0Wi}?{1rfZ_Yrn8tk}JA`$i5LOyQx!21T>~T@N0q-t&OQ@>X1g_dLt@6kPx=MILpB`>r zz!iwLakq3u0wt;a^{bN-F6}MMh!;2$>9c8x-%kDg{u7_)$uGpof;qY=Zr??fH(cR{ zlxQP<4f0$4a2413hT!&UmcPTsB$|en(+$DPpZc1IcVB7Hl$l~HuVs5LiDVxXw*LJ_ zgvH=FZt|}WI)A2dbhh@dZ4o8-yXR<4E9u`Tw_GaK0*YQ=>G#_LaY^NY=S4=%j?sq0 zH`?2|f|`8Y@KTt6URgNqj7yiobLCU}ySDtl@~MA1xrAp4H}1?+Ty2)MCIzFZ0Zr1; zwa(k|fepOQX{kHjO7;GnsPHY(*479+bM<9;B5@$k(kEcn zN{+CDXF2Bc8kon#mR~06NWXg%-?`G?x3~YuiI4xdZFPaTGFkS5%`-NhT@^#!Ov;P9BxgM`VqACQ*F;+3C#ztcNOE5mt01hQG=qzyegy{S4 zLPrH}9y2Tap1-s-h6muLpC)t))KzBuKq-k59Xm-|@0zG_vBnP6anLvmj4;AzPd-^tTI|u1?+YmOfH;ZKdOw$#p{OUa5O7K5#Do+5xw$ zHZS-qeVH#Is-)27zvJ#0aKuHwTORHGgcJBc(Zy!Au3wh!I&X8;Cv-n;nEZ6%z)q?6 zFE*%J5-OTt-rE|_c9CM*yvRAMS<{~P*6}@3*z$#q{Di)A77v>IUvNBi{I!{vU3-*e zxg38Ne~dZNbZ4JLtr1TNh91hx0@q&c_e&E#N}?F<8e=l6#Zqfv>_W#5%I?O!In_2? zuZB_J_Q)U$_9c&{u#Y#Xz|||0HZi?L_BzvG6O_%#vQs(s139Z2ni7U1QvIvBEPDy< z;=;zUb!vkCTp;Ivaz*BQM>h`U8dH7oo)(oq<$(mhTyC~f{GvJ%NB zjKOmZJ{GV0W~z!S9n3Q;;Bk#oxr+JvIsH7nn+p0i272yJ%7j{g`SNZsAS^&qAa6Zc zIkZ9tC{P|`&&6-X_Lo~>2DX|xget3bfEE!k6;xrUZo%qll~(*Yadigk2Q5-5)I(|N zdcbeMJa?2C!!yzmu;vywh-6EE1Pu1&*g}L8yCX(i!j;Ye`=iusY6k*8U6=$&2qFaN zl*$(X+TX;3=Lxs&MdIOvb|g;F{Ou-*=A+eE2w*Upm8w05X-4MFO$J9BNnP<%4i_^1h4$&&LqPA_Np%>)gb*~95@B+=ut^uEtD zoftq+WN_Ky`QGN8!b+H6?Hd-_$r zMJ%JMIrU?=%!LHzJ5p+&AXq~k*>9HeZSwM=&Ux(bD(KF`JnzbbE1YN4_IG)G0V9xF z0`!)LRfBQ-aaB!;-4XM;6IFrV3aiyYhsCqU7M#K|n)gN5#L)UEfJkNv+{^vfE$LmY zd`qIlKA{NgA=0#9Q$5E9g9@uEbdFf(izJ(0o(@J)9RFF8&9lIUMdztd9HeFbIi7PO z7;bgzjp!ysl3UT$ed&+1>bJVons< z-!eHfWsH@MH>h|ayFv&P_ag*Rb)gg*7~BBcz`&%$ohO3pZ^HTpbj?4p^vzdh;)GUU zv|yos0keoquvkbd{=+kKpcRPqP$w$qaoRU7VPYKd_*ho@nVw&eMlZt0Hlzsz2a=dC zfRF$ErfG;dh!IzX%j)%jzBV9nKGXn)gtvBjbTxuSlY!rBneRkk5mDqLEsdZsK$p@f zsqIrpFo3Q#l*piE1gjlmu2*(30O%<4^Tao<6xP4+8rdpnn7}oPr$dELm{bH7v<_iX zOac-5v|}37;n161bRhEjstcY^l@jU6#L}8T;b!Xsp4p{iXTe$K!h|$IgA2*rq;n^S ziTCBneyL)<6xL<>&c*v44YnT_h1n$>=YH+&rHL?7XCHsdlY-$ydQ?>`UAxG$jV)at zoN@QKXQ{ovR+{JV?ah)S0`Jj#mn_bcKdW+t3F`yoj`r$?VotT6f}}?6s}*&7y=7|u z?BMXrqNl2@@^dR4&&7KIy2+tj|CeXUS43nr*X^`<3*6nc7suAjcg*Dvm1r^J@*1`h>JDpzo{9&F~hc9!_*4Fi~Ya7V9gLQxg75?{v%aV-St@4 zZlT8sn1p+!5>&uDsIz!+UQgYbBvW%cz24g#{{%(9h>6J3h@0~|kD3Mt z%S;Fodv^+l-(q2aw9i%9m%&)~0%1U+u&E0#1k&ec7Buh%wnR&sO0`ai$x#=d1S65Z ze`QMQ#!cCEcknqwL0fw-&vE$Lste-Pv-+nwhE$%o|3R2#{jFuXFpy!Yoj;eA_Os~0 zBkLtsv(=W@*}C1yR8P(vTkL(1;o*>8XB&wm=perpryMwd?Z4}8b6#?kyJdP|D=`fK zBaa?3tBy%c+5)M4^IjWRR~CRRc;Q;x+lyyb-?YxkpiQ1`C5#Dma)RN4p-TGUsuSAI zMP5H!9$VPDL~&hdXD3yT|EW=f<=Gl9rcNf7gX6`pinldBcv3LR+{N*|t@BJbe&Un2 zB?snHf!fwEag$-oOy2F3RPZm)SNKI9hk#kiiuPUV^`1$Rz4PGCGsvzlvA@Ofbo-Z- zx&6@`*FJu|JaR~!{-t?WYS$^*Kt*NrjBO9>Ev+#&;gHEw62p2bH{Sn^rh7{2ooel` zBHb`qE6E?hir!fBo-Q!Qlx-b4N)JWKdXs{)ST?3bUA>UL+~k50O)gqhbW;+M#R>j} zZHBC{-yKU0%-7u>r#PIEmY|v38YQzY)JCY8PnF#x_%u7j5lXTn$;uCM_{a5qn3CKz z{XcWt)JxcV$n-31MLR$ja!6qhT`}^PYi1nHT<}V@!FaKK|3V9@5rv`N7=tp4|T}e zq}nj%^-nS&AG4@E#eC`+jd4@qht3y;&adMyHdHo5*&hhr#dQT->8?o7K@Z0NYTf$T z)_mSzN_lW#cIsz4q)W7c7{z$-OK*R{6g+4H|HhvP%&|XomoG)2WHvQe+d5w(tqT8m zTfE<{>4wGlRH&fqp=(7qpimne*g<=7i0%ChYIr;(O}&V=ff&Ra1&A=% zV4%4bq-0j8QyZ30b;@s7C=0G#H%IZjDVYAaTNkc!{MWCmUEo}l5bSZY{)?!_nwT7J z@*Jg`ecG-Q8v$};57cvlrLPu;*=qy?VOXF32U{(uVV8(Ik^wVr$w|wAo`+F>7@0$lya+;{?@xM4E+id=0i=iK`kv`M8S4C* zZX=hJz^r%AN|f*8M#MPegk8gCY4uPj+JaxHYr}6UY>SP)5?#J8P|u|xxFrW~aWHXy zFu1+=o~s2~E|7wxldN_}IOce|Y8(l6eYyFL!KseZ2MX1k8wbx@aX}&+d$oYh604nO9 z6P5Eh?r32Z^^wN+k+A!wv2&le)gXVOu(i*f+8NHX`3|hG;QEEgbFxcuFR@lms#eQt zmSoU+&sVB0WpL;?OWWO)uubm=2+CVtCaP_vIy7ME9ktBWr}F-EtTuwc{sd#ytR*)-^bVUSQ?bjyuCGTejomvTZ=F=8H=iZL`+Xdf%XQ68Zb1OesPLHFjd#NGzpf%xEdO}nBuqa&L(k1Ojd9Dl*rsFBx8H`E-o zBwVVD9P&cQ~yKWHGOd zhDH&O@rmnGB0blkJ=-n+CvWuV4^=se%j}H}^2jhTt=Y24*tJWr)w=nAlXoUD?Qil1 zd#O*s87bMCV!Sm)oQ~j%&0TJ4I#D@L<1Jus8UGJQ@3TN1)q)}Cbh1xL*AvU$&}&_R zb$w%qpOFeiK!*rd;cKQcy}~+*v}3MLMZ8WhbXlms*!^S{CYQ2{8gNn&NVm}v=QNgV zXq;}0Qf4p3mbCA>-KZK&Y0EIp!fy};uZq_2dRb+@wi!HIt77ZPh+q&ajakBhhqcYU zx<_J8bGQH8J<_>@cblweTWG&EPS9WoUJ}A=m5Z)7IPQsULJ{?HQ|C=y-@*ec!p%{9 zZ^{DspLgtCnEz^A`G?*y?28W+3dWc> z3ED5q?5*@KCKa?@?=)HB_}x(pRuoGdIXA(2zNy!FDGHp}Ro=sNd#}R7l~|JH8N6Vk zk(HHwbS^Byz4TBFaxESj!yi~|OUgBShDK$XhFvkvF($_j)*m|77zYO>*Bjc-9-O@L zi-%_2nf7gk1`ZC!$13yQm@KPn(1l023hVy1reLZxCl)jB{WhkKKh{iIY9|be_cqV7 zeptF;!*}Zvm&rN|{?WKHSf+uQw)*3ie7x&>%je;zu^F^gLk04>uCtA5J}o9(o8mK{&mwhY^1Bq=c6M)JudcW6xfCmn#qE;|ScjwzF%ZH%=2m-r10LK%8##u2Wx- zR@R-c`>4Ukc)pd?ZtyBHGR@-RJn6Ft{4{X)jcJ; zIprOZ1Xw*?r`Y9%y-6I4Lu0I&zYIh0RH4058Op>U9UH!vV-KOd74icUG$^C#wD9^Z zB>r!C9s6y>sSNpG<(24QihY~LHyG~D<(3{%;Tq=M#a~Hoy#_|4@9!NqwLZ5(-}m}U zZ5x_1Ug{`M?0-*`+4J`Kt9#x`mUttjKYtpUB{s|1V~f_M9J4(qPsVD$C#@YA5vGKm ziMw0EHLFB>^3{penH6w~u+nB;o%G$S^~W!-*m$Kh=JEPm@P@p!??}r_f7n?(`?Xzl zf3pIB0Hf0romUrL+%u0&d#6T}?6$sBFx6dXI@c+v0UK3=hH|9M!U5G@|bs?euA^K4q& zN43|c)?6#=u8oit9{ICUdssTFW+|Nz)SohQ3!l}#jCSeFY0GuXu*X(fivd`l>K8!toKc|dw; zy~~vWBVe@XHyYZ{L?RRFO!En&e<4^p6Z`jj05D0YG_FJWGa$HQoQiM<}eQ~ zT&OB8jw9Ya$TRxUv+un-q|XngH=!vX<6UT!L>Q1jA;fQI+&7Vzi|Qbv*c<;XJX4g@ zI*|#3AK?D8qh^E<*NKl}a%ACMOd^QU#sf-k5Vflz!}Oe&Fzp(O{l9h^vc z0aZP`r^EQR*Gi=@-)ZFX@IPv9=*CNc%}WS)sGYDb1}@~_AIa78^YigZhnunab+@n> zR#}e2tYYAvv;4sMii6Ke{!NnKUM_jPvSy(+ocK@O+c9LZ4pO5C07q?yUbdjKSJ&T* zq$KhTu`^D7q6q-ik$gC`j8t36GUA}?0cn@u?2o%YIm}q`mY@MH7Xyo9`QLrCubIqH zlP0|H#}lU+Hoh6WPof0<(S1gQ?(-X*ZbbSm*s1pbuT75EXI7%32iX~<`+^2T~L(JHIpnY9|4w?fK(G-hiJ$4B|rMubZGYo`f(=$1Quq-CMi5= z^78>Q{S16JS!m)?LMAGW2sQwpXQIhxxW7Q}|BoK~uP^=UX*_ra{Of7_fB7_+hp1xZ XlJ_q!ewGD)US7QHcPC$Zd)vPQS@mTX literal 0 HcmV?d00001 diff --git a/docs/paper/arxiv/figures/roles.typ b/docs/paper/arxiv/figures/roles.typ index 82982814..c6ecacd4 100644 --- a/docs/paper/arxiv/figures/roles.typ +++ b/docs/paper/arxiv/figures/roles.typ @@ -1,31 +1,29 @@ -#import "@preview/cetz:0.4.2": canvas, draw +#import "lib.typ": * -#set page(width: auto, height: auto, margin: 10pt) -#set text(size: 7pt, font: "New Computer Modern") - -#let col-human = rgb("#f28e2b") -#let col-agent = rgb("#4e79a7") -#let col-code = rgb("#59a14f") -#let col-skill = rgb("#9c755f") +#set page(..fig-page) +#set text(..fig-text) #canvas(length: 0.55cm, { import draw: * - // Helper: role node with shadow - let node(pos, label, sub, col, name-id, w: 2.2, h: 0.8) = { + // Helper: box with shadow + let node(pos, label, sub, name-id, accented: false, w: 2.2, h: 0.8) = { let (x, y) = pos - rect((x - w + 0.12, y - h + 0.12), (x + w + 0.12, y + h + 0.12), - radius: 7pt, fill: luma(230), stroke: none) + let s = if accented { stroke-accent } else { stroke-box } + let f = if accented { fill-accent } else { fill-light } + rect((x - w + 0.1, y - h + 0.1), (x + w + 0.1, y + h + 0.1), + radius: 7pt, fill: shadow-col, stroke: none) rect((x - w, y - h), (x + w, y + h), - radius: 7pt, fill: col.lighten(90%), stroke: (thickness: 1.3pt, paint: col), - name: name-id) - content((x, y + 0.22), text(10pt, weight: "bold", fill: col.darken(22%), label)) - content((x, y - 0.32), text(6.5pt, fill: col.darken(8%), sub)) + radius: 7pt, fill: f, stroke: s, name: name-id) + let c = if accented { accent.darken(20%) } else { fg } + content((x, y + 0.22), text(10pt, weight: "bold", fill: c, label)) + content((x, y - 0.32), text(6.5pt, fill: fg-light, sub)) } - // Helper: edge label with white backing + // Helper: edge label let elabel(pos, body) = { - content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt, body)) + content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt, + text(6.5pt, fill: fg-light, body))) } let cx = 8 @@ -33,51 +31,45 @@ // ── Codebase (center, larger) ── rect((cx - 2.7 + 0.12, cy - 1.4 + 0.12), (cx + 2.7 + 0.12, cy + 1.4 + 0.12), - radius: 8pt, fill: luma(225), stroke: none) + radius: 8pt, fill: shadow-col, stroke: none) rect((cx - 2.7, cy - 1.4), (cx + 2.7, cy + 1.4), - radius: 8pt, fill: col-code.lighten(92%), stroke: (thickness: 1.5pt, paint: col-code), + radius: 8pt, fill: fill-light, stroke: (thickness: 1.5pt, paint: border), name: "code") - content((cx, cy + 0.45), text(11pt, weight: "bold", fill: col-code.darken(25%), [Codebase])) - content((cx, cy - 0.3), text(7pt, style: "italic", fill: col-code.darken(8%), [agent-maintained])) + content((cx, cy + 0.45), text(11pt, weight: "bold", fill: fg, [Codebase])) + content((cx, cy - 0.3), text(7pt, style: "italic", fill: fg-light, [agent-maintained])) // ── Three roles ── - node((3.0, 11.0), [Contributor], [domain expert], col-human, "contrib") - node((3.0, 0.8), [Maintainer], [no code], col-human, "maint") - node((13.5, 2.0), [Agent], [implement · test · review], col-agent, "agent", w: 2.5) + node((3.0, 11.0), [Contributor], [domain expert], "contrib") + node((3.0, 0.8), [Maintainer], [no code], "maint") + node((13.5, 2.0), [Agent], [implement · test · review], "agent", accented: true, w: 2.5) // ── Contributor → Codebase: issue ── line((5.2, 11.0 - 0.8), (cx - 0.5, cy + 1.4), - stroke: (thickness: 1.1pt, paint: col-human), - mark: (end: "straight", scale: 0.42)) - elabel((6.8, 8.8), text(6.5pt, fill: col-human.darken(15%), [issue (creative elements)])) + stroke: stroke-edge, mark: arrow-end) + elabel((6.8, 8.8), [issue (creative elements)]) // ── Codebase → Contributor: visual check ── line((cx - 2.0, cy + 1.4), (2.2, 11.0 - 0.8), - stroke: (thickness: 0.9pt, paint: col-code, dash: "densely-dashed"), - mark: (end: "straight", scale: 0.38)) - elabel((2.5, 8.6), text(6pt, fill: col-code.darken(15%), [generated paper\ (visual check)])) + stroke: stroke-dotted, mark: arrow-end) + elabel((2.2, 8.6), [generated paper\ (visual check)]) // ── Maintainer → Codebase: approve, merge ── line((4.5, 0.8 + 0.8), (cx - 2.0, cy - 1.4), - stroke: (thickness: 0.9pt, paint: col-human), - mark: (end: "straight", scale: 0.38)) - elabel((3.8, 3.0), text(6pt, fill: col-human.darken(15%), [approve, merge])) + stroke: stroke-edge, mark: arrow-end) + elabel((3.8, 3.0), [approve, merge]) // ── Agent ↔ Codebase: execute skills ── line((13.5 - 2.3, 2.0 + 0.8), (cx + 2.0, cy - 1.4), - stroke: (thickness: 1.1pt, paint: col-agent), - mark: (start: "straight", end: "straight", scale: 0.42)) - elabel((12.0, 4.2), text(6pt, fill: col-agent.darken(15%), [execute skills])) + stroke: (thickness: 1.1pt, paint: accent), mark: arrow-both) + elabel((12.0, 4.2), text(fill: accent.darken(15%), [execute skills])) // ── Maintainer → Agent: author skills ── line((3.0 + 2.2, 0.8 + 0.2), (13.5 - 2.5, 2.0 - 0.3), - stroke: (thickness: 1.1pt, paint: col-skill), - mark: (end: "straight", scale: 0.42)) - elabel((8.2, 0.4), text(7pt, weight: "bold", fill: col-skill.darken(15%), [author skills])) + stroke: (thickness: 1.1pt, paint: accent), mark: arrow-end) + elabel((8.2, 0.35), text(weight: "bold", fill: accent.darken(15%), [author skills])) // ── Maintainer ↔ Contributor: community calls ── line((3.0 - 1.0, 0.8 + 0.8), (3.0 - 1.0, 11.0 - 0.8), - stroke: (thickness: 0.7pt, paint: col-human.lighten(25%), dash: "dashed"), - mark: (start: "straight", end: "straight", scale: 0.28)) - elabel((0.3, 5.9), text(5.5pt, fill: col-human.lighten(5%), [community\ calls])) + stroke: stroke-dashed, mark: arrow-both) + elabel((0.3, 5.9), [community\ calls]) }) From 39f10172551dc6530a5f3b126e5672a9d49cb289 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 19:05:29 +0800 Subject: [PATCH 27/38] Add paper redesign spec: bridge problem framing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Restructures the paper around the "bridge problem" concept — software too large for humans, made possible by agents constrained through systematic verification. Three barriers (convention drift, effort exhaustion, knowledge discontinuity) become the central thesis. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/paper-redesign-spec.md | 136 ++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 docs/paper/arxiv/paper-redesign-spec.md diff --git a/docs/paper/arxiv/paper-redesign-spec.md b/docs/paper/arxiv/paper-redesign-spec.md new file mode 100644 index 00000000..5fbdecd0 --- /dev/null +++ b/docs/paper/arxiv/paper-redesign-spec.md @@ -0,0 +1,136 @@ +# Paper Redesign Spec + +**Date:** 2026-03-14 +**Title:** Bridging NP-Hard Problems: Scaling Software Beyond Human Capacity with Agentic Coding + +## Core Thesis + +**Bridge problems** are software projects too large for humans at scale. Agents can build them because systematic verification constrains agent output to match contributor-specified ground truth. NP-hard reductions are the first convincing example. + +## Bridge Problem Definition + +A software project where subtasks are homogeneous and formally verifiable, but three structural barriers make human-only execution infeasible at scale: + +1. **Convention drift** — humans can't maintain uniform conventions across hundreds of contributions; agents read CLAUDE.md every time +2. **Effort exhaustion** — humans can't sustain the energy to verify 100+ problems and continuously test without a user community +3. **Knowledge discontinuity** — humans graduate, newcomers can't absorb implicit knowledge; skills make onboarding executable + +The correctness concern: contributor-specified ground truth (definitions, examples, expected behavior) flows through the verification stack (type system → round-trip tests → overhead validation → agentic tests), constraining what agents can produce. Agent output ⊆ contributor intent. + +## Section Structure + +### Section 1: Introduction (~1.5 pages) +- Open with familiar examples (airlines, chips, logistics → NP-hard problems → need reductions) +- The reduction graph idea (connect problems to solvers) +- **The claim:** This is a *bridge problem* — software too large for humans, made possible by agents constrained through systematic verification +- Three barriers preview +- **Fig 1: Scaling Wall** +- Contributions bullet list + +### Section 2: Bridge Problems (~2 pages) +- Define bridge problems formally +- Three barriers, each with concrete evidence from this project: + - Convention drift: agents never deviated from file naming, trait implementation, test patterns + - Effort exhaustion: 45 rules in 9 weeks vs Julia predecessor 20 types in 4 years; agents never tire of running the same verification loop + - Knowledge discontinuity: skills encode workflow as executable documents; new maintainers invoke same skills that produced the codebase +- The correctness concern and how verification addresses it +- **Fig 2: Verification Funnel** — agent generates candidates → type system rejects invalid structure → round-trip tests reject wrong semantics → agentic tests reject poor UX → only correct code survives +- Other candidate domains (algorithm libraries, compiler optimization passes, HDL, numerical linear algebra) + +### Section 3: Case Study — The Reduction Graph (~2 pages) +- What is a reduction (brief definition) +- Graph structure: 27 problems, 45 rules, 56 edges +- **Fig 3: Reduction Graph** with color-coded solver-reachability arrows: + - Blue = paths reaching ILP (Gurobi/CPLEX) + - Red = paths reaching QUBO (D-Wave) + - Green = paths reaching UD-MIS (Rydberg atoms) + - Nodes colored by which solvers they can reach (multi-color = multiple solvers) + - Reader traces arrows forward (problem → solver) or backward (solver → problems) +- Emergent compositionality highlighted as multi-hop colored paths (e.g., Factoring → CircuitSAT → SAT → ILP) +- Round-trip testing (brief) + +### Section 4: Methodology — Skills + Verification (~2 pages) +- Skills: persistent, versioned workflow scripts that encode convention +- **Fig 4: Pipeline** (existing 6-stage board, orange=human, blue=agent) +- The 14 skills and what they do (table or compact list) +- Verification stack in practice (type system, unit tests, closed-loop, overhead validation, agentic tests) + +### Section 5: Evidence (~2 pages) +- **Fig 5: Development Timeline** — cumulative plot of problem types + rules over 9 weeks, phase annotations (manual → basic-skills → full-pipeline), Julia predecessor 4-year trajectory overlaid +- Development metrics (58 PRs, 15:1 agent-to-human message ratio) +- Quality gate: 75% rejection rate on 322 batch-submitted proposals +- Barrier-by-barrier evidence: + - Convention: zero convention violations in agent-authored code + - Effort: acceleration curve across phases + - Continuity: skills as executable onboarding (dev-setup, add-rule, etc.) + +### Section 6: Discussion (~1.5 pages) +- Limitations (N=1, skill engineering cost, confounding factors) +- Why human experts remain essential (LLM reasoning limits, citing existing research) +- Future work (100+ problems, formal verification with Lean/Coq, cost-aware path selection, automated discovery via AlphaEvolve) + +### Appendices +- A1: System Architecture (trait hierarchy, ReduceTo, macros) — existing figure +- A2: Verification Stack Details (7-layer pyramid) — existing figure +- A3: Topology Issues (orphan nodes, redundant rules, NP-hardness gaps) — moved from main text + +## Figure Specifications + +### Fig 1: Scaling Wall (NEW) +- **Type:** Line chart with annotations +- **X-axis:** Number of problem types (0 → 200) +- **Y-axis:** Software quality (convention compliance, test coverage, documentation completeness) +- **Lines:** + - Human team trajectory: rises then plateaus/declines as it hits 3 barrier walls + - Agent + verification trajectory: breaks through all 3 walls +- **Annotations:** Three vertical dashed lines marking the barriers (convention drift, effort exhaustion, knowledge discontinuity) +- **Data points:** Julia predecessor at 20 (4 years), this work at 27 (9 weeks), vision at 100+ +- **Format:** Typst/CeTZ or TikZ + +### Fig 2: Verification Funnel (NEW) +- **Type:** Funnel/filter diagram +- **Flow:** Wide at top (agent generates many candidate implementations) → narrowing through filters → narrow at bottom (correct code) +- **Layers (top to bottom):** + 1. Agent output (wide) — "many plausible implementations" + 2. Type system filter — "rejects structural errors (wrong trait impl, type mismatch)" + 3. Round-trip tests filter — "rejects semantic errors (wrong transformation, broken inverse)" + 4. Overhead validation — "rejects incorrect complexity claims" + 5. Agentic feature tests — "rejects UX/documentation issues" + 6. Correct code (narrow) — "matches contributor ground truth" +- **Side annotation:** "Contributor-specified ground truth" arrow pointing into each filter level +- **Format:** Typst/CeTZ or TikZ + +### Fig 3: Reduction Graph (REDESIGN) +- **Base:** Existing 27-node directed graph +- **Enhancement:** Color-coded edges by solver reachability + - Blue edges/paths → ILP (Gurobi/CPLEX) + - Red edges/paths → QUBO (D-Wave quantum annealer) + - Green edges/paths → UD-MIS (Rydberg atom arrays) +- **Node coloring:** Nodes tinted by which solvers they can reach (multi-color for multiple solvers) +- **Solver hubs:** Prominent labels at bottom: "Gurobi/CPLEX", "D-Wave", "Rydberg" +- **Key insight:** Same graph answers both "what can this solver solve?" and "what solvers can this problem reach?" +- **Format:** Typst/CeTZ (redesign existing reduction-graph.typ) + +### Fig 4: Pipeline (EXISTING — keep as-is) +- 6-stage Kanban board +- Orange = human judgment points, Blue = agent-automated steps + +### Fig 5: Development Timeline (NEW) +- **Type:** Cumulative line plot +- **X-axis:** Time (weeks 1-9, with dates) +- **Y-axis (left):** Cumulative count (problem types, reduction rules) +- **Lines:** Two lines — problem types (solid) and reduction rules (dashed) +- **Phase bands:** Background shading for manual / basic-skills / full-pipeline phases +- **Overlay:** Julia predecessor trajectory (4 years to 20 types) shown as a faint reference line, dramatically slower +- **Data source:** git-mining-results.json +- **Format:** Typst/CeTZ-plot or matplotlib-generated PDF + +## Key Changes from Current Paper +1. "Bridge problem" concept elevated from Discussion to Section 2 +2. Reduction graph becomes case study illustrating the concept, not the main event +3. 4 impossibilities merged to 3 barriers (effort exhaustion + testing frequency combined) +4. Framing: "agents break through barriers, verification ensures correctness" (not "humans + agents must combine") +5. Reduction graph figure redesigned with solver-reachability coloring +6. Three Roles figure cut (pipeline is sufficient) +7. Topology Issues figure moved to appendix +8. Problem tree figure absorbed into reduction graph or cut From 6a92d6495bb804778325f3348e268edf2c350ff3 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 21:28:14 +0800 Subject: [PATCH 28/38] Restructure paper around bridge problem framing - Rewrite abstract and introduction with bridge problem concept - Add new Section 2 (Bridge Problems): definition, three barriers, verification constrains agent output, other candidate domains - Rename Section 3 to "Case Study: The Reduction Graph" - Rewrite Discussion: remove content now in Sec 2, tighten "Why Human Experts Remain Essential", rewrite conclusion - Move topology figure to appendix - Renumber sections throughout Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/paper.tex | 315 ++++++++++++++++--------------- docs/paper/arxiv/plan-rewrite.md | 85 +++++++++ 2 files changed, 252 insertions(+), 148 deletions(-) create mode 100644 docs/paper/arxiv/plan-rewrite.md diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 209c9725..e8b3e6c3 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -11,21 +11,18 @@ \begin{document} -\title{Skill-Based Agentic Coding for Mathematical Software:\\ -A Case Study in NP-Hard Problem Reductions} +\title{Bridging NP-Hard Problems: Scaling Software Beyond Human Capacity with Agentic Coding} \author{...} % placeholder \maketitle \begin{abstract} -Many real-world optimization problems---scheduling, routing, chip design---are computationally hard (NP-hard), yet specialized hardware and software solvers exist for a handful of them. -Connecting a new problem to an existing solver requires a \emph{reduction}: a mathematical transformation that converts one problem into another while preserving the solution. -We build a library of such reductions as a directed graph of 27~problem types and 56~directed edges (45~hand-coded transformation rules plus edges inferred from type relationships), forming a ``compilation layer'' that routes any supported problem to the solver best suited for it. -Building this graph at scale requires implementing many self-contained mathematical proofs as verified code---a task well suited to AI coding agents when properly structured. -We introduce a \emph{skill-based} methodology that decomposes work into human-creative specification (deciding which reductions are worth implementing) and agent-managed execution (writing code, tests, and documentation). -Over nine weeks, this approach produced a Rust library with $>$95\% test coverage. -The graph exhibits a striking property: reductions implemented independently compose automatically to solve problems that no single implementation was designed for. +Some software projects are too large for human teams to build at scale. +We call these \emph{bridge problems}: projects whose subtasks are homogeneous and formally verifiable, but where three structural barriers---convention drift, effort exhaustion, and knowledge discontinuity---make human-only execution infeasible beyond a few dozen components. +AI coding agents can break through these barriers because systematic verification constrains agent output to match contributor-specified ground truth. +We demonstrate this claim on NP-hard problem reductions, building a directed graph of 27~problem types and 45~transformation rules that routes any supported problem to a specialized solver. +Over nine weeks, a single maintainer and AI agents produced a Rust library with $>$95\% test coverage---a task whose predecessor project took four years to reach 20~problem types. Agent session data reveals a 15:1 ratio of agent-to-human messages, while an automated quality gate rejected 75\% of 322~batch-submitted proposals as incorrect or incomplete. \end{abstract} @@ -43,9 +40,9 @@ \subsection{The Problem: Many Hard Problems, Few Solvers} Each of these is an instance of an NP-hard problem---a class of problems for which no efficient general-purpose algorithm is known, but which can be solved in practice for moderate sizes by specialized solvers. The difficulty is that each solver speaks its own narrow language. -Rydberg atom arrays~\cite{lucas2014, pichler2018}---a type of quantum hardware---natively solve the Maximum Independent Set (MIS) problem on geometric graphs. -D-Wave quantum annealers~\cite{glover2019} solve Quadratic Unconstrained Binary Optimization (QUBO). -Commercial engines like Gurobi and CPLEX solve Integer Linear Programs (ILP). +Rydberg atom arrays~\cite{lucas2014, pichler2018}---a type of quantum hardware---natively solve the Maximum Independent Set problem on geometric graphs. +D-Wave quantum annealers~\cite{glover2019} solve Quadratic Unconstrained Binary Optimization. +Commercial engines like Gurobi and CPLEX solve Integer Linear Programs. A practitioner with a graph coloring problem cannot directly use any of these solvers without first \emph{translating} the problem into a form the solver understands. This translation is called a \emph{reduction}: a polynomial-time algorithm that converts an instance of one problem into an instance of another, together with an inverse map that translates the solution back. @@ -53,59 +50,135 @@ \subsection{The Problem: Many Hard Problems, Few Solvers} \subsection{Our Approach: A Reduction Graph} -We build a collection of such reductions organized as a \emph{directed graph}. -Each node in the graph is a problem type (e.g., Satisfiability, Max-Cut, Traveling Salesman). -Each directed edge is a verified reduction from one problem to another---code that transforms instances forward and maps solutions back. -Given any supported problem, a path through the graph leads to a solver, with each edge backed by tested code. +We organize these reductions as a \emph{directed graph}. +Each node is a problem type (e.g., Satisfiability, Max-Cut, Traveling Salesman). +Each directed edge is a verified reduction---code that transforms instances forward and maps solutions back. +Given any supported problem, a path through the graph leads to a solver, with every edge backed by tested code. The graph currently contains 27~problem types connected by 56~directed edges (\Cref{fig:reduction-graph}). -It exhibits a property we call \emph{emergent compositionality}: reductions implemented independently---by different people, in different pull requests, weeks apart---compose automatically through the graph infrastructure. -For example, one contributor implemented Factoring $\to$ Circuit Satisfiability (CircuitSAT), and another implemented CircuitSAT $\to$ ILP. +Reductions implemented independently compose automatically through the graph. +For example, one contributor implemented Factoring $\to$ Circuit Satisfiability, and another implemented Circuit Satisfiability $\to$ Integer Linear Programming. Neither intended the composition, yet the graph enables factoring integers via linear programming by chaining the two reductions. Each new edge creates not just one connection but new paths through the entire graph. -\subsection{The Challenge: Building the Graph at Scale} +\subsection{Bridge Problems} -Implementing a single reduction requires understanding both the source and target problems, designing a correct transformation, proving it preserves solutions, writing 50--400 lines of code, and testing it thoroughly. -Building 50 such reductions is labor-intensive. +Building this graph requires implementing 45~transformation rules, each involving 50--400 lines of verified code. +A predecessor project in Julia took four years to reach 20~problem types. +This is not a failure of effort but of scale: the project is a \emph{bridge problem}---software too large for human teams to build and maintain at the required quality level. -AI coding agents---systems that autonomously write, test, and debug code---have shown strong performance on isolated software tasks, resolving 70--80\% of single-issue bug fixes on the SWE-Bench benchmark~\cite{Xia2025LiveSWEagent}. -But on multi-step tasks spanning many files, success drops to roughly 21\%~\cite{Thai2025SWEEVO, Deng2025SWEBenchPro}. -For mathematical software, the gap is wider still: an agent can generate a plausible-looking reduction, but verifying that it preserves solution structure requires reasoning that current agents cannot reliably perform alone~\cite{Roychoudhury2025AgenticAI}. +We identify three structural barriers that make bridge problems infeasible for human-only teams: +\begin{enumerate} + \item \textbf{Convention drift.} Hundreds of contributions must follow identical file-naming, interface, and testing conventions. + Human contributors inevitably diverge; agents read the project specification on every invocation and never deviate. + \item \textbf{Effort exhaustion.} Each new reduction demands the same cycle of coding, testing, documentation, and review. + Human energy is finite; agents execute the same verification loop indefinitely without fatigue. + \item \textbf{Knowledge discontinuity.} Contributors graduate, change jobs, or lose context. + Implicit workflow knowledge---which files to create, which tests to write, which edge cases to check---is lost with each departure. + Reusable skills encode this knowledge as executable documents that any new contributor or agent can invoke. +\end{enumerate} -We argue that the bottleneck is not agent capability but \emph{task decomposition}: how work is structured so that each unit falls within an agent's reliable range. -NP-hard reductions are a natural fit. -Every reduction implements the same interface, follows the same file convention, requires the same test pattern, and produces the same artifacts. -This homogeneity enables \emph{reusable skills}: persistent, versioned workflow scripts that decompose a complex multi-file task into numbered agent-executable steps. -A single ``add-rule'' skill handles all reductions, because the workflow is structurally identical even when the mathematical content varies. +% Placeholder for Fig 1: Scaling Wall +% \begin{figure}[t] +% \centering +% % \includegraphics[width=\columnwidth]{figures/scaling-wall.pdf} +% \caption{Scaling wall. Human teams maintain quality up to $\sim$20 components, then hit convention drift, effort exhaustion, and knowledge discontinuity. Agents constrained by systematic verification break through all three barriers.} +% \label{fig:scaling-wall} +% \end{figure} -\subsection{Contributions} +AI coding agents can break through these barriers, but only if their output is constrained to match contributor-specified ground truth. +A contributor supplies the creative elements: which problems matter, what the formal definitions are, which examples reveal correctness. +These flow through a verification stack---type system, round-trip tests, overhead validation, agentic feature tests---that rejects any agent output inconsistent with the specification. +The agent produces volume; verification ensures correctness. -Our methodology separates creative decisions from routine execution. -\textbf{Contributors} supply the creative elements that only domain experts can provide: which problems matter, what are the formal definitions, which examples reveal correctness. -\textbf{The maintainer} encodes workflow knowledge into reusable skills and makes two judgment calls per contribution---what to build and whether to merge. -\textbf{Agents} serve as guides (helping contributors articulate their ideas interactively) and as runners (handling the routine volume of implementing code, writing tests, and fixing CI). -A library of 14~skills and a multi-layered verification stack ensure correctness across abstraction levels. +\subsection{Contributions} -Over nine weeks, this methodology produced a Rust library with 27~problem types, 45~reduction rules, and $>$95\% test coverage. +Over nine weeks, a single maintainer and AI agents produced a Rust library with 27~problem types, 45~reduction rules, and $>$95\% test coverage. Our contributions are: \begin{itemize} - \item A \textbf{verified reduction graph} connecting 27~NP-hard problem types to specialized solvers, with emergent compositionality through automatic path composition. - \item A \textbf{skill-based methodology} for mathematical software engineering, separating human-creative specification from agent-managed execution. - \item A \textbf{verified open-source artifact}: the Rust library, skill library, and full development history as a benchmark for agentic mathematical software engineering. + \item \textbf{The bridge problem concept}: a characterization of software projects where homogeneous, verifiable subtasks create scale barriers that agents can overcome (\Cref{sec:bridge}). + \item \textbf{A verified reduction graph} connecting 27~NP-hard problem types to specialized solvers, with emergent compositionality through automatic path composition (\Cref{sec:graph}). + \item \textbf{A skill-based methodology} for mathematical software engineering, encoding workflow knowledge as reusable, versioned scripts that decompose multi-file tasks into agent-executable steps (\Cref{sec:method}). + \item \textbf{Quantitative evidence}: a 15:1 agent-to-human message ratio, a 75\% rejection rate on 322~batch-submitted proposals demonstrating the quality gate's selectivity, and a 9-week timeline versus the predecessor's four years (\Cref{sec:evaluation}). \end{itemize} The rest of this paper is organized as follows. -\Cref{sec:graph} describes the reduction graph and its properties. -\Cref{sec:method} presents the skill-based methodology. -\Cref{sec:evaluation} evaluates through development metrics, a quality gate analysis, and case studies. +\Cref{sec:bridge} defines bridge problems and the three barriers. +\Cref{sec:graph} presents the reduction graph as a case study. +\Cref{sec:method} describes the methodology. +\Cref{sec:evaluation} provides evidence. \Cref{sec:related} surveys related work. -\Cref{sec:conclusion} discusses generalizability and future directions. +\Cref{sec:conclusion} discusses limitations and future directions. + +%====================================================================== +% SECTION 2: BRIDGE PROBLEMS +%====================================================================== +\section{Bridge Problems}\label{sec:bridge} + +A \emph{bridge problem} is a software project whose subtasks are homogeneous and formally verifiable, but whose scale exceeds what human teams can sustain. +We identify three structural barriers that distinguish bridge problems from ordinary large projects. + +\subsection{Three Barriers to Human-Scale Development} + +\paragraph{Convention drift.} +A reduction library demands strict uniformity: every rule file follows the same naming convention, implements the same trait, includes the same test pattern, and produces the same documentation artifacts. +Human teams cannot maintain this discipline across hundreds of contributions. +Conventions drift, shortcuts accumulate, and style guides go unread. +In our approach, the project specification (\texttt{CLAUDE.md}) and reusable skills encode every convention as a machine-readable document. +The agent reads it on every invocation and never deviates---convention compliance becomes a property of the tool, not a discipline of the team. + +\paragraph{Effort exhaustion.} +Each new problem type requires definitions, examples, reductions, tests, overhead formulas, and documentation. +Verifying each component one by one---running the same round-trip test for the 45th time, checking the same overhead formula against the same symbolic expressions---requires sustained, detail-oriented effort that no individual or small team can maintain indefinitely. +A niche mathematical library also lacks the user base to surface issues through organic community usage. +Agents execute the same verification loop without fatigue, and agentic feature tests---where an agent role-plays as a downstream user---provide the testing frequency that the community size cannot. + +\paragraph{Knowledge discontinuity.} +Human maintainers graduate, change jobs, or lose interest. +New contributors face a steep onboarding curve: understanding the architecture, the conventions, the testing patterns, and the implicit knowledge accumulated over years. +Skills encode this knowledge in executable, versionable form. +A new maintainer does not need to reconstruct the original developer's mental model from scattered comments and commit messages---they invoke the same skills that produced the codebase. +The \texttt{dev-setup} skill configures the development environment; the \texttt{add-rule} skill encodes the complete workflow for adding a reduction. +Open-source contribution, traditionally gated by the ability to absorb a project's implicit culture, becomes gated only by domain knowledge. + +\subsection{Verification Constrains Agent Output}\label{sec:verification-bridge} + +The natural concern is correctness: how can agent-written mathematical software be trusted? +Our answer is that contributor-specified ground truth flows through a verification stack that constrains what agents can produce. + +% Placeholder for Fig 2: Verification Funnel +% \begin{figure}[t] +% \centering +% % \includegraphics[width=\columnwidth]{figures/verification-funnel.pdf} +% \caption{Verification funnel. +% Agent-generated code is progressively filtered: the type system rejects structural errors, round-trip tests reject semantic errors, overhead validation rejects incorrect complexity claims, and agentic feature tests reject usability issues. +% Only code matching contributor-specified ground truth survives.} +% \label{fig:verification-funnel} +% \end{figure} + +A contributor supplies the creative elements that define correctness: formal problem definitions, worked examples with known solutions, and expected overhead formulas. +These flow through four verification layers: +\begin{enumerate} + \item \textbf{Type system}: the Rust compiler rejects structural errors---wrong return types, missing inverse maps, references to nonexistent problem attributes---at compile time. + \item \textbf{Round-trip tests}: for each reduction, a small instance is transformed forward, solved by brute force, mapped back, and verified optimal for the source. This catches the most mathematically subtle errors without per-reduction test logic. + \item \textbf{Overhead validation}: symbolic size expressions are evaluated against actual target sizes, catching formula errors that are type-correct but mathematically wrong. + \item \textbf{Agentic feature tests}: an agent reads the documentation, exercises the feature through the CLI, and judges whether results are consistent with domain knowledge---replacing the community feedback loop that niche libraries lack. +\end{enumerate} + +The agent has freedom in \emph{how} to implement; verification eliminates freedom in \emph{what} the result must be. +Agent output $\subseteq$ contributor intent. + +\subsection{Other Candidate Domains} + +NP-hard reductions are not the only bridge problem. +Algorithm libraries, compiler optimization passes, numerical linear algebra routines, and hardware description languages share the same structure: homogeneous subtasks, formal correctness criteria, and a scale that exceeds what small teams can sustain. +In each case, domain experts can encode the ``how'' into reusable skills while the ``what'' remains a human judgment call. +The methodology does \emph{not} generalize to heterogeneous tasks---the staple of SWE-Bench---where each issue is structurally unique and resists skill-based decomposition. %====================================================================== -% SECTION 2: THE REDUCTION GRAPH +% SECTION 3: CASE STUDY --- THE REDUCTION GRAPH %====================================================================== -\section{The Reduction Graph}\label{sec:graph} +\section{Case Study: The Reduction Graph}\label{sec:graph} \subsection{What Is a Reduction?} @@ -183,7 +256,7 @@ \subsection{Verification by Round-Trip Testing}\label{sec:roundtrip} See \Cref{app:architecture} for the complete type architecture. %====================================================================== -% SECTION 3: METHODOLOGY +% SECTION 4: METHODOLOGY %====================================================================== \section{Methodology}\label{sec:method} @@ -219,24 +292,14 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} Alternatively, the contributor fills in a structured GitHub issue template directly---the template fields mirror the same creative questions. Either way, the output is an issue whose fields capture every creative decision needed for implementation. -Crucially, the agent first analyzes the graph's topology to identify the most valuable contributions (\Cref{fig:topology}). +Crucially, the agent first analyzes the graph's topology to identify the most valuable contributions (\Cref{app:topology}). Three categories guide the analysis: -\emph{orphan nodes}---problem types with no reductions to or from any other node, contributing nothing to the graph; -\emph{redundant rules}---direct reductions whose overhead is dominated by a cheaper composite path through intermediate nodes; -and \emph{missing proof paths}---problems with no reduction chain from 3-SAT, leaving their NP-hardness unproven within the graph. +\emph{orphan nodes}---problem types with no reductions to or from any other node; +\emph{redundant rules}---direct reductions dominated by a cheaper composite path; +and \emph{missing proof paths}---problems with no reduction chain from 3-SAT. The agent ranks proposals by priority: rules that connect orphans or fill proof-chain gaps are suggested first. Before filing the issue, the agent runs the quality checks from Stage~2 on the draft, catching problems before they reach review. -\begin{figure}[t] - \centering - \includegraphics[width=\columnwidth]{figures/topology-issues.pdf} - \caption{Three graph topology issues detected by the \texttt{topology-sanity-check} skill. - \textbf{(a)}~An orphan node has no edges and cannot reach any solver. - \textbf{(b)}~A direct reduction is redundant when a composite path through~$B$ has equal or lower overhead. - \textbf{(c)}~Problems without a path from 3-SAT lack a machine-verifiable NP-hardness proof in the graph.} - \label{fig:topology} -\end{figure} - \textbf{Stage~2: Validate.} The \texttt{check-issue} skill applies four independent tests to every proposal: \emph{usefulness} (does a reduction path already exist? is this one dominated by a cheaper composite path?), \emph{non-triviality} (is this a genuine structural transformation, not a variable substitution?), \emph{correctness} (do the cited references exist and support the claims?), and \emph{writing quality} (are all symbols defined, all template sections complete, all examples fully worked?). References are verified through a fallback chain: project bibliography, then web search---never hallucinated. @@ -358,9 +421,9 @@ \subsection{Why Rust?}\label{sec:why-rust} Procedural macros deserve special mention: they enable the compile-time validation of overhead expressions and variant registrations that powers Layers~1 and~4 of the verification stack. %====================================================================== -% SECTION 4: EVALUATION +% SECTION 5: EVIDENCE %====================================================================== -\section{Evaluation}\label{sec:evaluation} +\section{Evidence}\label{sec:evaluation} We evaluate through development metrics from the project's history, an analysis of the automated quality gate, and case studies showing how verification layers interact. All development used Claude Code~\cite{Anthropic2025ClaudeCode} with Claude models (Sonnet~3.5 and Sonnet~4; the model version evolved during development). @@ -456,7 +519,7 @@ \subsection{Case Studies}\label{sec:cases} The pipeline's primary value here is enforcing conventions---file naming, macro registration, documentation---rather than catching logical errors. %====================================================================== -% SECTION 5: RELATED WORK +% SECTION 6: RELATED WORK %====================================================================== \section{Related Work}\label{sec:related} @@ -491,37 +554,10 @@ \section{Related Work}\label{sec:related} These approaches assume a QUBO or Ising formulation as input---precisely the transformation that our reduction graph provides as upstream infrastructure. %====================================================================== -% SECTION 6: DISCUSSION AND CONCLUSION +% SECTION 7: DISCUSSION AND CONCLUSION %====================================================================== \section{Discussion and Conclusion}\label{sec:conclusion} -\subsection{When Does This Methodology Apply?} - -The approach rests on what we call a \emph{Goldilocks property}: the target domain must have tasks that are (1)~formally specified, (2)~decomposable into homogeneous subtasks, (3)~equipped with automatable correctness criteria, and (4)~demanding enough to require both human creativity and mechanical execution. -NP-hard reductions satisfy all four, but are not unique. -Algorithm libraries, numerical linear algebra routines, compiler optimization passes, and hardware description languages share this structure. -In each case, a domain expert can encode the ``how'' into reusable skills while the ``what'' remains a human judgment call. - -The methodology does \emph{not} generalize to heterogeneous tasks---the staple of SWE-Bench---where each issue is structurally unique and resists skill-based decomposition. - -For domains that do share the Goldilocks property, our experience suggests three key components for scalable agentic open-source development: - -\begin{enumerate} - \item \textbf{A smooth contribution path.} - Progressive quality gates---propose, validate, implement, review, merge---ensure that every contribution follows the same verified pipeline regardless of who (or what) initiates it. - Skills encode this path so that it need not be re-learned or re-explained. - - \item \textbf{A code-free contribution path with visual verification.} - Domain experts contribute creative elements---definitions, examples, references---without writing a line of code. - An issue goes in; verified code and a documented paper entry come out. - The paper's worked examples, generated from the same code that the round-trip tests execute, let contributors visually confirm that the implementation matches their mathematical intent. - - \item \textbf{A minimal maintainer team amplified by agents.} - One or two maintainers write skills, curate the backlog, and make final merge decisions. - Agents handle all routine volume---implementation, testing, review, documentation---headlessly. - The maintainer's effort scales with the number of \emph{skill types} (currently~14), not the number of contributions. -\end{enumerate} - \subsection{Limitations} \paragraph{Single case study.} @@ -542,71 +578,35 @@ \subsection{Limitations} \subsection{Why Human Experts Remain Essential} -Our pipeline's reliance on human judgment at two transition points is not a temporary limitation awaiting better models---it reflects a fundamental gap. -Recent work demonstrates that large language models do not perform genuine mathematical reasoning but replicate patterns observed in training data: performance drops up to 65\% when only irrelevant clauses are added to otherwise identical problems~\cite{Mirzadeh2025GSMSymbolic}, and reasoning models exhibit complete accuracy collapse beyond certain complexity thresholds~\cite{Shojaee2025IllusionOfThinking}. -Multi-step reasoning surveys confirm that LLMs' limited working memory leads to failures when task demands exceed their capacity, with logically equivalent but differently phrased prompts producing different results~\cite{Plaat2025MultiStepReasoning}. - -These limitations are architectural, not merely a matter of scale. -Formal analysis shows that standard transformers are bounded by constant-depth threshold circuits (TC$^0$); even chain-of-thought prompting extends expressiveness only to polynomial-time computation---insufficient for the exponential search that NP-hard reasoning demands~\cite{Merrill2024ExpressivePower}. -More fundamentally, gradient-based optimization learns statistical correlations, not symbolic reasoning procedures: multi-step compositional reasoning exhibits \emph{multiplicative} error accumulation, where each approximate step compounds uncertainty rather than preserving logical structure~\cite{Dziri2023FaithFate}. -A transformer performs the same fixed-depth computation for every token regardless of problem difficulty; it cannot ``think harder'' on harder sub-problems the way a human mathematician allocates variable effort. - -The context window of the model used in this work (Claude Opus~4.6) is approximately 200,000~tokens---roughly a 500-page book. -This may seem large, but recent studies confirm it functions as \emph{working memory}, not deep reasoning capacity. -Huang et al.~\cite{Huang2025LLMWorkingMemory} show that LLMs cannot maintain and manipulate internal state---the hallmark of human working memory---across multiple model families regardless of chain-of-thought prompting. -Even when models can perfectly retrieve all relevant information from context, reasoning performance still degrades 14--85\% as input length increases~\cite{Du2025ContextLengthHurts}, and information in the middle of the context is effectively invisible~\cite{Liu2024LostInMiddle}. -A researcher's ability to hold a problem in mind for days, connect it to half-remembered theorems, and test ideas against deep intuition built over years of study is qualitatively different from pattern-matching over a fixed-length token buffer. -The gap is starkest on research-level mathematics: the FrontierMath benchmark, developed with Fields Medalists including Terence Tao, saw all AI models score below 2\% at launch~\cite{Glazer2024FrontierMath}; Humanity's Last Exam, spanning dozens of academic disciplines, found expert accuracy above 98\% versus 8\% for the best model at launch~\cite{Paster2025HLE}. -Moreover, users cannot modify foundation model weights: the model's knowledge is fixed at training time, and no amount of prompting adds expertise the training data did not contain. - -Skills are our response to this gap. -Rather than attempting to encode domain expertise in model weights (which requires expensive retraining) or in prompts (which are ephemeral and limited by context), skills encode it in \emph{versionable, composable documents} that persist across sessions and evolve through pull requests. -The creative decisions that require genuine expertise---which problems matter, which examples reveal correctness, whether a reduction is non-trivial---remain with human contributors. -The routine work that LLMs execute reliably---implementing known patterns, running tests, fixing compilation errors---is delegated to agents. +The pipeline's reliance on human judgment is not a temporary limitation awaiting better models---it reflects a fundamental gap. +LLMs do not perform genuine mathematical reasoning: performance drops up to 65\% when irrelevant clauses are added to otherwise identical problems~\cite{Mirzadeh2025GSMSymbolic}, reasoning models exhibit accuracy collapse beyond complexity thresholds~\cite{Shojaee2025IllusionOfThinking}, and multi-step compositional reasoning exhibits multiplicative error accumulation~\cite{Dziri2023FaithFate}. +Standard transformers are bounded by constant-depth threshold circuits (TC$^0$); even chain-of-thought extends expressiveness only to polynomial-time computation~\cite{Merrill2024ExpressivePower}. -\subsection{Scale Beyond Human Capacity} +The context window (200,000~tokens for Claude Opus~4.6) functions as working memory, not reasoning capacity: performance degrades 14--85\% as input length increases~\cite{Du2025ContextLengthHurts}, and LLMs cannot maintain internal state across turns~\cite{Huang2025LLMWorkingMemory}. +On research-level mathematics, the best models score below 2\% on FrontierMath~\cite{Glazer2024FrontierMath} and 8\% on Humanity's Last Exam~\cite{Paster2025HLE}. -A comprehensive reduction graph connecting hundreds of NP-hard problems would require thousands of verified reductions---each demanding familiarity with two distinct problem domains, a correct polynomial-time transformation, a solution-preserving inverse map, and thorough testing. -No research group has the bandwidth to implement this manually. -The predecessor Julia package, maintained conventionally for several years, reached 20~problem types; the skill-based pipeline produced 27~types with 45~verified rules in nine weeks. -The scaling vision (\Cref{fig:reduction-graph}, top layer) targets 100+ problem types---a scale that is infeasible without agent-managed execution. +Skills are our response: rather than encoding domain expertise in model weights (expensive retraining) or prompts (ephemeral), skills encode it in versionable documents that persist across sessions. +Creative decisions remain with humans; routine execution is delegated to agents. -\subsection{Industry Impact} +\subsection{Future Work} -Historically, utilizing a new hardware accelerator---a D-Wave quantum annealer, a Rydberg atom array, an Ising machine---required manually constructing a reduction from each target problem to the hardware's native formulation. -This case-by-case effort meant that each device supported on the order of ten problems at most, determined by which reductions a research group had time to derive and implement. +\paragraph{Industry impact.} +The reduction graph serves as a \emph{solver-agnostic compilation layer}. +Adding a single edge from a hardware platform's native formulation (e.g., MIS for Rydberg atoms) routes all 27~problem types to that device. +Each new problem added to the graph is instantly available on every connected solver. -The reduction graph changes this from a per-problem effort to a one-time connection. -Adding a single edge from the hardware's native formulation (e.g., MIS for Rydberg atoms) to the graph immediately routes \emph{all} 27~problem types to that device. -When a new solver or hardware platform appears, one verified reduction connects it to the entire graph---the practitioner need not derive any transformation by hand. -Conversely, each new problem added to the graph is instantly available on every connected solver. -The graph thus serves as a \emph{solver-agnostic compilation layer}: a user describes the problem, and the graph selects the verified, cost-annotated path to the best available hardware. - -\subsection{Barrier-Free Community Contribution} - -Traditional mathematical software demands that contributors learn the language, the build system, the testing conventions, and the project's architecture. -Skills eliminate this barrier entirely. -A physicist who knows that Spin Glass reduces to QUBO can contribute that knowledge through the \texttt{propose} skill without writing a line of Rust---the agent guides the session in mathematical language, validates the proposal, and the pipeline handles implementation. -The issue template captures exactly the creative elements; everything else is routine. -This transforms the contributor pool from ``people who know Rust and NP-hard reductions'' to ``people who know NP-hard reductions''---a far larger community. - -\subsection{The Scaling Vision} - -The graph's value grows superlinearly with its size. -Each new edge creates not just one connection but composite paths through the entire graph. -As the graph scales from 27 to 100+ problem types through agent-synthesized rules, it evolves from a library into a \emph{reduction compiler}: a user describes their problem, and the compiler selects the lowest-cost verified path to the target solver. - -Three directions extend this work. -First, composing with \emph{automated discovery}: evolutionary search methods like AlphaEvolve~\cite{Novikov2025AlphaEvolve} discover new reductions, and our pipeline implements and verifies them. -Second, \emph{formal verification}: supplementing round-trip tests and proof sketches with machine-checked Lean or Coq proofs. -Third, \emph{cost-aware path selection}: each edge carries a polynomial cost model describing the size blowup, and the optimal path may depend on instance scale. +\paragraph{Scaling to 100+ problems.} +The graph's value grows superlinearly: each new edge creates composite paths through the entire graph. +Three directions extend this work: composing with \emph{automated discovery} via evolutionary search~\cite{Novikov2025AlphaEvolve}, supplementing round-trip tests with \emph{formal verification} (machine-checked Lean or Coq proofs), and \emph{cost-aware path selection} where the optimal solver route depends on instance scale. \subsection{Conclusion} -We have presented a verified reduction graph connecting 27~NP-hard problem types to specialized solvers, built through skill-based agentic coding that separates creative specification from routine execution. -The graph exhibits emergent compositionality: independently implemented reductions compose automatically to solve problems no single implementation was designed for. -The core insight is that the bottleneck in agentic coding is not agent capability but task decomposition: when work is structured so each unit is formally specified, bounded in scope, and mechanically verifiable, current agents execute it reliably. -Skills lower the contribution barrier from ``knows the programming language'' to ``knows the mathematics,'' enabling a broader community to grow the graph toward the scale where it becomes a universal reduction compiler. +We have introduced \emph{bridge problems}---software projects whose scale exceeds human capacity but whose homogeneous, verifiable structure makes them amenable to agent execution constrained by systematic verification. +NP-hard problem reductions are the first convincing example: over nine weeks, a single maintainer and AI agents produced a verified reduction graph connecting 27~problem types to specialized solvers, a task whose predecessor took four years to reach 20~types. + +The core insight is that three structural barriers---convention drift, effort exhaustion, and knowledge discontinuity---make bridge problems infeasible for human teams, while verification ensures that agents cannot deviate from contributor-specified ground truth. +Skills encode workflow knowledge as reusable, versionable documents, lowering the contribution barrier from ``knows the programming language'' to ``knows the mathematics.'' +As the graph scales toward 100+ problem types, it evolves from a library into a reduction compiler---a vision that is infeasible without agentic execution. \bibliographystyle{IEEEtran} \bibliography{references} @@ -711,6 +711,25 @@ \section{Verification Stack Details}\label{app:verification} We observed agents modifying expected test outputs rather than fixing implementations---a rational strategy from the agent's perspective (shortest path to passing tests), but a correctness violation. Layer~5's independent fixtures and Layer~6's fresh-context review are the primary defenses against this failure mode. +\section{Graph Topology Analysis}\label{app:topology} + +The \texttt{topology-sanity-check} skill detects three categories of graph quality issues (\Cref{fig:topology}). + +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/topology-issues.pdf} + \caption{Three graph topology issues. + \textbf{(a)}~An orphan node has no edges and cannot reach any solver. + \textbf{(b)}~A direct reduction is redundant when a composite path through~$B$ has equal or lower overhead. + \textbf{(c)}~Problems without a path from 3-SAT lack a machine-verifiable NP-hardness proof in the graph.} + \label{fig:topology} +\end{figure} + +\emph{Orphan nodes} contribute nothing to the graph's routing capability. +\emph{Redundant rules} waste implementation effort when a cheaper composite path exists. +\emph{Missing proof paths} indicate problems whose NP-hardness is not machine-verifiable within the graph. +The agent uses these categories to rank proposals by priority: rules that connect orphans or fill proof-chain gaps are suggested first. + \section{Ablation Study Design}\label{app:ablation} To isolate the effect of skills on development outcomes, we design a controlled comparison on identical tasks. diff --git a/docs/paper/arxiv/plan-rewrite.md b/docs/paper/arxiv/plan-rewrite.md new file mode 100644 index 00000000..55fd0ece --- /dev/null +++ b/docs/paper/arxiv/plan-rewrite.md @@ -0,0 +1,85 @@ +# Paper Rewrite Plan + +**Spec:** `docs/paper/arxiv/paper-redesign-spec.md` +**Target:** `docs/paper/arxiv/paper.tex` + +## Steps + +### Step 1: Rewrite Abstract +Reframe around bridge problem concept. New arc: bridge problems exist (too large for humans) → agents can build them (verification constrains correctness) → NP-hard reductions as first example → evidence (27 types, 45 rules, 9 weeks vs 4 years). + +### Step 2: Rewrite Section 1 (Introduction) +- Keep familiar opening (airlines, chips, logistics) +- Introduce reduction graph idea +- **New:** Introduce "bridge problem" claim — this software is too large for humans +- Preview 3 barriers +- Mention verification solves the correctness concern +- Reference Fig 1 (Scaling Wall — to be created later) +- Contributions list + +### Step 3: Write Section 2 (Bridge Problems) — NEW +- Formal definition of bridge problems +- Three barriers with evidence: + - Convention drift + - Effort exhaustion (merged with testing frequency) + - Knowledge discontinuity +- Verification constrains agent output (funnel concept) +- Reference Fig 2 (Verification Funnel — to be created later) +- Other candidate domains + +### Step 4: Rewrite Section 3 (Case Study: Reduction Graph) +- Move existing graph description content here +- Keep: what is a reduction, graph structure, emergent compositionality +- **New:** Frame as case study illustrating bridge problem concept +- Reference Fig 3 (Reduction Graph — to be redesigned with solver-reachability coloring) +- Note: figure redesign is a separate step + +### Step 5: Rewrite Section 4 (Methodology) +- Largely keep existing methodology content +- Reframe skills as "how agents break through the 3 barriers" +- Keep pipeline figure (Fig 4) +- Keep verification stack description + +### Step 6: Rewrite Section 5 (Evidence) +- Reference Fig 5 (Development Timeline — to be created later) +- Development metrics +- Quality gate analysis +- **New:** Barrier-by-barrier evidence structure +- Julia predecessor comparison + +### Step 7: Rewrite Section 6 (Discussion) +- Keep limitations +- Keep "why human experts remain essential" (shortened) +- Move "Scale Beyond Human Capacity" content — already covered in Sec 2 +- Move "Barrier-Free Community Contribution" — fold into Sec 2 +- Future work + +### Step 8: Clean up appendices +- Move topology issues figure to appendix +- Keep architecture + verification pyramid in appendix +- Remove three roles figure reference + +### Step 9: Create Fig 1 (Scaling Wall) +New Typst/CeTZ figure in `figures/scaling-wall.typ` + +### Step 10: Create Fig 2 (Verification Funnel) +New Typst/CeTZ figure in `figures/verification-funnel.typ` + +### Step 11: Redesign Fig 3 (Reduction Graph with solver coloring) +Modify `figures/reduction-graph.typ` — add color-coded edges by solver reachability + +### Step 12: Create Fig 5 (Development Timeline) +New Typst/CeTZ-plot or Python-generated figure in `figures/timeline.typ` + +### Step 13: Rewrite abstract (final pass) +After all sections are stable, do a final pass on the abstract to ensure it matches. + +## Dependencies +- Steps 1-8 (text) can proceed before figures +- Steps 9-12 (figures) are independent of each other +- Step 13 depends on all prior steps + +## Notes +- Keep total under 12 pages (conference format) +- Use existing writing-guidelines.md principles +- Fix reviewer issues: timeline consistency, author count, LLM model identification From 2812b4516d6324200053a9e66afa0881a86835c6 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 21:35:52 +0800 Subject: [PATCH 29/38] Wire up figure references for scaling wall, verification funnel, timeline Uncomment figure placeholders and add timeline figure in Evidence section. Figures will be compiled from .typ sources in a following commit. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/paper.tex | 43 +++++++++++++++++++++++--------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index e8b3e6c3..ae932e9a 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -78,13 +78,14 @@ \subsection{Bridge Problems} Reusable skills encode this knowledge as executable documents that any new contributor or agent can invoke. \end{enumerate} -% Placeholder for Fig 1: Scaling Wall -% \begin{figure}[t] -% \centering -% % \includegraphics[width=\columnwidth]{figures/scaling-wall.pdf} -% \caption{Scaling wall. Human teams maintain quality up to $\sim$20 components, then hit convention drift, effort exhaustion, and knowledge discontinuity. Agents constrained by systematic verification break through all three barriers.} -% \label{fig:scaling-wall} -% \end{figure} +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/scaling-wall.pdf} + \caption{Human teams maintain quality up to $\sim$20 components, then hit three barriers: convention drift, effort exhaustion, and knowledge discontinuity. + Agents constrained by systematic verification break through all three. + The Julia predecessor reached 20~problem types in four years; this work reached 27 in nine weeks.} + \label{fig:scaling-wall} +\end{figure} AI coding agents can break through these barriers, but only if their output is constrained to match contributor-specified ground truth. A contributor supplies the creative elements: which problems matter, what the formal definitions are, which examples reveal correctness. @@ -146,15 +147,14 @@ \subsection{Verification Constrains Agent Output}\label{sec:verification-bridge} The natural concern is correctness: how can agent-written mathematical software be trusted? Our answer is that contributor-specified ground truth flows through a verification stack that constrains what agents can produce. -% Placeholder for Fig 2: Verification Funnel -% \begin{figure}[t] -% \centering -% % \includegraphics[width=\columnwidth]{figures/verification-funnel.pdf} -% \caption{Verification funnel. -% Agent-generated code is progressively filtered: the type system rejects structural errors, round-trip tests reject semantic errors, overhead validation rejects incorrect complexity claims, and agentic feature tests reject usability issues. -% Only code matching contributor-specified ground truth survives.} -% \label{fig:verification-funnel} -% \end{figure} +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/verification-funnel.pdf} + \caption{Verification funnel. + Agent-generated code is progressively filtered: the type system rejects structural errors, round-trip tests reject semantic errors, overhead validation rejects incorrect complexity claims, and agentic feature tests reject usability issues. + Only code matching contributor-specified ground truth survives.} + \label{fig:verification-funnel} +\end{figure} A contributor supplies the creative elements that define correctness: formal problem definitions, worked examples with known solutions, and expected overhead formulas. These flow through four verification layers: @@ -451,7 +451,16 @@ \subsection{Development Metrics}\label{sec:metrics} \end{tabular} \end{table} -\Cref{tab:growth} traces the growth across three phases. +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{figures/timeline.pdf} + \caption{Cumulative growth of problem types and reduction rules over nine weeks. + Background bands mark three development phases: manual (Phase~1), basic skills (Phase~2), and full pipeline (Phase~3). + The Julia predecessor's four-year trajectory is shown for comparison.} + \label{fig:timeline} +\end{figure} + +\Cref{tab:growth} and \Cref{fig:timeline} trace the growth across three phases. \textbf{Phase~1 (Manual, 35~PRs)}: no skills; the maintainer issued step-by-step commands, established the architecture, and ported reductions from the predecessor Julia package. \textbf{Phase~2 (Basic skills, 9~PRs)}: initial \texttt{add-model}/\texttt{add-rule} skills reduced per-task human involvement. \textbf{Phase~3 (Full pipeline, 15~PRs)}: complete skill library with orchestration, quality gates, and multi-agent review. From d6cfe7c90b550b6687744be96ca5a34b71eefd39 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 21:39:03 +0800 Subject: [PATCH 30/38] Add three new figures: scaling wall, verification funnel, timeline - scaling-wall: hero figure showing 3 barriers human teams hit - verification-funnel: how verification constrains agent output - timeline: cumulative growth over 9 weeks with phase bands Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/figures/scaling-wall.pdf | Bin 0 -> 18907 bytes docs/paper/arxiv/figures/scaling-wall.typ | 274 ++++++++++++++++++ docs/paper/arxiv/figures/timeline.pdf | Bin 0 -> 19340 bytes docs/paper/arxiv/figures/timeline.typ | 102 +++++++ .../arxiv/figures/verification-funnel.pdf | Bin 0 -> 20215 bytes .../arxiv/figures/verification-funnel.typ | 220 ++++++++++++++ 6 files changed, 596 insertions(+) create mode 100644 docs/paper/arxiv/figures/scaling-wall.pdf create mode 100644 docs/paper/arxiv/figures/scaling-wall.typ create mode 100644 docs/paper/arxiv/figures/timeline.pdf create mode 100644 docs/paper/arxiv/figures/timeline.typ create mode 100644 docs/paper/arxiv/figures/verification-funnel.pdf create mode 100644 docs/paper/arxiv/figures/verification-funnel.typ diff --git a/docs/paper/arxiv/figures/scaling-wall.pdf b/docs/paper/arxiv/figures/scaling-wall.pdf new file mode 100644 index 0000000000000000000000000000000000000000..9cbfbc4f8c0fd37a7ff8f705b515141eba3ea5c8 GIT binary patch literal 18907 zcmeIacT`hN_b;yWB2B=8A^}7|Y62vocaYvYNDaLcI#Lux0UK2*f>dduBE5r17wH|O z2q;ZD2;w~_K_B(;{oZ%|?pk;K{v-h4ob8`6pKNFWd%ZC_Uugh$=n2IT=JrCmIo(GV#3RmslE17ZZ@1IRT^NW{cQP|j8$ z9|`O)0<_TXo|b5BcN9v~#RUzrYq~h1RLor=m}-Euj5!*GwsQh1({eR;1}K0$ZIru{ z2Uxp1$O2~Ap&dbWuqi;Oi?hwI46;r?Q-DGMgA&AuAFCjc0OX*YF-rWbRn-%yz=#)e zWCa9*W&4Lx73G6gLiqwF|6(CX4Vr2s0Lq5^|KPvLPw#`~{@Hw3@5?%(oIsyw0nT=@ z^mGDz0VdTUV1sEMc?)C9Z_Nqj{BH#wwHY7`<8DxxB*f@P+p0l$l+ArzJdavY-O=0< z1y&E?k#cmgbO5Nu#7vG9;DtZ|h4=tD0uVR^0YO3pAy6O_3W35PP(BEh9|9GCK;aN5 z0s=)spn?z>F9Zhgz%UZjZ6Mf?{I$K{e=$MM1#7;Rrv(}l<6xxof=Wo4d!WEl2#>s~ zq`tH&hbqcjS_R6>CFSC1#id{cC~k-L<+=jW$~s%RSlNL=_|58+1K?cnG@AL(g1Q@h@JQGQ0PqbU1>}Qy009ERF8~k$F$BaIFTe|A1NDJA0xAPz0vKCB0*cWW1}F|_ z4&~(*1YQU*%76>33C0gz0)oJY0C@2sfg(6~fp!2U0T*a73<+MCJU}o|6E9E)<^rVv z)dCmBGeG#mAV|0%gb#2MZ~=Po@k96k8$iFo!3!zK|63Xc86{1!qe~Pb<_9 zn!k}q0Ug-+9|Eurh_t5%+QkXr$J{A}AZ*wS2u3H!k1h@2Wd}%fQSKggF3v*0y^V!M z*4awh#R=@*9wgv^0jdGq8vwmP*IC;+Te%+@1F-<>B>_5|m7OK}D22(hbOQYH$7sWc zrLyvLJ)*>nGJlRe5G#~5;8DP*e;I%NDg1Q|!sG%`iuvRLjmHMyk-?xqOzx4D!07Wc z3)70c*z{2wViHiSieL}%a0PBr?&i)mC=wA~;3WnTkpo_sA31-N1HBPry@j=-hF zlrU_WB(@CB%ZoTFm%`-mAIX*eSq?v{R|dMtS4Yt`}nZB zz_9kg0GU8Y9z_E14bvlj#-^pGJNN?>(`_)Z2}YfrGwNs0a&>V9`7wW(F8QN(gB?^& z?k~eLW&r!waK#7Q;=lpwZ$m$j{qrXe(9eJVl92vo81qTGtLLw;-9ItWSZ2dR6J{lh zhK=$iAT;q*)IPRWvFI{Pv?|>rZqsolX;Hz)Bed}en+jCY#)C@WQ%R2JDAT{LL8Z?L#?II;8XTZ!tzpxXOSXxw?(jLrmM z#`i(i7RF`%OjrE%&yn!R`paZ*p6-yCTwo#lw&%Fo^7K}w^nO#seWL5m4@&)&&u-p^ z9DL0iblr*QcL?35nikM}zlbKVY>)i%rk|=SO#WiVn+&VS#~*DTpFvKsf4>t#jmyP% zQu8FWMHClf>m9c{YfyG*>kJoTH?%eG^_vWRCFNHv>`dHDoPtbauNWCcm|CMw(wjft zn4YFoQn!!_XF*!4$Vj<|Hcd`iDBQdKLMA&SKk+2xgIA390T!Ct5>^&citZA|lUcLJ zcDxTIH|B5d5{ipEY*#fEHD_mJ>oZYDb52pQn?v1NXYlK#-xJ@rXE-r%I2xms!^M^z zzj*mFnd0|A#yeM^G+q5RB_Od}JNeOFh%_Q8GxO?It23!_Xx(`2inef)Y679^NIywd zqU*dpWHV;0AJC24w;@DbWg2T+RB{y*pIBo^CqK#gcrRfGox1P`n?l*EcX3nB`@BWAm9>=(j`sRf)jDT9 zIDuvIQ`XgFb)cX@JwJ(IupdBHnn7SeLIma(?5FCiOp=9iz?>VJ*f>Rf8yWu0(+ z=z5r3I$3VGC-i7c1*aytB(eQDIy?PBez(DpZJdjG{%h)wXBD~6->WKE+QXk7qwZRM zg!`7XTzqH7TqN?KoPjpMPVCOCO4P9J;`>?nxEP&sCcl0meHtD$8S%oL+80(_`Sq|} zT+_Q*wcWjXm90ioH#k(H>+rcf&R#Uuaiw8(pjZ+aGVng1yf9WKT)|Dis9zuHsh(q` z;<&+xd$w!}-V`M{QRrqWa630cnCAjrL1a^u>|#9h0i?}kW#|x=UZC#5KQ@GHz{z2U(@iv zT;f#2C06p3XHxj>@EG2N9eg>X0`GnEyt3HZ;GDd)vPhS@Ps^}DY3YFEiuGcusg6TpmlI51=W}6OJU@e99lEVTIe|Jcsf_aUXIlF-jogd1o$}old6Htr zyk9e+mP?$U`D*lD8IOOdS)b}g@I1bIu;p*ZrnLVA=bBi$g!4s7G3Bs;{_+u9g_<`u z6&m7;a@7z#kEid_)%H1V#SZlmuI|Ub)0qI+LvY;f2nV>zt^LL@OHwWnqw_n6x z?AdB>)2D}#jS;<V1v%eZxZ0wP z(3VB7CJH{lsYYxK=~hW+{~jV|AZWm4VBGKV}<5)%0XFg`ayQ2^Se(n}^G%>CISB*E(q}0qvz;@I;AOG!34<%LmRrw!%E_#sJ;gLqS6={*_ z&{fm#2aHSxZ3|r1Xq<>Iw7VbO6#D%2z=B?5qJ>hHu6$gLIx+3Mvl;87fEm~bb+~j= zV$zboUGR~;iDfpIqNo$ zIo9#!*T#&;nnvI6G;4j4`KVk~QGSIylji0#|C<5>;msDN&D<|e(NWN$;F{;Omkm>h zwnPyD{dji7OR~eGXA84~x^-WLS9+t<&gSb1{&3#Ggzh3In?iDv` zP#BVeYvSInPRqO$NF>X4KsjZhd(yAat^RUxrSnr#ld_v$!dm4L4|^z5dhI75r$#@x z4#U=aD9Nm2$tw@%9Ajh7`otv{;lXkrL3Q*oTpScUa?g-LJdN)mfPJ2~kwxjsKYx%;RGsNqXLUVZpTOypTtKSruC%C&? z^h&zR5>JmVw663!iS5@UDScgQwxj<nCyif#8;QqW7s@76u$C?gWE`<}JjIl4%~ zk=*fcwdLkrmVRsg%Xf=~&vl4+C4EAP?36JDC2k-mMEz)11Gi|O_z!cM3d>roN@!{C zMvTS#SEHAP#NVaMF{(+E-g)EzH*};=@D6)tqG8hmsg9F*l9kXolxCG$_b{bc%t!Vs zjMtgquGls~OT6-&Zr@1W^4W9rIW4H#%DVWZSk74wYu7t`hwl+Hwj$q(AM58>bW9Dv zmsfM05ei=?M>4JC4;YqDYFj046qUarEmPjw@l@RK56jG_MC9L2-r$5cPIY+1p6(w^ zja^oAzOA5qg~BxJ`C-^P)nTCXYGR~Cajb>i-Nmb?W0QKbzU6=Z+?sOhy?XV)d*(!K z_K`!qP)p*27Y!NUDL&gb+#-Ev?ayvTKJZidz>m2t9qFWO*~ z^Q>>lLpA29tg9PS!&w&o(3sF_o+?_o8-qbf-vnxKxWpBxTW)L?R!yVR&cqrS`?kpF zTghFT2@!$ZB(yxUoBC}``r0}@w@HvQtG>a_ydK9;MIjxdCv(#Awv>nmk}Ct`d%5=? zu4G+qMO^580$cK=VZEHg#?~_NX7_$|SL5ZjarC1_AMw1W
;X7pv@y28;eU)T&$ z)5k*IMXWfCI)DBoRIZ#V>CoEvb#LP$eof*%5yzLzgnO=i&KDWPc8{fyi5(BvgNcRs zyNZs!&PhGC*OK!h{^_Y?Lhn9J2!5Xy?+T2$`h!rKiH2wLWv9%knC#E~=M5jQ$2!(3 z%Uv$bp=?*k-;I$AzSie|bKi(8+a#meAPN7I+sHEa;Ych6iysYbP~L}IlX;w6o=V%? z#Ov}Sr)%u9Xh{4u2E5*8Ytrm_MF_-x%v`;m(<9`6a^Qwrt4fAn*%Hb(l4myXy>9sW zU2(gyc3Iq@yc@5Ev`uwi<=yi1V0bCw?@zbHiNG}o*RkiW`pik-&=_(Zm)yGJQ(C4I zdgnox1Am{?4cp^nv8%RQn~y)4sMAh1gx!DK5a#-E|I=*2Ybh?L+7|egM~^Gbw3MxV z7TP6SB(%*BxGzoSyVK{_Yj{CE_J8>@yG7-p%FR?QRl`3;dKs}HUxUCeWY`O3nmilU zE&>>UlbgN=`k_qc)gU8H`KC&Jn-Qv+*#hNoD zzLMg97R!_nb?O#5EI8+lm?04#TUGOrdzPS5{U9sdkuG)6{hE$*H@8Bi#>=a%A-Ls6 zDF*yFIc9bPCbyQ6Ga{AITLZ2|IzrPG5i`hpl}HPlw@L72zr+R75sg)#H+DaCuH9~!9yf@0H*Ho!-k|v6HN2=Z z$gRsCO#fzTGda(m8G1qu$}G@bQHmbPDBTo2-HNhTsI$JAlD={4GJkV3BcsyP{GOnl z2hQ|#qJ~qJ$3Ye2xwTF%+g#lRk>u#y#BN9UA;naJq3qwDPyaR<1U_?ukEcJML;nJ0 zYbndlgXiZ5P;=nT zCkWuNAO;VVgBU!>4`T3qya0%f!Qep-ECvrl03bRBgXcp6AUXzv7XYwsEC$a9AmdmJ z9s$x}FnEkM7z`f63n1fsm>RHHIvfk61OE{q?2f_GF?wUMbU+gjOBclGj78AFQY?ZF z)&aohzy)f6MbJS_K?EJJlmBdea|9Ire+r&s%*KM}f&$!FY#oc8WBh~(E6{`^@EkLz_3QK% z(6)c`CNCD1|9{}QCF%$S|9{~55x5R?$^QVJ9|7c8(Dev#{sErLU;tp=BT)GW_Wmn) zepH49m5+d9-Xrk*2$08u%0HlREU1jY!n(XzQ27Y9=EZ`_KS1>((EJFH$AZd7P&h9Z zR2KLRG>0Fhu%I&h2%0_um5(6pBTyNuH56;#4@etpA0HN)hGIc;z9aA(Yu^zx{R1>V z0^5%!2adpYtbP2yf#&d|vZINCqcW_0aBMq3vG(yF*?_eVgu4N>{0KBZn#4E)&EdfG z!;y_gz&Qx;|6c~jf1gf28r=DiFwFXu{}XWBS6^$eMC}BtP5PNw=_M;Q3ndq)Jfvwv zK>zX)BiV zwqAwWs<`LMiK4;{lXJVD&1`+`^vG82KR1n|iFVVn)QZ~5(JJ1F)XerfOgvdN{qa3} zdmOj%fqTpJv1do( z#U^fwG1uF~@b;jJMwh&d9*(cf^oqA#tJARMK_k!Oi>0GZ=EyKT6#1<1G=;2gPGJ~u zht+ddW{O?K3tLDZ8BLz8D6e}&=QVg{tY%Ph!#agBJf(y8m1{(9)7R_1C{=)!uh7qq zBixr;Djt46-@j7ixEYb&#r8z!{24xrNVg?vOTJV+;}MSK<)-NBR}*V#wPisqrW3+3 zPif+(Tc2j;PoE;Xp(zU^G0jh6QXLXw{<$+x#w|_ z5nWRqP$nucs2vNIB=A?qg}P@Zd@-JJuSbPo|WRS;dqwkMT9pkz1DH2M*58V zi4yHqMpgR)v}D#c1wrQZNmWm%MdopRVyDQ*_5<)c__Q|_ye=p=wjS6a`}zXnlx=9s zPT)PHQhuHiC$CaFiMKNJA!iL||MEgz#eFj((v<6|PF?vE+Lb5gBzD7{5<0rBw>6j@ zcf5VYK=lK>@tQL0VZ~h`cn6-62qPc)=6Yz=8@w=`vdEA3zPV0q?s=f{^w#({C<+26 zZ>`PFEISAE-h4S;i2S^|G;cUQ>rh*+klOcx zvg72^`*8#+mNAyJGF13?`OBd>`!xe36uetV*Y2o|?S}-I zDl&v?u-ks>>M!s0p4U?4QhejqcFM^EulCq%3V5d-v~v?ecDT<$Yp2aeF7^BCTGpG z$rZ zw+L&~Y(?m+eN-tK$Pz|631N$M1jTq-2JZ^@c3;lTNO0GPM#C;QQTp#(TtmOizrR>F zy@uoOqOtZ}1pk0cKQCyF+k=#&M9(dMJ(4?yOtI$; zcL-0eFU=Q?5V6m8FAaps1$Y&{$y|WGNOFyZw|{!XQ#86qs&e{W9u1oRF1kMehZ}1A z#WS#h{uHT9P-fj{u2x;q2L?}~RbJ>i;nFX?Fmx|!HGLM__P%L0)^MFMh2JYUnlMDM zv%~s!3Cr-EI8hSSA!@ZRt&BM=OD2tt>1Xm~6 znmAM! zt*FOIJ-#zZJYX_FFPO(1&18nVu$}#c)i?!D{z+AMV6YlhWUAVf>8FTIe4(4~(lq!_ zq@?k!$zBaEX_$*@YxU0xB%T#!6>~KfpsLkssB#o?n-*#O^kzcOjLVi8LIT&z80O4# z$QhKabHd|L|F*Vu4rvbOn#&Pi*1D9zGIusOf7~CR>;N$Ybfra1^tOiA>w{6 zRum5^N_PVC)4I;lVZ3doYxU z4<5Sv#|7~j`slyrX4ZZn3AplvTdyI&r+%|%yEQMW0YY%qEj!4`kMwzCqU(jP;c9|z zeVHc_4|Erlyj3rgh0L3u)u{V;E*`qO%~fBzvyCX7B3)sawl_}gYxRis3R@kj@gLOW zJ@!=T!@Hdms?vuAT3@&%_crf8yAG4ovI*V1o?cOKAYtieDN;+yX;v~&>ybM{wu?jl z?xsXTHEvhFKO8Oi^eLL!5*C85d8eg~7V_c7bz5G=wORbGI~ShOTsS>gRIL7}eie^z zsqNBc!Bf2VY1C$?sBNJ5YevQExc8{vla0HcBtP)`68M(;WK7=)+f&tNZ$4EB$O!xp z!CmUIGk<2&w{3pxK^YZuOiSl(Lfh_Xh*yD|sGZqM%H|S=`E#kl&9M?C^3sn=%q#Qk zO%>*irsPwy0yi0RGmLDxqGU~T)wlb#}0m(W+Ww=n0blB2nmRS$Kx^L8=eN$^Ka%LV8XFK0C-th}y>?1`_KSVVf^ zd~Mk}Ym+4^hF}X0L;LIR+8xFm4s=oP4YklV zv-M0()i-sOiNNgrhNg`|?%Bm#F>R@Pg#_=e`gjG0Aq?M{E;AynFZ9L?MC0*L;}cng zxlXV(b*{F*5uCXx<5f|fFU?X@4-Yo!;t8_5bP=cfRq>ZgNwycIV;z$O{e@DV#g+<+ z-QZrh-j&*^Op^mi?PhyN8Q|7Xyx=gjGB-gb{sxw*>g>lEeJ?#C7_Y6|?i{CV!DTmw zN`~!#rc#o1t`Mju92rXnUzPxFJ~e9)p5_+Kt-Fg?VISzr!79|YFySP=}D-E z44bu#bJUqBl+Xp^`8*Hy%HyDqVkG1&H?X4PdK}87e6_rv9N)@?wQn6FVs9M<*_*(B ze_SW7C;xiVtBf>$SmK=OTf)`s0TZKNKwxJ zWc;evV;9QVwK>T9V1l{tG~dRD>`sY>7z8-8Nc z8vfi1dev@aHMtj_id9aWwz+y;$+cC@dRjmF5x1(Qu1)EiFLFd(cOse23&@tp37w!k zR$u?+d}t;AN`b*>FG5npmV^JgTT{s!qT829U2dh}$RD~MkWe^1ZXSde*clS@+#qK% zE9HU)9>b+S;kcyT8H{lR&S^{CZfJ~R2?So zA2f%5F#Sk6c);o$l&<^ryL55SfqbB-nKA3_b9Z~{Th^q6Dsf;D$_1hJgl8|6HfY8i zKWp}`weups-vY;_YfhWn{kk|KtLIz%&gb`*pVRs9B6ehHmS`gE80#mRw{bRIAtF&U zIElVoqbokAuM$dqdm*Dh7JSv`rbep!p6=3OgIKz@RfW|JIy9R@FC;pfYv}SMu4+0} z56pK&(P+4XpO-HAh5XHe*E&y#?TU*(j5S(vl)UgyQeT|9o@vySbG?tX+5&*+S+tG92cvhDCA|? z>L=WHoXw76^ez}@cKGsAY+>fs3nk-c8O#qKP9D5o=z zsz~W#W21EGJOxF{)m##rE zu%jvjgGFkBD=@(|NRU`_mlT+j0)XX~S>Nf5p9?%1Dw{(4TrMDx#3wP`ztC(8uTM zL_A^4e3b^}I2}!R2`%|vu@6&D`Bn9`o-T-G1!Kp@E5HKf%_v?+jxLJo$@CUqMD_F2 zy~2fqd9C=F6~Y?#@WpkxcTHE&a%bvJf8bMYCy>Q`t5m}7OZVZf0fTm<*FWOKFtAx#dz{y-hYVZTj*rtqm;C{tKiF6TmEoR>kfF`lprt ztJd+n|5CA0{fwMiWU)>zxmP}=OcIR@gG7dziSY=DT=t@hL%722@`su!QUSR1pFdw3 z%Dm@%{X<4XSp{3!m2&q*?WUrGcEK+}W_uwg`k&#N$6Wf@R#7>#A~by3ktC9@DgVLc ziCpKtV==buk1rjU4iK|=y#Hc0qGfSnX}%&k?^9*;+?M40ve(cz>9=_t#GApsXIa_5 zkaQIHAn`c}X)<_3UWC_8n-^%}I8eJbd>DF7upwD9{msjNnLCb5Qfkpe^Y>uTlgqIsBz$=L4<^L6O2j)$FZ67Sg4sO>1nSLxE z_4v8z52yvu3x6!5 z$52^fwoPDXoQ^hy0AxQk|Ao~Cn6&w^(dZvaa5mu=B5Ph?=IMttK#(;c{tqVX#uF8k zm7O_e{{T4h@WEl+01gRMfaFHNkl1+;?4~E|PA(7ZHUew{Gmv2BGNggp!4>b|{td8J zK=zLj43qP((FrCXh}nXv|pt#fIe!R;qm+xQb znF7PEY(&pz2Xu#yuIo^^c#SmA+f*TXca}ez-V(1ie&puAi2gpH5hA>AGqQJo=DRoH z$mB%UF83azGDL&onmX-FRH=G~uwkL*%YEB`^IXiT3ms>^`*Y1uYcN|;o@~`Jrajbr)aDWIY_-wCet9I+xrrqFPkK_fD6|$AP3iWryQ>@o6n?_VEJ5hHP1OfVoha6yw)=1o+io%OjQbm6rDFacM2unYYmt2OVC!1Je$NzP#ESh@Et#w zE}0eR>M~0{p3(l%FQ4da8CN}nvE}e!qWP!+SMiGDvRmw_{l~%QZg`i<;c-{*)Tkp! zkliO>@&;LPgI^Pq{DU^y$Mcd&HgXod7RsMRUg%hT7a|&Yt(e&y=Ha5&n!3Ze6?XRA zIZLW911Z#`5pesynyUGtkP5g|$QRzgU`zdu9rqKY5YD6o$+^16UB^g>Y7C#YT9az? zD)?j`%-uP#huUa=;Piou!k%|bpRpaY9%DaMI5!uuy`FNKml2=MU+faaS7M5jtgs8$ z@$*_utU`IbKItl@bce%iN^ZYJcDT+5i16ns@y6d+(Dbi6|3a!HTDqsC8W zvMn}K?fq;V4IiQop9Kg;M22!~#2otWx}WWNqUAp~_bSetR#JubIwi}6VB^Q2&nohw79D5(R(5ygv#F(&cFw>wy~=AbIM>!R=R;-Y*n()uBJOcJ+-yvv z&)L5ijaQDkSO;shjhvNwq?J!alx=3{u`s6aI?4Lf2~QVVZLd3RQ8@0+&dF8dCoVpS zTPnU_;yv>+Y2^dKhSbi|vm}xD?Pv(S>ZVI?Mphn!o zqMaZ}X{6>P;P%R)2v3eIE{h<4c6_AsLz8(W-f3hrGxA0sI5 zT~%hY)kz~!QUv(KR6}uTv^Rac_T8EC6XJ9angGJ4N>iENBLWXx&ViiyfqnZ!P9uBeb z-ffZa%g(P4hcvyFW2udcRp}4*jZGRZ%A~#%k<87aKT9$lUikspQMu7naW5o4z1+Kh;2A zki1VW*dxW{slCaxwHZGA^OL)*T)9tWz7buasWMC|o#WO^L7P(RB%;!6fGA zv!!D(SYkdA_m*$5THeYDPzH`5KMh^4+gLx8T(7_^k9{+0?|7C9>U;0ouJ1V#LBIPqoFQ+{~?O z`z5sd+v$ieqXWb5U6^gc0)-pAKa_46*Si?m;MO*daO^Jok=%Xj-jzyru&t!jaAmb- zhU9i7alcswrJMc;WtzIvWUmm-4RXE;H70LkVZ2l0CjPqi%~Jb;Z)t@n9K3H%?_Rxr zxg|NOIdN7O@32O*woF7llT$K2CXdk}NA?bHdr;zVjSaoHhjY%?XytMQS>{ zDe0=h30Fdw&TFlCUL#&eK(*$XdH6ieE#l~2y{TxU@L`GaZ%t}-;zE40v*^*k&OOmU!WghGml!*`F#K5Vk7O@HDP|%m1Ik}`n4OXma$(L1(*&|~Ga~9b=3tVTWtUm0zzFnd96E@MuvB8%vXzURs3Ywj!y(s@6 zUOlc-KX66+9FJ*+NZq2dZnBO+8IExA)e|1-96Vpr@i|vc6?$=Nt(uDE&bw?ktdD-( zsoa`eEIE=dWjzKi4W*>fyEbfU-Gz_kdbRQ_RZUDJ*rzfaTG zDsF~K!F*`pfa_am{OYTHlY6lHH6Lr57oR5CTlHR>gx00_JufO$8C@6nIO3LM?`vc? zY~5T^U|J>=GK{|buUN5hlQn$nLD6XHNQ&{wPc(X{Tlyx?$(QT`!$A! z7Fksn6Kg|}JNVM^`y}$lMm!tF0Zq=RB0;mX-Rz5J#qM?n@Fc3#XRK>bG<<9dAFCi^ zDpBg;sWRbgzkYG1h<+t^tRe{((nSPgF3Tt6Qxm+=8z3O-JTAD;6Oc?R|F@t2zpoXN z26pCx+o{=rU-KToc57}ZKQ|2AhRp+PI0es1fUwC3!FXYCUOp&449X|S%ge!wWaR~R zbpkXPZeU;M|4jJnPu;(++7T4Q?489vP^h5L=2qrtb5Q=zReSm>%D?Ox6A||T9u^#c z&7znE9imLDm9>=bM%481iMi&CUgV{Vx5DW?0L6ixg(no-zZg7{%2Yzr{Bo6)ouP~dNhx#%|9Hjzj)&}Re%CO6+VCx@Vt|&&{3diySO-t z{Tgc`Jb#voSXl}=0$_|M@T7=!zLdEQ_KTyP7|L0M2lOBC75fEK$nzTyC=m!>J8NGs zFvY;|<>KYzg7Ry_phA3tLYNo^{$uO@h4LRwQU47m#wIK!$O*VkNELH@i5L_k`j-sQ zC~FsYCv&uzIdHgvouxS%hy)(MOCq4|M@66%6%`=`XAiWwvn5JFMoe8ErsJdQ!RzS3 z0oAg#wu56?Xv?#A@|N znFye}5b$JwwCPz4IMzi3wCC46YzzGo`bTXB^d{ij&kiZ}PlpET7iDq>y6fMDuHV<# zV22B7V5o3(u>nU5jO+h-;{Gv^fQve@&-uW~AmC{eC}+&zg?(uMaZU`dQ1ysN#oXQD zx6>=Y3;=ipj+FtC4$QFR;fk^Zp5?909X-H2EKs0phw?_btGlDDfo--x^Zf&b$k@39 z^kCQLQHFtV0Ko9esOST*kbnzKNx=Qz*cSvYAOIZ0U`_Tj4_xsGz97#3M>&pZThT*aSxg1i6<^E)q;AMvj`paSrJ=?4W4QTUUW57_ej zXC52@9Ln%V9dJPqZu-4m1aJrXSDpa4FZmBTguuV*KnjA8&~J257?S@_+kiaepMHVD z1VI4#cV0d|{(s5iLjbMwCmllIUv=<90RZU_IyeG2QsmD(Byjl0A9)Bt canvas coords + let px(x) = ox + x * sx + let py(y) = oy + y * sy + let pt(x, y) = (px(x), py(y)) + + // ── Barrier bands (draw first, behind everything) ── + let barriers = ( + (15, 22, "Convention\ndrift"), + (32, 42, "Effort\nexhaustion"), + (52, 65, "Knowledge\ndiscontinuity"), + ) + + for (x0, x1, label) in barriers { + rect( + pt(x0, 0), pt(x1, y-max), + fill: col-barrier.lighten(82%), + stroke: none, + ) + // Thin border lines + line(pt(x0, 0), pt(x0, y-max), stroke: (thickness: 0.4pt, paint: col-barrier.lighten(50%))) + line(pt(x1, 0), pt(x1, y-max), stroke: (thickness: 0.4pt, paint: col-barrier.lighten(50%))) + + // Barrier label at top + content( + pt((x0 + x1) / 2, y-max + 0.06), + anchor: "south", + text(6pt, fill: col-barrier.darken(15%), weight: "bold", label), + ) + } + + // ── Axes ── + // X-axis + line( + pt(0, 0), pt(x-max, 0), + stroke: (thickness: 1pt, paint: axis-col), + mark: (end: "straight", scale: 0.35), + ) + // Y-axis + line( + pt(0, 0), pt(0, y-max + 0.08), + stroke: (thickness: 1pt, paint: axis-col), + mark: (end: "straight", scale: 0.35), + ) + + // X-axis label + content( + pt(x-max / 2, -0.14), + anchor: "north", + text(8pt, fill: fg, [Number of problem types]), + ) + + // Y-axis label + content( + (ox - 1.8, py(y-max / 2)), + anchor: "center", + angle: 90deg, + text(8pt, fill: fg, [Quality]), + ) + + // X-axis tick marks + for x in (0, 20, 40, 60, 80, 100, 120, 140) { + line(pt(x, 0), pt(x, -0.02), stroke: (thickness: 0.6pt, paint: axis-col)) + content( + pt(x, -0.04), + anchor: "north", + text(6pt, fill: fg-light, str(x)), + ) + } + + // Y-axis: just "Low" and "High" labels (no numeric scale) + content( + (ox - 0.4, py(0.05)), + anchor: "east", + text(6pt, fill: fg-light, [Low]), + ) + content( + (ox - 0.4, py(0.95)), + anchor: "east", + text(6pt, fill: fg-light, [High]), + ) + + // ── Human team line ── + // Starts high, degrades at each barrier, plateaus low + let human-pts = ( + (0, 0.92), + (5, 0.93), + (10, 0.91), + (14, 0.88), + // Hit first barrier: convention drift + (18, 0.80), + (20, 0.75), + (24, 0.68), + (28, 0.64), + (30, 0.62), + // Hit second barrier: effort exhaustion + (35, 0.50), + (40, 0.42), + (45, 0.38), + (50, 0.34), + // Hit third barrier: knowledge discontinuity + (55, 0.28), + (60, 0.22), + (65, 0.19), + (75, 0.16), + (90, 0.14), + (120, 0.13), + ) + + // Draw the human line as a smooth spline, then a straight tail segment + let human-canvas = human-pts.map(((x, y)) => pt(x, y)) + catmull( + ..human-canvas, + stroke: (thickness: 1.8pt, paint: col-human), + ) + // Flat tail beyond the spline to avoid overshoot artifacts + line( + pt(120, 0.13), pt(145, 0.12), + stroke: (thickness: 1.8pt, paint: col-human), + ) + + // ── Agent + verification line ── + // Starts at same point, maintains quality throughout + let agent-pts = ( + (0, 0.92), + (5, 0.93), + (10, 0.92), + (15, 0.91), + (20, 0.91), + (25, 0.92), + (27, 0.92), + (30, 0.91), + (35, 0.91), + (40, 0.92), + (45, 0.91), + (50, 0.91), + (55, 0.92), + (60, 0.91), + (65, 0.92), + (70, 0.91), + (80, 0.92), + (90, 0.91), + (100, 0.92), + ) + + let agent-canvas = agent-pts.map(((x, y)) => pt(x, y)) + catmull( + ..agent-canvas, + stroke: (thickness: 1.8pt, paint: col-agent), + ) + + // Dashed continuation of agent line beyond data + line( + pt(100, 0.92), pt(145, 0.91), + stroke: (thickness: 1.4pt, paint: col-agent, dash: "dashed"), + ) + + // ── Data points ── + + // Julia predecessor: x=20 on the human line + circle( + pt(20, 0.75), + radius: 0.2, + fill: col-human, + stroke: (thickness: 1pt, paint: white), + name: "julia-pt", + ) + // Label below-left to avoid overlapping with "This work" + content( + (rel: (0.3, -0.6), to: "julia-pt"), + anchor: "north-west", + frame: "rect", + padding: (x: 0.12, y: 0.06), + fill: white, + stroke: (thickness: 0.5pt, paint: col-human.lighten(40%)), + text(6.5pt, fill: col-human.darken(20%), weight: "bold", [Julia (4 years)]), + ) + + // This work: x=27 on the agent line + circle( + pt(27, 0.92), + radius: 0.2, + fill: col-agent, + stroke: (thickness: 1pt, paint: white), + name: "this-pt", + ) + // Label below-right to avoid overlapping with barrier labels at top + content( + (rel: (0.4, -0.5), to: "this-pt"), + anchor: "north-west", + frame: "rect", + padding: (x: 0.12, y: 0.06), + fill: white, + stroke: (thickness: 0.5pt, paint: col-agent.lighten(40%)), + text(6.5pt, fill: col-agent.darken(20%), weight: "bold", [This work (9 weeks)]), + ) + + // Vision arrow: from x=100 toward x=140 + line( + pt(105, 0.80), pt(138, 0.80), + stroke: (thickness: 1.2pt, paint: col-agent.darken(10%)), + mark: (end: "straight", scale: 0.4), + name: "vision-arrow", + ) + content( + "vision-arrow.mid", + anchor: "south", + padding: 0.12, + text(7pt, fill: col-agent.darken(20%), weight: "bold", [Vision: 100+]), + ) + + // ── Legend ── + let lx = px(95) + let ly = py(0.22) + let leg-gap = 1.1 + + // Legend background + rect( + (lx - 0.4, ly + 0.6), + (lx + 7.0, ly - leg-gap - 0.4), + radius: 3pt, + fill: white.transparentize(15%), + stroke: (thickness: 0.5pt, paint: luma(200)), + ) + + // Human line legend + line( + (lx, ly), (lx + 1.2, ly), + stroke: (thickness: 1.8pt, paint: col-human), + ) + content( + (lx + 1.5, ly), + anchor: "west", + text(6.5pt, fill: fg, [Human team]), + ) + + // Agent line legend + line( + (lx, ly - leg-gap), (lx + 1.2, ly - leg-gap), + stroke: (thickness: 1.8pt, paint: col-agent), + ) + content( + (lx + 1.5, ly - leg-gap), + anchor: "west", + text(6.5pt, fill: fg, [Agent + verification]), + ) +}) diff --git a/docs/paper/arxiv/figures/timeline.pdf b/docs/paper/arxiv/figures/timeline.pdf new file mode 100644 index 0000000000000000000000000000000000000000..ca96812288885e428c8b25d66ef379697f40ef38 GIT binary patch literal 19340 zcmeHvc|6qL_rE3kQkIe>4Mmh?7R-$7TlRe^5`(d1nQ3P1yR;xuSt>;KvXu%UL?RU` zSwbjFku2Gg<@b8cXz%@ge;&W@=llNaYhGsN-q*SJo^$TGckc7tb6+CH8d{PtDI~K< z81To;41+)+IESOm3JS~+h_tys9t)8+#<*aKU{W3D?L~sXz^{64PDF?e0u7Lx+A=FC zF=M@)z;ev+KU82wB6vHJ%n4YmDGo;hi<{y+um%`Bgxm~}*1(XkBsWi>F*7{I3!nh< z%&`PdBG@_sECOb@kvu?lC@G*)oR`aw1vEXsr2vM20(uY|8480y0;mV;MP}k#tA^e{ z12#|y;wurQ*mp`ptRG1a>ko+hgM=V8D5?z{l+_dh2XzK~K*=94vPvK@6eWeEe8^Eg zWGNqHrXVnKvk(}i78s=#*hVds4@zB>cS?nncS^04cS_ZicS`*~yxV^37qHD=JC@Qf znjTnBP+Mj|H{cw-JpnC&Nn;3^u?;=`FUtC(ct46n7UL(|0(1URh7AJ(_$R9wFp5NS0JoTEEHt)yi8U1m4}0+%7Dja?uVYyZ&{K&}5H zLs~e3r?&?Ntn|OY6BMwwzh(&3-@nTcVCXRPUo!;S>%Yv990Z2^Ylc8;`G-s6XZWC=)X7YVbH%(&DY8Q$KS{QUDf=} z;DY@}llp(PqfmdNoBwEf{;zWLH{%@k@6XM0f1{g!e`5Z-Dd*pxmH*~Pl)Pg4$MOAN zwVS`0{K(6=f0rBhU)v3NZTatV^EWdwc@g;Ua`QKHF|aoK{ij?DM_Ij^c{`BE;01&* zP;l*~h9P3Xy8!sGu(mR=*k_3KRW~rekX$9za5#5K9VcKJ=tlCF)CQ?Fy&Q2)ZXmw+ zIbQ=ybOfLd#)|~j1)v8wd6h}-L35mimm5$F3xSi@u|KX9$z;E(?l)%Dbu_@d|CDEt zQJ{#KKaqs>)bVo0eM2^;SQj@UiQo?rQw3ldRve%NXpIO?SOO@N*jJ%8ZkT!F@c;?{ z*TJB!C@=`*1QeIxh9}_&5I7lPfK6&xk*TDuK)4+n-0*eB?Vlke~eyLy9siS^jB8|Cvd03jo`O z7!XSSv7gB#9vBxQumplZk#b*)sHx%nfZjwwBmpazLCHeokd$8+9TLXF%~93M1;CzU z5mZ47AVE-2a#@N&kRXyUv3x|?@J=20Dy`m z;=BotSV?jT5s|M@&k&1s!a9AuOO%J;2{=b0z#?E+I$mxhW)Tq`FGmk=C+t_6Kar>b z_U84S0Bi%I?oA}&Jc06L+%69hqhvtHJVCx%AOtE7kXT{~L^qt5Jb<)CL^Qpe)N!7m zH4>RY`v%znuoJ)+s5)miFDJr=Fc1f@U1nf~=;Y=|+DMV}96bTO{4zKY6e=fg{01d? zQ2aVnAWm3kK%;<8|1ey>-~4gNkaGbuC4WkT!cz=zLtwB%a_)wd5ZHG@axX$D>5V=l zCtws7!SO-F0~nQn@p8d3D?ouiC5VC+@JDvJ{Bj={jAZE@oE=>;1Tr}cAcsN`P$hB- zPD#NiDFh`2r=(;kDFh`YOG(L4QgW1(EG31cq~s_m6eWeEq|lTUijq>Lq|jt$H@E^6 zMedz%dUEt8fUX&G7n1b^YQW74`>mJpI6OEXkpIc%@zv5mV=>bD(_HcM3g_z-go4YG z$J`$#A#kLA{muAs8dm>dNcp9ToDVcM`@FRY7a2dGrt!GH*GJ1h*gdbmENtRkT(O~* zhKWR}&lzo#?k!%;$K&k@IuXze>V~(G@sNmUrronpAq7^(2(2xJZyas+_kLXJK2ZMl zqeoXxZ`$~p)R8iuH*231iLI%Ku_uc}KMj)1{kJnoSn(%rfeYNw>)DqAfeB7O=S36oF%qVzQBJ)X`CC7uu$@LGk3ECB+oZ0nu z93~1!^n)OYosYNj-d1bMzcf3LeNX&Q`9hI_&O~Q%px*iGj&t&NOoU~}?<{l5X32@_ zIpplUW8%=sA1!h;>p@2QUgk5*)~2~xTj`myOtpPJ4(v(Hp71`i+srnCs5XSveVGj$Cvo#L}?BBMz93fN0u(@ZD*;kICl_NaO>{f z{`Z{B2i|BM&Pz!v9(7LEtB{Ev*ge_A3IB zs*#Rnf&?{_AcvlI2Gq6^Mtd5r7xyD$P;zu!ev`FG2f;@ie%qAul-|{N^RH1m=BoOa zJl}(BJ06r#V8I&PT}aRS-0#v_NJ)v^Ce;|5XB`OC_>~hWgI4!_#PUrn#&tJuJG7|c z-EfTmpb3_R6#dYlxiM>}U2{@(kwLIkH9vcR_uJ#Kii*%Mde;7?C3;kSc_J))Y7`0#kdEbgIbAObRbAhf(qTJ&z%)dCHTwfyV1xmTAjl-F{# z-A`D2v(z|K>~`^ZtpHO&Q?%%(pf6PC`h?=Gj_PhPzJ?KbmlhO63rDUqpDu- z`XL&I)Z6H@;T@!R`838#&DJA=HfCw}*yCS1-#L9(kA~}De~I5M^YwvITA;s-j=V%m8!RMn~y>@BV zxH=yG98X>4S(=4)zT3H*c1J>5;k&}|8&`_m3)ltHBlj~dUksX_G9dmKMjR&p!%$m#8UyoQbEzJrDmS zVL{%1=l*TyuN{uJ5Q|A{961rsGK;tElV&j$>fOix;*9H_eJ=Lf ziYpGB(mGWsilG+J zE}EM+wLh(D@cq~}k{g3MdNW)&-9xu_<{<0h$&%Gz?xF6YmB+7$d&M-|Y&LrZpSX=% zdrS2thVl92C)AkuXIia;YHoYu^2=MrcSh}ONIUn)NNrJmXB|(N&gkQ6w`J2BB*TJQ zaa5Af>*`QY8Vo84EYkodI^q8prXk21D!;%q1bGMM z#}zl3?0*2${>(JE=J?le8sgiw%}+Ru4AX#>C-8y-NCOVMWMpK3tuF*b7D&qgF9cc! zA`83_fL`Q)7aWO#Ac2>xtQ-Udyg+(1@B-7oN)xycXuy*U)CdDHAh?qSTnH#|7Xe-f z6dD4!D1nQ7R~B%CQ}O@;IfyKH3khE2yTCRAz!lIU;0^-39$+2-#DEK|1>A-~K#*XD z40wSY01yhePynw0G6gT7EP%U#3zUa)$)R8nI8dCzGtjxf1>Qx#zzd0xg8N7usF-nd2&O4WR!)Lm2WwHMq*N&>4RQ*(ft)od_cjo>7C8l{pko*a<)Kg+ zN;%jDQm1eRqsRwWC)Wx438}+2kUE7k*alLkpm7*QKKMp06!~N*62U0)At^e8QRI`; zptL}dPi~_|5!^mbMCIn$ujYAjjI;X3(oE1o$>thS{w#f)E&58IJ0!Z9-O=D)_yWc zP{a5J2mj@@$`l=XtOs?t8a)$AT>#3*8a2b0#cPbq92I#s?DoWJc=~m4fo#{s@@n^c z%6XP)%Z6#!?=AW+B^8kddI0yHT|v_1>Ic{q!Q6Juf4r;t2Jyyq`b{IuG$dbx zY1)3y;lPWP_0BWSN;@Z?)txz~on6z=pM@FT9yOf4KF2p!b7*p8tY%Z7;+}IlIadfm zP%WMu_Gptmd~ajL-;Nm>8>DeUiO)O4&FswSMESgcTJGWNxUO@-eD87VqG$vk z^2zKX^0LeY?||6QLaAr2(_Knp@TICQ1MAyChnllJDUd3VC{e%>1- zZTTDI<1k&J8%$_x8=5WhPnfJn9c^C9cW>I}?i_cbJ?gm0oF**o=E&%S`KH`dyT$HV7Do;~SKEu+CJ)vR=DHb31PB~syx4?E^XbWW%@ z#B&cPai)ro4i(;CQEnq3pa-k#&F4NyA7v2EJ2z^rpa0Nb$}L#gYvMxG3;e}Qu@~nW zamNMp44qT$sGRC_PkYIfi3o%%xr%fTzbB~8Pqy!TudTbX`c8X!PMYhS=4aIei@nFE z#p#|-KWSpJI+_yRC!X2%Ky0Lgf73Y*s7|}ja^FN%ch~Et`Ir+~o8-3XXDDfjor>X2 z*|l$HYJA`t$pnM*&C7aaxJmBWWp?@YI{rAB{m)7h%Y+fx>V#LFPbMSdatK>uS-aF$ zpJZT_B>QS~(Fa>kUcxYKwoZ^gc6#e3=O|wGy+f@%Tu*XtrrnQMlgn|&5qPvy^=W0g z`3R>|w_Uc4(>>;q)AoATcJBv%ktyNF-xY*8GR4~Hb;>q!Yi*rHAai0VE zhNLP=YhSZ|!e}0z^SH=<`HF#6QBGMOE9CtRMNOS_Co@bSES;Z`AWZC9BZ1mrXB)&v15Ue5m#jISoHKadj^ezH{e{R`uW$ zxiJ+X=n=c=_@N5=h%(utAlsKV(9NA1?_NWuSsK!^9Cw8gw6!8(gzw`O zW&g5r)STy)Tt(AHg)JkmQ#|iH9}Njw9G$xSgsQ;iLJvCW!J`t-Si0HP+p9~QhU=A| zE)p2#URleSSnAPaU&xLgUaq`8`_O3gcCGG#WizH7s=>Ik2O^A+SFJdt)Aa{9rsVSNt~2r-Y%Ra~#Z?^{&*GA6q9vZE zxn93?WZ<5As&2yRU0V%aon#CVbWO_E+)^jxrIC16z2n8BYKBuoiu*X^O2g5OMY^d^ z2ybz|Az#pTRP6Zf!m{C_cn#7^o5Mkxo;nvyy^6Qy^-aw`47pSx*d^Y+_&Km6B`#Gv zhqT+GZwfTEq>q7( z{+E{0mnzOu)?5k3iPlzjCKh>-=L_$-PX=1YZ9`I{UvVKUcpn$vkh(2paFM?xl&7A_ z<;r|kP4d)n^(hNM(Z1@wcZ%mug`21Md`>)H(^oY3LHfzr%{E(KK5uM@Y7I%fWHfv` zA=hKYQ?GPxL7ex<)!waxn+_aB?}$>Gn31g03H!7#q#zKtOHZ$k+EA?7#~MOQv#nMK zT^hmOql<;^aON?a;z~dpoU#inVw@G~VxrS9Oo!d2QdE1^FwBW$R@(7yMm}IADAq>& zI)gd$mi+n_jd~g5{ZVC!_vqdqYq>95xZX18>*tk9x9#{S_t45~X}3c6mlp36`e@98 ztSinRulw}4c4VDJD#$wM*%3x|SJ5{q9!-YtMn-a%J5#KD7CY+=So8a>CSLg8Tb7Tj zXk8`gAC9rx*RI)PlshYQW3Py?#NH>{nTX+WE6y|lb-eH1^#w;4y0|UhavykOCe(5B z>Lmd&4>2CrD*Y6$vPKECptnU7bHZEK!zFXI*Di7>ALDCVZLH?)j3`6quvHF<37l$p z=^0qLkXTkH)Xi@vXj!V{D_$rvtD4l>fBNcT_V(fPm5gec_ztX=XkT*Bi^8F3_F$>G(%TfEg-e($HcQbdu#B;{Np0KLHfm%?a`mO1KJ8B zCx>`#%3SEfnHX(p7C%B!be~@|4eg1zLTzV?#?q|j(ffwqs?N#IlFE^aYl{@P@yaKZ zR6WVv7q)uKUUWz)BH5~jbC&+95NU^_oI3mV)6D@b2F2lfY@aWLNPI|+;_y?YwsQ{V zUF>Y1Ao^UalTh@Oyv#@6x{^Tj!yNkC%J5LN}sl+vmy=17t$tU~r)<6)t+)$ov zDRu+1^}S4(Ews4K+!cEzHsla3iY53&)4pTfjXUnBEU4M(jx1%`U!>KG>NA&)>+KJH ztB5?c-OtuHji0;AA(H0qHDzPo+U*WjyGul6B}U(#?)0jmmzON|D;43Zzb$KX7#q9) zmC>o4msEq_Bg|DVxNbd9Hz01Zup0X2LJL(Djb%fNxk3HxokLm&nVIq=t4!G&l8vd8 zQ$i})v3SNvq_(C+VVK3)IVLY9kp@H3-oaAdd+PCr;u`Fh#b_ns)7ZA%P&34~6wN4- z_>5)S(Kkudz-+3$Y>6DUfAqBVlyJq`BZg-0xAB|__qPhy+H=y2GCjZG`+zOBG+gIm zu2@6_={2kUO{~*%7YoES?q+ypCU4bDa8UW@Gn_folL>p5tq$-V^raD}b-BZd5PWdl zqxs~MB4C%8tG6<90;lqhu*{jJwaZmzje)HWi(1{b_ng?sg{M-t@EUciA3Nr%SZaL+ zPVhKy6R@BWZkWt#TgR}+bn{qAUqLEcYo7pto+HQe- z+*16e8-&BlE*B2WqX(Y#Jqv%k-uXUHeQDO*`;{_HOoO2JYROu9ap>Gk@%q^H&}7y= zHy4$Fovep;cAfmVvQ&4xas62Dup=Vl$qw=A7`f=Rh})+67oz+#)79)_(NS`{;(7XE zl}{}_A`Nax?%Cg~e!84epj25aO$Yz?yhSSLHSgSDn%Ren&&n!EgAsCv`% z+?BG2pUyWFny`1?xGeCgpjvgJ=gkW1!hNL*MoCNI?cG*WUzX)kpJwvt)-3D1M;>J} zW=eZVoDe%(rMM{1%>HR0;El}os_`0!R?-^=sWaY9A2LEnbHr1jn~_{zW=0`$8ihTn zYE;U>+(lDIdpb8?vVGW^n7(=r^MU#F(?eIbmzfkC;bD%9VWTyOy((G6mujpEjXt4! zgD3e;x`N3;T868#*R*VTMuPY=Pp1aF(6(yaijNwKFdEtVanMnP#{+vU;}olnVW;J+ z)zK_C3D%>Dz8kMM{wia3JtAb&UW|*t{ix%nt0`y1HOk+4L;6H~!@A9dhPPE_oayk~ zix5qD&Q&ijYa1j@n(LU`av0Kz84swM#HtLhL+`58IZ6esQ}4q!S+!cQdn{?4+uXQl z6OiIrg;hNqEoiCg?EA`h_o|-g$cO9qgc6lI4=d3oKib^!i8=U6-cDsT!Q>_BW$U!D zt2A;PKL(~&HW{lV>hXA^@{tqcAAL{Izj7JuOIx|U?% z5u49Kr*`S-rYzSv0XsH=l9M@|p~IU4AD zC_dnJq<9y9q#*}?{6O=^&<_NR>@YKgkUf#6U`zlx1_!7Q^jK0txBkgN8601$M{{iZ zL**`uq=Rf|8Eyd)13D8uv+$FK980@8azCw@Wcc(is$G73^_uHxEqA8Oo`(Xi7nIU1U+GI;VA9r-wR`$<#{DiOFJz_lbyba=5DfDDS z8;F>hI}i2P*S7iIKvuIq@?J5|i~S{C1&`=*D)*bj(J5O>E!lk}X>G6L8ARwmrq`r> zs8=TL&os?95x|##f?^xFbc`3W)7@vHhRtK7MQonosS@}yM&wa1@1IyHtKKJ4yXX{s zld+xQj4&ft{rZ92@=Hux?CG8CY6G|Q-0kU1a^VQLA~5F3dH7U2ijdl7*5sBNzRY;1 zXkpEnZge6;@bN=W=ssYgI4BVr2PTxIenywZ^yJLJ@v^yuxD`GG^rlL};Li;XZ!o2a;%476&sQu?~)|dIT zm;}+0mddKPALV;^JebcQ8Vlq1zP|3&L6zVteo2&CJxIyn(w7^q%~E8K{#dxI%5rE%xN*AI5zrvIc` zBj{;G{es&hLC1Cd?HRUIPLZVh9&7hBWi72juCMSpUAeP%Fzkc)T-)t6`L4=P>IJU3 zIEQnm!`*3j#fnzj)n~IH7NIv6*}4UFKKx0`f7(I<4<&(G{u&BG{1*H6HBiV#8aUeo z9wYJtqK9-m0eba~Pc^V7z&#n92aW&)jsx!3jq@QurX#RR01jf%sZFMG1U3u4#=dRD z#sOsD7S<1?oPq5M%5Ks|DQEJ|!`Eua5q0D@X#=d28-^Tg25w2gWsp*Gva(2Eb3_g< z1w}y-ln84x@{uXZNhBgA?u?Qk4|KA_T^-;6jFEx>-hQy1AN=j)oZlT_Q22kegE49I zxLBZ^qrc5+4sKmxjyoTQftiLXc(0n zu@TWPVR=qr-gWl#&VkUSv0ghtB?;*x_Y_vfR9K`cg4AOeFLHfRdl<$mBE5Nvy=^Dn ze`z($xZ{RX2i?_+E@MvfwNVC-XTLPyPQ7|sMXO$#5ZxQ`xP*ar*DYc9U9@}7m9C{T z8lNu3Nrop7AN3N4?Z?G&u{04!#fv1JRa|tc?{LF+H5xHl-nY@)q9O5lA%oGLqc<{2 zRgN=;0fE{Ex3$rQe8los@7?15eCd&pi&xuLyk-yILx{))kN%YV*JpEMM8tNxxuYWX zz+WFBQjZAtGWLfPGF_d-uAeAl=TcQO^u7|JxSP)Vw(D;D7S3225%iGU!jpE-G`8E3 zi3Ix2%-+o>s9?wR)i|`hDlXkaZ70TQQMbN>2{nhEzhk)m%3QnAf8q)>{F1|M#;b+| zccu)UMT21B<4<%P>5SK89x&me&)~a!B2sijEBL+YQ~VE9wB6rS9(8$8DUJ5^Sdd0< z>Go%31g+a8#nd17;WZp%n@R9rV;_!~oIVL^!*B#Im9MbCD_3B#i z;qrh1C%7QcVcD4g$P@jSL`3t=`D&7c?&&(w5w)1%$~sQvVNpWJU{m3qhb zq1KdjTVHBa7zTbd_CQWtUL==UcjSRxh%tO2{f7tGiT*l09nRwf$fJbgQ9C5=B0OtF zRN!-yd7F+pY(FBTDrKH|PIZ$=s?JG+v8T$^j?A0Q%cEou=rUeRl<){otS*css)!kG zKa#?j%^eFp&RCkQHZ-#xb=S}LK)g2MBxAT4k>#LJn~)kcoxAo{$$1ZsjCi&Ck=uvj z!}DbsT;fZIIcWqg9un?^F!<_=TQl9@j#CNeOnoAo({kaUnhL#0?7I_ir?=5N**8NNjbmp;!UaI;fQbrHY?UOs?Po4mee-; z#>6UjIMOZZc0`o!aE_C`n85i*s$4|9OwN3{&DQ>$*B(jXuOlK{wKk6#=OHo8&h)L8 zYH}0b+pqhTK^?mLzM_|Ct|Tjufl$wgOfY%v(V_y#!Jf1;8}Xq zjL#ox-d3y{GF&SxE7gUs%9Ks&B+Am1P!~kH>ckVG@aphF#Eue+t^~16M$PFr+}o^Z zd|5(Eyp_dII&Gb1j5WW{nYg8pKKGzl2r=ITvyz}<7Q33)nJVakJL5X#-O8sn$A~-5 ze^Y4%eZSjW;F;ISbFP&|qom!e^7T*O?RvFmG_y}c)X&h>f$a?5X3wryUC&vYB2sQ; z9!%WjvXHaCJ~+4yZSE1l5f==P2;TSfv*m|!j*lsX)&_^M)~hxiQ)n#~cE5x>o*FZ# zaqAu@_H4`gOK)=qv(ZrIH~Gq^zpqV5D`!$% z-t~b}Bem+kK|GR{-G#d1wa*Y@P^RpY5HUEI)u|XW2UD^Z2t)10%ZPL@xYoAX*VBm5??)=Gf3N|%E0yG2c(jFzX%KP>j z&t7QF-Q&i;3C`YDc7J~C;+rXy{;+Y9_tEl!)t^Cb zqluYrADYf7%MK|dA1n1NKARknDQTv-%WrWdpxnV5;oUZpI-k;-TdbqldW<_bIAYb6 zs?>dIf??W;n0~FV|H-RL_sL zg&VS{vuy+Il*96Rzj%l&I@7mR(U&y6jzK({92YpNW$Ysva-2hr*|cuB&wO%}KjzMeS|vIR1=L`!j-;fevvqjS1RSPCJRaMVB5>)V z#!1mRkN0;6IYxv+-47jcA`0hC>0fi+ow3{T>>aO;0sm09ZB6Ivo{m2~yI+%V_-r5a z*&d~a)DKa4zW4ZLZM)`=MB;7RzPz4psad?V-qPxe{y1B+p5C03?m5x3vsXz;czC3? zKv%g%SXsJaWfifHmwohZc1{PaLSHurGAbeo(RC{$!|mSe&5m1I*704KEq<)DJkHIH z7fJqB(zGTzmTe#9Ayxd1!BDHY(vCpr`O}XBxC;cIJ+DZbc~8xc=g@J}G2L94_ZDt# z^jK6<8@qqo7uPi=)TPAvObtb8_o(y5R=Rk!=Z^Hi{3oSJ3<-vUn0);U?}s0;D+(;L z?AY%gvi9EGlHu(6xO$^24@QQj&o;yu_3Yi}djD|JnXnV%7UdC<~ok&dQRJO*oGq9IdHfe9II2mpSJR(54pPCst+79J$sfqIAzWa8eiwEl@o zj?|k+sAURWnvD8s*~}Td`*yqW^G3{X5q~&Vmw#nK)F9_bSbMfpB6if>b9rUwzGG5y!5%`Zn`b*1 zr2l}YWPXlrQU{LSfCqfUfH)i?Z~#aOCL;w04+cpCXG*{aGeE>NET6`_fr(cfajy+f$w0pv@nEBLULCW1}G(c32wk?C1E86f|IknDS##^ z#ekHeuq%m#mzS3I_4SqVMM&WYF48bG8Z8ZlOT*!k0EHyc-;0Ftlk_6)+Nj`bh!>zO z2v{M;0eB99qA<`VnHeX?Z)|_qoG9b_41N&I^?O1 zJK%Y6AHm(x$QAFSs;Ee=|7X&T!e7sJ{Rf4AAl)cz3dHKTfg@Vo14AS#0ms2`0D}~+ zez~Ur@GcKHSvJl!DFM%sQ2^!nF^|$iKbZbTp8=x@`1Wl;Dt#Z&K>Lcq1Yo%SY3lmZ znNM|Ks_?+MfU^-<_1|42UndfX%(tLeeZZ_BVpuQopd z08cL9lmVSiKtpEUL_F3JaK1QWJcwW(#kphYhV{h~j0sq0;PevEec!!O8g2xD9vu48 z`f$*H0~GjSR`dgkFhhY5TW0Wx7v&EEIEDaMlr!76Jn&FE_y_U&o(F`u{+fq?0^Z|a z=wyH}+F$eJfMbro=Apn4=wHgB0r%7|c`z7wQ2J*&7+mgmIvF5L_*XhGeD~KpIrMM* z!BF56SAJ;&{kyzy80dQYnGXI3Uf?pn>jf?=^IIEmIrMKjg@b{{zqauQy&zyfK=Utj z2)OKTZ6IX8NZ_A&M##wl{^Vcskg~t&6oE$n)`kq~cOA$fWq;#Y4h^17{<#gLjLdKS z0Tlf`4~hDXXB7H3c>yy3PdoqA29PKBs~rG&=wI6a&g6nmPWrhlLI(Le-S51>56DX=N>T%2 fBlcHQvpD3hZ;=DdfytdfB9D2%$YF5GImZ77zV;Kg literal 0 HcmV?d00001 diff --git a/docs/paper/arxiv/figures/timeline.typ b/docs/paper/arxiv/figures/timeline.typ new file mode 100644 index 00000000..9e7c16ad --- /dev/null +++ b/docs/paper/arxiv/figures/timeline.typ @@ -0,0 +1,102 @@ +#import "@preview/cetz:0.4.0": canvas, draw +#import "@preview/cetz-plot:0.1.2": plot + +#set page(width: auto, height: auto, margin: 10pt) +#set text(size: 8pt, font: "New Computer Modern") + +// --- Colors --- +#let col-models = rgb("#4e79a7") // steel blue for problem types +#let col-rules = rgb("#59a14f") // green for reduction rules +#let col-julia = luma(160) // faint grey for Julia reference + +// Phase background colors +#let phase1-fill = rgb("#4e79a7").lighten(92%) // light blue +#let phase2-fill = rgb("#f0d060").lighten(70%) // light yellow +#let phase3-fill = rgb("#59a14f").lighten(88%) // light green + +// --- Data --- +// Week numbers (Week 1 = Jan 9, 2026) +// Jan 10 = ~week 0.14, Jan 26 = week 2.43, Feb 15 = week 5.29, Mar 13 = week 9.0 +#let models-data = ((0.14, 17), (2.43, 20), (5.29, 21), (9.0, 27)) +#let rules-data = ((0.14, 0), (2.43, 22), (5.29, 44), (9.0, 45)) + +// Phase boundaries in weeks: +// Phase 1: Jan 9 - Feb 22 = weeks 0 to 6.29 +// Phase 2: Feb 22 - Mar 1 = weeks 6.29 to 7.29 +// Phase 3: Mar 1 - Mar 13 = weeks 7.29 to 9.0 +#let phase1-end = 6.29 +#let phase2-end = 7.29 +#let week-max = 9.5 + +#canvas(length: 0.6cm, { + import draw: * + + plot.plot( + size: (12, 7), + x-label: [Weeks since project start (Jan 9, 2026)], + y-label: [Cumulative count], + x-min: 0, x-max: week-max, + y-min: 0, y-max: 52, + x-tick-step: 1, + y-tick-step: 10, + x-grid: "major", + y-grid: "major", + axis-style: "scientific", + legend: "inner-north-west", + legend-style: ( + stroke: 0.5pt + luma(200), + fill: white, + padding: 0.3, + ), + { + // --- Phase background bands --- + plot.add-fill-between( + domain: (0, phase1-end), + x => 52, x => 0, + style: (stroke: none, fill: phase1-fill), + label: none, + ) + plot.add-fill-between( + domain: (phase1-end, phase2-end), + x => 52, x => 0, + style: (stroke: none, fill: phase2-fill), + label: none, + ) + plot.add-fill-between( + domain: (phase2-end, week-max), + x => 52, x => 0, + style: (stroke: none, fill: phase3-fill), + label: none, + ) + + // --- Julia predecessor reference line --- + plot.add-hline( + 20, + style: (stroke: (paint: col-julia, thickness: 0.8pt, dash: "dashed")), + label: none, + ) + + // --- Data lines --- + // Problem types (solid blue) + plot.add( + models-data, + mark: "o", + mark-size: 0.15, + line: "linear", + style: (stroke: (paint: col-models, thickness: 1.6pt), fill: col-models), + label: [Problem types], + ) + + // Reduction rules (dashed green) + plot.add( + rules-data, + mark: "square", + mark-size: 0.15, + line: "linear", + style: (stroke: (paint: col-rules, thickness: 1.6pt, dash: "dashed"), fill: col-rules), + label: [Reduction rules], + ) + + }, + ) +}) diff --git a/docs/paper/arxiv/figures/verification-funnel.pdf b/docs/paper/arxiv/figures/verification-funnel.pdf new file mode 100644 index 0000000000000000000000000000000000000000..b64a6ac905e8051f4421ecb6bec9e10822f0ee4d GIT binary patch literal 20215 zcmeHvc|4R~`@cxChKLrLY}w5ivygquzLPagV=Q6F*j19H5TQj$Ldm|QDEm@bQr7HI z5+X~=zV*BBnWzwOc=Q`&;=Y6ho_zhIm#bDwnI{x#( zhmH;gfkNC)o}!bLrGr2uO#Iw&5D5c}1C9U&RouMro)8%LsO@A&fLOsL0dgZ7I(d0I z9NrG(qXT39%)r>w!wc(a;(@~%xw(0Q>_%=bI9-f8gjfs^R>gSYJe^#D!i?Q9cz^9ne83F_X7_=c)QX~O^0FVz3PZZ*3sd`>O z0aj25{09+3Cu!dF$kPY(x>3^G_8V9SmwZ_$4aIT;pi~$9^ zVZB^|`h!6O2(j778Y0U7TN8pY|6_*ab^Avg}?xT;6OeI2oi#Vpdr!_7!V1Az$76sI0PmIfgvC;Bm{maufUfqpUabvHOowi=CLJ9U!ukr=OSxNUMg&y4g8_Cii>1Dvp2!tRI8-1akrm3I*B)Q9=_p zQ@j(vg##-DL~a%or213Ke+jChsS3vZd-8%p0u07}1W%l+Cf?rdr=c6+9GnQA9)1u( zC7_q$gaAr_R^P)8=K)qo@JEHn3C3RT?ttll?FDLy)GI(az^ZyUxqG^K0Br=;2gpVB z7?DWg7}%4b5J@!Ho-ha+3A7XNMW7Hs+rl9TNfgk+QV=9u8fa$(L`n(;v^Ej~N5g?O zM-l13mIu?p)`vpSP?8bAAV?_C@_;Q!LZlFYF#sk3he#s;Qvd=|KuJIZU?70?fV_a6 zka7jh1t=2*5CNoU2pq5+(hrT40&EEUfjLV8<_DAk@Ja##0}>;ECph>6IV1s6z%pS_ zacSTKP$R$}904#P!5creBP^)-tqkW^;}DAgnoAD@j;LE$Hj%&u z<3Ip334@}LKNysi-FzWdfOf?sp)iOv905T=NqtSz6XW89Rl+*}BQUY&D1k=r2{aOs znbZe7A!0CTDKIk?jJpQT$-&VRf-?(l4@E+uL{*YrQ7XX81T6QMF4)q!m~;a7eI^ANJ_?5VMs367yR!aOU0al)5faD5;84_d3wFCx3Vh@tvg*KxIP>@;9q5e2+Ml$?I$}JC973YCJ%6GM{35E4ZS z!AT(mDI`S-sgXhmVhANg3L%LhNhm3VB2fY&fC8aOl;m8bNg*<&5;24z=b}tXlP2-0 zkV0@$NR=2us**x#BtA(}NS%~M$^`~00O$iGN*JkrC=w-%Tt89?FjD=bNOHnR^+O?v zdBRBbL!wC`QvHx5Ey2k3Qz501>W3uB2_w}HNzR2-KPhq=seVYZETsA&kx(cSkOwdi za;pGOMC<%%Cs;2JaE3-SVWORY7T|=({j_p-H+PVq_>b5UiP zK`Mb0vt&cU|DuFL1y<0q|@a~-;r-X!hiox8a*2DS<%UT)+8A8X$ z&M}Ku2`*9{RGQ9bE`{+Zc7J*{zdGVlpM36R=z6)?qxHS&ZDr{JkLRXapj8_a@Yb{U z=4st+dCu;DCDDZ%nZgYnBHHvj#-nbSYfr>zsvvk(j{EUC+BGyVzgE5X!tOG{ndkKb z-;N1$?Hs=lfp%u+F238WAEX}I>t2vQe$ch+vsAeH9eE`z>=VLxv?cOQ$nDE}USIBE z_rGV6PfOSC_{zVc?!?2Ky}J*%Xj&s`uN4UUrZqakXzG0`?3la*IoRYycqAR`9&2J) zzSXLm8;K@df3Q7D&n-MU3U%cK=Pvb{a>2l%!{3kIF%)eX3~NeMd!dzb``G=)`hZA% z)R%E_ul^bLo+-cOg<9o<2Ti===@PW)BDk!+eLLgdbJis3@&yd%gDNX^g(K$&#H<3t zFm)xSvx~TuGSuwJE&^6!H9%+sA*l&bKzdo{SZc-0%>Ps;gp5wqd8tS zdR@K{O-p|NBR#=iSkA<#@G+??hu^GP=Uh-cd`5?67foy7d^Alf1LGI5HzUmcRH#fR zx@1;ob9S4F+M!(%WnbpIIOE->rfz+b{vepA)KcD$d9sX5KcmH+oTrm*C7eU$+T6dR zR-r*-fO{%&rxXQGy)x(X$%xD%=8bfQt2HZ3d@LKY7YoD#Ppeq_>p8N%f*7U?B=32K zPpS~DuFTO-=f1hPAZ71q zeD9+Fb-et1x_soWgBwf6g6yqU7Qv#g*h~*~?MQrhu>R>8=eAvsxJBh(D%^m#SyjNP z3qz#}a8~bhEhSNS8D<#>$Lax&*=vQVI~O7huob1BdbGMP+!<2*^r>yzW7O2@aMM!O zCEnYC;=BF)&sA4>UPV^Brf&J1BG3)TE73*g5!B85=&0uC2vb{*pRkff1jU z-`1_NqfP#C_miL{fsH9Wkp;F4+lT&A24P>X(b!Gzbl({r&n5UhxzUd;A$fd&E*QRt z2SaNpotxK*hZPI>LbdPpwRwnmGI-$IpMF*1)d*+7S8h5qj)X5)ObUF{+w&%H( z5$yro@RXO`1v4A?U7DsV9uyO$PlKQ5TGQ>X1oNlXq3 z_9&}(eC^AVef#v;Oejj047bm-o1zEWc-~!O(`Cii2HoWMEAId3?!{{PeO;K(FnjwX z(k>E@p@`Yfg>s59^J@=Ea^KTT>0jG;fTsq;tw=FmJPrwi3P{&LR^8M+oK!?x~R@#Pnm?^Z;v zUO5rOb~dz16b4b#=#$PrCE&cK>jMl+(g6wsYK_>{lBvN>sLd*S|K zlL%XnkpeN(9tLD41s)q-;tCLq*1IyRjsJveI_XzSSP1n7hbIFCBM!dT4N~Bk}_(P6% ziB*?vbdE(ORb%Ba$9s6YmRx1ZqtNK%mUh?9_}AOKbd-)Cra@u9P(6JcjVahsQDWp6 zl+a^Ibzf0XC-_A?mNu}QfA;ODI^?_XGXA0lzh8zwce&m>X2F_$lMkLB`&OrZ$>#ja zB=N*y`^Vht?)1Wb_tu`xT#~n69!-D$NwqY_%F`WFb42me`=EQDC|2TUE)Fx#?nW1u z+3&J=l~OArhY;|PHDIS$IN#*HUC)dw2Ok%4>b;cIb5qXn2z+(x#Ryi};qYTnll7P+ zzgxy_6i2nYH8(CZ;J8y!Sxf64A(Lqv?&s!}4>6dD7CzOF^13*BQ-M~un7i3$LUP~X z)A5VR6Gt^P;dO5Wj?s4(HSaJ~)AvlkIe+D4m|dq+$)MX?$+Bxj8$O~KU8SWmkexJ@ z^N65(J$OW0yfJseUS;6x$c|@UN4&!9Q%dskEKD$Xb>p%^rPJKOAHz96_6~A^g4A+j0*@aV_Xd2WI zOksM7?Ic)0GoM`?NHU=0H`t*T&2zX0CisXtZTU|5ks%+NWY)db;aVGAgmvLldZ|et zKgf;1g|c=ZiC(uIIOXlI|E6j1Gn>kQwtxc*b!{0RuYSaqbzUEj5m)U=*)B|5ciQK$ zZL-RkQRk?kTX))IZl)KKwm>3H^zh931Wh8H^!0Ntstns(@g7qo zO`RaQO?J)YT~|2QE`Iz`zuso95vxbh%~kZZ>q#9_9nqhf0 zw$R0uDNmh2nJfp%`8<j>rgC>b&ZCE|2&fLl^9Os}1=b*WX4H$cEN8#DWHvlYkm*qU2YwHnuRmU%KP=GZv>W?Q84hV zuc3I?IUszu-)%K_-wH_v=i}hi{pTEg3uMex$Ix6??*oE z{*F=n0EIW>=6?yhc|ifz;ure3rOY4+!1ABL^8ok^tP@b=KLL6GghmAPenMeHSPuz+ za^M$@B*J=d0G0z`Jt-uV1nd2zMgwRM9Dv)TfiE1uZAl;>NCzUluuTvf3F?zX07xDY z>iY%0kpMp_C;;RU0Y51Kt|J3}AVf&~A>hCtBH$+lzY#!A6KeO576D*}B-{=rMFir&Oh~w$7=YIS zz#RgDmXd@ZrKN~gyBW9>gGq@);Ydj+aryi2BY1FPmm?y0(g<-9z()e`i0UF*4JZZ~ z!6V||o1t!?iMCLxG?Xa+e*ru!jts>87r-L}d_dpa8o>JjjFAB@GMq;S@W_xI8Q>zr zYb1c{2V_SExX54|8Q>zra%6yu%tr?B$Y~^i3rPm|$N(N0+#>^AB$y0J2JpzBAPK-j zk;@|ict1co5`c$PBEf590FP7x8Nfr4VL5XB$bcFdpd_b}0XzW8{0~EGzwcj>2TDmK z^dE=H&9CjvZ^GZ9wX_@b23ELzVY{EDD6nNBqssD-;YI!J!nt|)B5&(lo16+G(IPSL z5}KXyeEi#L`D0=z4!KbVDMG9%V--toXp6)vrxc&e?mGU))xj#>%;&A)(E*#HjZ7M^ z!(tg_-rfC)w+N$&(hkkv17}v{X6i?lmfW9OPnYM$<#<_5S4sE=3k#U;8#~MPq2w9i zFl9(msCA0M(e2g>G_RJk-Zwoza#q@B_F1>VOeif7X$BJIIf}Q{i zjxSbBSDNM-UCj4OT)FOYDZatP$aKN-mX`Qc;TJ<%ep#}*S#iS0EHKl&JdP+NI#&7bnwHdGw4j*{JPzv{mA_`3K~V{7hE&23Z;eQ&9pU8q(r>*0ou7qhIUm!(Wg zE<7sNierDu<|jd+8*UsrEtyV1;7_LbF;CaN`I$~6{f_sEPeJU-w`P}uqNSiHaJb#zvL>Yju<_V%_IcX(0wscb8Ur<`9}T9qs2mPv#5~7-Sa0imaA{drAiPK zR$3?XKD$ijOEWQhS6C+N_kPttFDQOzm2rQ1bp;_mBM!;C;Eow|gsi!Wo4Vyl$nv#e8;IJ;CbT6V*eXrk4`_ zZhxVVnD>Vo&0cy4?rhh#Vf`|qe@{4d2Dfu<(zmb&c@pcgs8@lgB1@xCvp z(i0O0z2L|1cZrIRBt8fnAMz{R$lw22*m7q@KOq4YZ;`dlg#G$BCZyR;Di$JVe4;%e zS)lI}T3u|%owkf1=;hZAUuD=`VG4a&`Z&6|Osnp za&_=ienWk)HtI(8S-TvUs`Vc{Cn!_t9*{l zgtWB#_aE53aVVxQE0rlz4Vt{xHBwS>ReOv(J#yh&##sMK?fsVKbI9r0^c0G>8F$AQ z6gr;8*GTmYC+fWb+)#i2ngG=_^#*Z{*<#^RG;#>^IUkTD!I_6$cggS(5o4IytGue z_dN8Gz^F_gw3)B*8dEjC{a)M~5f^%^GI(xC|7DgElWclS1ZA&-%y`A4oh9xOW107< zF0`oBD~GaN(z~WfFTm^E9>~WT5#t=QyURx=t}T5cYF5C=oT8}4(pu=0N@~iuy9NQ2Meo1tFLQ5z5Z_zkd z`OS)0J@_PMqR7#fRvWWzTPl1{ddSOKA2oFdH|-r?s*6@a`JY28ci-cp;(S&q=)xQ1 z658c8DOUR7)lKJ(<)$#K=SF0};K#se3ym3$_jiu&$7X+I4@WYjrY+B@Jx++ zM2O#?+Edjkwxhk(5{_m)QV=vBS>ce*9Ei^BJ*K*3%P=11)f2?_BH=}Y;!T9lnYR&# z&ZECM!%e>?HsoFn&!->%dTt+@ht7_%^VZxzWVExHv~T!Wo9`&C0J86Ds6NkWAAHz? zMmc<$1E)bF+%Jaoi0gul7Gi{b_nS?sTourC)=`2r<{Ldhddutx2@=I&tvfkny(QG+ z#OLQ)ZH=eD4zEdMN?LzOs=o`1zux^!i7AZ{zso<&=G4ua?XTG9r;jm}%Wgx7?rli2 zsF&-YtTVaxa&T>XL~QWn#JV>@Lh%&+oq$Jn(wecu3CD-nniC00yleTNxQ-or%gp`c zR26MO&~C5#jFgKCBe&LSf`%CVV;-8^th(#=Oy@LGi~d`uhlS-EMq~V%L&#f}InT$Q zQq!EG9?i*#uL@*iqFPH2aEuQ|GIg)<#(R5qbW6BSkA82ux;P-mHYR+o>WKUf3xhW& zD8&|(tagp3(C@8R?Qc^N{8nqIvUIn89I~yF#Z@*(U4Y@JM~L2?)3UWUmJ{?$u|%hqRi)e;ABVY$G#nnDL+PeEL+&s(NlO^L5K20H>V^xe20TsZqv+HIZPY<2 zv6DLLHllnVQ^vZhG|y%nw9$*FtD?&E zi}&`kcl^BKq|$G!p{1qE>()MW_d;IsxDTgsrY60j)%7t|_^JIbC+%YL?(FT&@_&7! zF7H8J&9UZ<_YHd*>z;lRhM*CQxv`4m!IN~}9JrThrvpC3Ts(7M8FP58CBX&p+VN>u zNN;As($mUG?b;Ui4aAsVoVE|$oRHN)Mw+3(r}yT{Z9>|ZCc4iwtk<@Fp=g-Rx;V#k zFN}5#Rq>2-Y^x>q?+!Plb>&|e%oc#rQDcLn#ug$vPA735o=$i#f-&G0WiJMJM0mMi zJpR|n5z#~H59A2IEB-`|{>hz(xaR)hu(o-=3?}?Sj{ddqAnkt)LIPH8TSG_)(31=J zB9H(kvB{GXfMmA91wf(yC#>ZN1qDS7Ma>pwK$POwp8qd1_$Q?0f7akn$OVc*6NUQ; zxj+GrCny8~MT>(jDR2OFLH~ui{N^r2a*z7yE+r{Na)kqI6 z@KXE3pok{+tN!GL84^aseEtiz{1smYpg|5Qu;rRDG-hMFVmGm{2#q_)bzET zcVpPr@dEEqH=}O}atXe6j!}JUa>|`+r?S!+h`hIQQDe{LI`#Cc%4${pBcD>SD zxWKzeN3rU$s4<1=oKcF&odWS2IYQmu4iY8@U&}1m^&+mA;ODIIe5iyYZiAv$qlRMN z5{xU{*live5SmOKRz$Ifc3ggq!llfYh6hJ2=FW#DkBOWTPR*Gs!X_aLSjL$=Klhz3 z!3x%V9b&A0A2%jOy=(hG^Y$+zeDvF_#|OWjStiiP+S;FaLErSEBD+Fu)>Q#>g8TE$ zbeS32@%=%mCu$FyE4=cZ3DtTs`$);pdUQ!ebwB-Wqk)k8{y0~WC45bF&$Qd+l%!>C zSoZ*Dc%0G9B{*Nzvz$F+MhW~uoliXW3zwrKP7vMK(|M_LVwa_5*f9@qu zLid3p|JWz`@sfwb0pDrh3zr6+`T@6oDdPSRu=hjSAp*lB$8TxU4iN(R`=ns&%>v?k zcMFHT-`u8mQEUUgTbar4Xyq?6`KL+2KWlo^WPteY-n9QB1xyFP{TI;JeXISK-(~{; zv*YwXZ@U0~=j55zuWc7el1Kh#Yx{jdOr8m#pu~LtI}`ZtOyJL%0O}uhU4HDIkak_j z`xnqH-KI&q8Bp@B1bN?swCnQ2iTc;B%YSyqg4jX+nq9!4lEnV7)q-R5D`fMa{QHiD zT5O=sejI20yWnCU<*D=cBC+m17#qsE;_Bj?uT^o|t=VC*9}2||LH)&|T-~N@a$gBi z*L8huFa1zG-jm$Ngzlj!J1=da984{SrI@vsQCZ?~Gd|r_Lk(jc;eap7|#0 z_jOr*yFn&}60;xUndH&smvqCC<YrWuy)w%Z+|e)w2_$i4Cxbpd?oZw$?hn@vqk zrV(bQCR3wXa&hlZMIFzSlRibDUL|!fqR-<28e!3f2mRyf+NIEe4`_7!MOOrBwr+ z+f+p*9(v^p?FYl>cB%4MPI=%@3haw~p4#?l@6@y2^T29QcsB;45VC%&L5qT7RWjT~ zgu-N0QK*MDtXcqbSiZ%dPlUE3f?xN3hok(RJA+xA*WQFFT^;n*%{Omd3k!WSgyZ54 zDNMZ)JbPh2ZT_H>-ZrhubM1)1+R2O2RQXQ~m~oloxK}$;wAWu-9p3fG_nHe+Efs_3 zAzhQRT<3P?Uld$sFO-p$dcm_>Gd}nA#+k^y=UF(%>F23sQsP+h7#7gSc)ft0`-1E4_Q$KdPq(KFN@a#09r|S1 zF@rv6Y){o(uNEj5p>hAVN-u{6ecyw6n*>hb?}uV9yty0fTNc$h^v>ORZB0J{9eHW~ zLUE`5MzLsH70>D+Ig<-Z4;fl~9Q3*EQRu>wfEL0lAwi{?5}$Vm{AiA+#&PjpXl41r zU4BRu`tDqb@5MCwnd@2Nnwn4GkfrE~u{$S>c(LI$vhl~#cdMSF=CVF}zJpq2Z(_&L z@hjuKdPnXb@0;J&7tHMg>Uu5R_KQX1DU_Q6#S8;nL;Z+d zIaTM?IsDxRZuz%G6m*w0Ei0qy5xz|lcZ^@Xet0`Frbd^(ug=#0Z*YuQ>F9TUT%qcc}yaZF2A~Ww+9^Ic_0i!leW6c zbMm>&7IO{w4^+cSLQ_QrL~87SlVqg%Au!^BBM==i{MT3nhY@`%^dQ6$$p}2_0v@UZ zd5Qms^UJLUs?C$2voTfLl&3lC6uM1sm@`UCu^r&or8d+y4hqv_S>k>3;Mx&tdz7m^RrrRH`fhs8zaGg%t6d_CvudwejS=Gs`fA@jgKwuoYO zrsX1Ep3IWt2D9xg5Nzq8S0ft0jAkwl>LSvztK!|g20v6q9!JNK{O;^V>a@}$hSAgt zX5ve>)1K;_)qn@7PSXxGsygjrAwSwL`#$^cyM~71UhLL1Sh#=Jc{co`Nu&h7Rhv7- zwSCDWGU(2l(6!=n5&o)0yNHK7TWK!y?A%@Z{pf=yw`iG9?69+~3SjOn>Fv7Vz~Y~D zV9b@(`eG~EBeBJ}-YGF?WoJ?D!iGKd=;vha#}#j2)>F?KRQkV}p2%S6+;9A3?m-%* zO)q1{-DirFVQw`Q3yi}bY#(mbl3Tf;fv2It$>@*6CxoA;EPou$vXTG}RfFf3eSz~@ znyvu73i+=x*n;3}-Oa_#12{noOhL#eW`Rg7a1fbvN))64nv_TiPVhDn*#l=5$#Xwq zvOV#XENL=GiU2&KejKYKMF7+BA4hb_*%I@X(8bv~VTcF*i03w?CB@-VNNM1BBNQbr z2}8gUL~{dL5 z@@vZ`Jbyjoz~Zqtl9mh?^LerLz~U6LZ#W8B1lsqGXffiR-5c-3q?TnURrNA8+hde&8Fv$KkH{ckV0qzk4Ofd(Vzb za2any`sm%1_tDuok(0o=+VpXpF!mnt$iXeIZuM?ipx43DKu?9U<&%S`tDp1FthBiK zjF-)nUKlAP_&N zdo1wtxl)+{uE!%xsu^J>tnAczAG>_-C3CPxP>*7AUH$s}gX847_d6?C-`9EM)207p zaCS7aM%(ALZ}(KU6X#g5UtB_!9n9>=S7_H+XO`aSQ=)6RJ1+1^8y?D-C}yanxUyI& zMq9%($Tzm*TvWh36SoPJp)p!5B}AyuaPo$lvb(>sP0Q6J7nt?aVmIm(<*qq2(jrF`oOaWBk; z-ppL!u~`ooDlaJ;ivd#ayGtGJ5dmsT0@<(z4A?rP$y(dR*i;nLteU5Of;&;X}Ll`?O3alf)!Syqej zm1q?Ews45Ah+}eAUb+1IykVhs1SF-CYQ9j}@b#vq4PPu%%mdI_1Wm z302TciDSbTxzvS4!`ia@cw#CM# zP$zG*YiAApt@$Bai`aciUrJOwGkoga%#O=i(_cy+?u}3^m!Q3z%+Au!)FVGzo+TJF znO)n&cm`f-K;25+=AXs)p)^ZkrX}#BvGw*I&*gN(<2+jZ2Z254=PcEAmJ~~(W zqfG?D5yeV8tG?9>;n*Op43$iS+&9HnM26~q*Ao6o%Z5a4xUf>)O78fNbKF~nL_#LXP)xA zEG7pYRIZbBR669moN)2%z4&kQW;GfnDL1m>awskEhvnv~pE}$qc`j!ZD_5exm;J_1 z@9nWyGT*+P(j5Cd__%az?Rb(>@qB}P>}j*cHofed*?HxKWz`b>hb@aPnOmhEIi!zY zAgGV&?CZLrv&*VvMd$sp@Ar?fdx~f9sP^<7E9Yw7xMZkOUA;e77H_W6Z}x;iGZ^8N z!q}^pd|C1Ky1CBty{%bKx%oc{FePj87oAr(B%B=`+k1&1wr_uptqi49-gTkV1{qHB z@0S_c-m95LpU=L_LSJ9$mO3}RG(|@(%@sARmMMMWI5%dnMQJUzJjXRIRn#~{B`4p4 zibLr6!f3ZD=eKH+*XBMuMh}GQjnC)puAa@k-dkSfSHXY^a#6khEjIVgv#@B*MD>q; zEiHH~{(~`|M>B!@-qGT#i;`DI>Rdw;#Ryfgx8lWe8|^; z3UlE5PQ9qCE)|X85{j5N7JbNYKYDORzNfG{0E_RNG$^kwmp4}I?-#H(H)6Ja`}7?I zB@*H$l^A_(dkZRG&{rVZ?am2J1he=@L(~!aZ~b#|7vJLi5v`q+$9sm~;q?kbQcE-J zrKQzq60}8{-qjj@r@5?xi{5HB^E*_k0^BtKUgRJMAVCCA2vi(!4weKjfRF&r(1Z7} zKm=7~B%y#uA22PIgu&60Xb~v-Fcdhb4$fdeZs4f;zb4!~6ZjqaB!f}X(nJi61V-uN zJTZ0{PYjs!7%jmAA(D7H%kcn| zVm=;D!1W9~^0FRw_A*AmT#m#BgyeV}Jw4rJBqV%%e8hd=;%*)e5-{Li2nnd9gruYx zKp{r(!+Tj0ZQm^l)AeAn%33zH*ymLhks4)DdFz$i^F}B zHhvQYNB~5E1C)T~UEO8KhGydC<|4n@)?_9AOqI36%D4cl9xq_ZOj5ow#)0(U;v|p5 z%SwRy10G2a#6*eTc)%P1^L4WK0}WFiG+!|&9PoQLk%Y;>(K19EmVrV^WpASVB`NOT z;Uw0HLNs*}+pUXvsbBWArR0<4t1n};jLQyv3ctKx(u_8SR6OR}KsrE9@fbrnIb#05laiT#+-31^FmECyGaCU* zQzx)TtGHkY1bN{43pYSOl2m`B$pW&=081S5Jr44~^-;25eKyCDT4D=gA8}~Z-@MYmnWsPKV3OdX`*&WLG;kp4k2pyw;CiG#80`G4n)zv=)0 literal 0 HcmV?d00001 diff --git a/docs/paper/arxiv/figures/verification-funnel.typ b/docs/paper/arxiv/figures/verification-funnel.typ new file mode 100644 index 00000000..7d844b8d --- /dev/null +++ b/docs/paper/arxiv/figures/verification-funnel.typ @@ -0,0 +1,220 @@ +#import "@preview/cetz:0.4.2": canvas, draw + +#set page(width: auto, height: auto, margin: 5pt) +#set text(size: 7pt, font: "New Computer Modern") + +// Filter layer data: (name, description) +#let filters = ( + ("Type system", "rejects structural errors"), + ("Round-trip tests", "rejects semantic errors"), + ("Overhead validation", "rejects incorrect complexity claims"), + ("Agentic feature tests", "rejects usability issues"), +) + +// Color palette: gradient from soft red (top) to green (bottom) +#let col-top = rgb("#d94f4f") // red — many errors +#let col-bot = rgb("#4ea45e") // green — correct +#let col-ground = rgb("#4e79a7") // steel blue — ground truth + +#let lerp-color(t) = { + color.mix((col-top, (1 - t) * 100%), (col-bot, t * 100%)) +} + +#canvas(length: 0.55cm, { + import draw: * + + let n = 4 // number of filter layers + let layer-h = 1.3 // height of each filter layer + let gap = 0.3 // gap between layers + let max-w = 13.0 // width at top (agent output) + let min-w = 5.0 // width at bottom (correct code) + let cx = 0 // center x + let cap-h = 1.3 // height of top/bottom cap regions + let right-x = max-w / 2 + 0.8 // x for right-side descriptions + + // Compute total funnel geometry + let funnel-top = cap-h + 0.6 // y where first filter starts + let funnel-bot = -(n * (layer-h + gap) - gap) - 0.4 // y where last filter ends + let total-h = funnel-top - funnel-bot + + // --- Top cap: "Agent output" --- + let top-y = funnel-top + cap-h + let top-w = max-w + 1.0 + + // Wide entry region + merge-path( + close: true, + fill: col-top.lighten(85%), + stroke: (thickness: 0.8pt, paint: col-top.lighten(30%)), + name: "top-cap", + { + line( + (cx - top-w / 2, top-y), + (cx + top-w / 2, top-y), + (cx + max-w / 2, funnel-top), + (cx - max-w / 2, funnel-top), + ) + }, + ) + content( + (cx, (top-y + funnel-top) / 2 + 0.15), + anchor: "center", + text(8.5pt, weight: "bold", fill: col-top.darken(20%), [Agent output]), + ) + content( + (cx, (top-y + funnel-top) / 2 - 0.4), + anchor: "center", + text(6.5pt, fill: col-top.darken(5%), style: "italic", [many candidate implementations]), + ) + + // --- Filter layers (narrowing from top to bottom) --- + for i in range(n) { + // t ranges from 0 (top filter) to 1 (bottom filter) + let t-top = i / n + let t-bot = (i + 1) / n + + // Widths: linear interpolation from max-w to min-w + let w-top = max-w - (max-w - min-w) * t-top + let w-bot = max-w - (max-w - min-w) * t-bot + + // Y coordinates (growing downward from funnel-top) + let y-top = funnel-top - i * (layer-h + gap) + let y-bot = y-top - layer-h + let y-mid = (y-top + y-bot) / 2 + + // Width at midpoint + let t-mid = (i + 0.5) / n + let w-mid = max-w - (max-w - min-w) * t-mid + + // Color for this layer + let col = lerp-color(t-mid) + let col-fill = col.lighten(75%) + let col-stroke = col.darken(10%) + let col-text = col.darken(35%) + + let name-id = "filter" + str(i) + + // Draw trapezoid + merge-path( + close: true, + fill: col-fill, + stroke: (thickness: 1pt, paint: col-stroke), + name: name-id, + { + line( + (cx - w-top / 2, y-top), + (cx + w-top / 2, y-top), + (cx + w-bot / 2, y-bot), + (cx - w-bot / 2, y-bot), + ) + }, + ) + + let (mechanism, desc) = filters.at(i) + + // Mechanism label inside the trapezoid + content( + (cx, y-mid + 0.15), + anchor: "center", + text(8pt, weight: "bold", fill: col-text, mechanism), + ) + + // Description below the name + content( + (cx, y-mid - 0.35), + anchor: "center", + text(6.5pt, fill: col-text.lighten(20%), style: "italic", desc), + ) + + // Right-side connecting dotted line + filter icon + let edge-x = cx + w-mid / 2 + line( + (edge-x + 0.05, y-mid), (right-x - 0.15, y-mid), + stroke: (thickness: 0.5pt, paint: col-stroke.lighten(40%), dash: "dotted"), + ) + content( + (right-x, y-mid), + anchor: "west", + text(6pt, fill: col-stroke, [#sym.times.o rejected]), + ) + } + + // --- Bottom cap: "Correct code" --- + let last-y-bot = funnel-top - (n - 1) * (layer-h + gap) - layer-h + let bot-y = last-y-bot - 0.4 + let bot-cap-y = bot-y - cap-h + let bot-w = min-w + + merge-path( + close: true, + fill: col-bot.lighten(80%), + stroke: (thickness: 1pt, paint: col-bot.darken(10%)), + name: "bot-cap", + { + line( + (cx - bot-w / 2, bot-y), + (cx + bot-w / 2, bot-y), + (cx + bot-w / 2 - 0.8, bot-cap-y), + (cx - bot-w / 2 + 0.8, bot-cap-y), + ) + }, + ) + content( + (cx, (bot-y + bot-cap-y) / 2 + 0.1), + anchor: "center", + text(8.5pt, weight: "bold", fill: col-bot.darken(30%), [Correct code]), + ) + content( + (cx, (bot-y + bot-cap-y) / 2 - 0.45), + anchor: "center", + text(6.5pt, fill: col-bot.darken(10%), style: "italic", [matches contributor ground truth]), + ) + + // --- Left side: "Contributor-specified ground truth" vertical arrow --- + let gt-x = cx - max-w / 2 - 2.2 + let gt-top = funnel-top + cap-h * 0.5 + let gt-bot = bot-cap-y + 0.3 + + // Main vertical arrow + line( + (gt-x, gt-top), (gt-x, gt-bot), + stroke: (thickness: 1.4pt, paint: col-ground), + mark: (end: "straight", scale: 0.5), + ) + + // Label for the vertical arrow + content( + (gt-x - 0.25, (gt-top + gt-bot) / 2), + anchor: "east", + angle: 90deg, + text(7pt, weight: "bold", fill: col-ground.darken(10%), + [Contributor-specified ground truth], + ), + ) + + // Dashed arrows from ground-truth line into each filter layer + for i in range(n) { + let t-mid = (i + 0.5) / n + let w-mid = max-w - (max-w - min-w) * t-mid + + let y-top = funnel-top - i * (layer-h + gap) + let y-bot = y-top - layer-h + let y-mid = (y-top + y-bot) / 2 + + let target-x = cx - w-mid / 2 + + line( + (gt-x + 0.1, y-mid), (target-x - 0.1, y-mid), + stroke: (thickness: 0.7pt, paint: col-ground.lighten(30%), dash: "dashed"), + mark: (end: "straight", scale: 0.35), + ) + } + + // Also connect to the bottom cap + line( + (gt-x + 0.1, (bot-y + bot-cap-y) / 2), + (cx - bot-w / 2 + 0.3, (bot-y + bot-cap-y) / 2), + stroke: (thickness: 0.7pt, paint: col-ground.lighten(30%), dash: "dashed"), + mark: (end: "straight", scale: 0.35), + ) +}) From 7c7a7a625ea1bb23d11bf969a94024c0bd85eb3e Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 21:54:49 +0800 Subject: [PATCH 31/38] Remove Julia predecessor comparison from paper and figures The comparison against our own prior project weakens the argument. The bridge problem thesis stands on its own through the three structural barriers, not through a self-comparison. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/figures/scaling-wall.pdf | Bin 18907 -> 18609 bytes docs/paper/arxiv/figures/scaling-wall.typ | 43 ++++++---------------- docs/paper/arxiv/figures/timeline.pdf | Bin 19340 -> 19328 bytes docs/paper/arxiv/figures/timeline.typ | 9 ----- docs/paper/arxiv/paper.tex | 13 +++---- 5 files changed, 17 insertions(+), 48 deletions(-) diff --git a/docs/paper/arxiv/figures/scaling-wall.pdf b/docs/paper/arxiv/figures/scaling-wall.pdf index 9cbfbc4f8c0fd37a7ff8f705b515141eba3ea5c8..ff2d40b45b6135eb63ba3a29a2f23a0d46a56054 100644 GIT binary patch delta 7261 zcmZ{JXH=6-*RDvBDhLQ70-;JL2_Xp}q4zGmH|ZtxqJap~1u3CQ2kD*AJ4mkzB3(d~ zUK9kRdhqc*XT9g~`{v$j&AzUE?R(G6njbT3<}oQa)b7-rm(|G3!qxZXGha3Bx>7W!?1a6teV`rDvD7y$f_K_J&1 z{I?0gu0_JX?KdX~{B3ZkFhCIW$F4mvKv3`>ogf4t2*K3iYhn-#bQneg;QuYIP4FHU zC=A4M-lDsS2>{Sx4T%(S&_WOd@gpD*B#Z!DcVOTjB>4}5O8r4ne`4@INctZnDEtS> z{E7Vs$^L_cpnqa=f0UB{pzB}%t{@2dgFt@};h%!n%Opqg3Ks$w!T@3SF#9B&Xh%`P~l23oZq-2oP_VI@yLuSy@d%ElN2?In9#jWwbK(2}fa{(dP-?Vzefa<{f*@ zxFl7@IL+duA2Xc8-RpY622$~u3BDXR`rP=5T;9My8f8Q61a_#H|LRC~_77E%R=k!~ zP%Z{DF&RV|3eCrF>L4LCsDbkWf6She&{XOJ*~NGkDKpX?BBGRH!j%=$SO^(|)Ck`p zzD-{BCz$2&mFLpZmpYOTp}$#J44P(Io4#EUl41y$`RXpBtdw#0uBqvhdpSuy(}~lK z+{((Wtr@0rZP`?KlD<7;d0AAh%@)t}DS`_ZihiCG$25}DacC2vr}*C2`9*zK_zo}4 z%BiG>TIol_jkL+~NO$97JVQgU&oX03JJ7zz-#*$69_eWm*pGYLzKVZ&lX~50o!ouDcNItyvukvebTr!P z_><(5YnuV1ysY$#Ee-JW1i&gHG zwauKZqlilWCV_~P|AroJKTR$!K%+G1V+Xe*>@32>ReVbiI}*lQj)Nxz^#T);@!o) z=Tc4k8V4R(4;pKHIs5^2TjSK<+r_o(DIEmesa&R7F3oUXW4FZR2?PVFZf&pq3{C;f ze3Z=zv$qMWq(`TABhBD^OOwO{leIh~vP!of$kvb2IVSgu-|EAS+ziCph#+T-qH%3% zXDcd%sTQl<^$71}A^b_12AM9dP9SepCZC?&AE|++GD5yfNe_Z-eO6c2Dk|n^fCn-} z9ZPh>QKE})WPZVSpfz>WVpth&#`l!-blnH?~rSCO_}^P%23=fd}3B|2~uS*zrq=u zL5xSA;lsA1Q|L`{jZZQ^I&MWclat3plaV81b~T^Q%|9FIlQWXfE!h;cOEpa0oL3Cc zmLm{WY(cM9D!r^F=gp*j>5o32}#n?CN`{m zwO;J$*)818JKxyVDP3a6$Fkm!j~3!M7f0OsaerX-ZONq)-H^u0OuSi22rtd2%&CQ^ zMc;jWMg|S2nkc$ttrs3_3k96O?d*IEu&)nX2^z|BHv&ivj5RyPham0o<|cUf~3xGoEJQ73|`Fvw@seY+6!ZHdQ1TNOhF z!5uyHc1ISl1UiXt+~6VV?6ft$QFhr*5nUQyPh?jMePkaYuIM<3 zx#(>^JBS?b%qE!aZ>%$mF_hmuW`M%Icg$n()!2%M%TY+t6W6XY1ZPUv!TG5#=pmYi zz_H8xa?We|w$8{HH;+%6oT4myi>k=M@>SfPMyqk|{8)emf{wrfJdkxUG^Q&sPu&~2 znIzeMiIxGnuKlvMQC{52Wp{=&y*;_%hmd8p7oqz?x;ui<(&Bed>{a`a6QSAaxw_eK z+VgdxarJ%j9QJlCgTs#QyVq0Z0>WthfB^#%?{YEAdSSKrslwgX+#4ufj`#kv`zuyO z{U_dn!mK#kR~4-eGg>eGZQ5V4Van8qhg414axQ~!Qra0_($UgSoM%5HeGL!6NY*z? zUm5HMaWbv_x)i9ZjLp<YBj*1Qq9?vVZYGv5vY25BN~``94G<0ZD} zjuX?+>k%gM#8w7Z0GKDAzaE84*;RPsBEV?|>!z-z=Tx|xSFX}NXVH@MGCzeXSd~y> zXsR_c_l(5}U-<>!;)vH{8e(rW*IiD@Ri3gY$wzm8-9LU__n!J492+N+?EnjvDcbM& zTFIfvV<9j)Kj=$!`Jx`9v(QpXUv5rTV)4RXIFcgdG~5kTsvRaaV^-@SpNSnBBOtwC z*(fmzExBi%748~r#gc10qSW0<#;>2RHqG!O`8IqkAK*HjsQ(uz>xi%!dbik4T^cm`NNnaPC-HXvDXld{;ohT_;mhaXD+<2Y1xVb zNK-fC3^#f^KQ_l~-Y zwb0GMPeUG&IkQ>Dp=a%?z3dStB0jPi@~lUO+cI+#q?L{DxW0Lx;QrGfsBq+;Bv{t;Nxs7cT z?T|e2AflsIStHJte`u9bU=1vES)yeKrig9VU8scytW3_JiNmn(badvlW4yiL&Su8$ zk9EH|cwJ^C#C+?2dF#^?>dJB|o-LCXCe&imDwJIX_dWNAxMtikBwIIr(~oa_W7BPH^ipE?ER0etXiHq%16S~%HEZZbxw}V6 z4Bjd0t_NC3{p$BwO+ZX zZ5Z1R_qY>vIlh;Cs)=8pZ5arh+$}xuw&@-tu8Awv&;J-NkYR1k0Dp|?S#gOt+ zE$KgqPtlVf#*2qWn6AdCUxSZ9vh6Us>GqBnF~2O@pBE(N&pVj)fqTe8?OMtU)4qTM z<=RykKaOXzbNyW8rM66CIOPn=vk*LW?_fpY{M7y?RC=nt*LCDHob=g}nz>`nG=jjB zX>m%Ti~561&W4p7clq!JT4Cx4{ZSw-+xZ0lcp1OO&AUU#05*5ReMy# zA19%st4n7_KF>b#UTKFai%_kLO^#jjl1hsThA^Zx2s2dmKD#}$oS zHE1z&2$#E&X)foG6OG)g-9&DlSH=#1ND79O6L%;ilh)a(sz^Mp=oa|?a^fQ!8(V$u zmxJMK&gx@ls(gQxq} zA-Mg(Y=SV9vrsZW_VWX)lK{W!x7Fv)l0}wabtb|vB$#5Niqh+%{=Djk)h7Woe1MdA z)VI}0p(~1W^HnMQ)2QEml37y2@KQ3e<5^v6qhY6BrBSCVnbRqYW+WFuP(HEZjPKUQ#Ssk69#GbE?o_{?T zzkD7p4wzWMkY^qAQZR0tEMC;Sw7U34*O2*R)%S4v>pZL!?BbHKwKpZS9>$I z*I=*lDMRfZ&G*#UySJI8#38^;uPy;Heo3vE|^cnupXj zNYSi5{%r61Ov}-nm|3B3mDhw6hpRh`+Sjhap&d;Yq-4D+_nbRvt_K!q^Z28cE@ChG zJwS_TfxbyPHQ^Cse`IW6A^XoRp^yG;W@@6_`L9GpUo{f<1`KvacMYSRDFT&srnna& zuDQ5_PR^-vru0bn%>Fd9%*)!BeJGIUe3UVB9{tHPnogIsb}=f8{WXL!^$Qojt+2cF z4|`dbqs1T6&@e#kP=4dixLypwH&-k!@cOHo_mA%BO6d!E0aBxF5V%}`pYKU_{p%$j zVuJPNkfv2;&R_2-)AR_?mCe5aa1-d$ zDq7xf*P~yN5A!UesL(x5Kg(7w76O&$J}OrG;gHezLSYl@@a1uzD*rRdOl>Cba>PB{ z$b>8%rsRq4vBCm?dI@=lA(LUYfx>u6ta33&sg$Z>rZ%f1d{__dSU~vht3gvDf4*?~ z?Ta`kg&%2V+cY`R{GbM~?Z)uBX z$u~$bF}oX6rZQVas@PX?OlNWI6Tg*4@!vY6Y7^PXqvqsUN@4>S;(H@_MP7VV3zsD+ zMj;dN{7Mv`C+5qbJ-Z2&QVdPjiN1J7uEd(kIxxVsxctOh1c&1{9&=^;vb98;z_>I< z@@dzBWCxNsB`Y>BI(hVm`im#HB?=6{-XtV=eodtKcR|GTJIhI&P}1%UK0jBSQOgNJ z1&i7b)kbf&XAR~Qsde)YDRA9+wjNWlJF$X+J3lez-GineR~YNy>@ql7M-PMOx%252}r#B!vgd zvv?JyKqC@s{5lfhQ~vrl^W=t~bQ;DlU`UMfb7`$2@2Nev4+lY>mnch)0mh;b5Xnbp zPkdg=+omm`)oGSBKCa*o>4P40VP*WGuSSTSeQ$(n?F*4M0{!rW4%wp>!imf8+#X4{ z5oDQmuf2m`U|+uCNr9|l!?mP{<}HdH5oTsR_f+cC#`F5}sQOK{5%+V#(&p4JQ?f2H zt`ElpLLIX-3!R)}it_Ho66Hq+0RfyL@?Pe=6GCTDbZq!%O`?I1AtEgJcbw8$1A~Qx z_+)UB?1c%v?kkWvOhyllQJxzIxifiTxye|9mtSLq4OyZkd)kse3K9j@5v)h{6bKiM zdU~IWT=wucLSqfP?udy35?k{>1b^9D2dd%zz{XqNTBrz`oorJdS0$*vq3B9}&HNBg=RE$EVKW_@^3M_qu|Rb+v$UCj5b|8Sg_C4-u9axJQY`++LU6U|w|>rK|x z!Z>VFkjPKsy}5Cd+|U5D>la%G<{*C>5ey*@wNEpnrO@Yfyx&Jh&hpyV+RD5kR$?^3 zGDcUA*CqNzk?K7T@{znm2M*4vUU~6ENBKq;qKf7{a(hnnz`WCAeW}Xr{Eq&i#A)>m;@rP&ZBw} zz=B?QHcqFgGttVCGU38OF)KahyNhzi(&pF3cgaLQeG=F}{~BDvO`qX5-?Xr~TtQa$ z(e~atCmv~%H45Y%mOo5MtN2@Bn`0mOX>nHIc2(Umn>MyqbDbqq_e!AVP#6 z@kZ8@w>;}0%l}-0+v%9|bx@4BTAnw_tG!lhtM8DW9M;bI+R1m5b&Kw=-J74Dd2jBW z#ut|M}{h0*p4YOWq2fx954G_1;4nAtIy= z($9{5vSvei{p+7V#g8wi^A;N7bCj)3w?0L1?6DF>*8lLZTIQ;&FS1KrSkku&=vWNH zeI}#pvlFzmGdMVxHh=sXdU+)@8E=&Ov27fF%)LJ_?Q#U%55{`E8}Gs%tlOXyjNmZy zGONJ7@)Oa=4;l8lM$!bmk?Zq!dh^cw%FK)W^)GvQ5%zAGE-9_i;PMrQFgdf@dBV1u zbhqNrvG-%0a|TT>49}ZmpjD02t$V=Ic>%2lef>>?QCBBns~zHq_JcW7P)XjeJz(kt zJ|#RMZp~EGY_{7(cTPJa@Ccm=oo3q2tD^eijYv8Se#2F^c?DPVZ4LAZ+Q zY01aE^$vnRrs^(dl}R||zvEGM6Y)VG70!BEQ(eI)SM7Z0HLU5_52IqdY=5UY zPS>e6Hx(13)__nQ^{jk72`F-etLJ{j)hYY5xsx; zR0u|GY&NF4i!dAjtxf$<11H)1H%C14bVa`0PuGMxR4S5;{bs|g7CjM*!9cKzjbh7 z;eT@=2;{F^kdWYCH9#=f9}fLrg~7sqCqw^-!9Zak@V^Vgz@Wbdcs=I7IUxw-zjM*Q zCk2OrK>tl%Qy?((uTfufu@Oh{BK%ho zC=~QpG7JWUVB{4Uy3^UMz=-pI-ZWxjT!8;Q eWb%7gdU#s6d-}Rt+Yp0b!e9_FJG-p9-2Vaj)Fr|I delta 7546 zcmZ`;1yoe+)|M1Ono&YzC<%!Prf5*Qkq+shLqeJnMNmRw2x%k*22qesL1GB$F6jm( z1SI^8d;fK>-~XSr*E;)op1se0-*=rg>+IR@VhN_O1Tj(s(09@p0<0MUX?YU?0Rbft z0x=~A0);hgfe z%l^^i|K$CN{Q2l#6=3k+8U*}D;5iBLe3nDlag=OcE#`qiP zjLgoqR@(|QgjtJuLZ+Z82yJ3Djb(g2w;tOpUggD=E?pP$PBjwZ%eFpY^Fb;)#9(O> zTA7)`LND>c5$v+n#35cBohO&ST4g^lb`+^I=e>3I^}PIB`IynE;pWlB?9usorTW^K zM*q4e(W+c-Z;BXuKL%}SQti)yQS>)pAs3eOzguu{y+?X(jhpiP#A&zlRcPLw)3%6c zvd1nlRsO0>2hV_K-;2lGjv_`JLr!TIVA}ngo}^aYkzYQJ(DsEXaOHl?wT?_)woPV4 z&h!0v9!yUp2&L1eqqmF`WXC^$@cb*74~$cnG%A{wvfquK$ z$|}1{o6O}eT{`l0_X-7h3z9bZ`6-ou1Yy{p-_K~fe?AYB-fo&(b{D0HNXyH+f8UxhE74OgNvF0e zoV$hh>vc77quZOhqwbbi8@f36GE1O2*@;{_BfF}* zs>h~={Art9a$nxiJvBV6k-jjRwJ+wMTjR?U8G2LlNtK(ThPQLNE9^_~zPts0hMmFp z#KYdL?qi-=w^+A}^s2cUqZ84W(`tl`>6NM7Z#)ZfZsGZoH^3 z**YOvn5OUBeo53$Q6qJPS+o$3JgZ>?q}WS7Us8*ku-oikg3m}YtLBLql(J?M(^Fnq zThaK!L!_`5c1&azmESZlq+f?Oo_}&%4ckm2Z<80|K*KA(%^VR}2FRA;M%S(A3{YL@ILV_yVt)4vFqY99tfSejj0Y$acNg++^6 zjhhBHpx=)R2};pUR9Y+0FlLI_6SI8`JNv?@1|RzPrmg~Q$f9sZRlHBrr*p!vs_9)$ zPSKS}aji}^PkAV3CVDf=OxH2B&lzH%yHVPeB*Ny`?AcYTibQxx9ES8Zp0e#N{TWtw!KpXYTOlH0`?#&M%gp;Gr4*l(URZcSq10{2$U4H)BsD=fM z)J)nbHGZ_M)sotjZ=m+bQHd{}8>&?^9VmJAYGB~itC9gzWqn7-qzMCC#|V8`Ad8?* z_h7)JJqUo6!KnAUoD>aCmmsY}A1cJx9dAiN@JY(u)PjPB6zvL>m^A7|%!%KPC~(u$ zF=nH0Z910NO8xPk@_sO9ke2$r`L6UyrSgc(*6QK_vCilF*H!cm%f{@ghlO~RS)Hy6-2sxT34|rIC_@Z zqGm3yjmgq3Rt6NkkRNhO!EN|w+m+%S#+ws9J`WILWK$@yy*j5G@(EPz+3fH#mgW0P zI$40oCu!^e(nE}`Q|Dvs_H|ZsZi2P*G>S(R9hbAeG7LBNdwDBmDbl^cRyFddY=XVy zZYCGqwX$f0N#eY~@JA0gF(QwT5YZQ$@~+o?e_-+lrx>n$nUdLo?oiC!Tb`YvQ!-=N z*6h;M(bRTh3nFjw8MpSvO@%00C+6>TqFa5Y1=l0ICrfV&jyo4mw@0Fmgo=hF_jNN9 zj)&eq7RT%3-th6A-?-}VC@{qM6!zL($9V?*#A*uuEu@J}iPdG9`n?DBb=>X~LSGOj zCK_`MI+XL4@akx(Y1Ou7Tjaq4U6nmUV&c2hTu$m8mL00LZtmLcE$S8Q{7qHTlcCcs zr->FyTHGTS#--FB3d9XO7iQ?n<`SWlIoZX@fNycDP?2Z1VrA4v)J0xyl*i=tcw{pl zMq1`M_BHmNUwg1v*`Z&Nzw11qKUSE9Y8b`bKX$-N+5*OxN8j47Fe>R~R_qU&e2HaL zp|-9M)5y9xJA+>uM&izte0$#(lT)G_O@8Pwma{1E)*;BaJs+mVJ^^5$o1%icF$br7-aDQ7lk^j;o(YFR*_ z`QVaMGYHodKa*gFjqWwazs-ADqpplm*rCc}c>3D^DQq;n!_urn=|rpULP zkix5^lRrX@uaVX;EAyPZ#?xJS?;{81HREIr~6)7EXokE%H^sr$%T2|tG2phG~0|Ac^>76{Fwb8)!i}6*Q@q1InQGY%0vi(VX5^>zj%QnEHOFRFNczZq zhk#s2qa?qPb|$H==nYR6Z!@v57IxY+)io!jB?v5epxvHBFZvOScH-w1$p(d%J@cdR z?cE|5gwpr9$vkU?Gq!Cy9qZKn@|uqn)vAX_UdsFaVR<+j1nyZnW?ul_I^W}waD8Mf zD`8v1<(ZPIBDGomn~ShL+KV8U-PB0SiUdphsLlJ=6VitA&v6?Y_{`9L&4$r_&Qu+~ z$qW4ut1D-3ThvZ08Vt2XL!-4I??wUot!_@Mg+s3n9XZFi1OrzV0oGaUa}6h0!zlqK z-^y4G&iVZN`|}g|7)yU}d`N?EJwX1+7$)r;)<`HQr9|KPE!84mN!PvyN^R9ul z{H?`cap2R-R*c74=hJrr_gIBYF)lm?hEIzJokEmFb&WGt?j+gKAYx>8MyXDUqGNaR z`SFOGLm7}QF9sg|LSEj^*^kH34SlWrT{E69H+`gvU(I;2Voc3htHbrgu&!TtjcgV! z1%Hazah!75SQo8P&607%w|+m_=OSrLeIf4jj`Q+~+pr54o8+hLeSvi@LlTj*jkftuTo0e~s~96DDlAn2v?P3X#= z!sPWvhkOQdf34An?FHHlCl4cF3CnrAj|&Gy{pm)ZJix2v`c-e)_(lpZ1@-HN??p-3 zPj|}^VTzxWjq8}{l@*72d9b|`_xESs5A)o)+~NF7^)9}`ibd)Bk_oIQ_X z`W!`uWuMp)P$ul!9UdgFn`#2)TEe1}Tf*FyPuG`9-pdL)H+8}lUnbX?-%~~VtaZzD zO6yph3Ei5*xwGOvXgvfjk9_&EbV%!=F2vCw+bA+m!H?KiXhe{dvYmu*%rS*^OHH@% z{D5T2oyTFa1$7sB9#}bFt$|k(mKhL?8ue=PC>FE znI|pbJRp8i!s3lsB$t}r)$ovioubwdlb_?noP`;457c!T5K^ksdUqcmOjKi>X(&Qi zXl_4h8oG^K6tBY`j=GiWiZ0YfEFxdjAuVm&)8N~FsXM%s`JP8?7My|RwT9BZ)9ubA z;$yCnSXV9KTXi6D9dF~H)Qo9usccmsP+Q!ab1g~sr=&EBb3`Myw$IVW_@u6_GZnjC#s z$@{&fzr+iFi8Jwf<_q}Q`bqgbeuP-kUyC@@(R|?g^TFh7c7vjrkc#Vl7otGB)4r_j zli*GSlt5VWeE{h*(l7~r0;}!_%=wqfz{8RLN$hIh#hF^`{ATIYkSo;^>W=q*Wcx?p zOZ6}!)yV8N=EtF!(Ha;rGbStQ^6|m)qwQtkR;>PbeM3M+uW59WYpCmwFRUCvVT$|M z)q;S5kf}XgYS)L89jmtWh@qqHWwTJJ29uW${5L&+jA{joo!U;GL@)mEzC1ZMn|~~H z!mbL`q7KvqEXGx7=874WYQHxM%H?sF`kKdcF>zo$@ZD@By=PLryb}6JqxKr*PCWBA&3J_r&Vd9NQ9eE3m+mm5 zw!mqxhYTT;ctaooP?w-r2PjhRL@9lEE|rP?%5N4?zwJ!lSAx52nDLg)_nqxGcFOms zi04$J=le!Q7~Py}YF15S?sKE9tz&YIW2HgI%t;$XbV&PMHM&~!9&JS|Lm8tt_e~NP zx(aPoc4BR=m93`f1vW$NCiy+@@o{r4uSMWg|BldtE9Qch3~3lkXR9K3uDL9eJ~7_fxP$WMBoS1;oQu1D|yya2UqK!eT}HWi*qX zA_DF(+*rR_9$X8T4gLZOdSqqLbL4)N0w|D{BD2z*+;@qbtkLKd9!;SGQu4_=TX}xw z0Jhcnj1lmGOF-WAEHK(lqo?_Mk`qI^>PQlDB{ z!VBboaX~Pt_fOCYembFPPg8K@fw@aEsIuyEOaRPQl%Q6ezjDyVHA!9d;X`K`=ufuL;ty1;nCz^~e?A$M1JF#yVl7ebCtTeDdUPr0zfH8xz$ z5DeZfa!K~(Jsd4rfw_^#+O*R9H1xHrUakP0htIp>2;Dne((7lga>XQWRot|FsFWa2 zu#l3Fnf%57;Y=0Xy#Vq&E_rjU)8 z82Lf*i97_DMyXq#D!t4TCe0U8LH2n^$pm91X6Nl`F90|UHJ+;AJBYbyC|=e7alpXB zwrAFb6yQcbbc(cK52|=qoA`4;`VE@5@&gOsV3mRt-h|bDZ1~ybFdyANV%L94L+fvGWlp1h(^p`V-+uXaqC7rO2NCn+V-%QL98N` zy@4xH))-RlwKPiW#>GieUvMx=qAI6p%zRhXc8cijcU^YA-m=HmxVzNFm0MZ4>BVaK z+m4%86*<~&wOo=bvulX6ViBbX0{ z_z<^^i-tELgzzPMi`B;43z8>lDJYL-`NXE`@q)Kh!}o@yNE*9x6K^EZxCa~JaRa)B z1+;R>Q0fqz3WpE3M~!>EHma>lEWLTHevek6#(iyz<;B8H$>=OY0~FP;{3DM{(^n3r z3uto^_(nz)k6=*-fsD)3xS@DVEQv}^DZ5zdBASLo-=psqP>Vy-&*xCra;(x~Ht7B2 z-DwvlLAC3A%;naS_35l#z2>4jfgLZgi2!11PL=a=&j-fYt;fC!ouoeg;WJqqh{mbO zP+#MzP~X!JENjl)POqzCtLJAkwkVG)M*%05R}gDc!VQ*l@{6H?r9-ujm#NW- zepgQ0upqG(@6T0-CLdglZHby%CvP8b`;kYryZ2>Lo_$kMX;Iv5TqJ*1cV)!9mga%M zBu%!a^W2aq!xJiKtp2fhYs1(pmu;LwWIgbh2#5=`8(6&I#QSPh%Xwo zo2tb%^8{p);)~fG3*`cDW7Ha5$1fzGtlbLM54KAbxHm$Er}ML(Xbf~-2|@~Y_`XQB z>uqqYPu-vxY;EU||FHGFY*nn1zc(*w#HJ@lpEyV_3Z0_NcTvUF0VE3uAt@1V3u{> zTC=vA>v!W7Qdhh;Kk{QPjsOeM%Cuy>@Hs^TGs*jWpNa@|mgP&TBkGBs>S>nKhsqJW z^$Z%IPWN1PC^3kzUA5WL-H2Y^Yy~iV6%*mvNw*T*WwWih?>)}1xAGZmxPt~zZL68^ zMJAvMH+j8E+!p7pKOeib?*RP7ObrNaN$!-i4oZ+=mKFe93Ne@@&BQu`pdB3+VY6KE z=1mv9bX~)0La~bbS3NXu3xCNW5!ks_`cUZJu9;-fs_VCwy{Ydd5c(m6<^eQG^~FUemac*+FkZw{nO|&F7UpZF^GOu(q#t;v%h*#rWEp;CV>W zZrQ2n3rKY1a$^T(^HrLI^-$m(xH;4BO?j!>)E;bk@L7fT*qyrRwjdTbhFs!SzI589j0%1#~xv73R zw?RlA5Qy*R=O&g-juE>9yIf8n<4i~bMT!25^0)*0msfJf^_r#vMAt{p1LWj!8+^|W zZ4Y^P#0TN3tZ-{|IfuLU((1nuxxASwl zBl7dZ_b(0sMg4ibEE*vK z0sozgfPw$XAz=Si1M_R2NE8I}S27ra6#091KRM*zCj*9{AfUf3JP(TadB*>J zs$eJrruB~^B4E%zlHmx%-#Y=rk*L3q8jL_8Ay`($n-7t{+8=FKSI=KV%U?~*PpIUA wcKtQf{5QM{H0J%kUnB_$KH&c^{hn5Ocv`r7`ncPm$-zi43PsM%EvF&>Kge^Lga7~l diff --git a/docs/paper/arxiv/figures/scaling-wall.typ b/docs/paper/arxiv/figures/scaling-wall.typ index 0b195736..4693cfb7 100644 --- a/docs/paper/arxiv/figures/scaling-wall.typ +++ b/docs/paper/arxiv/figures/scaling-wall.typ @@ -108,7 +108,7 @@ ) // ── Human team line ── - // Starts high, degrades at each barrier, plateaus low + // Hobby spline for the steep descent; straight line for the flat tail. let human-pts = ( (0, 0.92), (5, 0.93), @@ -129,20 +129,18 @@ (55, 0.28), (60, 0.22), (65, 0.19), - (75, 0.16), - (90, 0.14), - (120, 0.13), + (75, 0.15), + (85, 0.13), ) - // Draw the human line as a smooth spline, then a straight tail segment let human-canvas = human-pts.map(((x, y)) => pt(x, y)) - catmull( + hobby( ..human-canvas, stroke: (thickness: 1.8pt, paint: col-human), ) - // Flat tail beyond the spline to avoid overshoot artifacts + // Flat tail: straight line from where the spline ends line( - pt(120, 0.13), pt(145, 0.12), + pt(85, 0.13), pt(145, 0.12), stroke: (thickness: 1.8pt, paint: col-human), ) @@ -171,7 +169,7 @@ ) let agent-canvas = agent-pts.map(((x, y)) => pt(x, y)) - catmull( + hobby( ..agent-canvas, stroke: (thickness: 1.8pt, paint: col-agent), ) @@ -184,25 +182,6 @@ // ── Data points ── - // Julia predecessor: x=20 on the human line - circle( - pt(20, 0.75), - radius: 0.2, - fill: col-human, - stroke: (thickness: 1pt, paint: white), - name: "julia-pt", - ) - // Label below-left to avoid overlapping with "This work" - content( - (rel: (0.3, -0.6), to: "julia-pt"), - anchor: "north-west", - frame: "rect", - padding: (x: 0.12, y: 0.06), - fill: white, - stroke: (thickness: 0.5pt, paint: col-human.lighten(40%)), - text(6.5pt, fill: col-human.darken(20%), weight: "bold", [Julia (4 years)]), - ) - // This work: x=27 on the agent line circle( pt(27, 0.92), @@ -237,16 +216,16 @@ ) // ── Legend ── - let lx = px(95) - let ly = py(0.22) + let lx = px(80) + let ly = py(0.42) let leg-gap = 1.1 - // Legend background + // Legend background (fully opaque to cover the human line behind it) rect( (lx - 0.4, ly + 0.6), (lx + 7.0, ly - leg-gap - 0.4), radius: 3pt, - fill: white.transparentize(15%), + fill: white, stroke: (thickness: 0.5pt, paint: luma(200)), ) diff --git a/docs/paper/arxiv/figures/timeline.pdf b/docs/paper/arxiv/figures/timeline.pdf index ca96812288885e428c8b25d66ef379697f40ef38..dbfde4b17e2fe5c45cf614b5ef1bd5883e49d631 100644 GIT binary patch delta 3184 zcmZ{dc{~#g1IIJBA*?2HEOX>4!86kHj!kDXa>k$)E&P0xK zF87t>7@}Fu-sgEf?_cldd4GR>KfmuE-;^AthdE5Iw%`C5jPOI?0)az~7x$@7q>1O< zwkzzzAzxmb6qCWjslH++=b~2uro3zNDdfc13D{>`+h^|QE^9R(j#?>xy!$Jd`JqG! zV16Z;{YNQGDrPsu=0pqi7It^0nTpGA z^_2q+VV{Qj)^VWx?3JbjSKQgg{809ItnB$F)*M$}Hk2oF0>ySs;`WIie-Yp9c}IZk zYG%`FZ^xMJoG>fPxQ(OJz{8=;FdtbgDv2aejc>|^Bs%QU4EWn6$ef^9UrF#;&p-ui z3pq=Fre7`WI%J~k2f=7)rciz%sUDtUU?x4;q8@vF#X`{B%4B2$7sz6XZGHO<@DWMo zZ(&|B9`3cW1Nw`hgjvn*cKU7R7Efzl^yZXb;jTtDiE}Zz zFFlKs$e0OKi40QuCXxyu4tp48FiH6ooHkZuB<)S;+ybb)_vSY&_&On=e|0PW(3mZk z5YKMK*55Pi4zT7V5Y#Pk`A#X`dqjTP9K3y&8Lb)>!qWAaedqfsbY`M6k-VZ@;)mq< z9yFdB=#&ec0&geov+A}Nu%H@1?>R!%gcgPn&?%EU##%wCfzT!<9f1Xo&}@;#o-*jv zKU|$p?DNHr54T-A`3e>6OQA!(ZN@x># z7fIjUIrg}etsPsZ60O_qc|Nm&yF(2mMGX;f=eMhuTkG-R&~|Dp2rD5gbw#@0dQDKc?LvKZnzhM7wC}eda(y4J%k}yD*L+~AyO;@W?*u= z*-d*11mv7dO1TA#O%f{4fCXbyoEXMUbZ$+6nv@9{#kE&7vsP!U@)bYC!=EQ*N03dd zOp4tMQ$l+6#~dH3m|*U5GVu;5Jn_kh)p~KRhQ)ykhc40Qy!I{6X2n$D5+pVgwQnd? zAJliL=#1bx?6v;VNQpc&!>hyM3=|PDvLg<#D{bX;0}L)bNfB{cR$i}53EU2iH~KZe z0qYdvg`?hVXm;i6pcg&ohunwPw#Qsud(Wqm5cc&4Z>cN8_DDo&nr|V2}vY zZDP4ti1Zue3s|DqGBu8&6zcL>kt3@$N)@-RJRV{HqVs-g%Xgz0#lmD{1N8FvFmy!e z+|NyGDCbvfZZmSAIuOdLVH<1(G)mMU1UkT)@@m z#=L%-F$lPq4{^UBuVkLKlv#6V#XGsFFBX&o%5dYFB+5^ddxu!>MUeA}ok2 zSw4n>Y^Amoo-?ZF3M)fZx;-}Iz}Yj~Dx7Q67^_x_HY=+5c53T$I|V#{+C63b1KB~keRUK6R`T82F87t83MYka4K$D!7iQP!XZ!=P z;W&xCR%YLsyZ3ydOasQXlowp!)`V^zLd68{3utS2$r`)}p5sr^S{1!0|4A10oF=Q~ z-L1RAWN>QR8o=bv`^$+hx8xpB)z-m(*qf_e5o-}ce&1NNwE1B_{Z9Bn3tKqsidp|g z4o>&`?s|Jaqckzp=h?{$Zs*VzKXB%H)$EqS6UnDdv$pnK_Kr19bmKJ4SXH zeOZ6_=wdq2z{P6&sw~xRN?vaUBTyP+#zSF%yKgSLDmHE@RE!$U8e*42Vu%w>3f`A-=Wfn@_B4b#1x|{d+gA{{BhBCj`91+>FxVdcijUz&}b>?`zC*@fjBk z;C<5?jte^l=d`wlM@BWDhVO0dB#86>KpRJpTMt$jDe8g=(~)xKURiL?4t1;St9Ims z<4C!FuPh5;d_WPd+PU1&0@PoXY9F zy2whO)v>1{nE^$$u9g0lfbEJ}k}VY`Ij=t(QqiIVdi%0vjo6Y4$#I0b)@-@{)sg99 zjKN96Pp#EDZd4sE+FXy+jJ9;?*KolY?T&vuZF#wQ@EwP{LakuG+FcWUYpJ04^Fi@( zVCFY0CPGZ<bsy%ERr+ZA8e*wMn~uk9(6E-iPqaJpAD2>&z&9TGnr_~AQcOp^5{CsS z;a!qR;-jVxM(t}P5RkgZ15|XS9s3s5CR3M%pJ8;{zI-k3mXuR1_A;+yO~+k6OLwZ; zEy)9EJn)i(m!6RlimX0DofI%~?1L0IXR7Cp-`Gljp!f2C9xSO0^yP&`P$`(vdoJ6z zIjHnjah|%(NYX*wW>n7dmF*+kW8&S$*zhZNWa-*G>h+zpz0#tU0lK%ew_|{t0{weB z`0I5f!1Z)gZQ;UQn6%5>IF6!WmpNYAz}vfp3exF6|5kzSm?XJWUzk!U^Fr0%G^z!x zydaS|l37+LZs})F?W(dnTRGuTX%CJoW3BL&u{U~osIN0zxslqTX2C)n_!vPd7V~K| zHl^l^zaf&-KIzy29IC>e6VMmrE@3sp>y6GPoEm<(R!G8WJ{|8U#g&!r5M+YF5yvMx zOSWnQT#iaPQc${T$8tH2649}qNaj11OtagGlV~FZi9UXLa~>tz zCKxMvL&X$;*#`X9McEe&2yOUjc13U>{WKEeVa7MiBF67wIVMK)FOZl^BWi(=rx7{& z8^x)wzY}`OIhLL>o5-^hv2e?4-RmYBQjI-OK?S24RuUDL&+ce3piw`0)wtENd)0g= z_#%b;6iA;a!uC2l=(SaXMfF~>{ow(hmAhSB>>jOWx8Rg|viK83`?Gt(9JpG_Fq^8AJy(+hsOygzfBG23+U;=k&7=+W2$gJ<&IR2OCS8cDRWndjOwx6`t$vWv-e@3 z7_VBuXw(3^-YHP{$s9Xj(ZHHTRZSH^P&1VJ2VT1xW&>DN)inS54vpdeB~@eq*hbwE zZXN(Pf?0Zo7#OREtDuoeKB0tWq|E;M1& delta 3216 zcmZ{dXEYlQqlarp&7ekUYwwvLF=E%M8Z~Rx7Gk8*nAMQtZA$Fbf2h4mQMZ)Y&oP^+N5yMUvwo z$hpHVG&9aqIJ4XI8Uvuba61u2C&DZ*mcg)kOc&0lVfft!QST=icP#)l1YVSP$k3+3 z4^HnaVNN#4k8bP8E4g=TUu@{HD0VzGOa|n0pBoIv2}me$?6Q;YAfrw$IeP$R+ z^8Gh_sJ*!R;|J%1p_qSto9v`BY|4H~#*Q~I(TO#SBE;yVGMcW60M;*>LKWh(QKMwk zss}|T6iRic8U??xh&Ba`J z$5*gRmJ>03GAiO9Lb`#xJAZRnSRyNEk9Tsp{T9b%JHwfS^y=MVF>18zT^9`qR%JHycw(+ z$A_o9Xl1n5?I5i_ z6Z6~4#$>T?HLRX&RGk^epUiO5nErnqVRZ-aytd{{HB@wb*yhjO%%r@M!jP z+%5U{;9yLzgdXT{yNW*0i{D*DU(v2KL!Vxvz%1$h#-uKd_jL}tcAT1tIUqkrJ}5q? zt1bbhCvC~^P5_ki;FX^Pn#v89_xYjkBO^@GO~FZkcv}>UlL$$~fQA8KdP9L41T0QB z=uhBZPLHovWAaaLTIHe@%6};OMVKkVLdKc75yO+J7td8Np;j@L_Xe>;^60R8|56j;@16bn^B;%M>lTDPvg7D<(UeG}Ik^uWMSnMOf zA3DRoT))K{dc;2!(k#F}NJKheDl79#1yZAgmUZUAT?#LN+sS@O-oBGa`QBIJ-gqJg zDZeU%4XBMxBEQbBQ4c4wjvW1iO>1NFhpJOpM?!9|q_nL_Jy}V>?LSwORDI#g%60SD z0=`nk%+4-27*9lTFML8CXya9+pibz+HGHI$YDd%|5#J-#Vsk+c(N)&yTbGxF3XMT+ z7B?-Be=N$lVP_4@yCFN8RW6*;QJB-}Io~&@br!g73VMpE*F~F;CDg>4V!&ZFX#Kg_ zBvm{;%XfWUr8)5B(=Wnv*2)X1hGW0=OPDifQ_9|Ve(djLuv`=E@GZl`aeatk&H|Fh zP);WT1PQ>kU~%TF6+Rx)LYbrb+^Bn{T-E9v&QV-7ql(BkU@Y92W5c=%1M&Ri$4GQA zP-mF|>7OV&c;pI=O1QWgYWOtMMV+`=>S1MkQ1wA%Z7n{h?OXjUwB`rt3TpdEmU#o~ z%!i+N>-C?l_KK=0&LKNX)pEa})My)MY7^jQ>0vmeA*c>~tHFLQTe=i*ckd?;uOn>) z3#I|BE0g4NV-JA0>*vDcaMm$aI!Ql6Roi%8Wb4`DFRQ*3+ztv&f$yS;Hw_K}q0a^1 z>0D^^f3*{u3Q5Xvf~1qsPWOSl(X3erVN?AzQl>j;+6T^~ z*vlOTUaeI8DA!QF{_`L2LXUQncSPfO)FGF)hKY=RC29efqBU|$P>v*GT(K>q3IC`w zM5BL|)f>w06@EhCe$|CWsb;x^e20hiP5RzXZ98qPz4$#s8ENTy>Q>mr&sVWV9q$!- z5{3S;Mk;#|QHvp5Quoemu@l$qYZrIqrDp>j)q6UNGid_w+`IDCJ!QVWZ({CEpFJ-A zb%{427AsaC=Vc?Lw{!07$m6E;Dn2a6uy{3GHB>rBuBlon=Oqy1~hd@Npd9IDVAFi%|Dz3-kAYe6ZdAFW;yu%34v+qrJVwi1FlIlYSoTL%l}izU2GLCPe+&&q1SphBs5KFPB--Xt*bq5(W?YgQZD4L=EdH+S451J?`BGqq{s>5*g@i0cpk>e z_cW!+%n-B3mYu_rv+X$o%gF(c@pAM-tR0l>j`+_OY+Ybw3bPk}IZx@(+Yit&7icT= zcEN{C!vnS%>b==*{U5MdDvF|+v?^FlP_%D1DLuA>6xu?#S0A8OYiBRXiDO3|C$B%O zgFk@ABu@hEbzR>Ca*XWOAT+Av5T-rO^GiqB7})M5X-Jd$Be7#v$)XF)qp6#H$7-NJ z{$3hgW$$#CY1v%OcCdGdTzGU$%-LWiH7P^erDs`)c8OgnsG(O^PS*M$i&!ZYAV?2l zC$;rcH}Ze(LM>LUbJ2nAOPBMcaf0rnMvLQnuyzaX@bfehNQ$sbp-iD;cllE4w_-58 z#!uXqilkZG=1-qV{2C)3C+j+f_dMEtU;pQlixi!3x-cXBUC${E_&t)LtG2A2 z_qZvzJ}oc0RKXS|7I}OT-R^}3qe*K8R06TG-V8>3%o7`ny@X$DLhNoa?fIaJY8DnJ zzI{M!uj=8t78D0Ds@KS^M^w7VCQK{IK{?bfbg4C>=8@~A^x2u{U>QxhTg)TfxGM5_ zki@YcBdL?IVdQ%*StJAG!Z3=J%TW2$F zMC||Z)_jZrKrc~V%h5faC2NGjBDNXyezYqcczi>OAoMEVC`s}#X!HFK3~)^ZgLvrf zgA%RUwW#qEFBbPsYYmxMioyi&j%4;tQYO>1?uNOhEhtY(=?oV9##Kdh3#z%zQQZ@e zZlv??dt*Jvul|k^BQ$mndj1xDswfKHn!iKV(Ggu;>#a4{9TU}6>bSTB%L%Zrzb~(t zq0?UY3WVUWx!}3x;^F{I_rbfFW@G2{IZw_=RysbvPeb`nq8ydzY|I=;JDS3sfZ(>i&Q{}+3Z}kVyyx&MX6UONYN-)S z34zP3cQi+81m7x(sh()cWy-d^?OAP+wYfUV4ih?Kxuq5L>}nHZ=fISfnL1$g`s3R2 zUfN)?)x4}6yx%o9B`%R--`-Gp<^=pyY}%cjFCgwc+|YIwNgeK!NdB^0Zd8+^GVAD4 zxZ=pg)Dk@!Cj52M&NZOb{cWw|yTHDGY+hRM#Ecy#sCkT1)EnPEhA}(8WDRoRm~ilG zLwt4Mp;sYCLQ3vq)C|~n?Klu#W3nojgwomgulw^I+9w?od~FRD@rFqjMon+b1quPRw2egGD{bEi?Q^B=KVq+_#VLNAC12nt}S zoHD~3ebukGB=1+a$9;~TUX?#9SCPd=uE|V~*}_0)DIWEh)Rymc2-gT2d-&voZuh*; zfnG!y{a*@?j{6;`Bf3o?ug3AFgyH(Y*{MwLuM;Z7TF~|~cbJ83^WwZjyYa>~oq}h& zI_rO<0~|1}6%P;;D*LK}_D_2`hbM0DuI@HYbNKU3zkhkPX)u5mp(Y%Fc^{m=>?NLn zcp`_N)7eOK2O@&cF7C)ZR#cA-61n6!m|;@N%24Ib`lYW!4~D2|WXT#x{2Q8W4RUE2 zz!1&e`TDr=}^{WdqY{QqBIFd1R#|Crpa>*!oZJ@E{C5*g;}cU=Vv NRsmg?khp7Q{9my`3+?~_ diff --git a/docs/paper/arxiv/figures/timeline.typ b/docs/paper/arxiv/figures/timeline.typ index 9e7c16ad..e2fdf526 100644 --- a/docs/paper/arxiv/figures/timeline.typ +++ b/docs/paper/arxiv/figures/timeline.typ @@ -7,8 +7,6 @@ // --- Colors --- #let col-models = rgb("#4e79a7") // steel blue for problem types #let col-rules = rgb("#59a14f") // green for reduction rules -#let col-julia = luma(160) // faint grey for Julia reference - // Phase background colors #let phase1-fill = rgb("#4e79a7").lighten(92%) // light blue #let phase2-fill = rgb("#f0d060").lighten(70%) // light yellow @@ -69,13 +67,6 @@ label: none, ) - // --- Julia predecessor reference line --- - plot.add-hline( - 20, - style: (stroke: (paint: col-julia, thickness: 0.8pt, dash: "dashed")), - label: none, - ) - // --- Data lines --- // Problem types (solid blue) plot.add( diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index ae932e9a..338e66b0 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -22,7 +22,7 @@ We call these \emph{bridge problems}: projects whose subtasks are homogeneous and formally verifiable, but where three structural barriers---convention drift, effort exhaustion, and knowledge discontinuity---make human-only execution infeasible beyond a few dozen components. AI coding agents can break through these barriers because systematic verification constrains agent output to match contributor-specified ground truth. We demonstrate this claim on NP-hard problem reductions, building a directed graph of 27~problem types and 45~transformation rules that routes any supported problem to a specialized solver. -Over nine weeks, a single maintainer and AI agents produced a Rust library with $>$95\% test coverage---a task whose predecessor project took four years to reach 20~problem types. +Over nine weeks, a single maintainer and AI agents produced a Rust library with $>$95\% test coverage. Agent session data reveals a 15:1 ratio of agent-to-human messages, while an automated quality gate rejected 75\% of 322~batch-submitted proposals as incorrect or incomplete. \end{abstract} @@ -64,7 +64,6 @@ \subsection{Our Approach: A Reduction Graph} \subsection{Bridge Problems} Building this graph requires implementing 45~transformation rules, each involving 50--400 lines of verified code. -A predecessor project in Julia took four years to reach 20~problem types. This is not a failure of effort but of scale: the project is a \emph{bridge problem}---software too large for human teams to build and maintain at the required quality level. We identify three structural barriers that make bridge problems infeasible for human-only teams: @@ -83,7 +82,7 @@ \subsection{Bridge Problems} \includegraphics[width=\columnwidth]{figures/scaling-wall.pdf} \caption{Human teams maintain quality up to $\sim$20 components, then hit three barriers: convention drift, effort exhaustion, and knowledge discontinuity. Agents constrained by systematic verification break through all three. - The Julia predecessor reached 20~problem types in four years; this work reached 27 in nine weeks.} + Agents constrained by systematic verification break through all three.} \label{fig:scaling-wall} \end{figure} @@ -100,7 +99,7 @@ \subsection{Contributions} \item \textbf{The bridge problem concept}: a characterization of software projects where homogeneous, verifiable subtasks create scale barriers that agents can overcome (\Cref{sec:bridge}). \item \textbf{A verified reduction graph} connecting 27~NP-hard problem types to specialized solvers, with emergent compositionality through automatic path composition (\Cref{sec:graph}). \item \textbf{A skill-based methodology} for mathematical software engineering, encoding workflow knowledge as reusable, versioned scripts that decompose multi-file tasks into agent-executable steps (\Cref{sec:method}). - \item \textbf{Quantitative evidence}: a 15:1 agent-to-human message ratio, a 75\% rejection rate on 322~batch-submitted proposals demonstrating the quality gate's selectivity, and a 9-week timeline versus the predecessor's four years (\Cref{sec:evaluation}). + \item \textbf{Quantitative evidence}: a 15:1 agent-to-human message ratio, a 75\% rejection rate on 322~batch-submitted proposals demonstrating the quality gate's selectivity, and a 9-week development timeline (\Cref{sec:evaluation}). \end{itemize} The rest of this paper is organized as follows. @@ -456,12 +455,12 @@ \subsection{Development Metrics}\label{sec:metrics} \includegraphics[width=\columnwidth]{figures/timeline.pdf} \caption{Cumulative growth of problem types and reduction rules over nine weeks. Background bands mark three development phases: manual (Phase~1), basic skills (Phase~2), and full pipeline (Phase~3). - The Julia predecessor's four-year trajectory is shown for comparison.} + Background bands mark three development phases.} \label{fig:timeline} \end{figure} \Cref{tab:growth} and \Cref{fig:timeline} trace the growth across three phases. -\textbf{Phase~1 (Manual, 35~PRs)}: no skills; the maintainer issued step-by-step commands, established the architecture, and ported reductions from the predecessor Julia package. +\textbf{Phase~1 (Manual, 35~PRs)}: no skills; the maintainer issued step-by-step commands and established the architecture. \textbf{Phase~2 (Basic skills, 9~PRs)}: initial \texttt{add-model}/\texttt{add-rule} skills reduced per-task human involvement. \textbf{Phase~3 (Full pipeline, 15~PRs)}: complete skill library with orchestration, quality gates, and multi-agent review. The current codebase comprises 54,599~lines of Rust source, 28,343~lines of tests, and 6,362~lines of examples. @@ -611,7 +610,7 @@ \subsection{Future Work} \subsection{Conclusion} We have introduced \emph{bridge problems}---software projects whose scale exceeds human capacity but whose homogeneous, verifiable structure makes them amenable to agent execution constrained by systematic verification. -NP-hard problem reductions are the first convincing example: over nine weeks, a single maintainer and AI agents produced a verified reduction graph connecting 27~problem types to specialized solvers, a task whose predecessor took four years to reach 20~types. +NP-hard problem reductions are the first convincing example: over nine weeks, a single maintainer and AI agents produced a verified reduction graph connecting 27~problem types to specialized solvers. The core insight is that three structural barriers---convention drift, effort exhaustion, and knowledge discontinuity---make bridge problems infeasible for human teams, while verification ensures that agents cannot deviate from contributor-specified ground truth. Skills encode workflow knowledge as reusable, versionable documents, lowering the contribution barrier from ``knows the programming language'' to ``knows the mathematics.'' From 476619ba4d3aae4c64dec853a172640cd111fbb2 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sat, 14 Mar 2026 21:59:25 +0800 Subject: [PATCH 32/38] Move reduction graph to Figure 1, remove scaling wall from main text MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The reduction graph is the most informative visual — real data, not a conceptual sketch. Lead with what was built. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/paper.tex | 32 +++++++++++--------------------- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 338e66b0..27139be6 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -55,6 +55,17 @@ \subsection{Our Approach: A Reduction Graph} Each directed edge is a verified reduction---code that transforms instances forward and maps solutions back. Given any supported problem, a path through the graph leads to a solver, with every edge backed by tested code. +\begin{figure*}[t] + \centering + \includegraphics[width=0.88\textwidth]{figures/problemtree.pdf} + \caption{The reduction graph connects 27~NP-hard problem types to three solver families through 45~verified transformation rules. + \textbf{Bottom layer}: solvers---Maximum Independent Set on unit-disk graphs (UD-MIS) for Rydberg atom arrays, QUBO for D-Wave quantum annealers, and ILP for commercial solvers. + \textbf{Middle layer}: 27~problem types with color-coded edges showing solver reachability. + MIS is the dominant hub with 14~incoming and 13~outgoing edges. + Each new edge creates composite paths through the entire graph.} + \label{fig:reduction-graph} +\end{figure*} + The graph currently contains 27~problem types connected by 56~directed edges (\Cref{fig:reduction-graph}). Reductions implemented independently compose automatically through the graph. For example, one contributor implemented Factoring $\to$ Circuit Satisfiability, and another implemented Circuit Satisfiability $\to$ Integer Linear Programming. @@ -77,15 +88,6 @@ \subsection{Bridge Problems} Reusable skills encode this knowledge as executable documents that any new contributor or agent can invoke. \end{enumerate} -\begin{figure}[t] - \centering - \includegraphics[width=\columnwidth]{figures/scaling-wall.pdf} - \caption{Human teams maintain quality up to $\sim$20 components, then hit three barriers: convention drift, effort exhaustion, and knowledge discontinuity. - Agents constrained by systematic verification break through all three. - Agents constrained by systematic verification break through all three.} - \label{fig:scaling-wall} -\end{figure} - AI coding agents can break through these barriers, but only if their output is constrained to match contributor-specified ground truth. A contributor supplies the creative elements: which problems matter, what the formal definitions are, which examples reveal correctness. These flow through a verification stack---type system, round-trip tests, overhead validation, agentic feature tests---that rejects any agent output inconsistent with the specification. @@ -193,18 +195,6 @@ \subsection{Graph Structure}\label{sec:graph-structure} We organize reductions into a directed graph $G = (V, E)$, where each vertex $v \in V$ represents an NP-hard problem type and each directed edge $(u, v) \in E$ represents a verified reduction from problem~$u$ to problem~$v$. The graph contains 27~vertices (problem types) and 56~directed edges: 45~hand-coded reduction rules plus 11~edges inferred from subtype relationships (e.g., MIS on a geometric subgraph can always be treated as MIS on a general graph, because the geometric structure is a special case). -\begin{figure*}[t] - \centering - \includegraphics[width=0.88\textwidth]{figures/problemtree.pdf} - \caption{The reduction graph as compilation infrastructure. - \textbf{Bottom layer}: three families of solvers, each accepting a specific problem formulation---Maximum Independent Set on unit-disk graphs (UD-MIS) for Rydberg atom arrays, QUBO for D-Wave quantum annealers, and ILP for commercial solvers. - \textbf{Middle layer}: 27~problem types connected by 45~human-implemented reduction rules. - MIS is the dominant hub with 14~incoming and 13~outgoing edges. - \textbf{Top layer}: the scaling vision---agent-synthesized reductions extending coverage to 100+ problem types. - Solid arrows denote verified reductions; dashed arrows denote cross-reductions to alternative solvers.} - \label{fig:reduction-graph} -\end{figure*} - Three problems serve as ``compilation targets,'' each corresponding to a class of specialized solvers (\Cref{fig:reduction-graph}, bottom layer): \begin{itemize} \item \textbf{MIS} (Maximum Independent Set): target for Rydberg atom arrays, which solve MIS natively on geometric graphs. From fbdb4a58b5fa2ad1fb89b8c40c909fa770618ffc Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sun, 15 Mar 2026 01:20:01 +0800 Subject: [PATCH 33/38] Add TikZ role diagrams: Mentor, Orchestrator, Runner Three agent roles with skill mappings: - Mentor (4): propose, fix-issue, final-review, dev-setup - Orchestrator (5): project-pipeline, review-pipeline, issue-to-pr, check-issue, topology-sanity-check - Runner (7): add-model, add-rule, fix-pr, review-implementation, write-model-in-paper, write-rule-in-paper, release Replaces the old "two roles" (guides + runners) text with the three-role taxonomy and per-role TikZ diagrams. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/paper.tex | 104 +++++++++++++++++++++++++++++-------- 1 file changed, 83 insertions(+), 21 deletions(-) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 27139be6..60b6da07 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -8,6 +8,8 @@ \usepackage{listings} \usepackage{hyperref} \usepackage{cleveref} +\usepackage{tikz} +\usetikzlibrary{arrows.meta,positioning,calc} \begin{document} @@ -315,29 +317,89 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} The maintainer makes the final quality judgment and merges. This is one of only two human decisions in the pipeline; the other is moving an issue from Backlog to Ready (Stage~3). -\paragraph{Three roles, two decisions.} -The pipeline separates three distinct roles (\Cref{fig:pipeline}). +\paragraph{Three agent roles.} +The pipeline uses agents in three distinct roles, each backed by a set of skills (\Cref{fig:roles}). -\emph{Contributors} are domain experts---mathematicians, physicists, operations researchers---who identify which reductions are worth implementing. -They need no knowledge of Rust, the codebase, or even programming. -Two entry points accommodate different preferences. +\begin{figure*}[t] +\centering +\begin{tikzpicture}[ + node distance=0.6cm and 0.8cm, + human/.style={draw=orange!80!black, fill=orange!10, rounded corners=3pt, + text width=5.2em, align=center, font=\scriptsize, minimum height=2em}, + agent/.style={draw=blue!70!black, fill=blue!8, rounded corners=3pt, + text width=5.2em, align=center, font=\scriptsize, minimum height=2em}, + artifact/.style={draw=gray!70, fill=gray!8, rounded corners=2pt, + text width=5.2em, align=center, font=\scriptsize, minimum height=2em}, + skillbox/.style={draw=green!60!black, fill=green!6, rounded corners=2pt, + font=\tiny\ttfamily, inner sep=2pt, align=center}, + arr/.style={-{Stealth[length=4pt]}, thick}, + darr/.style={{Stealth[length=4pt]}-{Stealth[length=4pt]}, thick}, +] + +% === (a) Mentor === +\node[font=\footnotesize\bfseries] (title-a) {(a) Mentor}; +\node[human, below=0.3cm of title-a] (contrib) {Contributor\\[-1pt]{\tiny(domain expert)}}; +\node[agent, right=1.0cm of contrib] (mentor) {Mentor\\[-1pt]{\tiny agent}}; +\node[artifact, below=0.6cm of $(contrib)!0.5!(mentor)$] (issue) {GitHub\\Issue}; +\node[artifact, below left=0.5cm and 0.1cm of contrib] (pdf) {Paper PDF\\{\tiny(visual check)}}; +\node[skillbox, above=0.2cm of mentor] (sk-m) {propose, fix-issue,\\final-review, dev-setup}; + +\draw[darr, orange!80!black] (contrib) -- node[above, font=\tiny, text=gray!60!black] {interactive} (mentor); +\draw[arr, gray!60] (mentor) -- (issue); +\draw[arr, gray!60] (contrib) -- (pdf); + +% === (b) Orchestrator === +\node[font=\footnotesize\bfseries, right=3.8cm of title-a] (title-b) {(b) Orchestrator}; +\node[human, below=0.3cm of title-b] (maint) {Maintainer\\[-1pt]{\tiny(2 decisions)}}; +\node[agent, right=1.0cm of maint] (orch) {Orchestrator\\[-1pt]{\tiny agent}}; +\node[artifact, below=0.6cm of maint] (board) {Project\\Board}; +\node[artifact, below=0.6cm of orch] (pr) {Pull\\Request}; +\node[skillbox, above=0.2cm of orch] (sk-o) {project-pipeline,\\review-pipeline, issue-to-pr,\\check-issue, topology-check}; + +\draw[arr, orange!80!black] (maint) -- node[left, font=\tiny, text=orange!70!black, align=center] {move to\\Ready} (board); +\draw[arr, blue!60] (board) -- node[below, font=\tiny, text=gray!60!black] {dispatch} (pr); +\draw[arr, blue!60] (orch) -- (pr); +\draw[arr, orange!80!black] (maint.south east) -- node[right, font=\tiny, text=orange!70!black] {merge} (pr.north east); + +% === (c) Runner === +\node[font=\footnotesize\bfseries, right=3.8cm of title-b] (title-c) {(c) Runner}; +\node[agent, below=0.3cm of title-c] (runner) {Runner\\[-1pt]{\tiny agent}}; +\node[artifact, below left=0.6cm and 0.3cm of runner] (code) {Codebase\\{\tiny(code + tests)}}; +\node[artifact, below right=0.6cm and 0.3cm of runner] (paper) {Paper\\{\tiny(proof + example)}}; +\node[skillbox, above=0.2cm of runner] (sk-r) {add-model, add-rule,\\fix-pr, review-impl,\\write-in-paper, release}; + +\draw[arr, blue!60] (runner) -- (code); +\draw[arr, blue!60] (runner) -- (paper); + +\end{tikzpicture} +\caption{Three agent roles. + \textbf{(a)}~Mentors interact with humans: guiding contributors through proposals, helping fix issues, and assisting maintainers with final review. + \textbf{(b)}~Orchestrators manage the pipeline: picking tasks from the project board, dispatching runners, and validating issues. + \textbf{(c)}~Runners execute: implementing models and rules, fixing PRs, reviewing code, writing paper entries, and releasing. + Orange = human; blue = agent; green = skills.} +\label{fig:roles} +\end{figure*} + +\emph{Mentors} (4~skills) interact with humans. The \texttt{propose} skill conducts an interactive session in mathematical language, asking one question at a time: what is the problem, what is the formal definition, what does a worked example look like? -Before filing, it pre-validates the draft against Stage~2's quality checks, catching errors before they reach review. -Alternatively, a contributor can fill in a structured GitHub issue template directly. -Either way, the contributor's involvement ends when the issue is filed. - -\emph{The maintainer} makes exactly two decisions per contribution. -First, moving an issue from Backlog to Ready---a judgment call about what is worth building next. -Then everything between these two decisions runs headlessly. -\texttt{make run-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, implements the code, produces a pull request, and moves it to the review queue---all without human input. -\texttt{make run-review} merges the latest main branch, addresses automated code-review comments, runs agentic feature tests, and retries CI failures up to three times. -The maintainer's second decision is the final quality judgment: reading the pull request and merging it. - -\emph{Agents} fill two distinct roles. -As \emph{guides}, they onboard contributors: the \texttt{propose} skill conducts an interactive session in the contributor's language, asks clarifying questions, analyzes the graph topology to suggest high-value contributions, and pre-validates drafts---lowering the barrier to entry without lowering quality standards. -As \emph{runners}, they handle mechanical volume: implementing code, writing tests, fixing CI failures, and generating documentation---processing batches of issues headlessly overnight. -Both roles interact with the library through a command-line interface (\texttt{pred}) that serves as the uniform entry point for humans and agents alike---listing problems, querying reduction paths, inspecting overhead, and performing reductions. -The skill system ensures identical verification steps in both roles: whether an agent is guiding a contributor through a proposal or implementing a reduction autonomously, it traverses the same quality checklist. +Before filing, it pre-validates the draft against Stage~2's quality checks. +The \texttt{fix-issue} skill brainstorms with contributors to resolve quality problems. +The \texttt{final-review} skill guides the maintainer through merge decisions. +The \texttt{dev-setup} skill onboards new developers. + +\emph{Orchestrators} (5~skills) manage the pipeline. +\texttt{project-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, dispatches a runner, and moves the result to the review queue---all headlessly. +\texttt{review-pipeline} addresses code-review comments, runs agentic feature tests, and retries CI failures. +\texttt{check-issue} validates proposals before implementation. +The maintainer makes exactly two decisions: moving an issue from Backlog to Ready, and merging the final pull request. +Everything between runs without human input. + +\emph{Runners} (7~skills) execute. +\texttt{add-model} and \texttt{add-rule} implement problem types and reductions. +\texttt{review-implementation} dispatches parallel sub-agents in fresh context to review code. +\texttt{fix-pr} resolves CI failures and review comments. +\texttt{write-model-in-paper} and \texttt{write-rule-in-paper} generate paper entries with proof sketches and worked examples. +\texttt{release} handles version bumps and publishing. \subsection{Why Skills, Not Prompts or Scripts}\label{sec:skills} From 7fb50df8addb166c55f3ba7676c2e729d2a2021b Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sun, 15 Mar 2026 01:28:19 +0800 Subject: [PATCH 34/38] fix cli --- .gitignore | 2 +- docs/paper/arxiv/paper.tex | 110 +++++++++++++++++-------------------- 2 files changed, 50 insertions(+), 62 deletions(-) diff --git a/.gitignore b/.gitignore index 01399e95..3fa7dd40 100644 --- a/.gitignore +++ b/.gitignore @@ -88,4 +88,4 @@ claude-output.log docs/test-reports/ docs/superpowers/ *.log -.superpower/ +.superpowers/ diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 60b6da07..aa6d58ad 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -318,69 +318,24 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} This is one of only two human decisions in the pipeline; the other is moving an issue from Backlog to Ready (Stage~3). \paragraph{Three agent roles.} -The pipeline uses agents in three distinct roles, each backed by a set of skills (\Cref{fig:roles}). +The pipeline uses agents in three distinct roles, each backed by a set of skills. -\begin{figure*}[t] -\centering -\begin{tikzpicture}[ - node distance=0.6cm and 0.8cm, - human/.style={draw=orange!80!black, fill=orange!10, rounded corners=3pt, - text width=5.2em, align=center, font=\scriptsize, minimum height=2em}, - agent/.style={draw=blue!70!black, fill=blue!8, rounded corners=3pt, - text width=5.2em, align=center, font=\scriptsize, minimum height=2em}, - artifact/.style={draw=gray!70, fill=gray!8, rounded corners=2pt, - text width=5.2em, align=center, font=\scriptsize, minimum height=2em}, - skillbox/.style={draw=green!60!black, fill=green!6, rounded corners=2pt, - font=\tiny\ttfamily, inner sep=2pt, align=center}, - arr/.style={-{Stealth[length=4pt]}, thick}, - darr/.style={{Stealth[length=4pt]}-{Stealth[length=4pt]}, thick}, +\emph{Mentors} (4~skills) interact with humans. +% +\begin{center} +\begin{tikzpicture}[node distance=0.4cm and 0.6cm, + box/.style={draw, rounded corners=2pt, font=\scriptsize, minimum height=1.6em, inner sep=3pt}, + arr/.style={-{Stealth[length=3pt]}}, ] - -% === (a) Mentor === -\node[font=\footnotesize\bfseries] (title-a) {(a) Mentor}; -\node[human, below=0.3cm of title-a] (contrib) {Contributor\\[-1pt]{\tiny(domain expert)}}; -\node[agent, right=1.0cm of contrib] (mentor) {Mentor\\[-1pt]{\tiny agent}}; -\node[artifact, below=0.6cm of $(contrib)!0.5!(mentor)$] (issue) {GitHub\\Issue}; -\node[artifact, below left=0.5cm and 0.1cm of contrib] (pdf) {Paper PDF\\{\tiny(visual check)}}; -\node[skillbox, above=0.2cm of mentor] (sk-m) {propose, fix-issue,\\final-review, dev-setup}; - -\draw[darr, orange!80!black] (contrib) -- node[above, font=\tiny, text=gray!60!black] {interactive} (mentor); -\draw[arr, gray!60] (mentor) -- (issue); -\draw[arr, gray!60] (contrib) -- (pdf); - -% === (b) Orchestrator === -\node[font=\footnotesize\bfseries, right=3.8cm of title-a] (title-b) {(b) Orchestrator}; -\node[human, below=0.3cm of title-b] (maint) {Maintainer\\[-1pt]{\tiny(2 decisions)}}; -\node[agent, right=1.0cm of maint] (orch) {Orchestrator\\[-1pt]{\tiny agent}}; -\node[artifact, below=0.6cm of maint] (board) {Project\\Board}; -\node[artifact, below=0.6cm of orch] (pr) {Pull\\Request}; -\node[skillbox, above=0.2cm of orch] (sk-o) {project-pipeline,\\review-pipeline, issue-to-pr,\\check-issue, topology-check}; - -\draw[arr, orange!80!black] (maint) -- node[left, font=\tiny, text=orange!70!black, align=center] {move to\\Ready} (board); -\draw[arr, blue!60] (board) -- node[below, font=\tiny, text=gray!60!black] {dispatch} (pr); -\draw[arr, blue!60] (orch) -- (pr); -\draw[arr, orange!80!black] (maint.south east) -- node[right, font=\tiny, text=orange!70!black] {merge} (pr.north east); - -% === (c) Runner === -\node[font=\footnotesize\bfseries, right=3.8cm of title-b] (title-c) {(c) Runner}; -\node[agent, below=0.3cm of title-c] (runner) {Runner\\[-1pt]{\tiny agent}}; -\node[artifact, below left=0.6cm and 0.3cm of runner] (code) {Codebase\\{\tiny(code + tests)}}; -\node[artifact, below right=0.6cm and 0.3cm of runner] (paper) {Paper\\{\tiny(proof + example)}}; -\node[skillbox, above=0.2cm of runner] (sk-r) {add-model, add-rule,\\fix-pr, review-impl,\\write-in-paper, release}; - -\draw[arr, blue!60] (runner) -- (code); -\draw[arr, blue!60] (runner) -- (paper); - +\node[box] (contrib) {Contributor}; +\node[box, right=0.8cm of contrib] (mentor) {Mentor}; +\node[box, below=0.3cm of $(contrib)!0.5!(mentor)$] (issue) {GitHub Issue}; +\draw[arr, <->] (contrib) -- node[above, font=\tiny] {interactive} (mentor); +\draw[arr] (mentor) -- (issue); \end{tikzpicture} -\caption{Three agent roles. - \textbf{(a)}~Mentors interact with humans: guiding contributors through proposals, helping fix issues, and assisting maintainers with final review. - \textbf{(b)}~Orchestrators manage the pipeline: picking tasks from the project board, dispatching runners, and validating issues. - \textbf{(c)}~Runners execute: implementing models and rules, fixing PRs, reviewing code, writing paper entries, and releasing. - Orange = human; blue = agent; green = skills.} -\label{fig:roles} -\end{figure*} - -\emph{Mentors} (4~skills) interact with humans. +\end{center} +% +\noindent The \texttt{propose} skill conducts an interactive session in mathematical language, asking one question at a time: what is the problem, what is the formal definition, what does a worked example look like? Before filing, it pre-validates the draft against Stage~2's quality checks. The \texttt{fix-issue} skill brainstorms with contributors to resolve quality problems. @@ -388,13 +343,46 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} The \texttt{dev-setup} skill onboards new developers. \emph{Orchestrators} (5~skills) manage the pipeline. +% +\begin{center} +\begin{tikzpicture}[node distance=0.4cm and 0.5cm, + box/.style={draw, rounded corners=2pt, font=\scriptsize, minimum height=1.6em, inner sep=3pt}, + arr/.style={-{Stealth[length=3pt]}}, +] +\node[box] (maint) {Maintainer}; +\node[box, right=0.5cm of maint] (board) {Board}; +\node[box, right=0.5cm of board] (orch) {Orchestrator}; +\node[box, right=0.5cm of orch] (pr) {PR}; +\node[box, right=0.5cm of pr] (maint2) {Maintainer}; +\draw[arr] (maint) -- node[above, font=\tiny] {ready} (board); +\draw[arr] (board) -- node[above, font=\tiny] {pick} (orch); +\draw[arr] (orch) -- node[above, font=\tiny] {create} (pr); +\draw[arr] (pr) -- node[above, font=\tiny] {merge} (maint2); +\end{tikzpicture} +\end{center} +% +\noindent \texttt{project-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, dispatches a runner, and moves the result to the review queue---all headlessly. \texttt{review-pipeline} addresses code-review comments, runs agentic feature tests, and retries CI failures. \texttt{check-issue} validates proposals before implementation. The maintainer makes exactly two decisions: moving an issue from Backlog to Ready, and merging the final pull request. -Everything between runs without human input. \emph{Runners} (7~skills) execute. +% +\begin{center} +\begin{tikzpicture}[node distance=0.4cm and 0.6cm, + box/.style={draw, rounded corners=2pt, font=\scriptsize, minimum height=1.6em, inner sep=3pt}, + arr/.style={-{Stealth[length=3pt]}}, +] +\node[box] (runner) {Runner}; +\node[box, below left=0.3cm and 0.1cm of runner] (code) {Code + Tests}; +\node[box, below right=0.3cm and 0.1cm of runner] (paper) {Paper Entry}; +\draw[arr] (runner) -- (code); +\draw[arr] (runner) -- (paper); +\end{tikzpicture} +\end{center} +% +\noindent \texttt{add-model} and \texttt{add-rule} implement problem types and reductions. \texttt{review-implementation} dispatches parallel sub-agents in fresh context to review code. \texttt{fix-pr} resolves CI failures and review comments. From 5118ac36bcbf8695c84e68add00a9be5fde573b5 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sun, 15 Mar 2026 15:42:49 +0800 Subject: [PATCH 35/38] update --- Makefile | 12 ++- docs/paper/arxiv/figures/scaling-wall.pdf | Bin 18609 -> 18609 bytes docs/paper/arxiv/figures/timeline.pdf | Bin 19328 -> 19329 bytes .../arxiv/figures/verification-funnel.pdf | Bin 20215 -> 20215 bytes docs/paper/arxiv/paper.tex | 79 ++++++++++++++++-- 5 files changed, 84 insertions(+), 7 deletions(-) diff --git a/Makefile b/Makefile index d7011b9e..63d3425e 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ # Makefile for problemreductions -.PHONY: help build test mcp-test fmt clippy doc mdbook paper examples clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever diagrams jl-testdata cli cli-demo copilot-review +.PHONY: help build test mcp-test fmt clippy doc mdbook paper examples clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever diagrams arxiv-figures jl-testdata cli cli-demo copilot-review RUNNER ?= codex CLAUDE_MODEL ?= opus @@ -17,6 +17,7 @@ help: @echo " clippy - Run clippy lints" @echo " doc - Build mdBook documentation" @echo " diagrams - Generate SVG diagrams from Typst (light + dark)" + @echo " arxiv-figures - Compile arxiv figure Typst files to PDF" @echo " mdbook - Build and serve mdBook (with live reload)" @echo " paper - Build Typst paper (requires typst)" @echo " coverage - Generate coverage report (requires cargo-llvm-cov)" @@ -86,6 +87,15 @@ diagrams: typst compile $$src --root=. --input dark=true docs/src/static/$$base-dark.svg; \ done +# Compile arxiv figure Typst files to PDF +ARXIV_FIGURES := $(filter-out %/lib.typ,$(wildcard docs/paper/arxiv/figures/*.typ)) +arxiv-figures: + @for src in $(ARXIV_FIGURES); do \ + base=$$(basename $$src .typ); \ + echo "Compiling $$base (arxiv)..."; \ + typst compile $$src docs/paper/arxiv/figures/$$base.pdf; \ + done + # Build and serve mdBook with API docs mdbook: @echo "Exporting graph..." diff --git a/docs/paper/arxiv/figures/scaling-wall.pdf b/docs/paper/arxiv/figures/scaling-wall.pdf index ff2d40b45b6135eb63ba3a29a2f23a0d46a56054..0ab51540badf9c67a8ffa2f2672cbc7bb6f5dd26 100644 GIT binary patch delta 196 zcmdluk#XZh#tkbR*i8*hObks;Ca-r;g0VN>aG1!%Wg22=Vr61z1(YjxHbCaCbEp37HeKdhS4ZEZCbG;m7Aa@pB&6_+Fy Wl~fd^rg0gXTN)X1sj9mAy8!?N5<5=- delta 196 zcmdluk#XZh#tkbR*iDQKP0UQpC$D!aG1!%WfEd!Xk}t%1(YjxHbCaCbEp37HeKdhS4ZEZCbG;m7Aa@pB&6_+Fy Wl~fd^rg0gXTN)X1sj9mAy8!?)YCEq0 diff --git a/docs/paper/arxiv/figures/timeline.pdf b/docs/paper/arxiv/figures/timeline.pdf index dbfde4b17e2fe5c45cf614b5ef1bd5883e49d631..988120142f28fc8e7a43e9310175af057c217b5f 100644 GIT binary patch delta 484 zcmZpe&e%Aeal?C4Mx)6eg(Nm}n3ZwW=bgxTFv<1xbN@vleQ!ReYQ|?AK6hV*FXOA< zQ-=rA%tcGzR!v@aZ_lcqnQod!fj{^+o$%+@t6zPk&D8$nE3*wd3vGWycJbuIoj)G- zYN7!el3+Z!Gl8vbxMf88Ye+VcF&2>%UMi*#A+JC#Lk7M^&Y665}8<@N1>r`{h; zIDR$3XOhFU-3RvfPYhKrK2|mR+y=SBc7G?hA8t^{HTNr(m2CR@MeW%011sjeX7X5i zPxV4o#$%mdOANJ|DtMnSndu33ST*Sf0XklOsp^(Kp^v4Q~zDCF^SffcJ-|JX;YMhTmLI`|Jm~3)AlU>H9tMWwl|*beSPqJP)fq*D!Xk) z`(qD3Tcpr&vyHbu^OpUtNO85b&uyOHv^e;xW1GMqmA;SU*}pKuHyOq#aWJ*xoOK?i>sw69M>+-mYC1HxN04@!qlqi z4zCyXZhkTO*b0tEQ<*)q+2^E~`uiPuoH|<9p|;I^RLflkMccZx#_09%O3wD zlVcr^vzr(hnwXhbOxAW%g0VM;I89{YG6^vOy5@@onMav==bJhk z`lppRx|mlQySo}>mrq{js*Xcy^J~|eESg3N1|Xo2$IAs~7@ApHTA+(r7+4rje(UAt ohusoeTTKNGoKmq|c6MCFC5c5P6-B9OT!t3L1_oTJs;>TS0Nf|I$N&HU diff --git a/docs/paper/arxiv/figures/verification-funnel.pdf b/docs/paper/arxiv/figures/verification-funnel.pdf index b64a6ac905e8051f4421ecb6bec9e10822f0ee4d..5c5c38be1384b67fc04481d77cf0fa9ca2fc8a63 100644 GIT binary patch delta 173 zcmex9m+|{t#tmw2?52h$CWfYFlXcycVC>CdZWEcfOhXJ!tV|58fN~!_43N3np7~-f z28BuaspXa?k>(~|#@fE--r<$S<&!sfs^gH_{MK_1iy?MxwzirI8rY>K-|~|M0H~5R AX#fBK delta 173 zcmex9m+|{t#tmw2>?THr#ui2vlXcycVC>CdZWEcfOhSwdt&AK-|~|M0Ijt- A@c;k- diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index aa6d58ad..ab4c947f 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -20,12 +20,14 @@ \maketitle \begin{abstract} -Some software projects are too large for human teams to build at scale. -We call these \emph{bridge problems}: projects whose subtasks are homogeneous and formally verifiable, but where three structural barriers---convention drift, effort exhaustion, and knowledge discontinuity---make human-only execution infeasible beyond a few dozen components. -AI coding agents can break through these barriers because systematic verification constrains agent output to match contributor-specified ground truth. -We demonstrate this claim on NP-hard problem reductions, building a directed graph of 27~problem types and 45~transformation rules that routes any supported problem to a specialized solver. -Over nine weeks, a single maintainer and AI agents produced a Rust library with $>$95\% test coverage. -Agent session data reveals a 15:1 ratio of agent-to-human messages, while an automated quality gate rejected 75\% of 322~batch-submitted proposals as incorrect or incomplete. +A unified library of reductions between NP-hard problems would let practitioners route any supported problem to a specialized solver---quantum hardware, commercial optimizers, or domain-specific algorithms---through a single interface. +Yet building such a library by human effort alone is impractical: it requires many researchers to adopt a common language and conventions, and demands continuous full-time maintenance as new reduction rules are discovered. +We show that AI coding agents, guided by a system of reusable \emph{skills}---versioned, composable workflow documents that encode project conventions and domain knowledge---can overcome these barriers. +We demonstrate the approach by building a directed reduction graph of 27~problem types and 45~verified transformation rules, and make three contributions. +First, a \emph{no-code contribution route}: domain experts contribute reductions by filing structured issues with AI assistance, requiring no knowledge of the implementation language or codebase. +Second, a \emph{seven-layer verification stack} that ensures correctness and quality---an automated quality gate rejected 75\% of 322~batch-submitted proposals as incorrect or incomplete. +Third, a \emph{fully automated pipeline} that enables sustainable maintenance by a single maintainer, with an onboarding path that lets a new maintainer take over the project in half a day. +Over nine weeks, a single maintainer and AI agents produced a Rust library with $>$95\% test coverage and a 15:1 ratio of agent-to-human messages. \end{abstract} %====================================================================== @@ -416,6 +418,71 @@ \subsection{Why Skills, Not Prompts or Scripts}\label{sec:skills} The library comprises 14~skills in six categories: orchestration~(3), community contribution~(1), implementation~(2), quality gates~(4), documentation~(2), and onboarding~(2). +\begin{figure}[t] +\centering +\begin{tikzpicture}[ + core/.style={draw, rounded corners=3pt, font=\scriptsize\ttfamily, + fill=black!8, minimum height=1.8em, inner sep=4pt, + line width=0.6pt}, + skill/.style={draw, rounded corners=2pt, font=\tiny\ttfamily, + minimum height=1.5em, inner sep=2.5pt, line width=0.4pt}, + cat/.style={font=\tiny\sffamily\bfseries, text=black!50}, + link/.style={-{Stealth[length=2.5pt]}, thin, black!40}, +] +% --- Center: CLAUDE.md --- +\node[core, fill=blue!12, draw=blue!50, font=\scriptsize\ttfamily\bfseries, + minimum width=1.6cm] + (claude) {CLAUDE.md}; + +% --- Key project files (cardinal directions, close to center) --- +\node[core, above=0.35cm of claude] (traits) {src/traits.rs}; +\node[core, below=0.35cm of claude] (make) {Makefile}; +\node[core, left=0.45cm of claude, anchor=east] (models) {src/models/}; +\node[core, right=0.45cm of claude, anchor=west] (rules) {src/rules/}; + +\draw[link] (claude) -- (traits); +\draw[link] (claude) -- (make); +\draw[link] (claude) -- (models); +\draw[link] (claude) -- (rules); + +% --- Skill groups (corners, well outside inner ring) --- +% Top-left: Orchestration +\node[cat] at (-2.7, 2.0) (orch-label) {orchestration}; +\node[skill, below=1pt of orch-label] (s1) {project-pipeline}; +\node[skill, below=1pt of s1] (s2) {review-pipeline}; +\node[skill, below=1pt of s2] (s3) {issue-to-pr}; + +% Top-right: Quality gates +\node[cat] at (2.7, 2.0) (qa-label) {quality gates}; +\node[skill, below=1pt of qa-label] (s4) {check-issue}; +\node[skill, below=1pt of s4] (s5) {review-impl}; +\node[skill, below=1pt of s5] (s6) {fix-pr}; +\node[skill, below=1pt of s6] (s7) {topology-check}; + +% Bottom-left: Implementation +\node[cat] at (-2.7, -1.1) (impl-label) {implementation}; +\node[skill, below=1pt of impl-label] (s8) {add-model}; +\node[skill, below=1pt of s8] (s9) {add-rule}; + +% Bottom-right: Docs / community +\node[cat] at (2.7, -1.1) (doc-label) {docs / community}; +\node[skill, below=1pt of doc-label] (s10) {write-in-paper}; +\node[skill, below=1pt of s10] (s11) {propose}; +\node[skill, below=1pt of s11] (s12) {dev-setup}; + +% --- Dashed links from skill groups to center --- +\draw[link, dashed] (s3.east) -- (claude.north west); +\draw[link, dashed] (s7.west) -- (claude.north east); +\draw[link, dashed] (s9.east) -- (claude.south west); +\draw[link, dashed] (s12.west) -- (claude.south east); +\end{tikzpicture} +\caption{Project knowledge architecture. + \texttt{CLAUDE.md} defines conventions, architecture, and commands; + skills in four categories encode reusable workflows that reference it. + Inner nodes show key source directories the skills operate on.} +\label{fig:skill-architecture} +\end{figure} + \subsection{Correctness by Construction}\label{sec:verification} Correctness assurance comes from a seven-layer verification stack (\Cref{app:verification}). From 1fb6e699bb4ef296f2068829d5bba00be734bafd Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sun, 15 Mar 2026 15:47:49 +0800 Subject: [PATCH 36/38] update --- docs/paper/arxiv/figures/scaling-wall.pdf | Bin 18609 -> 0 bytes docs/paper/arxiv/figures/timeline.pdf | Bin 19329 -> 0 bytes .../paper/arxiv/figures/verification-funnel.pdf | Bin 20215 -> 0 bytes 3 files changed, 0 insertions(+), 0 deletions(-) delete mode 100644 docs/paper/arxiv/figures/scaling-wall.pdf delete mode 100644 docs/paper/arxiv/figures/timeline.pdf delete mode 100644 docs/paper/arxiv/figures/verification-funnel.pdf diff --git a/docs/paper/arxiv/figures/scaling-wall.pdf b/docs/paper/arxiv/figures/scaling-wall.pdf deleted file mode 100644 index 0ab51540badf9c67a8ffa2f2672cbc7bb6f5dd26..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 18609 zcmeIacU)6V_b;jf(ggv#NCZKoCIOOw5a~$oMS2Sz0-+;a6h)AZ4FN$qC`j)`L_msy zNKufY^deoVckd+nsE_aOzMp%~=bV2|2#}dsYt34-*38cS&aBO@CN0Yg<3m!h2LdlD zDi{O`!J6AqiHcG|ApBb1&KL;4nwd4m6)Z|&-JRSZFz}wr-h?L2A%K10j$N@_+QN(a&)Q4gLGL%Q#>h!O_qJh8b(& z?g)$mSX6`HoktueycNGW60G@u6-e|JAPhgypfCxD!4I#hLim-;ys_>?|EW2cSzy5K zA^egKSPOfA8jUt0DgcGR0EGkqIYJNw1PMVwgds4X5(a_6Aus_5Ob`MSg1`_E7!m?Q zL14lVI1~bh;n~%!A%sZ$)lG=M=r4;U*sker?uHL-Fuw*<7db@S%E4AROtSzs+~!Pxt~UK-WrR1f-y8ljJP)UG6)7#Yz<0$WBh5Z@v8W?L^S2vtD zgi`{Tu9yn|B|xi+v&7&)gE)T}L~PJ>cXkG%2Xq-YN`&bF>H(UHvvqdE;s7^+_5f0( zb@4>}y5N)oe1igA3WJ~oK*z!%LP#ONy#f$~fH2@>K?nkh1Y8aLz>$E%5fDK^B;a-= zL;!Ft3itu(z#l;*;CVn46oM26Tn~tZL6AZ~8~|YfhaiwZ1rQy8TtFTeB!Faq6O0u= zwh-_MAO+-ubpQbpA}9n90RaRA7!=?Is)7DM9RZbrr2q^ppa8?`3q%$07J@47J3Ftc_A6oNsK zh#w4+l2|X`8Ucgw!Ucgb7J>tQgW{)yyqlSWt%Zb>HL#xJr=J8kbKHRO#4{5nk{g5< zCM*aFlQMI@jIp)0af2XGcmhozu5|#nqWJJdFxD{u_m4;fBNP}F3=kLK_3XFVhM#}` z82n=-p~84Q2p2pl;L-&4tE&voSm-aK$$nN%lNUdAcm?#%O9<-A`oXB*1{D~5ztiL$<~dEon79^!ol4V^MmGZB$B{nb^3<@>;od@?&^kh z1o-h6LlFokp#p-}3G!q5LZBA_5*-ZA)fVd{0$je>*=3w8rLd0RRCc8T7Y9%c;35F% z1rD8+t&=5=Xbi+0?3W6dT$Z*LZp0G4&cYEGmp@h(0Rok!yEBmzzoPuP&Oj_NR=|h? zWBQl1=AXu2*Bg8-5T*G4{GjoK03;d=3dGkEtprw@pH+A-LJ4J}5Ag*UK}B$exH<#Z zBAl6%HHJzQ3cSz|QCZ-H{~q&4J1`sZ)|*>d*qGt)973KbMU<4bUS2>~#Jq67#f z8A3@AUqT`XB_Tqa5TS%1vvCber{ue*(e#~HS5~|AnWeLVFQvX_#1O!n+_?7f8%R4ZSetv}k=JL;P22#H) zRbGh^y7q2N+Lie#nnvfN+|I?{Ni%mR_dXOt8j=fV<)@sXU2Ks$sSHtFVWnF>5EJ$0 z@x`hh@49p16r?fxD9@e~r{tW19++XkJPv=B4!?QG=Em6-?TMyYqeoe;a+9xiUF~{> zGv8-p1FkMEUKBCwhe~Ajc=O+BORTf8a~`Sw9CCR>`sR8Rg335-y+Gvz-@Ty6Zxv6j zby1U8$eqxMIGMnS=Ao#JMlZnpKEDY5Iv#19HpH*<47MP5uTM|nSd zC*tGI+)WZQ+mn7>Jd&+SszRM~)ZzZaxTHj;yH=EniYk{?!WAPGADUC%j!-PJQa9v^WK3)~?_-x!sWIKDJV8N674@~^*b!oW zh;*`wI%=9L&rv+k)qzdwqtEo9>|)$Q4384+ZiOpE z-c-E&kd@+0unNT+^w*xfS8*n3y!olePUYIuw0$)*Yc)>77{3@5l4SPp|AG@yRCsjc zh_SKd@y7|Ceer$eT#Aa-)sKv2Yf>0VCA^ww(~=pTE6uLz(T8My%=)%X6Z^5bWx5Egn7C<86pG@ALCNVgl;i>@X`($ocbt$!UpydoD8UcMmI zF0xiDq&W999bsITe4v8WHumK$#u zntP^XHzsNdBU~t6=vz62DAd^DLita$)YE!iNTOLbi`dRRtHnKZb+6wG!=S^bdZj}= zn6s{9lBto)I8Xe^6?7;64L$Z{_vmNLb`;|3tkbn9zv;^s%?{U*w7g->J2z+?^KyF7 z-3OF%;$|*$sgO~7G;l3A6&`!A9~D*#>ny+h`kn6~wM{<{C+7yA6M_Ntih0WUQp_jS z=5w4Z1!DyZt80tL;P+tt?`0l`+FFG^J9WPiYl7na)I-(MQ_M{*qj2c7Olijn`@63$ zkvEfsO$QZOiD3H;B1o)?2cA9^W*jQS=~5g|M*GpQw6g5%Z3whak++5Q}H|BT?0 zq*N=++H-t#;_1`DqtIn(%DPV{UWfaZ=HK?$%4e?iY&F%kE$1@n^_@od%)amKZv0mK z8B_l8i5hi;Ci|u%`i_ns?efyt62Ht89_^jjht8a8T@dy zy;;^kE;VQRqRoB%`_muf{k3Ed3d>iGKU26}I)3lOpqTo?D7Am59O znkE|fePOm$uwmz@@laoBat$QfNzZA0t6*S6Hn7@-?Ae&%N@0RiNhK{dmD?Z7Ag~O3 z$|$8Z!xltc)*2hWZ01;qXkXo0;mwUB>o4FCiwlxQMp*5UpS<;`Jp9l_&CDL(={A43 ze`H?bleJK|L0p1Vgpm~+Z?QXjVoDX46)lo8CQv1 zPKH+s1#qUn{K)3Qfebn!AEYqPrJXBsQZ>!?M$Tg7V-k*BGe+H*%q1C9UwBC}a;HSZ z_vw{%Hk}tAgTH>_Hq%m$r4=OUSnwBl6JbCfkr)0`p@)DCseYyE{W(3M(&NPxt@Q_6?@HgYJEYwA(wmxIG`r%FD1_soARU4*$ z==3*3pEzg+Z%N*1>(t@@!1OX~Izgg#H$WQdJn#-80l~Q|G=7Tv|Ccf|TmpqF~(yPG@Yp(B3GI=viJv8CP z#`7dQY^m5blcU@pb;N}4xEO@4zd8G#+OrrX_m0Z=ZnPFO)V9DH}fi| z!Y1uf#>RN(4#r}Qq^vcr+;hEMly%sFrR_mG3tfCQvoi$Foz7S4OTy5Qxo^kcp})vQ zRYlh!b3QF!sdmt;D&cneJ;o4a3h}ny>ZH^yRtGZ0YrI44?ixp_JT4sJlo;pEtB?>q z^8Ms`V98siH>jd$krX@R4e6}Kx-ZW-@G}hDd>7}dmF|VeO zd%m%?<~qiyzqKcz-?ecIgFqRU&NV7E1z$-Ye_A(kuJY@dJA?P5$8K<$f6s0RySH$e zM}_3!XC0Z|=`mk{IS+%vM-da+AH_pIIlpc+J;p|_{(PV{@7R-Op#qYASoy5{8N>s- zgUHSPicYdwGPRL<=R(#@w-5eva<=X=B(rIQXyffgSHT-wP0uU4+o(;iv0tu&U1pUH z_Kj27%ssnHC%=5P?(NM;MZslvX2H$%_(ML`o_?W=kNJ;A`?uK+Ufzn~Kk9q-ljFNi z`~4XiiR3UU&1#(?92_`Y_1rwphr+s7x;H=Nlh#58TjM!<}sGg&EeU8jV(!C z-=}`Y2PFq)R$N;YsGpjYr>2yA?G}*JIc*p$f4M<~N%;OwS@T`4<%@;Xz3I>5bdwY= z&i2!tytdNkLS{EJ%6rkk6Or%^bw8~|ecvfX#}+4@2d6w11y;VbW*l2;PYc^|+)MM? z_$*(&-K*<0Ayec#?#@77tqhCqat&BmYs}URs0{JBE=uco&OCa+rqJCMZ6x9;^GJ?u zO`psA!AzO?!KWF5PdRw<+iwC_-L@tJV8^1o5M|O`4B& zeqpfQ3Bq#pFl{~4F+UOpOE$Pzd?}~0=G=6E-QDifx9W-&)uOHW+Qu39C*VT6Ihy)# z`lwo+kHrZ8(Vl*)(4sfm+Jjn=9v&!16GNPa&SyLK-Q>8)udi;C4_h)l%V*@CF}h~N zBsMAX*htrXh|4katyb2|HnaZC3)dVjwTmAt%cjr0oVOR!9Ox(L{D@qicGB?4D>B1< z(=MGIyW;a(p$uZdGncenNd%UwliT*?<6LthNjKT%T?2&Vs6uN5JD^c%JE;W&RpT&IFh^U zVb$14^*lOPFa3QC|09ekGfKm^Y1C=(a(RBAuXj>*yYWHx>3UbJOq5Am@=9&9ZicPhU|Z4U80R7kl^W(`a{xwZ}DWohrNb=AAS8 zy%K(}d`;5TkDioQw2kvswNF)5UJSO|M4){dmG&UIwGXAx3K!pnb#gnzhIo@HmNk8m zzEO8P*r6%z+=AKiynf$XpF~T!ZaP`njC5&U(W=B%s?DBvUD<8zEPI_y58&%CnOfwD zzS_ELk>AZ~12f{&KiHWx!<%St*i_|bKKu+1lC4!b^S&#I{ld2)9wzgL%$uCSX=VbO zxH>l9{ok~{3gvEUEi|k-yiOQ1XF6LqXngCS8_Q6ycmvZr>BmzRvRwJErvfhbt_8g3 zf0%-M%jZliUFL#Q%7sTV9TQoZj%bc}!6GVr{b`*r+Ah$?HQ7DM=6WyD4>9RK^9j zoEtmCH<<@J4h_jKr}1ns=iWy{hxjrioyJSejiD~Vu~?2%j{J`k7mqCH+BZ^>Vogfi z6{g-xH<`X1{+LVlY;+M7N%z&`i+u_5sn#ON#jx=!RVH>fW%nhS%Gy1Sz)lxWW2d*D zMZJEP5Ry+-clj<&iH)+7xW>~){-xX9@6Vk(SDMI{Ryhc_CTZ`dn2drMM# z9d=BfqW4$)?Jn?dJ15|s9e5M?^KS1i0I;T#w647FPrw?0UfdnbaQ|O1YXC_FaIHT< zYY!xw1dbs$PXgd0#E>Y#v|7t2LZW; zBLV0ck6a6&0O%QyTnhmRF#)+25GL$$2tkk_9Ui&HYlBCw@w@<@4B*5d?uPp(9(F=6H0_5e{Q*xC;b$VWO@L>KpfZ#I&kFqpKO=}G0z8W#!ny?eh+r*@06z;5 zw1yGv6ClFK1p5St05Ac579ax01p9~pFcE$hB*NB2n4Jh)6YLWt@)7MLwhxDNaBC&zjRI-7=-kT9NlZ;--Q?l1m3z`0 z@oh&W)sz(CtZc(Mt|X_ZCKrFDTno4DA<1mNhp%h^#@DtwTr*XAPmLYWaC6bL(2QBj)-2fw*U0qQd&p2d_T|lm*9m-vJGhpyeSQy& znIEQVzOq@3rW|~?@0MkVG}nF32fGJZbWR$MHQFihal5r{MjJoAfF8b%W^1sHgLY$z zJ^=UI(L)R4J(t@2>eX%d-B72=&}kTkY-!eL(HZ&2$+Yzo@&mwdSUx?^M)ytO%o@r| zT7!Qr#^VmfX?6Z_%W;hCfc7(K{<6MErCQOO%ndpPW%BTf1?JB~NZfo&W%T#cy$eMS ztGCm-IFqzb9~UqWcbSv25J=HA9ORy#e;Hfze0V9ft~{W{WcXs-W2U=ft&h|5#*R{4 z(~t;be9%*6mT5#!yV{)r$8Oc}G9Zv8U-1a=etht}{1@n{0hvKiR4XU?=j?h|J z=LeWIBvvOeicXSv#g88Pxa~{UE}*q4?{P-yMeB|&>fJlv1SRVuZxeL9~-9ik=lXw7MKp-9#FyT^|1B&aU(5>ZU6uS9eKpg)ZO!(~YIi)K7)6Vs$$jb^^T_P9~&Ry1#YU%_vqawBLYi=|c~ zea7}7gE=pE%!^Vxp@%voqvkIlPtTUH6&*UQCIt;6JKx`yBcl2}zkL58sqKV=$`)}J zN2CvvMF>{FvJV|Af1*rMUZl#=*`(gqZM1Iae_jvAQ#H1m`Qnfs`P=PPW_h{Cbc2+o zEk=j3sw%b)GRw7U1&G{1du_{Gs=FM=sPt<1ipzVpP%qt5{kR?EYofpsrhdU@x~sRM z$8%Csg;$};rR}JrD{0-nu>?z5`uf8(eX81ZcYP|ZO$VQuf-{%p+d(-CZO$~NZ(Dy6<6ilxpc+M$X!E~^?Hi;{AF z^XB;5hr|1$sgE0DWt zd}`QpbKg&k`P}prXV+Tj>{RSJ!2DMC#0En5J>qNFzDn8pR=%bEXM5(aX0qfSEHAti z^riLTyW7d+(W9~?LFgM*E_8gy!f=h+QcHDS>o~{?HQ)Z^>kJu?w5x06b*VO@%r#yr z^enUw20O^%v-Jl{NHwo?KHL0OI595HS1T3^=Xyzhb>rNUTVY=0Z2j2MfvZ^crSGC- zJG6Sa0ZV+YP%3>h+_-@3NA?kT+9KVq2ogOtu@|K4lD8br7e}km-NYBfpX1H+RXqqj zV_SGdq(TTPzbkzPR*>i%k9a+OkH6@{Cbjah&Rixp!3eir-vfLw!>{iCjm$@>r32FI zXLwt6#Gp`+dEW-0r?tpqVD zm3~ImudQdZ7mhU3KVaBlpkKi8O`Xk`ZQ=KjIAW`m+kJ}mioDs7OU;fj$~j|{ra@+G zMQOZZyp#(ko*bY`%rS5fV!Hp$us~d=R2ciJQ?w}K0#zxAcAynyh`6tIo3YgE zgp9-uIHl*?_>B!3Jqr8rw8@6kH!sF|^ctFVeA<7G+9NY}YIhFNZ`6D&uw%k)aNhKp z%b74yhkTOYx#;gcs_EkFMk{`GRXy(S*%4KKssXoSjl8Bl&>XXd({e1RB}m@C^@;Mm z(R*g$T(($NQK`;+GtPfIl2sSKkr+_xG(4|16HPRbe8^4^72eouY) znjxYuF4O(&oPyCETg~Co180y0Z)hL6FvA|r=Da7Nq!}&{%=R#_&{TFzqiSI@!#6Lr z>$EPew3V8%(fGFA?3YpO5Xo`9%L|CxEfj-Jal=bBTjL+moe@xD)4LN#+;YU@U>{H9 ztP8eZWEXyxoi%&+#)(nW!IX~YpA^csrCu7e)oe9CA+-o|a@|ev+u0lBC{r+VrX604 z@i=6<6sQzle_P=j=SJGM4WvJ_-VsV}(}sksz^0Lp-d}0jyORA!Hm|wzpN!luOvdt# zc`GNqZiLGBauT0zqpw%PYWI3;ZcHQOps`D4fTIuf(-#k&&uoRM3cI{ZXFx{jOeuP* zaM1=$nw?Ou|8nXs?AtnTL)peUvTT%kfo06jFy&pVYph4;Vt?(`J`L!;$BHAJ8#F3X zd-N)I6r8@9SK1#%aGDIMjJvys?I)EAJ}8 zP5AL+H%1G15ShlUmbN31k!!&=P=%$>WL>w;WHFsN)>l-bcCTTPRA8=+V>SOV>6=tW z)1!>mFtR0sl4X)RjBjX%oEZ-7_1iW(?a@Jr7V-DQZBxV7cZ5Qx>ssem1}1rKWQ*3mz?3h zdN${Yfemkrj7g5#dhhqK-JEFzdtRvfhp`wPhxXnV@XJ2jneZBNbnwaY?&bMOH8MGI zJrz51Gu~=hrW+XzFeh72tP%f%tC%raUmeQ%j290UiYvpr?^X`aqC5_4wXB`6&JaT* zIRiu8uIg>vcs_UN(bW%ap9juDZz=6KwK`hXp8OVV^4YdX8#b%k#3-`PW;I@0y{o1? z?4P;a_;TeL-{;vIacwD^&klAPdwB$gBK12>=Fg(6&h*5+k0s@2B%?47bspw?*|GS# zNqGFav`1w{o)mj+10v9J@9WZ7GG*VtFfn}iQWHE~#mR>!_DP9HB9VPrrpbv6=UYpI_g|nRNW66AOof?L)`VI z@}(yONxhy1dgO95$DAeSskmZ!lJ|ZHuaa>^?;$cv?D==g5K%j;7|7-@*_-{^3Eg?Y ziO-*;3c?>ws5I+yl%1Z7`d%ay|6uXy2kI{q%hJKMt8y8Tp2FM%T=m9!1)(1s*B5Qb zW3iRUNzPV&_NfRK0iha6nTHWxEvTd;&UK-YCNCKKEUz6Nzpk||Ovm@~VCj&t$9*jQ z=cNhAo4|t;-jx8QmI} z^4gp;kI_}b$E=Nm6`fmEt;Y0X@A0W<=vbFEeU+u?x)siPT1ci;R)mIrUqi#!(;-!Y z3;9<*^dKcfZMX%4U0#+pQQYL9#@l-FI5^}PkPiy@Gu1e6-FQ6I452Nxjstp+}V;zqSX!~XtWf{vH zQ@_Q4Fkmd3^}x((rkWkCFH@tt>c|`K)c6Q_ps$@{l}X3WHSt_xi$CnBrTf`a}URSZ?E2aw7v5xY{cXX zb>Gf;r+_q_t?yDL-8*vrVy1@YZ=Q&eo1| zf<9B+9Da_g>%BS$1{Y7a_?*t`sW_!QQV>5l_nBfibl>@Drsf3et{~AErUMVXc|R<8 z9Wy4E+%1rnrwugrx~`sr+titxZA7POSyozJJL$%0-vf!w z=B#J_5S#tiLiE)5jRHl(tS4;I(VupTXL;)A20h=+r11z}h>}1ESqab1=C0SA7-VJ= zm&;GrNNW48&DwIBbGyX(uFJvKyW^MDb>hO_f3J<;vhi92bhJqE+=~}d zWs`Jt$!pPG>$&)> zFSnH}z4D^^9kNS0eBVqK++>f}9~%)+dVNraq*<}_f;Z*Yle0c2<4{n{OGbIM^(V>p zOSe92-R5UEcq=i4o_!*i3Bk8-E0Vkg!lCg(l+ccq$hxhYozy-%FK@Rh2) zq#H%XP0sX$U$h{se#|UicH+Ws^0 zbAKDV30spVIZC>gtMxJ_gxVG!r)Mq9Z!<-EgD2?=UI?9oKL2gBU z1t%y}Ae6TpApw`NI6^9~FA`_}fQPkY7Ne$_h*z01kh z4<0&g(oJ!LGgBfxYV5SdhQD{=i^6S3i7a!tDhov@7Ea$?z~H`9x?QkpG0cx@p`q&V zsy4XX5Y9P$Cl@u`;k}q(l2G>g(_L)c)spIR{d(PJ2KCOg4x6m%VHXYtr&G!Id+p9( zse|u)=M$;@K(V<8sh-!sj>I&Vlq1XS*?Dv*V0gKov(oN=GN@eGVryg zQ_U#}cJDZ8kF9Iqt+&tn#oxR^mj#KhP!EdM-^Ct1jrBJ&u6{W>-tnQQXnVY4`^)yF z-N2idAl;*}bjiyv>Cem>4edO?ZL#z9L|M}6xYtVGmk-EXxRcYPg-;?}&V>OPkG_k3 z|6Uut`$GBF;&YZpQiioR=dT|xtGUiPEy$hb1N(J2hZltZ%Uz}x36OpzgfR%4cjuIyC|IU=#Y)IsBd?PgL7uDjfAPK7;o`` zsFzn3%6*uN7mhC7k2-SbtYp`$IRD8^W4OwZ%d}4$o{Tgl(_Qf`e%3qO>Bq?O^tqFR zch_!9?xmtodYcU_19RTaicbz(s?L+bPqP)+F=tm#nfaxs*o9_`!W>c+eW7!YNykmE z7neyM9oHDMb-t=;Z%TFca`QNk5zYP!jiF3lHc#zp189R4Fypd;TnU3s$RI0?_ZB+n zg^0HhO_q*Jm88|d@xH9_DJ(AO)kBk{-QwHLzA(2B;f80^PHkL0TJJPbE5>NHIDs}~y5{6P z6ME#-`IoFC5tq&%J<2R-K0L9Z5YQcgXz+D>TUk$hWuD z^M0k`=VL>32eHNlCXQc}_F4B7P3ySKAGJ<((R4Xh@ombf_BeKLje`BjKFk-_nH0V$ zhzozSX4c3QPWRNPT&P^^HWEAw5INYY~Q z$VVS135!eCX1Uwl*qNCDQO%*N(`V67xpKKHCrUA!BUe&cK1qws9`#z+K7-=Tmn#+V zeCfi8hu#PK9<4F#c+N`0H{F$USMh$dvu;xNrAXs5uZr)c*Dy^y_i#HRA$25}o;^w2 zaDZIvQnFl`B+FTxA;bN%7GX-Z1sr|J9E(&hq)@yr}-OVMQaRu%{O-<@uK|^)~Ms;d#B!Lq_)A)$j z+j)P7c{jynv*LG!22W-Ot_;dE>7=jFli;{#G?=8hQt2qwsLUKskd~~K8=N~!s$8sE z-N4QK&OD)l76C1a)2Y7Bp%Qsr`*6+VQx&F%-{=pn;+SVl=NNb;WH0GZ!?_Jbs$?0K z3k||0gj@5IdE_Nw?cx)B+Tu5R{qzo`$-cI%*N^!aOKq5*ddwp1xJsbyO_*R{j-o^- zq%#~XC?UFK>3Lhu`r${_qa$kXN6FgsV5b|3r2PI&*ftEtvA7p;(Xxh&Jct$6XN{0(s=51KfHJ7$ z;AB`+hHzGgo5!}uZWCWQB1*sEu-HXNe0BP}kk2!dP!*EZq8M}Xk57XKdTLa=ln)l} zlXs?Dvm5Q7e0k6X<>j*WQHM@F`Q5b=rVP@CEHg5#Y~HW0R=RjtA2Ct%OoTt>I<3rV zsXjmf#8*Srnb$@c@qE-C(^CiN^sm6?y`xIt9CSXnLImk<*giBlMz6zyUAGr(l4&@S9eYOYjsuqh#OnF!12=9u-Z?LMF*)y`&A{gm>5Eg1cR6eX(=~eT z#eap`)e3!%=;wemrz)?vofs2RIxCpfarB%5kJr7jHPz#fx>QUBW?C<(Umgx=I%8I2 zjqDUhrBz4%ewi zE*~mBPLDjL&3kWse3FxtCc%Tnw71tr#G-5^{q@Bu;8ZXZ&YtOTvC@{|$Jc}AoyQk^ zJp>ZXyA5BF4LpF;#}wFCyc>R6eAYj;T%dJhuDgozT$2qv%>$`lLvztnaXh4}qdT!a zr@HJ+V^k~lPIx{%G9j?@+q|!lyz{N3139gFf+J#y%ySl)c&+4`d^`vx>VJX1S{m1aBtRH0}TbaX`c@k{xiM(w$z^vBK{$rPrKv`-7{ z^*(v7y!gfcmIxpEiHsXhe)4j%-?liH!#d}K;K)lVX&%&f7OF88OM*?0+3HJ+d*1Yv z4B;N&Dt)#*rWZD2GIL*%scLqnzbG{DzhaEu2@4dLQFBM-nI?J6J)QG6m`cfcA-r9i z<{qH?7I8*Ugr-dD>e@H96v3B%rIv_G>$`nvAIoAMD`JdihHr5!uu+DUuDV)`Tqr5c zvbq29lb(ft-B2jWRcRg1x!_N8t*wI(Kdg@-cK3vOVhrxTujxXqb1k;?Ijun#LyFwr zbTt$$Pg=zpppd3{7C{U78y7XsGcPo>KNRr5PJTH2GHuH{&os+VZ}%k++SVn>>0WgN zJb#oqRMw>U14YgAM3?Lvoo_qq2d`9I)8DR)L==?wRWCquKk#dwZhlqK8osv?I$n1P zUAsJJ49iLTz5u=7O~!zVi=HsPXfn`fq%)}XC}=GS(Z@2MR>1h#1D&uE^5jC^^d3sZ ztA^kD)5YVaQxxOAO;zb6RN10 zVax{;gJ-!O2y`ON`YF zOu!ZaK7-Z#tPnhjo4pt-K*{TgvjuW`S<#|6ODho#;Le4>29(5DZQR_PMfmwWJw5q6 z1^BQyYknAzoXih}^TXl100pnBx09Qh7q62m8&SZIb5y`+K|q0K=0Hxq7%R{xUKvY^ zpF*A8aSnKqmKOXNLgu$CKa39s==Vc&N9Ui^POf}-EBGw1j{II`&HyF!H%evY|C!d& z@i%fUHS7PG-i_bc`X3JGUq<6MRe%OS6#;+}7*_|Ib!YOA8SP z07-ENZV(BBFKK2?_~2lR#yE-cgX0H$Bz(X(^8dyIN(92!*2)_UOf(q2yifsNn2;7s zP(%giiurFi@iq}CK~7-kL{#t(>OjNrqJOCXjk3bx9L?O&WX*5jFP8g-{A<2)E_4MG@bcR7)fWxI? zVtDz#lM%D4@Fta6?bb{Y3*0 zFcAgq`L&MVpVG^D1vq6&BvLlR+5h&y2CxD^4FM1U1UT@^ zlB+Yu0=U_?GIMYR>j+4KjxEL$gHyv{tbj~ZzA&3?2&pPn9BKU$h{Zoem@UB1V1b`h}LjN2C;3*5h0p368fOO%1(;yk~9nZ0Iatu<>V^Q>93g$-2IC18?hM&VPy zA0s0S0)^o1PB6;KGD09yCIOx}h?Ie?6V4k6&MoEqBqqMMs^MYV*3vKm3)pw55~DCq-6QV9fxA*ayf4;1+W zN&X-)1%Z*8g}}(Qz{s`0wrU}Nkn1A9lPe^@lWQfvldC4blk5NC-RgV4fNlQTvE+VH zbHllV+A;>Z0dMc)4rmEX8bC;lZRzoUQPv;D`%xs47(dw-nDZZH*fJo1f0CL({*@q> z-waJ3B6Zj{0PnM9QU-3e_Bc>th?KG$-rg0UmY28sOJ+C%0+%MG4V)lkYyZ&{K&}5r zhSc!{cON%fu+skmPf)<#{+=OFfB!B+fT6>%f6owTum3VbC7zdZvIEp^M93_e;DVme}8U9{ey1){fYUXrksC&R{n<{QPPU(KaTJJ zs@?p<i-EP-??2`0c=GDi*vF1Y0xuwh zfr4uOm8{sfPB2qlStfboxFi15Dbb&eJi4@jQ0n669bU|tXvv{grLymUl%Q+t(%Lzl7|z3 zJxL-cffhi7V4$S3WP>0=Bw$!+ur3u_PfeVQlQR*5!XUs5W1w%$AXua%`8Lp9I3k3k zBBO6@14_XGK!b$sep2MmzZzSLL`#zzCS!9I07ZkNJpLj8+kmL}coXsNKzR~wmw||qGaw|MAm1zy0u=*D%y0y67rciIfV72$)jS+j z@a~{BdNYFd4YC1XCx9$lnaU zN4br8xWR++iD)|pSfjo;$NGo$$4*VjPX>~88eUx&r02VPFt5GSjP=2j)TRfT1kGv@ zwt^;WRzun2hl3%>eNT3A-&SrdN?RB$xF>eBYNbq9>s?=YkWS(?`z4t>hJwhqch=aD z`6v+`yF&3hhIV~>VuVlRKgj74XN+UCFuIbzi;f}RNYnTI=>Ft_cRq3ZjID~gxOXKM z?29(cO(@!ZWhpa>0C}qyHF|QIvVfrut_s! ziH@PnhOx)O9vaRmMP~Bv|!B(-a@5nbj#( zGo+YTIlN5V6!l&@^BS=}yxyWpe~a7pI416Dv*CF5E!Tmt79?!Eg)!!7- zMZ6~?9W;wy9{m#9`zo*Lpv@^4WTwQj>EXUmlhI)e?fISS&tWs$D?6&Ued+LjDOY~W zlj}^y%ay!Os`NKUy4(>vQf1?tMov3uAD_>1VS-pdY3=ui?{)4LIJHo7qU7lbL=P90 zQ8#Q@w|_SKzLRI!&05OSCREtnyvPoot7UtKqV9=f`p0Ar#^^_Oe(cArl^7`R)))^N z<=m#l$#1E7g`jA5yx^db1tVlzaV<0DNx$<7&P%&u-{w;C@W`1zZ1rfq3ca}()hOxn zqQj51D%JR7iTIpvmw*{X@WNm~oeuT-RjMeo1KAA zJpwF?by$qrN&^}eZYY(b9#PDN&pw$pj(;c;gn%q)oUH1U61jUSw?tKSv*_Awj*Yd8 zRSoQ&_mfs9KDB%-cexbNz{^n58Y8k8{DtDekU)a@3GJPFTuXgRJ=8mHQpaT|L+tcUozh{`~|&0?C*(m3-TOG2;Xov6DyRP0BFulwIJ`8xWF8t5;5 zua5PVt5bqH6~5vFUcO`b8@VPQ)!TBstJ-`oZ&JJq6AF1Ja!X@D1rBvPO76d|awrY1+BOd&pIJDsL^SP_^^}d^y_L zLrAUO+5Xt)1j<_XihP{o-M)R)dy_ItUzfhUkyY+m!or&!d5C`Pa*$=9o-WJzPDT24 z#~tS;*!Sqp=641ll!C@y1ap))eO0-0TShNzk5}a-VotiN<-5Jh63k_qOegeq(QG~% zWDiY!K29$^OcCjAgIu-TJd}}>V97GXsOonl{634mCef7Q)9TE8;-jF~$!by&3|!Y; zuX=t-T9FChKg5|>d@R9KG&Z?q`g8)*0zxfRSLfVn+q|u+{Lvs$DJCO&0TRIVOb!2nVT2f!?9>uAMZTVbRY*<3>Y&s^=8xMUD#pFC4HIA&V zJ{#=rq_%^m>uGJX-}}z#E3udpH^T(8-LxA%9@)Nn=K4km$M`_m`jZ#l;-ac9mODH` zPT$6FPEve{rGGZJh?x=lOs#%I*+o3QsH#JZKbpTe^TH#2qrVOVhj~cyEE>kzo4O8yAV<8*0 zF3>g367}Vuf`(Yb7SH*4z`?$jpZ_hDOYEx)(|W+}P9i z{9P4A%vXg|yP^vI^h^9FhwE>^=MRWSN8MaSTZ@d+U{DEQnFct~3IDe+4MEya`30sS zNINh;uDD5L{|_+j&rE}Bj(-iOA--?h{Djj;Fb!CF0xu|lG~mEXT3Q;|`a(dEKw27j zA+XXAB=AB2dO-m%I2r>%11}^J1;GF>kRA)Xz%;PZ1TF*?@FW8@!axiN?qmTM0t(zk zfENOTg#a!};3C~c0&Z|}9zcMCAi-N`@FLv>wh;iXfEEFF5a9Iy^8g?QTwpEWHVguS z1~a6=3*-QRP{4%&cmda7Xbq=Gy(+yv;|y1 zeLo>s;O396g>TD85}y0d2=?pOuXGeFel#I}7!IV~`ICYDg|ta4)bAkn?=1>|(0_ws zC8VJMj75T71jRzoNU{^$nv4K+9S({-l~%~=q5pGcE+!oPHFK3>Z&hvdF|e+-%5(Mtn5CcV&Vh1&%!#I>dN6j%QBdY(DM7cY7>YF=lO{im{Dkv zD#N_ZlAXr$^{;(#j`I9--HmY)n)caYFn6b49tq6}#^hY|=D5a~t!VmULGG8zbuDfOA8jirR9fVcWIB!+tbSA!)BDC0cEaAF zt`B-tkX!JAc2ZuipG{oT6x&Ryy`5M6lZ*UsS@RYH1xH;){h8U%)XcKfHrq;hV(RKT zm`mn7UmaS=)N{&8tvW22thq1orU!k<9EC61!G0gxql0Kc{m+^YYtXy*-xUin4)oI; zPM}LW>lQnEV8^*uVMh_;r$Zla2aed7-(1zCV9lBj%5yJT?JCV3Wj?^S65UUAJz~VW zPU?})S%qy@b_S`pOPJ34JgQNT4z*j%$#&`=b3DB}vXw64$X(1(rD2`$>(gfUuTG$; zt0vG-PH79=V8B{fQtgy!XRvr>Z`ms|K*QK;C4VHOx`qh*QpNlN5PhI1Q z(pV+t$j8G>)!cFh%)&eq7MjTsVN1f@?E@K=RG#wn>jP?SPsO8!s~tU0d3ks{CN=4K za!jSLWr)2RFTKC6&`Cf*kJL4pEX_)t*d}=O!YhlzMGpfcT|yK*-d&7-?s+Ls^!bGr zd<3tKo@0hJg+rrul!tVsFmIT=v+%&w8-nukTo?ZvP3`rK*P3ffQtTJhJ}V`eibu?g z(LSATZ)Gq)ksdZAme=_}bh>8`%>`DdR+sPE(7W1!{*l(@*wgtmD9*z<^6H{zW4Y6X z4)A9r1Z@%xZSj6r^Iz!Ept>)!$aFRCiI+aqU6EWVh$v7Yyy$D6i;OQM>^!%E*i&ZZWi(JjzRYJ!kd}*6O>Vfj(7r>X2xM^=>Z-> zR0ii|t9WfMx5Ca5q1`?Y{3FxDB3_rAvS&DFsndsS;ZWzK&6M#mkEt(gbMIA5 zkX!mVnoJ1PlW!YoHGLWVEJ)t`WM*$`uV5V8K+bHPuW%In%)4B1G@O6#MyZiezly^d zOd6;2BbQ3n_m6gdbZX(Vw^9xUhMujTk{raybZavv{HbZ3@Tdw5BBYfjS zIG@`&nuw#-bm5iAvS6!TOX!Y1)z>4Cd8X#fe0zQ)r4oLRHca=Vz^>|~X`c643IUZ> zm?ihDD{@9HvOA|o(%tVodleeI`s%~wc8U_qi-Xvd2am41pQByqxV`a-P48>X;w8ei zr56^`hGsfc1s4lqrq*h%Ej-kJb-O`ZW6hXhuTlv9yhgY_I@g?4D*Ny#>pX4hr=$e4 zcita(QCna=`LGd%QRVn-!y*di$jeQ-w&T$U1gM%y(7ez1Pa(P(=RE6PHD;-tEg7XIP^_fC z6?`i=CSkyi-3U{!d0fxRyt<1!K-ZnDed97gH{~ew=n_=kRo?=p6Q5 zvk0z9`NDjCf?n;sd?fQ>gI39~J=3y`pGp<8eVE>>-Mi}U7=A(-1-qHynT+y(Ha_a^ zKHb@$tx%GbZdFW`Vx=&#;8GLWJtNsecw@Iv6?1XTW<}y&OP%Bw8@}_dM_xzz>RKND z^0=Z`(NWTZJ;@;1!ra=>^lD^c={?uEAd7fTG$r;0JHnLvN%;-Q+mgDM_FO;7)x_YG zwVYp{`XNH)gDIcLP~FgLxeI5*OtJ?*Cnweqm5t3xwV&T%xvTeCOLKHbXhxd;)a|4z zZtLzk6-z5(+{bfY?i!=fIDy?8E&uMLM5ETJ#g%bc-gqG$okmJM(KcTT2sIUFgBG?T zoMliO2i@z)W&D9X39Eb7I;f0(L7<<3R#h(>c9%j~NByDggx-$75?HjMh_mQPvACLL@dt}gZM!ez}UmuZjDRu2NeUmgqWgKi# zofy%$_@rU_E0bifMR50VdKPDqiFCJCz1P#zSJwK{&3#wCce63 zgLwE@to4B|wL$$W3j#O9g#{0ax9?`~4vSxRqzY{0e*JnVB&O8KW$l*h=!CIA&&}L4 zUQstuuI;sl)7dLq4r2LyOrsf-CY_I6Uur16#H!%M)49=7$K4lRi7sTW8589_+uZ9O zRI`#?*(flu$C}TqLf%iTRCqxtrDHfMceP;mRHALQl~-RE>)zA~0kPs>l>-e-zOT5r zu*bV4n5^w@4!9-gR?gsNl7m0eR@CuIZum2BZ7kdxd%pan%zBJ&HGOx?B1d3nY2@5E zw`HXhT^Iws71io{D2Dd)^VaeG;aQZ{Mpzux##K7Muv>M71^JSNlJT99yfo+Q@I zaSWZ>xMd?UE+3w1UeC5bmn%TrYmZW4*&Wpu_*l0*Y`@jBmC%E;snM+dN|e@)A>6Bj zFZdl()S3vFQaO+T8+oOJ^YHek%&#xqI@x!6Dqk_ad8T(;Jzgf&kGmd%(B^=0ai&|3 zE;PMlu-!?G@4K>YBhQQ;r^YaaoNhhfHPEv6j^c{4wf6LZ`Pfp6A z&+hiO^2^-A(Qg+?b+=f-fV*M0o%z1&BFKZUCZqa1>gi-8%Ka;Zd75q`Esx>O9eSaE zmOo7?7w(lMA}g?!KQ_^ATDWsm{Rkt&Rf$?7mgZCg%GC7G z8Wx-di;n+I%Jx*plw=5^S(xqZ46xcG2$v^SO#JtxH!# z!^4Rq+ih;*9G*FuB8oZM;5B*NwI4%*t3JoE70%BkiLaS!@Eq}@5~Fsy!-n8{5aHH# zro9Zh}EWNpnxe` z1ezO!V{1+qHI}iX-9z1BlVAJZTvhqBVB+&afhx9{&u8QMW_J0>rH|!bXRe)0-9F^v zq!`G*{V0F`nfL3T8Y5c1dcB;oN945c6|0Ly#cYP(HadJUIv_7w+2$NJ8YPs#H4LkH zYUUQHdqZOXp_hlV@=y0rr|;Ggo1#jn@huut$}u`{i!yJHfh7+YB+zekP58A2;2rgL zNLCh`dfz4?_=&5oeRjaAzM)fZN*6L#FEhI<*y4R7a%S)L`sNLpF#!vXs)r|4TIZLt zDjzN;HkTT*^xe43yI4}E^lorsef!FN`D%IzGr`>h<{!SSp)#K4acS4DX}v+8U^ZaL zeCYj7^n9({sthB`;%MN6^zPcX^>!V^iEWZ`KCQDkp~NNcvnO|;**%Q)Ls6=wgBi*c z3LzY2A5IMR?MSnF*pZyQalv+$G3x2jtlgD{CC9lKBV(DVbV>MQiFVxzre@K?$Mo~`B`rYvrs)GzF4S+xvI zcdx}M9gE>JQ*!is;kR!?M`U{T+C70}g}!6*)Txhl^ei%lWL@P~Q07bhB(-LdIg`8h z#klNkd1Y()3O8<|vd+3Zv$HKj9bNa{6HPvT&U#09!j4IPtvRlxJ0aa7*nChX>ni-< z$H-xW2?8gUCx@}kXQ}N<=SHYt2FH9Pll^c=*UZ zOti{pN0>&IUKoF#8pua-# z0kkqAYdeinI44Xi8KOZ0!T49Kz*Rck{r7ApB$7S2@N__Ub`PE z^qZy}Va7^x@CxhF8tNE_ozY|cB-C?daosS-cX(C#@{`nqo!Svogyf2V9^ z(~d=phrF_^H(pCb|3yz@A#a*soiX_TU|fyj_EZP-_o6`862woiUl@S$6he1 zUwxV%p*^sTqd-dbMoi;`ZJ803E4ydY*uZVNMWuQ^ zcXP_;9EM3+&R-|vm^0XfQ|`NM-cv)GnTK9m=W)oovw7sytk_cL?M<2fnv;|(>`U=> z7ox&ksfEso)LAzbFdYl__SvKq~$+tA%TaIKrMd@1@ZnK`}QqR$WjV8+XNmX z@&}@awA=xDm90-@uqVJh8N3^w00fQ$?$@pJAwZ@*uuA|AV$i8gqOu1z3% zD1sbeZA>~cMLvn-O^!PwCrAUGtFdycWxw196p65q+?(lMP+l3$5OiwSK9q zb!Sm4;3LQD0s}s+hH)PA{2ar{3f;eQI*t;deVlJ( zZW*rC>U4U7 zxD5r?oSnE)3o5ah#HyLL`#r`ZJhU{uMi!Rsr#ce-UBw*KW3t&w!>dvtQAZZ$lvo=D zD;d}$-1c$qaP&OrP*a(tIP?VNZwl!v_;^vJV<_vuSWE>nPDxv2um&ArI<3p4W1#u0 z&yz;qq3+%*s>f=TtToirs@-h{CQR;oMZzO9-pXzeG#TGIcWX8#3G>iu6Pny-({f+R z>~Ur|FulE0t%{$K(r)5Xq+r63CnCrT`il1)RYteHass8WYv$&i;)IyRJ~>+|M2j=4 za^_SYo1)bF^fgtww2No98PGj>*lkB;$Y2$(vF6ijR1_C)D}hwT)b zM2%yMC@!-O&OfGK4c%YOkUb{|JwT(n^FB>g^yM-6OzM%D zVAXWXIOoL-*0~X6<1n>`bTE;oIeOdt>lvY;zNVkNwQ7nDm88! ze|epq`BkI3CCWvj3Imr&;&fAAjq#Vh+@MzKTCmVebDm4{JOnBifqJn zM5z1&$6blHx8Fyx3-=S*>{v}I#NRx0rN;(zWJK^OwM9;9ggYe4hVP}`^~%PHj*G~= zVl76;~^533utnRFkW&yWKWMlDN+R}uXow=vNk%ciSft8vDnz=U0al|&o z9*a{5&7%YcN|yG67hDq}4qf3WqdCH;k*CO@_v*6IwAMiZNJOIgiZXYF*CT-(3a+=X zTZ$9U@(u_kINjH!5W)mLo9Cx8$**Iup?W%TA&%E_68`pjoaYC>Gb*3j>0wRWEJ*e0 zcQVcCa_YBVx}9mhf#B`SE*87ue8!$nqgsDuJDvrWH|*kwk##-Oyy2L*oj-Hjb4o12 zWkkN&UANt2&jF`1_5r$&G%O+qd0f(K-XqNTF1^Q~vyg@0yakMA264n| zn(uPEu7yCEO&*(%824_*vkru=2SYJMG}k1;&9YQaR4`?{je}h77Q*zr(cN8<7IrH+ z;VG3*I>c`G0jN&=L}J;R0n6a5BA-_(dx8z)V8(%$m2DlgWu4iN`ymn>D3}&AhQ}bbH4gmssc6y$$Es*k!q-$Ek&e)*V!9rJj0XhK}tz zZ62R7Xyq4Kd$JR0axk1V-Ul8YBKCCIth9>teJr88*>1X{*z({9tOgT{zsVhU)sL8W z7K09Y1!hfYi-ltaSQz8PW$#RzDvi4x>S5k1Y_eD zP29%tsXm82-D$(c%$9=fX|p4bce3qkQeI7Iwy3g;ni~n14iktTIog4Bs>Q7$pC6pO zz4YG8wk^-cZ-V`agWB=0?`Eu1uQ$jae6naR^>t}3{4Boy=<(n~j8}HHQ-z<7ML)We zWaZk+g=eX*3pnY&$(dSL7Z4PDcQat|{rp*e){h#hft7WiXU1#MTxVYfiR(BeAsOzY z4N{~`D$lM5i9c~l(&Y)q=ae2Ra??NbzPL2Qu+pEtKCB`zx1zOYChNspZujCobjj47 zBG=SwFBh)|U2~o3N#;$&=agGt^U$aIP+XpE+z;C~syIw6u90KEceUnCMok(d)k4U+ z?7cP6%>L%p9%OcRvLWiedCA(ik!g@jsJ_(`tkqGKh2F=THCJapzdkA1U%XA~L0PEn zL}qT+=iGJA#8(bS1NopE$((9D_w?NHOx}wFZ|3IS>`3q1qrTwwImVkl_)y)Adz+i3 zMT(E~Wzw?>+;_HNWM9(tXWPd#OhPZH9PKL}4qbIpkigeBH0rKa1$rD6nO)(}#_4f3 z_tt9kFW+mhC0dPs2~zD}-iUV3J#c7rl%b|#apP$=`1Qlx^2H5_tOH+0Mtgy;`z1Eh zKXB;UE5xrC)6U$Dx#-RmQrE-op7xgeer8JB_Hnn}=t;&_bz3u6PFa74M-g22M#MFmg z9Pg%hR(|g~f8+DH5a0c)j&z;1bk|!)ViAw#R4f}i9(B&Sj6R--`@(5{ z+D&WL*HYrV`VD><9vW^Ho6Y}R>xfF@n{0Nb{zYac3G2Cax193W{z;xoN97v^rivcy z6PM0)(z#B1UUbK@WywO`Wp-?clQ=S;LYO79?_?Tegv&4cdetmbd{^k;s9oPEoubFX17SI;E?Ke}ox~3@C zeeK=3`zT$SjP}QY*|~hKcT~0d)A_42D&srlOsI~O_!nN(*ex#LAQMobvg7Qga;sie z_DPxK-uw9l`STa|dj(+DH|8gdQSFQt&{Sa|o7_O5xmbp5M`fc3!DT&L$AmcUhJpEF zk2bflM74J_`gtbtvx_(}Yv<;U(S0n=%X-s_V|%Cxg>!)y%SY<2eRfe7gByO@QPHN) zBzdx==L}Ne(OcAwyIyjij>c?vJKou9_NvNt;1~xD?B$Jd80?b<`d#?mci7IfBl!qJ?(J=C*xsIHoaC++St*_(O|LYwzJi#lG6chVvIp_j^AKOz*hC*DSx0pwWB^ z;*Qqy*~=Ziyd?5)rm-mN9hGL`h;Ub+^W?#`0+@KXKl4mMvW9-e<^B8Zdq1yLwE1RjjvkDqnXR+y@STDZibbd zToV6+SkDjndVD{DyHfq3{`;FP&yG2-dGfKK@Xv$W6*if8E{tsZ1BjCTIiyJiIBx@< z>JbHkZ@ht1KawzMNjP{KND4Se0zQoaBC09_hr&@%1PqAOl7=CX2cVe!P~fx%2%Nyu zzjJa|0V}k%10FsgCkV7j zV#dM#d!?Q}1UFKZ4)#(w@@Xt@DVQV-;O`sf?w;SNJ-j7JQb^k4-KG3(JpoGSPn5d4 z|4i%d{u8;0fzv#RU`arAGJ2$kq{zcaRsWCy3I*&~yW0}w zZGkg(F7~8xBL!$l7UX^F7Fdg}u8fw4H__I^9;c-$Z-GY<0(H&Qa9&F81}HZ(oC-qi zgrS@qss5iyw+eqd(Dgqk`~&G$VIv?w#|0eGDsHyk-txd1Fg(B@nX6y!$pXB~0G^et z!%XtP!((JYd49|z_s|cfztv~JXac@{ACU5Y4QQZ!IY9z2T>mt6{pp~m3NTf;;hn(Q zh@|?z{372b5{UHoNZ7-GSwTc`9;C^O>{|KuG@8Gja-wTXaQ*2JC71!YZvjsX=xG8P zGWPNI#MuKL7e`w+Z!nMS(J^zu`QZo#1e_yqWC`fLzg$tOE(Cxc9Qsm+;h@t7DDcCq z=noWOgaUE3jNrL0@*f271_AyjN9ON&;7NAy590Ax9uU|1YaRj$xQu_HlLlgEf6YSy zXB&Ub!+>$mzm&xSKB-^wU@-7>^v`rKIO=ygX&^TES2{3e_t!iW_BZ}u81OMGzqEn< zU0yg0^tb&?2mb>vaOvOm0!K>!)&?Af{Y|HEFuM5HHvXU&1Pq8`{)G+!NB-6ZLK+PJ z{h4P33I#Zkf5}55f72-fi~X$)Y0U3BK%tSp@r=TP2ae&ZQ~ z{Y_rL48SAKKeYkmp?K#e5-uM}xMAT6CpE>|s8h^6S?P0eDE ZzrRI_E(a!e0+BT40V9W!PP*!={~sig6?6aq diff --git a/docs/paper/arxiv/figures/verification-funnel.pdf b/docs/paper/arxiv/figures/verification-funnel.pdf deleted file mode 100644 index 5c5c38be1384b67fc04481d77cf0fa9ca2fc8a63..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 20215 zcmeHvc|4R~`@cxChKLrLY}w72mF!#govhU~#u8>^>?+Ap6j7oip=4iDlszg-%9=e& zLS#wVrGEE4GnLBc`981j_w{;y|2#8O?sM+zI_Emqxz6@J*Ezz*8d_2?X*84Y8Q{aj z1cN{!#AC;q6cm^s5E*lS4?IN180&;5fkAbmH^B=610VHV97qsbxGX?!YR{yk#Dpg} zfP73~%%2IEd3k!{yv#lEcvB+L3uHGXy5SA59uRUiKv)Coh4*rC2QoABz!CroAkG}` z=}rQ3_XJtM2p2CmP#j7KNR&u$TF;>A{xbv!1Tg49Y!MUzfdG&Wob^Th3yxmA>PkR5AP2Yd0h!XYOsX1NRSQk|LK4GpBe{?xUt4H-SF<99?Sp*6LH?| zK>5L-F@#)gR1J~k|E&tanEx?BYCQpokyQ*PrV6qBQJ+Q-8GWoj(VJRx#%@>~9?Tvh zqvl4$xdPNmO7>I%pb!`!5FAJc2|+>75DY{P0s|sp5ST0k28X~95Ev2yLqT9@2n+*( z$w6eHWL{$@2*m)`*A8j};kAgAf}43C^CDX`XxC6Mdo?Tx555I##Z=8)LtVlU@2hSA zgG#9p-5jKJ8~~ABy!@rKL0U}$j_BY5n%wX48h8>8uzoDT3rqp8gO~RiKsP z#Q;ix*2vQV?+F%2^hbfHFU-6>JOI-H>kHHprB#4(fJOCm@$e#g0(AtI2S`QZ5Sd8k z5ZIES5LpaZpD+jp1=JJpMWT^F-NGSASu{|?2nY%;2h=kXfKAzy@&?HX6n+bPHfBr$+ziVL{DrWH`SnhnxjaU4~e&N8P}($pmg#Cla7Z z7!-~A!Jwu_^n=&}+Le-p!XR>RBm@nmv^5zHj^S-y%by;NEuHedVAvVQe+Nc;UE2? zAs+95clhy^BoFcMB;rVbh=8Oz1Q#zRVPPEt&du8a|AS^D5;dTm5q=SXc|g>?NnS*E zfS){k$wNdb5fHLWkRL4%0u=*DEb*Qs7a~C(7;uDzH3<&tM0cKt7N4xZG)K#qa=G6C(w!3F0<4Uyw;?totY=oH`-DhF>5DkZry__K?EIN%)tjRHE| zpqu#ha(!1pjs>ig{8t7nJjDR11qKr&$5KlPbP_+K$h8QigsF8%4!|fPf(?Y^0gNG@ zSb`ItNdXFclpqRPz=u5Z_~Sj$7|EqS=7@8~dXmXu067$jfGLqfvg8m7O$os%AtWV) zpoBCjAtX73Mo>a1a!3|R385*JfC!*K7z!md6*)?XN~uZ?A*rdTQQpZ>_|z#OI3=V( z4xuzCAx#ROEG48xc}GbF1}Xq(0~AUarF>`#C5&1=N)9ke`4AL2VU+TrQRFmXl=7i4 zln|wSD2kR~)bgoQ-cibjqR0uOln+Hsg;G8Q^&O>rD5@-!@}W>rC<>4VFb`_208eD= z{AnjRZ%=TBMmAxxoq!hLLcssDau1>h$WQ)9ZXQ2c8rWEjv^MA}$(_kxU56}2PLAB; zZO{t?_WpC00JOQEgRlC!F2ye?Dn8JB`yu}N{f`}_BTaUy9m;{Xr8#q_9)Fl;e?)XC z^Bprc(xyg3y+%XF&yu4CGQ@K>1!Dg6%rTBQF8}eRY)kjNuI%|GO!3t5DNDsJ)2?aa z2mHf8+O7b%mgKdu=If>K#pE}Or1xqM zy1lB?1?*)o0hR6#Pv%!g-0D+KKMz~0u)M#vQ>(2!Bk;l8bPKfl+XTGz)a`jjk0S!7 zw!o5^!b~mTCQgxUMjhi(@m6{hF*@oamUMxJ=_7etqT~L z+MQnnRMs86m%DTO9yc93WNlobxL7?CxIzZmno9Z_RE)(0X?V8lP{jd^4_hs)lxoiW&c?YB9Tr zUnxh=9_teIiY!o6Q~a#Rq&OXDxL7g4{lTKS`DnG}s3%kNExnQE%O+1%Z7SiFAIoP4 zILiXPj&!f_tZd^R4G9eq)?6L?*50&dpsTrPr^RjvtA3%JtC@J}LiE90Z+pWoKZuTv zaKM3{kWU;ZW7LJ%wA3Q5Rj=_bsO&$f&#;Z5wP-$?p_PU8lhmsb_5eC`mJ3q~r>hmW z{Y351HktBI^Ig0N#Hp$4ALQPN=Bu_;^kW|`qcTqF@~7nM=h%wp(z!SHZ>d#o&>rBQ zO4^E`5vW(=eL5MLRm}b^gXK!i3Y!qex7l-r(m^NG?E(y)IbT3bGDK2#JRu}kNw)jnwcZ(}l%%Pw0`;{+KwHE_=7&Vt?xZM=plnDKFyfhhvT6_%WLP7sM;yHLh zkD*&{iGJbb(4tr3soA+FA8HSW=Oom;GNJtr!ExCscx{zFy(*<|FY?(}S)=6kxqzz# zrTGk{3)}X7TQU>nZnd=zk$k~rv9D`O(!IU)k50O_ZM)Aesq|bq9^PhK38ya#Lloj| z-x%1)q6zZs@(`ZYJv_5mDao48rf%Qj_YjHV zamIHG<)jVgu`cD@@m?-THI=}kBdWz$kIBrR>k}pDCiNm1nh!BqMTH1QSF11#E96}! zzF@UeaL{J6p64EDa+wx7%#puW{*;Az8>_f$$cu6=LON4hi<#1SE#v|#A-$ljTYXEL z(u3}Y!Am0FrVJ$(xH6C23qTl$e~x2tnBMBKH9CP$^m|I9KUZSP_yAJ~e1`y*(L^pU zzmotf5%G--ZZc_f9nkfFqia+-9KuHyJ02!?8I7GFYDjJ6OSnIF=kX$4#rM~90fr=r z*;r%xBaeq}I7nuAw$pR6c{DPK<<~4anZMgHH%x5Y)|OkRv~SVn40O|ur?#eyd+^KW z5R;9j0r#GKOKRqDyPt~339_GD&X_Ug%DSb@$6n%5R>=RLaMD4Gu_rYtB{;;hyz)WZ zr-!?C8F87@lrEWUp69l}473TniQ_WhB-9396ZS9ZfA8VVY4d$eT*xG6^CZgQ0s%`C zvzrg?5@YG#9-Qp4qnS3KwsDU@8)SLZo9)QiH1vhc%bBv7UBb6wxwsqrpQGPzIvxS< zw;U6pVLWjro}LkPzIV6iL6X>|tNS+z^+}u)=SxxChRMO5!86`?Bt&@bV6_bu7>~0( zjTd_ryEZy-PXre!tsA*)^=Ok*^OT|i)bQ~09r5x^Ep#Sr-Fec>&ureTNL;yeG??pD zShd6@wWwW*S3a`Le9RtZs*BH+OghZMnN-d&qOug&oig}HxV@@5d1`af?h^CJBc3CL zQWiZdsMhpcDHvDonNF31G_5>GpXDRV+jp_dckTYj7}SZcjAy%#Rf7uE(pUw0tCM_O zP`M1pw&ZBuVUnho^<@f+VQ5~|w%8ennch4}YZ@X*!*D^oVo`xGoB7n){Ihp;MQ@m$ ztvT)tJ1sL$o_*p;nV!^hZBSq9*B7O~N-JGB^5xb$-?0y+bli=@$?q1G_gi!K$=XFo zSq<0jYUH82x0nBM$|`42#abJk_p^W=Q6bgoy4Q1Vd9V}67iV8c^cs}B%hfBj?K%>j zYh6XxST)S^7T&I_Se1G|Ec&pGL)^)Ldi&?jatXr>Xxt~dN3Wx?g3CPoYNNe&kW$NFnq%a@<=pLw1voiyzD zfPc+{S=|5j*C#XQl^mBxGv0pCD2uW6^1#*{Q2FpS`1S{ym4umd!|b!$F-7H$+pJ%t z)=DTMMLZRZxoH;8G*~| zLA`FBZx>nc{AuXyr8Upc$@Fg?r{~r7u~Z!ys{@=Ck_Aec760 zSh#TWxnA~_(=X}23K7AZZ_K)|#2p%Qh@d&zG^j0_%Jv-B zNwS7!Jvld!Y)mU`yhSrwV1Erv^gez1^393^L%s|toIC9zbiZ|x*2Iq+rX|0Br#J!^ z%iewYs-9pQHc3w7KN`IDazT; zw~FnhD3G4G({V^xkpv*mkQ}iR8rmTwJe$L#Ef{0 zS*#wv8FiKk2NzoAY3!cr%)Me0QQdv`eB*@ujY$_Zc?UH@pGaTA!g*d6+)(iVvvOPV zpa*MJ$fZ3ECs%sITBf;f(bPT^pxah=$apaJ*h352tfDcCh_8wlKR#d4FYGNs#k$<@ zy#5>EJ@(yV6^+HSe7Id|FS*Yi|wH)FR1l zQuyljrYnMPn=s*^e{VD2i0%F8=4$56=Smb^x>Ld$isf-H+Am_Z(hW35&2YRJTj=7; zQlihKO;Lp8ew@iQ8*4uGrk3ZhT6oYTTZM-|@4@}HBy@+$nY}$krGoE=Gq*VT;N^!# z5FZl}R;pf|uJ(J)XIvC!7yKtw1KVi$=6BHE;@3A~SU9Psw%d?CKB%IchDBiQD_U@k z2O@Oi%|>(gjgVw;J`PUZf6mc2K*lWeO{@$Ie-k4!ht_zzz+eyd=$VB%!LB@{Yr}h`vGr$1MPqu)}w1b zz$WsZ$G?HvZKI)IpBI@xE%~Z2I9a(D7c*zfY$-Q9TI{; z$U;zZ2(s0#2kxX`2x%xBB?~1lfB$_14^D1!WCTwRDNOOsdTvS+&3UE>Rr~n@I9R=V*QNcYbfJX)Qr~nrQCWBG|JSr$i0r1e&{3rn4 z50H)m;GtA0@ER4sqvSvZ@X%CPj#@q{phg8Ksqd%&9sp(jhoQCK_phiOr7Q~ik6q>Z zVSD{Z_&c~8rFl3wRA%Ow?5vst~b?By!#3AE<1YPZa#t8V8L8q7^C7HqETN?#FwHl*vHtzeLSS^UuXCs*$1o`_da)ReS{>=4en z%$qKBIBDO7ggX~3bU0^M`1+qbGkT5}pLKLe(zqLUJ|yJXv2+aqL(?p-OSyJg&$aFL z85!AKha_ahNthK>-?1BClU{0U%^Rw@fzD;_EmL#|)6L`D-_Y@Fmeb-Q!lLx-lhTMM zBa)wbcG%fVH+cAW^=7r3&V1%xSq*KX`zlWRl>^#6G9wO82)q;=e&!60w7UDn$f)jx zCvL2duGbHFV)it<4<<@M=^TksK;e@PRvb> z;mbX~Ud^{j;*C>JryllQnh$az*gd{{ru*&~L;LCjX`{jm2ResO@VhFs^*<|@*PPj5 z{t?SQN|!F|9}srTnpYN^9YQxfVQH(^5gYP~;mRYft_KW4o6B#vr_DL}$P02ro_83( z#e!>(b`}!QyL#hpbkiFqc6(%Qt`kciwC*~^o%U+1HR2)8k$&R<<^ zaG7v?&R6(2a&%7)n;2Bv;Mqh?+zJ=>3e@x2(N`O*Ke=u(5A!wc>m63>rzcjNc5Zn4fvR;b}W#!~dC)&0E2 z5PR8)v{m~>7u)-toAwU{-BC<>y|Ypvf>k`x`+E9o>_@lNqntiHQ9aD*hN%&6b{F|d z`Mj&q>1BopesIgmyIPpjtu?h}C@A}sWe1UalcQdGX@^ZmctCrSYm4-GnZBwoDRY^h z%)YK|U#u63IR%S}dR6B9-8Mv8x^i4Hs`SMdp;O|86H@_LMtzY=g8Qc#UC%;J5U%&}SmU&S2AxoM@~WNfg`6PCGUE0?W{IOb_w>#l?)!u;J36t~ z8-Dmsm!$Ma(%qo(A^);(1-n0r+ib1uCndrXtg|9jKA7RVF9cOc0)u5f9IY z=dQymrxND!GWl|i;K{hIkRrm zS{TQ9!#Evg5kc4XAR*q!7}uEXUB2>{+cGAiW<^Y`Xo_oW?8J_%2lgs?rPlW_nwl^> zMX@E?XJ+n5O)Dy{NSi3J;&#^t={jzF=br7zEEK1elky0rhRgqUyu zy_bejl4sUh!5L;>^o=hM3|0!$99o!93EE=0Rp%K1f%wfT)QilF+ggc%8(@Fj#VXjj0O zdGOi{;{8%6&&yq~(ITw4-)_rE^(!JeuKKF5#sbraC?EMPp}~@PoLwi6f{%>WW$F33 z)+1)qpNGH7WXalnO0K^JOSszoM3pU_m9Q-!-2V8rn$0h`=cf;`RVZviOYUq)wysy~ zpsh2Hdp`Jeb7XADmMXoXN)71x*wpbg#I!Y_G zplZ8qJe7H8y+(hVy6Bf$6ZNHA_2ZCDRUGaLxmqGD2R%a#Z=O)7jb|(JVpx^9>N_A% z0PR{SdANTITMC~c_q!Cy^P>8Y?A2?pqprKEVM96P*lMf$usNQ$qaQe0zR$>T5~)`% z*u@c@R$iTYdwd+`F43@mcn)ol<_fvV+9W4q`b;c&|M7S?PHlL)*n7#TZ;VlU5mLwW zwd^H@-lvXrSL>Y0+-q-`z*J3_g^@X+kd^Bqr1OWNo2ZVt~s9#vhQr;$9Pdqkk)fy9Dl zNTJf?9@P0t*Ur@WF^yqf1uHXk*}cb@e0cEB(@zAxi#d1ljv99V*Oo*#nECq+3a(30=L5% zzoILj@Q!V?#QxpkhO(~w3xnAJFgj>ve9+8VLf_>W-qXv4;7u|E+@c(%0FMZ7H>~IX z8aX0+Nd18v0eHoq$k9K!^N`ovKOEN9&zHd$zmTJUEj&p3AA^v9Roli85)$;}0=`HT zfJvG|8IY2{0Q$tg;ffU-i?tvRX zq_F1ob(*&ZaRi;Y6xH`~ZNfn3dfA83BvMN$Pc!`vHaHD$nVXSSw5w6)U7}U=DdPJw zq;<|~@3U_Db@3M(k9eU^ zvA#<614%O)jXBd)^P7dz@wsB%K29>`dtb^gIP@YfnG@#h2tw$@1H?f|+ffs#FNtQA zL~i^0#-t_-rxi)uzAYDDqVcKoWf37!i+S_mDPt1H#nW=(J-`HZ>gS!GV8=B&Fi_9*|yYZ>x0 zjN`k5(~j2ew^DxLHxs7&aQ42czuoAPy2fti8>R!H1^t)ZC6)*^k9($x7gLj$^8otd;D2B949m>8{ zn|%yEwcOnGDWfy5*oP?r%|gq>arAt(^~?jMr^-H~uYKeLX!`prc{l3qf5+>8_4Z&t z{jYY3%(XNOt<`?+68#?sd%z_a=!X9s>?!W8>j%7l+6DMc+J7GGHx?RX{?85a6m%cR z@{fI@AA>v`4){(3U$`9T)DO7zBgp$hz}^pKhX@Q)9KYo#J48s-?~{U!Hw(z)?gkEf zzqw6squB&}H!_po(aK+D@=ue3f7bN6$pCrmUbp{(0Hy=r{tM{qzR~{6Z!>}a*>U=x zw_O0gbLvd%*S3o+#Up>cwf#OJrp^S=P;$EeoeBJRCh+G>0R0cUE(BhMEz^mWXu+|52wC4L|Gs0P85^X( z8_!$+CZxnyZR*VJ3pfv7tUYaA$>Wl1FE#L+?YLpF?~0`MK?9_s+=)~6c`wB1>$*O- zm%V#D-jmYDhUuz7K{fJJ4}lI2n{HCR-C=vE*8W5Np|>lZ)ddP=yfU#gYqqd3pGI0* zm`{ynD_(wkA~MN>%kFHyn4IQzCwJdSJC#qBFXS|v>&urcV;`P4eJN`3gz146#g13b zBNLl1J?%Wu?{}$zq$OA%q0myS*M9GmM$J+1Y1cFXzkainsx0R!dxE z`rVS$dtE}6;wOA+>U;gNI>$K;XdE`tI>#9gRdQA;=f7eRY%!M1$9kGus;(OA-=Hfl z^)#$dZr>X*w@pL9X3CRrOk~%Er)g~;c1}I%Jp-%;#kXUz%Asr58+2)CR%IjHBxuY> zRm6H2!yk)a_baso2uUz@L<$?+>2OxMd2=wEH||xq>Xku1g95A8ui;^@hVXp+p+#x& zA+u-a)93fP7;e(7I^B*Oterd;O;_;Hm>r)rj(@QwRd4O3?fz}|{o>r%YUxmi z^sPj)ty*AppQ8EMrF$$bzD`E`j%Z9#Xd~(BVtt zy@m(w9PXRn)EC14{j^WAB6Iab`8&+63GXLp(V4KvR-O)i?>=VaD@J^I$@^Bp?vZ7< zs5v6yLC8h6Tg)P@+c&)uyuG4QGk`xr%PssNsMS3#-anVV)WUf#QLA-9?9BE!gT*%is?$s*ShNby-|>9&f` z-z=}^PVHQ}7WUj8d5VAcx1iDsbNaWLa*t#Ro^=5a)q%X^ zf8_b)Mjh4qNzkd7YCYN$ymiXm7V%cBatN+H!UpsvdS=1lh8#NYLenJu&9>U zl$zyyUdxp+t+#H^3fr?dNxRFo7X%tiptSU>8?p2{+qv zv7bOz>0#s9_7(`PY~PC!ZD2+-cNyv?(Xy@b&Fuz%bY(tI$ANg zQby!4oi$-!ODZIUs}~(2?`>^mxG1o7d+qmwcOPD7WIwva;Yf8Ldv9rPSG*HPK=Ph3 zcTT%=tr*Xw7PESnq~MjU#d!JOA4!=8cZ=;&r$OR2N4Gm64e;hs`{XAv)<7k$x3~;C#JiqJ*oY&HE z2k6zQf7QSm1ZV3+H=-wSf)<#9P*2PPkvQNWGUb#gNCQ+UnG~Ghts`;-&MZ>re&pAV zH@QC_xtd0@^Ovis5(WNF!PFuzR@8E(ZANV7m+mw@)h9gjN!0|>XT3Qx{ zgd@r31{|{n&TdnVM3X2-|0n^nNs`ehbwEV$lp5gHMHX${6qFqEmz~2f$bTGj*PHT7 zn>KhnYNEv{=7*EkPLR5l{g$&XT+n?Vzg_LFU&IclThib8FWvd>ov7q9J1)a# zwh85HcumPyf9phUBJXO``*G6PTjYHwVt(D~t@0r6y=6gO%BLzO2hmqP7MxsZA^MJ& z&y>Xt2336hc!{xmrrK!lX}7zE=8vUvoYS5|O1WG{31PydsK)#9qo#iTyE?vmbYjcO zW(MCJ6OZZWoWg#^*qs_b{jEvn{1UT3)It7Bvi$P`j2Yuu3jFR00YMdnY2W~X^l5`b zK_5?-$q(>77-7@M3^(WGrq6%h<##)UhdYvf6r1Pn-xm;aS+RS!tFql4{rkRMMh^#P zN3&}5d|&!?Pj$QSj+OXdPONr-SswTd?K0D7*e<8KZ&tSSVZA;pvZ**hz zvK9pF*8+zsick4>D1`~E$S*XOoZ9aiyr-?sn%WT%;1tl%EXQ?Oo-4ET>7-f2^ zV=f}P?x}}%Kr`+&ozsJ}xf0EfefI4yd7GZgAOBGJZM-6(B?I?n&SZ-wvwS?0a6*@+ z|Bke<0%xlg<*OHGd}2FB&n>VzC$K)fE*HCtOX7YeYjI6O;np4B?y&pwrOt7EWD1cm zl^!c1^g7wUOi9D1R|S4LZPA>!nzU;Aagb%WEM!Mlk~S|i(B*!W{HtyJFKkzqHDml_ z8l}E0>=P>HnVeNpt2i@nQlu9NN$sSYFH$p)KO)wf{<4o#@w`Bhy+}h{4#Ft>R>?Ci zL*$(G>!#r}4_oc$kvLnFX337p4$gYfyB(q(W(qJ&0q-eIqmH#wt_~-z4BJkZyx5ar z$~vi+xd_E>b>Y#lwwykJm@1^%kphV*zGU|*->=L!kh5zY+*6N4yI)#enwZpbNFeU1 zQ?lQ-wTAin{Lm5W*j=ihO4YqGed}J$jw{$PpHCU?jZ~?SVZ4~a&C$=+qcmHQEgCbK zQ`^LP5?*FZ-%8&WkS+ACEL&#f9a{Td^Y!l6ms~@KVy=&5en3PX;D7KwR{AJCr&x%z zd2p^k<`Es9lIKdbEcvFw8559c!Ev#_lVn@k;w#`MaQi-3eQ+0QuX3acnhnA%%^^6iaRY4d+TbJSV6|gN5XE4 zDM5SH>SUc&_qi@7o;!6r;mf>bjkbAeeD>vBT5H07#kt3ioZ?HLDw@VBmMRP7yz)1E zedvY!moLY4#y$={C>#5FI9au1zCkJWgk@u!Va~Oj{EDLT$1?rWbIjW?Q1T$GBxO~+?gv+u+r|ge8{2`f^wY;+(kxUxx4@c? zN9^gsXtxIMm&Xz>t$eqP?g=v-pU>a^csB29Z$-6#B?~&(P2=j9*u0xh!lQMPwBGx- zv=DHFcV+|uokafI2TQIj%3c|%a}P_BB2~v;Pmsz>=u>5fHgMSHN2=y>tZcSnx*EVK z&O_)s{;cY8nPd!~SmeB!N@gnk{UUZ&rtEgFAH9K~ zB|?dar0BTKE$9MKKaptS&7(RQ)}{_;i5V|dZXFQ?@*~aaMu8Mk%K6J1d+TTP-(z9SQfkhLIyZP58lTD z5!H~Fg#sRZa2OJXK){eF2`FYi6ga33&R{@p;Hdh)CR{%g_#OJBf>CmEWDJc0Mj7C} zunt%+ESUaJ=+??Wf8Ehr1!X^A*5nGDe<%BgDhkY$R|`N~925l(Aq}7g9_n~!7wrH~ zyjg&uIWE8zC+8rbtfb&4@8|B}4x|q8b9W<<%fgcbfpe+cPAodt=FN&f7kSAFf2i(s@J-j{L$VocjWbl*=I7l)u zX&4~i56RsnMLoivM>w z$z`HYf}DWr2ucpVKT4d{tuM7*yY!Wxaxg-aNq^$z)yd=(YR>HkhjW&UxO#lOM4j+Dx53M@@s zz!t6Uh9!}dfa@=afPfUK{&=SV$Sw~oaj5q=C;`_;DS+i!A4jR7b*4|PGoUpAPd^)^ z(ys;$(A4?GT7>NsNENMIagDP`$`_r-e} zd*U5|+X{f{`?V(2aPb7_!KN>xFAJ`g0ETtFq94G*1g!heOrZY&fyEcZk@aUB zczy!>KnTC$FaRR(M;v&%=Fd0;a8b>laY*piq(6AkpxeYBaTqyZVfcF-437Sb4hC${ z{zZq9`zuZsjrl8IBpNt3^hX{@xxdpxA^z3@3JC>9wBO}Jp^<;-0F4Gt1pQ72gCY>X zTJ<+x7!(EW;{6eaLI2Gw2b_8OgAOJOF1ml`g~E;afyP7d*> zUSM!wApVmME{pz)4gvj3f4~5a`AZh0EV$neety~&FvcVP(g9Kq^%pN1xI*JkUJUea z{mH?=+k1ZG;pK@1?)U*NWFcQJXXX-s2Q43HR^YX85I9g0$_X@b8-#Z05g^8WyPMZ01E From 33436b2f66fad600143a203cecb286bb673e1a18 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Sun, 15 Mar 2026 20:41:50 +0800 Subject: [PATCH 37/38] update --- docs/paper/arxiv/figures/role-mentor.typ | 45 ++++++++ .../paper/arxiv/figures/role-orchestrator.typ | 58 ++++++++++ docs/paper/arxiv/figures/role-runner.typ | 36 ++++++ docs/paper/arxiv/figures/skill-map.typ | 89 +++++++++++++++ docs/paper/arxiv/paper.tex | 105 ++---------------- 5 files changed, 237 insertions(+), 96 deletions(-) create mode 100644 docs/paper/arxiv/figures/role-mentor.typ create mode 100644 docs/paper/arxiv/figures/role-orchestrator.typ create mode 100644 docs/paper/arxiv/figures/role-runner.typ create mode 100644 docs/paper/arxiv/figures/skill-map.typ diff --git a/docs/paper/arxiv/figures/role-mentor.typ b/docs/paper/arxiv/figures/role-mentor.typ new file mode 100644 index 00000000..19d10ac6 --- /dev/null +++ b/docs/paper/arxiv/figures/role-mentor.typ @@ -0,0 +1,45 @@ +#import "lib.typ": * +#import "@preview/pixel-family:0.1.0": alice, bolt + +#set page(..fig-page) +#set text(..fig-text) + +#canvas(length: 0.55cm, { + import draw: * + + // Helper: auto-sized rounded box + let node(pos, label, name-id, accented: false) = { + let s = if accented { stroke-accent } else { stroke-box } + let f = if accented { fill-accent } else { fill-light } + let c = if accented { accent.darken(20%) } else { fg } + content(pos, + box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt, + text(8pt, weight: "bold", fill: c, label)), + name: name-id) + } + + let elabel(pos, body) = { + content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt, + text(6pt, fill: fg-light, body))) + } + + // Contributor (human) + content((0, 1.4), alice(size: 1.8em, baseline: 0pt)) + node((0, 0), [Contributor], "contrib") + + // Mentor (agent) + content((6, 1.4), bolt(size: 1.8em, baseline: 0pt)) + node((6, 0), [Mentor], "mentor", accented: true) + + // GitHub Issue + node((3, -2.2), [GitHub Issue], "issue") + + // Contributor ↔ Mentor + line("contrib.east", "mentor.west", + stroke: stroke-edge, mark: arrow-both) + elabel(("contrib", 50%, "mentor"), [interactive]) + + // Mentor → Issue + line("mentor.south", "issue.east", + stroke: stroke-edge, mark: arrow-end) +}) diff --git a/docs/paper/arxiv/figures/role-orchestrator.typ b/docs/paper/arxiv/figures/role-orchestrator.typ new file mode 100644 index 00000000..08a9e79d --- /dev/null +++ b/docs/paper/arxiv/figures/role-orchestrator.typ @@ -0,0 +1,58 @@ +#import "lib.typ": * +#import "@preview/pixel-family:0.1.0": bob, nova + +#set page(..fig-page) +#set text(..fig-text) + +#canvas(length: 0.55cm, { + import draw: * + + // Helper: auto-sized rounded box + let node(pos, label, name-id, accented: false) = { + let s = if accented { stroke-accent } else { stroke-box } + let f = if accented { fill-accent } else { fill-light } + let c = if accented { accent.darken(20%) } else { fg } + content(pos, + box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt, + text(8pt, weight: "bold", fill: c, label)), + name: name-id) + } + + let elabel(pos, body) = { + content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt, + text(6pt, fill: fg-light, body))) + } + + let gap = 4.5 + + // Maintainer (left) + content((0, 1.4), bob(size: 1.8em, baseline: 0pt)) + node((0, 0), [Maintainer], "maint") + + // Board + node((gap, 0), [Board], "board") + + // Orchestrator (agent) + content((2 * gap, 1.4), nova(size: 1.8em, baseline: 0pt)) + node((2 * gap, 0), [Orchestrator], "orch", accented: true) + + // PR + node((3 * gap, 0), [PR], "pr") + + // Maintainer (right) + content((4 * gap, 1.4), bob(size: 1.8em, baseline: 0pt)) + node((4 * gap, 0), [Maintainer], "maint2") + + // Edges + line("maint.east", "board.west", stroke: stroke-edge, mark: arrow-end) + elabel(("maint", 50%, "board"), [ready]) + + line("board.east", "orch.west", stroke: stroke-edge, mark: arrow-end) + elabel(("board", 50%, "orch"), [pick]) + + line("orch.east", "pr.west", stroke: stroke-edge, mark: arrow-end) + elabel(("orch", 50%, "pr"), [create]) + + line("pr.east", "maint2.west", stroke: stroke-edge, mark: arrow-end) + elabel(("pr", 50%, "maint2"), [merge]) +}) diff --git a/docs/paper/arxiv/figures/role-runner.typ b/docs/paper/arxiv/figures/role-runner.typ new file mode 100644 index 00000000..6e512561 --- /dev/null +++ b/docs/paper/arxiv/figures/role-runner.typ @@ -0,0 +1,36 @@ +#import "lib.typ": * +#import "@preview/pixel-family:0.1.0": crank + +#set page(..fig-page) +#set text(..fig-text) + +#canvas(length: 0.55cm, { + import draw: * + + // Helper: auto-sized rounded box + let node(pos, label, name-id, accented: false) = { + let s = if accented { stroke-accent } else { stroke-box } + let f = if accented { fill-accent } else { fill-light } + let c = if accented { accent.darken(20%) } else { fg } + content(pos, + box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt, + text(8pt, weight: "bold", fill: c, label)), + name: name-id) + } + + // Runner (agent, top center) + content((3, 2.1), crank(size: 1.8em, baseline: 0pt)) + node((3, 0.7), [Runner], "runner", accented: true) + + // Code + Tests (bottom left) + node((0.5, -0.7), [Code + Tests], "code") + + // Paper Entry (bottom right) + node((5.5, -0.7), [Paper Entry], "paper") + + // Edges + line("runner.south-west", "code.north-east", + stroke: stroke-edge, mark: arrow-end) + line("runner.south-east", "paper.north-west", + stroke: stroke-edge, mark: arrow-end) +}) diff --git a/docs/paper/arxiv/figures/skill-map.typ b/docs/paper/arxiv/figures/skill-map.typ new file mode 100644 index 00000000..121a629a --- /dev/null +++ b/docs/paper/arxiv/figures/skill-map.typ @@ -0,0 +1,89 @@ +#import "lib.typ": * + +#set page(..fig-page) +#set text(..fig-text) + +#canvas(length: 0.55cm, { + import draw: * + + // Helper: core node (inner ring) + let core(pos, label, name-id) = { + let (x, y) = pos + rect((x - 1.5, y - 0.35 + 0.06), (x + 1.5, y + 0.35 + 0.06), + radius: 4pt, fill: shadow-col, stroke: none) + rect((x - 1.5, y - 0.35), (x + 1.5, y + 0.35), + radius: 4pt, fill: fill-light, stroke: stroke-box, name: name-id) + content(name-id, text(7pt, weight: "bold", fill: fg, raw(label))) + } + + // Helper: skill node (outer ring) + let skill(pos, label, name-id) = { + let (x, y) = pos + rect((x - 1.6, y - 0.28), (x + 1.6, y + 0.28), + radius: 3pt, fill: white, stroke: (thickness: 0.5pt, paint: border), name: name-id) + content(name-id, text(6pt, fill: fg, raw(label))) + } + + // Helper: category label + let cat(pos, label) = { + content(pos, text(6pt, weight: "bold", fill: fg-light, label)) + } + + let cx = 8 + let cy = 5.5 + + // ── Center: CLAUDE.md ── + rect((cx - 1.8 + 0.08, cy - 0.45 + 0.08), (cx + 1.8 + 0.08, cy + 0.45 + 0.08), + radius: 5pt, fill: shadow-col, stroke: none) + rect((cx - 1.8, cy - 0.45), (cx + 1.8, cy + 0.45), + radius: 5pt, fill: fill-accent, stroke: stroke-accent, name: "claude") + content("claude", text(9pt, weight: "bold", fill: accent.darken(20%), raw("CLAUDE.md"))) + + // ── Inner ring: key project files ── + core((cx, cy + 2.0), "src/traits.rs", "traits") + core((cx, cy - 2.0), "Makefile", "make") + core((cx - 3.0, cy), "src/models/", "models") + core((cx + 3.0, cy), "src/rules/", "rules") + + // Links from center to inner ring + line("claude.north", "traits.south", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end) + line("claude.south", "make.north", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end) + line("claude.west", "models.east", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end) + line("claude.east", "rules.west", stroke: (thickness: 0.6pt, paint: edge-col, dash: "densely-dashed"), mark: arrow-end) + + // ── Outer ring: skill groups ── + + // Top-left: Orchestration + let ol = (cx - 5.5, cy + 4.5) + cat((ol.at(0), ol.at(1) + 0.5), [orchestration]) + skill((ol.at(0), ol.at(1)), "project-pipeline", "s1") + skill((ol.at(0), ol.at(1) - 0.7), "review-pipeline", "s2") + skill((ol.at(0), ol.at(1) - 1.4), "issue-to-pr", "s3") + + // Top-right: Quality gates + let qr = (cx + 5.5, cy + 4.5) + cat((qr.at(0), qr.at(1) + 0.5), [quality gates]) + skill((qr.at(0), qr.at(1)), "check-issue", "s4") + skill((qr.at(0), qr.at(1) - 0.7), "review-impl", "s5") + skill((qr.at(0), qr.at(1) - 1.4), "fix-pr", "s6") + skill((qr.at(0), qr.at(1) - 2.1), "topology-check", "s7") + + // Bottom-left: Implementation + let il = (cx - 5.5, cy - 3.0) + cat((il.at(0), il.at(1) + 0.5), [implementation]) + skill((il.at(0), il.at(1)), "add-model", "s8") + skill((il.at(0), il.at(1) - 0.7), "add-rule", "s9") + + // Bottom-right: Docs / community + let dr = (cx + 5.5, cy - 3.0) + cat((dr.at(0), dr.at(1) + 0.5), [docs / community]) + skill((dr.at(0), dr.at(1)), "write-in-paper", "s10") + skill((dr.at(0), dr.at(1) - 0.7), "propose", "s11") + skill((dr.at(0), dr.at(1) - 1.4), "dev-setup", "s12") + + // Dashed links from skill groups to center + line("s3.east", "claude.north-west", stroke: stroke-dashed, mark: arrow-end) + line("s7.west", "claude.north-east", stroke: stroke-dashed, mark: arrow-end) + line("s9.east", "claude.south-west", stroke: stroke-dashed, mark: arrow-end) + line("s12.west", "claude.south-east", stroke: stroke-dashed, mark: arrow-end) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index ab4c947f..04af9dec 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -8,12 +8,11 @@ \usepackage{listings} \usepackage{hyperref} \usepackage{cleveref} -\usepackage{tikz} -\usetikzlibrary{arrows.meta,positioning,calc} +% TikZ removed — all diagrams are now Typst-compiled PDFs \begin{document} -\title{Bridging NP-Hard Problems: Scaling Software Beyond Human Capacity with Agentic Coding} +\title{Grand Assembly of Computational Hard Problems: The Art of Agentic Coding} \author{...} % placeholder @@ -23,11 +22,11 @@ A unified library of reductions between NP-hard problems would let practitioners route any supported problem to a specialized solver---quantum hardware, commercial optimizers, or domain-specific algorithms---through a single interface. Yet building such a library by human effort alone is impractical: it requires many researchers to adopt a common language and conventions, and demands continuous full-time maintenance as new reduction rules are discovered. We show that AI coding agents, guided by a system of reusable \emph{skills}---versioned, composable workflow documents that encode project conventions and domain knowledge---can overcome these barriers. -We demonstrate the approach by building a directed reduction graph of 27~problem types and 45~verified transformation rules, and make three contributions. +We demonstrate the approach by building a large library of problem reductions. During this process, we make three contributions. First, a \emph{no-code contribution route}: domain experts contribute reductions by filing structured issues with AI assistance, requiring no knowledge of the implementation language or codebase. -Second, a \emph{seven-layer verification stack} that ensures correctness and quality---an automated quality gate rejected 75\% of 322~batch-submitted proposals as incorrect or incomplete. +Second, a \emph{seven-layer verification stack} that enforces mathematical correctness, culminating in \emph{agentic feature tests}---AI agents that act as first-time users, exercising the library end-to-end---replacing the years of community trial-and-error traditionally needed to surface integration bugs. Third, a \emph{fully automated pipeline} that enables sustainable maintenance by a single maintainer, with an onboarding path that lets a new maintainer take over the project in half a day. -Over nine weeks, a single maintainer and AI agents produced a Rust library with $>$95\% test coverage and a 15:1 ratio of agent-to-human messages. +Building such a stack takes approximately ten weeks. Now we have $>100$ problems in the package. We show that AI agents can produce correct and maintainable software beyond the scale of human capabilities, while human developers can focus on the creative parts. \end{abstract} %====================================================================== @@ -325,16 +324,7 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} \emph{Mentors} (4~skills) interact with humans. % \begin{center} -\begin{tikzpicture}[node distance=0.4cm and 0.6cm, - box/.style={draw, rounded corners=2pt, font=\scriptsize, minimum height=1.6em, inner sep=3pt}, - arr/.style={-{Stealth[length=3pt]}}, -] -\node[box] (contrib) {Contributor}; -\node[box, right=0.8cm of contrib] (mentor) {Mentor}; -\node[box, below=0.3cm of $(contrib)!0.5!(mentor)$] (issue) {GitHub Issue}; -\draw[arr, <->] (contrib) -- node[above, font=\tiny] {interactive} (mentor); -\draw[arr] (mentor) -- (issue); -\end{tikzpicture} +\includegraphics[height=1.8cm]{figures/role-mentor.pdf} \end{center} % \noindent @@ -347,20 +337,7 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} \emph{Orchestrators} (5~skills) manage the pipeline. % \begin{center} -\begin{tikzpicture}[node distance=0.4cm and 0.5cm, - box/.style={draw, rounded corners=2pt, font=\scriptsize, minimum height=1.6em, inner sep=3pt}, - arr/.style={-{Stealth[length=3pt]}}, -] -\node[box] (maint) {Maintainer}; -\node[box, right=0.5cm of maint] (board) {Board}; -\node[box, right=0.5cm of board] (orch) {Orchestrator}; -\node[box, right=0.5cm of orch] (pr) {PR}; -\node[box, right=0.5cm of pr] (maint2) {Maintainer}; -\draw[arr] (maint) -- node[above, font=\tiny] {ready} (board); -\draw[arr] (board) -- node[above, font=\tiny] {pick} (orch); -\draw[arr] (orch) -- node[above, font=\tiny] {create} (pr); -\draw[arr] (pr) -- node[above, font=\tiny] {merge} (maint2); -\end{tikzpicture} +\includegraphics[width=\columnwidth]{figures/role-orchestrator.pdf} \end{center} % \noindent @@ -372,16 +349,7 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} \emph{Runners} (7~skills) execute. % \begin{center} -\begin{tikzpicture}[node distance=0.4cm and 0.6cm, - box/.style={draw, rounded corners=2pt, font=\scriptsize, minimum height=1.6em, inner sep=3pt}, - arr/.style={-{Stealth[length=3pt]}}, -] -\node[box] (runner) {Runner}; -\node[box, below left=0.3cm and 0.1cm of runner] (code) {Code + Tests}; -\node[box, below right=0.3cm and 0.1cm of runner] (paper) {Paper Entry}; -\draw[arr] (runner) -- (code); -\draw[arr] (runner) -- (paper); -\end{tikzpicture} +\includegraphics[height=1.8cm]{figures/role-runner.pdf} \end{center} % \noindent @@ -420,62 +388,7 @@ \subsection{Why Skills, Not Prompts or Scripts}\label{sec:skills} \begin{figure}[t] \centering -\begin{tikzpicture}[ - core/.style={draw, rounded corners=3pt, font=\scriptsize\ttfamily, - fill=black!8, minimum height=1.8em, inner sep=4pt, - line width=0.6pt}, - skill/.style={draw, rounded corners=2pt, font=\tiny\ttfamily, - minimum height=1.5em, inner sep=2.5pt, line width=0.4pt}, - cat/.style={font=\tiny\sffamily\bfseries, text=black!50}, - link/.style={-{Stealth[length=2.5pt]}, thin, black!40}, -] -% --- Center: CLAUDE.md --- -\node[core, fill=blue!12, draw=blue!50, font=\scriptsize\ttfamily\bfseries, - minimum width=1.6cm] - (claude) {CLAUDE.md}; - -% --- Key project files (cardinal directions, close to center) --- -\node[core, above=0.35cm of claude] (traits) {src/traits.rs}; -\node[core, below=0.35cm of claude] (make) {Makefile}; -\node[core, left=0.45cm of claude, anchor=east] (models) {src/models/}; -\node[core, right=0.45cm of claude, anchor=west] (rules) {src/rules/}; - -\draw[link] (claude) -- (traits); -\draw[link] (claude) -- (make); -\draw[link] (claude) -- (models); -\draw[link] (claude) -- (rules); - -% --- Skill groups (corners, well outside inner ring) --- -% Top-left: Orchestration -\node[cat] at (-2.7, 2.0) (orch-label) {orchestration}; -\node[skill, below=1pt of orch-label] (s1) {project-pipeline}; -\node[skill, below=1pt of s1] (s2) {review-pipeline}; -\node[skill, below=1pt of s2] (s3) {issue-to-pr}; - -% Top-right: Quality gates -\node[cat] at (2.7, 2.0) (qa-label) {quality gates}; -\node[skill, below=1pt of qa-label] (s4) {check-issue}; -\node[skill, below=1pt of s4] (s5) {review-impl}; -\node[skill, below=1pt of s5] (s6) {fix-pr}; -\node[skill, below=1pt of s6] (s7) {topology-check}; - -% Bottom-left: Implementation -\node[cat] at (-2.7, -1.1) (impl-label) {implementation}; -\node[skill, below=1pt of impl-label] (s8) {add-model}; -\node[skill, below=1pt of s8] (s9) {add-rule}; - -% Bottom-right: Docs / community -\node[cat] at (2.7, -1.1) (doc-label) {docs / community}; -\node[skill, below=1pt of doc-label] (s10) {write-in-paper}; -\node[skill, below=1pt of s10] (s11) {propose}; -\node[skill, below=1pt of s11] (s12) {dev-setup}; - -% --- Dashed links from skill groups to center --- -\draw[link, dashed] (s3.east) -- (claude.north west); -\draw[link, dashed] (s7.west) -- (claude.north east); -\draw[link, dashed] (s9.east) -- (claude.south west); -\draw[link, dashed] (s12.west) -- (claude.south east); -\end{tikzpicture} +\includegraphics[width=\columnwidth]{figures/skill-map.pdf} \caption{Project knowledge architecture. \texttt{CLAUDE.md} defines conventions, architecture, and commands; skills in four categories encode reusable workflows that reference it. From a199853318fbcf09ab6f505bc04bc5397d374472 Mon Sep 17 00:00:00 2001 From: GiggleLiu Date: Mon, 16 Mar 2026 01:31:08 +0800 Subject: [PATCH 38/38] =?UTF-8?q?Restructure=20agent=20roles:=20three=20ro?= =?UTF-8?q?les=20=E2=86=92=20two=20types=20(Mentor/Worker)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reframe the agent taxonomy around knowledge asymmetry: Mentors guide humans with superior project knowledge; Workers execute routine heavy-lifting with less domain knowledge. Merges Orchestrator+Runner into Worker with lightweight subcategories. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/paper/arxiv/figures/role-worker.typ | 76 ++++++++++++++++++++++++ docs/paper/arxiv/paper.tex | 43 +++++--------- 2 files changed, 92 insertions(+), 27 deletions(-) create mode 100644 docs/paper/arxiv/figures/role-worker.typ diff --git a/docs/paper/arxiv/figures/role-worker.typ b/docs/paper/arxiv/figures/role-worker.typ new file mode 100644 index 00000000..ccf7ffef --- /dev/null +++ b/docs/paper/arxiv/figures/role-worker.typ @@ -0,0 +1,76 @@ +#import "lib.typ": * +#import "@preview/pixel-family:0.1.0": bob, nova, crank + +#set page(..fig-page) +#set text(..fig-text) + +#canvas(length: 0.55cm, { + import draw: * + + // Helper: auto-sized rounded box + let node(pos, label, name-id, accented: false) = { + let s = if accented { stroke-accent } else { stroke-box } + let f = if accented { fill-accent } else { fill-light } + let c = if accented { accent.darken(20%) } else { fg } + content(pos, + box(fill: f, stroke: s, inset: (x: 8pt, y: 4pt), radius: 5pt, + text(8pt, weight: "bold", fill: c, label)), + name: name-id) + } + + let elabel(pos, body) = { + content(pos, box(fill: white, inset: (x: 3pt, y: 1.5pt), radius: 2pt, + text(6pt, fill: fg-light, body))) + } + + let gap = 5.0 + + // Maintainer (left) + content((0, 1.4), bob(size: 1.8em, baseline: 0pt)) + node((0, 0), [Maintainer], "maint") + + // Board + node((gap, 0), [Board], "board") + + // Orchestrator worker (agent) + content((2 * gap, 1.4), nova(size: 1.8em, baseline: 0pt)) + node((2 * gap, 0), [Orchestrate], "orch", accented: true) + + // Implementation worker (agent) + content((3 * gap, 1.4), crank(size: 1.8em, baseline: 0pt)) + node((3 * gap, 0), [Implement], "impl", accented: true) + + // Outputs (bottom) + node((2.5 * gap - 2.2, -2.0), [Code + Tests], "code") + node((2.5 * gap + 2.2, -2.0), [Paper Entry], "paper") + + // PR + node((4 * gap, 0), [PR], "pr") + + // Maintainer (right) + content((5 * gap, 1.4), bob(size: 1.8em, baseline: 0pt)) + node((5 * gap, 0), [Maintainer], "maint2") + + // Edges + line("maint.east", "board.west", stroke: stroke-edge, mark: arrow-end) + elabel(("maint", 50%, "board"), [ready]) + + line("board.east", "orch.west", stroke: stroke-edge, mark: arrow-end) + elabel(("board", 50%, "orch"), [pick]) + + line("orch.east", "impl.west", + stroke: (thickness: 1.1pt, paint: accent), mark: arrow-end) + elabel(("orch", 50%, "impl"), [dispatch]) + + // Implementation outputs + line("impl.south-west", "code.north-east", + stroke: stroke-edge, mark: arrow-end) + line("impl.south-east", "paper.north-west", + stroke: stroke-edge, mark: arrow-end) + + line("impl.east", "pr.west", stroke: stroke-edge, mark: arrow-end) + elabel(("impl", 50%, "pr"), [create]) + + line("pr.east", "maint2.west", stroke: stroke-edge, mark: arrow-end) + elabel(("pr", 50%, "maint2"), [merge]) +}) diff --git a/docs/paper/arxiv/paper.tex b/docs/paper/arxiv/paper.tex index 04af9dec..00675a33 100644 --- a/docs/paper/arxiv/paper.tex +++ b/docs/paper/arxiv/paper.tex @@ -318,46 +318,35 @@ \subsection{From Contributor to Verified Code}\label{sec:pipeline-overview} The maintainer makes the final quality judgment and merges. This is one of only two human decisions in the pipeline; the other is moving an issue from Backlog to Ready (Stage~3). -\paragraph{Three agent roles.} -The pipeline uses agents in three distinct roles, each backed by a set of skills. +\paragraph{Two agent types.} +The pipeline uses agents in two distinct types, distinguished by the \emph{knowledge asymmetry} between agent and human. -\emph{Mentors} (4~skills) interact with humans. +\emph{Mentors} (4~skills) know more than the human about the project's conventions, architecture, and topology---and use that knowledge to guide humans toward high-quality contributions. % \begin{center} \includegraphics[height=1.8cm]{figures/role-mentor.pdf} \end{center} % \noindent -The \texttt{propose} skill conducts an interactive session in mathematical language, asking one question at a time: what is the problem, what is the formal definition, what does a worked example look like? -Before filing, it pre-validates the draft against Stage~2's quality checks. -The \texttt{fix-issue} skill brainstorms with contributors to resolve quality problems. -The \texttt{final-review} skill guides the maintainer through merge decisions. -The \texttt{dev-setup} skill onboards new developers. - -\emph{Orchestrators} (5~skills) manage the pipeline. -% -\begin{center} -\includegraphics[width=\columnwidth]{figures/role-orchestrator.pdf} -\end{center} -% -\noindent -\texttt{project-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, dispatches a runner, and moves the result to the review queue---all headlessly. -\texttt{review-pipeline} addresses code-review comments, runs agentic feature tests, and retries CI failures. -\texttt{check-issue} validates proposals before implementation. -The maintainer makes exactly two decisions: moving an issue from Backlog to Ready, and merging the final pull request. - -\emph{Runners} (7~skills) execute. +The \texttt{propose} skill analyzes the reduction graph's topology---orphan nodes, missing proof paths, redundant edges---to identify the highest-value contributions, then conducts an interactive session in mathematical language, asking one question at a time: what is the problem, what is the formal definition, what does a worked example look like? +It proposes options, analyzes trade-offs, and recommends with reasons, so that even a newcomer can produce a publication-quality issue on their first attempt. +The \texttt{fix-issue} skill brainstorms with contributors to resolve quality problems found during validation. +The \texttt{final-review} skill guides the maintainer through merge decisions, surfacing quality signals the maintainer might miss. +The \texttt{dev-setup} skill onboards new developers, configuring their environment interactively. +In each case the human learns through the interaction---the agent acts as a domain-aware tutor who transfers project knowledge to the contributor. + +\emph{Workers} (12~skills) know less than the human about \emph{what} should be built, but execute routine heavy-lifting that follows fixed patterns. % \begin{center} -\includegraphics[height=1.8cm]{figures/role-runner.pdf} +\includegraphics[width=\columnwidth]{figures/role-worker.pdf} \end{center} % \noindent -\texttt{add-model} and \texttt{add-rule} implement problem types and reductions. -\texttt{review-implementation} dispatches parallel sub-agents in fresh context to review code. -\texttt{fix-pr} resolves CI failures and review comments. -\texttt{write-model-in-paper} and \texttt{write-rule-in-paper} generate paper entries with proof sketches and worked examples. +Orchestration workers manage the pipeline headlessly: \texttt{project-pipeline} picks the highest-ranked Ready issue, creates an isolated git worktree, dispatches an implementation worker, and moves the result to the review queue; \texttt{review-pipeline} addresses code-review comments, runs agentic feature tests, and retries CI failures. +Implementation workers produce artifacts: \texttt{add-model} and \texttt{add-rule} write code following skill checklists; \texttt{write-model-in-paper} and \texttt{write-rule-in-paper} generate paper entries with proof sketches and worked examples. +Quality workers validate against rubrics: \texttt{check-issue} validates proposals before implementation; \texttt{review-implementation} dispatches parallel sub-agents in fresh context to review code; \texttt{fix-pr} resolves CI failures and review comments. \texttt{release} handles version bumps and publishing. +The maintainer makes exactly two decisions in the entire pipeline: moving an issue from Backlog to Ready, and merging the final pull request. \subsection{Why Skills, Not Prompts or Scripts}\label{sec:skills}