Skip to content

Add rate limit model fallback and PTY resilience#4

Open
bh13731 wants to merge 20 commits intolightsparkdev:mainfrom
bh13731:rate-limit-model-fallback
Open

Add rate limit model fallback and PTY resilience#4
bh13731 wants to merge 20 commits intolightsparkdev:mainfrom
bh13731:rate-limit-model-fallback

Conversation

@bh13731
Copy link
Copy Markdown

@bh13731 bh13731 commented Apr 1, 2026

Summary

  • Adds automatic model fallback when Claude API rate limits (429) or overload errors (529) are hit
  • Fallback chain: opus -> sonnet[1m] -> sonnet -> haiku, with per-model cooldown tracking
  • StuckJobWatchdog scans tmux sessions for rate limit errors and auto-restarts agents with fallback models
  • PtyManager gracefully handles posix_spawnp failures by polling tmux session exit instead of immediately failing the agent
  • New API endpoints for viewing and managing rate limit state

Changes

  • ModelClassifier.ts: Rate limit cooldown system with DB persistence, getFallbackModel() chain, resolveModel() now checks rate limits before assigning
  • StuckJobWatchdog.ts: New Check 5 scans tmux sessions every 30s for rate limit strings, kills stuck agents after 3min and re-queues with fallback model
  • models.ts API: GET/POST/DELETE /api/models/rate-limits for manual rate limit management
  • PtyManager.ts: When PTY attach fails but tmux is alive, poll for session exit instead of failing immediately

Test plan

  • npm test passes (239 tests)
  • npx tsc -p tsconfig.server.json --noEmit passes
  • Manually tested: marked opus as rate-limited via API, verified agents dispatched with sonnet fallback
  • Verified watchdog detects rate limit strings in tmux output and re-queues jobs

Made with Cursor

bh13731 and others added 20 commits March 31, 2026 08:55
Set up the test infrastructure for integration testing:
- Add vitest, supertest, and @types/supertest as dev dependencies
- Create vitest.config.ts with --experimental-sqlite support
- Create shared test helpers: setupTestDb/cleanupTestDb for fresh :memory:
  databases, createSocketMock for capturing emitted events, and factory
  helpers for projects/workflows/jobs
- Add 11 smoke tests proving DB isolation, socket mocking, fixture
  factories, and WorkflowManager importability all work
- Add "test" and "test:watch" scripts to package.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nt cycle)

Ports the structured plan/review/implement cycle from autonomous-coding-agents
into Hurlicane as a first-class feature. Each workflow runs Claude as implementer
and Codex as reviewer in a repeating cycle until all milestones are complete.

Key changes:
- New workflows table + job columns (workflow_id, workflow_cycle, workflow_phase)
- WorkflowManager: event-driven orchestrator modelled after DebateManager
- WorkflowPrompts: phase-specific prompts using notes as shared artifacts
- Single shared worktree per workflow (created once at start, all phases share
  one branch so changes accumulate linearly across cycles)
- Auto PR creation + worktree cleanup on workflow completion via gh CLI
- Codex review phase (cycle 2+) does full code review via git diff before
  updating the plan — adds Fix: milestones for any quality issues found
- isAutoExitJob() helper replaces debate_role checks across 8 call sites so
  workflow phase jobs also exit cleanly without calling finish_job
- PtyManager: PTY attach failure is a warning (not job failure) for auto-exit
  jobs since tailing already captures output
- REST API: GET/POST /api/workflows, cancel, resume endpoints
- UI: Autonomous Agents button, WorkflowForm, WorkflowDetailModal with
  milestone progress bar, plan/worklog/jobs tabs, View PR link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tree add

Without mkdir -p the worktree creation silently fails on fresh installs
or when the directory has been removed, falling back to work_dir directly
and allowing WorkQueueManager to create per-phase worktrees again.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and query type annotations

- Add work_dir and max_turns to Job interface in types.ts (columns exist in DB but were missing from type)
- Add eye:pr:* and eye:pr-review:* events to ServerToClientEvents (used by SocketManager but not typed)
- Add explicit :any annotations to db.prepare().all().map() callbacks in queries.ts (noImplicitAny)
- Use Map<string, Job> generic for jobMap constructors to fix tuple inference (fixes template_id and AgentWithJob errors)
- Complete missing fields in Job stub literals (debate_loop, workflow_id, etc.)
- Remove invalid shell: true from execSync in WorkflowManager (ExecSyncOptions.shell is string-only; execSync uses a shell by default)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Recovery now scans for running workflows whose current-phase job is
done but no next-phase job was spawned — happens when the server
restarts between finish_job and onJobCompleted. The gap detector
correctly advances assess→review, review→implement, implement→review
without double-spawning.

Also fix the worktree checkbox label to accurately describe shared-branch
behavior (not per-phase isolation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…repos

Two bugs caused the ea-ops-copilot workflow worktree to be deleted 5 min
after creation, making all subsequent phase jobs fail with ENOENT:

1. cleanupOrphanedWorktrees() deleted any .orchestrator-worktrees/ dir
   not in the DB — but workflow shared worktrees (wf-*) were never
   inserted into the worktrees table. Fix: only remove directories that
   appear in `git worktree list` for the current (Hurlicane) repo.
   Other repos' worktrees share the same parent dir and must not be touched.

2. startWorkflow() didn't register the shared worktree in the DB.
   Fix: attempt insertWorktree() after creation as a secondary guard
   (wrapped in try/catch since job_id FK may reject the sentinel value).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ete shared interfaces

Add Pr interface to shared types and use PrReview/PrReviewMessage for
all four PR-related socket events (eye:pr:new, eye:pr-review:new,
eye:pr-review:update, eye:pr-review:message) in ServerToClientEvents
and SocketManager emit functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…anup

Bug 1 (stuck workflow on restart) — two gaps remained:
- Gap detector used workflow.current_cycle to find the assess job, but
  onJobCompleted bumps current_cycle to 1 *before* spawnPhaseJob, so after
  a crash the assess job (cycle=0) was never found. Fix: look at cycle 0
  when current_phase='assess', regardless of current_cycle.
- created_at comparison was unguarded against undefined; added ?? 0 fallback.

Bug 2 (worktree deleted by cleanup) — previous fix was incomplete:
- insertWorktree(workflow.id) silently failed due to FK constraints (FK=ON),
  so Hurlicane-targeting workflow worktrees were still unprotected. Removed
  the broken sentinel approach entirely.
- cleanupOrphanedWorktrees() now builds the protected set from both the
  per-job worktrees table AND active/blocked workflow worktree_paths, so
  workflow shared worktrees are never treated as orphans regardless of which
  repo they belong to.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workflows could get stuck mid-transition if onJobCompleted threw after
finish_job but before spawnPhaseJob — recovery at startup catches this,
but jobs completing *after* startup had no periodic catch. Now
startWorkflowGapDetector() polls every 60s so any stuck workflow
recovers within a minute regardless of when it got stuck.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bCompleted()

Add _resetForTest() exports to WorkflowManager and DebateManager that clear
module-level _processedJobs dedup Sets. Add resetManagerState() helper to
test helpers. Create workflow-dedup.test.ts with 6 tests proving:
- dedup guard blocks duplicate job ID within a single test
- _resetForTest() enables same job ID across separate tests (per-test independence)
- assess completion spawns review job with correct DB state + socket emissions
- failed phase job marks workflow as blocked
- non-workflow jobs are silently ignored

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agents that lost their MCP connection outside of wait_for_jobs had no
recovery path except a 90-minute idle timeout. This caused agents to
enter sleep-retry loops burning turn budget while the server was already
back up.

- Track MCP-disconnected agents in a new disconnectedAgents map
- Add watchdog Check 3: kill and restart disconnected agents after 2min
- Extend idle timeout from tmux-only to all agent types, reduce 90→20min
- Call workflowOnJobCompleted from watchdog so workflows advance immediately
- Add force param to WorkflowManager.onJobCompleted so gap detector can
  bypass _processedJobs dedup and actually recover stuck workflows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e stopping conditions

Introduces stop_mode ('turns' | 'budget' | 'time' | 'completion') and stop_value
columns on jobs and workflows (per-phase). Adds token accumulator columns on agents.
Backfills existing rows with stop_mode='turns' and stop_value=max_turns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CostEstimator maps model + token counts to estimated USD using a pricing lookup
table. AgentRunner now extracts usage blocks from Claude/Codex stream events and
accumulates token counts on the agent row. Also uses effectiveMaxTurns() to set
--max-turns based on stop_mode (1000 safety cap for budget/time/completion).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HealthMonitor tick() now checks agents with stop_mode='budget' or 'time':
- Budget: estimates cost from accumulated tokens, warns at 80%, kills at 100%
- Time: compares elapsed time against limit, warns at 80%, kills at 100%
- Agents are stopped gracefully (marked done, not failed) since hitting a
  configured limit is expected behavior, not an error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…and MCP tools

WorkflowManager now propagates per-phase stop config (stop_mode, stop_value) to
phase jobs in spawnPhaseJob, startWorkflow, and resumeWorkflow.

API routes (jobs.ts, workflows.ts) accept stopMode/stopValue from clients.
MCP create_job tool gains stop_mode and stop_value parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…WorkflowForm

New StopModePicker component provides a segmented button group for selecting
between Turns, Budget, Time, and Run to Completion modes with contextual
input fields (number for turns, dollar amount for budget, minutes for time).

WorkflowForm: replaces 3 max_turns number inputs with 3 StopModePicker instances.
JobForm: adds a StopModePicker (defaults to 'completion' for ad-hoc jobs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… agent

Without this, a workflow phase job killed by budget or time limit would
leave the workflow stuck until the periodic gap detector caught it (~60s).
Now killAgentGracefully calls debateOnJobCompleted and workflowOnJobCompleted
immediately, same as the normal exit path in AgentRunner.handleJobCompletion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ditions-turns-budge-0d3f3318

feat: flexible stopping conditions (turns / budget / time / completion)
The jobsCache Map in jobs.ts cached GET /api/jobs responses for 1.5s
with no invalidation on job mutation. This caused cross-test state
leakage (stale rows from a previous test's DB appeared in subsequent
fresh-DB tests) and was also a production correctness bug where create
or update operations would not be reflected in list responses for up
to 1.5s.

Remove the cache entirely — it was not providing meaningful benefit
relative to the correctness risk.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When Claude API rate limits (429) or overload errors (529) are hit,
the system now automatically detects and recovers:

ModelClassifier: Add a model fallback chain (opus -> sonnet[1m] ->
sonnet -> haiku). Track per-model rate limit cooldowns in both memory
and DB. resolveModel() checks rate limits before assigning models and
falls through the chain to find an available one.

StuckJobWatchdog: Add Check 5 that periodically scans tmux sessions
for rate limit error strings. When an agent has been stuck for >3min,
marks the model as rate-limited, kills the session, and re-queues the
job with a fallback model.

models API: Add GET/POST/DELETE /api/models/rate-limits endpoints for
viewing and managing rate limit state manually.

PtyManager: When node-pty fails to attach (posix_spawnp) but the tmux
session is alive, set up a fallback poll to detect session exit instead
of immediately marking the agent as failed. This lets agents complete
via MCP finish_job despite the PTY viewer being unavailable.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant