Add rate limit model fallback and PTY resilience by bh13731 · Pull Request #4 · lightsparkdev/hurlicane

bh13731 · 2026-04-01T06:59:27Z

Summary

Adds automatic model fallback when Claude API rate limits (429) or overload errors (529) are hit
Fallback chain: opus -> sonnet[1m] -> sonnet -> haiku, with per-model cooldown tracking
StuckJobWatchdog scans tmux sessions for rate limit errors and auto-restarts agents with fallback models
PtyManager gracefully handles posix_spawnp failures by polling tmux session exit instead of immediately failing the agent
New API endpoints for viewing and managing rate limit state

Changes

ModelClassifier.ts: Rate limit cooldown system with DB persistence, getFallbackModel() chain, resolveModel() now checks rate limits before assigning
StuckJobWatchdog.ts: New Check 5 scans tmux sessions every 30s for rate limit strings, kills stuck agents after 3min and re-queues with fallback model
models.ts API: GET/POST/DELETE /api/models/rate-limits for manual rate limit management
PtyManager.ts: When PTY attach fails but tmux is alive, poll for session exit instead of failing immediately

Test plan

npm test passes (239 tests)
npx tsc -p tsconfig.server.json --noEmit passes
Manually tested: marked opus as rate-limited via API, verified agents dispatched with sonnet fallback
Verified watchdog detects rate limit strings in tmux output and re-queues jobs

Made with Cursor

Set up the test infrastructure for integration testing: - Add vitest, supertest, and @types/supertest as dev dependencies - Create vitest.config.ts with --experimental-sqlite support - Create shared test helpers: setupTestDb/cleanupTestDb for fresh :memory: databases, createSocketMock for capturing emitted events, and factory helpers for projects/workflows/jobs - Add 11 smoke tests proving DB isolation, socket mocking, fixture factories, and WorkflowManager importability all work - Add "test" and "test:watch" scripts to package.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nt cycle) Ports the structured plan/review/implement cycle from autonomous-coding-agents into Hurlicane as a first-class feature. Each workflow runs Claude as implementer and Codex as reviewer in a repeating cycle until all milestones are complete. Key changes: - New workflows table + job columns (workflow_id, workflow_cycle, workflow_phase) - WorkflowManager: event-driven orchestrator modelled after DebateManager - WorkflowPrompts: phase-specific prompts using notes as shared artifacts - Single shared worktree per workflow (created once at start, all phases share one branch so changes accumulate linearly across cycles) - Auto PR creation + worktree cleanup on workflow completion via gh CLI - Codex review phase (cycle 2+) does full code review via git diff before updating the plan — adds Fix: milestones for any quality issues found - isAutoExitJob() helper replaces debate_role checks across 8 call sites so workflow phase jobs also exit cleanly without calling finish_job - PtyManager: PTY attach failure is a warning (not job failure) for auto-exit jobs since tailing already captures output - REST API: GET/POST /api/workflows, cancel, resume endpoints - UI: Autonomous Agents button, WorkflowForm, WorkflowDetailModal with milestone progress bar, plan/worklog/jobs tabs, View PR link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tree add Without mkdir -p the worktree creation silently fails on fresh installs or when the directory has been removed, falling back to work_dir directly and allowing WorkQueueManager to create per-phase worktrees again. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…and query type annotations - Add work_dir and max_turns to Job interface in types.ts (columns exist in DB but were missing from type) - Add eye:pr:* and eye:pr-review:* events to ServerToClientEvents (used by SocketManager but not typed) - Add explicit :any annotations to db.prepare().all().map() callbacks in queries.ts (noImplicitAny) - Use Map<string, Job> generic for jobMap constructors to fix tuple inference (fixes template_id and AgentWithJob errors) - Complete missing fields in Job stub literals (debate_loop, workflow_id, etc.) - Remove invalid shell: true from execSync in WorkflowManager (ExecSyncOptions.shell is string-only; execSync uses a shell by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Recovery now scans for running workflows whose current-phase job is done but no next-phase job was spawned — happens when the server restarts between finish_job and onJobCompleted. The gap detector correctly advances assess→review, review→implement, implement→review without double-spawning. Also fix the worktree checkbox label to accurately describe shared-branch behavior (not per-phase isolation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…repos Two bugs caused the ea-ops-copilot workflow worktree to be deleted 5 min after creation, making all subsequent phase jobs fail with ENOENT: 1. cleanupOrphanedWorktrees() deleted any .orchestrator-worktrees/ dir not in the DB — but workflow shared worktrees (wf-*) were never inserted into the worktrees table. Fix: only remove directories that appear in `git worktree list` for the current (Hurlicane) repo. Other repos' worktrees share the same parent dir and must not be touched. 2. startWorkflow() didn't register the shared worktree in the DB. Fix: attempt insertWorktree() after creation as a secondary guard (wrapped in try/catch since job_id FK may reject the sentinel value). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ete shared interfaces Add Pr interface to shared types and use PrReview/PrReviewMessage for all four PR-related socket events (eye:pr:new, eye:pr-review:new, eye:pr-review:update, eye:pr-review:message) in ServerToClientEvents and SocketManager emit functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…anup Bug 1 (stuck workflow on restart) — two gaps remained: - Gap detector used workflow.current_cycle to find the assess job, but onJobCompleted bumps current_cycle to 1 *before* spawnPhaseJob, so after a crash the assess job (cycle=0) was never found. Fix: look at cycle 0 when current_phase='assess', regardless of current_cycle. - created_at comparison was unguarded against undefined; added ?? 0 fallback. Bug 2 (worktree deleted by cleanup) — previous fix was incomplete: - insertWorktree(workflow.id) silently failed due to FK constraints (FK=ON), so Hurlicane-targeting workflow worktrees were still unprotected. Removed the broken sentinel approach entirely. - cleanupOrphanedWorktrees() now builds the protected set from both the per-job worktrees table AND active/blocked workflow worktree_paths, so workflow shared worktrees are never treated as orphans regardless of which repo they belong to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Workflows could get stuck mid-transition if onJobCompleted threw after finish_job but before spawnPhaseJob — recovery at startup catches this, but jobs completing *after* startup had no periodic catch. Now startWorkflowGapDetector() polls every 60s so any stuck workflow recovers within a minute regardless of when it got stuck. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…bCompleted() Add _resetForTest() exports to WorkflowManager and DebateManager that clear module-level _processedJobs dedup Sets. Add resetManagerState() helper to test helpers. Create workflow-dedup.test.ts with 6 tests proving: - dedup guard blocks duplicate job ID within a single test - _resetForTest() enables same job ID across separate tests (per-test independence) - assess completion spawns review job with correct DB state + socket emissions - failed phase job marks workflow as blocked - non-workflow jobs are silently ignored Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Agents that lost their MCP connection outside of wait_for_jobs had no recovery path except a 90-minute idle timeout. This caused agents to enter sleep-retry loops burning turn budget while the server was already back up. - Track MCP-disconnected agents in a new disconnectedAgents map - Add watchdog Check 3: kill and restart disconnected agents after 2min - Extend idle timeout from tmux-only to all agent types, reduce 90→20min - Call workflowOnJobCompleted from watchdog so workflows advance immediately - Add force param to WorkflowManager.onJobCompleted so gap detector can bypass _processedJobs dedup and actually recover stuck workflows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e stopping conditions Introduces stop_mode ('turns' | 'budget' | 'time' | 'completion') and stop_value columns on jobs and workflows (per-phase). Adds token accumulator columns on agents. Backfills existing rows with stop_mode='turns' and stop_value=max_turns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CostEstimator maps model + token counts to estimated USD using a pricing lookup table. AgentRunner now extracts usage blocks from Claude/Codex stream events and accumulates token counts on the agent row. Also uses effectiveMaxTurns() to set --max-turns based on stop_mode (1000 safety cap for budget/time/completion). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HealthMonitor tick() now checks agents with stop_mode='budget' or 'time': - Budget: estimates cost from accumulated tokens, warns at 80%, kills at 100% - Time: compares elapsed time against limit, warns at 80%, kills at 100% - Agents are stopped gracefully (marked done, not failed) since hitting a configured limit is expected behavior, not an error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…and MCP tools WorkflowManager now propagates per-phase stop config (stop_mode, stop_value) to phase jobs in spawnPhaseJob, startWorkflow, and resumeWorkflow. API routes (jobs.ts, workflows.ts) accept stopMode/stopValue from clients. MCP create_job tool gains stop_mode and stop_value parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…WorkflowForm New StopModePicker component provides a segmented button group for selecting between Turns, Budget, Time, and Run to Completion modes with contextual input fields (number for turns, dollar amount for budget, minutes for time). WorkflowForm: replaces 3 max_turns number inputs with 3 StopModePicker instances. JobForm: adds a StopModePicker (defaults to 'completion' for ad-hoc jobs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… agent Without this, a workflow phase job killed by budget or time limit would leave the workflow stuck until the periodic gap detector caught it (~60s). Now killAgentGracefully calls debateOnJobCompleted and workflowOnJobCompleted immediately, same as the normal exit path in AgentRunner.handleJobCompletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ditions-turns-budge-0d3f3318 feat: flexible stopping conditions (turns / budget / time / completion)

The jobsCache Map in jobs.ts cached GET /api/jobs responses for 1.5s with no invalidation on job mutation. This caused cross-test state leakage (stale rows from a previous test's DB appeared in subsequent fresh-DB tests) and was also a production correctness bug where create or update operations would not be reflected in list responses for up to 1.5s. Remove the cache entirely — it was not providing meaningful benefit relative to the correctness risk. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

When Claude API rate limits (429) or overload errors (529) are hit, the system now automatically detects and recovers: ModelClassifier: Add a model fallback chain (opus -> sonnet[1m] -> sonnet -> haiku). Track per-model rate limit cooldowns in both memory and DB. resolveModel() checks rate limits before assigning models and falls through the chain to find an available one. StuckJobWatchdog: Add Check 5 that periodically scans tmux sessions for rate limit error strings. When an agent has been stuck for >3min, marks the model as rate-limited, kills the session, and re-queues the job with a fallback model. models API: Add GET/POST/DELETE /api/models/rate-limits endpoints for viewing and managing rate limit state manually. PtyManager: When node-pty fails to attach (posix_spawnp) but the tmux session is alive, set up a fallback poll to detect session exit instead of immediately marking the agent as failed. This lets agents complete via MCP finish_job despite the PTY viewer being unavailable. Made-with: Cursor

bh13731 and others added 20 commits March 31, 2026 08:55

Merge pull request #1 from bh13731/orchestrator/flexible-stopping-con…

4a06c4d

…ditions-turns-budge-0d3f3318 feat: flexible stopping conditions (turns / budget / time / completion)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rate limit model fallback and PTY resilience#4

Add rate limit model fallback and PTY resilience#4
bh13731 wants to merge 20 commits intolightsparkdev:mainfrom
bh13731:rate-limit-model-fallback

bh13731 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bh13731 commented Apr 1, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant