Add rate limit model fallback and PTY resilience#4
Open
bh13731 wants to merge 20 commits intolightsparkdev:mainfrom
Open
Add rate limit model fallback and PTY resilience#4bh13731 wants to merge 20 commits intolightsparkdev:mainfrom
bh13731 wants to merge 20 commits intolightsparkdev:mainfrom
Conversation
Set up the test infrastructure for integration testing: - Add vitest, supertest, and @types/supertest as dev dependencies - Create vitest.config.ts with --experimental-sqlite support - Create shared test helpers: setupTestDb/cleanupTestDb for fresh :memory: databases, createSocketMock for capturing emitted events, and factory helpers for projects/workflows/jobs - Add 11 smoke tests proving DB isolation, socket mocking, fixture factories, and WorkflowManager importability all work - Add "test" and "test:watch" scripts to package.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nt cycle) Ports the structured plan/review/implement cycle from autonomous-coding-agents into Hurlicane as a first-class feature. Each workflow runs Claude as implementer and Codex as reviewer in a repeating cycle until all milestones are complete. Key changes: - New workflows table + job columns (workflow_id, workflow_cycle, workflow_phase) - WorkflowManager: event-driven orchestrator modelled after DebateManager - WorkflowPrompts: phase-specific prompts using notes as shared artifacts - Single shared worktree per workflow (created once at start, all phases share one branch so changes accumulate linearly across cycles) - Auto PR creation + worktree cleanup on workflow completion via gh CLI - Codex review phase (cycle 2+) does full code review via git diff before updating the plan — adds Fix: milestones for any quality issues found - isAutoExitJob() helper replaces debate_role checks across 8 call sites so workflow phase jobs also exit cleanly without calling finish_job - PtyManager: PTY attach failure is a warning (not job failure) for auto-exit jobs since tailing already captures output - REST API: GET/POST /api/workflows, cancel, resume endpoints - UI: Autonomous Agents button, WorkflowForm, WorkflowDetailModal with milestone progress bar, plan/worklog/jobs tabs, View PR link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tree add Without mkdir -p the worktree creation silently fails on fresh installs or when the directory has been removed, falling back to work_dir directly and allowing WorkQueueManager to create per-phase worktrees again. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and query type annotations - Add work_dir and max_turns to Job interface in types.ts (columns exist in DB but were missing from type) - Add eye:pr:* and eye:pr-review:* events to ServerToClientEvents (used by SocketManager but not typed) - Add explicit :any annotations to db.prepare().all().map() callbacks in queries.ts (noImplicitAny) - Use Map<string, Job> generic for jobMap constructors to fix tuple inference (fixes template_id and AgentWithJob errors) - Complete missing fields in Job stub literals (debate_loop, workflow_id, etc.) - Remove invalid shell: true from execSync in WorkflowManager (ExecSyncOptions.shell is string-only; execSync uses a shell by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Recovery now scans for running workflows whose current-phase job is done but no next-phase job was spawned — happens when the server restarts between finish_job and onJobCompleted. The gap detector correctly advances assess→review, review→implement, implement→review without double-spawning. Also fix the worktree checkbox label to accurately describe shared-branch behavior (not per-phase isolation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…repos Two bugs caused the ea-ops-copilot workflow worktree to be deleted 5 min after creation, making all subsequent phase jobs fail with ENOENT: 1. cleanupOrphanedWorktrees() deleted any .orchestrator-worktrees/ dir not in the DB — but workflow shared worktrees (wf-*) were never inserted into the worktrees table. Fix: only remove directories that appear in `git worktree list` for the current (Hurlicane) repo. Other repos' worktrees share the same parent dir and must not be touched. 2. startWorkflow() didn't register the shared worktree in the DB. Fix: attempt insertWorktree() after creation as a secondary guard (wrapped in try/catch since job_id FK may reject the sentinel value). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ete shared interfaces Add Pr interface to shared types and use PrReview/PrReviewMessage for all four PR-related socket events (eye:pr:new, eye:pr-review:new, eye:pr-review:update, eye:pr-review:message) in ServerToClientEvents and SocketManager emit functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…anup Bug 1 (stuck workflow on restart) — two gaps remained: - Gap detector used workflow.current_cycle to find the assess job, but onJobCompleted bumps current_cycle to 1 *before* spawnPhaseJob, so after a crash the assess job (cycle=0) was never found. Fix: look at cycle 0 when current_phase='assess', regardless of current_cycle. - created_at comparison was unguarded against undefined; added ?? 0 fallback. Bug 2 (worktree deleted by cleanup) — previous fix was incomplete: - insertWorktree(workflow.id) silently failed due to FK constraints (FK=ON), so Hurlicane-targeting workflow worktrees were still unprotected. Removed the broken sentinel approach entirely. - cleanupOrphanedWorktrees() now builds the protected set from both the per-job worktrees table AND active/blocked workflow worktree_paths, so workflow shared worktrees are never treated as orphans regardless of which repo they belong to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workflows could get stuck mid-transition if onJobCompleted threw after finish_job but before spawnPhaseJob — recovery at startup catches this, but jobs completing *after* startup had no periodic catch. Now startWorkflowGapDetector() polls every 60s so any stuck workflow recovers within a minute regardless of when it got stuck. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bCompleted() Add _resetForTest() exports to WorkflowManager and DebateManager that clear module-level _processedJobs dedup Sets. Add resetManagerState() helper to test helpers. Create workflow-dedup.test.ts with 6 tests proving: - dedup guard blocks duplicate job ID within a single test - _resetForTest() enables same job ID across separate tests (per-test independence) - assess completion spawns review job with correct DB state + socket emissions - failed phase job marks workflow as blocked - non-workflow jobs are silently ignored Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agents that lost their MCP connection outside of wait_for_jobs had no recovery path except a 90-minute idle timeout. This caused agents to enter sleep-retry loops burning turn budget while the server was already back up. - Track MCP-disconnected agents in a new disconnectedAgents map - Add watchdog Check 3: kill and restart disconnected agents after 2min - Extend idle timeout from tmux-only to all agent types, reduce 90→20min - Call workflowOnJobCompleted from watchdog so workflows advance immediately - Add force param to WorkflowManager.onJobCompleted so gap detector can bypass _processedJobs dedup and actually recover stuck workflows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e stopping conditions
Introduces stop_mode ('turns' | 'budget' | 'time' | 'completion') and stop_value
columns on jobs and workflows (per-phase). Adds token accumulator columns on agents.
Backfills existing rows with stop_mode='turns' and stop_value=max_turns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CostEstimator maps model + token counts to estimated USD using a pricing lookup table. AgentRunner now extracts usage blocks from Claude/Codex stream events and accumulates token counts on the agent row. Also uses effectiveMaxTurns() to set --max-turns based on stop_mode (1000 safety cap for budget/time/completion). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HealthMonitor tick() now checks agents with stop_mode='budget' or 'time': - Budget: estimates cost from accumulated tokens, warns at 80%, kills at 100% - Time: compares elapsed time against limit, warns at 80%, kills at 100% - Agents are stopped gracefully (marked done, not failed) since hitting a configured limit is expected behavior, not an error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…and MCP tools WorkflowManager now propagates per-phase stop config (stop_mode, stop_value) to phase jobs in spawnPhaseJob, startWorkflow, and resumeWorkflow. API routes (jobs.ts, workflows.ts) accept stopMode/stopValue from clients. MCP create_job tool gains stop_mode and stop_value parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…WorkflowForm New StopModePicker component provides a segmented button group for selecting between Turns, Budget, Time, and Run to Completion modes with contextual input fields (number for turns, dollar amount for budget, minutes for time). WorkflowForm: replaces 3 max_turns number inputs with 3 StopModePicker instances. JobForm: adds a StopModePicker (defaults to 'completion' for ad-hoc jobs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… agent Without this, a workflow phase job killed by budget or time limit would leave the workflow stuck until the periodic gap detector caught it (~60s). Now killAgentGracefully calls debateOnJobCompleted and workflowOnJobCompleted immediately, same as the normal exit path in AgentRunner.handleJobCompletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ditions-turns-budge-0d3f3318 feat: flexible stopping conditions (turns / budget / time / completion)
The jobsCache Map in jobs.ts cached GET /api/jobs responses for 1.5s with no invalidation on job mutation. This caused cross-test state leakage (stale rows from a previous test's DB appeared in subsequent fresh-DB tests) and was also a production correctness bug where create or update operations would not be reflected in list responses for up to 1.5s. Remove the cache entirely — it was not providing meaningful benefit relative to the correctness risk. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When Claude API rate limits (429) or overload errors (529) are hit, the system now automatically detects and recovers: ModelClassifier: Add a model fallback chain (opus -> sonnet[1m] -> sonnet -> haiku). Track per-model rate limit cooldowns in both memory and DB. resolveModel() checks rate limits before assigning models and falls through the chain to find an available one. StuckJobWatchdog: Add Check 5 that periodically scans tmux sessions for rate limit error strings. When an agent has been stuck for >3min, marks the model as rate-limited, kills the session, and re-queues the job with a fallback model. models API: Add GET/POST/DELETE /api/models/rate-limits endpoints for viewing and managing rate limit state manually. PtyManager: When node-pty fails to attach (posix_spawnp) but the tmux session is alive, set up a fallback poll to detect session exit instead of immediately marking the agent as failed. This lets agents complete via MCP finish_job despite the PTY viewer being unavailable. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
getFallbackModel()chain,resolveModel()now checks rate limits before assigningGET/POST/DELETE /api/models/rate-limitsfor manual rate limit managementTest plan
npm testpasses (239 tests)npx tsc -p tsconfig.server.json --noEmitpassesMade with Cursor