feat: centralize CLI models + refresh README for open source launch#435
feat: centralize CLI models + refresh README for open source launch#435khaliqgant wants to merge 91 commits intomainfrom
Conversation
- Add --local option to `up` command for future local Relaycast binary - Dashboard auto-creates its own RelayAdapter when projectRoot is provided - Simplify startDashboard call (no relayAdapter param needed from CLI) - Stub logs fallback message until co-founder ships Relaycast binary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts from main's version bumps (2.3.14) and new features (agent heartbeat, exit detection, workflows) while preserving the PR's package consolidation and broker SDK refactor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if (!apiKey) { | ||
| const raw = await readFile(cachePath, "utf-8"); | ||
| const creds = JSON.parse(raw); | ||
| apiKey = creds.api_key; | ||
| } |
There was a problem hiding this comment.
🔴 createRelaycastClient throws unhandled ENOENT when cache file doesn't exist
When no API key is provided via options or RELAY_API_KEY env var, createRelaycastClient calls readFile(cachePath, "utf-8") without a try/catch. If the cache file ~/.agent-relay/relaycast.json doesn't exist (common for first-time users or CI environments), this throws an unhandled ENOENT error instead of the intended "API key not found" error message on line 40.
Root Cause and Impact
At packages/broker-sdk/src/relaycast.ts:34, the readFile call is not wrapped in a try/catch. The code assumes the file exists and is valid JSON. If the file is missing, the function throws Error: ENOENT: no such file or directory instead of falling through to the descriptive error on line 40: "Relaycast API key not found in options, RELAY_API_KEY, or cache file".
This affects any caller of createRelaycastClient without an API key, including the MCP adapter's getInbox() and getHealth() methods at packages/mcp/src/client-adapter.ts:638 and packages/mcp/src/client-adapter.ts:680.
Actual behavior: Throws ENOENT: no such file or directory, open '/home/user/.agent-relay/relaycast.json'
Expected behavior: Throws "Relaycast API key not found in options, RELAY_API_KEY, or cache file"
| if (!apiKey) { | |
| const raw = await readFile(cachePath, "utf-8"); | |
| const creds = JSON.parse(raw); | |
| apiKey = creds.api_key; | |
| } | |
| if (!apiKey) { | |
| try { | |
| const raw = await readFile(cachePath, "utf-8"); | |
| const creds = JSON.parse(raw); | |
| apiKey = creds.api_key; | |
| } catch { | |
| // Cache file missing or invalid — fall through to the error below | |
| } | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| @@ -194,9 +204,36 @@ export class AgentRelayClient { | |||
| return result; | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
🟡 spawnPty sends hardcoded runtime "pty" but spawnAgent sends "headless_claude" — inconsistent runtime values
The spawnPty method at packages/broker-sdk/src/client.ts:172 hardcodes runtime: "pty" in the AgentSpec, while the spawnAgent method at line 196 hardcodes runtime: "headless_claude". However, the Rust binary's AgentSpec struct uses an AgentRuntime enum. The spawnPty method always sends "pty" regardless of the CLI being spawned.
Detailed Explanation
At packages/broker-sdk/src/client.ts:168-187, the spawnPty method constructs an AgentSpec with runtime: "pty". This is correct for the general case. However, the new fields model, cwd, team, shadow_of, and shadow_mode are passed through from SpawnPtyInput to AgentSpec correctly.
The concern is that the spawnAgent method at line 190-204 hardcodes runtime: "headless_claude" — this means any agent spawned via spawnAgent is always treated as headless Claude, even if a different CLI is specified. This was pre-existing behavior but is now more visible since the MCP adapter's toAgentInfo function at packages/mcp/src/client-adapter.ts:281 maps headless_claude to cli: 'claude' and everything else to cli: 'pty', which means non-Claude agents spawned via spawnAgent would be incorrectly labeled.
This is a pre-existing issue that becomes more impactful with the migration since more code paths now use these methods.
Was this helpful? React with 👍 or 👎 to provide feedback.
packages/broker-sdk/src/relay.ts
Outdated
| async waitForAgentReady(name: string, timeoutMs = 60_000): Promise<Agent> { | ||
| const client = await this.ensureStarted(); | ||
| const existing = this.knownAgents.get(name); | ||
| if (existing && this.readyAgents.has(name)) { | ||
| return existing; | ||
| } | ||
|
|
||
| return new Promise<Agent>((resolve, reject) => { | ||
| let settled = false; | ||
| let timeout: ReturnType<typeof setTimeout> | undefined; | ||
|
|
||
| const cleanup = () => { | ||
| unsubscribe(); | ||
| if (timeout) { | ||
| clearTimeout(timeout); | ||
| timeout = undefined; | ||
| } | ||
| }; | ||
|
|
||
| const resolveWith = (agent: Agent) => { | ||
| if (settled) return; | ||
| settled = true; | ||
| cleanup(); | ||
| resolve(agent); | ||
| }; | ||
|
|
||
| const rejectWith = (error: Error) => { | ||
| if (settled) return; | ||
| settled = true; | ||
| cleanup(); | ||
| reject(error); | ||
| }; | ||
|
|
||
| const unsubscribe = client.onEvent((event) => { | ||
| if (event.kind !== "worker_ready" || event.name !== name) { | ||
| return; | ||
| } | ||
| let agent = this.knownAgents.get(event.name); | ||
| if (!agent) { | ||
| agent = this.makeAgent(event.name, event.runtime, []); | ||
| this.knownAgents.set(event.name, agent); | ||
| } | ||
| this.readyAgents.add(event.name); | ||
| resolveWith(agent); | ||
| }); | ||
|
|
||
| timeout = setTimeout(() => { | ||
| rejectWith(new Error(`Timed out waiting for worker_ready for '${name}' after ${timeoutMs}ms`)); | ||
| }, timeoutMs); | ||
|
|
||
| const known = this.knownAgents.get(name); | ||
| if (known && this.readyAgents.has(name)) { | ||
| resolveWith(known); | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
🟡 waitForAgentReady has a TOCTOU race between initial check and event subscription
In waitForAgentReady, the initial early-return check at lines 337-339 runs before the Promise constructor. Inside the Promise, the event listener is subscribed at line 368, and then a second check is performed at lines 385-388. However, a worker_ready event can arrive between the ensureStarted() call (line 336) and the onEvent subscription (line 368), causing the method to miss the event and time out.
Detailed Explanation of the Race Window
The sequence is:
ensureStarted()— starts the client, which begins receiving events- Check
knownAgents+readyAgents(lines 337-339) — agent not ready yet - Enter Promise constructor
- RACE WINDOW: Between steps 2 and 5, the event loop can process a
worker_readyevent that updatesreadyAgentsvia the main event handler atrelay.ts:506-514 - Subscribe to
onEvent(line 368) — too late, event already processed - Re-check
readyAgents(lines 385-388) — this catches the race!
Actually, looking more carefully, the re-check at lines 385-388 does catch this race condition because it runs synchronously within the same Promise constructor after subscribing. Since the worker_ready event handler at line 506-514 updates readyAgents synchronously, and the re-check at 385-388 reads it synchronously, this is safe.
However, there's still a subtle issue: the check at lines 337-339 and the check at lines 385-388 are identical duplicates. The first check (337-339) is redundant because the second check (385-388) handles the same case. This is not a bug per se, but the duplicate check is misleading — it suggests the author was trying to handle a race condition but the actual protection comes from the second check only.
Was this helpful? React with 👍 or 👎 to provide feedback.
CLI Refactoring: - Split 5361-line src/cli/index.ts into 8 modular command files + 14 shared lib files - Wire bootstrap.ts as new CLI entry point - Remove legacy patterns: tmux, socket/daemon, agents.json, storage adapter - Remove create-agent command (replaced by spawn) - Migrate all spawnPty calls to broker-sdk AgentRelayClient - Rename Rust broker binary from agent-relay to agent-relay-broker - 66 tests with full DI pattern across all modules Broker-SDK Improvements: - Per-agent cwd on spawnPty - Agent.waitForReady() promise - Structured data field in sendMessage - Reason parameter on release() - Model as first-class spawn option - Agent.status computed getter (spawning/ready/idle/exited) - Agent.onOutput() per-agent streaming hook - relay.system() convenience handle SDK Rename: - Rename packages/broker-sdk to packages/sdk (@agent-relay/sdk) - Delete empty packages/sdk-ts - Update all imports, build scripts, CI workflows Workflow Runner Fixes: - Add trajectories and idleThresholdSecs to schema.json - Fix README template table patterns - Extract shared execute/resume logic into runWorkflowCore - Remove global process.env.RELAY_API_KEY mutation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # packages/broker-sdk/src/logs.ts # packages/broker-sdk/tsconfig.json # packages/sdk/src/__tests__/idle-nudge.test.ts # packages/sdk/src/__tests__/unit.test.ts # packages/sdk/src/examples/workflow-superiority.ts # packages/sdk/src/relay.ts # packages/sdk/src/workflows/builtin-templates/competitive.yaml # packages/sdk/src/workflows/runner.ts
Tests for gc command and tmux discovery were referencing deleted functionality from the legacy cleanup wave. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docs (22 files): - Replace create-agent with spawn across all docs - Rename agent-relay binary refs to agent-relay-broker - Update imports from 'agent-relay' to '@agent-relay/sdk' - Remove tmux references and prerequisites - Update socket/legacy references to file-based protocol CLI: - Remove obsolete doctor command (SQLite/JSONL diagnostics no longer needed) - Clean up doctor from monitoring module, bootstrap, and all tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o cli-uses-broker
| const allLines = content.split('\n'); | ||
| const tailLines = allLines.slice(-lines); | ||
| return tailLines.join('\n').trim(); | ||
| fh = await open(filePath, "r"); |
Check failure
Code scanning / CodeQL
Insecure temporary file High
Copilot Autofix
AI about 2 hours ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
| } else { | ||
| machineId = `${deps.getHostname()}-${deps.randomHex(8)}`; | ||
| fs.mkdirSync(dataDir, { recursive: true }); | ||
| fs.writeFileSync(machineIdPath, machineId); |
Check failure
Code scanning / CodeQL
Potential file system race condition High
| const modelSplit = architectCli.split(':'); | ||
| const cli = modelSplit[0]; | ||
| const model = modelSplit[1]; | ||
| const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$'); |
Check failure
Code scanning / CodeQL
Incomplete string escaping or encoding High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 14 hours ago
In general, the problem arises because the code tries to escape one special character ($) in a string coming from deps.getAgentOutboxTemplate() without considering backslashes or the real context in which the string needs escaping. The correct fix is to (a) avoid ad‑hoc partial escaping and (b) either use a well-tested escaping helper for the actual target context (regex, replacement string, shell, etc.) or, if the string is only used literally, not escape it at all.
Within this snippet, outboxPath is only used inside a prompt string shown to an agent: `Use relay protocol with outbox path ${outboxPath}. ...`. There is no interpretation of $ or \ in such plain text, so escaping $ is unnecessary and misleading. The minimal and safest behavior-preserving fix is to stop doing incomplete escaping and just use the template as-is. That means replacing:
const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$');with:
const outboxPath = deps.getAgentOutboxTemplate();This removes the broken custom escaping without changing any visible semantics, because the prompt will now show the actual template string instead of an odd $→\$-mangled form. No new imports or helper functions are needed; we just simplify the assignment on line 91.
| @@ -88,7 +88,7 @@ | ||
| const modelSplit = architectCli.split(':'); | ||
| const cli = modelSplit[0]; | ||
| const model = modelSplit[1]; | ||
| const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$'); | ||
| const outboxPath = deps.getAgentOutboxTemplate(); | ||
| const prompt = | ||
| 'You are the Architect, a cross-project coordinator.\n\n' + | ||
| `Connected Projects:\n${formatProjectContext(valid)}\n\n` + |
| deps.log(` - ${credential.provider}`); | ||
| } | ||
|
|
||
| fs.writeFileSync(credentialsPath, JSON.stringify(credentials, null, 2)); |
Check warning
Code scanning / CodeQL
Network data written to file Medium
| } | ||
|
|
||
| if (process.platform === 'win32') { | ||
| await execFileAsync('cmd', ['/c', 'start', '', url]); |
Check warning
Code scanning / CodeQL
Indirect uncontrolled command line Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 14 hours ago
General fix: avoid invoking cmd with user-controlled arguments. Use an API that does not require shell metacharacter parsing, or change the invoked program to something that directly accepts a URL argument without indirect shell semantics. For opening URLs on Windows, a common pattern is to call explorer.exe with the URL; this runs without an intermediate cmd layer and still opens the user’s default browser.
Concrete best fix here, without changing external behavior:
- Keep using
execFileAsync(already a safe API). - Change the Windows branch of
defaultOpenExternalto executeexplorer.exedirectly instead ofcmd /c start "" url. - This preserves the “open in default browser” behavior while removing the dependency on
cmd’s argument parsing. - No changes are needed elsewhere;
deps.openExternal’s behavior is the same from the caller’s perspective.
Specific change:
-
In
src/cli/commands/cloud.ts, indefaultOpenExternal, replace:if (process.platform === 'win32') { await execFileAsync('cmd', ['/c', 'start', '', url]); return; }
with:
if (process.platform === 'win32') { await execFileAsync('explorer.exe', [url]); return; }
No new imports or helper functions are needed; we reuse the existing execFileAsync defined with promisify(execFile).
| @@ -67,7 +67,7 @@ | ||
| } | ||
|
|
||
| if (process.platform === 'win32') { | ||
| await execFileAsync('cmd', ['/c', 'start', '', url]); | ||
| await execFileAsync('explorer.exe', [url]); | ||
| return; | ||
| } | ||
|
|
Delete all 20 MCP integration test files (now obsolete after @agent-relay/mcp removal). Update test runner to SDK-only, simplify package.json test scripts. Fix e2e permission denied by chmod+x index.js in postbuild. Fix stress test crash from missing packages/resiliency import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add auto-restart supervisor, metrics collection, and crash insights modules to the Rust broker. Expose new types and methods via both TypeScript and Python SDKs. Remove references to deleted @agent-relay/storage package from CI workflows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o cli-uses-broker
The broker had two redundant modes: init (stdio, full-featured) and listen (HTTP API, stripped-down). The dashboard needs to proxy /api/spawn to the broker, but init mode had no HTTP API. This adds --api-port to init mode so the HTTP API runs alongside stdio, then removes the redundant listen mode entirely. Changes: - Add --api-port flag to InitCommand (default 0 = disabled) - Start HTTP API server in run_init when api_port > 0 - Handle spawn/release/list API requests in init's main select loop - Update CLI to pass --api-port 3889 when dashboard is active - Simplify dashboard relay URL resolution (always http://localhost:3889) - Remove getDashboardSpawnEnv standalone-mode hack - Remove ListenCommand, Commands::Listen, and run_listen() (~330 lines) - Move spawn_env_vars to spawner.rs (still used by wrap mode) - Update integration tests from listen to init --api-port - Update ARCHITECTURE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
binaryArgs like --api-port were prepended before the 'init' subcommand, causing clap to reject the unknown flag at the top level. Move them after the subcommand and its standard args so clap parses them correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o cli-uses-broker
Add WebSocket endpoint to broker HTTP API that streams dashboard-relevant events (agent lifecycle, message delivery, worker status) via a tokio broadcast channel. This enables the dashboard to receive real-time updates without polling the Relaycast REST API. - Add axum ws feature to Cargo.toml - Add broadcast channel piped through stdout writer task - Add /ws route with WebSocketUpgrade handler - Filter events by kind (relay_inbound, agent_spawned, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add axum ws feature and broadcast channel to broker - Add /ws WebSocket endpoint streaming dashboard-relevant events - Handle broadcast::Lagged errors gracefully (skip, don't disconnect) - Add 30s ping keepalive to detect dead WS clients - Fix lint-staged: models.generated.ts → cli-registry.generated.ts - Add npm_link to CLIRegistry codegen output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tration retry - Add `task` field to HTTP API spawn handler so dashboard-spawned agents receive their task via initial_task injection - Implement actual spawn logic for Relaycast WS `agent.spawn_requested` events (previously ignored with a warning) - Add retry_agent_registration helper with 3 attempts for transient pre-registration failures (rate limits, transport errors) - Extract HTTP API handlers into src/listen_api.rs (~340 lines) - Move RelaycastAgentEvent, parse_relaycast_agent_event, and registration helpers into src/relaycast_ws.rs - Reduce main.rs from 5031 to 4633 lines Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Optional X-API-Key auth on /api/* and /ws, /health public - 401 uses RelayError envelope shape - /api/threads endpoint with thread grouping and unread counts - Agent state publishing to Relaycast WS (spawned/idle/exited/stuck)
- Update message_bridge.rs and main.rs to satisfy contract fixtures - All contract fixture tests now passing
- 5 fixture JSON files (identity, errors, events, health, replay) - Red contract tests for SDK + broker conformance - broker-sdk TypeScript contract test stubs
- ReplayBuffer with VecDeque ring buffer, capacity 1000 - AtomicU64 monotonic seq counter - replay_since() for WS reconnect catch-up
# Conflicts: # src/listen_api.rs # src/main.rs
# Conflicts: # .trajectories/index.json # packages/sdk/src/__tests__/contract-fixtures.test.ts # src/main.rs # src/message_bridge.rs
# Conflicts: # packages/sdk/src/__tests__/contract-fixtures.test.ts # src/main.rs
- Resolve 8 merge conflict blocks in main.rs test module - Fix listen_api_health to return rich /health shape (agentCount, etc.) - Add /api/events/replay route for replay buffer endpoint - Add terminal guard for delivery_ack timeout handling - Support degraded health status on startup 429 190 lib tests + 179 bin tests all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This PR prepares the relay repo for open source launch with:
Centralized CLI Registry
packages/shared/cli-registry.yaml@agent-relay/configModels.Claude.OPUS,Models.Codex.CODEX_5_3, etc.)README Refresh
Cleanup
scripts/migrate-to-broker-sdk.ts(migration complete)TRAIL_GIT_AUTH_FIX.mdto.trajectories/completed/.worktrees/to.gitignoreTest plan
npm run codegento confirm generation works🤖 Generated with Claude Code