Skip to content

Comments

feat: centralize CLI models + refresh README for open source launch#435

Open
khaliqgant wants to merge 91 commits intomainfrom
cli-uses-broker
Open

feat: centralize CLI models + refresh README for open source launch#435
khaliqgant wants to merge 91 commits intomainfrom
cli-uses-broker

Conversation

@khaliqgant
Copy link
Collaborator

@khaliqgant khaliqgant commented Feb 18, 2026

Summary

This PR prepares the relay repo for open source launch with:

Centralized CLI Registry

  • Single source of truth for CLI versions and models in packages/shared/cli-registry.yaml
  • Auto-generated TypeScript enums in @agent-relay/config
  • SDK re-exports model enums (Models.Claude.OPUS, Models.Codex.CODEX_5_3, etc.)
  • GitHub Action to auto-regenerate on YAML changes
  • Version checking script for npm/PyPI updates

README Refresh

  • Workflows section: YAML format, CLI commands, built-in templates
  • Cloud section: Multi-agent orchestration, isolated sandboxes, 24/7 durable workflows
  • Integrations: GitHub CI auto-fix, Linear, Slack
  • Architecture section: Component diagram explaining Relay Broker, Relaycast, Dashboard, Cloud, Trajectories
  • Swarm patterns: 20+ patterns organized by category (Core, Data, Quality, Adversarial, Resilience)

Cleanup

  • Remove scripts/migrate-to-broker-sdk.ts (migration complete)
  • Move TRAIL_GIT_AUTH_FIX.md to .trajectories/completed/
  • Add .worktrees/ to .gitignore

Test plan

  • Verify README renders correctly on GitHub
  • Verify model enums export correctly from SDK
  • Run npm run codegen to confirm generation works

🤖 Generated with Claude Code

khaliqgant and others added 4 commits February 18, 2026 11:09
- Add --local option to `up` command for future local Relaycast binary
- Dashboard auto-creates its own RelayAdapter when projectRoot is provided
- Simplify startDashboard call (no relayAdapter param needed from CLI)
- Stub logs fallback message until co-founder ships Relaycast binary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 7 additional findings.

Open in Devin Review

Resolve conflicts from main's version bumps (2.3.14) and new features
(agent heartbeat, exit detection, workflows) while preserving the PR's
package consolidation and broker SDK refactor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment on lines 33 to 37
if (!apiKey) {
const raw = await readFile(cachePath, "utf-8");
const creds = JSON.parse(raw);
apiKey = creds.api_key;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 createRelaycastClient throws unhandled ENOENT when cache file doesn't exist

When no API key is provided via options or RELAY_API_KEY env var, createRelaycastClient calls readFile(cachePath, "utf-8") without a try/catch. If the cache file ~/.agent-relay/relaycast.json doesn't exist (common for first-time users or CI environments), this throws an unhandled ENOENT error instead of the intended "API key not found" error message on line 40.

Root Cause and Impact

At packages/broker-sdk/src/relaycast.ts:34, the readFile call is not wrapped in a try/catch. The code assumes the file exists and is valid JSON. If the file is missing, the function throws Error: ENOENT: no such file or directory instead of falling through to the descriptive error on line 40: "Relaycast API key not found in options, RELAY_API_KEY, or cache file".

This affects any caller of createRelaycastClient without an API key, including the MCP adapter's getInbox() and getHealth() methods at packages/mcp/src/client-adapter.ts:638 and packages/mcp/src/client-adapter.ts:680.

Actual behavior: Throws ENOENT: no such file or directory, open '/home/user/.agent-relay/relaycast.json'
Expected behavior: Throws "Relaycast API key not found in options, RELAY_API_KEY, or cache file"

Suggested change
if (!apiKey) {
const raw = await readFile(cachePath, "utf-8");
const creds = JSON.parse(raw);
apiKey = creds.api_key;
}
if (!apiKey) {
try {
const raw = await readFile(cachePath, "utf-8");
const creds = JSON.parse(raw);
apiKey = creds.api_key;
} catch {
// Cache file missing or invalid — fall through to the error below
}
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@@ -194,9 +204,36 @@ export class AgentRelayClient {
return result;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 spawnPty sends hardcoded runtime "pty" but spawnAgent sends "headless_claude" — inconsistent runtime values

The spawnPty method at packages/broker-sdk/src/client.ts:172 hardcodes runtime: "pty" in the AgentSpec, while the spawnAgent method at line 196 hardcodes runtime: "headless_claude". However, the Rust binary's AgentSpec struct uses an AgentRuntime enum. The spawnPty method always sends "pty" regardless of the CLI being spawned.

Detailed Explanation

At packages/broker-sdk/src/client.ts:168-187, the spawnPty method constructs an AgentSpec with runtime: "pty". This is correct for the general case. However, the new fields model, cwd, team, shadow_of, and shadow_mode are passed through from SpawnPtyInput to AgentSpec correctly.

The concern is that the spawnAgent method at line 190-204 hardcodes runtime: "headless_claude" — this means any agent spawned via spawnAgent is always treated as headless Claude, even if a different CLI is specified. This was pre-existing behavior but is now more visible since the MCP adapter's toAgentInfo function at packages/mcp/src/client-adapter.ts:281 maps headless_claude to cli: 'claude' and everything else to cli: 'pty', which means non-Claude agents spawned via spawnAgent would be incorrectly labeled.

This is a pre-existing issue that becomes more impactful with the migration since more code paths now use these methods.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 15 additional findings in Devin Review.

Open in Devin Review

Comment on lines 335 to 390
async waitForAgentReady(name: string, timeoutMs = 60_000): Promise<Agent> {
const client = await this.ensureStarted();
const existing = this.knownAgents.get(name);
if (existing && this.readyAgents.has(name)) {
return existing;
}

return new Promise<Agent>((resolve, reject) => {
let settled = false;
let timeout: ReturnType<typeof setTimeout> | undefined;

const cleanup = () => {
unsubscribe();
if (timeout) {
clearTimeout(timeout);
timeout = undefined;
}
};

const resolveWith = (agent: Agent) => {
if (settled) return;
settled = true;
cleanup();
resolve(agent);
};

const rejectWith = (error: Error) => {
if (settled) return;
settled = true;
cleanup();
reject(error);
};

const unsubscribe = client.onEvent((event) => {
if (event.kind !== "worker_ready" || event.name !== name) {
return;
}
let agent = this.knownAgents.get(event.name);
if (!agent) {
agent = this.makeAgent(event.name, event.runtime, []);
this.knownAgents.set(event.name, agent);
}
this.readyAgents.add(event.name);
resolveWith(agent);
});

timeout = setTimeout(() => {
rejectWith(new Error(`Timed out waiting for worker_ready for '${name}' after ${timeoutMs}ms`));
}, timeoutMs);

const known = this.knownAgents.get(name);
if (known && this.readyAgents.has(name)) {
resolveWith(known);
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 waitForAgentReady has a TOCTOU race between initial check and event subscription

In waitForAgentReady, the initial early-return check at lines 337-339 runs before the Promise constructor. Inside the Promise, the event listener is subscribed at line 368, and then a second check is performed at lines 385-388. However, a worker_ready event can arrive between the ensureStarted() call (line 336) and the onEvent subscription (line 368), causing the method to miss the event and time out.

Detailed Explanation of the Race Window

The sequence is:

  1. ensureStarted() — starts the client, which begins receiving events
  2. Check knownAgents + readyAgents (lines 337-339) — agent not ready yet
  3. Enter Promise constructor
  4. RACE WINDOW: Between steps 2 and 5, the event loop can process a worker_ready event that updates readyAgents via the main event handler at relay.ts:506-514
  5. Subscribe to onEvent (line 368) — too late, event already processed
  6. Re-check readyAgents (lines 385-388) — this catches the race!

Actually, looking more carefully, the re-check at lines 385-388 does catch this race condition because it runs synchronously within the same Promise constructor after subscribing. Since the worker_ready event handler at line 506-514 updates readyAgents synchronously, and the re-check at 385-388 reads it synchronously, this is safe.

However, there's still a subtle issue: the check at lines 337-339 and the check at lines 385-388 are identical duplicates. The first check (337-339) is redundant because the second check (385-388) handles the same case. This is not a bug per se, but the duplicate check is misleading — it suggests the author was trying to handle a race condition but the actual protection comes from the second check only.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

khaliqgant and others added 11 commits February 19, 2026 23:07
CLI Refactoring:
- Split 5361-line src/cli/index.ts into 8 modular command files + 14 shared lib files
- Wire bootstrap.ts as new CLI entry point
- Remove legacy patterns: tmux, socket/daemon, agents.json, storage adapter
- Remove create-agent command (replaced by spawn)
- Migrate all spawnPty calls to broker-sdk AgentRelayClient
- Rename Rust broker binary from agent-relay to agent-relay-broker
- 66 tests with full DI pattern across all modules

Broker-SDK Improvements:
- Per-agent cwd on spawnPty
- Agent.waitForReady() promise
- Structured data field in sendMessage
- Reason parameter on release()
- Model as first-class spawn option
- Agent.status computed getter (spawning/ready/idle/exited)
- Agent.onOutput() per-agent streaming hook
- relay.system() convenience handle

SDK Rename:
- Rename packages/broker-sdk to packages/sdk (@agent-relay/sdk)
- Delete empty packages/sdk-ts
- Update all imports, build scripts, CI workflows

Workflow Runner Fixes:
- Add trajectories and idleThresholdSecs to schema.json
- Fix README template table patterns
- Extract shared execute/resume logic into runWorkflowCore
- Remove global process.env.RELAY_API_KEY mutation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	packages/broker-sdk/src/logs.ts
#	packages/broker-sdk/tsconfig.json
#	packages/sdk/src/__tests__/idle-nudge.test.ts
#	packages/sdk/src/__tests__/unit.test.ts
#	packages/sdk/src/examples/workflow-superiority.ts
#	packages/sdk/src/relay.ts
#	packages/sdk/src/workflows/builtin-templates/competitive.yaml
#	packages/sdk/src/workflows/runner.ts
Tests for gc command and tmux discovery were referencing deleted
functionality from the legacy cleanup wave.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docs (22 files):
- Replace create-agent with spawn across all docs
- Rename agent-relay binary refs to agent-relay-broker
- Update imports from 'agent-relay' to '@agent-relay/sdk'
- Remove tmux references and prerequisites
- Update socket/legacy references to file-based protocol

CLI:
- Remove obsolete doctor command (SQLite/JSONL diagnostics no longer needed)
- Clean up doctor from monitoring module, bootstrap, and all tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
const allLines = content.split('\n');
const tailLines = allLines.slice(-lines);
return tailLines.join('\n').trim();
fh = await open(filePath, "r");

Check failure

Code scanning / CodeQL

Insecure temporary file High

Insecure creation of file in
the os temp dir
.

Copilot Autofix

AI about 2 hours ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

} else {
machineId = `${deps.getHostname()}-${deps.randomHex(8)}`;
fs.mkdirSync(dataDir, { recursive: true });
fs.writeFileSync(machineIdPath, machineId);

Check failure

Code scanning / CodeQL

Potential file system race condition High

The file may have changed since it
was checked
.
const modelSplit = architectCli.split(':');
const cli = modelSplit[0];
const model = modelSplit[1];
const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$');

Check failure

Code scanning / CodeQL

Incomplete string escaping or encoding High

This does not escape backslash characters in the input.

Copilot Autofix

AI about 14 hours ago

In general, the problem arises because the code tries to escape one special character ($) in a string coming from deps.getAgentOutboxTemplate() without considering backslashes or the real context in which the string needs escaping. The correct fix is to (a) avoid ad‑hoc partial escaping and (b) either use a well-tested escaping helper for the actual target context (regex, replacement string, shell, etc.) or, if the string is only used literally, not escape it at all.

Within this snippet, outboxPath is only used inside a prompt string shown to an agent: `Use relay protocol with outbox path ${outboxPath}. ...`. There is no interpretation of $ or \ in such plain text, so escaping $ is unnecessary and misleading. The minimal and safest behavior-preserving fix is to stop doing incomplete escaping and just use the template as-is. That means replacing:

const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$');

with:

const outboxPath = deps.getAgentOutboxTemplate();

This removes the broken custom escaping without changing any visible semantics, because the prompt will now show the actual template string instead of an odd $\$-mangled form. No new imports or helper functions are needed; we just simplify the assignment on line 91.

Suggested changeset 1
src/cli/lib/bridge.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/cli/lib/bridge.ts b/src/cli/lib/bridge.ts
--- a/src/cli/lib/bridge.ts
+++ b/src/cli/lib/bridge.ts
@@ -88,7 +88,7 @@
       const modelSplit = architectCli.split(':');
       const cli = modelSplit[0];
       const model = modelSplit[1];
-      const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$');
+      const outboxPath = deps.getAgentOutboxTemplate();
       const prompt =
         'You are the Architect, a cross-project coordinator.\n\n' +
         `Connected Projects:\n${formatProjectContext(valid)}\n\n` +
EOF
@@ -88,7 +88,7 @@
const modelSplit = architectCli.split(':');
const cli = modelSplit[0];
const model = modelSplit[1];
const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$');
const outboxPath = deps.getAgentOutboxTemplate();
const prompt =
'You are the Architect, a cross-project coordinator.\n\n' +
`Connected Projects:\n${formatProjectContext(valid)}\n\n` +
Copilot is powered by AI and may make mistakes. Always verify output.
deps.log(` - ${credential.provider}`);
}

fs.writeFileSync(credentialsPath, JSON.stringify(credentials, null, 2));

Check warning

Code scanning / CodeQL

Network data written to file Medium

Write to file system depends on
Untrusted data
.
}

if (process.platform === 'win32') {
await execFileAsync('cmd', ['/c', 'start', '', url]);

Check warning

Code scanning / CodeQL

Indirect uncontrolled command line Medium

This command depends on an unsanitized
command-line argument
.

Copilot Autofix

AI about 14 hours ago

General fix: avoid invoking cmd with user-controlled arguments. Use an API that does not require shell metacharacter parsing, or change the invoked program to something that directly accepts a URL argument without indirect shell semantics. For opening URLs on Windows, a common pattern is to call explorer.exe with the URL; this runs without an intermediate cmd layer and still opens the user’s default browser.

Concrete best fix here, without changing external behavior:

  • Keep using execFileAsync (already a safe API).
  • Change the Windows branch of defaultOpenExternal to execute explorer.exe directly instead of cmd /c start "" url.
  • This preserves the “open in default browser” behavior while removing the dependency on cmd’s argument parsing.
  • No changes are needed elsewhere; deps.openExternal’s behavior is the same from the caller’s perspective.

Specific change:

  • In src/cli/commands/cloud.ts, in defaultOpenExternal, replace:

    if (process.platform === 'win32') {
      await execFileAsync('cmd', ['/c', 'start', '', url]);
      return;
    }

    with:

    if (process.platform === 'win32') {
      await execFileAsync('explorer.exe', [url]);
      return;
    }

No new imports or helper functions are needed; we reuse the existing execFileAsync defined with promisify(execFile).


Suggested changeset 1
src/cli/commands/cloud.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/cli/commands/cloud.ts b/src/cli/commands/cloud.ts
--- a/src/cli/commands/cloud.ts
+++ b/src/cli/commands/cloud.ts
@@ -67,7 +67,7 @@
   }
 
   if (process.platform === 'win32') {
-    await execFileAsync('cmd', ['/c', 'start', '', url]);
+    await execFileAsync('explorer.exe', [url]);
     return;
   }
 
EOF
@@ -67,7 +67,7 @@
}

if (process.platform === 'win32') {
await execFileAsync('cmd', ['/c', 'start', '', url]);
await execFileAsync('explorer.exe', [url]);
return;
}

Copilot is powered by AI and may make mistakes. Always verify output.
khaliqgant and others added 7 commits February 20, 2026 12:34
Delete all 20 MCP integration test files (now obsolete after @agent-relay/mcp
removal). Update test runner to SDK-only, simplify package.json test scripts.
Fix e2e permission denied by chmod+x index.js in postbuild. Fix stress test
crash from missing packages/resiliency import.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add auto-restart supervisor, metrics collection, and crash insights
modules to the Rust broker. Expose new types and methods via both
TypeScript and Python SDKs. Remove references to deleted
@agent-relay/storage package from CI workflows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
khaliqgant and others added 14 commits February 23, 2026 10:09
The broker had two redundant modes: init (stdio, full-featured) and
listen (HTTP API, stripped-down). The dashboard needs to proxy /api/spawn
to the broker, but init mode had no HTTP API. This adds --api-port to
init mode so the HTTP API runs alongside stdio, then removes the
redundant listen mode entirely.

Changes:
- Add --api-port flag to InitCommand (default 0 = disabled)
- Start HTTP API server in run_init when api_port > 0
- Handle spawn/release/list API requests in init's main select loop
- Update CLI to pass --api-port 3889 when dashboard is active
- Simplify dashboard relay URL resolution (always http://localhost:3889)
- Remove getDashboardSpawnEnv standalone-mode hack
- Remove ListenCommand, Commands::Listen, and run_listen() (~330 lines)
- Move spawn_env_vars to spawner.rs (still used by wrap mode)
- Update integration tests from listen to init --api-port
- Update ARCHITECTURE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
binaryArgs like --api-port were prepended before the 'init' subcommand,
causing clap to reject the unknown flag at the top level. Move them
after the subcommand and its standard args so clap parses them correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add WebSocket endpoint to broker HTTP API that streams dashboard-relevant
events (agent lifecycle, message delivery, worker status) via a tokio
broadcast channel. This enables the dashboard to receive real-time updates
without polling the Relaycast REST API.

- Add axum ws feature to Cargo.toml
- Add broadcast channel piped through stdout writer task
- Add /ws route with WebSocketUpgrade handler
- Filter events by kind (relay_inbound, agent_spawned, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add axum ws feature and broadcast channel to broker
- Add /ws WebSocket endpoint streaming dashboard-relevant events
- Handle broadcast::Lagged errors gracefully (skip, don't disconnect)
- Add 30s ping keepalive to detect dead WS clients
- Fix lint-staged: models.generated.ts → cli-registry.generated.ts
- Add npm_link to CLIRegistry codegen output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tration retry

- Add `task` field to HTTP API spawn handler so dashboard-spawned agents
  receive their task via initial_task injection
- Implement actual spawn logic for Relaycast WS `agent.spawn_requested`
  events (previously ignored with a warning)
- Add retry_agent_registration helper with 3 attempts for transient
  pre-registration failures (rate limits, transport errors)
- Extract HTTP API handlers into src/listen_api.rs (~340 lines)
- Move RelaycastAgentEvent, parse_relaycast_agent_event, and
  registration helpers into src/relaycast_ws.rs
- Reduce main.rs from 5031 to 4633 lines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
endpoint: string,
init: RequestInit
): Promise<Response> {
const response = await fetch(`${cloudUrl}${endpoint}`, init);

Check warning

Code scanning / CodeQL

File data in outbound network request Medium

Outbound network request depends on
file data
.
khaliqgant and others added 15 commits February 23, 2026 22:31
- Optional X-API-Key auth on /api/* and /ws, /health public
- 401 uses RelayError envelope shape
- /api/threads endpoint with thread grouping and unread counts
- Agent state publishing to Relaycast WS (spawned/idle/exited/stuck)
- Update message_bridge.rs and main.rs to satisfy contract fixtures
- All contract fixture tests now passing
- 5 fixture JSON files (identity, errors, events, health, replay)
- Red contract tests for SDK + broker conformance
- broker-sdk TypeScript contract test stubs
- ReplayBuffer with VecDeque ring buffer, capacity 1000
- AtomicU64 monotonic seq counter
- replay_since() for WS reconnect catch-up
# Conflicts:
#	src/listen_api.rs
#	src/main.rs
# Conflicts:
#	.trajectories/index.json
#	packages/sdk/src/__tests__/contract-fixtures.test.ts
#	src/main.rs
#	src/message_bridge.rs
# Conflicts:
#	packages/sdk/src/__tests__/contract-fixtures.test.ts
#	src/main.rs
- Resolve 8 merge conflict blocks in main.rs test module
- Fix listen_api_health to return rich /health shape (agentCount, etc.)
- Add /api/events/replay route for replay buffer endpoint
- Add terminal guard for delivery_ack timeout handling
- Support degraded health status on startup 429

190 lib tests + 179 bin tests all passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant