feat(resilience): Agent Immortality Protocol — agentes que nunca morrem (#568)#590
feat(resilience): Agent Immortality Protocol — agentes que nunca morrem (#568)#590nikolasdehor wants to merge 1 commit intoSynkraAI:mainfrom
Conversation
…g para agentes (SynkraAI#568) Implementa o protocolo completo de imortalidade de agentes com heartbeat, snapshots, deteccao de crash, auto-revival, fingerprint comportamental, protecao contra cascata e health score composto. 126 testes unitarios.
|
@nikolasdehor is attempting to deploy a commit to the Pedro Valério Lopez's projects Team on Vercel. A member of the Team first needs to authorize it. |
WalkthroughThis PR introduces the Agent Immortality Protocol, a resilience mechanism enabling agent self-healing and state recovery through heartbeat monitoring, periodic state snapshots, crash detection, and auto-revival capabilities. A compatibility wrapper is provided for backward compatibility, alongside a comprehensive test suite covering all functionality. Changes
Sequence DiagramsequenceDiagram
participant Agent
participant Protocol as AgentImmortalityProtocol
participant Monitor as Heartbeat Monitor
participant Snapshots as Snapshot Persistence
participant Disk as Disk Storage
Agent->>Protocol: registerAgent(agentId)
Protocol->>Protocol: Initialize agent state
Protocol->>Monitor: startMonitoring(agentId)
Monitor->>Monitor: Start interval check
loop Periodic Heartbeats
Agent->>Protocol: heartbeat(agentId, stateData)
Protocol->>Protocol: Update fingerprint & health
Protocol->>Snapshots: createSnapshot(agentId)
Snapshots->>Disk: Persist snapshot
end
loop Monitor Check Interval
Monitor->>Protocol: Check last heartbeat
alt Heartbeat missed beyond grace
Protocol->>Protocol: Mark agent as DEAD
Protocol->>Protocol: Emit death-detected event
Protocol->>Protocol: Queue auto-revival
Protocol->>Snapshots: getLatestSnapshot(agentId)
Snapshots->>Disk: Load snapshot
Protocol->>Agent: reviveAgent(agentId)
Agent->>Agent: Restore from snapshot
Agent->>Protocol: heartbeat (resumed)
else Recent heartbeat
Protocol->>Protocol: Update health score
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
.aios-core/core/resilience/agent-immortality.js (1)
1-2: Prefer the project's absolute import form for this shim.This wrapper hardcodes a repo-relative hop into
.aiox-core, so the compatibility layer depends on the current directory layout. Re-export from the package's absolute internal path instead of../../../....As per coding guidelines, "Use absolute imports instead of relative imports in all code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aios-core/core/resilience/agent-immortality.js around lines 1 - 2, The shim currently uses a relative require in module.exports (require('../../../.aiox-core/core/resilience/agent-immortality')) which couples it to repo layout; change the export to re-export the package's absolute internal path instead (use the package's absolute import for the internal module) so module.exports = require('<package-name>/core/resilience/agent-immortality') (replace <package-name> with the actual package identifier) to follow the project's absolute-import guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.aiox-core/core/resilience/agent-immortality.js:
- Around line 672-696: The saved JSON only contains summary metadata and
loadState() must rebuild in-memory structures so recovery works; modify
loadState to, when reading the saved payload, rehydrate this.agents (creating
Agent objects or restoring their full state fields including snapshots array,
revivalHistory array, lastSnapshot, registeredAt, lastHeartbeat, errorCount,
revivalCount, healthScore, and per-agent config/fingerprint baselines),
repopulate this._dependencies from data.dependencies, and restore any snapshot
storage indexes/refs so reviveAgent(agentId) can find snapshots and history;
update any helper methods (e.g., _calculateHealthScore) to accept restored
agents and ensure reviveAgent, snapshot handling, and fingerprint baseline
lookup use the rehydrated agent instances instead of the summary JSON.
- Around line 599-610: declareDependency currently allows creating transitive
cycles (e.g., adding A->B when B already depends on A); before pushing
dependsOnId into this._dependencies for agentId, call the internal traversal
(e.g., _findDependents(dependsOnId)) to see if it already reaches agentId and if
so throw an Error like "Declaring this dependency would create a cycle"; update
declareDependency to perform this check and refuse the mutation so
getCascadeRisk and dependents remain correct.
- Around line 663-700: saveState currently assigns a rejected promise to
this._saveQueue which leaves the queue permanently rejected; fix saveState by
wrapping the chained async callback (the function passed to
this._saveQueue.then) in a try/catch: perform the directory creation, data
prepare and fs.writeFileSync inside try, and in catch reset the queue to a
resolved promise (e.g. this._saveQueue = Promise.resolve()) so future calls to
saveState/_persistSnapshot can continue, then rethrow or log the error to
preserve error visibility; reference the saveState method and the
this._saveQueue field when making this change.
In @.aiox-core/install-manifest.yaml:
- Around line 1039-1042: The install manifest is missing the compatibility shim
entry for the retrocompat file; add a manifest entry for
".aios-core/core/resilience/agent-immortality.js" (matching the actual shim file
you added) alongside the existing "core/resilience/agent-immortality.js" entry,
supplying the correct sha256 hash, size and type so brownfield upgrades will
install the shim path and preserve existing imports.
In `@tests/core/resilience/agent-immortality.test.js`:
- Around line 974-987: The test is a no-op because it never asserts that the
cascade event or status changes actually occurred; update the spec around
protocol.registerAgent, protocol.declareDependency, protocol.getCascadeRisk, and
the Events.CASCADE_RISK listener to assert observable behavior: attach the
handler via protocol.on(Events.CASCADE_RISK, handler) and
expect(handler).toHaveBeenCalledTimes(1) (and/or toHaveBeenCalledWith(...)
validating payload), and for crash-detection/revival cases advance timers and
then assert agent.status (AgentStatus.DEAD or AgentStatus.ALIVE) and any emitted
Events (e.g., revive/crash events) were fired so the test fails if core
failure/recovery paths regress.
---
Nitpick comments:
In @.aios-core/core/resilience/agent-immortality.js:
- Around line 1-2: The shim currently uses a relative require in module.exports
(require('../../../.aiox-core/core/resilience/agent-immortality')) which couples
it to repo layout; change the export to re-export the package's absolute
internal path instead (use the package's absolute import for the internal
module) so module.exports =
require('<package-name>/core/resilience/agent-immortality') (replace
<package-name> with the actual package identifier) to follow the project's
absolute-import guideline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 531d113d-b04f-4ff2-915a-4e0f3c954dcd
📒 Files selected for processing (4)
.aios-core/core/resilience/agent-immortality.js.aiox-core/core/resilience/agent-immortality.js.aiox-core/install-manifest.yamltests/core/resilience/agent-immortality.test.js
| declareDependency(agentId, dependsOnId) { | ||
| this._assertAgentExists(agentId); | ||
| this._assertAgentExists(dependsOnId); | ||
|
|
||
| if (agentId === dependsOnId) { | ||
| throw new Error('An agent cannot depend on itself'); | ||
| } | ||
|
|
||
| const deps = this._dependencies.get(agentId) ?? []; | ||
| if (!deps.includes(dependsOnId)) { | ||
| deps.push(dependsOnId); | ||
| this._dependencies.set(agentId, deps); |
There was a problem hiding this comment.
Reject transitive dependency cycles here.
declareDependency('A', 'B') currently allows the reverse edge when B already depends on A. In that state, _findDependents('A') returns ['B', 'A'], so the target becomes its own dependent and getCascadeRisk() overstates cascade risk.
Suggested guard
if (agentId === dependsOnId) {
throw new Error('An agent cannot depend on itself');
}
+ if (this._findDependents(agentId).includes(dependsOnId)) {
+ throw new Error(
+ `Declaring "${agentId}" -> "${dependsOnId}" would create a dependency cycle`
+ );
+ }
const deps = this._dependencies.get(agentId) ?? [];As per coding guidelines, "Check for proper input validation on public API methods".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/resilience/agent-immortality.js around lines 599 - 610,
declareDependency currently allows creating transitive cycles (e.g., adding A->B
when B already depends on A); before pushing dependsOnId into this._dependencies
for agentId, call the internal traversal (e.g., _findDependents(dependsOnId)) to
see if it already reaches agentId and if so throw an Error like "Declaring this
dependency would create a cycle"; update declareDependency to perform this check
and refuse the mutation so getCascadeRisk and dependents remain correct.
| async saveState() { | ||
| this._saveQueue = this._saveQueue.then(async () => { | ||
| const filePath = path.resolve(this.projectRoot, this.config.stateFile); | ||
| const dir = path.dirname(filePath); | ||
|
|
||
| if (!fs.existsSync(dir)) { | ||
| fs.mkdirSync(dir, { recursive: true }); | ||
| } | ||
|
|
||
| const data = { | ||
| schemaVersion: this.config.schemaVersion, | ||
| savedAt: new Date().toISOString(), | ||
| agents: {}, | ||
| dependencies: {}, | ||
| }; | ||
|
|
||
| for (const [id, agent] of this.agents.entries()) { | ||
| data.agents[id] = { | ||
| id: agent.id, | ||
| status: agent.status, | ||
| registeredAt: agent.registeredAt, | ||
| lastHeartbeat: agent.lastHeartbeat, | ||
| lastSnapshot: agent.lastSnapshot, | ||
| errorCount: agent.errorCount, | ||
| snapshotCount: agent.snapshots.length, | ||
| revivalCount: agent.revivalHistory.length, | ||
| healthScore: this._calculateHealthScore(agent), | ||
| }; | ||
| } | ||
|
|
||
| for (const [id, deps] of this._dependencies.entries()) { | ||
| data.dependencies[id] = [...deps]; | ||
| } | ||
|
|
||
| fs.writeFileSync(filePath, JSON.stringify(data, null, 2), 'utf-8'); | ||
| }); | ||
|
|
||
| await this._saveQueue; |
There was a problem hiding this comment.
A single saveState() failure bricks later persistence.
saveState() assigns the raw chained promise back to _saveQueue and awaits it, but never resets the queue on rejection. After one filesystem error, _saveQueue stays rejected forever; every later saveState() and _persistSnapshot() call chains off that rejected promise and is skipped.
One way to keep the queue usable after a failed write
async saveState() {
- this._saveQueue = this._saveQueue.then(async () => {
+ const op = this._saveQueue.catch(() => {}).then(async () => {
const filePath = path.resolve(this.projectRoot, this.config.stateFile);
const dir = path.dirname(filePath);
...
fs.writeFileSync(filePath, JSON.stringify(data, null, 2), 'utf-8');
});
-
- await this._saveQueue;
+ this._saveQueue = op.catch(() => {});
+ await op;
}As per coding guidelines, "Verify error handling is comprehensive with proper try/catch and error context".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/resilience/agent-immortality.js around lines 663 - 700,
saveState currently assigns a rejected promise to this._saveQueue which leaves
the queue permanently rejected; fix saveState by wrapping the chained async
callback (the function passed to this._saveQueue.then) in a try/catch: perform
the directory creation, data prepare and fs.writeFileSync inside try, and in
catch reset the queue to a resolved promise (e.g. this._saveQueue =
Promise.resolve()) so future calls to saveState/_persistSnapshot can continue,
then rethrow or log the error to preserve error visibility; reference the
saveState method and the this._saveQueue field when making this change.
| const data = { | ||
| schemaVersion: this.config.schemaVersion, | ||
| savedAt: new Date().toISOString(), | ||
| agents: {}, | ||
| dependencies: {}, | ||
| }; | ||
|
|
||
| for (const [id, agent] of this.agents.entries()) { | ||
| data.agents[id] = { | ||
| id: agent.id, | ||
| status: agent.status, | ||
| registeredAt: agent.registeredAt, | ||
| lastHeartbeat: agent.lastHeartbeat, | ||
| lastSnapshot: agent.lastSnapshot, | ||
| errorCount: agent.errorCount, | ||
| snapshotCount: agent.snapshots.length, | ||
| revivalCount: agent.revivalHistory.length, | ||
| healthScore: this._calculateHealthScore(agent), | ||
| }; | ||
| } | ||
|
|
||
| for (const [id, deps] of this._dependencies.entries()) { | ||
| data.dependencies[id] = [...deps]; | ||
| } | ||
|
|
There was a problem hiding this comment.
loadState() doesn't restore anything the protocol needs for recovery.
The serialized payload only keeps summary metadata, and loadState() just returns that JSON without rebuilding this.agents, this._dependencies, snapshots, revival history, fingerprint baselines, or per-agent config. After a process restart, reviveAgent() still has no in-memory snapshot/history to work with, so the advertised disk-backed recovery path never actually comes back online.
Also applies to: 707-719
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/resilience/agent-immortality.js around lines 672 - 696, The
saved JSON only contains summary metadata and loadState() must rebuild in-memory
structures so recovery works; modify loadState to, when reading the saved
payload, rehydrate this.agents (creating Agent objects or restoring their full
state fields including snapshots array, revivalHistory array, lastSnapshot,
registeredAt, lastHeartbeat, errorCount, revivalCount, healthScore, and
per-agent config/fingerprint baselines), repopulate this._dependencies from
data.dependencies, and restore any snapshot storage indexes/refs so
reviveAgent(agentId) can find snapshots and history; update any helper methods
(e.g., _calculateHealthScore) to accept restored agents and ensure reviveAgent,
snapshot handling, and fingerprint baseline lookup use the rehydrated agent
instances instead of the summary JSON.
| - path: core/resilience/agent-immortality.js | ||
| hash: sha256:89ae4bac066088e76071cfc9b391418e9eba804bcc2b2f943edb1ce38974735c | ||
| type: core | ||
| size: 37573 |
There was a problem hiding this comment.
The compatibility shim is missing from the install manifest.
This manifest adds core/resilience/agent-immortality.js, but the new retrocompat file at .aios-core/core/resilience/agent-immortality.js is not listed anywhere in the generated manifest. Brownfield upgrades driven by this file will install the canonical module without the shim, so existing imports still break.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/install-manifest.yaml around lines 1039 - 1042, The install
manifest is missing the compatibility shim entry for the retrocompat file; add a
manifest entry for ".aios-core/core/resilience/agent-immortality.js" (matching
the actual shim file you added) alongside the existing
"core/resilience/agent-immortality.js" entry, supplying the correct sha256 hash,
size and type so brownfield upgrades will install the shim path and preserve
existing imports.
| it('should emit cascade-risk for high/critical risks', () => { | ||
| protocol.registerAgent('agent-1'); | ||
| protocol.registerAgent('agent-2'); | ||
| protocol.declareDependency('agent-2', 'agent-1'); | ||
|
|
||
| const agent = protocol.agents.get('agent-1'); | ||
| agent.status = AgentStatus.DEAD; | ||
|
|
||
| const handler = jest.fn(); | ||
| protocol.on(Events.CASCADE_RISK, handler); | ||
| protocol.getCascadeRisk('agent-1'); | ||
| // 1 dependent + dead = high (nao critical) | ||
| // high emite cascade-risk | ||
| }); |
There was a problem hiding this comment.
Several of these tests are currently no-ops.
The cascade-risk case never asserts that the handler fired, and the crash-detection/revival cases only advance timers or read locals without checking status or emitted events. Those tests will stay green even if the core failure/recovery path regresses.
Example assertions to make these cases observable
protocol.on(Events.CASCADE_RISK, handler);
protocol.getCascadeRisk('agent-1');
- // 1 dependent + dead = high (nao critical)
- // high emite cascade-risk
+ expect(handler).toHaveBeenCalledTimes(1);
+ expect(handler.mock.calls[0][0]).toMatchObject({
+ agentId: 'agent-1',
+ riskLevel: 'high',
+ });
@@
jest.advanceTimersByTime(2000);
-
- const agent = protocol.agents.get('agent-1');
- // O agente pode estar como SUSPECT
+ expect(protocol.agents.get('agent-1').status).toBe(AgentStatus.SUSPECT);
@@
await Promise.resolve();
await Promise.resolve();
+ expect(revivalHandler).toHaveBeenCalledTimes(1);
+ expect(protocol.agents.get('agent-1').status).toBe(AgentStatus.ALIVE);As per coding guidelines, "Verify test coverage exists for new/modified functions".
Also applies to: 1020-1059
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/core/resilience/agent-immortality.test.js` around lines 974 - 987, The
test is a no-op because it never asserts that the cascade event or status
changes actually occurred; update the spec around protocol.registerAgent,
protocol.declareDependency, protocol.getCascadeRisk, and the Events.CASCADE_RISK
listener to assert observable behavior: attach the handler via
protocol.on(Events.CASCADE_RISK, handler) and
expect(handler).toHaveBeenCalledTimes(1) (and/or toHaveBeenCalledWith(...)
validating payload), and for crash-detection/revival cases advance timers and
then assert agent.status (AgentStatus.DEAD or AgentStatus.ALIVE) and any emitted
Events (e.g., revive/crash events) were fired so the test fails if core
failure/recovery paths regress.
Summary
Testes
Summary by CodeRabbit