Skip to content

feat(resilience): Agent Immortality Protocol — agentes que nunca morrem (#568)#590

Open
nikolasdehor wants to merge 1 commit intoSynkraAI:mainfrom
nikolasdehor:feat/agent-immortality-protocol
Open

feat(resilience): Agent Immortality Protocol — agentes que nunca morrem (#568)#590
nikolasdehor wants to merge 1 commit intoSynkraAI:mainfrom
nikolasdehor:feat/agent-immortality-protocol

Conversation

@nikolasdehor
Copy link
Contributor

@nikolasdehor nikolasdehor commented Mar 12, 2026

Summary

  • Heartbeat monitoring com detecção de falha
  • State snapshots para recovery
  • Crash detection com auto-revival automático
  • Behavioral fingerprint e health score
  • Cascade protection para evitar falhas em cadeia

Testes

  • 126 testes unitários passando

Reabertura do PR #576 (fechado acidentalmente). Resolve issue #568.

Summary by CodeRabbit

  • New Features
    • Added automatic agent recovery system with crash detection and periodic state snapshots enabling self-healing capabilities
    • Introduced comprehensive health monitoring, behavioral anomaly detection, and cascade failure protection

…g para agentes (SynkraAI#568)

Implementa o protocolo completo de imortalidade de agentes com heartbeat,
snapshots, deteccao de crash, auto-revival, fingerprint comportamental,
protecao contra cascata e health score composto. 126 testes unitarios.
@vercel
Copy link

vercel bot commented Mar 12, 2026

@nikolasdehor is attempting to deploy a commit to the Pedro Valério Lopez's projects Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions bot added area: agents Agent system related area: workflows Workflow system related squad mcp type: test Test coverage and quality area: core Core framework (.aios-core/core/) area: installer Installer and setup (packages/installer/) area: synapse SYNAPSE context engine area: cli CLI tools (bin/, packages/aios-pro-cli/) area: pro Pro features (pro/) area: health-check Health check system area: docs Documentation (docs/) area: devops CI/CD, GitHub Actions (.github/) labels Mar 12, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Walkthrough

This PR introduces the Agent Immortality Protocol, a resilience mechanism enabling agent self-healing and state recovery through heartbeat monitoring, periodic state snapshots, crash detection, and auto-revival capabilities. A compatibility wrapper is provided for backward compatibility, alongside a comprehensive test suite covering all functionality.

Changes

Cohort / File(s) Summary
Agent Immortality Protocol Implementation
.aiox-core/core/resilience/agent-immortality.js
Core module introducing the AgentImmortalityProtocol class with agent lifecycle management, heartbeat monitoring, snapshot creation and persistence, crash detection, auto-revival, behavioral fingerprinting with anomaly detection, health scoring, cascade protection via dependency tracking, and event emission for key milestones.
Compatibility Wrapper
.aios-core/core/resilience/agent-immortality.js
Retro-compatibility module that re-exports the canonical implementation from .aiox-core/core/resilience/agent-immortality, enabling existing import paths to resolve correctly without code duplication.
Manifest Update
.aiox-core/install-manifest.yaml
Timestamp refresh, registration of new core/resilience/agent-immortality.js file, removal of development/tasks/review-prs.md entry, and size value updates across manifest entries.
Test Suite
tests/core/resilience/agent-immortality.test.js
Comprehensive unit test coverage exercising exports, constructor behavior, agent lifecycle, monitoring, heartbeats, snapshots, revival mechanisms, health scoring, fingerprinting, anomaly detection, cascade protection, persistence, and error handling.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Protocol as AgentImmortalityProtocol
    participant Monitor as Heartbeat Monitor
    participant Snapshots as Snapshot Persistence
    participant Disk as Disk Storage

    Agent->>Protocol: registerAgent(agentId)
    Protocol->>Protocol: Initialize agent state
    
    Protocol->>Monitor: startMonitoring(agentId)
    Monitor->>Monitor: Start interval check
    
    loop Periodic Heartbeats
        Agent->>Protocol: heartbeat(agentId, stateData)
        Protocol->>Protocol: Update fingerprint & health
        Protocol->>Snapshots: createSnapshot(agentId)
        Snapshots->>Disk: Persist snapshot
    end
    
    loop Monitor Check Interval
        Monitor->>Protocol: Check last heartbeat
        alt Heartbeat missed beyond grace
            Protocol->>Protocol: Mark agent as DEAD
            Protocol->>Protocol: Emit death-detected event
            Protocol->>Protocol: Queue auto-revival
            Protocol->>Snapshots: getLatestSnapshot(agentId)
            Snapshots->>Disk: Load snapshot
            Protocol->>Agent: reviveAgent(agentId)
            Agent->>Agent: Restore from snapshot
            Agent->>Protocol: heartbeat (resumed)
        else Recent heartbeat
            Protocol->>Protocol: Update health score
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main change: introduction of an Agent Immortality Protocol for resilience. It accurately reflects the primary addition of a comprehensive agent lifecycle management system with heartbeat monitoring, state recovery, and health scoring.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
.aios-core/core/resilience/agent-immortality.js (1)

1-2: Prefer the project's absolute import form for this shim.

This wrapper hardcodes a repo-relative hop into .aiox-core, so the compatibility layer depends on the current directory layout. Re-export from the package's absolute internal path instead of ../../../....

As per coding guidelines, "Use absolute imports instead of relative imports in all code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.aios-core/core/resilience/agent-immortality.js around lines 1 - 2, The shim
currently uses a relative require in module.exports
(require('../../../.aiox-core/core/resilience/agent-immortality')) which couples
it to repo layout; change the export to re-export the package's absolute
internal path instead (use the package's absolute import for the internal
module) so module.exports =
require('<package-name>/core/resilience/agent-immortality') (replace
<package-name> with the actual package identifier) to follow the project's
absolute-import guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.aiox-core/core/resilience/agent-immortality.js:
- Around line 672-696: The saved JSON only contains summary metadata and
loadState() must rebuild in-memory structures so recovery works; modify
loadState to, when reading the saved payload, rehydrate this.agents (creating
Agent objects or restoring their full state fields including snapshots array,
revivalHistory array, lastSnapshot, registeredAt, lastHeartbeat, errorCount,
revivalCount, healthScore, and per-agent config/fingerprint baselines),
repopulate this._dependencies from data.dependencies, and restore any snapshot
storage indexes/refs so reviveAgent(agentId) can find snapshots and history;
update any helper methods (e.g., _calculateHealthScore) to accept restored
agents and ensure reviveAgent, snapshot handling, and fingerprint baseline
lookup use the rehydrated agent instances instead of the summary JSON.
- Around line 599-610: declareDependency currently allows creating transitive
cycles (e.g., adding A->B when B already depends on A); before pushing
dependsOnId into this._dependencies for agentId, call the internal traversal
(e.g., _findDependents(dependsOnId)) to see if it already reaches agentId and if
so throw an Error like "Declaring this dependency would create a cycle"; update
declareDependency to perform this check and refuse the mutation so
getCascadeRisk and dependents remain correct.
- Around line 663-700: saveState currently assigns a rejected promise to
this._saveQueue which leaves the queue permanently rejected; fix saveState by
wrapping the chained async callback (the function passed to
this._saveQueue.then) in a try/catch: perform the directory creation, data
prepare and fs.writeFileSync inside try, and in catch reset the queue to a
resolved promise (e.g. this._saveQueue = Promise.resolve()) so future calls to
saveState/_persistSnapshot can continue, then rethrow or log the error to
preserve error visibility; reference the saveState method and the
this._saveQueue field when making this change.

In @.aiox-core/install-manifest.yaml:
- Around line 1039-1042: The install manifest is missing the compatibility shim
entry for the retrocompat file; add a manifest entry for
".aios-core/core/resilience/agent-immortality.js" (matching the actual shim file
you added) alongside the existing "core/resilience/agent-immortality.js" entry,
supplying the correct sha256 hash, size and type so brownfield upgrades will
install the shim path and preserve existing imports.

In `@tests/core/resilience/agent-immortality.test.js`:
- Around line 974-987: The test is a no-op because it never asserts that the
cascade event or status changes actually occurred; update the spec around
protocol.registerAgent, protocol.declareDependency, protocol.getCascadeRisk, and
the Events.CASCADE_RISK listener to assert observable behavior: attach the
handler via protocol.on(Events.CASCADE_RISK, handler) and
expect(handler).toHaveBeenCalledTimes(1) (and/or toHaveBeenCalledWith(...)
validating payload), and for crash-detection/revival cases advance timers and
then assert agent.status (AgentStatus.DEAD or AgentStatus.ALIVE) and any emitted
Events (e.g., revive/crash events) were fired so the test fails if core
failure/recovery paths regress.

---

Nitpick comments:
In @.aios-core/core/resilience/agent-immortality.js:
- Around line 1-2: The shim currently uses a relative require in module.exports
(require('../../../.aiox-core/core/resilience/agent-immortality')) which couples
it to repo layout; change the export to re-export the package's absolute
internal path instead (use the package's absolute import for the internal
module) so module.exports =
require('<package-name>/core/resilience/agent-immortality') (replace
<package-name> with the actual package identifier) to follow the project's
absolute-import guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 531d113d-b04f-4ff2-915a-4e0f3c954dcd

📥 Commits

Reviewing files that changed from the base of the PR and between f74e3e7 and f943268.

📒 Files selected for processing (4)
  • .aios-core/core/resilience/agent-immortality.js
  • .aiox-core/core/resilience/agent-immortality.js
  • .aiox-core/install-manifest.yaml
  • tests/core/resilience/agent-immortality.test.js

Comment on lines +599 to +610
declareDependency(agentId, dependsOnId) {
this._assertAgentExists(agentId);
this._assertAgentExists(dependsOnId);

if (agentId === dependsOnId) {
throw new Error('An agent cannot depend on itself');
}

const deps = this._dependencies.get(agentId) ?? [];
if (!deps.includes(dependsOnId)) {
deps.push(dependsOnId);
this._dependencies.set(agentId, deps);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject transitive dependency cycles here.

declareDependency('A', 'B') currently allows the reverse edge when B already depends on A. In that state, _findDependents('A') returns ['B', 'A'], so the target becomes its own dependent and getCascadeRisk() overstates cascade risk.

Suggested guard
   if (agentId === dependsOnId) {
     throw new Error('An agent cannot depend on itself');
   }
+  if (this._findDependents(agentId).includes(dependsOnId)) {
+    throw new Error(
+      `Declaring "${agentId}" -> "${dependsOnId}" would create a dependency cycle`
+    );
+  }
 
   const deps = this._dependencies.get(agentId) ?? [];

As per coding guidelines, "Check for proper input validation on public API methods".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.aiox-core/core/resilience/agent-immortality.js around lines 599 - 610,
declareDependency currently allows creating transitive cycles (e.g., adding A->B
when B already depends on A); before pushing dependsOnId into this._dependencies
for agentId, call the internal traversal (e.g., _findDependents(dependsOnId)) to
see if it already reaches agentId and if so throw an Error like "Declaring this
dependency would create a cycle"; update declareDependency to perform this check
and refuse the mutation so getCascadeRisk and dependents remain correct.

Comment on lines +663 to +700
async saveState() {
this._saveQueue = this._saveQueue.then(async () => {
const filePath = path.resolve(this.projectRoot, this.config.stateFile);
const dir = path.dirname(filePath);

if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}

const data = {
schemaVersion: this.config.schemaVersion,
savedAt: new Date().toISOString(),
agents: {},
dependencies: {},
};

for (const [id, agent] of this.agents.entries()) {
data.agents[id] = {
id: agent.id,
status: agent.status,
registeredAt: agent.registeredAt,
lastHeartbeat: agent.lastHeartbeat,
lastSnapshot: agent.lastSnapshot,
errorCount: agent.errorCount,
snapshotCount: agent.snapshots.length,
revivalCount: agent.revivalHistory.length,
healthScore: this._calculateHealthScore(agent),
};
}

for (const [id, deps] of this._dependencies.entries()) {
data.dependencies[id] = [...deps];
}

fs.writeFileSync(filePath, JSON.stringify(data, null, 2), 'utf-8');
});

await this._saveQueue;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

A single saveState() failure bricks later persistence.

saveState() assigns the raw chained promise back to _saveQueue and awaits it, but never resets the queue on rejection. After one filesystem error, _saveQueue stays rejected forever; every later saveState() and _persistSnapshot() call chains off that rejected promise and is skipped.

One way to keep the queue usable after a failed write
 async saveState() {
-  this._saveQueue = this._saveQueue.then(async () => {
+  const op = this._saveQueue.catch(() => {}).then(async () => {
     const filePath = path.resolve(this.projectRoot, this.config.stateFile);
     const dir = path.dirname(filePath);
     ...
     fs.writeFileSync(filePath, JSON.stringify(data, null, 2), 'utf-8');
   });
-
-  await this._saveQueue;
+  this._saveQueue = op.catch(() => {});
+  await op;
 }

As per coding guidelines, "Verify error handling is comprehensive with proper try/catch and error context".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.aiox-core/core/resilience/agent-immortality.js around lines 663 - 700,
saveState currently assigns a rejected promise to this._saveQueue which leaves
the queue permanently rejected; fix saveState by wrapping the chained async
callback (the function passed to this._saveQueue.then) in a try/catch: perform
the directory creation, data prepare and fs.writeFileSync inside try, and in
catch reset the queue to a resolved promise (e.g. this._saveQueue =
Promise.resolve()) so future calls to saveState/_persistSnapshot can continue,
then rethrow or log the error to preserve error visibility; reference the
saveState method and the this._saveQueue field when making this change.

Comment on lines +672 to +696
const data = {
schemaVersion: this.config.schemaVersion,
savedAt: new Date().toISOString(),
agents: {},
dependencies: {},
};

for (const [id, agent] of this.agents.entries()) {
data.agents[id] = {
id: agent.id,
status: agent.status,
registeredAt: agent.registeredAt,
lastHeartbeat: agent.lastHeartbeat,
lastSnapshot: agent.lastSnapshot,
errorCount: agent.errorCount,
snapshotCount: agent.snapshots.length,
revivalCount: agent.revivalHistory.length,
healthScore: this._calculateHealthScore(agent),
};
}

for (const [id, deps] of this._dependencies.entries()) {
data.dependencies[id] = [...deps];
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

loadState() doesn't restore anything the protocol needs for recovery.

The serialized payload only keeps summary metadata, and loadState() just returns that JSON without rebuilding this.agents, this._dependencies, snapshots, revival history, fingerprint baselines, or per-agent config. After a process restart, reviveAgent() still has no in-memory snapshot/history to work with, so the advertised disk-backed recovery path never actually comes back online.

Also applies to: 707-719

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.aiox-core/core/resilience/agent-immortality.js around lines 672 - 696, The
saved JSON only contains summary metadata and loadState() must rebuild in-memory
structures so recovery works; modify loadState to, when reading the saved
payload, rehydrate this.agents (creating Agent objects or restoring their full
state fields including snapshots array, revivalHistory array, lastSnapshot,
registeredAt, lastHeartbeat, errorCount, revivalCount, healthScore, and
per-agent config/fingerprint baselines), repopulate this._dependencies from
data.dependencies, and restore any snapshot storage indexes/refs so
reviveAgent(agentId) can find snapshots and history; update any helper methods
(e.g., _calculateHealthScore) to accept restored agents and ensure reviveAgent,
snapshot handling, and fingerprint baseline lookup use the rehydrated agent
instances instead of the summary JSON.

Comment on lines +1039 to +1042
- path: core/resilience/agent-immortality.js
hash: sha256:89ae4bac066088e76071cfc9b391418e9eba804bcc2b2f943edb1ce38974735c
type: core
size: 37573
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The compatibility shim is missing from the install manifest.

This manifest adds core/resilience/agent-immortality.js, but the new retrocompat file at .aios-core/core/resilience/agent-immortality.js is not listed anywhere in the generated manifest. Brownfield upgrades driven by this file will install the canonical module without the shim, so existing imports still break.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.aiox-core/install-manifest.yaml around lines 1039 - 1042, The install
manifest is missing the compatibility shim entry for the retrocompat file; add a
manifest entry for ".aios-core/core/resilience/agent-immortality.js" (matching
the actual shim file you added) alongside the existing
"core/resilience/agent-immortality.js" entry, supplying the correct sha256 hash,
size and type so brownfield upgrades will install the shim path and preserve
existing imports.

Comment on lines +974 to +987
it('should emit cascade-risk for high/critical risks', () => {
protocol.registerAgent('agent-1');
protocol.registerAgent('agent-2');
protocol.declareDependency('agent-2', 'agent-1');

const agent = protocol.agents.get('agent-1');
agent.status = AgentStatus.DEAD;

const handler = jest.fn();
protocol.on(Events.CASCADE_RISK, handler);
protocol.getCascadeRisk('agent-1');
// 1 dependent + dead = high (nao critical)
// high emite cascade-risk
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Several of these tests are currently no-ops.

The cascade-risk case never asserts that the handler fired, and the crash-detection/revival cases only advance timers or read locals without checking status or emitted events. Those tests will stay green even if the core failure/recovery path regresses.

Example assertions to make these cases observable
       protocol.on(Events.CASCADE_RISK, handler);
       protocol.getCascadeRisk('agent-1');
-      // 1 dependent + dead = high (nao critical)
-      // high emite cascade-risk
+      expect(handler).toHaveBeenCalledTimes(1);
+      expect(handler.mock.calls[0][0]).toMatchObject({
+        agentId: 'agent-1',
+        riskLevel: 'high',
+      });
@@
       jest.advanceTimersByTime(2000);
-
-      const agent = protocol.agents.get('agent-1');
-      // O agente pode estar como SUSPECT
+      expect(protocol.agents.get('agent-1').status).toBe(AgentStatus.SUSPECT);
@@
       await Promise.resolve();
       await Promise.resolve();
+      expect(revivalHandler).toHaveBeenCalledTimes(1);
+      expect(protocol.agents.get('agent-1').status).toBe(AgentStatus.ALIVE);

As per coding guidelines, "Verify test coverage exists for new/modified functions".

Also applies to: 1020-1059

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/resilience/agent-immortality.test.js` around lines 974 - 987, The
test is a no-op because it never asserts that the cascade event or status
changes actually occurred; update the spec around protocol.registerAgent,
protocol.declareDependency, protocol.getCascadeRisk, and the Events.CASCADE_RISK
listener to assert observable behavior: attach the handler via
protocol.on(Events.CASCADE_RISK, handler) and
expect(handler).toHaveBeenCalledTimes(1) (and/or toHaveBeenCalledWith(...)
validating payload), and for crash-detection/revival cases advance timers and
then assert agent.status (AgentStatus.DEAD or AgentStatus.ALIVE) and any emitted
Events (e.g., revive/crash events) were fired so the test fails if core
failure/recovery paths regress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: agents Agent system related area: cli CLI tools (bin/, packages/aios-pro-cli/) area: core Core framework (.aios-core/core/) area: devops CI/CD, GitHub Actions (.github/) area: docs Documentation (docs/) area: health-check Health check system area: installer Installer and setup (packages/installer/) area: pro Pro features (pro/) area: synapse SYNAPSE context engine area: workflows Workflow system related mcp squad type: test Test coverage and quality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant