Fix: support/summarizer model lifecycle by giveen · Pull Request #430 · aliasrobotics/cai

giveen · 2026-04-03T15:11:35Z

Purpose: Fix support/summarizer model lifecycle and improve auto‑compaction so the support agent no longer holds server context, avoids repeating exhausted approaches, and doesn’t crash the runner.
What Changed

Flush support model: Added async def cleanup() to OpenAIChatCompletionsModel to close the underlying async client, remove the instance from ACTIVE_MODEL_INSTANCES, and clear message_history.

Call cleanup after summarization: Updated _ai_summarize_history to call model_inst.cleanup() in a finally block so temporary summary/support model instances are best-effort flushed after Runner.run. See memory.py.
Compaction & resume hardening (context): earlier related fixes included moving _last_user_input assignment earlier, injecting compacted summary back into message_history to prevent repeat work, and improving _format_history_for_summary to preserve tool args and larger tool outputs so the summarizer can list “Exhausted Approaches”. See cli.py, openai_chatcompletions.py, and memory.py.
Why

CAI_SUPPORT_MODEL / auto‑compaction path was leaving ephemeral summary/support model instances with open HTTP clients and retained message history, causing downstream LLM servers (e.g., llama-support) to accumulate large context (n_ctx_slot) and consume slots/tokens. Flushing prevents slot/context growth and avoids repeated retries of exhausted approaches.

Config checks

export CAI_SUPPORT_MODEL=llama-support
export CAI_SUPPORT_INTERVAL=3

My current setup for testing is as follows:

LLAMA.CPP and LiteLLM(proxy)

.env

OPENAI_API_BASE="http://192.168.0.165:4000/v1"
CAI_MODEL="openai/reasoner"
CAI_SUPPORT_MODEL="openai/support"
CAI_SUPPORT_INTERVAL=25
OPENAI_API_KEY="sk-asdfasdfs"
PROMPT_TOOLKIT_NO_CPR=1
CAI_STREAM=false
CAI_AGENT_TYPE="redteam_agent"
CAI_GUARDRAILS="true"
CAI_TOOL_TIMEOUT=120
CAI_MEMORY="episodic"
CAI_MEMORY_ONLINE="true"
CAI_MEMORY_ONLINE_INTERVAL=5

Main Model setup with LLAMA.CPP

ExecStart=/mnt/storage/llama.cpp/build/bin/llama-server \
    --model /mnt/storage/models/qwen3.5-27b-opus-distill-Q6_K.gguf \
    --model-draft /mnt/storage/models/Qwen2.5-Coder-1.5B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --n-gpu-layers-draft 99 \
    --ctx-size 65536 \
    --threads 16 \
    --batch-size 2048 \
    --ubatch-size 512 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --sleep-idle-seconds 300 \
    --host 0.0.0.0 \
    --port 8080 \
    --cont-batching \
    --jinja

Support Model setup with LLAMA.CPP

ExecStart=/mnt/storage/llama.cpp/build/bin/llama-server \
    --model /mnt/storage/models/Qwen2.5-Coder-1.5B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --port 8081 \
    --slot-save-path /tmp \
    --host 0.0.0.0 \
    --flash-attn on

With this setup, every 25 turns (user input and LLM input/out), the support model will summarize the work that was completed, provide a list of things that were done, or failed and suggested methods moving forward.

This is then saved to memory.

Context is cleared for main model and summary

Summary is loaded back into main model as its prompt and it continues its work. This prevents local LLM's from running out of context space and causing local models to shutdown under load.

CODED WITH CLaude Sonnet 4.6 - High thinking mode

…he main agent context Problem ------- CAI_SUPPORT_MODEL and CAI_SUPPORT_INTERVAL were documented environment variables that had no runtime implementation. The support/reasoner agent was constructed using CAI_SUPPORT_MODEL but was never invoked automatically. CAI_SUPPORT_INTERVAL existed only in docs/config tables with no scheduler reading it. As a result, users with a limited context window (e.g. 32k on a local llama.cpp setup) had no way to automatically keep the main model's message_history from overflowing during a long pentest. Solution -------- Added an auto-compact scheduler block immediately after turn_count += 1 in the main single-agent run loop (run_cai_cli). When both CAI_SUPPORT_MODEL and CAI_SUPPORT_INTERVAL are set the scheduler: 1. Fires every CAI_SUPPORT_INTERVAL turns (modulo check). 2. Calls COMPACT_COMMAND_INSTANCE._perform_compaction(model_override= CAI_SUPPORT_MODEL) which: - Sends the full message_history to the support model for summarisation. - Clears message_history entirely (hard context reset). - Saves the summary to /memory as a .md file. - Stores the summary in COMPACTED_SUMMARIES under the agent name. 3. Re-syncs the local agent variable from AGENT_MANAGER.get_active_agent() so the run loop continues with the freshly reloaded agent instance whose system prompt already contains the injected summary (the system prompt template calls get_compacted_summary() dynamically on every turn so no extra wiring was needed). 4. Prints a visible yellow/green indicator so users can see when compaction fires and confirm the context window has been reset. 5. Silently swallows errors (only logs when CAI_DEBUG=2) so a failing support model never crashes the main session. Usage ----- CAI_SUPPORT_MODEL="openai/support" # lighter model on litellm proxy CAI_SUPPORT_INTERVAL=4 # compact every 4 turns

After auto-compact fired the main run loop fell back to get_user_input() and waited for the human, so the agent stopped working mid-task. Root cause: no mechanism existed to replay the current task into the next iteration after message_history was cleared. Fix: - Add _post_compact_input (str | None) variable initialised to None at session start. - Capture _last_user_input from user_input just before turn_count += 1. - After a successful auto-compact set _post_compact_input to _last_user_input (or 'Continue the current task.' if it was blank). - At the top of the input-gathering block, consume _post_compact_input before calling get_user_input() so the agent immediately re-runs with its previous task prompt and the fresh compacted context.

- Print startup banner confirming CAI_SUPPORT_MODEL + CAI_SUPPORT_INTERVAL are loaded so users can verify env vars are picked up. - Show a per-turn countdown (dim cyan) so it is visible the interval is counting correctly toward the next compact. - Remove the CAI_DEBUG=2 gate on error output — auto-compact errors are now always printed so silent failures can no longer mask a broken support model endpoint or compaction issue.

CAI_SUPPORT_INTERVAL previously counted outer while-loop iterations (one per user input). In agentic/continue mode the agent makes many tool-call rounds per single user input, so the interval never fired unless the user typed N separate messages. Fix: count assistant messages in message_history as a proxy for LLM API calls. This fires after every N responses from the main model regardless of how many came from a single outer iteration. - Startup banner updated: 'every N LLM responses' instead of 'turns' - Countdown shows [{current}/{threshold}] response counts - Fire condition: llm_call_count >= support_interval (fires as soon as threshold is reached; resets naturally when history is cleared after compact)

Both get_response (token counting) and _fetch_response (actual API call) were prepending self.message_history to every request. But cli.py already passes the full conversation history via history_context as conversation_input to Runner.run, which threads it through as original_input to these methods. Result: every historical message was sent TWICE in every API call, doubling the effective context size. After auto-compact cleared message_history, the duplication between runner-accumulated generated_items and message_history rebuilt the doubled context within a single Runner.run invocation (after just 3-4 tool calls), explaining why n_tokens never dropped post-compact. Fix: remove the prepend loops. The runner's input parameter already contains the full conversation (original_input + generated_items), so converted_messages is built from input alone. message_history continues to serve its role as cross-turn persistence (populated via add_to_message_history, consumed by cli.py as history_context for the next Runner.run call). Expected effect: halved token counts in normal operation; post-compact first call starts at ~system_prompt + 1 user message and grows linearly with tool calls rather than doubling.

Previously CAI_SUPPORT_INTERVAL only checked at the end of each outer while-loop iteration (once per user input). In agentic sessions the agent makes many successive tool calls inside a single Runner.run invocation, so the check would only fire after the runner returned — too late, or never if the runner was still running. Fix: - Add ContextCompactedError exception class. - Add a count-based trigger at the top of _auto_compact_if_needed (which fires on EVERY LLM API call, inside both get_response and stream_response). When assistant-message count >= CAI_SUPPORT_INTERVAL: 1. Set the compact model to CAI_SUPPORT_MODEL temporarily. 2. Summarise via _ai_summarize_history (awaited in-situ, no asyncio.run). 3. Store summary in COMPACTED_SUMMARIES so get_system_prompt picks it up on the next turn without needing agent reload. 4. Clear message_history + reset CAI_CONTEXT_USAGE. 5. Raise ContextCompactedError to abort the current runner invocation. - cli.py catches ContextCompactedError in both streaming and non-streaming runner call sites: - Sets _post_compact_input = _last_user_input so the task is replayed. - Re-syncs the local agent reference via AGENT_MANAGER. - Continues the outer while-loop, restarting with a clean context window. The existing outer-loop CLI check (counting assistant messages in history after the runner finishes) is kept as a belt-and-suspenders fallback.

…lError on first compact

…ompaction prompt After compaction, message_history.clear() wiped all context and the stored summary in COMPACTED_SUMMARIES was never re-injected, so the next runner turn started completely blank and the agent would repeat already-tried approaches. Fix: immediately after clearing, push a user+assistant exchange containing the summary into message_history so it flows through history_context on the next iteration as normal conversation context. Also: add an explicit 'Exhausted Approaches — DO NOT RETRY' section (§9) and 'Recommended Next Steps' section (§10) to the compaction prompt so the support model produces a checklist of dead ends the main agent must not revisit.

Three root causes identified and fixed: 1. _format_history_for_summary dropped ALL tool outputs >500 chars with '[Long output truncated]', meaning nmap/gobuster/curl results — the exact evidence the summary model needs to write the 'Exhausted Approaches' section — were silently discarded. Increased limit to 2000 chars (first chars of each output), bumped the message cap from 50 to 200 blocks, and fixed the assistant tool-call extractor which only handled object-style tool_calls (not the dict-style format used in message_history), causing every command ever run to disappear. 2. _post_compact_input was set to the raw original user task (e.g. 'hack the box machine Cap'). That becomes the last message the LLM reads, overriding the memory acknowledgement and making the agent treat it as a brand-new task. Now injects an explicit anti-repetition instruction alongside the original task text. 3. (Previous fix) Summary prompt now includes §9 Exhausted Approaches and §10 Recommended Next Steps — this only works if the summary model actually sees the scan/tool data, which fix #1 now guarantees.

…+ call)

Copilot

Pull request overview

Fixes the support/summarizer model lifecycle to avoid leaking client resources/history, and hardens auto-compaction so the CLI can restart cleanly after mid-run compaction.

Changes:

Added OpenAIChatCompletionsModel.cleanup() and invoked it after summarization to best-effort close async clients and clear temporary model state.
Updated OpenAI Chat Completions request building to avoid double-including history when CLI already passes it in conversation_input.
Added a ContextCompactedError restart path and support-interval-based auto-compaction/replay behavior in the CLI/model.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/cai/sdk/agents/models/openai_chatcompletions.py`	Adds explicit model cleanup and a support-interval auto-compact trigger (raising `ContextCompactedError`) and avoids double-prepending history.
`src/cai/repl/commands/memory.py`	Improves summary prompt/formatting and ensures temporary summarizer models are cleaned up in a `finally` block.
`src/cai/cli.py`	Catches `ContextCompactedError` to restart Runner and replays the last user task; prints auto-compact status and includes a CLI-side interval compaction block.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T15:19:38Z

src/cai/sdk/agents/models/openai_chatcompletions.py

+                        except Exception as _ce:
+                            _console.print(f"[red]Auto-compact error: {_ce}[/red]")
+                        # Always abort the current runner invocation so the outer loop
+                        # can restart with our freshly cleared context.
+                        raise ContextCompactedError(
+                            f"Context compacted after {_asst_count} LLM responses "
+                            f"(threshold {_support_interval})"
+                        )


In the CAI_SUPPORT_INTERVAL trigger, ContextCompactedError is raised unconditionally after the summarization attempt, even if summarization/compaction fails (e.g., import/runtime error). In that failure case, message_history may remain unchanged and the next Runner restart will immediately hit the same condition again, potentially causing an infinite restart loop. Consider only raising ContextCompactedError after successfully clearing/reseeding history (or add a guard/backoff flag so failures fall back to normal execution and don’t retrigger immediately).

Copilot · 2026-04-03T15:19:38Z

src/cai/cli.py

+            # many tool-call rounds per single user input — are handled correctly.
+            _support_model = os.getenv("CAI_SUPPORT_MODEL")
+            _support_interval_raw = os.getenv("CAI_SUPPORT_INTERVAL")
+            if _support_model and _support_interval_raw:


This CLI-level CAI_SUPPORT_INTERVAL auto-compact runs whenever CAI_SUPPORT_MODEL/CAI_SUPPORT_INTERVAL are set, but it does not check CAI_AUTO_COMPACT. As a result, setting CAI_AUTO_COMPACT=false will still trigger count-based compaction from the CLI loop. Consider honoring the same disable flag here (or introducing a dedicated flag for support-interval compaction) so users can reliably turn auto-compaction off.

Suggested change

if _support_model and _support_interval_raw:

_auto_compact_raw = os.getenv("CAI_AUTO_COMPACT", "true")

_auto_compact_enabled = _auto_compact_raw.strip().lower() not in {

"0", "false", "no", "off"

}

if _auto_compact_enabled and _support_model and _support_interval_raw:

Copilot · 2026-04-03T15:19:38Z

src/cai/sdk/agents/models/openai_chatcompletions.py

+                                if self.agent_name not in COMPACTED_SUMMARIES:
+                                    COMPACTED_SUMMARIES[self.agent_name] = []
+                                    APPLIED_MEMORY_IDS[self.agent_name] = []
+                                COMPACTED_SUMMARIES[self.agent_name] = [_summary]


When applying the new in-memory summary, COMPACTED_SUMMARIES[self.agent_name] is overwritten but APPLIED_MEMORY_IDS is only initialized for the first-time case. If the agent already has entries, the old APPLIED_MEMORY_IDS[self.agent_name] values remain and no longer correspond to the newly applied summary, which can make /memory status misleading. Consider clearing/updating APPLIED_MEMORY_IDS[self.agent_name] alongside COMPACTED_SUMMARIES when overwriting the summary (even if no memory_id is available).

Suggested change

if self.agent_name not in COMPACTED_SUMMARIES:

COMPACTED_SUMMARIES[self.agent_name] = []

APPLIED_MEMORY_IDS[self.agent_name] = []

COMPACTED_SUMMARIES[self.agent_name] = [_summary]

COMPACTED_SUMMARIES[self.agent_name] = [_summary]

APPLIED_MEMORY_IDS[self.agent_name] = []

giveen added 11 commits April 2, 2026 14:25

fix: assign _last_user_input before Runner.run() to avoid UnboundLoca…

0235e9a

…lError on first compact

support: flush support/summarizer model after summarization (cleanup …

380a4a7

…+ call)

Merge branch 'aliasrobotics:main' into support-model-fix

881c25b

Copilot AI review requested due to automatic review settings April 3, 2026 15:11

Copilot started reviewing on behalf of giveen April 3, 2026 15:12 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: support/summarizer model lifecycle #430

Fix: support/summarizer model lifecycle #430
giveen wants to merge 11 commits intoaliasrobotics:mainfrom
giveen:support-model-fix

giveen commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            if _support_model and _support_interval_raw:
+            _auto_compact_raw = os.getenv("CAI_AUTO_COMPACT", "true")
+            _auto_compact_enabled = _auto_compact_raw.strip().lower() not in {
+                "0", "false", "no", "off"
+            }
+            if _auto_compact_enabled and _support_model and _support_interval_raw:

Conversation

giveen commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

giveen commented Apr 3, 2026 •

edited

Loading