Skip to content

Fix: support/summarizer model lifecycle #430

Open
giveen wants to merge 11 commits intoaliasrobotics:mainfrom
giveen:support-model-fix
Open

Fix: support/summarizer model lifecycle #430
giveen wants to merge 11 commits intoaliasrobotics:mainfrom
giveen:support-model-fix

Conversation

@giveen
Copy link
Copy Markdown

@giveen giveen commented Apr 3, 2026

Purpose: Fix support/summarizer model lifecycle and improve auto‑compaction so the support agent no longer holds server context, avoids repeating exhausted approaches, and doesn’t crash the runner.
What Changed

Flush support model: Added async def cleanup() to OpenAIChatCompletionsModel to close the underlying async client, remove the instance from ACTIVE_MODEL_INSTANCES, and clear message_history.

Call cleanup after summarization: Updated _ai_summarize_history to call model_inst.cleanup() in a finally block so temporary summary/support model instances are best-effort flushed after Runner.run. See memory.py.
Compaction & resume hardening (context): earlier related fixes included moving _last_user_input assignment earlier, injecting compacted summary back into message_history to prevent repeat work, and improving _format_history_for_summary to preserve tool args and larger tool outputs so the summarizer can list “Exhausted Approaches”. See cli.py, openai_chatcompletions.py, and memory.py.
Why

CAI_SUPPORT_MODEL / auto‑compaction path was leaving ephemeral summary/support model instances with open HTTP clients and retained message history, causing downstream LLM servers (e.g., llama-support) to accumulate large context (n_ctx_slot) and consume slots/tokens. Flushing prevents slot/context growth and avoids repeated retries of exhausted approaches.

Config checks

export CAI_SUPPORT_MODEL=llama-support
export CAI_SUPPORT_INTERVAL=3

My current setup for testing is as follows:

LLAMA.CPP and LiteLLM(proxy)

.env

OPENAI_API_BASE="http://192.168.0.165:4000/v1"
CAI_MODEL="openai/reasoner"
CAI_SUPPORT_MODEL="openai/support"
CAI_SUPPORT_INTERVAL=25
OPENAI_API_KEY="sk-asdfasdfs"
PROMPT_TOOLKIT_NO_CPR=1
CAI_STREAM=false
CAI_AGENT_TYPE="redteam_agent"
CAI_GUARDRAILS="true"
CAI_TOOL_TIMEOUT=120
CAI_MEMORY="episodic"
CAI_MEMORY_ONLINE="true"
CAI_MEMORY_ONLINE_INTERVAL=5

Main Model setup with LLAMA.CPP

ExecStart=/mnt/storage/llama.cpp/build/bin/llama-server \
    --model /mnt/storage/models/qwen3.5-27b-opus-distill-Q6_K.gguf \
    --model-draft /mnt/storage/models/Qwen2.5-Coder-1.5B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --n-gpu-layers-draft 99 \
    --ctx-size 65536 \
    --threads 16 \
    --batch-size 2048 \
    --ubatch-size 512 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --sleep-idle-seconds 300 \
    --host 0.0.0.0 \
    --port 8080 \
    --cont-batching \
    --jinja

Support Model setup with LLAMA.CPP

ExecStart=/mnt/storage/llama.cpp/build/bin/llama-server \
    --model /mnt/storage/models/Qwen2.5-Coder-1.5B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --port 8081 \
    --slot-save-path /tmp \
    --host 0.0.0.0 \
    --flash-attn on

With this setup, every 25 turns (user input and LLM input/out), the support model will summarize the work that was completed, provide a list of things that were done, or failed and suggested methods moving forward.

This is then saved to memory.

Context is cleared for main model and summary

Summary is loaded back into main model as its prompt and it continues its work. This prevents local LLM's from running out of context space and causing local models to shutdown under load.

CODED WITH CLaude Sonnet 4.6 - High thinking mode

giveen added 11 commits April 2, 2026 14:25
…he main agent context

Problem
-------
CAI_SUPPORT_MODEL and CAI_SUPPORT_INTERVAL were documented environment
variables that had no runtime implementation.  The support/reasoner agent
was constructed using CAI_SUPPORT_MODEL but was never invoked automatically.
CAI_SUPPORT_INTERVAL existed only in docs/config tables with no scheduler
reading it.  As a result, users with a limited context window (e.g. 32k on a
local llama.cpp setup) had no way to automatically keep the main model's
message_history from overflowing during a long pentest.

Solution
--------
Added an auto-compact scheduler block immediately after turn_count += 1 in
the main single-agent run loop (run_cai_cli).

When both CAI_SUPPORT_MODEL and CAI_SUPPORT_INTERVAL are set the scheduler:

1. Fires every CAI_SUPPORT_INTERVAL turns (modulo check).
2. Calls COMPACT_COMMAND_INSTANCE._perform_compaction(model_override=
   CAI_SUPPORT_MODEL) which:
   - Sends the full message_history to the support model for summarisation.
   - Clears message_history entirely (hard context reset).
   - Saves the summary to /memory as a .md file.
   - Stores the summary in COMPACTED_SUMMARIES under the agent name.
3. Re-syncs the local agent variable from AGENT_MANAGER.get_active_agent()
   so the run loop continues with the freshly reloaded agent instance whose
   system prompt already contains the injected summary (the system prompt
   template calls get_compacted_summary() dynamically on every turn so no
   extra wiring was needed).
4. Prints a visible yellow/green indicator so users can see when compaction
   fires and confirm the context window has been reset.
5. Silently swallows errors (only logs when CAI_DEBUG=2) so a failing support
   model never crashes the main session.

Usage
-----
  CAI_SUPPORT_MODEL="openai/support"  # lighter model on litellm proxy
  CAI_SUPPORT_INTERVAL=4              # compact every 4 turns
After auto-compact fired the main run loop fell back to get_user_input()
and waited for the human, so the agent stopped working mid-task.

Root cause: no mechanism existed to replay the current task into the next
iteration after message_history was cleared.

Fix:
- Add _post_compact_input (str | None) variable initialised to None at
  session start.
- Capture _last_user_input from user_input just before turn_count += 1.
- After a successful auto-compact set _post_compact_input to
  _last_user_input (or 'Continue the current task.' if it was blank).
- At the top of the input-gathering block, consume _post_compact_input
  before calling get_user_input() so the agent immediately re-runs with
  its previous task prompt and the fresh compacted context.
- Print startup banner confirming CAI_SUPPORT_MODEL + CAI_SUPPORT_INTERVAL
  are loaded so users can verify env vars are picked up.
- Show a per-turn countdown (dim cyan) so it is visible the interval is
  counting correctly toward the next compact.
- Remove the CAI_DEBUG=2 gate on error output — auto-compact errors are
  now always printed so silent failures can no longer mask a broken
  support model endpoint or compaction issue.
CAI_SUPPORT_INTERVAL previously counted outer while-loop iterations
(one per user input).  In agentic/continue mode the agent makes many
tool-call rounds per single user input, so the interval never fired
unless the user typed N separate messages.

Fix: count assistant messages in message_history as a proxy for LLM
API calls.  This fires after every N responses from the main model
regardless of how many came from a single outer iteration.

- Startup banner updated: 'every N LLM responses' instead of 'turns'
- Countdown shows [{current}/{threshold}] response counts
- Fire condition: llm_call_count >= support_interval (fires as soon
  as threshold is reached; resets naturally when history is cleared
  after compact)
Both get_response (token counting) and _fetch_response (actual API call)
were prepending self.message_history to every request. But cli.py already
passes the full conversation history via history_context as conversation_input
to Runner.run, which threads it through as original_input to these methods.

Result: every historical message was sent TWICE in every API call, doubling
the effective context size. After auto-compact cleared message_history, the
duplication between runner-accumulated generated_items and message_history
rebuilt the doubled context within a single Runner.run invocation (after
just 3-4 tool calls), explaining why n_tokens never dropped post-compact.

Fix: remove the prepend loops. The runner's input parameter already contains
the full conversation (original_input + generated_items), so converted_messages
is built from input alone. message_history continues to serve its role as
cross-turn persistence (populated via add_to_message_history, consumed by
cli.py as history_context for the next Runner.run call).

Expected effect: halved token counts in normal operation; post-compact first
call starts at ~system_prompt + 1 user message and grows linearly with tool
calls rather than doubling.
Previously CAI_SUPPORT_INTERVAL only checked at the end of each outer
while-loop iteration (once per user input). In agentic sessions the agent
makes many successive tool calls inside a single Runner.run invocation,
so the check would only fire after the runner returned — too late, or
never if the runner was still running.

Fix:
- Add ContextCompactedError exception class.
- Add a count-based trigger at the top of _auto_compact_if_needed (which
  fires on EVERY LLM API call, inside both get_response and stream_response).
  When assistant-message count >= CAI_SUPPORT_INTERVAL:
    1. Set the compact model to CAI_SUPPORT_MODEL temporarily.
    2. Summarise via _ai_summarize_history (awaited in-situ, no asyncio.run).
    3. Store summary in COMPACTED_SUMMARIES so get_system_prompt picks it
       up on the next turn without needing agent reload.
    4. Clear message_history + reset CAI_CONTEXT_USAGE.
    5. Raise ContextCompactedError to abort the current runner invocation.
- cli.py catches ContextCompactedError in both streaming and non-streaming
  runner call sites:
    - Sets _post_compact_input = _last_user_input so the task is replayed.
    - Re-syncs the local agent reference via AGENT_MANAGER.
    - Continues the outer while-loop, restarting with a clean context window.

The existing outer-loop CLI check (counting assistant messages in history
after the runner finishes) is kept as a belt-and-suspenders fallback.
…ompaction prompt

After compaction, message_history.clear() wiped all context and the stored
summary in COMPACTED_SUMMARIES was never re-injected, so the next runner turn
started completely blank and the agent would repeat already-tried approaches.

Fix: immediately after clearing, push a user+assistant exchange containing the
summary into message_history so it flows through history_context on the next
iteration as normal conversation context.

Also: add an explicit 'Exhausted Approaches — DO NOT RETRY' section (§9) and
'Recommended Next Steps' section (§10) to the compaction prompt so the support
model produces a checklist of dead ends the main agent must not revisit.
Three root causes identified and fixed:

1. _format_history_for_summary dropped ALL tool outputs >500 chars with
   '[Long output truncated]', meaning nmap/gobuster/curl results — the
   exact evidence the summary model needs to write the 'Exhausted
   Approaches' section — were silently discarded. Increased limit to
   2000 chars (first chars of each output), bumped the message cap from
   50 to 200 blocks, and fixed the assistant tool-call extractor which
   only handled object-style tool_calls (not the dict-style format used
   in message_history), causing every command ever run to disappear.

2. _post_compact_input was set to the raw original user task (e.g.
   'hack the box machine Cap'). That becomes the last message the LLM
   reads, overriding the memory acknowledgement and making the agent
   treat it as a brand-new task. Now injects an explicit anti-repetition
   instruction alongside the original task text.

3. (Previous fix) Summary prompt now includes §9 Exhausted Approaches
   and §10 Recommended Next Steps — this only works if the summary model
   actually sees the scan/tool data, which fix #1 now guarantees.
Copilot AI review requested due to automatic review settings April 3, 2026 15:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the support/summarizer model lifecycle to avoid leaking client resources/history, and hardens auto-compaction so the CLI can restart cleanly after mid-run compaction.

Changes:

  • Added OpenAIChatCompletionsModel.cleanup() and invoked it after summarization to best-effort close async clients and clear temporary model state.
  • Updated OpenAI Chat Completions request building to avoid double-including history when CLI already passes it in conversation_input.
  • Added a ContextCompactedError restart path and support-interval-based auto-compaction/replay behavior in the CLI/model.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/cai/sdk/agents/models/openai_chatcompletions.py Adds explicit model cleanup and a support-interval auto-compact trigger (raising ContextCompactedError) and avoids double-prepending history.
src/cai/repl/commands/memory.py Improves summary prompt/formatting and ensures temporary summarizer models are cleaned up in a finally block.
src/cai/cli.py Catches ContextCompactedError to restart Runner and replays the last user task; prints auto-compact status and includes a CLI-side interval compaction block.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3598 to +3605
except Exception as _ce:
_console.print(f"[red]Auto-compact error: {_ce}[/red]")
# Always abort the current runner invocation so the outer loop
# can restart with our freshly cleared context.
raise ContextCompactedError(
f"Context compacted after {_asst_count} LLM responses "
f"(threshold {_support_interval})"
)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CAI_SUPPORT_INTERVAL trigger, ContextCompactedError is raised unconditionally after the summarization attempt, even if summarization/compaction fails (e.g., import/runtime error). In that failure case, message_history may remain unchanged and the next Runner restart will immediately hit the same condition again, potentially causing an infinite restart loop. Consider only raising ContextCompactedError after successfully clearing/reseeding history (or add a guard/backoff flag so failures fall back to normal execution and don’t retrigger immediately).

Copilot uses AI. Check for mistakes.
# many tool-call rounds per single user input — are handled correctly.
_support_model = os.getenv("CAI_SUPPORT_MODEL")
_support_interval_raw = os.getenv("CAI_SUPPORT_INTERVAL")
if _support_model and _support_interval_raw:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CLI-level CAI_SUPPORT_INTERVAL auto-compact runs whenever CAI_SUPPORT_MODEL/CAI_SUPPORT_INTERVAL are set, but it does not check CAI_AUTO_COMPACT. As a result, setting CAI_AUTO_COMPACT=false will still trigger count-based compaction from the CLI loop. Consider honoring the same disable flag here (or introducing a dedicated flag for support-interval compaction) so users can reliably turn auto-compaction off.

Suggested change
if _support_model and _support_interval_raw:
_auto_compact_raw = os.getenv("CAI_AUTO_COMPACT", "true")
_auto_compact_enabled = _auto_compact_raw.strip().lower() not in {
"0", "false", "no", "off"
}
if _auto_compact_enabled and _support_model and _support_interval_raw:

Copilot uses AI. Check for mistakes.
Comment on lines +3566 to +3569
if self.agent_name not in COMPACTED_SUMMARIES:
COMPACTED_SUMMARIES[self.agent_name] = []
APPLIED_MEMORY_IDS[self.agent_name] = []
COMPACTED_SUMMARIES[self.agent_name] = [_summary]
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When applying the new in-memory summary, COMPACTED_SUMMARIES[self.agent_name] is overwritten but APPLIED_MEMORY_IDS is only initialized for the first-time case. If the agent already has entries, the old APPLIED_MEMORY_IDS[self.agent_name] values remain and no longer correspond to the newly applied summary, which can make /memory status misleading. Consider clearing/updating APPLIED_MEMORY_IDS[self.agent_name] alongside COMPACTED_SUMMARIES when overwriting the summary (even if no memory_id is available).

Suggested change
if self.agent_name not in COMPACTED_SUMMARIES:
COMPACTED_SUMMARIES[self.agent_name] = []
APPLIED_MEMORY_IDS[self.agent_name] = []
COMPACTED_SUMMARIES[self.agent_name] = [_summary]
COMPACTED_SUMMARIES[self.agent_name] = [_summary]
APPLIED_MEMORY_IDS[self.agent_name] = []

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants