A self-extending AI agent with persistent memory, sandboxed tool creation, budget-aware model routing, and cloud access via MCP.
Built with LangGraph, Supabase/pgvector, Docker, and FastMCP. Developed across 8 disciplined sprints, each adding a distinct capability layer.
Most LangGraph agent repos are single-feature demos β a memory example here, a tool-calling example there. cairn is an integrated system where every piece works together, designed for individual developers who want to build and run a personal AI agent without enterprise infrastructure.
-
Self-Extending Metatool System with Human Approval β The agent can write its own tools, test them in a Docker sandbox, register them in a database as "pending," and require explicit human review and approval via CLI before they go live. Self-extending agents exist (see Related Projects), but most auto-promote new capabilities. cairn's full pipeline β create β sandbox test β DB registration β human review β approval β dynamic import β prioritizes safety over convenience.
-
4-Tier Model Routing with Budget Caps β YAML-driven rules route tasks to the cheapest capable model (Qwen 3 8B β Qwen 3 32B β Claude Sonnet β Claude Extended). Daily spend tracking auto-downgrades to local models when you hit your budget. No LiteLLM dependency β just a simple, readable routing config.
-
Daily Research Digest Pipeline β A recurring daemon task scrapes configurable news sources, pre-filters items by embedding similarity against your SCMS project memories, reranks with a cross-encoder (cairn-rank), then summarizes with a local 32B model. Three scoring layers β embedding similarity, cross-encoder relevance, and LLM scoring β are tracked against human judgment in an evaluation pipeline. Approved items become permanent memories, improving future filtering β a feedback loop. Runs for ~$0.03/month.
-
MCP Server with OAuth 2.1 β Your agent's memory and task queue are accessible from claude.ai, Claude Desktop, Claude Code, and mobile via a Railway-deployed MCP server with full OAuth 2.1 (DCR + PKCE). One of the few Python FastMCP + OAuth reference implementations available.
-
Persistent Memory (SCMS) β Supabase + pgvector with 1536-dimensional embeddings and HNSW cosine search. The agent remembers across sessions β projects, decisions, learnings, and context. Not a demo β this is the actual persistence layer the whole system runs on.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLI (main.py) β
β Modes: single task, interactive REPL, daemon, digest β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββ¬βββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β LangGraph StateGraph β β Daemon (daemon.py) β
β β β Polling loop + croniter β
β START β β Recurring cron tasks β
β β CLASSIFY (no LLM) β β Daily digest pipeline β
β β PLAN (LLM) β βββββββββββββββββββββββββββ
β β ACT (LLM+tools) β
β β REFLECT (LLM) β
β ...or END β
ββββββββββββ¬βββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tool Layer (16+ tools) β
β Data: web_search, url_reader, arxiv_search, β
β github_search β
β Files: file_reader, file_writer, note_taker β
β Code: code_executor (Docker sandbox) β
β SCMS: scms_search, scms_store β
β Project: create_project, update_project, archive_project β
β Meta: create_tool, test_tool, list_pending_tools β
ββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
β Supabase/pgvector β β Docker Sandbox β β Model Router β
β 5 tables + RPC β β 256MB, no net β β YAML rules β
β 1536-dim embeddingsβ β 60s timeout β β 4 tiers β
ββββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββ
β MCP Server (Railway) β
β FastMCP + OAuth 2.1 (DCR + PKCE) β
β 16 tools Β· claude.ai / Desktop / mobile β
ββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.12+
- uv package manager
- Ollama for local LLM inference
- Docker (recommended for code execution; restricted subprocess fallback available with
--allow-subprocess) - A Supabase project (free tier works)
git clone https://github.com/reallyreallyryan/cairn.git
cd cairn
# Install dependencies
uv sync
# Pull local models
ollama pull qwen3:32b
ollama pull qwen3:8b
# Set up Supabase:
# 1. Create a project at supabase.com
# 2. Enable the pgvector extension (Database > Extensions)
# 3. Run the migrations in order: scms/migrations/001_initial.sql through 005
# 4. Copy your Project URL + anon key
# Build the Docker sandbox image (required for code execution and metatool testing)
docker build -t cairn-sandbox -f sandbox/Dockerfile .
# Configure
cp .env.example .env
# Edit .env with your Supabase URL, API keys, etc.
# Start Ollama (auto-started by cairn if needed, or start manually)
ollama serve
# Run your first task
uv run python main.py "What projects am I working on?"# Single task
python main.py "Search the web for LangGraph tutorials"
# Interactive mode
python main.py -i
# Use cloud model explicitly
python main.py --model cloud "Architect a REST API for task management"
# Daily research digest
python main.py --digest # Run manually
python main.py --review-digest # Review & approve/reject items into memory
python main.py --digest-status # Check last run stats
python main.py --digest-eval # Run evaluation report on approval/rejection history
python main.py --compile-digest # Compile approved items into deep-dive + briefing docs
python main.py --compile-digest --compile-since 2026-03-21 # Filter by date
# Task queue & daemon
python main.py --queue "Research MCP best practices" --priority 2
python main.py --daemon # Background task processing
# Metatool management
python main.py --pending-tools # List tools awaiting approval
python main.py --review-tool <id> # Review tool code + sandbox test results
python main.py --approve-tool <id> # Approve for production usecairn was built incrementally across 8 sprints. Each sprint added a distinct capability layer, and each sprint brief was handed to Claude Code for implementation.
| Sprint | Focus | What Was Added |
|---|---|---|
| 1 | Foundation | SCMS + pgvector memory, 10 tools, CLI with single task and interactive modes |
| 2 | Intelligence | PlanβActβReflect loop, keyword classifier, multi-step planning, decision logging |
| 3 | Security | Docker sandbox, metatool system, human approval workflow, dynamic tool loading |
| 4 | Autonomy | Model routing, budget tracking, task queue, daemon mode, notifications |
| 5a | Cloud Access | MCP server, OAuth 2.1, Railway deployment, OpenAI embedding migration |
| 5b | Digest Pipeline | Daily research digest, 4-tier routing (Qwen 3 upgrade), local 32B summarization |
| 5c | Hardening | Classifier default fix (multiβresearch), archived DB status, MCP ToolAnnotations, httpx session fix, ddgs migration |
| 6 | Digest Relevance | Embedding-based pre-filter for digest pipeline, per-source similarity thresholds, cold-start bypass |
| 7 | Digest Hardening | Few-shot calibration from approval history, digest dedup on ingest, evaluation pipeline with threshold analysis, digest sources expanded to 16, digest_eval MCP tool |
| 8 | Security + Reranking | gitleaks pre-commit hook, cairn-rank cross-encoder integration into digest pipeline, three-layer scoring eval |
| 8b | Digest Compiler | Compile approved digest items into deep-dive and briefing documents with full article fetching and LLM summarization |
βββ agent/ # LangGraph agent
β βββ graph.py # StateGraph: classify β plan β act β reflect
β βββ classifier.py # Keyword-based task classification (no LLM call)
β βββ classify.py # CLASSIFY node: task type, project detection, SCMS context
β βββ plan.py # PLAN node: LLM plan generation, step parsing
β βββ act.py # ACT node: tool execution, fallback dispatch
β βββ reflect.py # REFLECT node: result evaluation, continuation logic
β βββ utils.py # Shared utilities: get_llm(), clean_output()
β βββ nodes.py # Re-exports from split modules (backward compat)
β βββ state.py # AgentState TypedDict
β βββ model_router.py # Complexity β tier β budget check β LLM instance
β βββ daemon.py # Background task queue processor
β βββ digest.py # Daily research digest orchestrator
β βββ evaluation.py # Digest evaluation: metrics, thresholds, reports
β βββ compile_digest.py # Digest compiler: article fetching + LLM summaries
β βββ notifications.py # macOS + file log notifications
β βββ tools/ # 16+ tools (web, files, code, SCMS, project, metatool)
β βββ web_search.py
β βββ url_reader.py
β βββ arxiv_search.py
β βββ github_search.py
β βββ file_reader.py
β βββ file_writer.py
β βββ note_taker.py
β βββ code_executor.py
β βββ scms_tools.py
β βββ project_tools.py # create_project, update_project, archive_project
β βββ metatool.py
β βββ custom/ # Agent-created tools (after human approval)
βββ mcp_server/ # FastMCP server for cloud access
β βββ server.py # 16 MCP tools, OAuth 2.1
β βββ config.py
βββ config/ # YAML configs
β βββ model_routing.yaml
β βββ sandbox_policy.yaml
β βββ digest_sources.yaml
βββ scms/ # Shared Context Memory Store
β βββ client.py # SCMSClient β CRUD + semantic search
β βββ embeddings.py # OpenAI text-embedding-3-small
β βββ migrations/ # Supabase SQL migrations (001β005)
βββ sandbox/ # Docker sandbox
β βββ Dockerfile
β βββ manager.py # Container lifecycle, code injection, cleanup
βββ tests/ # Integration tests (pytest)
β βββ test_project_crud.py
β βββ test_metatool_loading.py
β βββ test_digest_dedup.py
β βββ test_digest_fewshot.py
β βββ test_evaluation.py
β βββ test_rerank.py
βββ main.py # CLI entry point
uv run pytestTests use mocked SCMS client β no Supabase or Docker required to run them.
Key choices and their tradeoffs:
- Keyword classifier over LLM classifier β Task classification uses deterministic keyword matching, not an LLM call. Faster, cheaper, predictable. Falls back to "research" with research-focused tools for ambiguous tasks.
- Supabase over SQLite β pgvector for semantic search, cloud-accessible from MCP server, single source of truth. Requires network connectivity but enables the entire cloud access story.
- Flat cost estimates over token tracking β Simple $0/$0.01/$0.03 per-call tiers rather than token-level metering. Sufficient for cost tracking. Token-level tracking deferred to future work.
- Human approval for agent-created tools β The metatool pipeline requires explicit CLI approval. No auto-promotion, ever. This is a deliberate safety decision.
- Two-stage tool promotion β Sandbox-built tools go live in the daemon/CLI after human approval (stage 1). Promoting a tool to the MCP server for cloud clients requires Claude Code review and a Railway redeploy (stage 2). No tool reaches claude.ai or Claude Desktop without two gates.
- Local-first model routing β Default tier is local (free). Cloud models only used when routing rules determine the task needs them. Budget exhaustion auto-downgrades to local.
cairn is designed to be cheap to run daily:
| Operation | Model | Cost |
|---|---|---|
| Simple recall / notes | Qwen 3 8B (local) | $0.00 |
| Summarization / digest | Qwen 3 32B (local) | $0.00 |
| Research / multi-step | Claude Sonnet (cloud) | ~$0.01/task |
| Complex technical | Claude Sonnet extended | ~$0.03/task |
| Daily digest (full run) | Local + embedding | ~$0.001/day |
| Daily budget cap | Configurable | Default $5.00 |
The daily digest pipeline runs almost entirely on local models. The only cloud cost is embedding approved items via OpenAI text-embedding-3-small (~$0.03/month).
- Improve digest relevance scoring (embedding pre-filter + few-shot calibration from approval/rejection history)
- Evaluation pipeline using digest approval/rejection data
- Cross-encoder reranking via cairn-rank (three-layer scoring comparison)
- Security hardening (gitleaks pre-commit hook)
- Memory deduplication and aging
- 24/7 daemon deployment on Railway
- Multi-agent collaboration patterns
cairn exists in a growing ecosystem of autonomous agent tools. These projects explore overlapping ideas at different scales:
- OpenClaw β Personal AI assistant with 214k+ stars. Connects to messaging platforms with self-extending skills. Different architecture (gateway vs. research agent) and auto-promotes new capabilities without human approval.
- NVIDIA OpenShell β Enterprise sandbox for self-evolving agents with policy controls. Requires DGX/RTX hardware. cairn targets the same safety-first philosophy at a scale that runs on a laptop with Ollama.
- LangGraph β The state machine framework cairn is built on. cairn's ClassifyβPlanβActβReflect loop is one opinionated implementation of LangGraph's primitives.
- LiteLLM β Production LLM proxy with routing and budget management at enterprise scale. cairn's model router is a lightweight alternative for solo developers who want the same idea in a YAML file.
- e2b β Cloud sandboxing for AI code execution. cairn uses a simpler local Docker sandbox with resource limits.
Issues and PRs welcome. See CONTRIBUTING.md for details.
If you build something interesting with cairn, I'd love to hear about it.
MIT β see LICENSE for details.