Skip to content

reallyreallyryan/cairn

Repository files navigation

cairn

A self-extending AI agent with persistent memory, sandboxed tool creation, budget-aware model routing, and cloud access via MCP.

Built with LangGraph, Supabase/pgvector, Docker, and FastMCP. Developed across 8 disciplined sprints, each adding a distinct capability layer.


What Makes This Different

Most LangGraph agent repos are single-feature demos β€” a memory example here, a tool-calling example there. cairn is an integrated system where every piece works together, designed for individual developers who want to build and run a personal AI agent without enterprise infrastructure.

  • Self-Extending Metatool System with Human Approval β€” The agent can write its own tools, test them in a Docker sandbox, register them in a database as "pending," and require explicit human review and approval via CLI before they go live. Self-extending agents exist (see Related Projects), but most auto-promote new capabilities. cairn's full pipeline β€” create β†’ sandbox test β†’ DB registration β†’ human review β†’ approval β†’ dynamic import β€” prioritizes safety over convenience.

  • 4-Tier Model Routing with Budget Caps β€” YAML-driven rules route tasks to the cheapest capable model (Qwen 3 8B β†’ Qwen 3 32B β†’ Claude Sonnet β†’ Claude Extended). Daily spend tracking auto-downgrades to local models when you hit your budget. No LiteLLM dependency β€” just a simple, readable routing config.

  • Daily Research Digest Pipeline β€” A recurring daemon task scrapes configurable news sources, pre-filters items by embedding similarity against your SCMS project memories, reranks with a cross-encoder (cairn-rank), then summarizes with a local 32B model. Three scoring layers β€” embedding similarity, cross-encoder relevance, and LLM scoring β€” are tracked against human judgment in an evaluation pipeline. Approved items become permanent memories, improving future filtering β€” a feedback loop. Runs for ~$0.03/month.

  • MCP Server with OAuth 2.1 β€” Your agent's memory and task queue are accessible from claude.ai, Claude Desktop, Claude Code, and mobile via a Railway-deployed MCP server with full OAuth 2.1 (DCR + PKCE). One of the few Python FastMCP + OAuth reference implementations available.

  • Persistent Memory (SCMS) β€” Supabase + pgvector with 1536-dimensional embeddings and HNSW cosine search. The agent remembers across sessions β€” projects, decisions, learnings, and context. Not a demo β€” this is the actual persistence layer the whole system runs on.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CLI (main.py)                     β”‚
β”‚  Modes: single task, interactive REPL, daemon, digest    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚                              β”‚
               β–Ό                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   LangGraph StateGraph    β”‚    β”‚    Daemon (daemon.py)    β”‚
β”‚                           β”‚    β”‚  Polling loop + croniter β”‚
β”‚  START                    β”‚    β”‚  Recurring cron tasks    β”‚
β”‚    β†’ CLASSIFY (no LLM)    β”‚    β”‚  Daily digest pipeline   β”‚
β”‚    β†’ PLAN    (LLM)        β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚    β†’ ACT     (LLM+tools)  β”‚
β”‚    β†’ REFLECT (LLM)        β”‚
β”‚    ...or END              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Tool Layer (16+ tools)                  β”‚
β”‚  Data:    web_search, url_reader, arxiv_search,           β”‚
β”‚           github_search                                   β”‚
β”‚  Files:   file_reader, file_writer, note_taker            β”‚
β”‚  Code:    code_executor (Docker sandbox)                  β”‚
β”‚  SCMS:    scms_search, scms_store                         β”‚
β”‚  Project: create_project, update_project, archive_project β”‚
β”‚  Meta:    create_tool, test_tool, list_pending_tools      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Supabase/pgvector  β”‚  β”‚  Docker Sandbox   β”‚  β”‚ Model Router  β”‚
β”‚  5 tables + RPC     β”‚  β”‚  256MB, no net    β”‚  β”‚ YAML rules    β”‚
β”‚  1536-dim embeddingsβ”‚  β”‚  60s timeout      β”‚  β”‚ 4 tiers       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚  MCP Server (Railway)                     β”‚
               β”‚  FastMCP + OAuth 2.1 (DCR + PKCE)        β”‚
               β”‚  16 tools Β· claude.ai / Desktop / mobile  β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager
  • Ollama for local LLM inference
  • Docker (recommended for code execution; restricted subprocess fallback available with --allow-subprocess)
  • A Supabase project (free tier works)

Setup

git clone https://github.com/reallyreallyryan/cairn.git
cd cairn

# Install dependencies
uv sync

# Pull local models
ollama pull qwen3:32b
ollama pull qwen3:8b

# Set up Supabase:
# 1. Create a project at supabase.com
# 2. Enable the pgvector extension (Database > Extensions)
# 3. Run the migrations in order: scms/migrations/001_initial.sql through 005
# 4. Copy your Project URL + anon key

# Build the Docker sandbox image (required for code execution and metatool testing)
docker build -t cairn-sandbox -f sandbox/Dockerfile .

# Configure
cp .env.example .env
# Edit .env with your Supabase URL, API keys, etc.

# Start Ollama (auto-started by cairn if needed, or start manually)
ollama serve

# Run your first task
uv run python main.py "What projects am I working on?"

Usage

# Single task
python main.py "Search the web for LangGraph tutorials"

# Interactive mode
python main.py -i

# Use cloud model explicitly
python main.py --model cloud "Architect a REST API for task management"

# Daily research digest
python main.py --digest              # Run manually
python main.py --review-digest       # Review & approve/reject items into memory
python main.py --digest-status       # Check last run stats
python main.py --digest-eval         # Run evaluation report on approval/rejection history
python main.py --compile-digest      # Compile approved items into deep-dive + briefing docs
python main.py --compile-digest --compile-since 2026-03-21  # Filter by date

# Task queue & daemon
python main.py --queue "Research MCP best practices" --priority 2
python main.py --daemon              # Background task processing

# Metatool management
python main.py --pending-tools       # List tools awaiting approval
python main.py --review-tool <id>    # Review tool code + sandbox test results
python main.py --approve-tool <id>   # Approve for production use

Sprint History

cairn was built incrementally across 8 sprints. Each sprint added a distinct capability layer, and each sprint brief was handed to Claude Code for implementation.

Sprint Focus What Was Added
1 Foundation SCMS + pgvector memory, 10 tools, CLI with single task and interactive modes
2 Intelligence Plan→Act→Reflect loop, keyword classifier, multi-step planning, decision logging
3 Security Docker sandbox, metatool system, human approval workflow, dynamic tool loading
4 Autonomy Model routing, budget tracking, task queue, daemon mode, notifications
5a Cloud Access MCP server, OAuth 2.1, Railway deployment, OpenAI embedding migration
5b Digest Pipeline Daily research digest, 4-tier routing (Qwen 3 upgrade), local 32B summarization
5c Hardening Classifier default fix (multi→research), archived DB status, MCP ToolAnnotations, httpx session fix, ddgs migration
6 Digest Relevance Embedding-based pre-filter for digest pipeline, per-source similarity thresholds, cold-start bypass
7 Digest Hardening Few-shot calibration from approval history, digest dedup on ingest, evaluation pipeline with threshold analysis, digest sources expanded to 16, digest_eval MCP tool
8 Security + Reranking gitleaks pre-commit hook, cairn-rank cross-encoder integration into digest pipeline, three-layer scoring eval
8b Digest Compiler Compile approved digest items into deep-dive and briefing documents with full article fetching and LLM summarization

Project Structure

β”œβ”€β”€ agent/                # LangGraph agent
β”‚   β”œβ”€β”€ graph.py          # StateGraph: classify β†’ plan β†’ act β†’ reflect
β”‚   β”œβ”€β”€ classifier.py     # Keyword-based task classification (no LLM call)
β”‚   β”œβ”€β”€ classify.py       # CLASSIFY node: task type, project detection, SCMS context
β”‚   β”œβ”€β”€ plan.py           # PLAN node: LLM plan generation, step parsing
β”‚   β”œβ”€β”€ act.py            # ACT node: tool execution, fallback dispatch
β”‚   β”œβ”€β”€ reflect.py        # REFLECT node: result evaluation, continuation logic
β”‚   β”œβ”€β”€ utils.py          # Shared utilities: get_llm(), clean_output()
β”‚   β”œβ”€β”€ nodes.py          # Re-exports from split modules (backward compat)
β”‚   β”œβ”€β”€ state.py          # AgentState TypedDict
β”‚   β”œβ”€β”€ model_router.py   # Complexity β†’ tier β†’ budget check β†’ LLM instance
β”‚   β”œβ”€β”€ daemon.py         # Background task queue processor
β”‚   β”œβ”€β”€ digest.py         # Daily research digest orchestrator
β”‚   β”œβ”€β”€ evaluation.py     # Digest evaluation: metrics, thresholds, reports
β”‚   β”œβ”€β”€ compile_digest.py  # Digest compiler: article fetching + LLM summaries
β”‚   β”œβ”€β”€ notifications.py  # macOS + file log notifications
β”‚   └── tools/            # 16+ tools (web, files, code, SCMS, project, metatool)
β”‚       β”œβ”€β”€ web_search.py
β”‚       β”œβ”€β”€ url_reader.py
β”‚       β”œβ”€β”€ arxiv_search.py
β”‚       β”œβ”€β”€ github_search.py
β”‚       β”œβ”€β”€ file_reader.py
β”‚       β”œβ”€β”€ file_writer.py
β”‚       β”œβ”€β”€ note_taker.py
β”‚       β”œβ”€β”€ code_executor.py
β”‚       β”œβ”€β”€ scms_tools.py
β”‚       β”œβ”€β”€ project_tools.py # create_project, update_project, archive_project
β”‚       β”œβ”€β”€ metatool.py
β”‚       └── custom/       # Agent-created tools (after human approval)
β”œβ”€β”€ mcp_server/           # FastMCP server for cloud access
β”‚   β”œβ”€β”€ server.py         # 16 MCP tools, OAuth 2.1
β”‚   └── config.py
β”œβ”€β”€ config/               # YAML configs
β”‚   β”œβ”€β”€ model_routing.yaml
β”‚   β”œβ”€β”€ sandbox_policy.yaml
β”‚   └── digest_sources.yaml
β”œβ”€β”€ scms/                 # Shared Context Memory Store
β”‚   β”œβ”€β”€ client.py         # SCMSClient β€” CRUD + semantic search
β”‚   β”œβ”€β”€ embeddings.py     # OpenAI text-embedding-3-small
β”‚   └── migrations/       # Supabase SQL migrations (001–005)
β”œβ”€β”€ sandbox/              # Docker sandbox
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── manager.py        # Container lifecycle, code injection, cleanup
β”œβ”€β”€ tests/                # Integration tests (pytest)
β”‚   β”œβ”€β”€ test_project_crud.py
β”‚   β”œβ”€β”€ test_metatool_loading.py
β”‚   β”œβ”€β”€ test_digest_dedup.py
β”‚   β”œβ”€β”€ test_digest_fewshot.py
β”‚   β”œβ”€β”€ test_evaluation.py
β”‚   └── test_rerank.py
└── main.py               # CLI entry point

Testing

uv run pytest

Tests use mocked SCMS client β€” no Supabase or Docker required to run them.

Design Decisions

Key choices and their tradeoffs:

  • Keyword classifier over LLM classifier β€” Task classification uses deterministic keyword matching, not an LLM call. Faster, cheaper, predictable. Falls back to "research" with research-focused tools for ambiguous tasks.
  • Supabase over SQLite β€” pgvector for semantic search, cloud-accessible from MCP server, single source of truth. Requires network connectivity but enables the entire cloud access story.
  • Flat cost estimates over token tracking β€” Simple $0/$0.01/$0.03 per-call tiers rather than token-level metering. Sufficient for cost tracking. Token-level tracking deferred to future work.
  • Human approval for agent-created tools β€” The metatool pipeline requires explicit CLI approval. No auto-promotion, ever. This is a deliberate safety decision.
  • Two-stage tool promotion β€” Sandbox-built tools go live in the daemon/CLI after human approval (stage 1). Promoting a tool to the MCP server for cloud clients requires Claude Code review and a Railway redeploy (stage 2). No tool reaches claude.ai or Claude Desktop without two gates.
  • Local-first model routing β€” Default tier is local (free). Cloud models only used when routing rules determine the task needs them. Budget exhaustion auto-downgrades to local.

Cost

cairn is designed to be cheap to run daily:

Operation Model Cost
Simple recall / notes Qwen 3 8B (local) $0.00
Summarization / digest Qwen 3 32B (local) $0.00
Research / multi-step Claude Sonnet (cloud) ~$0.01/task
Complex technical Claude Sonnet extended ~$0.03/task
Daily digest (full run) Local + embedding ~$0.001/day
Daily budget cap Configurable Default $5.00

The daily digest pipeline runs almost entirely on local models. The only cloud cost is embedding approved items via OpenAI text-embedding-3-small (~$0.03/month).

Roadmap

  • Improve digest relevance scoring (embedding pre-filter + few-shot calibration from approval/rejection history)
  • Evaluation pipeline using digest approval/rejection data
  • Cross-encoder reranking via cairn-rank (three-layer scoring comparison)
  • Security hardening (gitleaks pre-commit hook)
  • Memory deduplication and aging
  • 24/7 daemon deployment on Railway
  • Multi-agent collaboration patterns

Related Projects

cairn exists in a growing ecosystem of autonomous agent tools. These projects explore overlapping ideas at different scales:

  • OpenClaw β€” Personal AI assistant with 214k+ stars. Connects to messaging platforms with self-extending skills. Different architecture (gateway vs. research agent) and auto-promotes new capabilities without human approval.
  • NVIDIA OpenShell β€” Enterprise sandbox for self-evolving agents with policy controls. Requires DGX/RTX hardware. cairn targets the same safety-first philosophy at a scale that runs on a laptop with Ollama.
  • LangGraph β€” The state machine framework cairn is built on. cairn's Classifyβ†’Planβ†’Actβ†’Reflect loop is one opinionated implementation of LangGraph's primitives.
  • LiteLLM β€” Production LLM proxy with routing and budget management at enterprise scale. cairn's model router is a lightweight alternative for solo developers who want the same idea in a YAML file.
  • e2b β€” Cloud sandboxing for AI code execution. cairn uses a simpler local Docker sandbox with resource limits.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for details.

If you build something interesting with cairn, I'd love to hear about it.

License

MIT β€” see LICENSE for details.

About

A self-extending AI agent with persistent memory, sandboxed tool creation, budget-aware model routing, and cloud access via MCP.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages