CoSA is a modular framework for building, training, and deploying specialized LLM-powered agents. It provides the infrastructure for Lupin, a voice-first conversational AI system with trust-aware human-in-the-loop decision making.
CoSA implements a collection of targeted agents, each specialized for specific tasks:
- Text generation and completion
- Mathematics and calculations
- Calendar management and scheduling
- Weather reporting
- Todo list management
- Code execution and debugging
- Hybrid TTS Streaming: Fast, reliable text-to-speech with no word truncation
- And more...
The system includes two high-performance TTS solutions optimized for different use cases:
Architecture: OpenAI TTS → FastAPI → WebSocket → Client
- Server:
stream_tts_hybrid()- forwards OpenAI chunks via WebSocket - Client: Collects all chunks, then plays complete audio file
- Benefits: 50% faster than complete file approach, zero truncation, universal compatibility
Architecture: ElevenLabs Streaming API → FastAPI → WebSocket → Client
- Server: Direct WebSocket streaming with progressive chunk delivery
- Client: Immediate playback of audio chunks as received
- Benefits: Ultra-low latency, real-time streaming, significantly faster than hybrid mode
- Use Case: Interactive conversations requiring immediate audio response
Endpoints:
/api/get-audio- Hybrid OpenAI approach for reliability/api/get-audio-elevenlabs- Instant ElevenLabs streaming for speed
/agents: Individual agent implementationsagent_base.py: Abstract base class for all agentsllm.py,llm_v0.py: LLM service integration (legacy)/v010: Current agent architecture with Pydantic XML processing/io_models/: Pydantic XML models and utilitiesxml_models.py: Core XML response models with template generationutils/prompt_template_processor.py: Dynamic template processing
/v1: New modular LLM client architecturellm_client.py: Unified client for all LLM providersllm_client_factory.py: Factory pattern for client creationtoken_counter.py: Cross-provider token counting
- Specialized agents for math, calendaring, weather, etc.
/app: Core application componentsconfiguration_manager.py: Settings management with inheritanceutil_llm_client.py: Client for LLM service communication
/memory: Data persistence and memory management/rest: REST API infrastructure- Queue management, WebSocket routers, authentication
- Producer-consumer pattern with event-driven processing
/tools: External integrations and toolssearch_gib.py: Internal search capabilitiessearch_kagi.py: Integration with Kagi search API
/training: Model training infrastructurepeft_trainer.py: PEFT (Parameter-Efficient Fine-Tuning) implementationquantizer.py: Model quantization for deploymentxml_coordinator.py: Structured XML training data generation/validation
/utils: Shared utility functions
- Python 3.9+
- PyTorch
- Transformers library
- Hugging Face account (for model access)
For a complete list of dependencies, see the requirements.txt file.
# Clone the repository
git clone git@github.com:deepily/cosa.git
cd cosa
# Install dependencies
pip install -r requirements.txtCoSA is designed to be used as a submodule/subtree within the parent "Lupin" project (formerly genie-in-the-box), but can also be used independently for agent development.
TBD: Usage examples and API documentation will be provided in future updates.
CoSA includes tools for fine-tuning and deploying LLM models using Parameter-Efficient Fine-Tuning (PEFT):
# Example: Fine-tune a model using PEFT
python -m cosa.training.peft_trainer \
--model "mistralai/Mistral-7B-Instruct-v0.2" \
--model-name "Mistral-7B-Instruct-v0.2" \
--test-train-path "/path/to/training/data" \
--lora-dir "/path/to/output/lora" \
--post-training-statsFor detailed instructions on using the PEFT trainer, including all available options, data format requirements, and advanced features like GPU management, please refer to the PEFT Trainer README.
Based on analysis of the codebase, here's how the COSA (Collection of Small Agents) framework works:
FastAPI Server (fastapi_app/main.py) - CURRENT
|
├── WebSocket endpoints
├── REST API endpoints
└── Async handlers
Flask Server (app.py) - DEPRECATED/REMOVED
├── /push endpoint (migrated to FastAPI)
├── /api/upload-and-transcribe-* (migrated)
└── Socket.IO connections (replaced with WebSockets)
User Request (voice/text)
|
v
MultiModalMunger (preprocessing)
|
v
TodoFifoQueue.push_job()
├── Check for similar snapshots
├── Parse salutations
├── Get question gist (via Gister)
└── Route to agent via LLM
|
v
Agent Router (LLM-based)
├── "agent router go to calendar" → CalendaringAgent
├── "agent router go to math" → MathAgent
├── "agent router go to todo list" → TodoListAgent
├── "agent router go to date and time" → DateAndTimeAgent
├── "agent router go to weather" → WeatherAgent
└── "agent router go to receptionist" → ReceptionistAgent
TodoFifoQueue (pending jobs)
|
v
RunningFifoQueue.enter_running_loop()
├── Pop from TodoQueue
├── Execute job (Agent or SolutionSnapshot)
└── Route to appropriate queue:
├── DoneQueue (successful)
└── DeadQueue (errors)
AgentBase (abstract)
|
├── run_prompt() → LlmClient → LLM Service
├── run_code() → RunnableCode → Python exec()
└── run_formatter() → RawOutputFormatter
|
v
do_all() orchestrates the complete flow
ConfigurationManager
- Singleton pattern
- Manages
lupin-app.inisettings (formerly gib-app.ini) - Environment variable overrides
LlmClient/LlmClientFactory
- Unified interface for multiple LLM providers
- Supports OpenAI, Groq, Google, Anthropic
- Handles streaming/non-streaming modes
SolutionSnapshot
- Serializes successful agent runs
- Stores code, prompts, responses
- Enables solution reuse
Memory Components
InputAndOutputTable: Logs all I/OEmbeddingManager: Manages embeddings (singleton)GistNormalizer: Text preprocessing (singleton)SolutionSnapshotManager: Manages saved solutions
1. User: "What's the weather today?"
2. FastAPI receives request
3. MultiModalMunger processes input
4. TodoFifoQueue:
- Checks for similar snapshots
- No match found
- Routes to weather agent via LLM
5. WeatherAgent created and queued
6. RunningFifoQueue executes:
- Calls agent.do_all()
- Agent queries weather API
- Formats response
7. Results sent to DoneQueue
8. Audio response generated via TTS
9. Response sent to user
- Singleton: ConfigurationManager, EmbeddingManager, GistNormalizer
- Abstract Factory: LlmClientFactory
- Template Method: AgentBase.do_all()
- Queue-based Architecture: Async job processing
- Serialization: SolutionSnapshot for persistence
The framework elegantly handles voice/text input, routes to specialized agents, executes code dynamically, and maintains a memory of successful solutions for reuse.
Please refer to CLAUDE.md for detailed code style and development guidelines.
For current research and planning documents, see the RND directory, which includes:
- LLM Client Architecture Refactoring Plan: Comprehensive plan for improving the v010 LLM client architecture
- LLM Client Refactoring Progress: Progress tracker for the LLM client refactoring project
- LLM Refactoring Analysis: Analysis of LLM component refactoring needs
- Agent Migration v000 to v010 Plan: Migration strategy for agent architecture
- Screen Reader Agent Implementation Plan: Plan for screen reader accessibility agent
- Agent Factory Testing Plan: Testing strategy for agent factory components
- CI Testing Implementation Plan: Continuous integration testing setup
- LLM Prompt Format Analysis: Analysis of prompt formatting approaches
- Prompt Templating Strategies: Strategies for prompt template management
- Python Package Distribution Plan: Plan for package distribution strategy
- Versioning and CI/CD Strategy: Version management and deployment strategy
- Universal Prediction Engine (UPE) — 7 prediction slices with response_type filtering to prevent cross-type contamination
- Bayesian Beta-Bernoulli Trust Model — Per-agent trust learning with conjugate prior updates
- Thompson Sampling — Exploration-exploitation balance for auto-approve vs. escalate decisions
- Conformal Prediction — Calibrated confidence intervals with statistical guarantees
- LanceDB Preference Embeddings — Semantic similarity search with
response_typefiltering and MC option validation - L1-L5 Trust Escalation — Five trust levels from "always ask" to "full autonomy" with circuit breaker pattern
- Hot-Swap Config — Running dev server toggles between config blocks at runtime via
/api/init?config_block_id=... GET /api/server-info— Unauthenticated introspection endpoint (config block, masked DB URL, environment)swap_database()— Runtime database environment switching (development/testing/production)- Database Disambiguation —
lupin_dbsplit intolupin_db_devandlupin_db_prod
- Unified
~/.lupin/config— Three credential stores collapsed into one file - Fail-hard on missing config — Removed all legacy fallbacks;
FileNotFoundErrorwith migration instructions - Strict Project Detection —
KNOWN_PROJECTSregistry +is_known_project()for MCP validation
user_initiated_messagetype for voice input routingQualifierClassificationmodel +display_qualifier_widgetnotification field- Programmatic session ID regex tightened to require hyphen
- Dead event cleanup — Removed
active_conversation_changed(emitted but never subscribed)
- SWE Team Agent — 4-phase agentic software development with trust-aware decision proxy
- Everyday Calculator Agent — Natural language calculator with MathAgent fallback
- CRUD for DataFrames Agent — Voice-controlled create/read/update/delete for Pandas DataFrames
- Notification Proxy Agent — Phi-4 LLM fuzzy script matching for automated interactive testing
- Agentic Job System — Background execution engine for long-running Claude Agent SDK tasks
- Deep Research + Podcast Generator — Research-to-podcast chained pipeline
- Dry-Run Mode — Test agentic jobs without API costs
job_state_transitionevents for real-time job status via WebSocket
- +905 unit tests across trust engine, session bridge, hooks, credentials, prediction engine
- WebSocket tests: 50/50 passing
- Integration tests: 136 passed (comprehensive auth, admin, queue filtering)
- Interactive proxy tests: 12 scenarios across Calculator, CRUD, and Expediter agents
- v0.1.4 — cosa-voice MCP Server, Runtime Argument Expeditor, batch voice questions
- v0.1.3 — CJ Flow agentic job system, JWT WebSocket auth, unified LoRA training
- v0.1.2 — LanceDB migration with 100% feature parity
- v0.1.1 — WebSocket FastAPI test suite
- v0.1.0 — Complete Flask elimination, FastAPI-only architecture
- Pydantic XML Migration — All 8 agents migrated with 4 core models and 3-tier strategy
- Design by Contract Documentation — 100% coverage across all 73 Python modules
- Modular LLM Client Architecture — Vendor-agnostic support for OpenAI, Groq, Anthropic, Google
- Producer-Consumer Queue — 6,700x performance improvement via event-driven processing
- WebSocket User Routing — Persistent user-centric event routing with multi-session support
This project is licensed under the terms specified in the LICENSE file.