CUCo is a training-free, agent-driven framework that automatically generates high-performance CUDA kernels jointly orchestrating computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks optimization opportunities unavailable to existing approaches, reducing end-to-end latency by up to 1.57x over host-driven baselines.
CUCo consists of three inter-twined components:
-
Design Space Specification — A structured, declarative set of communication primitives (backend, placement, sync scope, issuer granularity, chunk size) that grounds agent reasoning in valid collective semantics.
-
Fast-Path Agent — A correctness-first pipeline that converts host-driven NCCL code into device-initiated (GIN/LSA) equivalents through a three-step process: CUDA code analysis, host-to-device transformation via an LLM-judge loop, and evolve-block annotation.
-
Slow-Path Agent — An LLM-driven evolutionary search that optimizes the fast-path baseline through island-based populations, phase-dependent explore/exploit mutation, cascaded evaluation, and a shared candidate database with meta-summarization.
Given a host-driven CUDA+NCCL kernel, CUCo's fast-path agent first analyzes the communication pattern, converts host-side collectives to device-initiated GIN/LSA primitives, and annotates mutable regions with EVOLVE-BLOCK markers. The slow-path agent then treats the annotated kernel as generation 0 and runs an evolutionary search: each generation, an LLM mutates the code within the evolve blocks, the candidate is compiled, run, and scored, and the result feeds back into the next iteration. Over 10-20 generations, this loop discovers optimizations like compute-communication overlap, kernel fusion, and pipelined transfers that are difficult to find manually.
CUCo was evaluated on four representative workloads spanning different compute-communication patterns. In each case, CUCo's evolved kernels significantly outperform the host-driven NCCL baselines.
| DeepSeek-V3 MoE Dispatch-Compute-Combine |
KV Cache Transfer Prefill-Decode Pipeline |
![]() |
![]() |
| Flash Attention Attention with AllGather |
GEMM + AllGather Matmul with Collective |
![]() |
![]() |
| Guide | Description |
|---|---|
| Getting Started | Installation, first run, end-to-end walkthrough |
| Architecture | System design, module map, data flow |
| Adding a New Workload | Step-by-step guide to onboard your own kernel |
| Fast-Path Agent | Host-to-device transformation pipeline |
| Slow-Path Agent | Evolutionary search deep dive |
| Configuration Reference | All config parameters (EvolutionConfig, TransformConfig, etc.) |
| LLM Backends | Provider setup (Anthropic, Bedrock, OpenAI, Gemini, DeepSeek) |
| Writing Evaluations | Custom evaluate.py for your workload |
| Visualization | Web UI, plotting tools, database queries |
While the included example uses CUDA and NCCL device APIs, CUCo's core framework is workload-agnostic. Run cuco_init /path/to/kernel.cu to scaffold a new workload with all required files pre-configured for your cluster. The evaluation script (evaluate.py), prompt customization (run_evo.py), and API documentation file are all user-defined — you can adapt CUCo for any kernel, library, or optimization target where an LLM can generate code and a script can score it. See Adding a New Workload for details.
cuco/ Core framework
├── core/ Evolution runner, sampler, novelty judge, summarizer
├── database/ Candidate database, complexity analysis, island management
├── edit/ Diff/full-rewrite application, async editing
├── llm/ LLM client, model backends (Anthropic, OpenAI, Gemini, DeepSeek)
├── prompts/ Mutation prompt templates (base, diff, full, cross, novelty, meta)
├── transform/ Fast-path agent: CUDA analyzer, host-to-device transformer
├── plots/ Visualization utilities (lineage trees, pareto fronts, improvement plots)
├── webui/ Interactive evolution visualization UI
├── launch/ Local and Slurm launch backends
├── templates/ Templates for evaluate.py, .gitignore (used by cuco_init)
├── site_config.py Cluster auto-detection and ~/.cuco/site.yaml management
├── init_workload.py Workload scaffolding logic (used by cuco_init)
├── run_workload.py Workload launcher logic (used by cuco_run)
├── cuco_init CLI: scaffold new workloads or run cluster setup
├── cuco_run CLI: launch evolution for a named workload
├── cuco_launch Entry point for launching evolution runs
└── cuco_visualize Entry point for the visualization UI
workloads/
└── ds_v3_moe/ DeepSeek-V3 MoE dispatch-compute-combine workload
├── ds_v3_moe.cu Seed CUDA kernel (host-driven baseline)
├── evaluate.py Build, run, and fitness evaluation logic
├── run_evo.py Launch slow-path evolutionary search
├── run_transform.py Launch fast-path host-to-device transformation
├── nccl_api_docs.py NCCL device API documentation for agent context
└── results_ds_v3_moe/ Evolution results (generations, scores, logs)
pyproject.toml Package configuration and dependencies
uv.lock Locked dependency versions
- Python >= 3.10
- CUDA 13.1+ with NCCL 2.28.9+ (for device-initiated communication)
- NVIDIA GPUs with NVLink (intra-node) or RoCE (inter-node)
- LLM API credentials (Anthropic Bedrock, OpenAI, etc.)
# Clone the repository
git clone https://github.com/UT-InfraAI/cuco.git
cd cuco
# Create virtual environment and install
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Or with uv (recommended):
uv venv
source .venv/bin/activate
uv syncCreate a .env file in the repository root with your LLM API credentials:
AWS_DEFAULT_REGION=us-east-1
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...CUCo provides cuco_init to scaffold a new workload from any seed CUDA kernel:
# One-time cluster setup (auto-detects CUDA, NCCL, MPI, GPUs)
cuco_init --setup
# Scaffold a new workload from your seed kernel
cuco_init /path/to/my_kernel.cu
# Run evolution (50 generations by default)
cuco_run my_kernel --generations 50cuco_init creates a ready-to-run workload directory under workloads/ with all required files (evaluate.py, run_evo.py, run_transform.py, etc.) pre-configured using your cluster settings from ~/.cuco/site.yaml. See Adding a New Workload for details.
The fast-path agent converts a host-driven NCCL program into a device-initiated equivalent:
cd workloads/ds_v3_moe
python run_transform.pyThis runs the three-step pipeline (CUDA analysis, host-to-device transformation, evolve-block annotation) and outputs the transformed kernel to _transform_host_output/.
The slow-path agent optimizes the transformed kernel through LLM-driven evolution:
cd workloads/ds_v3_moe
python run_evo.py --num_generations=18Evolution results (candidate programs, scores, logs) are saved to results_ds_v3_moe/.
Launch the interactive web UI to explore the evolution tree:
cuco_visualize --db workloads/ds_v3_moe/results_ds_v3_moe/evolution_db.sqlite@misc{hu2026cucoagenticframeworkcompute,
title={CUCo: An Agentic Framework for Compute and Communication Co-design},
author={Bodun Hu and Yoga Sri Varshan V and Saurabh Agarwal and Aditya Akella},
year={2026},
eprint={2603.02376},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2603.02376},
}Apache 2.0





