Skip to content

UT-InfraAI/cuco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUCo Logo

CUCo: An Agentic Framework for Compute and Communication Co-design

arXiv Website License

CUCo is a training-free, agent-driven framework that automatically generates high-performance CUDA kernels jointly orchestrating computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks optimization opportunities unavailable to existing approaches, reducing end-to-end latency by up to 1.57x over host-driven baselines.

Overview

CUCo consists of three inter-twined components:

  1. Design Space Specification — A structured, declarative set of communication primitives (backend, placement, sync scope, issuer granularity, chunk size) that grounds agent reasoning in valid collective semantics.

  2. Fast-Path Agent — A correctness-first pipeline that converts host-driven NCCL code into device-initiated (GIN/LSA) equivalents through a three-step process: CUDA code analysis, host-to-device transformation via an LLM-judge loop, and evolve-block annotation.

  3. Slow-Path Agent — An LLM-driven evolutionary search that optimizes the fast-path baseline through island-based populations, phase-dependent explore/exploit mutation, cascaded evaluation, and a shared candidate database with meta-summarization.

How It Works

CUCo Workflow

Given a host-driven CUDA+NCCL kernel, CUCo's fast-path agent first analyzes the communication pattern, converts host-side collectives to device-initiated GIN/LSA primitives, and annotates mutable regions with EVOLVE-BLOCK markers. The slow-path agent then treats the annotated kernel as generation 0 and runs an evolutionary search: each generation, an LLM mutates the code within the evolve blocks, the candidate is compiled, run, and scored, and the result feeds back into the next iteration. Over 10-20 generations, this loop discovers optimizations like compute-communication overlap, kernel fusion, and pipelined transfers that are difficult to find manually.

Key Results

CUCo was evaluated on four representative workloads spanning different compute-communication patterns. In each case, CUCo's evolved kernels significantly outperform the host-driven NCCL baselines.

DeepSeek-V3 MoE
Dispatch-Compute-Combine
KV Cache Transfer
Prefill-Decode Pipeline
MoE Ratio Sweep KV Cache Transfer
Flash Attention
Attention with AllGather
GEMM + AllGather
Matmul with Collective
Flash Attention GEMM AllGather

Documentation

Guide Description
Getting Started Installation, first run, end-to-end walkthrough
Architecture System design, module map, data flow
Adding a New Workload Step-by-step guide to onboard your own kernel
Fast-Path Agent Host-to-device transformation pipeline
Slow-Path Agent Evolutionary search deep dive
Configuration Reference All config parameters (EvolutionConfig, TransformConfig, etc.)
LLM Backends Provider setup (Anthropic, Bedrock, OpenAI, Gemini, DeepSeek)
Writing Evaluations Custom evaluate.py for your workload
Visualization Web UI, plotting tools, database queries

Extensibility

While the included example uses CUDA and NCCL device APIs, CUCo's core framework is workload-agnostic. Run cuco_init /path/to/kernel.cu to scaffold a new workload with all required files pre-configured for your cluster. The evaluation script (evaluate.py), prompt customization (run_evo.py), and API documentation file are all user-defined — you can adapt CUCo for any kernel, library, or optimization target where an LLM can generate code and a script can score it. See Adding a New Workload for details.

Repository Layout

cuco/                   Core framework
├── core/               Evolution runner, sampler, novelty judge, summarizer
├── database/           Candidate database, complexity analysis, island management
├── edit/               Diff/full-rewrite application, async editing
├── llm/                LLM client, model backends (Anthropic, OpenAI, Gemini, DeepSeek)
├── prompts/            Mutation prompt templates (base, diff, full, cross, novelty, meta)
├── transform/          Fast-path agent: CUDA analyzer, host-to-device transformer
├── plots/              Visualization utilities (lineage trees, pareto fronts, improvement plots)
├── webui/              Interactive evolution visualization UI
├── launch/             Local and Slurm launch backends
├── templates/          Templates for evaluate.py, .gitignore (used by cuco_init)
├── site_config.py      Cluster auto-detection and ~/.cuco/site.yaml management
├── init_workload.py    Workload scaffolding logic (used by cuco_init)
├── run_workload.py     Workload launcher logic (used by cuco_run)
├── cuco_init           CLI: scaffold new workloads or run cluster setup
├── cuco_run            CLI: launch evolution for a named workload
├── cuco_launch         Entry point for launching evolution runs
└── cuco_visualize      Entry point for the visualization UI
workloads/
└── ds_v3_moe/          DeepSeek-V3 MoE dispatch-compute-combine workload
    ├── ds_v3_moe.cu    Seed CUDA kernel (host-driven baseline)
    ├── evaluate.py     Build, run, and fitness evaluation logic
    ├── run_evo.py      Launch slow-path evolutionary search
    ├── run_transform.py Launch fast-path host-to-device transformation
    ├── nccl_api_docs.py NCCL device API documentation for agent context
    └── results_ds_v3_moe/ Evolution results (generations, scores, logs)
pyproject.toml          Package configuration and dependencies
uv.lock                 Locked dependency versions

Setup

Prerequisites

  • Python >= 3.10
  • CUDA 13.1+ with NCCL 2.28.9+ (for device-initiated communication)
  • NVIDIA GPUs with NVLink (intra-node) or RoCE (inter-node)
  • LLM API credentials (Anthropic Bedrock, OpenAI, etc.)

Installation

# Clone the repository
git clone https://github.com/UT-InfraAI/cuco.git
cd cuco

# Create virtual environment and install
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Or with uv (recommended):

uv venv
source .venv/bin/activate
uv sync

Configuration

Create a .env file in the repository root with your LLM API credentials:

AWS_DEFAULT_REGION=us-east-1
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...

Usage

Quick Start: Add Your Own Workload

CUCo provides cuco_init to scaffold a new workload from any seed CUDA kernel:

# One-time cluster setup (auto-detects CUDA, NCCL, MPI, GPUs)
cuco_init --setup

# Scaffold a new workload from your seed kernel
cuco_init /path/to/my_kernel.cu

# Run evolution (50 generations by default)
cuco_run my_kernel --generations 50

cuco_init creates a ready-to-run workload directory under workloads/ with all required files (evaluate.py, run_evo.py, run_transform.py, etc.) pre-configured using your cluster settings from ~/.cuco/site.yaml. See Adding a New Workload for details.

Fast-Path Agent (Host-to-Device Transformation)

The fast-path agent converts a host-driven NCCL program into a device-initiated equivalent:

cd workloads/ds_v3_moe
python run_transform.py

This runs the three-step pipeline (CUDA analysis, host-to-device transformation, evolve-block annotation) and outputs the transformed kernel to _transform_host_output/.

Slow-Path Agent (Evolutionary Search)

The slow-path agent optimizes the transformed kernel through LLM-driven evolution:

cd workloads/ds_v3_moe
python run_evo.py --num_generations=18

Evolution results (candidate programs, scores, logs) are saved to results_ds_v3_moe/.

Visualization

Launch the interactive web UI to explore the evolution tree:

cuco_visualize --db workloads/ds_v3_moe/results_ds_v3_moe/evolution_db.sqlite

Citation

@misc{hu2026cucoagenticframeworkcompute,
      title={CUCo: An Agentic Framework for Compute and Communication Co-design}, 
      author={Bodun Hu and Yoga Sri Varshan V and Saurabh Agarwal and Aditya Akella},
      year={2026},
      eprint={2603.02376},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2603.02376}, 
}

License

Apache 2.0

About

An agent for CUDA compute-communication kernel co-design

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors