Fetches a bug from Azure DevOps, runs multi-perspective AI root cause analysis, evaluates and verifies each result, then uses semantic consensus to select the best answer. Supports cross-model voting across OpenAI, Anthropic, and Google models via the GitHub Copilot API.
- Multi-Agent RCA — 3 perspective agents (Flow, Signal, Data), each using a different model by default (GPT-4o, Claude Sonnet, Gemini) for maximum diversity. Override with
--modelto use a single model for all agents. - Multi-Run Mode — Run the same model N times with varied temperatures to reduce randomness.
- Cross-Model Vote — Run different models (e.g. GPT-4o, Claude Sonnet, Gemini) and let them compete; a consensus LLM picks the winner.
- 5-Dimension Scoring — Each RCA is scored on Specificity, Causality, Evidence, Uncertainty Awareness, and Signal Usage (max 14 points).
- Verification — Each RCA is checked against extracted context (stack traces, files, repro steps) and signals; fabrications are flagged.
- Semantic Consensus — An LLM groups candidates by root cause theme, then selects the best group using a weighted formula (group size × 2 + avg score + verified count × 2).
- Bug Type Classification — Automatically classifies bugs as
runtime_error,config_data, orfeature_requestand adapts evaluation thresholds accordingly. - Table Signal Extraction — Detects blank cells, posting groups, GL accounts, and field headers from embedded HTML tables and Excel attachments.
- Excel Attachment Parsing — Downloads and parses
.xlsxattachments from Azure DevOps using openpyxl. - Memory Store — Verified, consensus-backed RCAs are stored to disk for future similarity lookups.
Fetch bug from ADO → Preprocess & extract signals → Classify bug type
→ Extract verifiable context → Retrieve similar past cases from memory
→ Multi-agent / Multi-run / Cross-model RCA
→ Evaluate each (5-dimension scoring)
→ Verify each against context + signals
→ LLM semantic consensus with weighted scoring
→ Store verified result to memory
- Copy
.env.exampleto.envand fill in your values:
cp .env.example .env- Install dependencies:
pip install -r requirements.txt- Configure
.env:
| Variable | Description |
|---|---|
ADO_TOKEN |
Azure DevOps Personal Access Token (needs Work Items read scope) |
ADO_ORG |
Azure DevOps organization name |
ADO_PROJECT |
Azure DevOps project name |
GITHUB_TOKEN |
GitHub PAT (for calling AI models via GitHub Copilot API) |
API_ENDPOINT |
API endpoint (default: https://api.githubcopilot.com) |
MODEL |
Default model (e.g. gpt-4o, claude-sonnet-4, gemini-2.5-pro) |
# Multi-agent mode (default — 3 agents × 3 different models)
python main.py 596528
# Force all agents to use the same model
python main.py 596528 --model claude-sonnet-4
# Multi-run mode (5 identical runs, same model)
python main.py 596528 --runs 5
# Cross-model vote (run multiple models and let them compete)
python main.py 596528 --models gpt-4o claude-sonnet-4 gpt-4.1
# List all available models
python main.py --list-models| Flag | Description |
|---|---|
work_item_id |
Azure DevOps work item ID to analyze |
--model, -m |
Force a single model for all agents (default: diverse rotation) |
--runs, -r |
Number of multi-run passes (0 = multi-agent mode) |
--models |
Space-separated list of models for cross-model voting |
--list-models |
List all available models and exit |
The GitHub Copilot endpoint supports the following models:
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3-mini, o4-mini |
| Anthropic | claude-sonnet-4, claude-opus-4, claude-3.5-sonnet |
gemini-2.5-pro |
Reasoning models (o3-mini, o4-mini) are automatically handled — temperature and system role are disabled.
The tool outputs a ranked comparison table, consensus analysis, and the winning RCA:
============================================================
RCA Candidates — Comparison
============================================================
Rank Agent/Model Score S C E U Sig Verified Status
───── ────────────────────── ───── ─ ─ ─ ─ ─── ──────── ──────────
1 Flow (gpt-4o) 11/14 3 3 2 1 2 YES ★ WINNER
2 Signal (claude-sonnet-4) 10/14 3 2 2 1 2 YES
3 Data (gemini-2.5-pro) 9/14 2 2 2 1 2 no
============================================================
Consensus & Winner Selection
============================================================
Winner Selection:
Consensus size: 2 (strong)
Consensus score: 15.5 (= size×2 + avg_score + verified×2)
Reasoning: gpt-4o and claude-sonnet-4 agree on the root cause...
============================================================
Final RCA — Winner: gpt-4o
Score: 11/14 | Verified: YES | Consensus: 2/3
============================================================
Root Cause:
...
Memory: STORED (passed triple filter: consensus + score + verified)
bug-agent-demo/
├── main.py # Entry point & CLI
├── fetch_bug.py # Azure DevOps REST API + Excel attachment parsing
├── preprocess.py # HTML cleaning, signal extraction, bug classification
├── signal_extractor.py # Keyword/pattern signals + table signal extraction
├── context_loader.py # Extract verifiable context (stack traces, files, lines)
├── rca_agent.py # Single-run & multi-run RCA via OpenAI-compatible API
├── multi_agent.py # Multi-agent RCA (3 perspective agents)
├── rca_evaluator.py # 5-dimension RCA quality scoring
├── rca_verifier.py # Verification against context + signals
├── consensus.py # LLM-based semantic consensus with weighted scoring
├── memory_store.py # RCA memory storage & similarity lookup
├── config.py # Configuration, model registry, validation
├── requirements.txt # Python dependencies
├── .env.example # Template for secrets
├── prompts/
│ ├── rca_prompt.txt # RCA system prompt
│ ├── agent_data.txt # Data-focused agent prompt
│ ├── agent_flow.txt # Flow-focused agent prompt
│ └── agent_signal.txt # Signal-focused agent prompt
└── data/
├── bug_*.json # Fetched bugs (created at runtime)
└── rca_memory/ # Stored RCA results