Agent-aware code quality system for multi-agent codebases.
In 2026, code is written by fleets of AI agents. Arbiter knows who wrote each line — human or AI — and scores quality accordingly.
| Feature | Traditional Tools | Arbiter |
|---|---|---|
| Agent attribution | None | First-class: tracks Claude, Codex, Gemini, Copilot, humans |
| Per-commit scoring | Repo-wide only | Scores each commit's changed files individually |
| Diff analysis | N/A | Score only what changed in a PR/branch |
| Transparency | Opaque score | Every score decomposes into lint + security + complexity |
| Agent-specific gates | N/A | Different quality thresholds per agent trust tier |
| Tool integration | Proprietary | Wraps tools you already trust: ruff, Bandit, radon, vulture |
| Dashboard | SaaS login | Single HTML file with per-agent timelines, commit feed, fleet view |
| Dependencies | Heavy | Analysis tools only; core is stdlib Python |
git clone https://github.com/hummbl-dev/arbiter.git
cd arbiter
# Install (makes `arbiter` command available)
pip install ".[analyzers]"
# Quick score (no persistence)
arbiter score /path/to/your/repo
# Full analysis with per-commit agent attribution
arbiter analyze /path/to/your/repo
# Score only files changed since main
arbiter diff /path/to/your/repo --base main
# Agent leaderboard
arbiter agents
# Start dashboard
arbiter serve --port 8080
# Open http://localhost:8080PYTHONPATH=src python -m arbiter score /path/to/your/repodocker build -t arbiter .
docker run -p 8080:8080 -v /path/to/repo:/repo:ro arbiterGit Repo ──→ [Git Historian] ──→ [Analyzer Runner] ──→ [Scoring Engine] ──→ [SQLite Store]
│ │ │ │
agent attribution tool invocation weighted rubric trend data
(Co-Authored-By, (ruff, radon, (lint 35%, │
email matching) vulture, bandit) security 30%, ├──→ REST API
complexity 35%) └──→ Dashboard
┌────────────┐
│Diff Analyzer│ ←── v0.2: scores only changed files per commit/branch
└────────────┘
Every commit is scored against only the files it changed, not the entire repo. This makes the agent leaderboard meaningful — a commit that touches 1 clean file scores differently than one that touches 10 messy files.
arbiter diff scores only files changed since a base branch. Ideal for CI/PR quality gates — fast, scoped, actionable.
Arbiter identifies which agent authored each commit:
- Co-Authored-By trailer —
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> - Author email — maps
noreply@anthropic.com→ claude,codex@openai.com→ codex - Default — "human" if no agent pattern matches
Configure in agents.yml:
agents:
- name: claude
emails: [noreply@anthropic.com]
co_author_patterns: ["Claude\\s+(Opus|Sonnet|Haiku)"]
trust_tier: verified
quality_threshold: 70.0
- name: gemini
trust_tier: probation
quality_threshold: 80.0 # Higher bar for probationary agents| Analyzer | Tool | What It Finds |
|---|---|---|
| Lint | ruff | Style violations, import errors, bugbear patterns |
| Complexity | radon | Cyclomatic complexity (grade A-F per function) |
| Security | bandit | Hardcoded secrets, shell injection, dangerous patterns |
| Dead Code | vulture | Unused functions, imports, variables |
| Duplication | AST hash | Near-duplicate function bodies |
Deterministic. Same code → same score. Always.
Overall = Lint (35%) + Security (30%) + Complexity (35%)
Penalty points by severity:
CRITICAL: 50 | HIGH: 20 | MEDIUM: 5 | LOW: 1
Score = 100 - (total_penalty / LOC) * normalization_factor
Grades: A (90+) | B (80+) | C (70+) | D (60+) | F (<60)
Single HTML file with Chart.js. No build step, no React, no npm.
- Score Card — Big number + breakdown bars
- Agent Leaderboard — Who writes the best code? Color-coded by agent
- Per-Agent Quality Timeline — Score over time per agent (not just repo-wide)
- Commit Feed — Recent commits with agent, score, changes, timestamp
- Hotspot Files — Ranked by finding count
- Fleet View — Multi-repo quality grid with color-coded scores
- Tabbed UI — Overview, Commits, Fleet tabs
GET /api/score Current repo score
GET /api/agents Agent leaderboard
GET /api/agents/{name}/trend Per-agent quality over time
GET /api/trend?days=30 Quality over time
GET /api/worst?limit=20 Worst files
GET /api/commits Recent commits with scores
GET /api/commits/{hash} Detail for one commit
GET /api/fleet Fleet report (multi-repo)
GET /api/health System health
arbiter analyze <repo> # Full analysis + per-commit scoring + persist
arbiter score <repo> [--json] [--exclude] # Quick score (no persist)
arbiter diff <repo> [--base main] [--json] # Score only changed files vs base branch
arbiter agents # Agent leaderboard
arbiter trend [--days 30] # Quality trend
arbiter worst [--limit 20] # Worst files
arbiter commits [--agent claude] # Recent commits
arbiter audit-fleet <directory> # Audit all repos in a directory
arbiter fleet-report # Fleet quality summary
arbiter triage # Auto-classify repos: green/yellow/red/archive
arbiter fix <repo> [--dry-run] # Auto-fix ruff findings + before/after score
arbiter serve [--port 8080] # API + dashboardpip install ".[test]"
PYTHONPATH=src python -m pytest tests/ -v
# 78 tests, <7 seconds- Python 3.11+
- git (for historian)
- Optional: ruff, radon, vulture, bandit (for full analysis)
- Docker (for containerized deployment)
This repo is part of the HUMMBL cognitive AI architecture. Related repos:
| Repo | Purpose |
|---|---|
| hummbl-governance | Governance primitives that Arbiter scores repos against |
| base120 | Deterministic cognitive framework -- 120 mental models across 6 transformations |
| mcp-server | Model Context Protocol server for Base120 integration |
| agentic-patterns | Stdlib-only safety patterns for agentic AI systems |
| governed-iac-reference | Reference architecture for governed infrastructure-as-code |
Learn more at hummbl.io.
MIT — see LICENSE.
Built by HUMMBL LLC from production experience coordinating Claude, Codex, Gemini, and human engineers on a 6,000+ test codebase.