Rubric Generation Pipeline

A generator-validator loop for producing high-quality synthetic rubrics for RL training.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PIPELINE FLOW                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│  │ GENERATE │ ──▶ │  GATE 1  │ ──▶ │  GATE 2  │ ──▶ │  GATE 3  │          │
│  │  Opus    │     │ (Auto)   │     │  Opus    │     │  Opus    │          │
│  │(high tmp)│     │40-60% rej│     │40-60%pass│     │60-80%pass│          │
│  └──────────┘     └────┬─────┘     └────┬─────┘     └────┬─────┘          │
│                        │                │                 │                │
│                        ▼                ▼                 ▼                │
│                   ┌─────────┐     ┌───────────┐     ┌───────────┐         │
│                   │ FAILED  │     │ BORDERLINE│     │ VALIDATED │         │
│                   └─────────┘     └─────┬─────┘     └───────────┘         │
│                                         │                                  │
│                                         ▼                                  │
│                                   ┌───────────┐                           │
│                                   │  REFINE   │ ◀── both_pass failures    │
│                                   │   Opus    │                           │
│                                   └─────┬─────┘                           │
│                                         │                                  │
│                                         ▼                                  │
│                                   ┌───────────┐                           │
│                                   │ RECOVERED │ ──▶ Gates 2-3 again       │
│                                   └───────────┘                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Goal: Generate exactly 6 validated rubrics per example with diversity-aware generation

Repo Workflow, Components, and End-to-End Flow

This repo is set up as a generator → multi-gate validator → refinement loop because rubric quality is harder to verify than to draft. We bias toward high recall early (generate many candidates) and high precision late (strict gates), then recover borderline cases instead of discarding useful signal.

How the Repo Works (At a Glance)

flowchart TD
  A[Inputs: preference pairs<br/>data/examples/] --> B[Generator<br/>src/agents/generator.py]
  B --> C[Gate 1: Structural<br/>src/agents/structural.py]
  C -->|pass| D[Gate 2: Semantic<br/>src/agents/semantic.py]
  C -->|fail| Cx["Rejected pool<br/>(in memory)"]
  D -->|pass| E[Gate 3: Discrimination<br/>src/agents/discriminator.py]
  D -->|borderline| Dx["Borderline pool<br/>(in memory)"]
  D -->|fail| Dy["Rejected pool<br/>(in memory)"]
  E -->|pass| F["Validated rubrics<br/>data/validated/"]
  E -->|fail| Ex["Gate 3 failures<br/>(in memory)"]
  Dx --> H[Refiner<br/>src/agents/refiner.py]
  Ex --> H
  H --> D
  H --> E
  F --> M[Metrics<br/>data/metrics/]
  E --> M
  P[pipeline.log] -.-> M

Components and Responsibilities

flowchart LR
  subgraph Entrypoint["run.py"]
    EP[CLI with subcommands:<br/>run, metrics, check-imports, clean]
  end
  subgraph Agents["src/agents/"]
    G1[generator.py - Candidate generation]
    V1[structural.py - Gate 1]
    V2[semantic.py - Gate 2]
    V3[discriminator.py - Gate 3]
    R1[refiner.py - Refinement loop]
    O1[orchestrator.py - Pipeline orchestration]
  end
  subgraph Core["src/"]
    C1[client.py - AnthropicClient]
    C2[config.py - PipelineConfig]
    C3[models.py - Data models]
  end
  subgraph Data["data/"]
    D1[examples/]
    D3[validated/]
    D4[metrics/]
  end
  D1 --> EP
  EP --> O1
  O1 --> G1
  O1 --> V1
  O1 --> V2
  O1 --> V3
  O1 --> R1
  V3 --> D3
  EP --> D4

End-to-End Workflow (Step-by-Step)

sequenceDiagram
  autonumber
  participant U as User
  participant P as run.py
  participant O as Orchestrator
  participant G as Generator
  participant S as Gate 1 (Structural)
  participant M as Gate 2 (Semantic)
  participant D as Gate 3 (Discrimination)
  participant R as Refiner
  participant Out as Output

  U->>P: Run pipeline on examples
  P->>O: Initialize PipelineOrchestrator
  O->>G: Generate N candidate rubrics
  G-->>O: candidates kept in memory
  O->>S: Structural checks (fast, deterministic)
  S-->>O: pass/fail + routing
  O->>M: Semantic checks (Claude Opus)
  M-->>O: pass/fail + borderline
  O->>D: Discrimination checks (Claude Opus)
  D-->>O: pass/fail + both-pass cases
  O->>R: Refine borderline and both-pass failures
  R-->>O: improved candidates
  O->>M: Re-check semantics
  O->>D: Re-check discrimination
  D-->>Out: Save validated rubrics + metrics

Why It's Set Up This Way

Generate wide, filter hard: High-quality rubrics are rare; generating many candidates increases odds of capturing strong discriminators.
Cheap checks first: Structural validation is deterministic and fast, preventing wasted LLM calls on malformed rubrics.
Semantic before discrimination: A rubric must be clear and verifiable before we test whether it separates preferred vs. rejected.
Discrimination is the core signal: The final gate ensures rubrics reflect the actual preference delta, which is what RL training needs.
Refinement saves value: Borderline candidates often become strong after targeted edits, reducing waste without lowering standards.

Agent Handoff (Comprehensive)

Use this section to orient a coding agent quickly and accurately.

Source of Truth

Primary entrypoint: run.py (CLI with subcommands: run, metrics, check-imports, clean).
Pipeline orchestration: src/agents/orchestrator.py handles the full loop, concurrency, and I/O.
API client: src/client.py (AnthropicClient) — async Anthropic SDK with retry and structured output.
Configuration: src/config.py (PipelineConfig) — all constants in a frozen dataclass.

What Actually Runs (Code Map)

run.py: CLI entrypoint with subcommands.
src/agents/orchestrator.py: Pipeline orchestration, concurrency, and I/O.
src/agents/generator.py: Candidate generation (Claude Opus 4.6).
src/agents/structural.py: Deterministic structural checks (Gate 1).
src/agents/semantic.py: Semantic quality gate (Claude Opus 4.6).
src/agents/discriminator.py: Discrimination gate (Claude Opus 4.6).
src/agents/refiner.py: Refinement loop (Claude Opus 4.6).
src/client.py: Anthropic API client with retry, structured output, and concurrency control.
src/config.py: All pipeline constants (PipelineConfig frozen dataclass).
src/models.py: Pydantic data models.

Data Flow (Actual Outputs)

Inputs: data/examples/*.json
Outputs:
- data/validated/<example_id>.json (single file containing array of 6 rubrics per example)
- data/metrics/<example_id>.json and data/metrics/summary.json
- pipeline.log (log file in repo root)
Not persisted by default: intermediate candidates, gate pass/fail pools, and refinement queues are kept in memory.

Key Behaviors and Gotchas

Normalization: afm_response is auto-mapped to rejected_response; metadata.rubric_criteria becomes rubric_hints.
Deduping: Candidates are deduped by rubric text before gates.
Gate order: Gate 1 → Gate 2 → Gate 3; borderline + both-pass failures are refined.
Randomization: Gate 3 randomizes response order to avoid position bias.
Diversity-aware generation: After each Gate 3 round, analyzes category coverage and targets underrepresented failure categories in subsequent generation rounds.
Forced fill: If still short after max attempts, the pipeline force-fills using best failures or fallback templates/hints to reach 6 rubrics.
Concurrency knobs: --parallel, --gate2-concurrency, --gate3-concurrency (CLI args) + PARALLEL_EXAMPLES, GATE2_CONCURRENCY, GATE3_CONCURRENCY (env vars) + defaults in src/config.py.

Model and Configuration

All pipeline stages use Claude Opus 4.6 (claude-opus-4-6), configured in src/config.py:

@dataclass(frozen=True)
class PipelineConfig:
    target_rubrics: int = 6
    max_generation_rounds: int = 7
    initial_candidates: int = 30
    additional_candidates: int = 12
    model: str = "anthropic/claude-opus-4-6"
    # ... temperature/token settings per stage

How to Run (CLI)

uv run python run.py run --examples data/examples/pref-0000.json
uv run python run.py run --examples data/examples/ --parallel 12
uv run python run.py metrics
uv run python run.py check-imports
uv run python run.py clean

Quick Start

1. Setup

# Navigate to project
cd rubric_generation

# Install dependencies
uv sync

# Set your API key
export LB_API_KEY='your-labelbox-key'

# Or copy the example env file
cp .env.example .env
# Edit .env with your Labelbox key

2. Prepare Your Data

Place your examples in data/examples/:

{
  "id": "pref-0000",
  "prompt": "User's original prompt/conversation",
  "preferred_response": "The better response",
  "afm_response": "The worse response (rejected)",
  "category_hint": "scheduling",
  "metadata": {
    "rubric_criteria": [
      "Hint 1 from human annotators",
      "Hint 2 from human annotators"
    ]
  }
}

Note: The pipeline accepts both afm_response and rejected_response fields.

3. Run the Pipeline

# Single example
uv run python run.py run --examples data/examples/pref-0000.json

# All examples in directory
uv run python run.py run --examples data/examples/

# Custom output directory
uv run python run.py run --examples data/examples/ --output results/

High-throughput options (recommended for large datasets):

# Shard across multiple terminals/machines
uv run python run.py run --examples data/examples/ --shard 0/4

# Resume mode (default on) skips completed examples
uv run python run.py run --examples data/examples/ --resume

# Override concurrency
uv run python run.py run --examples data/examples/ --parallel 12 --gate2-concurrency 30 --gate3-concurrency 12

4. Check Results

# Validated rubrics
ls data/validated/

# Pipeline metrics
uv run python run.py metrics

# Metrics for a specific example
uv run python run.py metrics pref-0010

Directory Structure

rubric_generation/
├── run.py                  # CLI entrypoint (subcommands: run, metrics, check-imports, clean)
├── pyproject.toml          # Dependencies (openai, pydantic, python-dotenv, tqdm)
├── .env.example            # API key template
│
├── src/
│   ├── client.py           # AnthropicClient (async, retry, structured output)
│   ├── config.py           # PipelineConfig (frozen dataclass)
│   ├── models.py           # Pydantic data models
│   ├── agents/
│   │   ├── orchestrator.py # Pipeline orchestration
│   │   ├── generator.py    # Candidate generation
│   │   ├── structural.py   # Gate 1: Structural validation
│   │   ├── semantic.py     # Gate 2: Semantic validation
│   │   ├── discriminator.py# Gate 3: Discrimination validation
│   │   ├── refiner.py      # Refinement loop
│   │   └── base.py         # Base agent class
│   └── utils/
│       └── json_parser.py  # JSON extraction from LLM output
│
├── data/
│   ├── examples/           # Input: (prompt, preferred, rejected) tuples
│   ├── validated/          # OUTPUT: Final validated rubrics
│   └── metrics/            # Pipeline metrics
│
├── RUBRIC_STYLE_GUIDE.md   # Mined rubric principles (input to generator)
├── pipeline.log            # Run logs (appended)
└── README.md               # This file

The Four Gates

Gate 1: Structural Validation

Model: Automated (no LLM calls)
Target: Reject 40-60%
Checks: Format, atomicity, self-containment, actionable language, anti-patterns

Gate 2: Semantic Validation

Model: Claude Opus 4.6
Target: Pass 40-60% of Gate 1 survivors
Checks: Quality checklist (self-containment, specificity 2-4, unambiguity, verifiability, meaningfulness)

Gate 3: Discrimination Validation

Model: Claude Opus 4.6
Target: Pass 60-80% of Gate 2 survivors
Checks: Does the rubric correctly distinguish preferred from rejected?
This is the most important gate

Refinement Loop

Model: Claude Opus 4.6
Input: Borderline candidates + "both pass" failures
Max cycles: 2 per candidate
Strategies: Increase specificity, decrease specificity, fix ambiguity, strengthen discrimination

Key Metrics to Monitor

Metric	Target	Alert If
Gate 1 rejection rate	40-60%	<30% (too clean) or >70% (quality issue)
Gate 2 pass rate	40-60%	<30% (semantic issues)
Gate 3 pass rate	60-80%	<50% (discrimination problems)
Inverted rate	<5%	>5% (fundamental generator issue)
Refinement success	20-40%	<15% (refinement prompts need work)
Avg discrimination strength	>3.0	<2.5 (weak rubrics)

Configuration

All pipeline constants are in src/config.py (PipelineConfig frozen dataclass):

@dataclass(frozen=True)
class PipelineConfig:
    target_rubrics: int = 6
    max_generation_rounds: int = 7
    initial_candidates: int = 30
    additional_candidates: int = 12
    max_total_candidates: int = 300
    max_no_new_rounds: int = 3
    max_seconds_per_example: int = 240

    parallel_examples: int = 12
    gate2_concurrency: int = 30
    gate3_concurrency: int = 12

    model: str = "anthropic/claude-opus-4-6"

    generator_temperature: float = 0.8
    gate2_temperature: float = 0.2
    gate3_temperature: float = 0.1
    refinement_temperature: float = 0.6

    failure_categories: tuple[str, ...] = (
        "instruction-retention",
        "inference-memory",
        "version-editing",
        "self-coherence",
    )

Troubleshooting

High "both_pass" rate

Generation isn't targeting the actual contrast
Solution: Add more specific constraint prompts in generator

High inverted rate

Generator fundamentally misunderstands what makes preferred better
Solution: Re-analyze examples, check for ambiguous preferences

Low refinement success

Refinement strategies aren't effective
Solution: Add more targeted strategies, reduce max cycles, accept lower recovery

Output Format

The pipeline generates exactly 6 validated rubrics per example in data/validated/<example_id>.json:

data/validated/
├── pref-0000.json
├── pref-0001.json
└── ...

Each file contains an array of 6 rubrics:

[
  {
    "id": "rubric-uuid",
    "example_id": "pref-0000",
    "rubric_text": "[Objective] The response should...",
    "abstraction_level": "category",
    "category": "instruction-retention",
    "gate3_result": {
      "discrimination_strength": 4,
      "high_value": true,
      "discrimination_type": "clean",
      "passed": true
    }
  },
  // ... 5 more rubrics
]

Integration with RL Training

For RL training:

Use abstraction_level: "category" rubrics for generalization
Prioritize high_value: true rubrics (discriminate on subtle differences)
Filter by discrimination_strength >= 3 for clean gradients

References

RUBRIC_STYLE_GUIDE.md - Mined rubric principles
src/config.py - All pipeline constants
src/client.py - API client configuration

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
codemaps		codemaps
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
QUICK_START.md		QUICK_START.md
README.md		README.md
RUBRIC_STYLE_GUIDE.md		RUBRIC_STYLE_GUIDE.md
SETUP_GUIDE.md		SETUP_GUIDE.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Labelbox/multichallenge_rubric_generation

Folders and files

Latest commit

History

Repository files navigation