Skip to content

Labelbox/multichallenge_rubric_generation

Repository files navigation

Rubric Generation Pipeline

A generator-validator loop for producing high-quality synthetic rubrics for RL training.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PIPELINE FLOW                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│  │ GENERATE │ ──▶ │  GATE 1  │ ──▶ │  GATE 2  │ ──▶ │  GATE 3  │          │
│  │  Opus    │     │ (Auto)   │     │  Opus    │     │  Opus    │          │
│  │(high tmp)│     │40-60% rej│     │40-60%pass│     │60-80%pass│          │
│  └──────────┘     └────┬─────┘     └────┬─────┘     └────┬─────┘          │
│                        │                │                 │                │
│                        ▼                ▼                 ▼                │
│                   ┌─────────┐     ┌───────────┐     ┌───────────┐         │
│                   │ FAILED  │     │ BORDERLINE│     │ VALIDATED │         │
│                   └─────────┘     └─────┬─────┘     └───────────┘         │
│                                         │                                  │
│                                         ▼                                  │
│                                   ┌───────────┐                           │
│                                   │  REFINE   │ ◀── both_pass failures    │
│                                   │   Opus    │                           │
│                                   └─────┬─────┘                           │
│                                         │                                  │
│                                         ▼                                  │
│                                   ┌───────────┐                           │
│                                   │ RECOVERED │ ──▶ Gates 2-3 again       │
│                                   └───────────┘                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Goal: Generate exactly 6 validated rubrics per example with diversity-aware generation

Repo Workflow, Components, and End-to-End Flow

This repo is set up as a generator → multi-gate validator → refinement loop because rubric quality is harder to verify than to draft. We bias toward high recall early (generate many candidates) and high precision late (strict gates), then recover borderline cases instead of discarding useful signal.

How the Repo Works (At a Glance)

flowchart TD
  A[Inputs: preference pairs<br/>data/examples/] --> B[Generator<br/>src/agents/generator.py]
  B --> C[Gate 1: Structural<br/>src/agents/structural.py]
  C -->|pass| D[Gate 2: Semantic<br/>src/agents/semantic.py]
  C -->|fail| Cx["Rejected pool<br/>(in memory)"]
  D -->|pass| E[Gate 3: Discrimination<br/>src/agents/discriminator.py]
  D -->|borderline| Dx["Borderline pool<br/>(in memory)"]
  D -->|fail| Dy["Rejected pool<br/>(in memory)"]
  E -->|pass| F["Validated rubrics<br/>data/validated/"]
  E -->|fail| Ex["Gate 3 failures<br/>(in memory)"]
  Dx --> H[Refiner<br/>src/agents/refiner.py]
  Ex --> H
  H --> D
  H --> E
  F --> M[Metrics<br/>data/metrics/]
  E --> M
  P[pipeline.log] -.-> M
Loading

Components and Responsibilities

flowchart LR
  subgraph Entrypoint["run.py"]
    EP[CLI with subcommands:<br/>run, metrics, check-imports, clean]
  end
  subgraph Agents["src/agents/"]
    G1[generator.py - Candidate generation]
    V1[structural.py - Gate 1]
    V2[semantic.py - Gate 2]
    V3[discriminator.py - Gate 3]
    R1[refiner.py - Refinement loop]
    O1[orchestrator.py - Pipeline orchestration]
  end
  subgraph Core["src/"]
    C1[client.py - AnthropicClient]
    C2[config.py - PipelineConfig]
    C3[models.py - Data models]
  end
  subgraph Data["data/"]
    D1[examples/]
    D3[validated/]
    D4[metrics/]
  end
  D1 --> EP
  EP --> O1
  O1 --> G1
  O1 --> V1
  O1 --> V2
  O1 --> V3
  O1 --> R1
  V3 --> D3
  EP --> D4
Loading

End-to-End Workflow (Step-by-Step)

sequenceDiagram
  autonumber
  participant U as User
  participant P as run.py
  participant O as Orchestrator
  participant G as Generator
  participant S as Gate 1 (Structural)
  participant M as Gate 2 (Semantic)
  participant D as Gate 3 (Discrimination)
  participant R as Refiner
  participant Out as Output

  U->>P: Run pipeline on examples
  P->>O: Initialize PipelineOrchestrator
  O->>G: Generate N candidate rubrics
  G-->>O: candidates kept in memory
  O->>S: Structural checks (fast, deterministic)
  S-->>O: pass/fail + routing
  O->>M: Semantic checks (Claude Opus)
  M-->>O: pass/fail + borderline
  O->>D: Discrimination checks (Claude Opus)
  D-->>O: pass/fail + both-pass cases
  O->>R: Refine borderline and both-pass failures
  R-->>O: improved candidates
  O->>M: Re-check semantics
  O->>D: Re-check discrimination
  D-->>Out: Save validated rubrics + metrics
Loading

Why It's Set Up This Way

  • Generate wide, filter hard: High-quality rubrics are rare; generating many candidates increases odds of capturing strong discriminators.
  • Cheap checks first: Structural validation is deterministic and fast, preventing wasted LLM calls on malformed rubrics.
  • Semantic before discrimination: A rubric must be clear and verifiable before we test whether it separates preferred vs. rejected.
  • Discrimination is the core signal: The final gate ensures rubrics reflect the actual preference delta, which is what RL training needs.
  • Refinement saves value: Borderline candidates often become strong after targeted edits, reducing waste without lowering standards.

Agent Handoff (Comprehensive)

Use this section to orient a coding agent quickly and accurately.

Source of Truth

  • Primary entrypoint: run.py (CLI with subcommands: run, metrics, check-imports, clean).
  • Pipeline orchestration: src/agents/orchestrator.py handles the full loop, concurrency, and I/O.
  • API client: src/client.py (AnthropicClient) — async Anthropic SDK with retry and structured output.
  • Configuration: src/config.py (PipelineConfig) — all constants in a frozen dataclass.

What Actually Runs (Code Map)

  • run.py: CLI entrypoint with subcommands.
  • src/agents/orchestrator.py: Pipeline orchestration, concurrency, and I/O.
  • src/agents/generator.py: Candidate generation (Claude Opus 4.6).
  • src/agents/structural.py: Deterministic structural checks (Gate 1).
  • src/agents/semantic.py: Semantic quality gate (Claude Opus 4.6).
  • src/agents/discriminator.py: Discrimination gate (Claude Opus 4.6).
  • src/agents/refiner.py: Refinement loop (Claude Opus 4.6).
  • src/client.py: Anthropic API client with retry, structured output, and concurrency control.
  • src/config.py: All pipeline constants (PipelineConfig frozen dataclass).
  • src/models.py: Pydantic data models.

Data Flow (Actual Outputs)

  • Inputs: data/examples/*.json
  • Outputs:
    • data/validated/<example_id>.json (single file containing array of 6 rubrics per example)
    • data/metrics/<example_id>.json and data/metrics/summary.json
    • pipeline.log (log file in repo root)
  • Not persisted by default: intermediate candidates, gate pass/fail pools, and refinement queues are kept in memory.

Key Behaviors and Gotchas

  • Normalization: afm_response is auto-mapped to rejected_response; metadata.rubric_criteria becomes rubric_hints.
  • Deduping: Candidates are deduped by rubric text before gates.
  • Gate order: Gate 1 → Gate 2 → Gate 3; borderline + both-pass failures are refined.
  • Randomization: Gate 3 randomizes response order to avoid position bias.
  • Diversity-aware generation: After each Gate 3 round, analyzes category coverage and targets underrepresented failure categories in subsequent generation rounds.
  • Forced fill: If still short after max attempts, the pipeline force-fills using best failures or fallback templates/hints to reach 6 rubrics.
  • Concurrency knobs: --parallel, --gate2-concurrency, --gate3-concurrency (CLI args) + PARALLEL_EXAMPLES, GATE2_CONCURRENCY, GATE3_CONCURRENCY (env vars) + defaults in src/config.py.

Model and Configuration

All pipeline stages use Claude Opus 4.6 (claude-opus-4-6), configured in src/config.py:

@dataclass(frozen=True)
class PipelineConfig:
    target_rubrics: int = 6
    max_generation_rounds: int = 7
    initial_candidates: int = 30
    additional_candidates: int = 12
    model: str = "anthropic/claude-opus-4-6"
    # ... temperature/token settings per stage

How to Run (CLI)

uv run python run.py run --examples data/examples/pref-0000.json
uv run python run.py run --examples data/examples/ --parallel 12
uv run python run.py metrics
uv run python run.py check-imports
uv run python run.py clean

Quick Start

1. Setup

# Navigate to project
cd rubric_generation

# Install dependencies
uv sync

# Set your API key
export LB_API_KEY='your-labelbox-key'

# Or copy the example env file
cp .env.example .env
# Edit .env with your Labelbox key

2. Prepare Your Data

Place your examples in data/examples/:

{
  "id": "pref-0000",
  "prompt": "User's original prompt/conversation",
  "preferred_response": "The better response",
  "afm_response": "The worse response (rejected)",
  "category_hint": "scheduling",
  "metadata": {
    "rubric_criteria": [
      "Hint 1 from human annotators",
      "Hint 2 from human annotators"
    ]
  }
}

Note: The pipeline accepts both afm_response and rejected_response fields.

3. Run the Pipeline

# Single example
uv run python run.py run --examples data/examples/pref-0000.json

# All examples in directory
uv run python run.py run --examples data/examples/

# Custom output directory
uv run python run.py run --examples data/examples/ --output results/

High-throughput options (recommended for large datasets):

# Shard across multiple terminals/machines
uv run python run.py run --examples data/examples/ --shard 0/4

# Resume mode (default on) skips completed examples
uv run python run.py run --examples data/examples/ --resume

# Override concurrency
uv run python run.py run --examples data/examples/ --parallel 12 --gate2-concurrency 30 --gate3-concurrency 12

4. Check Results

# Validated rubrics
ls data/validated/

# Pipeline metrics
uv run python run.py metrics

# Metrics for a specific example
uv run python run.py metrics pref-0010

Directory Structure

rubric_generation/
├── run.py                  # CLI entrypoint (subcommands: run, metrics, check-imports, clean)
├── pyproject.toml          # Dependencies (openai, pydantic, python-dotenv, tqdm)
├── .env.example            # API key template
│
├── src/
│   ├── client.py           # AnthropicClient (async, retry, structured output)
│   ├── config.py           # PipelineConfig (frozen dataclass)
│   ├── models.py           # Pydantic data models
│   ├── agents/
│   │   ├── orchestrator.py # Pipeline orchestration
│   │   ├── generator.py    # Candidate generation
│   │   ├── structural.py   # Gate 1: Structural validation
│   │   ├── semantic.py     # Gate 2: Semantic validation
│   │   ├── discriminator.py# Gate 3: Discrimination validation
│   │   ├── refiner.py      # Refinement loop
│   │   └── base.py         # Base agent class
│   └── utils/
│       └── json_parser.py  # JSON extraction from LLM output
│
├── data/
│   ├── examples/           # Input: (prompt, preferred, rejected) tuples
│   ├── validated/          # OUTPUT: Final validated rubrics
│   └── metrics/            # Pipeline metrics
│
├── RUBRIC_STYLE_GUIDE.md   # Mined rubric principles (input to generator)
├── pipeline.log            # Run logs (appended)
└── README.md               # This file

The Four Gates

Gate 1: Structural Validation

  • Model: Automated (no LLM calls)
  • Target: Reject 40-60%
  • Checks: Format, atomicity, self-containment, actionable language, anti-patterns

Gate 2: Semantic Validation

  • Model: Claude Opus 4.6
  • Target: Pass 40-60% of Gate 1 survivors
  • Checks: Quality checklist (self-containment, specificity 2-4, unambiguity, verifiability, meaningfulness)

Gate 3: Discrimination Validation

  • Model: Claude Opus 4.6
  • Target: Pass 60-80% of Gate 2 survivors
  • Checks: Does the rubric correctly distinguish preferred from rejected?
  • This is the most important gate

Refinement Loop

  • Model: Claude Opus 4.6
  • Input: Borderline candidates + "both pass" failures
  • Max cycles: 2 per candidate
  • Strategies: Increase specificity, decrease specificity, fix ambiguity, strengthen discrimination

Key Metrics to Monitor

Metric Target Alert If
Gate 1 rejection rate 40-60% <30% (too clean) or >70% (quality issue)
Gate 2 pass rate 40-60% <30% (semantic issues)
Gate 3 pass rate 60-80% <50% (discrimination problems)
Inverted rate <5% >5% (fundamental generator issue)
Refinement success 20-40% <15% (refinement prompts need work)
Avg discrimination strength >3.0 <2.5 (weak rubrics)

Configuration

All pipeline constants are in src/config.py (PipelineConfig frozen dataclass):

@dataclass(frozen=True)
class PipelineConfig:
    target_rubrics: int = 6
    max_generation_rounds: int = 7
    initial_candidates: int = 30
    additional_candidates: int = 12
    max_total_candidates: int = 300
    max_no_new_rounds: int = 3
    max_seconds_per_example: int = 240

    parallel_examples: int = 12
    gate2_concurrency: int = 30
    gate3_concurrency: int = 12

    model: str = "anthropic/claude-opus-4-6"

    generator_temperature: float = 0.8
    gate2_temperature: float = 0.2
    gate3_temperature: float = 0.1
    refinement_temperature: float = 0.6

    failure_categories: tuple[str, ...] = (
        "instruction-retention",
        "inference-memory",
        "version-editing",
        "self-coherence",
    )

Troubleshooting

High "both_pass" rate

  • Generation isn't targeting the actual contrast
  • Solution: Add more specific constraint prompts in generator

High inverted rate

  • Generator fundamentally misunderstands what makes preferred better
  • Solution: Re-analyze examples, check for ambiguous preferences

Low refinement success

  • Refinement strategies aren't effective
  • Solution: Add more targeted strategies, reduce max cycles, accept lower recovery

Output Format

The pipeline generates exactly 6 validated rubrics per example in data/validated/<example_id>.json:

data/validated/
├── pref-0000.json
├── pref-0001.json
└── ...

Each file contains an array of 6 rubrics:

[
  {
    "id": "rubric-uuid",
    "example_id": "pref-0000",
    "rubric_text": "[Objective] The response should...",
    "abstraction_level": "category",
    "category": "instruction-retention",
    "gate3_result": {
      "discrimination_strength": 4,
      "high_value": true,
      "discrimination_type": "clean",
      "passed": true
    }
  },
  // ... 5 more rubrics
]

Integration with RL Training

For RL training:

  • Use abstraction_level: "category" rubrics for generalization
  • Prioritize high_value: true rubrics (discriminate on subtle differences)
  • Filter by discrimination_strength >= 3 for clean gradients

References

  • RUBRIC_STYLE_GUIDE.md - Mined rubric principles
  • src/config.py - All pipeline constants
  • src/client.py - API client configuration

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages