A generator-validator loop for producing high-quality synthetic rubrics for RL training.
┌─────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GENERATE │ ──▶ │ GATE 1 │ ──▶ │ GATE 2 │ ──▶ │ GATE 3 │ │
│ │ Opus │ │ (Auto) │ │ Opus │ │ Opus │ │
│ │(high tmp)│ │40-60% rej│ │40-60%pass│ │60-80%pass│ │
│ └──────────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ ┌───────────┐ │
│ │ FAILED │ │ BORDERLINE│ │ VALIDATED │ │
│ └─────────┘ └─────┬─────┘ └───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ REFINE │ ◀── both_pass failures │
│ │ Opus │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ RECOVERED │ ──▶ Gates 2-3 again │
│ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Goal: Generate exactly 6 validated rubrics per example with diversity-aware generation
This repo is set up as a generator → multi-gate validator → refinement loop because rubric quality is harder to verify than to draft. We bias toward high recall early (generate many candidates) and high precision late (strict gates), then recover borderline cases instead of discarding useful signal.
flowchart TD
A[Inputs: preference pairs<br/>data/examples/] --> B[Generator<br/>src/agents/generator.py]
B --> C[Gate 1: Structural<br/>src/agents/structural.py]
C -->|pass| D[Gate 2: Semantic<br/>src/agents/semantic.py]
C -->|fail| Cx["Rejected pool<br/>(in memory)"]
D -->|pass| E[Gate 3: Discrimination<br/>src/agents/discriminator.py]
D -->|borderline| Dx["Borderline pool<br/>(in memory)"]
D -->|fail| Dy["Rejected pool<br/>(in memory)"]
E -->|pass| F["Validated rubrics<br/>data/validated/"]
E -->|fail| Ex["Gate 3 failures<br/>(in memory)"]
Dx --> H[Refiner<br/>src/agents/refiner.py]
Ex --> H
H --> D
H --> E
F --> M[Metrics<br/>data/metrics/]
E --> M
P[pipeline.log] -.-> M
flowchart LR
subgraph Entrypoint["run.py"]
EP[CLI with subcommands:<br/>run, metrics, check-imports, clean]
end
subgraph Agents["src/agents/"]
G1[generator.py - Candidate generation]
V1[structural.py - Gate 1]
V2[semantic.py - Gate 2]
V3[discriminator.py - Gate 3]
R1[refiner.py - Refinement loop]
O1[orchestrator.py - Pipeline orchestration]
end
subgraph Core["src/"]
C1[client.py - AnthropicClient]
C2[config.py - PipelineConfig]
C3[models.py - Data models]
end
subgraph Data["data/"]
D1[examples/]
D3[validated/]
D4[metrics/]
end
D1 --> EP
EP --> O1
O1 --> G1
O1 --> V1
O1 --> V2
O1 --> V3
O1 --> R1
V3 --> D3
EP --> D4
sequenceDiagram
autonumber
participant U as User
participant P as run.py
participant O as Orchestrator
participant G as Generator
participant S as Gate 1 (Structural)
participant M as Gate 2 (Semantic)
participant D as Gate 3 (Discrimination)
participant R as Refiner
participant Out as Output
U->>P: Run pipeline on examples
P->>O: Initialize PipelineOrchestrator
O->>G: Generate N candidate rubrics
G-->>O: candidates kept in memory
O->>S: Structural checks (fast, deterministic)
S-->>O: pass/fail + routing
O->>M: Semantic checks (Claude Opus)
M-->>O: pass/fail + borderline
O->>D: Discrimination checks (Claude Opus)
D-->>O: pass/fail + both-pass cases
O->>R: Refine borderline and both-pass failures
R-->>O: improved candidates
O->>M: Re-check semantics
O->>D: Re-check discrimination
D-->>Out: Save validated rubrics + metrics
- Generate wide, filter hard: High-quality rubrics are rare; generating many candidates increases odds of capturing strong discriminators.
- Cheap checks first: Structural validation is deterministic and fast, preventing wasted LLM calls on malformed rubrics.
- Semantic before discrimination: A rubric must be clear and verifiable before we test whether it separates preferred vs. rejected.
- Discrimination is the core signal: The final gate ensures rubrics reflect the actual preference delta, which is what RL training needs.
- Refinement saves value: Borderline candidates often become strong after targeted edits, reducing waste without lowering standards.
Use this section to orient a coding agent quickly and accurately.
- Primary entrypoint:
run.py(CLI with subcommands:run,metrics,check-imports,clean). - Pipeline orchestration:
src/agents/orchestrator.pyhandles the full loop, concurrency, and I/O. - API client:
src/client.py(AnthropicClient) — async Anthropic SDK with retry and structured output. - Configuration:
src/config.py(PipelineConfig) — all constants in a frozen dataclass.
run.py: CLI entrypoint with subcommands.src/agents/orchestrator.py: Pipeline orchestration, concurrency, and I/O.src/agents/generator.py: Candidate generation (Claude Opus 4.6).src/agents/structural.py: Deterministic structural checks (Gate 1).src/agents/semantic.py: Semantic quality gate (Claude Opus 4.6).src/agents/discriminator.py: Discrimination gate (Claude Opus 4.6).src/agents/refiner.py: Refinement loop (Claude Opus 4.6).src/client.py: Anthropic API client with retry, structured output, and concurrency control.src/config.py: All pipeline constants (PipelineConfigfrozen dataclass).src/models.py: Pydantic data models.
- Inputs:
data/examples/*.json - Outputs:
data/validated/<example_id>.json(single file containing array of 6 rubrics per example)data/metrics/<example_id>.jsonanddata/metrics/summary.jsonpipeline.log(log file in repo root)
- Not persisted by default: intermediate candidates, gate pass/fail pools, and refinement queues are kept in memory.
- Normalization:
afm_responseis auto-mapped torejected_response;metadata.rubric_criteriabecomesrubric_hints. - Deduping: Candidates are deduped by rubric text before gates.
- Gate order: Gate 1 → Gate 2 → Gate 3; borderline + both-pass failures are refined.
- Randomization: Gate 3 randomizes response order to avoid position bias.
- Diversity-aware generation: After each Gate 3 round, analyzes category coverage and targets underrepresented failure categories in subsequent generation rounds.
- Forced fill: If still short after max attempts, the pipeline force-fills using best failures or fallback templates/hints to reach 6 rubrics.
- Concurrency knobs:
--parallel,--gate2-concurrency,--gate3-concurrency(CLI args) +PARALLEL_EXAMPLES,GATE2_CONCURRENCY,GATE3_CONCURRENCY(env vars) + defaults insrc/config.py.
All pipeline stages use Claude Opus 4.6 (claude-opus-4-6), configured in src/config.py:
@dataclass(frozen=True)
class PipelineConfig:
target_rubrics: int = 6
max_generation_rounds: int = 7
initial_candidates: int = 30
additional_candidates: int = 12
model: str = "anthropic/claude-opus-4-6"
# ... temperature/token settings per stageuv run python run.py run --examples data/examples/pref-0000.json
uv run python run.py run --examples data/examples/ --parallel 12
uv run python run.py metrics
uv run python run.py check-imports
uv run python run.py clean# Navigate to project
cd rubric_generation
# Install dependencies
uv sync
# Set your API key
export LB_API_KEY='your-labelbox-key'
# Or copy the example env file
cp .env.example .env
# Edit .env with your Labelbox keyPlace your examples in data/examples/:
{
"id": "pref-0000",
"prompt": "User's original prompt/conversation",
"preferred_response": "The better response",
"afm_response": "The worse response (rejected)",
"category_hint": "scheduling",
"metadata": {
"rubric_criteria": [
"Hint 1 from human annotators",
"Hint 2 from human annotators"
]
}
}Note: The pipeline accepts both afm_response and rejected_response fields.
# Single example
uv run python run.py run --examples data/examples/pref-0000.json
# All examples in directory
uv run python run.py run --examples data/examples/
# Custom output directory
uv run python run.py run --examples data/examples/ --output results/High-throughput options (recommended for large datasets):
# Shard across multiple terminals/machines
uv run python run.py run --examples data/examples/ --shard 0/4
# Resume mode (default on) skips completed examples
uv run python run.py run --examples data/examples/ --resume
# Override concurrency
uv run python run.py run --examples data/examples/ --parallel 12 --gate2-concurrency 30 --gate3-concurrency 12# Validated rubrics
ls data/validated/
# Pipeline metrics
uv run python run.py metrics
# Metrics for a specific example
uv run python run.py metrics pref-0010rubric_generation/
├── run.py # CLI entrypoint (subcommands: run, metrics, check-imports, clean)
├── pyproject.toml # Dependencies (openai, pydantic, python-dotenv, tqdm)
├── .env.example # API key template
│
├── src/
│ ├── client.py # AnthropicClient (async, retry, structured output)
│ ├── config.py # PipelineConfig (frozen dataclass)
│ ├── models.py # Pydantic data models
│ ├── agents/
│ │ ├── orchestrator.py # Pipeline orchestration
│ │ ├── generator.py # Candidate generation
│ │ ├── structural.py # Gate 1: Structural validation
│ │ ├── semantic.py # Gate 2: Semantic validation
│ │ ├── discriminator.py# Gate 3: Discrimination validation
│ │ ├── refiner.py # Refinement loop
│ │ └── base.py # Base agent class
│ └── utils/
│ └── json_parser.py # JSON extraction from LLM output
│
├── data/
│ ├── examples/ # Input: (prompt, preferred, rejected) tuples
│ ├── validated/ # OUTPUT: Final validated rubrics
│ └── metrics/ # Pipeline metrics
│
├── RUBRIC_STYLE_GUIDE.md # Mined rubric principles (input to generator)
├── pipeline.log # Run logs (appended)
└── README.md # This file
- Model: Automated (no LLM calls)
- Target: Reject 40-60%
- Checks: Format, atomicity, self-containment, actionable language, anti-patterns
- Model: Claude Opus 4.6
- Target: Pass 40-60% of Gate 1 survivors
- Checks: Quality checklist (self-containment, specificity 2-4, unambiguity, verifiability, meaningfulness)
- Model: Claude Opus 4.6
- Target: Pass 60-80% of Gate 2 survivors
- Checks: Does the rubric correctly distinguish preferred from rejected?
- This is the most important gate
- Model: Claude Opus 4.6
- Input: Borderline candidates + "both pass" failures
- Max cycles: 2 per candidate
- Strategies: Increase specificity, decrease specificity, fix ambiguity, strengthen discrimination
| Metric | Target | Alert If |
|---|---|---|
| Gate 1 rejection rate | 40-60% | <30% (too clean) or >70% (quality issue) |
| Gate 2 pass rate | 40-60% | <30% (semantic issues) |
| Gate 3 pass rate | 60-80% | <50% (discrimination problems) |
| Inverted rate | <5% | >5% (fundamental generator issue) |
| Refinement success | 20-40% | <15% (refinement prompts need work) |
| Avg discrimination strength | >3.0 | <2.5 (weak rubrics) |
All pipeline constants are in src/config.py (PipelineConfig frozen dataclass):
@dataclass(frozen=True)
class PipelineConfig:
target_rubrics: int = 6
max_generation_rounds: int = 7
initial_candidates: int = 30
additional_candidates: int = 12
max_total_candidates: int = 300
max_no_new_rounds: int = 3
max_seconds_per_example: int = 240
parallel_examples: int = 12
gate2_concurrency: int = 30
gate3_concurrency: int = 12
model: str = "anthropic/claude-opus-4-6"
generator_temperature: float = 0.8
gate2_temperature: float = 0.2
gate3_temperature: float = 0.1
refinement_temperature: float = 0.6
failure_categories: tuple[str, ...] = (
"instruction-retention",
"inference-memory",
"version-editing",
"self-coherence",
)- Generation isn't targeting the actual contrast
- Solution: Add more specific constraint prompts in generator
- Generator fundamentally misunderstands what makes preferred better
- Solution: Re-analyze examples, check for ambiguous preferences
- Refinement strategies aren't effective
- Solution: Add more targeted strategies, reduce max cycles, accept lower recovery
The pipeline generates exactly 6 validated rubrics per example in data/validated/<example_id>.json:
data/validated/
├── pref-0000.json
├── pref-0001.json
└── ...
Each file contains an array of 6 rubrics:
[
{
"id": "rubric-uuid",
"example_id": "pref-0000",
"rubric_text": "[Objective] The response should...",
"abstraction_level": "category",
"category": "instruction-retention",
"gate3_result": {
"discrimination_strength": 4,
"high_value": true,
"discrimination_type": "clean",
"passed": true
}
},
// ... 5 more rubrics
]For RL training:
- Use
abstraction_level: "category"rubrics for generalization - Prioritize
high_value: truerubrics (discriminate on subtle differences) - Filter by
discrimination_strength >= 3for clean gradients
- RUBRIC_STYLE_GUIDE.md - Mined rubric principles
- src/config.py - All pipeline constants
- src/client.py - API client configuration