Skip to content

OracleRubric class Experimental#1018

Open
Jgmedina95 wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
Jgmedina95:pr/oracle-rubric-to-main
Open

OracleRubric class Experimental#1018
Jgmedina95 wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
Jgmedina95:pr/oracle-rubric-to-main

Conversation

@Jgmedina95
Copy link

@Jgmedina95 Jgmedina95 commented Mar 13, 2026

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • [ X] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

OracleRubric + Solubility Expert: What Changed

Summary

This update introduces a new OracleRubric class for scoring with external backends (instead of an LLM judge), and adds a concrete environment example, solubility_expert, that uses Rowan's solubility workflow API through rowan-python.

Why This Was Added

JudgeRubric is great when scoring is done by a judge model. For many tasks, scoring is better handled by:

  • a domain API,
  • a simulator,
  • a model server,
  • or any custom backend.

OracleRubric provides the same ergonomic pattern as JudgeRubric, but for backend-oracle scoring. whenever you need an external Neural Network or model for scoring/grading/simulating generated outputs.

OracleRubric Design

Location:

  • verifiers/rubrics/experimental/oracle_rubric.py

Compatibility shim:

  • verifiers/rubrics/oracle_rubric.py

Core API

rubric = vf.OracleRubric(
    oracle=my_backend,      # backend client or callable
    oracle_fn=call_backend, # optional adapter for backend invocation
)
rubric.add_reward_func(my_reward_func)

Key Behavior

  • oracle can be a callable backend or an object with .predict(...).
  • oracle_fn is optional and receives context (prompt, completion, answer, state, parsed response) plus the backend object. If not included, it will call the oracle directly Oracle()
  • Reward functions receive an injected oracle callable and can do:
    • result = await oracle(prompt, completion, answer, state)
  • Oracle measurements are cached per rollout (cache_measurements=True by default), so multiple reward funcs can reuse the same backend result.

Parallels to JudgeRubric

Concept JudgeRubric OracleRubric
Main goal Score with judge LLM Score with external oracle backend
Constructor backend arg judge_client / model settings oracle backend object/callable
Injected callable in reward funcs judge(...) oracle(...)
Parser usage Parse completion before judge call Parse completion before oracle call
State caching Caches judge responses Caches oracle results
Reward registration style add_reward_func(...) add_reward_func(...)

The intended developer experience is intentionally parallel so users can switch scoring backends without changing rubric patterns.

Solubility Expert Example

Location:

  • environments/solubility_expert/solubility_expert.py
  • environments/solubility_expert/README.md

What It Demonstrates

  • A realistic OracleRubric usage in chemistry/SMILES editing.
  • A backend client (SolubilityPredictClient) that supports:
    • mock mode (offline),
    • Rowan API mode (real external call).
  • Directional reward scoring (higher / lower) based on oracle-returned solubility.

Rowan API Integration

When use_rowan_api=True, the environment:

  1. reads API key from rowan_key (or ROWAN_API_KEY),
  2. calls:
    • rowan.submit_solubility_workflow(...)
  3. waits for completion via:
    • .wait_for_result().fetch_latest(in_place=True)
  4. extracts a solubility value from the returned workflow data,
  5. returns payload fields used by scoring:
    • edited_solubility
    • valid_predict_call
    • workflow_uuid
    • workflow_status

How to Run the Example

Install dependency:

pip install rowan-python

Set key:

export rowan_key="<your_key>"

Run:

prime eval run solubility_expert --env-args '{"use_rowan_api": true } -n 1'

NOTE: ^^ the solubility example and Rowan-API is very very SLOW, like 2mins per call. I mainly use it because it was one of the first APIs I found in the wild that I could use for this, but definitely having it locally would make it faster.

Practical Takeaways

  • Use JudgeRubric when scoring should come from an LLM judge.
  • Use OracleRubric when scoring should come from a domain backend/API.
  • Keep reward logic in add_reward_func(...) functions and call injected oracle(...) directly for a consistent rubric authoring pattern.

Note

Medium Risk
Adds a new scoring primitive (OracleRubric) that executes arbitrary backend/oracle calls during rollout scoring and introduces an example env that can block on external Rowan API workflows, increasing runtime/failure-surface compared to pure in-process rubrics.

Overview
Introduces experimental OracleRubric, a JudgeRubric-style rubric that scores via an external oracle (callable, .predict(...) client, or custom oracle_fn) and caches oracle measurements per rollout to avoid duplicate backend calls.

Adds a new solubility_expert example environment that combines a similarity-based reward with an oracle-based directional solubility reward, with a mock offline predictor by default and an optional Rowan submit_solubility_workflow(...) integration (API key validation + configurable solvents/temps/credits).

Exports OracleRubric from verifiers/__init__.py, adds an experimental rubric README, and includes new unit tests covering initialization, oracle_fn wiring, answer-dict scoring, and within-rollout caching behavior.

Written by Cursor Bugbot for commit 8ca5f56. This will update automatically on new commits. Configure here.

@Jgmedina95 Jgmedina95 changed the title Pr/oracle rubric to main OracleRubric class Experimental Mar 13, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

system_prompt=system_prompt,
parser=similarity_rubric.parser,
rubric=rubric,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing environments/README.md update for new environment

Low Severity

A new environment solubility_expert was added to the environments/ folder but environments/README.md was not updated. The README does not list solubility_expert in any section, nor does it mention OracleRubric as a pattern. The project rule requires that any PR adding an environment must update the environments README to list it under the appropriate category and update the "What to look at for each pattern" section if applicable.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

"info": {},
},
]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mock oracle and dataset start_solubility values are inconsistent

Medium Severity

The hardcoded start_solubility values in the dataset are calibrated for Rowan's real API, not the mock estimate_solubility function. In mock mode (the default), estimate_solubility("CCO") returns ~0.40 but start_solubility is 0.61; estimate_solubility("c1ccccc1") returns ~0.14 but start_solubility is 0.70; estimate_solubility("CC(=O)N") returns ~0.49 but start_solubility is −0.42. Since directional scoring computes delta = edited_solubility − start_solubility, mock mode produces meaningless rewards — e.g., any edit to CC(=O)N automatically scores 1.0 because the mock always returns [0, 1] while the baseline is negative.

Additional Locations (1)
Fix in Cursor Fix in Web

@Jgmedina95 Jgmedina95 marked this pull request as draft March 13, 2026 18:35
@Jgmedina95 Jgmedina95 marked this pull request as ready for review March 15, 2026 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant