OracleRubric class Experimental#1018
OracleRubric class Experimental#1018Jgmedina95 wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| system_prompt=system_prompt, | ||
| parser=similarity_rubric.parser, | ||
| rubric=rubric, | ||
| ) |
There was a problem hiding this comment.
Missing environments/README.md update for new environment
Low Severity
A new environment solubility_expert was added to the environments/ folder but environments/README.md was not updated. The README does not list solubility_expert in any section, nor does it mention OracleRubric as a pattern. The project rule requires that any PR adding an environment must update the environments README to list it under the appropriate category and update the "What to look at for each pattern" section if applicable.
Triggered by project rule: BugBot Instructions
| "info": {}, | ||
| }, | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Mock oracle and dataset start_solubility values are inconsistent
Medium Severity
The hardcoded start_solubility values in the dataset are calibrated for Rowan's real API, not the mock estimate_solubility function. In mock mode (the default), estimate_solubility("CCO") returns ~0.40 but start_solubility is 0.61; estimate_solubility("c1ccccc1") returns ~0.14 but start_solubility is 0.70; estimate_solubility("CC(=O)N") returns ~0.49 but start_solubility is −0.42. Since directional scoring computes delta = edited_solubility − start_solubility, mock mode produces meaningless rewards — e.g., any edit to CC(=O)N automatically scores 1.0 because the mock always returns [0, 1] while the baseline is negative.


Description
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
OracleRubric + Solubility Expert: What Changed
Summary
This update introduces a new
OracleRubricclass for scoring with external backends (instead of an LLM judge), and adds a concrete environment example,solubility_expert, that uses Rowan's solubility workflow API throughrowan-python.Why This Was Added
JudgeRubricis great when scoring is done by a judge model. For many tasks, scoring is better handled by:OracleRubricprovides the same ergonomic pattern asJudgeRubric, but for backend-oracle scoring. whenever you need an external Neural Network or model for scoring/grading/simulating generated outputs.OracleRubric Design
Location:
verifiers/rubrics/experimental/oracle_rubric.pyCompatibility shim:
verifiers/rubrics/oracle_rubric.pyCore API
Key Behavior
oraclecan be a callable backend or an object with.predict(...).oracle_fnis optional and receives context (prompt,completion,answer,state, parsedresponse) plus the backend object. If not included, it will call the oracle directly Oracle()oraclecallable and can do:result = await oracle(prompt, completion, answer, state)cache_measurements=Trueby default), so multiple reward funcs can reuse the same backend result.Parallels to JudgeRubric
judge_client/ model settingsoraclebackend object/callablejudge(...)oracle(...)add_reward_func(...)add_reward_func(...)The intended developer experience is intentionally parallel so users can switch scoring backends without changing rubric patterns.
Solubility Expert Example
Location:
environments/solubility_expert/solubility_expert.pyenvironments/solubility_expert/README.mdWhat It Demonstrates
OracleRubricusage in chemistry/SMILES editing.SolubilityPredictClient) that supports:higher/lower) based on oracle-returned solubility.Rowan API Integration
When
use_rowan_api=True, the environment:rowan_key(orROWAN_API_KEY),rowan.submit_solubility_workflow(...).wait_for_result().fetch_latest(in_place=True)edited_solubilityvalid_predict_callworkflow_uuidworkflow_statusHow to Run the Example
Install dependency:
Set key:
Run:
NOTE: ^^ the solubility example and Rowan-API is very very SLOW, like 2mins per call. I mainly use it because it was one of the first APIs I found in the wild that I could use for this, but definitely having it locally would make it faster.
Practical Takeaways
JudgeRubricwhen scoring should come from an LLM judge.OracleRubricwhen scoring should come from a domain backend/API.add_reward_func(...)functions and call injectedoracle(...)directly for a consistent rubric authoring pattern.Note
Medium Risk
Adds a new scoring primitive (
OracleRubric) that executes arbitrary backend/oracle calls during rollout scoring and introduces an example env that can block on external Rowan API workflows, increasing runtime/failure-surface compared to pure in-process rubrics.Overview
Introduces experimental
OracleRubric, aJudgeRubric-style rubric that scores via an external oracle (callable,.predict(...)client, or customoracle_fn) and caches oracle measurements per rollout to avoid duplicate backend calls.Adds a new
solubility_expertexample environment that combines a similarity-based reward with an oracle-based directional solubility reward, with a mock offline predictor by default and an optional Rowansubmit_solubility_workflow(...)integration (API key validation + configurable solvents/temps/credits).Exports
OracleRubricfromverifiers/__init__.py, adds anexperimentalrubric README, and includes new unit tests covering initialization,oracle_fnwiring, answer-dict scoring, and within-rollout caching behavior.Written by Cursor Bugbot for commit 8ca5f56. This will update automatically on new commits. Configure here.