Issue 131: MLflow integration by davramov · Pull Request #132 · als-computing/splash_flows

davramov · 2026-04-06T20:35:37Z

New files

orchestration/mlflow.py — Core MLflow integration module with:

ModelCheckpointInfo dataclass holding checkpoint paths and inference params
_is_mlflow_reachable() — fast health check (2s timeout) to short-circuit before MLflow's built-in 30s retry loop fires on a down server
get_mlflow_client() — constructs a client from config.mlflow["tracking_uri"]
get_checkpoint_info() — resolves a model's production alias from the registry; returns None gracefully on any failure, allowing callers to fall back to config defaults
register_checkpoint() — registers a model version with nersc_path, alcf_path, and inference params as version-level tags; list/dict values are JSON-encoded automatically
log_segmentation_metrics() — logs timing and job params as an (optionally nested) MLflow run after SLURM job completion on the orchestration server, not the compute node

orchestration/flows/bl832/register_mlflow.py

One-time registration script for sam3-petiole and dinov3-petiole, storing all model-coupled params (checkpoint paths, conda env, scripts dir, batch size, prompts, confidence, overlap) in the registry.
Includes retrieve_mlflow_params_test() to validate the resolution chain end-to-end, with separate checks confirming MLflow-coupled params are overridden and SLURM allocation params are unchanged.

Modified files

orchestration/flows/bl832/config.py

Added self.mlflow = self.config["mlflow"]["local"] to Config832._beam_specific_config(), exposing the MLflow environment dict (containing tracking_uri and registry_uri)
config.yml — Added a multi-environment mlflow section with four named environments:

# `Config832` defaults to local; switching environments requires only changing the key in `_beam_specific_config`.  
mlflow:
    local:     # http://localhost:5001
    dev:       # http://mlflow-dev.computing.als.lbl.gov
    prod:      # https://mlflow.computing.als.lbl.gov
    staging:   # https://mlflow-staging.computing.als.lbl.gov

orchestration/flows/bl832/nersc.py — Refactored _load_job_options() to implement a three-layer resolution:

Config YAML (base defaults — SLURM allocation, infrastructure)
MLflow Registry (model-coupled params — checkpoints, inference hyperparams, envs, scripts) — skipped silently if mlflow_model_name is not passed or server is unreachable
Prefect Variable (hot overrides — wins over everything, no redeploy needed)

segmentation_sam3() and segmentation_dinov3() call _load_job_options with mlflow_model_name and mlflow_checkpoint_key to activate the MLflow layer.

… model checkpoint, see if server is reachable, get client, retrieve checkpoint info, log run metrics)

…including inference parameter settings). Also includes a query to check that values are pulled from the server as expected.

…odel information found in mlflow

…_dinov3()

davramov added 6 commits April 6, 2026 13:07

Adding mlflow uris to config

9824c18

updating dependencies to include mlflow

4d48a87

Adding utility script for interacting with mlflow servers (register a…

680732e

… model checkpoint, see if server is reachable, get client, retrieve checkpoint info, log run metrics)

Added a script to bl832 for registering model checkpoints to mlflow (…

3fc8995

…including inference parameter settings). Also includes a query to check that values are pulled from the server as expected.

Updating _load_job_options() in nersc.py to additionally prioritize m…

c063048

…odel information found in mlflow

adding pytest for mlflow

b058d62

davramov requested a review from xiaoyachong April 6, 2026 20:37

davramov added 2 commits April 6, 2026 13:44

adjusting _load_job_options() in segmentation_sam3() and segmentation…

1aadfa1

…_dinov3()

Swapping sync globus transfers to async for nersc recon/segment flows

bfbec10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 131: MLflow integration#132

Issue 131: MLflow integration#132
davramov wants to merge 8 commits intoals-computing:mainfrom
davramov:mlflow-integration

davramov commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davramov commented Apr 6, 2026

New files

Modified files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant