Skip to content

Issue 131: MLflow integration#132

Open
davramov wants to merge 8 commits intoals-computing:mainfrom
davramov:mlflow-integration
Open

Issue 131: MLflow integration#132
davramov wants to merge 8 commits intoals-computing:mainfrom
davramov:mlflow-integration

Conversation

@davramov
Copy link
Copy Markdown
Contributor

@davramov davramov commented Apr 6, 2026

New files

orchestration/mlflow.py — Core MLflow integration module with:

  • ModelCheckpointInfo dataclass holding checkpoint paths and inference params

  • _is_mlflow_reachable() — fast health check (2s timeout) to short-circuit before MLflow's built-in 30s retry loop fires on a down server

  • get_mlflow_client() — constructs a client from config.mlflow["tracking_uri"]

  • get_checkpoint_info() — resolves a model's production alias from the registry; returns None gracefully on any failure, allowing callers to fall back to config defaults

  • register_checkpoint() — registers a model version with nersc_path, alcf_path, and inference params as version-level tags; list/dict values are JSON-encoded automatically

  • log_segmentation_metrics() — logs timing and job params as an (optionally nested) MLflow run after SLURM job completion on the orchestration server, not the compute node

orchestration/flows/bl832/register_mlflow.py

  • One-time registration script for sam3-petiole and dinov3-petiole, storing all model-coupled params (checkpoint paths, conda env, scripts dir, batch size, prompts, confidence, overlap) in the registry.
  • Includes retrieve_mlflow_params_test() to validate the resolution chain end-to-end, with separate checks confirming MLflow-coupled params are overridden and SLURM allocation params are unchanged.

Modified files

orchestration/flows/bl832/config.py

  • Added self.mlflow = self.config["mlflow"]["local"] to Config832._beam_specific_config(), exposing the MLflow environment dict (containing tracking_uri and registry_uri)
  • config.yml — Added a multi-environment mlflow section with four named environments:
# `Config832` defaults to local; switching environments requires only changing the key in `_beam_specific_config`.  
mlflow:
    local:     # http://localhost:5001
    dev:       # http://mlflow-dev.computing.als.lbl.gov
    prod:      # https://mlflow.computing.als.lbl.gov
    staging:   # https://mlflow-staging.computing.als.lbl.gov

orchestration/flows/bl832/nersc.py — Refactored _load_job_options() to implement a three-layer resolution:

  1. Config YAML (base defaults — SLURM allocation, infrastructure)
  2. MLflow Registry (model-coupled params — checkpoints, inference hyperparams, envs, scripts) — skipped silently if mlflow_model_name is not passed or server is unreachable
  3. Prefect Variable (hot overrides — wins over everything, no redeploy needed)

segmentation_sam3() and segmentation_dinov3() call _load_job_options with mlflow_model_name and mlflow_checkpoint_key to activate the MLflow layer.

davramov added 6 commits April 6, 2026 13:07
… model checkpoint, see if server is reachable, get client, retrieve checkpoint info, log run metrics)
…including inference parameter settings). Also includes a query to check that values are pulled from the server as expected.
@davramov davramov requested a review from xiaoyachong April 6, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant