Open
Conversation
… model checkpoint, see if server is reachable, get client, retrieve checkpoint info, log run metrics)
…including inference parameter settings). Also includes a query to check that values are pulled from the server as expected.
…odel information found in mlflow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New files
orchestration/mlflow.py— Core MLflow integration module with:ModelCheckpointInfodataclass holding checkpoint paths and inference params_is_mlflow_reachable()— fast health check (2s timeout) to short-circuit before MLflow's built-in 30s retry loop fires on a down serverget_mlflow_client()— constructs a client fromconfig.mlflow["tracking_uri"]get_checkpoint_info()— resolves a model's production alias from the registry; returns None gracefully on any failure, allowing callers to fall back to config defaultsregister_checkpoint()— registers a model version withnersc_path,alcf_path,and inference params as version-level tags; list/dict values are JSON-encoded automaticallylog_segmentation_metrics()— logs timing and job params as an (optionally nested) MLflow run after SLURM job completion on the orchestration server, not the compute nodeorchestration/flows/bl832/register_mlflow.pysam3-petioleanddinov3-petiole,storing all model-coupled params (checkpoint paths, conda env, scripts dir, batch size, prompts, confidence, overlap) in the registry.retrieve_mlflow_params_test()to validate the resolution chain end-to-end, with separate checks confirming MLflow-coupled params are overridden and SLURM allocation params are unchanged.Modified files
orchestration/flows/bl832/config.pyself.mlflow = self.config["mlflow"]["local"]toConfig832._beam_specific_config(),exposing the MLflow environment dict (containingtracking_uriandregistry_uri)config.yml— Added a multi-environment mlflow section with four named environments:orchestration/flows/bl832/nersc.py— Refactored_load_job_options()to implement a three-layer resolution:segmentation_sam3()andsegmentation_dinov3()call_load_job_optionswithmlflow_model_nameandmlflow_checkpoint_keyto activate the MLflow layer.