Tomography reconstruction multinode optimization#111
Tomography reconstruction multinode optimization#111davramov merged 72 commits intoals-computing:mainfrom
Conversation
…NERSC). Also switching from podman to shifter for better overhead performance
…ction codes in scripts/perlmutter/* and scripts/polaris/*, since they are a linting nightmare but work
…/nersc.py to set the number of nodes to use for reconstruction
…e multinode reconstruction flow
…w for the dispatcher
73eb2d4 to
81dd6c9
Compare
|
When reviewing, you can ignore the files in |
xiaoyachong
left a comment
There was a problem hiding this comment.
Thanks for your PR. I’ve left a few comments, mainly regarding the generalization of the code. Different projects may need to call different inference pipelines for various ML models, rather than relying on hardcoded logic. If this requires substantial changes, we can address it in a future PR—especially as we gain more insight into handling different model inference workflows when working on Harry’s projects.
Also, I was wondering whether we want to generate a Zarr file for the segmentation results for visualization.
orchestration/flows/bl832/nersc.py
Outdated
| } | ||
|
|
||
| flow_name = f"delete {location}: {Path(tiff_file_path).name}" | ||
| def segmentation_dino( |
There was a problem hiding this comment.
Similar to segmentation_sam3(), we may need to make model loading and inference more flexible to support different projects.
| else: | ||
| return False | ||
|
|
||
| def combine_segmentations( |
There was a problem hiding this comment.
Same as the other two segmentation functions, we could make this more flexible—for example, by reading seg_scripts_dir from a config file.
Also, I wonder if it would be beneficial to move the segmentation and combination code into a separate segmentation.py file, as nersc.py is getting quite long. In segmentation.py, we could load different models depending on the project—for example, for Synaps, we load SAM3 and DINO3 and then combine them, while for Harry’s project, we might load a U-Net model instead.
There was a problem hiding this comment.
I agree, this script (and alcf.py) will become quite large fairly quickly as we add more methods. Not for this PR, but the next one where we integrate Harry's code, I think we can split it up so ALCF/NERSC flows live in their own subdirectories:
flows/bl832/
nersc/
__init__.py # re-exports NERSCTomographyHPCController
controller.py # class definition + reconstruct, build_multi_resolution
segmentation.py # segmentation_sam3, segmentation_dino, combine_segmentations
streaming.py # streaming mixin stuff
alcf/
__init__.py
controller.py
segmentation.py
| logger.error(f"Failed to transfer TIFFs to data832: {e}") | ||
| data832_tiff_transfer_success = False | ||
|
|
||
| # ── STEP 3: SAM3 / DINOv3 ────────────────────────── |
There was a problem hiding this comment.
This step (SAM3 / DINOv3) is currently hardcoded, but the workflow may vary across different projects. For example, other projects may load/save U-Net features. Could we make this more flexible (perhaps in a future PR)?
There was a problem hiding this comment.
I agree, we should address this in another PR. I'll create a new issue that captures this and other larger changes that seem out of scope for getting this PR merged
| @@ -0,0 +1,1755 @@ | |||
| from __future__ import print_function | |||
There was a problem hiding this comment.
I wonder if we need to explain some reconstruction details in our draft manuscript.
There was a problem hiding this comment.
It seems Prady will add those details in his paper.
…roughout. Setting defaults to 4 nodes so we can always run without a reservation
…ge_recon_segment_flow to nersc_petiole_segment_flow
…method for using defaults from config.yml vs prefect variable overrides (if defaults=False), updated controller methods to use _load_job_options()
xiaoyachong
left a comment
There was a problem hiding this comment.
Hi David, thanks for your great work! I’ve left a few comments—there don’t seem to be any major changes needed, just some minor issues.
| else: | ||
| return False | ||
|
|
||
| def combine_segmentations( |
orchestration/flows/bl832/nersc.py
Outdated
|
|
||
|
|
||
| @task(name="nersc_segmentation_dino_task") | ||
| def nersc_segmentation_dino_task( |
There was a problem hiding this comment.
I noticed that on NERSC there are segmentation_dino and nersc_segmentation_dino_task, while on ALCF there are _segmentation_dino_wrapper, segmentation_dino, and alcf_segmentation_dino_task. Could we make these naming conventions consistent across both environments?
There was a problem hiding this comment.
The _segmentation_dino_wrapper is necessary to submit to Globus Compute, but not needed for SFAPI. Otherwise I'll make sure it is consistent when I continue working on the ALCF implementation.
| @@ -0,0 +1,1755 @@ | |||
| from __future__ import print_function | |||
There was a problem hiding this comment.
It seems Prady will add those details in his paper.
This PR adds the
reconstruct_multinode()method for tomography reconstruction inorchestration/flows/bl832/nersc.py.num_nodeswhen calling reconstruction, and use the correct QOS.Additionally, this PR improves the
cancel_sfapi_job.pyscript, includes the reconstruction/multiresolution scripts used on Perlmutter.