Tomography reconstruction multinode optimization by davramov · Pull Request #111 · als-computing/splash_flows

davramov · 2026-01-15T01:21:44Z

This PR adds the reconstruct_multinode() method for tomography reconstruction in orchestration/flows/bl832/nersc.py.

Ability to request num_nodes when calling reconstruction, and use the correct QOS.
Uses shifter instead of podman to handle containers
By partitioning sinograms across multiple CPU nodes, I achieved near-linear speedup. For 8 nodes, there was close to a 7x speedup in performance for the reconstruction (not including overhead).
From here, I found the next main bottleneck was Podman-hpc, which pulls the microct image on every job (~90 seconds). By switching to Shifter with a pre-cached image, container startup dropped to ~2-3 seconds. The remaining overhead (~1 minute) is due to SFAPI and queuing on Perlmutter (not much we can improve here).
The sweet spot seems to be 4 CPU nodes in the realtime queue using Shifter, bringing down the total time from ~10 minutes to ~2 minutes. This balances the quick pickup by the realtime queue and the linear performance boost. Scaling beyond this requires the regular, demand, or premium queues, which have longer wait times that offset the reconstruction speedup (maybe we can ask Bjoern nicely for more nodes in the realtime queue).
For fun, I ran one test using 128 nodes, and while recon was fast (~30 seconds), the wait in the queue was close to 30 minutes.

Additionally, this PR improves the cancel_sfapi_job.py script, includes the reconstruction/multiresolution scripts used on Perlmutter.

…NERSC). Also switching from podman to shifter for better overhead performance

…ction codes in scripts/perlmutter/* and scripts/polaris/*, since they are a linting nightmare but work

…/nersc.py to set the number of nodes to use for reconstruction

…e multinode reconstruction flow

…w for the dispatcher

…4 now

…tation

… from dino

…iresolution

davramov · 2026-03-13T20:48:12Z

When reviewing, you can ignore the files in /scripts/perlmutter/, I added these for completeness

xiaoyachong

Thanks for your PR. I’ve left a few comments, mainly regarding the generalization of the code. Different projects may need to call different inference pipelines for various ML models, rather than relying on hardcoded logic. If this requires substantial changes, we can address it in a future PR—especially as we gain more insight into handling different model inference workflows when working on Harry’s projects.

Also, I was wondering whether we want to generate a Zarr file for the segmentation results for visualization.

orchestration/flows/bl832/config.py

orchestration/flows/bl832/dispatcher.py

orchestration/flows/bl832/nersc.py

xiaoyachong · 2026-03-18T00:06:52Z

orchestration/flows/bl832/nersc.py

+            }

-        flow_name = f"delete {location}: {Path(tiff_file_path).name}"
+    def segmentation_dino(


Similar to segmentation_sam3(), we may need to make model loading and inference more flexible to support different projects.

xiaoyachong · 2026-03-18T00:27:52Z

orchestration/flows/bl832/nersc.py

+            else:
+                return False
+
+    def combine_segmentations(


Same as the other two segmentation functions, we could make this more flexible—for example, by reading seg_scripts_dir from a config file.

Also, I wonder if it would be beneficial to move the segmentation and combination code into a separate segmentation.py file, as nersc.py is getting quite long. In segmentation.py, we could load different models depending on the project—for example, for Synaps, we load SAM3 and DINO3 and then combine them, while for Harry’s project, we might load a U-Net model instead.

I agree, this script (and alcf.py) will become quite large fairly quickly as we add more methods. Not for this PR, but the next one where we integrate Harry's code, I think we can split it up so ALCF/NERSC flows live in their own subdirectories:

flows/bl832/ nersc/ __init__.py # re-exports NERSCTomographyHPCController controller.py # class definition + reconstruct, build_multi_resolution segmentation.py # segmentation_sam3, segmentation_dino, combine_segmentations streaming.py # streaming mixin stuff alcf/ __init__.py controller.py segmentation.py

Sounds great

orchestration/flows/bl832/nersc.py

xiaoyachong · 2026-03-18T00:36:59Z

orchestration/flows/bl832/nersc.py

+        logger.error(f"Failed to transfer TIFFs to data832: {e}")
+        data832_tiff_transfer_success = False
+
+    # ── STEP 3: SAM3 / DINOv3 ──────────────────────────


This step (SAM3 / DINOv3) is currently hardcoded, but the workflow may vary across different projects. For example, other projects may load/save U-Net features. Could we make this more flexible (perhaps in a future PR)?

I agree, we should address this in another PR. I'll create a new issue that captures this and other larger changes that seem out of scope for getting this PR merged

xiaoyachong · 2026-03-18T00:38:27Z

scripts/perlmutter/sfapi_reconstruction.py

@@ -0,0 +1,1755 @@
+from __future__ import print_function


I wonder if we need to explain some reconstruction details in our draft manuscript.

It seems Prady will add those details in his paper.

…roughout. Setting defaults to 4 nodes so we can always run without a reservation

…ge_recon_segment_flow to nersc_petiole_segment_flow

…method for using defaults from config.yml vs prefect variable overrides (if defaults=False), updated controller methods to use _load_job_options()

xiaoyachong

Hi David, thanks for your great work! I’ve left a few comments—there don’t seem to be any major changes needed, just some minor issues.

orchestration/flows/bl832/config.py

orchestration/flows/bl832/dispatcher.py

xiaoyachong · 2026-03-24T21:33:42Z

orchestration/flows/bl832/nersc.py

+            else:
+                return False
+
+    def combine_segmentations(


Sounds great

xiaoyachong · 2026-03-24T21:43:30Z

orchestration/flows/bl832/nersc.py

+
+
+@task(name="nersc_segmentation_dino_task")
+def nersc_segmentation_dino_task(


I noticed that on NERSC there are segmentation_dino and nersc_segmentation_dino_task, while on ALCF there are _segmentation_dino_wrapper, segmentation_dino, and alcf_segmentation_dino_task. Could we make these naming conventions consistent across both environments?

The _segmentation_dino_wrapper is necessary to submit to Globus Compute, but not needed for SFAPI. Otherwise I'll make sure it is consistent when I continue working on the ALCF implementation.

xiaoyachong · 2026-03-24T21:44:20Z

scripts/perlmutter/sfapi_reconstruction.py

@@ -0,0 +1,1755 @@
+from __future__ import print_function


It seems Prady will add those details in his paper.

davramov mentioned this pull request Jan 17, 2026

ModCon Segmentation #113

Open

davramov added 10 commits January 28, 2026 13:03

initial commit with multinode support

e09c193

adding logic for determining qos based on number of nodes requested

21244e5

Adding specific tag for microct image (for more efficient caching on …

77d15cc

…NERSC). Also switching from podman to shifter for better overhead performance

Adding reconstruct_multinode() method

9e90a26

Making the cancel_sfapi_job.py script more useful

31ed453

in setup.cfg, adding a new section for flake8 to ignore the reconstru…

a88fd45

…ction codes in scripts/perlmutter/* and scripts/polaris/*, since they are a linting nightmare but work

Adding nersc_recon_num_nodes = 4 to Config832, which is used in bl832…

3691601

…/nersc.py to set the number of nodes to use for reconstruction

separating single node (production) nersc reconstruction flow from th…

1f48743

…e multinode reconstruction flow

Making a spearate deployment for the nersc multinode reconstruction flow

8ad972f

Creating option to turn on/off the nersc multinode reconstruction flo…

81dd6c9

…w for the dispatcher

davramov force-pushed the recon-optimization branch from 73eb2d4 to 81dd6c9 Compare January 28, 2026 21:04

davramov mentioned this pull request Feb 2, 2026

NERSC Integration #121

Closed

Updating segmentation to use inference_v4.

141a5a6

davramov mentioned this pull request Feb 9, 2026

[bl832] Forge dispatcher for segmentation flows #122

Open

davramov added 15 commits February 9, 2026 14:33

removing comments. segmentation still isn't working

b4cab67

this configuration worked with 1 node for segmentation, testing with …

271d2bf

…4 now

adding nersc_forge_recon_segment_flow to prefect.yaml for deployment

9171ea5

removing comments

b5bd66d

making config.nersc_recon_num_nodes to set number of nodes for segmen…

3a5b1d2

…tation

Using the amsc006 reservation for recon+segmentation

9df2f97

Configuring to use all the nodes in the reservation

340fcb2

num_nodes fix

f2f8806

changing segmentation confidence from 0.5 to 0.2

347816d

Setting patch-size=400 and confidence=0.5

dd07993

confidence=0.2

54d832c

Adding prefect variable to override defaults for segmentation

e9336dc

updating segmentation to v5

567899e

new checkpoint

68490df

adding checkpoint as part of the segmentation variable

290a983

davramov added 8 commits March 13, 2026 10:53

linting

5e1778c

removing cellpose

dd78a6e

removing extract_regions flow (replaced by the combine step)

b73bfe5

renaming segmantion flows/tasks to segmentation_sam3 to differentiate…

58fe7c5

… from dino

removing commented code

827d7c7

removing multiresolution multinode optimization efforts from this PR

6e03983

Adding pytests for bl832/nersc.py: reconstruction, segmentation, mult…

eee3733

…iresolution

Adding pytests for bl832/nersc.py: reconstruction, segmentation, mult…

87e107d

…iresolution

davramov requested a review from xiaoyachong March 13, 2026 20:47

xiaoyachong reviewed Mar 18, 2026

View reviewed changes

davramov added 15 commits March 18, 2026 10:45

Moving recon/segmentation num_nodes configuration to config.yaml

9509447

Replacing the original reconstruct code with the multinode version th…

d637f5e

…roughout. Setting defaults to 4 nodes so we can always run without a reservation

removing the sam3 forge segmentation flow, and renaming the nersc_for…

35c8019

…ge_recon_segment_flow to nersc_petiole_segment_flow

if to elif

8c25418

setting nersc account for slurm based on config settings

1ea50f6

Loading cpus-per-task from config for reconstruction slurm submission

28ce316

setting sam3 checkpoint/conda/model/vocab paths in config

1e97c55

adding prompts to config

664671a

Adding dino and combine segmentations settings to config

5f694d2

Updating pytests

0e66b2f

Moving the rest of job submission variables to config.yml, created a …

8726dda

…method for using defaults from config.yml vs prefect variable overrides (if defaults=False), updated controller methods to use _load_job_options()

updating pytests

b0844b3

removing commented code

e5be7d5

updating prefect.yaml

b7b785a

including script_name as part of config for sam3/dino/combine

836c9e8

xiaoyachong reviewed Mar 24, 2026

View reviewed changes

davramov added 2 commits April 1, 2026 16:45

renaming DINO references to DINOv3

0be1eeb

Updating pytest for nersc bc of the DINO -> DINOv3 naming changes

d696931

davramov merged commit 5238893 into als-computing:main Apr 6, 2026
1 check passed



		@task(name="nersc_segmentation_dino_task")
		def nersc_segmentation_dino_task(

Conversation

davramov commented Jan 15, 2026

Uh oh!

davramov commented Mar 13, 2026

Uh oh!

xiaoyachong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoyachong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants