Skip to content

Tomography reconstruction multinode optimization#111

Merged
davramov merged 72 commits intoals-computing:mainfrom
davramov:recon-optimization
Apr 6, 2026
Merged

Tomography reconstruction multinode optimization#111
davramov merged 72 commits intoals-computing:mainfrom
davramov:recon-optimization

Conversation

@davramov
Copy link
Copy Markdown
Contributor

This PR adds the reconstruct_multinode() method for tomography reconstruction in orchestration/flows/bl832/nersc.py.

  • Ability to request num_nodes when calling reconstruction, and use the correct QOS.
  • Uses shifter instead of podman to handle containers
  • By partitioning sinograms across multiple CPU nodes, I achieved near-linear speedup. For 8 nodes, there was close to a 7x speedup in performance for the reconstruction (not including overhead).
  • From here, I found the next main bottleneck was Podman-hpc, which pulls the microct image on every job (~90 seconds). By switching to Shifter with a pre-cached image, container startup dropped to ~2-3 seconds. The remaining overhead (~1 minute) is due to SFAPI and queuing on Perlmutter (not much we can improve here).
  • The sweet spot seems to be 4 CPU nodes in the realtime queue using Shifter, bringing down the total time from ~10 minutes to ~2 minutes. This balances the quick pickup by the realtime queue and the linear performance boost. Scaling beyond this requires the regular, demand, or premium queues, which have longer wait times that offset the reconstruction speedup (maybe we can ask Bjoern nicely for more nodes in the realtime queue).
  • For fun, I ran one test using 128 nodes, and while recon was fast (~30 seconds), the wait in the queue was close to 30 minutes.
image

Additionally, this PR improves the cancel_sfapi_job.py script, includes the reconstruction/multiresolution scripts used on Perlmutter.

@davramov davramov mentioned this pull request Jan 17, 2026
@davramov davramov mentioned this pull request Feb 2, 2026
@davramov davramov requested a review from xiaoyachong March 13, 2026 20:47
@davramov
Copy link
Copy Markdown
Contributor Author

When reviewing, you can ignore the files in /scripts/perlmutter/, I added these for completeness

Copy link
Copy Markdown
Contributor

@xiaoyachong xiaoyachong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. I’ve left a few comments, mainly regarding the generalization of the code. Different projects may need to call different inference pipelines for various ML models, rather than relying on hardcoded logic. If this requires substantial changes, we can address it in a future PR—especially as we gain more insight into handling different model inference workflows when working on Harry’s projects.

Also, I was wondering whether we want to generate a Zarr file for the segmentation results for visualization.

}

flow_name = f"delete {location}: {Path(tiff_file_path).name}"
def segmentation_dino(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to segmentation_sam3(), we may need to make model loading and inference more flexible to support different projects.

else:
return False

def combine_segmentations(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other two segmentation functions, we could make this more flexible—for example, by reading seg_scripts_dir from a config file.

Also, I wonder if it would be beneficial to move the segmentation and combination code into a separate segmentation.py file, as nersc.py is getting quite long. In segmentation.py, we could load different models depending on the project—for example, for Synaps, we load SAM3 and DINO3 and then combine them, while for Harry’s project, we might load a U-Net model instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this script (and alcf.py) will become quite large fairly quickly as we add more methods. Not for this PR, but the next one where we integrate Harry's code, I think we can split it up so ALCF/NERSC flows live in their own subdirectories:

flows/bl832/
    nersc/
        __init__.py          # re-exports NERSCTomographyHPCController
        controller.py        # class definition + reconstruct, build_multi_resolution
        segmentation.py      # segmentation_sam3, segmentation_dino, combine_segmentations
        streaming.py         # streaming mixin stuff
    alcf/
        __init__.py
        controller.py
        segmentation.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great

logger.error(f"Failed to transfer TIFFs to data832: {e}")
data832_tiff_transfer_success = False

# ── STEP 3: SAM3 / DINOv3 ──────────────────────────
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step (SAM3 / DINOv3) is currently hardcoded, but the workflow may vary across different projects. For example, other projects may load/save U-Net features. Could we make this more flexible (perhaps in a future PR)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should address this in another PR. I'll create a new issue that captures this and other larger changes that seem out of scope for getting this PR merged

@@ -0,0 +1,1755 @@
from __future__ import print_function
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to explain some reconstruction details in our draft manuscript.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Prady will add those details in his paper.

Copy link
Copy Markdown
Contributor

@xiaoyachong xiaoyachong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi David, thanks for your great work! I’ve left a few comments—there don’t seem to be any major changes needed, just some minor issues.

else:
return False

def combine_segmentations(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great



@task(name="nersc_segmentation_dino_task")
def nersc_segmentation_dino_task(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that on NERSC there are segmentation_dino and nersc_segmentation_dino_task, while on ALCF there are _segmentation_dino_wrapper, segmentation_dino, and alcf_segmentation_dino_task. Could we make these naming conventions consistent across both environments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _segmentation_dino_wrapper is necessary to submit to Globus Compute, but not needed for SFAPI. Otherwise I'll make sure it is consistent when I continue working on the ALCF implementation.

@@ -0,0 +1,1755 @@
from __future__ import print_function
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Prady will add those details in his paper.

@davramov davramov merged commit 5238893 into als-computing:main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants