Speedup Warmup on ROCm by michaelmckinsey1 · Pull Request #24 · LBANN/ScaFFold

michaelmckinsey1 · 2026-03-09T19:56:39Z

Summary

Set the env variables from this PR (see file changes tab) to significantly speed up MIOpen warmup. You may encounter an error of the form No suitable algorithm was found to execute the required convolution, meaning you cannot set at least one of the options.

It is possible to get around the above error by increasing the sharding. For example, at scale 8 using 8 shards (2,2,2), but this requires #27

Details

list of options https://rocm.docs.amd.com/projects/MIOpen/en/develop/reference/env_variables.html#algorithm-control

export MIOPEN_DEBUG_CONV_DIRECT=0

Eric M. from AMD suggested using this variable to disable direct convolution benchmarking (which ScaFFold does not use direct convolutions anyway). This speeds up our warmup period from 700s to 10s at scale 7. However I got the error
MIOpen Error: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/ocl/convolutionocl.cpp:1125: No suitable algorithm was found to execute the required convolution with this when using num_shards=2

export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0

I tested this option, and it only speeds up benchmarking by 25%, but there is no error. It disables kernels like naive_conv_ab_nonpacked_fwd_ndhwc_half_double_half.kd

export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0

This has a much lower performance imapct, disables kernels like naive_conv_ab_nonpacked_wrw_ndhwc_half_double_half.kd

MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0

This is likely the problematic one, but also accounts for ~70% of the warmup, disabling naive_conv_ab_nonpacked_bwd_ndhwc_half_double_half.kd. This works at scales below 8. At scale 8, if using 2 shards, which should be enough for MI300A, this option fails with the No suitable algorithm was found to execute the required convolution error. However, if using 8 shards at scale 8 (this should match the shape of scale 7 1 shard) this option works and reduces warmup by 20x in conjunction with the other options.

michaelmckinsey1 added 2 commits March 9, 2026 12:52

Update scaffold-tuolumne.job

32b2fd9

Update scaffold-tuolumne-torchpypi.job

73c6d12

michaelmckinsey1 self-assigned this Mar 9, 2026

michaelmckinsey1 marked this pull request as draft March 10, 2026 00:37

michaelmckinsey1 added 4 commits March 13, 2026 10:41

Update scaffold-tuolumne.job

3a8f5e0

Update scaffold-tuolumne-torchpypi.job

2cc5932

Update scaffold-tuolumne-torchpypi.job

69aa4ff

Update scaffold-tuolumne.job

263b4a9

michaelmckinsey1 changed the title ~~Speedup Warmup on Elcap~~ Speedup Warmup on ROCm Mar 21, 2026

michaelmckinsey1 linked an issue Mar 21, 2026 that may be closed by this pull request

Slow performance on ROCm system #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup Warmup on ROCm#24

Speedup Warmup on ROCm#24
michaelmckinsey1 wants to merge 6 commits intoLBANN:mainfrom
michaelmckinsey1:michaelmckinsey1-patch-3

michaelmckinsey1 commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelmckinsey1 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelmckinsey1 commented Mar 9, 2026 •

edited

Loading