Skip to content

Speedup Warmup on ROCm#24

Draft
michaelmckinsey1 wants to merge 6 commits intoLBANN:mainfrom
michaelmckinsey1:michaelmckinsey1-patch-3
Draft

Speedup Warmup on ROCm#24
michaelmckinsey1 wants to merge 6 commits intoLBANN:mainfrom
michaelmckinsey1:michaelmckinsey1-patch-3

Conversation

@michaelmckinsey1
Copy link
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Mar 9, 2026

Summary

Set the env variables from this PR (see file changes tab) to significantly speed up MIOpen warmup. You may encounter an error of the form No suitable algorithm was found to execute the required convolution, meaning you cannot set at least one of the options.

It is possible to get around the above error by increasing the sharding. For example, at scale 8 using 8 shards (2,2,2), but this requires #27

Details

list of options https://rocm.docs.amd.com/projects/MIOpen/en/develop/reference/env_variables.html#algorithm-control

  • export MIOPEN_DEBUG_CONV_DIRECT=0

Eric M. from AMD suggested using this variable to disable direct convolution benchmarking (which ScaFFold does not use direct convolutions anyway). This speeds up our warmup period from 700s to 10s at scale 7. However I got the error
MIOpen Error: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/ocl/convolutionocl.cpp:1125: No suitable algorithm was found to execute the required convolution with this when using num_shards=2

  • export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0

I tested this option, and it only speeds up benchmarking by 25%, but there is no error. It disables kernels like naive_conv_ab_nonpacked_fwd_ndhwc_half_double_half.kd

  • export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0

This has a much lower performance imapct, disables kernels like naive_conv_ab_nonpacked_wrw_ndhwc_half_double_half.kd

  • MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0

This is likely the problematic one, but also accounts for ~70% of the warmup, disabling naive_conv_ab_nonpacked_bwd_ndhwc_half_double_half.kd. This works at scales below 8. At scale 8, if using 2 shards, which should be enough for MI300A, this option fails with the No suitable algorithm was found to execute the required convolution error. However, if using 8 shards at scale 8 (this should match the shape of scale 7 1 shard) this option works and reduces warmup by 20x in conjunction with the other options.

@michaelmckinsey1 michaelmckinsey1 self-assigned this Mar 9, 2026
@michaelmckinsey1 michaelmckinsey1 marked this pull request as draft March 10, 2026 00:37
@michaelmckinsey1 michaelmckinsey1 changed the title Speedup Warmup on Elcap Speedup Warmup on ROCm Mar 21, 2026
@michaelmckinsey1 michaelmckinsey1 linked an issue Mar 21, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow performance on ROCm system

1 participant