Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Set the env variables from this PR (see file changes tab) to significantly speed up MIOpen warmup. You may encounter an error of the form
No suitable algorithm was found to execute the required convolution, meaning you cannot set at least one of the options.It is possible to get around the above error by increasing the sharding. For example, at scale 8 using 8 shards (2,2,2), but this requires #27
Details
list of options https://rocm.docs.amd.com/projects/MIOpen/en/develop/reference/env_variables.html#algorithm-control
export MIOPEN_DEBUG_CONV_DIRECT=0Eric M. from AMD suggested using this variable to disable direct convolution benchmarking (which ScaFFold does not use direct convolutions anyway). This speeds up our warmup period from 700s to 10s at scale 7. However I got the error
MIOpen Error: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/ocl/convolutionocl.cpp:1125: No suitable algorithm was found to execute the required convolutionwith this when usingnum_shards=2I tested this option, and it only speeds up benchmarking by 25%, but there is no error. It disables kernels like
naive_conv_ab_nonpacked_fwd_ndhwc_half_double_half.kdThis has a much lower performance imapct, disables kernels like
naive_conv_ab_nonpacked_wrw_ndhwc_half_double_half.kdThis is likely the problematic one, but also accounts for ~70% of the warmup, disabling
naive_conv_ab_nonpacked_bwd_ndhwc_half_double_half.kd. This works at scales below 8. At scale 8, if using 2 shards, which should be enough for MI300A, this option fails with theNo suitable algorithm was found to execute the required convolutionerror. However, if using 8 shards at scale 8 (this should match the shape of scale 7 1 shard) this option works and reduces warmup by 20x in conjunction with the other options.