Improve handling of the NVTE_CUDA_ARCHS#2665
Conversation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Greptile SummaryThis PR improves handling of the
Confidence Score: 4/5
Important Files Changed
Flowchartflowchart TD
A[CMAKE_CUDA_ARCHITECTURES input] --> B{Arch 100 found?}
B -->|Yes| C["Keep 100 in CMAKE_CUDA_ARCHITECTURES\nAppend 100a to NVTE_SPECIFIC_ARCHS\n(+103a if CUDA≥12.9)"]
B -->|No| D{Arch 101 found?}
C --> D
D -->|Yes| E["Keep 101 in CMAKE_CUDA_ARCHITECTURES\nAppend 101a to NVTE_SPECIFIC_ARCHS"]
D -->|No| F{Arch 110 found?}
E --> F
F -->|Yes| G{CMake ≥ 4.0.2?}
G -->|Yes| H["Remove 110, add 110f\nto CMAKE_CUDA_ARCHITECTURES"]
G -->|No| I["Remove 110\nAdd 110 to NVTE_GENERIC_ARCHS\nAdd 110f to NVTE_SPECIFIC_ARCHS"]
F -->|No| J{Arch 120 found?}
H --> J
I --> J
J -->|Yes| K["Remove 120 from CMAKE_CUDA_ARCHITECTURES"]
K --> L{CMake ≥ 4.0.2?}
L -->|Yes| M{CUDA ≥ 12.9?}
L -->|No| N{CUDA ≥ 12.9?}
M -->|Yes| O["Add 120f to CMAKE_CUDA_ARCHITECTURES"]
M -->|No| P["Add 120 to NVTE_GENERIC_ARCHS\nAdd 120a to NVTE_SPECIFIC_ARCHS"]
N -->|Yes| Q["Add 120 to NVTE_GENERIC_ARCHS\nAdd 120f to NVTE_SPECIFIC_ARCHS"]
N -->|No| R["Add 120 to NVTE_GENERIC_ARCHS\nAdd 120a to NVTE_SPECIFIC_ARCHS"]
J -->|No| S{CMake < 4.0.2 AND\nCMAKE_CUDA_ARCHITECTURES empty?}
O --> S
P --> S
Q --> S
R --> S
S -->|Yes| T["Warning + set CMAKE_CUDA_ARCHITECTURES=75"]
S -->|No| U["Set NVTE_ARCH_SPECIFIC_TARGETS=TRUE\nDefine NVTE_HAS_ARCH_SPECIFIC_TARGETS=1\non arch-specific sources"]
T --> U
Last reviewed commit: 1d89a57 |
| @@ -54,31 +52,57 @@ endif() | |||
| # Check for architecture 101 (if we see this we are in toolkit <= 12.9) | |||
| list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index) | |||
| if(NOT arch_101_index EQUAL -1) | |||
| list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101") | |||
| list(APPEND NVTE_GENERIC_ARCHS "101") | |||
| list(APPEND NVTE_SPECIFIC_ARCHS "101a") | |||
| endif() | |||
There was a problem hiding this comment.
Arch 100/101 not removed
When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.
There was a problem hiding this comment.
@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.
| if(CMAKE_VERSION VERSION_LESS 4.0.2) | ||
| if(NOT CMAKE_CUDA_ARCHITECTURES) | ||
| message(WARNING | ||
| "CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. " | ||
| "Please upgrade to CMake 4.0.2+ for native 'f' architecture support. " | ||
| "Adding sm_75 target in addition to the specified target to avoid configuration " | ||
| "errors - this will result in longer build time, but does not affect correctness.") | ||
| set(CMAKE_CUDA_ARCHITECTURES 75) | ||
| endif() |
There was a problem hiding this comment.
Fallback wipes user archs
On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.
There was a problem hiding this comment.
@greptileai If the list is empty, then setting is the same as appending.
|
/te-ci |
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) | ||
|
|
||
| # cuDNN frontend API |
There was a problem hiding this comment.
Always-on arch-specific define
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) makes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 get added to all transformer_engine_cuda_arch_specific_sources (lines ~247-255) even when NVTE_SPECIFIC_ARCHS is empty (e.g., user selects only 70/80/89/90). That flips the behavior in util/ptx.cuh to skip the compile-time static_assert that catches “compiled generic but using arch/family-specific features”, so those mistakes will now silently pass in exactly the builds where no arch-specific codegen flags are being injected for these sources.
|
/te-ci |
| endif() | ||
| endif() | ||
|
|
||
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) |
There was a problem hiding this comment.
Unconditional flag disables compile-time safety checks. NVTE_ARCH_SPECIFIC_TARGETS is set to TRUE even when NVTE_SPECIFIC_ARCHS is empty (e.g., user builds only sm_70/80/89/90). This causes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 to be defined for all arch-specific sources (lines 249-258), which disables the static_assert in ptx.cuh lines 34-38 and 56-60. The compile-time check that catches misuse of arch/family-specific features in generic builds is now always off, even when no arch-specific code generation is happening.
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) | |
| if(NVTE_SPECIFIC_ARCHS) | |
| set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) | |
| endif() |
|
/te-ci |
Description
Improve handling of the NVTE_CUDA_ARCHS env variable:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: