Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
e7c1509
Added GHA CI workflow
leo-amd Feb 19, 2026
70231ba
Change target branch
leo-amd Feb 19, 2026
29a8e0b
Update naming
leo-amd Feb 19, 2026
7bcfc34
ci: trigger actions
leo-amd Feb 19, 2026
a19f56c
Move the file
leo-amd Feb 19, 2026
43a1d6d
Setup python env
leo-amd Feb 20, 2026
79963ba
Use containers
leo-amd Feb 20, 2026
48d9e9c
These k8s runners don't support native containers, therefore I am run…
leo-amd Feb 20, 2026
25126ef
Typo
leo-amd Feb 20, 2026
c307e20
Fix git dubious ownership
leo-amd Feb 23, 2026
bdc31f0
Git fixes
leo-amd Feb 23, 2026
b85b9ca
Typo
leo-amd Feb 23, 2026
9464a90
Cmake change
leo-amd Feb 23, 2026
73fdcc9
requirements.txt fix
leo-amd Feb 23, 2026
2616627
Clone in container
leo-amd Feb 23, 2026
d368e9e
Resolve latest PyTorch main SHA
leo-amd Feb 23, 2026
143384a
Rewrite from scratch
leo-amd Feb 23, 2026
a5e15d3
Set rocm
leo-amd Feb 23, 2026
7e59143
Add sanity check
leo-amd Feb 23, 2026
e606389
set -euxo pipefail
leo-amd Feb 23, 2026
7649d7d
typo
leo-amd Feb 23, 2026
c2de6a2
Rewritten
leo-amd Feb 23, 2026
215d531
Fix tests
leo-amd Feb 23, 2026
46c2da3
Set large timeout for tests
leo-amd Feb 24, 2026
03f3d2c
Split the steps
leo-amd Feb 24, 2026
43fedde
Implement discussed features
leo-amd Feb 25, 2026
2181dd5
Fix tests
leo-amd Feb 25, 2026
8c056d5
Fix tests more
leo-amd Feb 25, 2026
fbba6e0
Try tests
leo-amd Feb 25, 2026
c2ad547
Removed the HIP_VISIBLE_DEVICES code
leo-amd Feb 25, 2026
b8cb6bd
Lock the RCCL context
leo-amd Feb 25, 2026
1dcf247
Force CPU to wait for the GPUs, and we need to force all GPUs to wai…
leo-amd Feb 26, 2026
7726b0c
Revert
leo-amd Feb 26, 2026
0eec57f
Resolve comments
leo-amd Feb 27, 2026
d64c451
Hausekepping
leo-amd Feb 27, 2026
2fbaa90
Run CI
leo-amd Mar 9, 2026
2958b4f
Propagate import errors
leo-amd Mar 9, 2026
82f689a
Extension tests fix
leo-amd Mar 9, 2026
ee7b996
Apply launch bounds unconditionally
leo-amd Mar 9, 2026
0ec6f25
Define USE_ROCM during JIT compilation
leo-amd Mar 10, 2026
b93977f
Revert some changes
leo-amd Mar 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions .github/workflows/rocm-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
name: Apex ROCm CI

on:
pull_request:
types: [opened, synchronize, ready_for_review]
branches:
- master
- release/1.8.0
- release/1.9.0
- release/1.10.0
workflow_dispatch:
inputs:
apex_gitref:
description: 'Apex branch or commit SHA to build'
required: false
default: 'master'
type: string
docker_image:
description: 'Docker image to use'
required: false
default: 'rocm/pytorch:latest'
type: string
run_extension:
description: 'Run Extension Import tests'
required: false
default: true
type: boolean
run_l0:
description: 'Run L0 tests'
required: false
default: true
type: boolean
run_contrib:
description: 'Run Contrib tests'
required: false
default: true
type: boolean
run_halo:
description: 'Run Peer Halo Exchange tests'
required: false
default: true
type: boolean
run_syncbn:
description: 'Run Distributed Synced BatchNorm tests'
required: false
default: true
type: boolean

env:
DOCKER_IMAGE: ${{ inputs.docker_image || 'rocm/pytorch:latest' }}

jobs:
build:
name: Build Apex Wheel
runs-on: build-only-apex
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
# Uses the specified branch on manual runs; defaults to the PR/Push context otherwise
ref: ${{ github.event_name == 'workflow_dispatch' && inputs.apex_gitref || '' }}
submodules: recursive

- name: Build Apex Wheel in Docker
run: |
docker run --rm -v ${{ github.workspace }}:/workspace -w /workspace ${{ env.DOCKER_IMAGE }} bash -c "
pip install --upgrade pip
pip install build ninja wheel packaging

python3 -m build --wheel --no-isolation -C--build-option=--cpp_ext -C--build-option=--cuda_ext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram I wonder if we should just make the --cpp_ext and --cuda_ext flags the default, so that the Apex build command is as simple as other packages.


chown -R $(id -u):$(id -g) dist/
"

- name: Upload Wheel Artifact
uses: actions/upload-artifact@v4
with:
name: apex-wheel
path: dist/*.whl
retention-days: 7

test:
name: Run Unit Tests
timeout-minutes: 720
runs-on: linux-apex-mi325-8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-amd Again, this label looks like it's a specific runner label for Apex, which seems unnecessary? For e.g. something like linux-mi325-8 should be good enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use that kind of labeling across all repositories. Mostly for monitoring and usage reporting

needs: build
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'workflow_dispatch' && inputs.apex_gitref || '' }}
submodules: recursive

- name: Download Wheel Artifact
uses: actions/download-artifact@v4
with:
name: apex-wheel
path: dist/

- name: Start Background Docker Container
run: |
docker run -d --name apex-test-container \
--device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host \
-e OMP_NUM_THREADS=8 \
-e TORCH_NCCL_ASYNC_ERROR_HANDLING=1 \
-e NCCL_DEBUG=WARN \
-v ${{ github.workspace }}:/workspace -w /workspace \
${{ env.DOCKER_IMAGE }} sleep infinity

- name: Install Dependencies and Built Wheel
run: |
docker exec apex-test-container bash -c "
set -e
pip install expecttest onnxscript
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram Can these be added to some requirements.txt or requirements-dev.txt file of Apex?

pip install dist/apex-*.whl
"

- name: Run Extension Import tests
if: ${{ github.event_name != 'workflow_dispatch' || inputs.run_extension }}
run: |
docker exec apex-test-container bash -c "
set -eo pipefail
cd tests
python3 test_extension_import.py 2>&1 | tee ../extension_import_results.log
"

- name: Run L0 tests
if: ${{ (always()) && (github.event_name != 'workflow_dispatch' || inputs.run_l0) }}
run: |
docker exec apex-test-container bash -c "
set -eo pipefail
cd tests/L0
sh run_rocm.sh 2>&1 | tee ../../L0_results.log
"

- name: Run Contrib tests
if: ${{ (success() || failure()) && (github.event_name != 'workflow_dispatch' || inputs.run_contrib) }}
run: |
docker exec apex-test-container bash -c "
set -eo pipefail
cd apex/contrib/test
python3 run_rocm_extensions.py 2>&1 | tee ../../../contrib_results.log
"

- name: Run Peer Halo Exchange tests
if: ${{ (success() || failure()) && (github.event_name != 'workflow_dispatch' || inputs.run_halo) }}
run: |
docker exec apex-test-container bash -c "
set -eo pipefail
export HSA_FORCE_FINE_GRAIN_PCIE=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram RCCL team (Wenkai) told me a while back that HSA_FORCE_FINE_GRAIN_PCIE is not needed anymore: "since ROCm 5.7, no longer required". Can you please confirm if that's true on ROCm7.2, and remove it from here (and any Apex documentation), if so?

export HSA_ENABLE_SDMA=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram Is this a hack/workaround?

torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py 2>&1 | tee halo_results.log
"

- name: Run Distributed Synced BatchNorm tests
if: ${{ (success() || failure()) && (github.event_name != 'workflow_dispatch' || inputs.run_syncbn) }}
run: |
docker exec apex-test-container bash -c "
set -eo pipefail
cd tests/distributed/synced_batchnorm
sh unit_test.sh 2>&1 | tee ../../../syncbn_results.log
"

- name: Fix Artifact Permissions
if: always()
run: |
docker exec apex-test-container bash -c "chown -R $(id -u):$(id -g) *.log"

- name: Cleanup Background Container
if: always()
run: docker rm -f apex-test-container

- name: Upload Test Logs
if: always()
uses: actions/upload-artifact@v4
with:
name: test-logs
path: |
*.log
retention-days: 14
41 changes: 27 additions & 14 deletions tests/test_extension_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,30 +101,36 @@ def get_environment(self):
"""
# Get current environment and ensure CUDA/PyTorch libraries are available
env = os.environ.copy()

# Add common CUDA library paths

ld_library_path = env.get('LD_LIBRARY_PATH', '')
cuda_paths = [
'/usr/local/cuda/lib64',
'/usr/local/cuda/lib',
'/opt/conda/lib',
'/usr/lib/x86_64-linux-gnu'
]

extra_paths = []

# Add PyTorch library path
try:
import torch
torch_lib_path = os.path.join(os.path.dirname(torch.__file__), 'lib')
if os.path.exists(torch_lib_path):
cuda_paths.append(torch_lib_path)
extra_paths.append(torch_lib_path)
except ImportError:
pass


# Add ROCm library path if present
rocm_path = os.environ.get('ROCM_PATH', '/opt/rocm')
rocm_lib = os.path.join(rocm_path, 'lib')
if os.path.exists(rocm_lib):
extra_paths.append(rocm_lib)

# Add common CUDA library paths (only those that exist)
for path in ['/usr/local/cuda/lib64', '/usr/local/cuda/lib',
'/opt/conda/lib', '/usr/lib/x86_64-linux-gnu']:
if os.path.isdir(path):
extra_paths.append(path)

# Update LD_LIBRARY_PATH
if ld_library_path:
env['LD_LIBRARY_PATH'] = ':'.join(cuda_paths) + ':' + ld_library_path
env['LD_LIBRARY_PATH'] = ':'.join(extra_paths) + ':' + ld_library_path
else:
env['LD_LIBRARY_PATH'] = ':'.join(cuda_paths)
env['LD_LIBRARY_PATH'] = ':'.join(extra_paths)
return env


Expand Down Expand Up @@ -229,7 +235,14 @@ def test_extensions_import(self):
error_display = error_message[:17] + "..." if len(error_message) > 20 else error_message
print(f"{extension:<30} {success:<10} {error_display:<20}")
print("-" * 60)


# Fail the test if any extensions failed to import
failed_extensions = [ext for ext, success, _ in results if not success]
self.assertEqual(
len(failed_extensions), 0,
f"{len(failed_extensions)} extension(s) failed to import: {', '.join(failed_extensions)}"
)


if __name__ == '__main__':
unittest.main()
Loading