Fix bus error or segfault from roi_align with large batchsize#9441
Fix bus error or segfault from roi_align with large batchsize#9441zy1git wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9441
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b0004a1 with merge base 6285457 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
test/test_ops.py
Outdated
| output_bytes = n_rois * channels * pooled_h * pooled_w * 4 # float32 | ||
| if output_bytes > 9 * 1024**3: | ||
| pytest.skip("Test requires ~9 GB of memory") | ||
|
|
There was a problem hiding this comment.
all these values are statically defined. This if block is either always True or always False.
There was a problem hiding this comment.
Thanks for pointing this out. I agree. I removed that part in the new commit.
test/test_ops.py
Outdated
| x = torch.rand(num_imgs, channels, height, width, dtype=torch.float32, device=device) | ||
| rois = torch.zeros(n_rois, 5, dtype=torch.float32, device=device) | ||
| except RuntimeError: | ||
| pytest.skip("Not enough memory to allocate test tensors") |
There was a problem hiding this comment.
Please verify tests aren't not being skipped on the CI. If they pass, remove the try/except, if they don't, we'll have to consider other strategies to test this.
There was a problem hiding this comment.
Thanks for the comment. I verified that the tests are not being skipped on the CI and my devserver. I removed the try/except in the new commit.
| template <typename T> | ||
| void roi_align_backward_kernel_impl( | ||
| int nthreads, | ||
| int64_t nthreads, |
There was a problem hiding this comment.
Can you explain why nthreads needs to be int64_t? It should never need to be that large? If it's for integer comparison to not warn, we could just cast?
There was a problem hiding this comment.
Thanks for the question. nthreads in the backward kernel is grad.numel(), which equals n_rois × channels × pooled_h × pooled_w.
I changed to int nthreads in both .cpp and .cu files and ran the test (only the forward kernel test, no backward kernel test due to the large memory requirement). The CPU test passed but the CUDA test failed with all-zero output. The reason is that the CPU forward kernel doesn't use nthreads — it loops over n_rois separately, and the overflow is handled by the int64_t changes to index_n, index_n_c, and index. The CUDA forward kernel uses a flat loop with nthreads as the bound, so truncating to int caused nthreads to wrap to a negative value, making the loop condition immediately false and skipping all output computation — resulting in all-zero output.
The CPU backward kernel does use nthreads in the same flat-loop pattern as CUDA (for (int64_t index = 0; index < nthreads; ...)) and receives the same overflowing value via grad.numel(), so it needs int64_t for the same reason.
A backward-specific test would require large memory (output + grad_output + grad_input), which might be impractical for CI. Do we want to add one with a memory skip guard, or is the current forward-only test sufficient?
There was a problem hiding this comment.
As for "If it's for integer comparison to not warn, we could just cast?", I think it isn't a warning issue. The problem is that the value could be actually large and gets truncated at the call site before the function body runs. In the author's reproducing example (batch_size=172, default 1000 proposals per image), nthreads is 172,000 × 256 × 7 × 7 = 2,157,568,000 > INT_MAX.
I added the backward kernel test in the latest commit. If I change int64_t nthreadsto int nthreads, the CPU backward test fails with all-zero gradients because nthreads gets truncated to a negative value and the loop never executes.
nthreads doesn't mean "number of threads" in this code, instead, it means "total number of output elements to process.", which could be very large.
Feel free to let me know if you have any questions.
There was a problem hiding this comment.
Thanks. For my own ref:
nthreads is the name for output_size in the cuda kernel:
vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu
Lines 378 to 379 in d7400a3
where output size was already inferred as uint64 since size() returns uint64:
vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu
Lines 354 to 362 in d7400a3
nthreads was previously implicitly cast to int32 when passed to roi_align_forward_kernel_impl causing an overflow. Now, with the change made to the function signature, it's properly kept as int64.
|
|
||
| @pytest.mark.parametrize("device", cpu_and_cuda()) | ||
| def test_roi_align_large_index(self, device): | ||
| """Regression test for https://github.com/pytorch/vision/issues/8206""" |
There was a problem hiding this comment.
| """Regression test for https://github.com/pytorch/vision/issues/8206""" | |
| """Non-regression test for https://github.com/pytorch/vision/issues/8206""" |
|
|
||
| # Forward kernel test | ||
| assert result.shape == (n_rois, channels, pooled_h, pooled_w) | ||
| assert result.abs().sum() > 0, "roi_align returned all zeros — likely an index overflow bug" |
There was a problem hiding this comment.
Here and below, don't specify anything beyond the assert, pytest is already good at showing the right thing. Here, the message doesn't help that much either.
| assert result.abs().sum() > 0, "roi_align returned all zeros — likely an index overflow bug" | |
| assert result.abs().sum() > 0 |
| pc.pos1 = static_cast<int64_t>(y_low) * width + x_low; | ||
| pc.pos2 = static_cast<int64_t>(y_low) * width + x_high; | ||
| pc.pos3 = static_cast<int64_t>(y_high) * width + x_low; | ||
| pc.pos4 = static_cast<int64_t>(y_high) * width + x_high; |
There was a problem hiding this comment.
Can you double check that these casts are needed?
claude says: y_low and y_high are pixel coordinates bounded by height, and width is the image width. y_low * width + x_low is at most height * width, which is the number of pixels in a single channel of a single image. That's not going to overflow int.
Please check every single other change in this PR.
| template <typename T> | ||
| void roi_align_backward_kernel_impl( | ||
| int nthreads, | ||
| int64_t nthreads, |
There was a problem hiding this comment.
Thanks. For my own ref:
nthreads is the name for output_size in the cuda kernel:
vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu
Lines 378 to 379 in d7400a3
where output size was already inferred as uint64 since size() returns uint64:
vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu
Lines 354 to 362 in d7400a3
nthreads was previously implicitly cast to int32 when passed to roi_align_forward_kernel_impl causing an overflow. Now, with the change made to the function signature, it's properly kept as int64.
| ), f"Expected execution to take < {execution_time_ms_threshold} ms, actually took {execution_time_ms} ms" | ||
|
|
||
| @pytest.mark.parametrize("device", cpu_and_cuda()) | ||
| def test_roi_align_large_index(self, device): |
There was a problem hiding this comment.
This is actually taking a super long time on CPU:
2026-03-20T08:49:14.6040945Z 260.23s call test/test_ops.py::TestRoIAlign::test_roi_align_large_index[cpu]
on GPU it seems OK. Let's just skip the test on CPU indicating it takes too long.
Summary
Bug: roi_align in torchvision crashes with a bus error/segfault on CPU or returns silently wrong (all-zero) results on CUDA when the total number of output elements exceeds
INT_MAX(~2.1 billion). This is caused by 32-bit int overflow in index arithmetic within the C++ and CUDA kernels.Root Cause: The kernels use int for composite index calculations like n × channels × pooled_width × pooled_height and pointer offsets like (roi_batch_ind × channels + c) × height × width. When these products exceed 2,147,483,647, the int wraps to a negative value, causing out-of-bounds memory access.
Example: FasterRCNN with batch_size=172 generates ~172,000 ROIs. The output index reaches 171,999 × 256 × 7 × 7 = 2,157,555,456 > INT_MAX, which matches the reporter's observed threshold exactly.
Fix: Promoted int to int64_t for all index, offset, and stride variables in the relevant files.
Test Plan
New overflow regression test
pytest test/test_ops.py::TestRoIAlign::test_roi_align_large_index -vExisting tests — verify no regressions
pytest test/test_ops.py::TestRoIAlign -vFixes #8206