Optimize CPU deform_conv2d forward pass with parallel im2col by developer0hye · Pull Request #9442 · pytorch/vision

developer0hye · 2026-03-16T14:52:08Z

Summary

The CPU deform_conv2d forward pass spends 89–97% of its time in the deformable_im2col_kernel (confirmed via torch.profiler), yet this kernel runs entirely single-threaded. GEMM (addmm_) accounts for only 3–10% and is already parallelized by BLAS.

This PR introduces three changes to torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp that together yield a 2.5–3.3x end-to-end speedup on the forward pass:

Parallelize deformable_im2col_kernel with at::parallel_for.
Each loop iteration writes to a non-overlapping region of the columns buffer (the write offset is uniquely determined by (in_c, out_b, out_y, out_x)), so parallelization is safe with no synchronization needed. Results are bit-for-bit identical regardless of thread count.
Replace at::zeros with at::empty for the columns buffer.
deformable_im2col_kernel writes every element of this buffer (n_in_channels × kH × kW × parallel_imgs × out_h × out_w elements total), so zero-initialization is wasted work.
Replace at::zeros with at::empty for out_buf and use addmm_ with beta=0.
Each out_buf[b][g] is written exactly once per (batch_block, weight_group) pair. Using beta=0 skips the accumulation of uninitialized values while preserving in-place operation (unlike at::mm, which allocates a new tensor).

Benchmark

All measurements use time.perf_counter(), 10 warmup + 100 timed iterations, reporting the median.

Hardware: Apple M2, torch.get_num_threads() = 4
Dtype: float32, with mask (DCNv2 mode)
Config format: s{spatial}-b{batch}, e.g. s32-b4 = 64 in/out channels, 3×3 kernel, stride 1, padding 1, 32×32 spatial, batch 4. s64-* uses 256 in/out channels.

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.78          0.83         3.3x
s32-b3           9.62          3.54         2.7x
s32-b4          15.99          5.01         3.2x
s32-b8          32.90         11.17         2.9x
s64-b1          76.16         30.52         2.5x
s64-b4         315.69        122.65         2.6x
s64-b7         566.37        230.67         2.5x

Profiler breakdown (baseline, s32-b1)

                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    torchvision::deform_conv2d        92.30%      25.091ms       100.00%      27.183ms       2.718ms            10
                  aten::addmm_         2.82%     766.166us         2.82%     767.458us      76.746us            10
                   aten::zeros         0.57%     154.080us         2.94%     798.875us      79.888us            10

Benchmark script

import time
import torch
from torchvision.ops import deform_conv2d

def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]

Numerical correctness

Output is bit-for-bit identical between 1-thread and 8-thread execution (torch.equal returns True). Each thread operates on a disjoint slice of the columns buffer, so floating-point evaluation order is unchanged.

All existing TestDeformConv tests pass (forward, backward, scripting, opcheck).

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 137d2b7 with merge base 8a5946e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Three changes to the CPU deformable convolution forward kernel: 1. Replace at::zeros with at::empty for columns and out_buf buffers. The deformable_im2col_kernel writes every element of the columns buffer, and out_buf is fully written by addmm_, so zero-initialization is wasted work. 2. Use addmm_ with beta=0 instead of the default beta=1. This avoids accumulating into uninitialized memory while preserving in-place operation (no extra allocation unlike at::mm). 3. Parallelize deformable_im2col_kernel with at::parallel_for. The im2col loop was the only single-threaded phase in the forward pass (GEMM is already parallelized by BLAS). Each loop iteration writes to a non-overlapping region of the columns buffer, so parallelization is safe. Benchmark results on Apple M2 (CPU, float32): Config Before (ms) After (ms) Change small-b1 9.76 2.44 -75% small-b8 91.77 33.88 -63% medium-b1 216.70 75.80 -65% medium-b8 1152.09 650.00 -44% large-b1 348.86 302.70 -13% large-b4 1342.75 1289.96 -4% Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

zy1git · 2026-03-20T08:08:29Z

@developer0hye Hi, thanks a lot for this PR! May I ask what's the motivation for optimizing the CPU path for deform_conv2d? It's almost always used on GPU. Is there a specific application in your use case?

developer0hye · 2026-03-20T08:24:43Z

@zy1git Great question!

The CPU path ships with torchvision and has a known issue (#6619) — it runs entirely single-threaded despite being embarrassingly parallel.
The fix is minimal and risk-free — at::parallel_for on a non-overlapping loop + removing redundant zero-init. Output is bit-for-bit identical. No new dependencies, no API change.
CPU inference is real — edge deployment, CI/testing, prototyping, and environments without GPU access (containers, ARM devices) all hit this path.

On a personal note, I've been working on in-browser ML inference — things like humanblur and bgremover. These are built with Candle + WASM rather than PyTorch, so admittedly a different stack, but the experience taught me that CPU-side efficiency matters more than you'd expect — not every user has GPU acceleration available, even in a browser. That mindset carried over here: if a CPU path exists and there's a straightforward way to make it 3x faster, it's worth doing.

meta-cla bot added the cla signed label Mar 16, 2026

developer0hye force-pushed the feat/dcnv2-cpu-forward-optimization branch from e653cad to 8a89fb8 Compare March 16, 2026 14:56

style: fix clang-format lint for method chain in deform_conv2d_kernel

137d2b7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU deform_conv2d forward pass with parallel im2col#9442

Optimize CPU deform_conv2d forward pass with parallel im2col#9442
developer0hye wants to merge 2 commits intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization

developer0hye commented Mar 16, 2026

Uh oh!

pytorch-bot bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

zy1git commented Mar 20, 2026

Uh oh!

developer0hye commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

developer0hye commented Mar 16, 2026

Summary

Benchmark

Numerical correctness

Related

Uh oh!

pytorch-bot bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

✅ No Failures

Uh oh!

zy1git commented Mar 20, 2026

Uh oh!

developer0hye commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 16, 2026 •

edited

Loading