Skip to content

feat(hesai): add CUDA-accelerated point cloud decoder#421

Open
k1832 wants to merge 7 commits intotier4:mainfrom
k1832:feat/core-cuda-decode
Open

feat(hesai): add CUDA-accelerated point cloud decoder#421
k1832 wants to merge 7 commits intotier4:mainfrom
k1832:feat/core-cuda-decode

Conversation

@k1832
Copy link
Copy Markdown

@k1832 k1832 commented Mar 19, 2026

PR Type

  • New Feature

Related Links

Description

Add a GPU-accelerated decode path for Hesai LiDAR sensors using CUDA. The feature is:

  • Compile-time opt-in: Build with -DBUILD_CUDA=ON. When CUDA toolkit is not found, the build silently falls back to CPU-only.
  • Runtime opt-in: Set NEBULA_USE_CUDA=1 environment variable. When unset, the existing CPU path is used with zero overhead.

What it does

  • Processes an entire scan in a single batched CUDA kernel launch (launch_decode_hesai_scan_batch)
  • Uses pre-computed angle lookup tables (azimuth/elevation) uploaded to GPU once at initialization
  • Supports calibration-based and correction-based angle correctors
  • Currently validated on OT128 (Pandar128E4X) sensor

Files changed

File Change
hesai_cuda_kernels.cu New CUDA kernel for batched point cloud decoding
hesai_cuda_decoder.hpp HesaiScanDecoderCuda RAII class — GPU buffer management, angle LUT, device memory
hesai_decoder.hpp Integration: GPU scan buffer, flush, result conversion
hesai_sensor.hpp Expose max_scan_buffer_points() for GPU buffer sizing
angle_corrector_*.hpp Expose angle LUT data for GPU upload
cuda_compat.hpp NEBULA_HOST_DEVICE / NEBULA_DEVICE macros for host/device code sharing
nebula_hesai_decoders/CMakeLists.txt CUDA library target, toolkit detection
nebula_hesai/CMakeLists.txt CUDA decoder test target
hesai_cuda_decoder_test.cpp 5 GPU-vs-CPU equivalence tests

Known limitations

  • GPU kernel does not set return_type field (always return_id)

Performance

Measured using profiling_runner.bash (3 runs × 20s each, Pandar128E4X ~72k pts/scan, RTX 5080, CUDA 12.4):

./scripts/profiling_runner.bash cpu-baseline \
    --sensor-model Pandar128E4X --rosbag-path <ot128_rosbag> -n 3 -t 20

NEBULA_USE_CUDA=1 ./scripts/profiling_runner.bash gpu-cuda \
    --sensor-model Pandar128E4X --rosbag-path <ot128_rosbag> -n 3 -t 20

./scripts/plot_times.py cpu-baseline gpu-cuda --metrics decode
Configuration Median P5 P95
CPU Baseline 6.80 ms 6.62 ms 7.28 ms
GPU (this PR) 2.48 ms 2.41 ms 12.77 ms

The GPU results show a bimodal distribution: ~57% of scans decode in ~2.4 ms (~2.8x faster than CPU), while the remaining ~43% take ~10.5 ms. The slow path is dominated by the bulk D2H copy in process_gpu_results(), which copies the entire sparse output buffer back to host memory. The follow-up PR (zero-copy via cuda_blackboard) eliminates this D2H copy by keeping points on the GPU, which should remove the slow-path bottleneck.

Review Procedure

Build (with CUDA)

colcon build --packages-up-to nebula_hesai \
  --cmake-args -DBUILD_CUDA=ON -DBUILD_TESTING=ON

Requires NVIDIA CUDA Toolkit (tested with CUDA 12.x). If the toolkit is not found, the build succeeds but CUDA support is silently disabled.

Running with CUDA enabled

The GPU decode path is gated by a runtime environment variable:

# Enable GPU decoding
export NEBULA_USE_CUDA=1

# Launch the driver node as usual — it will log "CUDA decoder initialized successfully" on startup
ros2 launch nebula nebula_launch.py sensor_model:=Pandar128E4X ...

# To disable (default), unset the variable
unset NEBULA_USE_CUDA

Test

# Run all tests (132 existing + 5 new CUDA tests)
source install/setup.bash
colcon test --packages-select nebula_hesai --ctest-args -V

# Or run CUDA tests only
./build/nebula_hesai/hesai_cuda_decoder_test_main

Test results

[==========] Running 5 tests from 1 test suite.
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence
[       OK ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence (21778 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty
[       OK ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty (388 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuFieldValidity
[       OK ] HesaiCudaDecoderTest.OT128_GpuFieldValidity (378 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts
[       OK ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts (369 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_IntensityExactMatch
[       OK ] HesaiCudaDecoderTest.OT128_IntensityExactMatch (17217 ms)
[  PASSED  ] 5 tests.

# Full suite
Summary: 137 tests, 0 errors, 0 failures, 0 skipped

Remarks

  • When CUDA is not compiled in (BUILD_CUDA=OFF), the 5 CUDA tests are compiled but skip at runtime via GTEST_SKIP(), so they do not break CPU-only CI.
  • GPU uses CPU-authoritative scan cutting via scan_state flags. Point counts are identical between CPU and GPU (verified by kMaxPointCountDiff = 0 in tests). Coordinate differences are sub-millimetre (< 0.1 mm) due to floating-point hardware rounding.

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

  • Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.

@k1832 k1832 force-pushed the feat/core-cuda-decode branch from 580316f to cd2b0e8 Compare March 23, 2026 01:32
k1832 added 2 commits March 23, 2026 12:21
Add a GPU decode path for Hesai LiDAR sensors, gated behind compile-time
BUILD_CUDA=ON and runtime NEBULA_USE_CUDA=1 environment variable.

The implementation includes:
- CUDA kernel for batched point cloud decoding (hesai_cuda_kernels.cu)
- Angle LUT upload and GPU scan buffer management in hesai_decoder.hpp
- GPU-vs-CPU equivalence tests for OT128 (Pandar128E4X) sensor

The GPU path processes an entire scan in a single kernel launch, using
pre-computed angle lookup tables and a sparse output buffer. When CUDA
is not available or NEBULA_USE_CUDA is unset, the existing CPU path is
used with zero overhead.

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
- Copyright year 2024 -> 2026 for new files
- Replace deprecated find_package(CUDA) with find_package(CUDAToolkit)
- Remove --expt-relaxed-constexpr flag (not needed)
- Remove unused per-packet kernel and launcher (dead code)
- Batch launcher returns bool; caller logs via NEBULA_LOG_STREAM
- Reorder CudaNebulaPoint fields for better memory packing
- Remove redundant is_multi_frame member; use n_frames > 1
- Make HesaiCudaDecoder destructor virtual
- Add int32_t range guarantee comment in angle corrector

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
@k1832 k1832 force-pushed the feat/core-cuda-decode branch from cd2b0e8 to 508175b Compare March 23, 2026 03:21
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.37%. Comparing base (baf4f92) to head (305c097).
⚠️ Report is 9 commits behind head on main.

❌ Your patch check has failed because the patch coverage (83.33%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
+ Coverage   48.35%   48.37%   +0.02%     
==========================================
  Files         156      157       +1     
  Lines       12996    13007      +11     
  Branches     6900     6906       +6     
==========================================
+ Hits         6284     6292       +8     
- Misses       5325     5327       +2     
- Partials     1387     1388       +1     
Flag Coverage Δ
nebula_core_common 55.27% <84.61%> (?)
nebula_core_decoders 55.27% <84.61%> (?)
nebula_hesai 55.27% <84.61%> (?)
nebula_hesai_decoders 55.27% <84.61%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace .points access with direct iteration over PointCloud<T>
(which now extends std::vector<T> instead of pcl::PointCloud).

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
@k1832 k1832 force-pushed the feat/core-cuda-decode branch from 62ab94c to 09658fe Compare March 23, 2026 03:55
pre-commit-ci bot and others added 2 commits March 23, 2026 03:56
- Add missing #include <string> in hesai_decoder.hpp
- Add missing #include <limits> in hesai_cuda_decoder_test.cpp
- Fix readability/braces warning for ifdef-guarded else block

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
@k1832 k1832 marked this pull request as ready for review March 23, 2026 04:05
Copy link
Copy Markdown
Collaborator

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the awesome PR!

I think the core of it is already good, and it's quite readable and cleanly implemented. I've left some comments here and there about what would make our maintenance life easier moving forward.

About the scan cutting differences, I think you could fairly easily integrate the CPU scan cutting state (which is still available, even when CUDA is enabled in your implementation) to make CUDA behave exactly like the CPU version.

I haven't yet reviewed every detail completely, but there's enough feedback for now I think.

Also, I'd really appreciate if you could post some performance metrics, e.g. using Nebula's benchmark scripts.

@k1832
Copy link
Copy Markdown
Author

k1832 commented Apr 2, 2026

Thank you for the review! All inline comments have been addressed.

Re performance: I added a comparison to the PR description (CPU 6.08ms vs GPU 0.31ms, ~20x speedup on RTX 5080). The numbers were collected using a non-standard profiling setup. Let me look into reproducing them with profiling_runner.bash / plot_times.py and I will follow up with that.

@drwnz drwnz requested a review from manato April 2, 2026 07:41
@mojomex
Copy link
Copy Markdown
Collaborator

mojomex commented Apr 2, 2026

@k1832 Damn! 20x is insane. Amazing work 👏

Copy link
Copy Markdown
Collaborator

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k1832 Thank you for addressing the changes!
It seems like they're not pushed yet though, please check.

- Tighten test tolerances: kMaxPointCountDiff=0, kMinMatchRatio=1.0,
  kXyzTolerance=0.1mm (GPU uses CPU-authoritative scan_state)
- Create cuda_compat.hpp with NEBULA_HOST_DEVICE macros; annotate
  deg2rad, rad2deg, angle_is_between, normalize_angle
- Remove dead get_cuda_raw_angles() and get_frame_angle_info()
- Refactor GPU resources into HesaiScanDecoderCuda RAII class
  (constructor allocates, destructor frees, no manual cleanup)
- Remove CUDA_KERNELS_EXIST file-existence check from CMakeLists
- Simplify HesaiDecoder: single unique_ptr replaces ~100 lines of
  scattered CUDA members and manual destructor
@k1832
Copy link
Copy Markdown
Author

k1832 commented Apr 3, 2026

Ran the benchmark using profiling_runner.bash as suggested (3 runs × 20s, Pandar128E4X on RTX 5080).

Configuration Median P5 P95
CPU Baseline 6.80 ms 6.62 ms 7.28 ms
GPU (this PR) 2.48 ms 2.41 ms 12.77 ms

The GPU results are bimodal: ~57% of scans at ~2.4 ms, ~43% at ~10.5 ms. The slow scans are bottlenecked by the bulk D2H sparse buffer copy in process_gpu_results().

Sorry for the confusion with the earlier 20x number — that was measured with CUDA events around only the H2D + kernel + count readback, which excluded the packet staging overhead and the bulk D2H result copy. That was not an apples-to-apples comparison with the CPU path. The honest end-to-end comparison via profiling_runner.bash shows ~2.8x on the fast path (6.8 ms → 2.4 ms).

The follow-up PR (zero-copy via cuda_blackboard) eliminates the D2H copy by keeping points on the GPU, which should remove the slow-path bottleneck.

@k1832
Copy link
Copy Markdown
Author

k1832 commented Apr 3, 2026

@mojomex The PR description is updated too with the performance measurement result. Please take a look again 🙏

Copy link
Copy Markdown
Collaborator

@manato manato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k1832
Thank you very much for the PR!

Overall, it is nicely looking and well organized!
I left some comments that I noticed. I'd appreciate it if you could consider them 🙏

In terms of speed up, I guess measurement result using rosbag and the actual sensor would be slightly different because the case of using the actual sensor does pipelined-processing of decoding and receiving, while the case of using rosbag proceeds whole 1 scan packet (i.e., accumulated packets) at once.
By any chance, do you perhaps have any measurement data using the actual sensors?

Copy link
Copy Markdown
Collaborator

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer! I left some CUDA performance and architecture comments, please check.


std::array<DecodeFrame, 2> frame_buffers_{initialize_frame(), initialize_frame()};

#ifdef NEBULA_CUDA_ENABLED
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to separate out all the CUDA stuff from this file.

Logically, we could split this as follows:

  • HesaiDecoder handles:
    • packet bytes->struct conversion
    • functional safety
    • packet loss diagnostics
    • scan cutting
    • performance counters/timers
  • HesaiDecoderImplCpu:
    • unpack(const packet&)
    • accumulation in std::array<DecodeFrame, 2>
  • HesaiDecoderImplCuda:
    • unpack(const packet&)
    • accumulation and HtoD, DtoH

Or in other words, have the unpack function + impl-specific state/buffers in two different classes and dependency-inject one of them into HesaiDecoder.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this would be cleaner. This is a larger refactor — will address in a follow-up.

@mojomex
Copy link
Copy Markdown
Collaborator

mojomex commented Apr 3, 2026

Screenshot from 2026-04-04 00-04-41

Here's my nsys run as well (sorry @manato for the collision 😭). As @manato and I commented, the cudaMemcpys are on the default stream, and indeed the DtoH copies in the decode phase (not just init phase) are on the default stream, causing interference with other workloads.

@drwnz
Copy link
Copy Markdown
Collaborator

drwnz commented Apr 6, 2026

Without having looked at the code in detail yet, my main concern is about the actual latency with a real sensor.
Because testing on rosbags mean you have all the packets at once and therefore can parallelize decoding of a whole scan, the speedup you observed is expected. However, with a real sensor, the packets arrive one by one and are processed as they arrive. Therefore, waiting for a batch or a whole scan of packets may actually increase the pipeline latency in some conditions. Batching the packet processing could result in quite complex sensor-dependent tuning.
@k1832 was this considered in your implementation?

@mojomex
Copy link
Copy Markdown
Collaborator

mojomex commented Apr 6, 2026

@drwnz To track this, we should probably do what autoware_pointcloud_preprocessor does and publish a pipeline_latency after we published our point clouds (pipeline_latency = publish_time - pointcloud_message_header_stamp).

What do you think?

- Use cudaMemcpyAsync + dedicated stream everywhere (angle LUT upload,
  count readback, result D2H copy) to avoid blocking default stream
- Simplify dual-return filter return for n_returns==2 path
- Remove unused #include <tuple> from angle_corrector_correction_based
- Remove unnecessary target_include_directories for CUDAToolkit
- Save/restore NEBULA_USE_CUDA env in test instead of unconditional unset
- Remove redundant comment about unique_ptr cleanup
@k1832
Copy link
Copy Markdown
Author

k1832 commented Apr 7, 2026

@drwnz (CC: @mojomex )

my main concern is about the actual latency with a real sensor.

Good point. Here's my analysis of the latency difference:

If my understanding is correct, both CPU and GPU paths publish the full scan's point cloud at once when the scan boundary is detected, not per-packet. The difference is how much work remains at that moment:

  • CPU: points decoded incrementally as packets arrived. At boundary, just publish. Near-instant.
  • GPU: packets only staged to host memory. At boundary, full flush (H2D + kernel + D2H) before publishing.

So the GPU path adds flush latency at the scan boundary. The profiling results show this flush takes 2.4ms on some scans and 10.5ms on others (bimodal pattern), where the slow path is dominated by the bulk D2H sparse buffer copy. This bimodal pattern is likely an artifact of the short test rosbag being looped. The follow-up PR (zero-copy via cuda_blackboard) eliminates this D2H copy entirely, which should remove the slow path and make the distribution uniform.

The primary benefit of PR1 is CPU offload during the scan period. This is fully opt-in — build with -DBUILD_CUDA=OFF and no CUDA code is compiled at all. Even with CUDA compiled in, omitting export NEBULA_USE_CUDA=1 skips GPU initialization, and the only runtime cost is a single null pointer check per packet.

Again, it's a great discussion point. I need your team's perspective and let's keep this discussion going here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants