feat(hesai): add CUDA-accelerated point cloud decoder by k1832 · Pull Request #421 · tier4/nebula

k1832 · 2026-03-19T06:07:41Z

PR Type

New Feature

Description

Add a GPU-accelerated decode path for Hesai LiDAR sensors using CUDA. The feature is:

Compile-time opt-in: Build with -DBUILD_CUDA=ON. When CUDA toolkit is not found, the build silently falls back to CPU-only.
Runtime opt-in: Set NEBULA_USE_CUDA=1 environment variable. When unset, the existing CPU path is used with zero overhead.

What it does

Processes an entire scan in a single batched CUDA kernel launch (launch_decode_hesai_scan_batch)
Uses pre-computed angle lookup tables (azimuth/elevation) uploaded to GPU once at initialization
Supports calibration-based and correction-based angle correctors
Currently validated on OT128 (Pandar128E4X) sensor

Files changed

File	Change
`hesai_cuda_kernels.cu`	New CUDA kernel for batched point cloud decoding
`hesai_cuda_decoder.hpp`	`HesaiScanDecoderCuda` RAII class — GPU buffer management, angle LUT, device memory
`hesai_decoder.hpp`	Integration: GPU scan buffer, flush, result conversion
`hesai_sensor.hpp`	Expose `max_scan_buffer_points()` for GPU buffer sizing
`angle_corrector_*.hpp`	Expose angle LUT data for GPU upload
`cuda_compat.hpp`	`NEBULA_HOST_DEVICE` / `NEBULA_DEVICE` macros for host/device code sharing
`nebula_hesai_decoders/CMakeLists.txt`	CUDA library target, toolkit detection
`nebula_hesai/CMakeLists.txt`	CUDA decoder test target
`hesai_cuda_decoder_test.cpp`	5 GPU-vs-CPU equivalence tests

Known limitations

GPU kernel does not set return_type field (always return_id)

Performance

Measured using profiling_runner.bash (3 runs × 20s each, Pandar128E4X ~72k pts/scan, RTX 5080, CUDA 12.4):

./scripts/profiling_runner.bash cpu-baseline \
    --sensor-model Pandar128E4X --rosbag-path <ot128_rosbag> -n 3 -t 20

NEBULA_USE_CUDA=1 ./scripts/profiling_runner.bash gpu-cuda \
    --sensor-model Pandar128E4X --rosbag-path <ot128_rosbag> -n 3 -t 20

./scripts/plot_times.py cpu-baseline gpu-cuda --metrics decode

Configuration	Median	P5	P95
CPU Baseline	6.80 ms	6.62 ms	7.28 ms
GPU (this PR)	2.48 ms	2.41 ms	12.77 ms

The GPU results show a bimodal distribution: ~57% of scans decode in ~2.4 ms (~2.8x faster than CPU), while the remaining ~43% take ~10.5 ms. The slow path is dominated by the bulk D2H copy in process_gpu_results(), which copies the entire sparse output buffer back to host memory. The follow-up PR (zero-copy via cuda_blackboard) eliminates this D2H copy by keeping points on the GPU, which should remove the slow-path bottleneck.

Review Procedure

Build (with CUDA)

colcon build --packages-up-to nebula_hesai \
  --cmake-args -DBUILD_CUDA=ON -DBUILD_TESTING=ON

Requires NVIDIA CUDA Toolkit (tested with CUDA 12.x). If the toolkit is not found, the build succeeds but CUDA support is silently disabled.

Running with CUDA enabled

The GPU decode path is gated by a runtime environment variable:

# Enable GPU decoding
export NEBULA_USE_CUDA=1

# Launch the driver node as usual — it will log "CUDA decoder initialized successfully" on startup
ros2 launch nebula nebula_launch.py sensor_model:=Pandar128E4X ...

# To disable (default), unset the variable
unset NEBULA_USE_CUDA

Test

# Run all tests (132 existing + 5 new CUDA tests)
source install/setup.bash
colcon test --packages-select nebula_hesai --ctest-args -V

# Or run CUDA tests only
./build/nebula_hesai/hesai_cuda_decoder_test_main

Test results

[==========] Running 5 tests from 1 test suite.
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence
[       OK ] HesaiCudaDecoderTest.OT128_GpuVsCpuEquivalence (21778 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty
[       OK ] HesaiCudaDecoderTest.OT128_GpuOutputNonEmpty (388 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_GpuFieldValidity
[       OK ] HesaiCudaDecoderTest.OT128_GpuFieldValidity (378 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts
[       OK ] HesaiCudaDecoderTest.OT128_BoundaryScanPointCounts (369 ms)
[ RUN      ] HesaiCudaDecoderTest.OT128_IntensityExactMatch
[       OK ] HesaiCudaDecoderTest.OT128_IntensityExactMatch (17217 ms)
[  PASSED  ] 5 tests.

# Full suite
Summary: 137 tests, 0 errors, 0 failures, 0 skipped

Remarks

When CUDA is not compiled in (BUILD_CUDA=OFF), the 5 CUDA tests are compiled but skip at runtime via GTEST_SKIP(), so they do not break CPU-only CI.
GPU uses CPU-authoritative scan cutting via scan_state flags. Point counts are identical between CPU and GPU (verified by kMaxPointCountDiff = 0 in tests). Coordinate differences are sub-millimetre (< 0.1 mm) due to floating-point hardware rounding.

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

Commits are properly organized and messages are according to the guideline
(Optional) Unit tests have been written for new behavior
PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

All open points are addressed and tracked via issues or tickets

CI Checks

Build and test for PR: Required to pass before the merge.

Add a GPU decode path for Hesai LiDAR sensors, gated behind compile-time BUILD_CUDA=ON and runtime NEBULA_USE_CUDA=1 environment variable. The implementation includes: - CUDA kernel for batched point cloud decoding (hesai_cuda_kernels.cu) - Angle LUT upload and GPU scan buffer management in hesai_decoder.hpp - GPU-vs-CPU equivalence tests for OT128 (Pandar128E4X) sensor The GPU path processes an entire scan in a single kernel launch, using pre-computed angle lookup tables and a sparse output buffer. When CUDA is not available or NEBULA_USE_CUDA is unset, the existing CPU path is used with zero overhead. Signed-off-by: Keita Morisaki <kmta1236@gmail.com>

- Copyright year 2024 -> 2026 for new files - Replace deprecated find_package(CUDA) with find_package(CUDAToolkit) - Remove --expt-relaxed-constexpr flag (not needed) - Remove unused per-packet kernel and launcher (dead code) - Batch launcher returns bool; caller logs via NEBULA_LOG_STREAM - Reorder CudaNebulaPoint fields for better memory packing - Remove redundant is_multi_frame member; use n_frames > 1 - Make HesaiCudaDecoder destructor virtual - Add int32_t range guarantee comment in angle corrector Signed-off-by: Keita Morisaki <kmta1236@gmail.com>

codecov · 2026-03-23T03:31:50Z

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.37%. Comparing base (baf4f92) to head (305c097).
⚠️ Report is 9 commits behind head on main.

❌ Your patch check has failed because the patch coverage (83.33%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
+ Coverage   48.35%   48.37%   +0.02%     
==========================================
  Files         156      157       +1     
  Lines       12996    13007      +11     
  Branches     6900     6906       +6     
==========================================
+ Hits         6284     6292       +8     
- Misses       5325     5327       +2     
- Partials     1387     1388       +1

Flag	Coverage Δ
nebula_core_common	`55.27% <84.61%> (?)`
nebula_core_decoders	`55.27% <84.61%> (?)`
nebula_hesai	`55.27% <84.61%> (?)`
nebula_hesai_decoders	`55.27% <84.61%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace .points access with direct iteration over PointCloud<T> (which now extends std::vector<T> instead of pcl::PointCloud). Signed-off-by: Keita Morisaki <kmta1236@gmail.com>

- Add missing #include <string> in hesai_decoder.hpp - Add missing #include <limits> in hesai_cuda_decoder_test.cpp - Fix readability/braces warning for ifdef-guarded else block Signed-off-by: Keita Morisaki <kmta1236@gmail.com>

mojomex

Thank you for the awesome PR!

I think the core of it is already good, and it's quite readable and cleanly implemented. I've left some comments here and there about what would make our maintenance life easier moving forward.

About the scan cutting differences, I think you could fairly easily integrate the CPU scan cutting state (which is still available, even when CUDA is enabled in your implementation) to make CUDA behave exactly like the CPU version.

I haven't yet reviewed every detail completely, but there's enough feedback for now I think.

Also, I'd really appreciate if you could post some performance metrics, e.g. using Nebula's benchmark scripts.

src/nebula_hesai/nebula_hesai/tests/hesai_cuda_decoder_test.cpp

src/nebula_hesai/nebula_hesai_decoders/src/cuda/hesai_cuda_kernels.cu

src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp

..._hesai_decoders/include/nebula_hesai_decoders/decoders/angle_corrector_calibration_based.hpp

src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp

k1832 · 2026-04-02T07:03:46Z

Thank you for the review! All inline comments have been addressed.

Re performance: I added a comparison to the PR description (CPU 6.08ms vs GPU 0.31ms, ~20x speedup on RTX 5080). The numbers were collected using a non-standard profiling setup. Let me look into reproducing them with profiling_runner.bash / plot_times.py and I will follow up with that.

mojomex · 2026-04-02T08:36:47Z

@k1832 Damn! 20x is insane. Amazing work 👏

mojomex

@k1832 Thank you for addressing the changes!
It seems like they're not pushed yet though, please check.

src/nebula_hesai/nebula_hesai_decoders/CMakeLists.txt

- Tighten test tolerances: kMaxPointCountDiff=0, kMinMatchRatio=1.0, kXyzTolerance=0.1mm (GPU uses CPU-authoritative scan_state) - Create cuda_compat.hpp with NEBULA_HOST_DEVICE macros; annotate deg2rad, rad2deg, angle_is_between, normalize_angle - Remove dead get_cuda_raw_angles() and get_frame_angle_info() - Refactor GPU resources into HesaiScanDecoderCuda RAII class (constructor allocates, destructor frees, no manual cleanup) - Remove CUDA_KERNELS_EXIST file-existence check from CMakeLists - Simplify HesaiDecoder: single unique_ptr replaces ~100 lines of scattered CUDA members and manual destructor

k1832 · 2026-04-03T03:46:14Z

Ran the benchmark using profiling_runner.bash as suggested (3 runs × 20s, Pandar128E4X on RTX 5080).

Configuration	Median	P5	P95
CPU Baseline	6.80 ms	6.62 ms	7.28 ms
GPU (this PR)	2.48 ms	2.41 ms	12.77 ms

The GPU results are bimodal: ~57% of scans at ~2.4 ms, ~43% at ~10.5 ms. The slow scans are bottlenecked by the bulk D2H sparse buffer copy in process_gpu_results().

Sorry for the confusion with the earlier 20x number — that was measured with CUDA events around only the H2D + kernel + count readback, which excluded the packet staging overhead and the bulk D2H result copy. That was not an apples-to-apples comparison with the CPU path. The honest end-to-end comparison via profiling_runner.bash shows ~2.8x on the fast path (6.8 ms → 2.4 ms).

The follow-up PR (zero-copy via cuda_blackboard) eliminates the D2H copy by keeping points on the GPU, which should remove the slow-path bottleneck.

k1832 · 2026-04-03T03:50:18Z

@mojomex The PR description is updated too with the performance measurement result. Please take a look again 🙏

manato

@k1832
Thank you very much for the PR!

Overall, it is nicely looking and well organized!
I left some comments that I noticed. I'd appreciate it if you could consider them 🙏

In terms of speed up, I guess measurement result using rosbag and the actual sensor would be slightly different because the case of using the actual sensor does pipelined-processing of decoding and receiving, while the case of using rosbag proceeds whole 1 scan packet (i.e., accumulated packets) at once.
By any chance, do you perhaps have any measurement data using the actual sensors?

...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp

src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp

...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp

src/nebula_hesai/nebula_hesai_decoders/src/cuda/hesai_cuda_kernels.cu

...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp

mojomex

Getting closer! I left some CUDA performance and architecture comments, please check.

mojomex · 2026-04-03T14:22:11Z

src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp


  std::array<DecodeFrame, 2> frame_buffers_{initialize_frame(), initialize_frame()};

+#ifdef NEBULA_CUDA_ENABLED


I'd still like to separate out all the CUDA stuff from this file.

Logically, we could split this as follows:

HesaiDecoder handles:

packet bytes->struct conversion

functional safety

packet loss diagnostics

scan cutting

performance counters/timers

HesaiDecoderImplCpu:

unpack(const packet&)

accumulation in std::array<DecodeFrame, 2>

HesaiDecoderImplCuda:

unpack(const packet&)

accumulation and HtoD, DtoH

Or in other words, have the unpack function + impl-specific state/buffers in two different classes and dependency-inject one of them into HesaiDecoder.

Agreed this would be cleaner. This is a larger refactor — will address in a follow-up.

src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp

...a_hesai_decoders/include/nebula_hesai_decoders/decoders/angle_corrector_correction_based.hpp

src/nebula_hesai/nebula_hesai_decoders/CMakeLists.txt

src/nebula_hesai/nebula_hesai/tests/hesai_cuda_decoder_test.cpp

...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp

mojomex · 2026-04-03T15:32:30Z

Here's my nsys run as well (sorry @manato for the collision 😭). As @manato and I commented, the cudaMemcpys are on the default stream, and indeed the DtoH copies in the decode phase (not just init phase) are on the default stream, causing interference with other workloads.

drwnz · 2026-04-06T03:15:19Z

Without having looked at the code in detail yet, my main concern is about the actual latency with a real sensor.
Because testing on rosbags mean you have all the packets at once and therefore can parallelize decoding of a whole scan, the speedup you observed is expected. However, with a real sensor, the packets arrive one by one and are processed as they arrive. Therefore, waiting for a batch or a whole scan of packets may actually increase the pipeline latency in some conditions. Batching the packet processing could result in quite complex sensor-dependent tuning.
@k1832 was this considered in your implementation?

mojomex · 2026-04-06T08:44:26Z

@drwnz To track this, we should probably do what autoware_pointcloud_preprocessor does and publish a pipeline_latency after we published our point clouds (pipeline_latency = publish_time - pointcloud_message_header_stamp).

What do you think?

- Use cudaMemcpyAsync + dedicated stream everywhere (angle LUT upload, count readback, result D2H copy) to avoid blocking default stream - Simplify dual-return filter return for n_returns==2 path - Remove unused #include <tuple> from angle_corrector_correction_based - Remove unnecessary target_include_directories for CUDAToolkit - Save/restore NEBULA_USE_CUDA env in test instead of unconditional unset - Remove redundant comment about unique_ptr cleanup

k1832 · 2026-04-07T05:32:04Z

@drwnz (CC: @mojomex )

my main concern is about the actual latency with a real sensor.

Good point. Here's my analysis of the latency difference:

If my understanding is correct, both CPU and GPU paths publish the full scan's point cloud at once when the scan boundary is detected, not per-packet. The difference is how much work remains at that moment:

CPU: points decoded incrementally as packets arrived. At boundary, just publish. Near-instant.
GPU: packets only staged to host memory. At boundary, full flush (H2D + kernel + D2H) before publishing.

So the GPU path adds flush latency at the scan boundary. The profiling results show this flush takes 2.4ms on some scans and 10.5ms on others (bimodal pattern), where the slow path is dominated by the bulk D2H sparse buffer copy. This bimodal pattern is likely an artifact of the short test rosbag being looped. The follow-up PR (zero-copy via cuda_blackboard) eliminates this D2H copy entirely, which should remove the slow path and make the distribution uniform.

The primary benefit of PR1 is CPU offload during the scan period. This is fully opt-in — build with -DBUILD_CUDA=OFF and no CUDA code is compiled at all. Even with CUDA compiled in, omitting export NEBULA_USE_CUDA=1 skips GPU initialization, and the only runtime cost is a single null pointer check per packet.

Again, it's a great discussion point. I need your team's perspective and let's keep this discussion going here.

k1832 force-pushed the feat/core-cuda-decode branch from 580316f to cd2b0e8 Compare March 23, 2026 01:32

k1832 added 2 commits March 23, 2026 12:21

k1832 force-pushed the feat/core-cuda-decode branch from cd2b0e8 to 508175b Compare March 23, 2026 03:21

fix(hesai): adapt CUDA tests for new PointCloud API (PCL removal)

09658fe

Replace .points access with direct iteration over PointCloud<T> (which now extends std::vector<T> instead of pcl::PointCloud). Signed-off-by: Keita Morisaki <kmta1236@gmail.com>

k1832 force-pushed the feat/core-cuda-decode branch from 62ab94c to 09658fe Compare March 23, 2026 03:55

pre-commit-ci bot and others added 2 commits March 23, 2026 03:56

ci(pre-commit): autofix

08afe44

fix(hesai): resolve cpplint errors in CUDA decoder

18fb65c

- Add missing #include <string> in hesai_decoder.hpp - Add missing #include <limits> in hesai_cuda_decoder_test.cpp - Fix readability/braces warning for ifdef-guarded else block Signed-off-by: Keita Morisaki <kmta1236@gmail.com>

k1832 marked this pull request as ready for review March 23, 2026 04:05

k1832 requested review from mojomex, nishikawa-masaki and veqcc March 23, 2026 04:05

mojomex reviewed Mar 25, 2026

View reviewed changes

drwnz requested a review from manato April 2, 2026 07:41

mojomex reviewed Apr 2, 2026

View reviewed changes

src/nebula_hesai/nebula_hesai_decoders/CMakeLists.txt Outdated Show resolved Hide resolved

manato reviewed Apr 3, 2026

View reviewed changes

mojomex requested changes Apr 3, 2026

View reviewed changes


		std::array<DecodeFrame, 2> frame_buffers_{initialize_frame(), initialize_frame()};

		#ifdef NEBULA_CUDA_ENABLED

Conversation

k1832 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Related Links

Description

What it does

Files changed

Known limitations

Performance

Review Procedure

Build (with CUDA)

Running with CUDA enabled

Test

Test results

Remarks

Pre-Review Checklist for the PR Author

Checklist for the PR Reviewer

Post-Review Checklist for the PR Author

CI Checks

Uh oh!

codecov bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mojomex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k1832 commented Apr 2, 2026

Uh oh!

mojomex commented Apr 2, 2026

Uh oh!

mojomex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k1832 commented Apr 3, 2026

Uh oh!

k1832 commented Apr 3, 2026

Uh oh!

manato left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mojomex left a comment

Choose a reason for hiding this comment

Uh oh!

mojomex Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

k1832 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mojomex commented Apr 3, 2026

Uh oh!

drwnz commented Apr 6, 2026

Uh oh!

mojomex commented Apr 6, 2026

k1832 commented Mar 19, 2026 •

edited

Loading

codecov bot commented Mar 23, 2026 •

edited

Loading

k1832 commented Apr 7, 2026 •

edited

Loading