feat(hesai): add CUDA-accelerated point cloud decoder#421
feat(hesai): add CUDA-accelerated point cloud decoder#421k1832 wants to merge 7 commits intotier4:mainfrom
Conversation
580316f to
cd2b0e8
Compare
Add a GPU decode path for Hesai LiDAR sensors, gated behind compile-time BUILD_CUDA=ON and runtime NEBULA_USE_CUDA=1 environment variable. The implementation includes: - CUDA kernel for batched point cloud decoding (hesai_cuda_kernels.cu) - Angle LUT upload and GPU scan buffer management in hesai_decoder.hpp - GPU-vs-CPU equivalence tests for OT128 (Pandar128E4X) sensor The GPU path processes an entire scan in a single kernel launch, using pre-computed angle lookup tables and a sparse output buffer. When CUDA is not available or NEBULA_USE_CUDA is unset, the existing CPU path is used with zero overhead. Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
- Copyright year 2024 -> 2026 for new files - Replace deprecated find_package(CUDA) with find_package(CUDAToolkit) - Remove --expt-relaxed-constexpr flag (not needed) - Remove unused per-packet kernel and launcher (dead code) - Batch launcher returns bool; caller logs via NEBULA_LOG_STREAM - Reorder CudaNebulaPoint fields for better memory packing - Remove redundant is_multi_frame member; use n_frames > 1 - Make HesaiCudaDecoder destructor virtual - Add int32_t range guarantee comment in angle corrector Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
cd2b0e8 to
508175b
Compare
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (83.33%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #421 +/- ##
==========================================
+ Coverage 48.35% 48.37% +0.02%
==========================================
Files 156 157 +1
Lines 12996 13007 +11
Branches 6900 6906 +6
==========================================
+ Hits 6284 6292 +8
- Misses 5325 5327 +2
- Partials 1387 1388 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Replace .points access with direct iteration over PointCloud<T> (which now extends std::vector<T> instead of pcl::PointCloud). Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
62ab94c to
09658fe
Compare
- Add missing #include <string> in hesai_decoder.hpp - Add missing #include <limits> in hesai_cuda_decoder_test.cpp - Fix readability/braces warning for ifdef-guarded else block Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
mojomex
left a comment
There was a problem hiding this comment.
Thank you for the awesome PR!
I think the core of it is already good, and it's quite readable and cleanly implemented. I've left some comments here and there about what would make our maintenance life easier moving forward.
About the scan cutting differences, I think you could fairly easily integrate the CPU scan cutting state (which is still available, even when CUDA is enabled in your implementation) to make CUDA behave exactly like the CPU version.
I haven't yet reviewed every detail completely, but there's enough feedback for now I think.
Also, I'd really appreciate if you could post some performance metrics, e.g. using Nebula's benchmark scripts.
src/nebula_hesai/nebula_hesai/tests/hesai_cuda_decoder_test.cpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai/tests/hesai_cuda_decoder_test.cpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/src/cuda/hesai_cuda_kernels.cu
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/src/cuda/hesai_cuda_kernels.cu
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp
Show resolved
Hide resolved
..._hesai_decoders/include/nebula_hesai_decoders/decoders/angle_corrector_calibration_based.hpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp
Outdated
Show resolved
Hide resolved
|
Thank you for the review! All inline comments have been addressed. Re performance: I added a comparison to the PR description (CPU 6.08ms vs GPU 0.31ms, ~20x speedup on RTX 5080). The numbers were collected using a non-standard profiling setup. Let me look into reproducing them with |
|
@k1832 Damn! 20x is insane. Amazing work 👏 |
- Tighten test tolerances: kMaxPointCountDiff=0, kMinMatchRatio=1.0, kXyzTolerance=0.1mm (GPU uses CPU-authoritative scan_state) - Create cuda_compat.hpp with NEBULA_HOST_DEVICE macros; annotate deg2rad, rad2deg, angle_is_between, normalize_angle - Remove dead get_cuda_raw_angles() and get_frame_angle_info() - Refactor GPU resources into HesaiScanDecoderCuda RAII class (constructor allocates, destructor frees, no manual cleanup) - Remove CUDA_KERNELS_EXIST file-existence check from CMakeLists - Simplify HesaiDecoder: single unique_ptr replaces ~100 lines of scattered CUDA members and manual destructor
|
Ran the benchmark using
The GPU results are bimodal: ~57% of scans at ~2.4 ms, ~43% at ~10.5 ms. The slow scans are bottlenecked by the bulk D2H sparse buffer copy in Sorry for the confusion with the earlier 20x number — that was measured with CUDA events around only the H2D + kernel + count readback, which excluded the packet staging overhead and the bulk D2H result copy. That was not an apples-to-apples comparison with the CPU path. The honest end-to-end comparison via The follow-up PR (zero-copy via cuda_blackboard) eliminates the D2H copy by keeping points on the GPU, which should remove the slow-path bottleneck. |
|
@mojomex The PR description is updated too with the performance measurement result. Please take a look again 🙏 |
manato
left a comment
There was a problem hiding this comment.
@k1832
Thank you very much for the PR!
Overall, it is nicely looking and well organized!
I left some comments that I noticed. I'd appreciate it if you could consider them 🙏
In terms of speed up, I guess measurement result using rosbag and the actual sensor would be slightly different because the case of using the actual sensor does pipelined-processing of decoding and receiving, while the case of using rosbag proceeds whole 1 scan packet (i.e., accumulated packets) at once.
By any chance, do you perhaps have any measurement data using the actual sensors?
...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp
Outdated
Show resolved
Hide resolved
...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp
Show resolved
Hide resolved
...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai_decoders/src/cuda/hesai_cuda_kernels.cu
Outdated
Show resolved
Hide resolved
...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp
Show resolved
Hide resolved
mojomex
left a comment
There was a problem hiding this comment.
Getting closer! I left some CUDA performance and architecture comments, please check.
|
|
||
| std::array<DecodeFrame, 2> frame_buffers_{initialize_frame(), initialize_frame()}; | ||
|
|
||
| #ifdef NEBULA_CUDA_ENABLED |
There was a problem hiding this comment.
I'd still like to separate out all the CUDA stuff from this file.
Logically, we could split this as follows:
- HesaiDecoder handles:
- packet bytes->struct conversion
- functional safety
- packet loss diagnostics
- scan cutting
- performance counters/timers
- HesaiDecoderImplCpu:
- unpack(const packet&)
- accumulation in std::array<DecodeFrame, 2>
- HesaiDecoderImplCuda:
- unpack(const packet&)
- accumulation and HtoD, DtoH
Or in other words, have the unpack function + impl-specific state/buffers in two different classes and dependency-inject one of them into HesaiDecoder.
There was a problem hiding this comment.
Agreed this would be cleaner. This is a larger refactor — will address in a follow-up.
src/nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/decoders/hesai_decoder.hpp
Outdated
Show resolved
Hide resolved
...a_hesai_decoders/include/nebula_hesai_decoders/decoders/angle_corrector_correction_based.hpp
Outdated
Show resolved
Hide resolved
src/nebula_hesai/nebula_hesai/tests/hesai_cuda_decoder_test.cpp
Outdated
Show resolved
Hide resolved
...nebula_hesai/nebula_hesai_decoders/include/nebula_hesai_decoders/cuda/hesai_cuda_decoder.hpp
Outdated
Show resolved
Hide resolved
|
Without having looked at the code in detail yet, my main concern is about the actual latency with a real sensor. |
|
@drwnz To track this, we should probably do what What do you think? |
- Use cudaMemcpyAsync + dedicated stream everywhere (angle LUT upload, count readback, result D2H copy) to avoid blocking default stream - Simplify dual-return filter return for n_returns==2 path - Remove unused #include <tuple> from angle_corrector_correction_based - Remove unnecessary target_include_directories for CUDAToolkit - Save/restore NEBULA_USE_CUDA env in test instead of unconditional unset - Remove redundant comment about unique_ptr cleanup
Good point. Here's my analysis of the latency difference: If my understanding is correct, both CPU and GPU paths publish the full scan's point cloud at once when the scan boundary is detected, not per-packet. The difference is how much work remains at that moment:
So the GPU path adds flush latency at the scan boundary. The profiling results show this flush takes 2.4ms on some scans and 10.5ms on others (bimodal pattern), where the slow path is dominated by the bulk D2H sparse buffer copy. This bimodal pattern is likely an artifact of the short test rosbag being looped. The follow-up PR (zero-copy via cuda_blackboard) eliminates this D2H copy entirely, which should remove the slow path and make the distribution uniform. The primary benefit of PR1 is CPU offload during the scan period. This is fully opt-in — build with Again, it's a great discussion point. I need your team's perspective and let's keep this discussion going here. |

PR Type
Related Links
Description
Add a GPU-accelerated decode path for Hesai LiDAR sensors using CUDA. The feature is:
-DBUILD_CUDA=ON. When CUDA toolkit is not found, the build silently falls back to CPU-only.NEBULA_USE_CUDA=1environment variable. When unset, the existing CPU path is used with zero overhead.What it does
launch_decode_hesai_scan_batch)Files changed
hesai_cuda_kernels.cuhesai_cuda_decoder.hppHesaiScanDecoderCudaRAII class — GPU buffer management, angle LUT, device memoryhesai_decoder.hpphesai_sensor.hppmax_scan_buffer_points()for GPU buffer sizingangle_corrector_*.hppcuda_compat.hppNEBULA_HOST_DEVICE/NEBULA_DEVICEmacros for host/device code sharingnebula_hesai_decoders/CMakeLists.txtnebula_hesai/CMakeLists.txthesai_cuda_decoder_test.cppKnown limitations
return_typefield (alwaysreturn_id)Performance
Measured using
profiling_runner.bash(3 runs × 20s each, Pandar128E4X ~72k pts/scan, RTX 5080, CUDA 12.4):./scripts/profiling_runner.bash cpu-baseline \ --sensor-model Pandar128E4X --rosbag-path <ot128_rosbag> -n 3 -t 20 NEBULA_USE_CUDA=1 ./scripts/profiling_runner.bash gpu-cuda \ --sensor-model Pandar128E4X --rosbag-path <ot128_rosbag> -n 3 -t 20 ./scripts/plot_times.py cpu-baseline gpu-cuda --metrics decodeThe GPU results show a bimodal distribution: ~57% of scans decode in ~2.4 ms (~2.8x faster than CPU), while the remaining ~43% take ~10.5 ms. The slow path is dominated by the bulk D2H copy in
process_gpu_results(), which copies the entire sparse output buffer back to host memory. The follow-up PR (zero-copy viacuda_blackboard) eliminates this D2H copy by keeping points on the GPU, which should remove the slow-path bottleneck.Review Procedure
Build (with CUDA)
Requires NVIDIA CUDA Toolkit (tested with CUDA 12.x). If the toolkit is not found, the build succeeds but CUDA support is silently disabled.
Running with CUDA enabled
The GPU decode path is gated by a runtime environment variable:
Test
Test results
Remarks
BUILD_CUDA=OFF), the 5 CUDA tests are compiled but skip at runtime viaGTEST_SKIP(), so they do not break CPU-only CI.scan_stateflags. Point counts are identical between CPU and GPU (verified bykMaxPointCountDiff = 0in tests). Coordinate differences are sub-millimetre (< 0.1 mm) due to floating-point hardware rounding.Pre-Review Checklist for the PR Author
PR Author should check the checkboxes below when creating the PR.
Checklist for the PR Reviewer
Reviewers should check the checkboxes below before approval.
Post-Review Checklist for the PR Author
PR Author should check the checkboxes below before merging.
CI Checks