Skip to content

fix(udp): detect silent SO_RCVBUF clamping by kernel#424

Draft
k1832 wants to merge 3 commits intotier4:mainfrom
k1832:fix/detect-rcvbuf-clamping
Draft

fix(udp): detect silent SO_RCVBUF clamping by kernel#424
k1832 wants to merge 3 commits intotier4:mainfrom
k1832:fix/detect-rcvbuf-clamping

Conversation

@k1832
Copy link
Copy Markdown

@k1832 k1832 commented Mar 23, 2026

PR Type

  • Bug Fix

Related Links

  • Linux SO_RCVBUF documentation: man 7 socket

Description

setsockopt(SO_RCVBUF) silently clamps the buffer size to net.core.rmem_max without returning an error. On default Ubuntu installs, rmem_max is ~212 KB, so Nebula's configured 5.4 MB buffer request (udp_socket_receive_buffer_size_bytes: 5400000) gets capped to 212 KB — 40x smaller than intended — with no warning or error.

This behavior is kernel-version-dependent:

Scenario (5.4 MB req, 212 KB rmem_max) Kernel 5.17 Kernel 6.8
setsockopt result FAILS (existing handling catches it) OK (silent clamp, no error)
This PR needed? Redundant (defense-in-depth) Yes — only detection method

On kernel 6.8 (Ubuntu 24.04), setsockopt always succeeds regardless of rmem_max, making getsockopt readback the only way to detect clamping.

This can cause packet drops on machines where rmem_max has not been raised (e.g. test machines, standalone Nebula setups, or fresh installs without the agnocast DDS config).

Fix

After setsockopt(SO_RCVBUF), call getsockopt(SO_RCVBUF) to read back the actual value the kernel granted. If it is significantly less than requested (>4 KB difference), throw a SocketError with a clear message and the exact sysctl command to fix it:

SO_RCVBUF was clamped by the kernel: requested 5400000 bytes, got 425984 bytes.
Increase net.core.rmem_max: sudo sysctl -w net.core.rmem_max=5400000

The 4 KB threshold avoids false positives when requesting values near INT32_MAX, where the kernel's internal doubling of SO_RCVBUF causes minor rounding (e.g. requesting 2147483647, getting 2147483646).

Review Procedure

  1. Read the diff in udp.hpp:set_socket_buffer_size() (the fix) and test_udp.cpp (the test)
  2. Verify the logic: getsockopt returns the kernel's doubled value, which should be >= the requested value unless clamped
  3. Existing TestBufferResize test (requests rmem_max = INT32_MAX) passes with the 4 KB threshold

Remarks

  • TestBufferClampingDetected added: on systems with high rmem_max (like CI), it verifies 5.4 MB succeeds; on default systems (~212 KB rmem_max), it verifies SocketError is thrown
  • The codecov/patch/nebula_core_hw_interfaces check fails because CI has a high rmem_max, so the throw SocketError branch is never executed. This path can only be covered on a machine with default rmem_max (~212 KB).
  • The caller at hesai_hw_interface.cpp:231-238 already wraps set_socket_buffer_size() in a try/catch that converts SocketError to a runtime_error with a helpful message

Pre-Review Checklist for the PR Author

  • Assign PR to reviewer

Checklist for the PR Reviewer

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.
  • codecov/patch/nebula_core_hw_interfaces: Expected to fail — the error-throwing branch requires rmem_max < 5.4 MB to execute, which is not the case in CI.

k1832 and others added 2 commits March 23, 2026 18:15
setsockopt(SO_RCVBUF) silently clamps the buffer size to
net.core.rmem_max without returning an error. On default Ubuntu
installs, rmem_max is ~212 KB, so Nebula's 5.4 MB request gets
capped to 212 KB (40x smaller) with no indication.

Add a getsockopt(SO_RCVBUF) check after setting. If the kernel
granted significantly less than requested (>4 KB difference),
throw a SocketError with the exact sysctl command to fix it.

The 4 KB threshold avoids false positives when requesting values
near INT32_MAX, where the kernel's internal doubling causes minor
rounding.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.36%. Comparing base (baf4f92) to head (37e2980).

❌ Your patch check has failed because the patch coverage (62.50%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #424      +/-   ##
==========================================
+ Coverage   48.34%   48.36%   +0.01%     
==========================================
  Files         156      156              
  Lines       12996    13012      +16     
  Branches     6900     6912      +12     
==========================================
+ Hits         6283     6293      +10     
- Misses       5326     5327       +1     
- Partials     1387     1392       +5     
Flag Coverage Δ
nebula_core_hw_interfaces 69.23% <62.50%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add TestBufferClampingDetected that validates the getsockopt readback
check added in the previous commit:

- If rmem_max >= 5.4 MB (properly configured): verifies the 5.4 MB
  buffer request succeeds without throwing.
- If rmem_max < 5.4 MB (default Ubuntu ~212 KB): verifies that
  SocketError is thrown with a helpful sysctl message.

Both branches test real behavior depending on the machine's kernel
configuration.
Copy link
Copy Markdown
Collaborator

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which system are you on? On my kernel 5.17 machine I cannot get your exception to be thrown. Instead, setsockopt fails directly, causing e.g. the Hesai HW interface to throw an error instead.

Could you paste the output of the below script on your system?
test-rmem-max.sh

Mine is:

My Output
=== Testing rmem_max=100000 ===
net.core.rmem_max = 100000
[rmem_max=100000 buffer=rmem_max] expected=pass
[rmem_max=100000 buffer=rmem_max] matched: pass
[rmem_max=100000 buffer=2*rmem_max] expected=pass
[rmem_max=100000 buffer=2*rmem_max] matched: pass
[rmem_max=100000 buffer=2*rmem_max+1] expected=fail
[rmem_max=100000 buffer=2*rmem_max+1] unexpected result: expected=fail actual=pass
[rmem_max=100000 buffer=2*rmem_max+576] expected=fail
[rmem_max=100000 buffer=2*rmem_max+576] unexpected result: expected=fail actual=pass
[rmem_max=100000 buffer=2*rmem_max+5000] expected=fail
[hesai_ros_wrapper_node-1]   what():  Could not set socket receive buffer size to 205000. Try increasing net.core.rmem_max.
[rmem_max=100000 buffer=2*rmem_max+5000] matched: fail
=== Testing rmem_max=1000000 ===
net.core.rmem_max = 1000000
[rmem_max=1000000 buffer=rmem_max] expected=pass
[rmem_max=1000000 buffer=rmem_max] matched: pass
[rmem_max=1000000 buffer=2*rmem_max] expected=pass
[rmem_max=1000000 buffer=2*rmem_max] matched: pass
[rmem_max=1000000 buffer=2*rmem_max+1] expected=fail
[rmem_max=1000000 buffer=2*rmem_max+1] unexpected result: expected=fail actual=pass
[rmem_max=1000000 buffer=2*rmem_max+576] expected=fail
[rmem_max=1000000 buffer=2*rmem_max+576] unexpected result: expected=fail actual=pass
[rmem_max=1000000 buffer=2*rmem_max+5000] expected=fail
[hesai_ros_wrapper_node-1]   what():  Could not set socket receive buffer size to 2005000. Try increasing net.core.rmem_max.
[rmem_max=1000000 buffer=2*rmem_max+5000] matched: fail
=== Testing rmem_max=1000000000 ===
net.core.rmem_max = 1000000000
[rmem_max=1000000000 buffer=rmem_max] expected=pass
[rmem_max=1000000000 buffer=rmem_max] matched: pass
[rmem_max=1000000000 buffer=2*rmem_max] expected=pass
[rmem_max=1000000000 buffer=2*rmem_max] matched: pass
[rmem_max=1000000000 buffer=2*rmem_max+1] expected=fail
[rmem_max=1000000000 buffer=2*rmem_max+1] unexpected result: expected=fail actual=pass
[rmem_max=1000000000 buffer=2*rmem_max+576] expected=fail
[rmem_max=1000000000 buffer=2*rmem_max+576] unexpected result: expected=fail actual=pass
[rmem_max=1000000000 buffer=2*rmem_max+5000] expected=fail
[hesai_ros_wrapper_node-1]   what():  Could not set socket receive buffer size to 2000005000. Try increasing net.core.rmem_max.
[rmem_max=1000000000 buffer=2*rmem_max+5000] matched: fail

@k1832
Copy link
Copy Markdown
Author

k1832 commented Mar 27, 2026

Thanks for the detailed test script! I ran it on my kernel (6.8.0-79-generic) and got very different results from yours (5.17):

Full output (kernel 6.8.0-79-generic)
=== rmem_max=100000 ===
  [buf=rmem_max] req=100000  setsockopt=OK  actual=200000  clamped=False
  [buf=2x] req=200000  setsockopt=OK  actual=200000  clamped=False
  [buf=2x+1] req=200001  setsockopt=OK  actual=200000  clamped=True
  [buf=2x+5000] req=205000  setsockopt=OK  actual=200000  clamped=True
  [buf=5.4MB(nebula)] req=5400000  setsockopt=OK  actual=200000  clamped=True

=== rmem_max=212992 ===
  [buf=rmem_max] req=212992  setsockopt=OK  actual=425984  clamped=False
  [buf=2x] req=425984  setsockopt=OK  actual=425984  clamped=False
  [buf=2x+1] req=425985  setsockopt=OK  actual=425984  clamped=True
  [buf=2x+5000] req=430984  setsockopt=OK  actual=425984  clamped=True
  [buf=5.4MB(nebula)] req=5400000  setsockopt=OK  actual=425984  clamped=True

=== rmem_max=2700000 ===
  [buf=rmem_max] req=2700000  setsockopt=OK  actual=5400000  clamped=False
  [buf=2x] req=5400000  setsockopt=OK  actual=5400000  clamped=False
  [buf=2x+1] req=5400001  setsockopt=OK  actual=5400000  clamped=True
  [buf=2x+5000] req=5405000  setsockopt=OK  actual=5400000  clamped=True
  [buf=5.4MB(nebula)] req=5400000  setsockopt=OK  actual=5400000  clamped=False

=== rmem_max=5400000 ===
  [buf=rmem_max] req=5400000  setsockopt=OK  actual=10800000  clamped=False
  [buf=2x] req=10800000  setsockopt=OK  actual=10800000  clamped=False
  [buf=2x+1] req=10800001  setsockopt=OK  actual=10800000  clamped=True
  [buf=2x+5000] req=10805000  setsockopt=OK  actual=10800000  clamped=True
  [buf=5.4MB(nebula)] req=5400000  setsockopt=OK  actual=10800000  clamped=False

On kernel 6.8, setsockopt never fails -- it always succeeds and silently clamps. Requesting 5.4 MB with rmem_max=212 KB returns OK with actual=425,984 (40x smaller, no error).

So the behavior is kernel-version-dependent:

Scenario (5.4 MB req, 212 KB rmem_max) Kernel 5.17 (yours) Kernel 6.8 (mine)
setsockopt result FAILS OK (silent clamp)
Existing error handling catches it? Yes No
This PR's getsockopt check catches it? Yes (redundant) Yes (only detection)

The getsockopt readback is the only way to detect the clamping on kernel 6.8. The test script is at the bottom if you'd like to reproduce.

Test script
#!/bin/bash
# sudo bash test-rmem-clamping.sh
set -euo pipefail
[[ $EUID -ne 0 ]] && echo "Run with sudo" && exit 1
ORIG=$(sysctl -n net.core.rmem_max)
trap "sysctl -w net.core.rmem_max=$ORIG > /dev/null; echo Restored rmem_max=$ORIG" EXIT

test_setsockopt() {
    python3 - "$2" "$3" <<'PY'
import socket, sys
buf, label = int(sys.argv[1]), sys.argv[2]
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
    s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, buf)
    actual = s.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
    print(f"  [{label}] req={buf}  OK  actual={actual}  clamped={actual < buf}")
except OSError as e:
    print(f"  [{label}] req={buf}  FAILED: {e}")
finally:
    s.close()
PY
}

for rmem in 100000 212992 2700000 5400000; do
    echo "=== rmem_max=$rmem ==="
    sysctl -w net.core.rmem_max=$rmem > /dev/null
    test_setsockopt $rmem $rmem "buf=rmem_max"
    test_setsockopt $rmem $((rmem*2)) "buf=2x"
    test_setsockopt $rmem $((rmem*2+1)) "buf=2x+1"
    test_setsockopt $rmem $((rmem*2+5000)) "buf=2x+5000"
    test_setsockopt $rmem 5400000 "buf=5.4MB(nebula)"
    echo
done

@mojomex
Copy link
Copy Markdown
Collaborator

mojomex commented Mar 31, 2026

@k1832 Huh, indeed, with your script I was able to reproduce it on all my machines.
I've double checked the Nebula code but couldn't quite pinpoint why the results are different. In any case, I think your approach is right, let's definitely do it!

Copy link
Copy Markdown
Collaborator

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good! Let's structure the unit tests similarly to how you structured your test script in your comment, i.e. test cases relative to rmem_max.

Comment on lines +107 to +109
constexpr size_t nebula_buf_size = 5400000;

if (rmem_max >= nebula_buf_size) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test the same cases on all systems, regardless of actual rmem_max:

  1. read rmem_max
  2. test set_socket_buffer_size(rmem_max) (should succeed)
  3. test set_socket_buffer_size(2*rmem_max) (should succeed)
  4. test set_socket_buffer_size(2*rmem_max+1) (should fail)

if (!rmem_max_maybe.has_value()) GTEST_SKIP() << rmem_max_maybe.error();
size_t rmem_max = rmem_max_maybe.value();

// Nebula's configured buffer size for Pandar128E4X
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UDP code is part of nebula_core and as such is vendor-independent. Let's best remove the mention of Pandar128E4X here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants