fix(udp): detect silent SO_RCVBUF clamping by kernel#424
fix(udp): detect silent SO_RCVBUF clamping by kernel#424k1832 wants to merge 3 commits intotier4:mainfrom
Conversation
setsockopt(SO_RCVBUF) silently clamps the buffer size to net.core.rmem_max without returning an error. On default Ubuntu installs, rmem_max is ~212 KB, so Nebula's 5.4 MB request gets capped to 212 KB (40x smaller) with no indication. Add a getsockopt(SO_RCVBUF) check after setting. If the kernel granted significantly less than requested (>4 KB difference), throw a SocketError with the exact sysctl command to fix it. The 4 KB threshold avoids false positives when requesting values near INT32_MAX, where the kernel's internal doubling causes minor rounding.
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (62.50%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #424 +/- ##
==========================================
+ Coverage 48.34% 48.36% +0.01%
==========================================
Files 156 156
Lines 12996 13012 +16
Branches 6900 6912 +12
==========================================
+ Hits 6283 6293 +10
- Misses 5326 5327 +1
- Partials 1387 1392 +5
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Add TestBufferClampingDetected that validates the getsockopt readback check added in the previous commit: - If rmem_max >= 5.4 MB (properly configured): verifies the 5.4 MB buffer request succeeds without throwing. - If rmem_max < 5.4 MB (default Ubuntu ~212 KB): verifies that SocketError is thrown with a helpful sysctl message. Both branches test real behavior depending on the machine's kernel configuration.
There was a problem hiding this comment.
Which system are you on? On my kernel 5.17 machine I cannot get your exception to be thrown. Instead, setsockopt fails directly, causing e.g. the Hesai HW interface to throw an error instead.
Could you paste the output of the below script on your system?
test-rmem-max.sh
Mine is:
My Output
=== Testing rmem_max=100000 ===
net.core.rmem_max = 100000
[rmem_max=100000 buffer=rmem_max] expected=pass
[rmem_max=100000 buffer=rmem_max] matched: pass
[rmem_max=100000 buffer=2*rmem_max] expected=pass
[rmem_max=100000 buffer=2*rmem_max] matched: pass
[rmem_max=100000 buffer=2*rmem_max+1] expected=fail
[rmem_max=100000 buffer=2*rmem_max+1] unexpected result: expected=fail actual=pass
[rmem_max=100000 buffer=2*rmem_max+576] expected=fail
[rmem_max=100000 buffer=2*rmem_max+576] unexpected result: expected=fail actual=pass
[rmem_max=100000 buffer=2*rmem_max+5000] expected=fail
[hesai_ros_wrapper_node-1] what(): Could not set socket receive buffer size to 205000. Try increasing net.core.rmem_max.
[rmem_max=100000 buffer=2*rmem_max+5000] matched: fail
=== Testing rmem_max=1000000 ===
net.core.rmem_max = 1000000
[rmem_max=1000000 buffer=rmem_max] expected=pass
[rmem_max=1000000 buffer=rmem_max] matched: pass
[rmem_max=1000000 buffer=2*rmem_max] expected=pass
[rmem_max=1000000 buffer=2*rmem_max] matched: pass
[rmem_max=1000000 buffer=2*rmem_max+1] expected=fail
[rmem_max=1000000 buffer=2*rmem_max+1] unexpected result: expected=fail actual=pass
[rmem_max=1000000 buffer=2*rmem_max+576] expected=fail
[rmem_max=1000000 buffer=2*rmem_max+576] unexpected result: expected=fail actual=pass
[rmem_max=1000000 buffer=2*rmem_max+5000] expected=fail
[hesai_ros_wrapper_node-1] what(): Could not set socket receive buffer size to 2005000. Try increasing net.core.rmem_max.
[rmem_max=1000000 buffer=2*rmem_max+5000] matched: fail
=== Testing rmem_max=1000000000 ===
net.core.rmem_max = 1000000000
[rmem_max=1000000000 buffer=rmem_max] expected=pass
[rmem_max=1000000000 buffer=rmem_max] matched: pass
[rmem_max=1000000000 buffer=2*rmem_max] expected=pass
[rmem_max=1000000000 buffer=2*rmem_max] matched: pass
[rmem_max=1000000000 buffer=2*rmem_max+1] expected=fail
[rmem_max=1000000000 buffer=2*rmem_max+1] unexpected result: expected=fail actual=pass
[rmem_max=1000000000 buffer=2*rmem_max+576] expected=fail
[rmem_max=1000000000 buffer=2*rmem_max+576] unexpected result: expected=fail actual=pass
[rmem_max=1000000000 buffer=2*rmem_max+5000] expected=fail
[hesai_ros_wrapper_node-1] what(): Could not set socket receive buffer size to 2000005000. Try increasing net.core.rmem_max.
[rmem_max=1000000000 buffer=2*rmem_max+5000] matched: fail
|
Thanks for the detailed test script! I ran it on my kernel (6.8.0-79-generic) and got very different results from yours (5.17): Full output (kernel 6.8.0-79-generic)On kernel 6.8, So the behavior is kernel-version-dependent:
The Test script#!/bin/bash
# sudo bash test-rmem-clamping.sh
set -euo pipefail
[[ $EUID -ne 0 ]] && echo "Run with sudo" && exit 1
ORIG=$(sysctl -n net.core.rmem_max)
trap "sysctl -w net.core.rmem_max=$ORIG > /dev/null; echo Restored rmem_max=$ORIG" EXIT
test_setsockopt() {
python3 - "$2" "$3" <<'PY'
import socket, sys
buf, label = int(sys.argv[1]), sys.argv[2]
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, buf)
actual = s.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
print(f" [{label}] req={buf} OK actual={actual} clamped={actual < buf}")
except OSError as e:
print(f" [{label}] req={buf} FAILED: {e}")
finally:
s.close()
PY
}
for rmem in 100000 212992 2700000 5400000; do
echo "=== rmem_max=$rmem ==="
sysctl -w net.core.rmem_max=$rmem > /dev/null
test_setsockopt $rmem $rmem "buf=rmem_max"
test_setsockopt $rmem $((rmem*2)) "buf=2x"
test_setsockopt $rmem $((rmem*2+1)) "buf=2x+1"
test_setsockopt $rmem $((rmem*2+5000)) "buf=2x+5000"
test_setsockopt $rmem 5400000 "buf=5.4MB(nebula)"
echo
done |
|
@k1832 Huh, indeed, with your script I was able to reproduce it on all my machines. |
mojomex
left a comment
There was a problem hiding this comment.
Logic looks good! Let's structure the unit tests similarly to how you structured your test script in your comment, i.e. test cases relative to rmem_max.
| constexpr size_t nebula_buf_size = 5400000; | ||
|
|
||
| if (rmem_max >= nebula_buf_size) { |
There was a problem hiding this comment.
Let's test the same cases on all systems, regardless of actual rmem_max:
- read rmem_max
- test
set_socket_buffer_size(rmem_max)(should succeed) - test
set_socket_buffer_size(2*rmem_max)(should succeed) - test
set_socket_buffer_size(2*rmem_max+1)(should fail)
| if (!rmem_max_maybe.has_value()) GTEST_SKIP() << rmem_max_maybe.error(); | ||
| size_t rmem_max = rmem_max_maybe.value(); | ||
|
|
||
| // Nebula's configured buffer size for Pandar128E4X |
There was a problem hiding this comment.
The UDP code is part of nebula_core and as such is vendor-independent. Let's best remove the mention of Pandar128E4X here.
PR Type
Related Links
SO_RCVBUFdocumentation:man 7 socketDescription
setsockopt(SO_RCVBUF)silently clamps the buffer size tonet.core.rmem_maxwithout returning an error. On default Ubuntu installs,rmem_maxis ~212 KB, so Nebula's configured 5.4 MB buffer request (udp_socket_receive_buffer_size_bytes: 5400000) gets capped to 212 KB — 40x smaller than intended — with no warning or error.This behavior is kernel-version-dependent:
setsockoptresultOn kernel 6.8 (Ubuntu 24.04),
setsockoptalways succeeds regardless ofrmem_max, makinggetsockoptreadback the only way to detect clamping.This can cause packet drops on machines where
rmem_maxhas not been raised (e.g. test machines, standalone Nebula setups, or fresh installs without the agnocast DDS config).Fix
After
setsockopt(SO_RCVBUF), callgetsockopt(SO_RCVBUF)to read back the actual value the kernel granted. If it is significantly less than requested (>4 KB difference), throw aSocketErrorwith a clear message and the exactsysctlcommand to fix it:The 4 KB threshold avoids false positives when requesting values near
INT32_MAX, where the kernel's internal doubling ofSO_RCVBUFcauses minor rounding (e.g. requesting 2147483647, getting 2147483646).Review Procedure
udp.hpp:set_socket_buffer_size()(the fix) andtest_udp.cpp(the test)getsockoptreturns the kernel's doubled value, which should be >= the requested value unless clampedTestBufferResizetest (requestsrmem_max=INT32_MAX) passes with the 4 KB thresholdRemarks
TestBufferClampingDetectedadded: on systems with highrmem_max(like CI), it verifies 5.4 MB succeeds; on default systems (~212 KBrmem_max), it verifiesSocketErroris throwncodecov/patch/nebula_core_hw_interfacescheck fails because CI has a highrmem_max, so thethrow SocketErrorbranch is never executed. This path can only be covered on a machine with defaultrmem_max(~212 KB).hesai_hw_interface.cpp:231-238already wrapsset_socket_buffer_size()in a try/catch that convertsSocketErrorto aruntime_errorwith a helpful messagePre-Review Checklist for the PR Author
Checklist for the PR Reviewer
Post-Review Checklist for the PR Author
CI Checks
rmem_max< 5.4 MB to execute, which is not the case in CI.