net: macb: manage BNA error and prevent RX lockup on GEM#7
Open
yseraf wants to merge 1 commit intolinux4microchip:linux-6.12-mchpfrom
Open
net: macb: manage BNA error and prevent RX lockup on GEM#7yseraf wants to merge 1 commit intolinux4microchip:linux-6.12-mchpfrom
yseraf wants to merge 1 commit intolinux4microchip:linux-6.12-mchpfrom
Conversation
This patch addresses a controller freeze occurring under high incoming traffic loads on GEM-based platforms like the ATSAMA5D41. When the DMA exhausted all available buffers, it triggered a Buffer Not Available (BNA) condition in the Receive Status Register (RSR). On this hardware, clearing the BNA flag without first resolving the underlying resource shortage caused the controller to enter a permanent blocked state. The recovery logic introduced here forces a slot release by incrementing the consumer index (rx_tail), acknowledges the error in RSR, and performs a descriptor refill before allowing NAPI to continue. To isolate this critical repair phase from asynchronous bus errors, the HRESP interrupt is transiently masked. This GEM-specific implementation is separated from the legacy MACB polling to avoid regressions on older architectures. Signed-off-by: Yannick Serafini <yannick.serafini@wifx.net>
Author
|
The bug and this patch can be easily tested using Packet Sender application with its UDP Traffic Generator utils : Without patch, the Ethernet controller gets stuck almost instantly (in few dozen of seconds). Note: the standard Ethernet controller behaviour and performance like bandwidth is not affected by the patch, it just makes the ATSAMA5Dxx a true network compatible CPU. |
cristibirsan
pushed a commit
that referenced
this pull request
Feb 12, 2026
commit a53e356 upstream. While testing rpmsg-char interface it was noticed that duplicate sysfs entries are getting created and below warning is noticed. Reason for this is that we are leaking rpmsg device pointer, setting it null without actually unregistering device. Any further attempts to unregister fail because rpdev is NULL, resulting in a leak. Fix this by unregistering rpmsg device before removing its reference from rpmsg channel. sysfs: cannot create duplicate filename '/devices/platform/soc@0/3700000.remot eproc/remoteproc/remoteproc1/3700000.remoteproc:glink-edge/3700000.remoteproc: glink-edge.adsp_apps.-1.-1' [ 114.115347] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.16.0-rc4 #7 PREEMPT [ 114.115355] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) [ 114.115358] Workqueue: events qcom_glink_work [ 114.115371] Call trace:8 [ 114.115374] show_stack+0x18/0x24 (C) [ 114.115382] dump_stack_lvl+0x60/0x80 [ 114.115388] dump_stack+0x18/0x24 [ 114.115393] sysfs_warn_dup+0x64/0x80 [ 114.115402] sysfs_create_dir_ns+0xf4/0x120 [ 114.115409] kobject_add_internal+0x98/0x260 [ 114.115416] kobject_add+0x9c/0x108 [ 114.115421] device_add+0xc4/0x7a0 [ 114.115429] rpmsg_register_device+0x5c/0xb0 [ 114.115434] qcom_glink_work+0x4bc/0x820 [ 114.115438] process_one_work+0x148/0x284 [ 114.115446] worker_thread+0x2c4/0x3e0 [ 114.115452] kthread+0x12c/0x204 [ 114.115457] ret_from_fork+0x10/0x20 [ 114.115464] kobject: kobject_add_internal failed for 3700000.remoteproc: glink-edge.adsp_apps.-1.-1 with -EEXIST, don't try to register things with the same name in the same directory. [ 114.250045] rpmsg 3700000.remoteproc:glink-edge.adsp_apps.-1.-1: device_add failed: -17 Fixes: 835764d ("rpmsg: glink: Move the common glink protocol implementation to glink_native.c") Cc: Stable@vger.kernel.org Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250822100043.2604794-2-srinivas.kandagatla@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cristibirsan
pushed a commit
that referenced
this pull request
Feb 12, 2026
[ Upstream commit 2a71a1a ] skbuff_fclone_cache was created without defining a usercopy region, [1] unlike skbuff_head_cache which properly whitelists the cb[] field. [2] This causes a usercopy BUG() when CONFIG_HARDENED_USERCOPY is enabled and the kernel attempts to copy sk_buff.cb data to userspace via sock_recv_errqueue() -> put_cmsg(). The crash occurs when: 1. TCP allocates an skb using alloc_skb_fclone() (from skbuff_fclone_cache) [1] 2. The skb is cloned via skb_clone() using the pre-allocated fclone [3] 3. The cloned skb is queued to sk_error_queue for timestamp reporting 4. Userspace reads the error queue via recvmsg(MSG_ERRQUEUE) 5. sock_recv_errqueue() calls put_cmsg() to copy serr->ee from skb->cb [4] 6. __check_heap_object() fails because skbuff_fclone_cache has no usercopy whitelist [5] When cloned skbs allocated from skbuff_fclone_cache are used in the socket error queue, accessing the sock_exterr_skb structure in skb->cb via put_cmsg() triggers a usercopy hardening violation: [ 5.379589] usercopy: Kernel memory exposure attempt detected from SLUB object 'skbuff_fclone_cache' (offset 296, size 16)! [ 5.382796] kernel BUG at mm/usercopy.c:102! [ 5.383923] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI [ 5.384903] CPU: 1 UID: 0 PID: 138 Comm: poc_put_cmsg Not tainted 6.12.57 #7 [ 5.384903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 5.384903] RIP: 0010:usercopy_abort+0x6c/0x80 [ 5.384903] Code: 1a 86 51 48 c7 c2 40 15 1a 86 41 52 48 c7 c7 c0 15 1a 86 48 0f 45 d6 48 c7 c6 80 15 1a 86 48 89 c1 49 0f 45 f3 e8 84 27 88 ff <0f> 0b 490 [ 5.384903] RSP: 0018:ffffc900006f77a8 EFLAGS: 00010246 [ 5.384903] RAX: 000000000000006f RBX: ffff88800f0ad2a8 RCX: 1ffffffff0f72e74 [ 5.384903] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff87b973a0 [ 5.384903] RBP: 0000000000000010 R08: 0000000000000000 R09: fffffbfff0f72e74 [ 5.384903] R10: 0000000000000003 R11: 79706f6372657375 R12: 0000000000000001 [ 5.384903] R13: ffff88800f0ad2b8 R14: ffffea00003c2b40 R15: ffffea00003c2b00 [ 5.384903] FS: 0000000011bc4380(0000) GS:ffff8880bf100000(0000) knlGS:0000000000000000 [ 5.384903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.384903] CR2: 000056aa3b8e5fe4 CR3: 000000000ea26004 CR4: 0000000000770ef0 [ 5.384903] PKRU: 55555554 [ 5.384903] Call Trace: [ 5.384903] <TASK> [ 5.384903] __check_heap_object+0x9a/0xd0 [ 5.384903] __check_object_size+0x46c/0x690 [ 5.384903] put_cmsg+0x129/0x5e0 [ 5.384903] sock_recv_errqueue+0x22f/0x380 [ 5.384903] tls_sw_recvmsg+0x7ed/0x1960 [ 5.384903] ? srso_alias_return_thunk+0x5/0xfbef5 [ 5.384903] ? schedule+0x6d/0x270 [ 5.384903] ? srso_alias_return_thunk+0x5/0xfbef5 [ 5.384903] ? mutex_unlock+0x81/0xd0 [ 5.384903] ? __pfx_mutex_unlock+0x10/0x10 [ 5.384903] ? __pfx_tls_sw_recvmsg+0x10/0x10 [ 5.384903] ? _raw_spin_lock_irqsave+0x8f/0xf0 [ 5.384903] ? _raw_read_unlock_irqrestore+0x20/0x40 [ 5.384903] ? srso_alias_return_thunk+0x5/0xfbef5 The crash offset 296 corresponds to skb2->cb within skbuff_fclones: - sizeof(struct sk_buff) = 232 - offsetof(struct sk_buff, cb) = 40 - offset of skb2.cb in fclones = 232 + 40 = 272 - crash offset 296 = 272 + 24 (inside sock_exterr_skb.ee) This patch uses a local stack variable as a bounce buffer to avoid the hardened usercopy check failure. [1] https://elixir.bootlin.com/linux/v6.12.62/source/net/ipv4/tcp.c#L885 [2] https://elixir.bootlin.com/linux/v6.12.62/source/net/core/skbuff.c#L5104 [3] https://elixir.bootlin.com/linux/v6.12.62/source/net/core/skbuff.c#L5566 [4] https://elixir.bootlin.com/linux/v6.12.62/source/net/core/skbuff.c#L5491 [5] https://elixir.bootlin.com/linux/v6.12.62/source/mm/slub.c#L5719 Fixes: 6d07d1c ("usercopy: Restrict non-usercopy caches to size 0") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251223203534.1392218-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This patch addresses a controller freeze occurring under high incoming traffic loads on GEM-based platforms like the ATSAMA5D41. When the DMA exhausted all available buffers, it triggered a Buffer Not Available (BNA) condition in the Receive Status Register (RSR).
On this hardware, clearing the BNA flag without first resolving the underlying resource shortage caused the controller to enter a permanent blocked state. The recovery logic introduced here forces a slot release by incrementing the consumer index (rx_tail), acknowledges the error in RSR, and performs a descriptor refill before allowing NAPI to continue. To isolate this critical repair phase from asynchronous bus errors, the HRESP interrupt is transiently masked. This GEM-specific implementation is separated from the legacy MACB polling to avoid regressions on older architectures.