-
Notifications
You must be signed in to change notification settings - Fork 773
New ConnMonitor to Track "Stalled" Connection State for SSH Conns #2846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sawka
commented
Feb 9, 2026
- ConnMonitor to track stalled connections
- New Stalled Overlay to show feedback when we think a connection is stalled
- New Icon in ConnButton to show stalled connections
- Callbacks in domain socket and PTYs to track activity
WalkthroughAdds a concurrency-aware ConnMonitor for SSH connections (activity timestamps, input signaling, keep-alive loop, health states) and wires it into connection lifecycle and status derivation. Propagates an optional readCallback through domain-socket RPC paths (RunWshRpcOverListener → handleDomainSocketClient → AdaptStreamToMsgCh → StreamToLines). Extends wshrpc types with ConnStatus fields and a NotifySystemResumeCommand RPC plus server and client call sites. Frontend: new connhealthstatus handling, stalled overlay UI, and NotifySystemResumeCommand on system resume. Other changes: GetConn → MaybeGetConn call-site swaps, terminal reset sequence update, minor CSS and type additions. Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
frontend/types/gotypes.d.ts (1)
788-799:⚠️ Potential issue | 🟡 MinorFix typo in Go struct JSON tag:
lastactivitybeforerstalledtimeshould belastactivitybeforestalledtime.Line 798 contains
lastactivitybeforerstalledtimewith an extrarbeforestalled. The typo originates in the Go struct's JSON tag inpkg/wshrpc/wshrpctypes.go:456, where it readsjson:"lastactivitybeforerstalledtime,omitempty"and should bejson:"lastactivitybeforestalledtime,omitempty". The TypeScript file correctly reflects the Go source, but the Go source needs correction.
🤖 Fix all issues with AI agents
In `@frontend/app/block/connstatusoverlay.tsx`:
- Around line 223-224: Remove the hardcoded debug assignment that forces the
stalled overlay: delete the `showStalled = true;` statement so the boolean
`showStalled` computed from `connStatus.status == "connected"` and
`connStatus.connhealthstatus == "stalled"` is used; while here you’re editing
`connstatusoverlay.tsx`, also consider tightening the comparison to `===` for
`connStatus.status`/`connStatus.connhealthstatus` in the `showStalled`
initialization to avoid accidental type-coercion bugs.
In `@pkg/remote/conncontroller/connmonitor.go`:
- Around line 92-109: The code in ConnMonitor.SendKeepAlive reads conn.Client
without holding conn.lock, risking a nil dereference if
Close()/closeInternal_withlifecyclelock clears the client concurrently; fix by
taking the connection's lock (or calling the existing GetClient() helper that
acquires the lock) to snapshot the client into a local variable before launching
the goroutine, return early if the snapshot is nil, and then use that local
client inside the goroutine (still ensuring clearKeepAliveInFlight and
UpdateLastActivityTime are deferred/called as before) so no concurrent mutation
of conn.Client can cause a panic.
In `@pkg/wshrpc/wshrpctypes.go`:
- Line 456: Update the JSON tag for the struct field
LastActivityBeforeStalledTime to the correct key "lastactivitybeforestalledtime"
(remove the stray "r") so the backend emits the proper property; then update the
corresponding frontend definitions (frontend/types/gotypes.d.ts entry around
line ~798) and usage in connstatusoverlay.tsx to match the corrected key and run
any API/client tests to ensure compatibility.
🧹 Nitpick comments (3)
frontend/app/block/connstatusoverlay.tsx (1)
46-109:handleDisconnectis duplicated betweenStalledOverlayandConnStatusOverlay.Lines 58–61 and 149–152 are identical. Consider extracting to a shared helper or passing the callback from the parent.
pkg/remote/conncontroller/connmonitor.go (1)
16-26: Consider unexportingKeepAliveInFlight— it's only used within this file.
KeepAliveInFlightis a synchronization-sensitive field accessed exclusively throughsetKeepAliveInFlight,clearKeepAliveInFlight, andgetTimeSinceKeepAlive. Making it unexported (keepAliveInFlight) prevents accidental unsynchronized access from other files in the package.pkg/remote/conncontroller/conncontroller.go (1)
1028-1045: Monitor is never closed — verify this is intentional for connection reuse.
MakeConnMonitorstarts a background goroutine, butSSHConn.Close()/closeInternal_withlifecyclelock()never callsMonitor.Close(). SinceSSHConnobjects are reused across reconnections (stored inclientControllerMap), this is likely intentional — the monitor should persist. However, the monitor will continue callingcheckConnectionwhile disconnected, which will attemptSendKeepAliveon a nil client. With the race fix suggested inconnmonitor.go, this safely returns an error, but you might consider resetting health state or pausing the monitor on disconnect.
| let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled"; | ||
| showStalled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: showStalled = true is hardcoded debug code — stalled overlay will always appear.
Line 224 unconditionally sets showStalled = true, overriding the actual health status check on line 223. This means every connected SSH connection will display the stalled overlay regardless of connection health. This looks like leftover development/debug code that must be removed before merging.
Proposed fix
let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled";
- showStalled = true;
if (!showWshError && !showStalled && (isLayoutMode || connStatus.status == "connected" || connModalOpen)) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled"; | |
| showStalled = true; | |
| let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled"; |
🤖 Prompt for AI Agents
In `@frontend/app/block/connstatusoverlay.tsx` around lines 223 - 224, Remove the
hardcoded debug assignment that forces the stalled overlay: delete the
`showStalled = true;` statement so the boolean `showStalled` computed from
`connStatus.status == "connected"` and `connStatus.connhealthstatus ==
"stalled"` is used; while here you’re editing `connstatusoverlay.tsx`, also
consider tightening the comparison to `===` for
`connStatus.status`/`connStatus.connhealthstatus` in the `showStalled`
initialization to avoid accidental type-coercion bugs.
pkg/wshrpc/wshrpctypes.go
Outdated
| WshError string `json:"wsherror,omitempty"` | ||
| NoWshReason string `json:"nowshreason,omitempty"` | ||
| WshVersion string `json:"wshversion,omitempty"` | ||
| LastActivityBeforeStalledTime int64 `json:"lastactivitybeforerstalledtime,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the file at line 456 to confirm the typo
echo "=== File content at line 456 ==="
sed -n '456p' pkg/wshrpc/wshrpctypes.go
# Find all references to the typo'd JSON key across the codebase
echo -e "\n=== References to typo'd key (lastactivitybeforerstalledtime) ==="
rg -n "lastactivitybeforerstalledtime" --type-list | head -1
# Search broadly for both versions
echo -e "\n=== References to typo'd version ==="
rg -in "lastactivitybeforerstalledtime"
echo -e "\n=== References to correct version ==="
rg -in "lastactivitybeforestalledtime"
# Check the struct context
echo -e "\n=== Struct context around line 456 ==="
sed -n '450,465p' pkg/wshrpc/wshrpctypes.goRepository: wavetermdev/waveterm
Length of output: 2487
Fix typo in JSON tag: lastactivitybeforerstalledtime → lastactivitybeforestalledtime.
Line 456 has a stray r in the JSON key. The tag reads "lastactivitybeforerstalledtime" but should be "lastactivitybeforestalledtime". This typo is already baked into the frontend type definitions (frontend/types/gotypes.d.ts:798) and is being used in connstatusoverlay.tsx. Fixing it requires coordinating backend and frontend changes now before the API solidifies further.
Proposed fix
- LastActivityBeforeStalledTime int64 `json:"lastactivitybeforerstalledtime,omitempty"`
+ LastActivityBeforeStalledTime int64 `json:"lastactivitybeforestalledtime,omitempty"`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| LastActivityBeforeStalledTime int64 `json:"lastactivitybeforerstalledtime,omitempty"` | |
| LastActivityBeforeStalledTime int64 `json:"lastactivitybeforestalledtime,omitempty"` |
🤖 Prompt for AI Agents
In `@pkg/wshrpc/wshrpctypes.go` at line 456, Update the JSON tag for the struct
field LastActivityBeforeStalledTime to the correct key
"lastactivitybeforestalledtime" (remove the stray "r") so the backend emits the
proper property; then update the corresponding frontend definitions
(frontend/types/gotypes.d.ts entry around line ~798) and usage in
connstatusoverlay.tsx to match the corrected key and run any API/client tests to
ensure compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/util/utilfn/streamtolines.go (1)
59-72:⚠️ Potential issue | 🟡 Minor
readCallbackis skipped on the final read that returns data alongsideio.EOF.When
input.Readreturns bothn > 0anderr == io.EOF,streamToLines_processBufprocesses the data (line 64), but the early return on line 66 meansreadCallbackis never invoked for that last chunk. For the ConnMonitor activity-tracking use case this is likely inconsequential (the connection is closing anyway), but worth noting.
🤖 Fix all issues with AI agents
In `@pkg/remote/conncontroller/conncontroller.go`:
- Around line 1033-1041: Reset the SSHConn health state at the start of each
reconnect by setting conn.ConnHealthStatus = ConnHealthStatus_Good in the
reconnection path (e.g., at the top of Connect() or connectInternal()) so stale
Degraded/Stalled values from previous sessions do not carry over; locate the
SSHConn struct/instance created earlier and the Monitor (MakeConnMonitor /
Monitor) usage to ensure the reset happens before any Monitor or session logic
runs and before returning/continuing the connection attempt (this ensures
clientControllerMap-held SSHConn instances start each reconnect with a clean
health state).
- Line 296: The call to wshutil.RunWshRpcOverListener(listener,
conn.Monitor.UpdateLastActivityTime) uses conn.Monitor without a nil check;
either add a defensive guard (if conn.Monitor != nil {
wshutil.RunWshRpcOverListener(listener, conn.Monitor.UpdateLastActivityTime) }
else { /* handle or log unexpected nil */ }) or add a clear comment above this
use explaining why conn.Monitor is guaranteed non-nil (point to
getConnInternal() which initializes Monitor) and assert that invariant; update
the code path where RunWshRpcOverListener is invoked (referencing conn and
RunWshRpcOverListener) so the nil case is explicitly handled or documented to
match the existing nil check pattern used elsewhere (e.g., the conn.Monitor !=
nil check at other call sites).
In `@pkg/remote/conncontroller/connmonitor.go`:
- Around line 149-159: The inner select handling cm.inputNotifyCh incorrectly
uses break which only exits the select and allows the subsequent
cm.Conn.SetConnHealthStatus(ConnHealthStatus_Degraded) and cm.checkConnection()
to run; change the control flow so that when cm.LastActivityTime.Load() >=
inputTime you skip the degraded logic and resume the outer loop (replace the
break with continue in the branch inside the select), ensuring cm.ctx.Done()
handling remains unchanged and references to cm.inputNotifyCh,
cm.LastActivityTime.Load(), cm.Conn.SetConnHealthStatus and cm.checkConnection
are preserved.
In `@pkg/wshrpc/wshserver/wshserver.go`:
- Around line 758-761: In WshServer.NotifySystemResumeCommand, iterate over the
server's active connections in clientControllerMap and for each connection
retrieve its ConnMonitor (check for nil), call
ConnMonitor.UpdateLastActivityTime() and clear the keep-alive flag by setting
ConnMonitor.KeepAliveInFlight = false; ensure you handle any concurrency
protections around clientControllerMap (use the existing map mutex or accessor)
so iterations are safe and skip or log controllers without a monitor.
🧹 Nitpick comments (3)
frontend/app/block/connstatusoverlay.tsx (1)
58-61: DuplicatehandleDisconnectcallbacks inStalledOverlayandConnStatusOverlay.Both components define identical
handleDisconnectcallbacks. Consider extracting this into a shared helper or passing it as a prop from the parent to reduce duplication.Also applies to: 149-152
pkg/remote/conncontroller/connmonitor.go (2)
16-26: Consider unexporting fields that are only accessed within the package.
Conn,LastActivityTime,LastInputTime,KeepAliveSentTime, andKeepAliveInFlightare exported but only consumed withinconncontroller. Unexporting them would reduce the public API surface and clarify the intended encapsulation.
28-39: The monitor goroutine starts unconditionally, even for connections inInitstate.
MakeConnMonitoris called fromgetConnInternalwhen theSSHConnis first created (statusInit), before any SSH connection exists. ThekeepAliveMonitorgoroutine will run and callcheckConnectionevery 5 seconds, butLastActivityTimewill be 0 so it early-returns. This is harmless but worth noting — the goroutine idles until the connection is established. Consider starting the monitor only when the connection reachesConnectedstate, or document the current behavior.
| func (ws *WshServer) NotifySystemResumeCommand(ctx context.Context) error { | ||
| log.Printf("NotifySystemResumeCommand called\n") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "SystemResume\|NotifyResume\|systemresume" --type=go -C3Repository: wavetermdev/waveterm
Length of output: 46
🏁 Script executed:
rg -in "resume" --type=go -C2 | head -100Repository: wavetermdev/waveterm
Length of output: 1385
🏁 Script executed:
rg -n "NotifySystemResumeCommand" --type=goRepository: wavetermdev/waveterm
Length of output: 574
🏁 Script executed:
rg -in "notifysystemresume" --type-list | head -20Repository: wavetermdev/waveterm
Length of output: 453
🏁 Script executed:
fd -type f -name "*.ts" -o -name "*.tsx" | head -20Repository: wavetermdev/waveterm
Length of output: 234
🏁 Script executed:
rg "notifysystemresume" -iRepository: wavetermdev/waveterm
Length of output: 1165
🏁 Script executed:
fd -e ts -e tsx | head -30Repository: wavetermdev/waveterm
Length of output: 833
🏁 Script executed:
rg -n "NotifySystemResumeCommand" emain/emain.ts -B10 -A5Repository: wavetermdev/waveterm
Length of output: 821
🏁 Script executed:
rg -n "ConnMonitor\|Monitor.*Stall\|NotifyResume\|resume.*monitor" --type=go -C2Repository: wavetermdev/waveterm
Length of output: 46
🏁 Script executed:
rg "Monitor\|stall" -i frontend/app --type=ts --type=tsx -C2 | head -80Repository: wavetermdev/waveterm
Length of output: 91
🏁 Script executed:
rg "Monitor|stall" -i frontend/app -C2 | head -100Repository: wavetermdev/waveterm
Length of output: 7548
🏁 Script executed:
rg "timeout|keepalive|stall" -i pkg/wshrpc --type=go -C2 | head -80Repository: wavetermdev/waveterm
Length of output: 5350
🏁 Script executed:
rg "heartbeat|ping|keepalive|alive" -i pkg/wshrpc --type=go | head -50Repository: wavetermdev/waveterm
Length of output: 279
🏁 Script executed:
fd -name "*conncontroller*" --type=fRepository: wavetermdev/waveterm
Length of output: 294
🏁 Script executed:
rg "KeepAliveSentTime" --type=go -B3 -A3Repository: wavetermdev/waveterm
Length of output: 3102
🏁 Script executed:
rg "ping|keep.?alive" -i --type=go -C2 | head -100Repository: wavetermdev/waveterm
Length of output: 4564
🏁 Script executed:
rg -n "type ConnMonitor struct" --type=go -A20Repository: wavetermdev/waveterm
Length of output: 1578
🏁 Script executed:
rg -n "func.*ConnMonitor.*Reset\|NotifyResume\|ResetKeep" --type=goRepository: wavetermdev/waveterm
Length of output: 46
🏁 Script executed:
rg -n "func.*GetConnections\|AllConnections\|ConnMap\|Monitors" --type=go pkg/remote/conncontroller | head -30Repository: wavetermdev/waveterm
Length of output: 46
🏁 Script executed:
rg -n "KeepAliveSentTime.Store\|LastActivityTime.Store" --type=go -B2 -A2Repository: wavetermdev/waveterm
Length of output: 46
🏁 Script executed:
cat -n pkg/remote/conncontroller/connmonitor.go | head -100Repository: wavetermdev/waveterm
Length of output: 2884
🏁 Script executed:
cat -n pkg/remote/conncontroller/conncontroller.go | head -150Repository: wavetermdev/waveterm
Length of output: 5640
🏁 Script executed:
rg -n "LastActivityTime.Store\|NotifyInput\|UpdateLastActivityTime" --type=go pkg/remote/conncontrollerRepository: wavetermdev/waveterm
Length of output: 46
Missing logic to reset connection monitors on system resume
The backend handler should iterate through active connections (via clientControllerMap) and reset stale keep-alive timers by calling Monitor.UpdateLastActivityTime() and clearing KeepAliveInFlight on each ConnMonitor. After system sleep, the timers are out of sync with actual activity, which can cause the connection health monitor to incorrectly mark connections as stalled.
The ConnMonitor struct and UpdateLastActivityTime() method already exist in pkg/remote/conncontroller/connmonitor.go; you only need to iterate through all connections and invoke these methods when the system resumes.
🤖 Prompt for AI Agents
In `@pkg/wshrpc/wshserver/wshserver.go` around lines 758 - 761, In
WshServer.NotifySystemResumeCommand, iterate over the server's active
connections in clientControllerMap and for each connection retrieve its
ConnMonitor (check for nil), call ConnMonitor.UpdateLastActivityTime() and clear
the keep-alive flag by setting ConnMonitor.KeepAliveInFlight = false; ensure you
handle any concurrency protections around clientControllerMap (use the existing
map mutex or accessor) so iterations are safe and skip or log controllers
without a monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/remote/conncontroller/connmonitor.go`:
- Around line 110-129: In SendKeepAlive, avoid calling UpdateLastActivityTime
when the keepalive request failed: check the err returned from
client.SendRequest and only call cm.UpdateLastActivityTime() (or otherwise mark
health Good) when err == nil; leave the existing
setKeepAliveInFlight/clearKeepAliveInFlight handling and the panichandler defer
intact so only the success path updates last-activity/health for ConnMonitor and
failures do not reset the connection to "Good".
🧹 Nitpick comments (3)
frontend/app/block/connstatusoverlay.tsx (2)
56-56: Consider initializingelapsedTimefrom the timestamp instead of"".On first render,
elapsedTimeis""until the effect fires (after paint). This means users may briefly see "Connection to X is stalled" without the elapsed-time parenthetical, then it pops in. You could initialize state eagerly to avoid the flash:Suggested initialization
- const [elapsedTime, setElapsedTime] = React.useState<string>(""); + const [elapsedTime, setElapsedTime] = React.useState<string>(() => { + if (!connStatus.lastactivitybeforestalledtime) return ""; + return formatElapsedTime(Date.now() - connStatus.lastactivitybeforestalledtime); + });
81-105: Inline Tailwind classes are quite long — consider extracting to a CSS class.The overlay
divon line 83 has a very long class string mixing utilities and CSS variables. This is a style preference, but for readability and maintainability, extracting it into a named CSS class (like the existingconnstatus-overlaypattern used on line 230) would be more consistent with the rest of this file.pkg/remote/conncontroller/connmonitor.go (1)
20-31: Consider unexporting lock-guarded fields to prevent unguarded access.
KeepAliveInFlightis protected bycm.lock(viasetKeepAliveInFlight/clearKeepAliveInFlight/getTimeSinceKeepAlive), but being exported allows external callers to read/write it without acquiring the lock. Lowercasing it would enforce correct access at compile time.Suggested change
- KeepAliveInFlight bool + keepAliveInFlight boolAnd update the three internal references accordingly.