Skip to content

Conversation

@sawka
Copy link
Member

@sawka sawka commented Feb 9, 2026

  • ConnMonitor to track stalled connections
  • New Stalled Overlay to show feedback when we think a connection is stalled
  • New Icon in ConnButton to show stalled connections
  • Callbacks in domain socket and PTYs to track activity

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Walkthrough

Adds a concurrency-aware ConnMonitor for SSH connections (activity timestamps, input signaling, keep-alive loop, health states) and wires it into connection lifecycle and status derivation. Propagates an optional readCallback through domain-socket RPC paths (RunWshRpcOverListener → handleDomainSocketClient → AdaptStreamToMsgCh → StreamToLines). Extends wshrpc types with ConnStatus fields and a NotifySystemResumeCommand RPC plus server and client call sites. Frontend: new connhealthstatus handling, stalled overlay UI, and NotifySystemResumeCommand on system resume. Other changes: GetConn → MaybeGetConn call-site swaps, terminal reset sequence update, minor CSS and type additions.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a ConnMonitor component to track stalled connection state for SSH connections, which is the primary objective of this PR.
Description check ✅ Passed The description is directly related to the changeset and outlines the key components added: ConnMonitor, Stalled Overlay UI, ConnButton icon, and activity callbacks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sawka/conn-keepalives

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
pkg/remote/conncontroller/connmonitor.go (2)

20-31: KeepAliveInFlight is exported but requires lock protection — consider unexporting.

KeepAliveInFlight is accessed exclusively under cm.lock internally (setKeepAliveInFlight, clearKeepAliveInFlight, getTimeSinceKeepAlive), but being exported allows external callers to read/write it without the lock, introducing a data race risk. If there's no external consumer, make it unexported.

Suggested change
-	KeepAliveInFlight bool
+	keepAliveInFlight bool

And update the three internal references accordingly.


110-131: SendKeepAlive always returns nil — the error return is misleading.

Since the actual send runs in a goroutine and errors are logged there, this method can never return a non-nil error. The return signature suggests callers should handle errors, but there's nothing to handle. Consider changing the signature to SendKeepAlive() (no return) for clarity.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
frontend/types/gotypes.d.ts (1)

788-799: ⚠️ Potential issue | 🟡 Minor

Fix typo in Go struct JSON tag: lastactivitybeforerstalledtime should be lastactivitybeforestalledtime.

Line 798 contains lastactivitybeforerstalledtime with an extra r before stalled. The typo originates in the Go struct's JSON tag in pkg/wshrpc/wshrpctypes.go:456, where it reads json:"lastactivitybeforerstalledtime,omitempty" and should be json:"lastactivitybeforestalledtime,omitempty". The TypeScript file correctly reflects the Go source, but the Go source needs correction.

🤖 Fix all issues with AI agents
In `@frontend/app/block/connstatusoverlay.tsx`:
- Around line 223-224: Remove the hardcoded debug assignment that forces the
stalled overlay: delete the `showStalled = true;` statement so the boolean
`showStalled` computed from `connStatus.status == "connected"` and
`connStatus.connhealthstatus == "stalled"` is used; while here you’re editing
`connstatusoverlay.tsx`, also consider tightening the comparison to `===` for
`connStatus.status`/`connStatus.connhealthstatus` in the `showStalled`
initialization to avoid accidental type-coercion bugs.

In `@pkg/remote/conncontroller/connmonitor.go`:
- Around line 92-109: The code in ConnMonitor.SendKeepAlive reads conn.Client
without holding conn.lock, risking a nil dereference if
Close()/closeInternal_withlifecyclelock clears the client concurrently; fix by
taking the connection's lock (or calling the existing GetClient() helper that
acquires the lock) to snapshot the client into a local variable before launching
the goroutine, return early if the snapshot is nil, and then use that local
client inside the goroutine (still ensuring clearKeepAliveInFlight and
UpdateLastActivityTime are deferred/called as before) so no concurrent mutation
of conn.Client can cause a panic.

In `@pkg/wshrpc/wshrpctypes.go`:
- Line 456: Update the JSON tag for the struct field
LastActivityBeforeStalledTime to the correct key "lastactivitybeforestalledtime"
(remove the stray "r") so the backend emits the proper property; then update the
corresponding frontend definitions (frontend/types/gotypes.d.ts entry around
line ~798) and usage in connstatusoverlay.tsx to match the corrected key and run
any API/client tests to ensure compatibility.
🧹 Nitpick comments (3)
frontend/app/block/connstatusoverlay.tsx (1)

46-109: handleDisconnect is duplicated between StalledOverlay and ConnStatusOverlay.

Lines 58–61 and 149–152 are identical. Consider extracting to a shared helper or passing the callback from the parent.

pkg/remote/conncontroller/connmonitor.go (1)

16-26: Consider unexporting KeepAliveInFlight — it's only used within this file.

KeepAliveInFlight is a synchronization-sensitive field accessed exclusively through setKeepAliveInFlight, clearKeepAliveInFlight, and getTimeSinceKeepAlive. Making it unexported (keepAliveInFlight) prevents accidental unsynchronized access from other files in the package.

pkg/remote/conncontroller/conncontroller.go (1)

1028-1045: Monitor is never closed — verify this is intentional for connection reuse.

MakeConnMonitor starts a background goroutine, but SSHConn.Close() / closeInternal_withlifecyclelock() never calls Monitor.Close(). Since SSHConn objects are reused across reconnections (stored in clientControllerMap), this is likely intentional — the monitor should persist. However, the monitor will continue calling checkConnection while disconnected, which will attempt SendKeepAlive on a nil client. With the race fix suggested in connmonitor.go, this safely returns an error, but you might consider resetting health state or pausing the monitor on disconnect.

Comment on lines 223 to 224
let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled";
showStalled = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: showStalled = true is hardcoded debug code — stalled overlay will always appear.

Line 224 unconditionally sets showStalled = true, overriding the actual health status check on line 223. This means every connected SSH connection will display the stalled overlay regardless of connection health. This looks like leftover development/debug code that must be removed before merging.

Proposed fix
         let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled";
-        showStalled = true;
         if (!showWshError && !showStalled && (isLayoutMode || connStatus.status == "connected" || connModalOpen)) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled";
showStalled = true;
let showStalled = connStatus.status == "connected" && connStatus.connhealthstatus == "stalled";
🤖 Prompt for AI Agents
In `@frontend/app/block/connstatusoverlay.tsx` around lines 223 - 224, Remove the
hardcoded debug assignment that forces the stalled overlay: delete the
`showStalled = true;` statement so the boolean `showStalled` computed from
`connStatus.status == "connected"` and `connStatus.connhealthstatus ==
"stalled"` is used; while here you’re editing `connstatusoverlay.tsx`, also
consider tightening the comparison to `===` for
`connStatus.status`/`connStatus.connhealthstatus` in the `showStalled`
initialization to avoid accidental type-coercion bugs.

WshError string `json:"wsherror,omitempty"`
NoWshReason string `json:"nowshreason,omitempty"`
WshVersion string `json:"wshversion,omitempty"`
LastActivityBeforeStalledTime int64 `json:"lastactivitybeforerstalledtime,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the file at line 456 to confirm the typo
echo "=== File content at line 456 ==="
sed -n '456p' pkg/wshrpc/wshrpctypes.go

# Find all references to the typo'd JSON key across the codebase
echo -e "\n=== References to typo'd key (lastactivitybeforerstalledtime) ==="
rg -n "lastactivitybeforerstalledtime" --type-list | head -1

# Search broadly for both versions
echo -e "\n=== References to typo'd version ==="
rg -in "lastactivitybeforerstalledtime"

echo -e "\n=== References to correct version ==="
rg -in "lastactivitybeforestalledtime"

# Check the struct context
echo -e "\n=== Struct context around line 456 ==="
sed -n '450,465p' pkg/wshrpc/wshrpctypes.go

Repository: wavetermdev/waveterm

Length of output: 2487


Fix typo in JSON tag: lastactivitybeforerstalledtimelastactivitybeforestalledtime.

Line 456 has a stray r in the JSON key. The tag reads "lastactivitybeforerstalledtime" but should be "lastactivitybeforestalledtime". This typo is already baked into the frontend type definitions (frontend/types/gotypes.d.ts:798) and is being used in connstatusoverlay.tsx. Fixing it requires coordinating backend and frontend changes now before the API solidifies further.

Proposed fix
-	LastActivityBeforeStalledTime int64  `json:"lastactivitybeforerstalledtime,omitempty"`
+	LastActivityBeforeStalledTime int64  `json:"lastactivitybeforestalledtime,omitempty"`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
LastActivityBeforeStalledTime int64 `json:"lastactivitybeforerstalledtime,omitempty"`
LastActivityBeforeStalledTime int64 `json:"lastactivitybeforestalledtime,omitempty"`
🤖 Prompt for AI Agents
In `@pkg/wshrpc/wshrpctypes.go` at line 456, Update the JSON tag for the struct
field LastActivityBeforeStalledTime to the correct key
"lastactivitybeforestalledtime" (remove the stray "r") so the backend emits the
proper property; then update the corresponding frontend definitions
(frontend/types/gotypes.d.ts entry around line ~798) and usage in
connstatusoverlay.tsx to match the corrected key and run any API/client tests to
ensure compatibility.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/util/utilfn/streamtolines.go (1)

59-72: ⚠️ Potential issue | 🟡 Minor

readCallback is skipped on the final read that returns data alongside io.EOF.

When input.Read returns both n > 0 and err == io.EOF, streamToLines_processBuf processes the data (line 64), but the early return on line 66 means readCallback is never invoked for that last chunk. For the ConnMonitor activity-tracking use case this is likely inconsequential (the connection is closing anyway), but worth noting.

🤖 Fix all issues with AI agents
In `@pkg/remote/conncontroller/conncontroller.go`:
- Around line 1033-1041: Reset the SSHConn health state at the start of each
reconnect by setting conn.ConnHealthStatus = ConnHealthStatus_Good in the
reconnection path (e.g., at the top of Connect() or connectInternal()) so stale
Degraded/Stalled values from previous sessions do not carry over; locate the
SSHConn struct/instance created earlier and the Monitor (MakeConnMonitor /
Monitor) usage to ensure the reset happens before any Monitor or session logic
runs and before returning/continuing the connection attempt (this ensures
clientControllerMap-held SSHConn instances start each reconnect with a clean
health state).
- Line 296: The call to wshutil.RunWshRpcOverListener(listener,
conn.Monitor.UpdateLastActivityTime) uses conn.Monitor without a nil check;
either add a defensive guard (if conn.Monitor != nil {
wshutil.RunWshRpcOverListener(listener, conn.Monitor.UpdateLastActivityTime) }
else { /* handle or log unexpected nil */ }) or add a clear comment above this
use explaining why conn.Monitor is guaranteed non-nil (point to
getConnInternal() which initializes Monitor) and assert that invariant; update
the code path where RunWshRpcOverListener is invoked (referencing conn and
RunWshRpcOverListener) so the nil case is explicitly handled or documented to
match the existing nil check pattern used elsewhere (e.g., the conn.Monitor !=
nil check at other call sites).

In `@pkg/remote/conncontroller/connmonitor.go`:
- Around line 149-159: The inner select handling cm.inputNotifyCh incorrectly
uses break which only exits the select and allows the subsequent
cm.Conn.SetConnHealthStatus(ConnHealthStatus_Degraded) and cm.checkConnection()
to run; change the control flow so that when cm.LastActivityTime.Load() >=
inputTime you skip the degraded logic and resume the outer loop (replace the
break with continue in the branch inside the select), ensuring cm.ctx.Done()
handling remains unchanged and references to cm.inputNotifyCh,
cm.LastActivityTime.Load(), cm.Conn.SetConnHealthStatus and cm.checkConnection
are preserved.

In `@pkg/wshrpc/wshserver/wshserver.go`:
- Around line 758-761: In WshServer.NotifySystemResumeCommand, iterate over the
server's active connections in clientControllerMap and for each connection
retrieve its ConnMonitor (check for nil), call
ConnMonitor.UpdateLastActivityTime() and clear the keep-alive flag by setting
ConnMonitor.KeepAliveInFlight = false; ensure you handle any concurrency
protections around clientControllerMap (use the existing map mutex or accessor)
so iterations are safe and skip or log controllers without a monitor.
🧹 Nitpick comments (3)
frontend/app/block/connstatusoverlay.tsx (1)

58-61: Duplicate handleDisconnect callbacks in StalledOverlay and ConnStatusOverlay.

Both components define identical handleDisconnect callbacks. Consider extracting this into a shared helper or passing it as a prop from the parent to reduce duplication.

Also applies to: 149-152

pkg/remote/conncontroller/connmonitor.go (2)

16-26: Consider unexporting fields that are only accessed within the package.

Conn, LastActivityTime, LastInputTime, KeepAliveSentTime, and KeepAliveInFlight are exported but only consumed within conncontroller. Unexporting them would reduce the public API surface and clarify the intended encapsulation.


28-39: The monitor goroutine starts unconditionally, even for connections in Init state.

MakeConnMonitor is called from getConnInternal when the SSHConn is first created (status Init), before any SSH connection exists. The keepAliveMonitor goroutine will run and call checkConnection every 5 seconds, but LastActivityTime will be 0 so it early-returns. This is harmless but worth noting — the goroutine idles until the connection is established. Consider starting the monitor only when the connection reaches Connected state, or document the current behavior.

Comment on lines +758 to +761
func (ws *WshServer) NotifySystemResumeCommand(ctx context.Context) error {
log.Printf("NotifySystemResumeCommand called\n")
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "SystemResume\|NotifyResume\|systemresume" --type=go -C3

Repository: wavetermdev/waveterm

Length of output: 46


🏁 Script executed:

rg -in "resume" --type=go -C2 | head -100

Repository: wavetermdev/waveterm

Length of output: 1385


🏁 Script executed:

rg -n "NotifySystemResumeCommand" --type=go

Repository: wavetermdev/waveterm

Length of output: 574


🏁 Script executed:

rg -in "notifysystemresume" --type-list | head -20

Repository: wavetermdev/waveterm

Length of output: 453


🏁 Script executed:

fd -type f -name "*.ts" -o -name "*.tsx" | head -20

Repository: wavetermdev/waveterm

Length of output: 234


🏁 Script executed:

rg "notifysystemresume" -i

Repository: wavetermdev/waveterm

Length of output: 1165


🏁 Script executed:

fd -e ts -e tsx | head -30

Repository: wavetermdev/waveterm

Length of output: 833


🏁 Script executed:

rg -n "NotifySystemResumeCommand" emain/emain.ts -B10 -A5

Repository: wavetermdev/waveterm

Length of output: 821


🏁 Script executed:

rg -n "ConnMonitor\|Monitor.*Stall\|NotifyResume\|resume.*monitor" --type=go -C2

Repository: wavetermdev/waveterm

Length of output: 46


🏁 Script executed:

rg "Monitor\|stall" -i frontend/app --type=ts --type=tsx -C2 | head -80

Repository: wavetermdev/waveterm

Length of output: 91


🏁 Script executed:

rg "Monitor|stall" -i frontend/app -C2 | head -100

Repository: wavetermdev/waveterm

Length of output: 7548


🏁 Script executed:

rg "timeout|keepalive|stall" -i pkg/wshrpc --type=go -C2 | head -80

Repository: wavetermdev/waveterm

Length of output: 5350


🏁 Script executed:

rg "heartbeat|ping|keepalive|alive" -i pkg/wshrpc --type=go | head -50

Repository: wavetermdev/waveterm

Length of output: 279


🏁 Script executed:

fd -name "*conncontroller*" --type=f

Repository: wavetermdev/waveterm

Length of output: 294


🏁 Script executed:

rg "KeepAliveSentTime" --type=go -B3 -A3

Repository: wavetermdev/waveterm

Length of output: 3102


🏁 Script executed:

rg "ping|keep.?alive" -i --type=go -C2 | head -100

Repository: wavetermdev/waveterm

Length of output: 4564


🏁 Script executed:

rg -n "type ConnMonitor struct" --type=go -A20

Repository: wavetermdev/waveterm

Length of output: 1578


🏁 Script executed:

rg -n "func.*ConnMonitor.*Reset\|NotifyResume\|ResetKeep" --type=go

Repository: wavetermdev/waveterm

Length of output: 46


🏁 Script executed:

rg -n "func.*GetConnections\|AllConnections\|ConnMap\|Monitors" --type=go pkg/remote/conncontroller | head -30

Repository: wavetermdev/waveterm

Length of output: 46


🏁 Script executed:

rg -n "KeepAliveSentTime.Store\|LastActivityTime.Store" --type=go -B2 -A2

Repository: wavetermdev/waveterm

Length of output: 46


🏁 Script executed:

cat -n pkg/remote/conncontroller/connmonitor.go | head -100

Repository: wavetermdev/waveterm

Length of output: 2884


🏁 Script executed:

cat -n pkg/remote/conncontroller/conncontroller.go | head -150

Repository: wavetermdev/waveterm

Length of output: 5640


🏁 Script executed:

rg -n "LastActivityTime.Store\|NotifyInput\|UpdateLastActivityTime" --type=go pkg/remote/conncontroller

Repository: wavetermdev/waveterm

Length of output: 46


Missing logic to reset connection monitors on system resume

The backend handler should iterate through active connections (via clientControllerMap) and reset stale keep-alive timers by calling Monitor.UpdateLastActivityTime() and clearing KeepAliveInFlight on each ConnMonitor. After system sleep, the timers are out of sync with actual activity, which can cause the connection health monitor to incorrectly mark connections as stalled.

The ConnMonitor struct and UpdateLastActivityTime() method already exist in pkg/remote/conncontroller/connmonitor.go; you only need to iterate through all connections and invoke these methods when the system resumes.

🤖 Prompt for AI Agents
In `@pkg/wshrpc/wshserver/wshserver.go` around lines 758 - 761, In
WshServer.NotifySystemResumeCommand, iterate over the server's active
connections in clientControllerMap and for each connection retrieve its
ConnMonitor (check for nil), call ConnMonitor.UpdateLastActivityTime() and clear
the keep-alive flag by setting ConnMonitor.KeepAliveInFlight = false; ensure you
handle any concurrency protections around clientControllerMap (use the existing
map mutex or accessor) so iterations are safe and skip or log controllers
without a monitor.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/remote/conncontroller/connmonitor.go`:
- Around line 110-129: In SendKeepAlive, avoid calling UpdateLastActivityTime
when the keepalive request failed: check the err returned from
client.SendRequest and only call cm.UpdateLastActivityTime() (or otherwise mark
health Good) when err == nil; leave the existing
setKeepAliveInFlight/clearKeepAliveInFlight handling and the panichandler defer
intact so only the success path updates last-activity/health for ConnMonitor and
failures do not reset the connection to "Good".
🧹 Nitpick comments (3)
frontend/app/block/connstatusoverlay.tsx (2)

56-56: Consider initializing elapsedTime from the timestamp instead of "".

On first render, elapsedTime is "" until the effect fires (after paint). This means users may briefly see "Connection to X is stalled" without the elapsed-time parenthetical, then it pops in. You could initialize state eagerly to avoid the flash:

Suggested initialization
-        const [elapsedTime, setElapsedTime] = React.useState<string>("");
+        const [elapsedTime, setElapsedTime] = React.useState<string>(() => {
+            if (!connStatus.lastactivitybeforestalledtime) return "";
+            return formatElapsedTime(Date.now() - connStatus.lastactivitybeforestalledtime);
+        });

81-105: Inline Tailwind classes are quite long — consider extracting to a CSS class.

The overlay div on line 83 has a very long class string mixing utilities and CSS variables. This is a style preference, but for readability and maintainability, extracting it into a named CSS class (like the existing connstatus-overlay pattern used on line 230) would be more consistent with the rest of this file.

pkg/remote/conncontroller/connmonitor.go (1)

20-31: Consider unexporting lock-guarded fields to prevent unguarded access.

KeepAliveInFlight is protected by cm.lock (via setKeepAliveInFlight/clearKeepAliveInFlight/getTimeSinceKeepAlive), but being exported allows external callers to read/write it without acquiring the lock. Lowercasing it would enforce correct access at compile time.

Suggested change
-	KeepAliveInFlight bool
+	keepAliveInFlight bool

And update the three internal references accordingly.

@sawka sawka merged commit 458fcc2 into main Feb 10, 2026
7 checks passed
@sawka sawka deleted the sawka/conn-keepalives branch February 10, 2026 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant