Skip to content

Redesign MRP kick: abort exchange and restart fresh#3399

Open
Apollon77 wants to merge 7 commits intomainfrom
fix/mrp-kick-pending-race
Open

Redesign MRP kick: abort exchange and restart fresh#3399
Apollon77 wants to merge 7 commits intomainfrom
fix/mrp-kick-pending-race

Conversation

@Apollon77
Copy link
Collaborator

@Apollon77 Apollon77 commented Mar 12, 2026

Summary

Replaces the MRP kick mechanism with a cleaner architectural approach:

  • Old behavior: Kick short-circuited the MRP retransmission timer on the current exchange, which was fragile due to a race condition where kicks could arrive while a retransmission send was in-flight (timer stopped, isRunning === false).
  • New behavior: Kick aborts the current CASE handshake exchange via a localAbort signal and lets connect() loop immediately to a fresh exchange with a clean MRP backoff (~2.25s first retry). DNS-SD address discovery or an explicit peer.kick() now takes effect within seconds instead of waiting out a potentially many-minute backoff.

Key changes

  • MessageExchange: Removed exchange.kick() method and the #kick/#kickPending fields entirely — kick logic now lives at the PeerConnection layer
  • PeerConnection: Added KickOrigin type ("discover" | "connect"), lastRestartAt shared state, and localAbort-based restart logic in attemptOnce()
  • PeerTimingParameters: Added mrpKickRestartIntervalDiscover (30 min) and mrpKickRestartIntervalConnect (10 min) rate-limit thresholds
  • Peer: Updated QuietObservable<[KickOrigin]> typing and ConnectionProcess interface to thread origin through

Rate limiting (two layers)

  1. QuietObservable (3s debounce, skipSuppressedEmits: true): First-line defense against rapid-fire mDNS churn
  2. Per-origin lastRestartAt threshold: Shared across all concurrent address attempts so at most one restart fires per kick

Design notes

  • localAbort wraps addressAbort (not overallAbort) — firing it aborts only the current pair() call, not the entire address retry loop
  • lastRestartAt is scoped to the PeerConnection call, shared across concurrent address attempts to prevent burst restarts
  • If a kick fires just as pair() succeeds, overallAbort() has already fired, making localAbort() harmless

Test plan

  • 21 new unit tests in PeerConnectionKickTest.ts covering:
    • PeerTimingParameters defaults and overrides
    • KickOrigin threading through QuietObservable
    • Rate-limiting logic (thresholds, shared state, cross-origin)
    • localAbort restart pattern (abort propagation, loop restart, double-abort safety)
    • Observable type contract (QuietObservable assignability, use() disposal)
  • Full protocol test suite passes (607/607, ESM + CJS)
  • Type-check clean, linter clean, formatter clean

🤖 Generated with Claude Code

When DNS-SD resolves a peer address, PeerConnection calls exchange.kick()
to short-circuit the MRP backoff timer and retransmit immediately.  The
kick checks whether the retransmission timer is currently running, and if
not, does nothing.

There is a race in Node.js's timer phase: the retransmission timer and the
DNS-SD notify setTimeout(0) both expire at roughly the same time.  The
retransmission timer fires first, stopping itself and starting an async
channel.send().  The notify fires next, the kick arrives, sees the timer
is not running (channel.send() is still in flight), and is silently
dropped.  When channel.send() completes, #initializeResubmission starts a
fresh 2-minute backoff — ignoring that a kick was requested.

Fix by adding a #kickPending flag.  When kick() fires and the timer is not
running, we set the flag instead of doing nothing.  #initializeResubmission
checks the flag after channel.send() completes and, if set, starts the
next timer with zero delay so the retransmission fires immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Apollon77 Apollon77 requested a review from lauckhart as a code owner March 12, 2026 18:33
Copilot AI review requested due to automatic review settings March 12, 2026 18:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in MRP retransmission “kick” handling within MessageExchange so that a kick arriving while a retransmission send is in-flight is not silently dropped, and instead forces the next resubmission to occur immediately.

Changes:

  • Add #kickPending to remember kicks that arrive when the retransmission timer isn’t running (send in-flight).
  • Clear #kickPending on message boundary (send() finally) and consume it in #initializeResubmission() to start the next timer with zero delay.

You can also share your feedback on Copilot code review. Take the survey.

@Apollon77 Apollon77 marked this pull request as draft March 13, 2026 10:48
Apollon77 and others added 2 commits March 13, 2026 13:43
…circuiting timer

Instead of calling exchange.kick() to short-circuit the MRP retransmission timer,
a kick now aborts the current CASE handshake exchange via a local abort signal and
lets connect() loop immediately to a fresh exchange with a clean MRP backoff (~2.25s
first retry). This means a DNS-SD address discovery or an explicit peer.kick() takes
effect within seconds instead of waiting out the current (potentially many-minute)
MRP backoff interval.

Two KickOrigin categories ("discover" and "connect") control separate rate limits
via PeerTimingParameters:
- "discover" (DNS-SD triggered): 30 min minimum between restarts
- "connect" (user/explicit): 10 min minimum between restarts

The QuietObservable on the kicker still provides a 3-second debounce as first-line
defense against rapid mDNS churn before the per-origin rate limit is consulted.

The exchange.kick() method and its supporting #kick field in MessageExchange are
removed entirely — kick logic now lives at the PeerConnection layer via localAbort.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Apollon77 Apollon77 changed the title Fix MRP kick race: honour kick that arrives while retransmission is in flight Redesign MRP kick: abort exchange and restart fresh Mar 13, 2026
@Apollon77 Apollon77 marked this pull request as ready for review March 13, 2026 12:57
@Apollon77 Apollon77 marked this pull request as draft March 13, 2026 14:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Apollon77 and others added 2 commits March 13, 2026 16:59
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@Apollon77 Apollon77 marked this pull request as ready for review March 13, 2026 16:57
* allowing the new address to be used immediately. This threshold prevents mDNS churn from
* restarting the handshake too frequently.
*/
mrpKickRestartIntervalDiscover: Duration;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting pretty esoteric :)

Trying to think how we could make these more comprehensible, maybe something like:

retryBackoffResetWindow: {
    connect?: Duration,
    addressChange?: Duration,
}

...but I acknowledge this is no more likely to be understood :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants