Skip to content

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols#718

Merged
johntmyers merged 1 commit intomainfrom
fix/websocket-101-upgrade-relay-v2
Apr 2, 2026
Merged

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols#718
johntmyers merged 1 commit intomainfrom
fix/websocket-101-upgrade-relay-v2

Conversation

@johntmyers
Copy link
Copy Markdown
Collaborator

Summary

  • Detect HTTP 101 Switching Protocols in the L7 relay and switch to raw bidirectional TCP instead of re-entering the HTTP parsing loop
  • Validate client sent Upgrade + Connection: Upgrade headers before accepting 101 (rejects unsolicited upgrades from non-compliant upstream servers)
  • Replace bool return type with RelayOutcome enum across the L7Provider trait and all relay functions for cleaner upgrade signaling

Related Issue

Changes

  • crates/openshell-sandbox/src/l7/provider.rs: Add RelayOutcome enum (Reusable/Consumed/Upgraded { overflow }). Update L7Provider::relay trait return type from Result<bool> to Result<RelayOutcome>.
  • crates/openshell-sandbox/src/l7/rest.rs: Detect 101 in relay_response() before the generic 1xx/bodiless handler. Return RelayOutcome directly (no intermediate tuple). Add client_requested_upgrade() validation in relay_http_request_with_resolver(). Update relay_response_to_client wrapper to return RelayOutcome.
  • crates/openshell-sandbox/src/l7/relay.rs: Add shared handle_upgrade() helper. Update relay_rest() and relay_passthrough_with_credentials() to match on RelayOutcome. Add l7_decision = "allow_upgrade" audit log annotation.
  • crates/openshell-sandbox/tests/websocket_upgrade.rs: Integration test spinning up a real WebSocket echo server, upgrading through L7Provider::relay, and exchanging a text frame bidirectionally.
  • crates/openshell-sandbox/Cargo.toml: Add tokio-tungstenite and futures dev-dependencies for integration tests.

Improvements over #683

Area #683 This PR
Return type (bool, u16, Vec<u8>) tuple RelayOutcome enum directly from relay_response()
Trait consistency L7Provider::relay still returned bool Trait updated to Result<RelayOutcome>
Unsolicited 101 No validation Validates client sent Upgrade + Connection: Upgrade headers
Code duplication Duplicate upgrade handling in two relay paths Shared handle_upgrade() helper
Audit logging No upgrade-specific annotation l7_decision = "allow_upgrade" for observability
Dead code relay_response_to_client silently discarded overflow Updated to return RelayOutcome
Integration test Unit test only (relay_response in isolation) Full WebSocket echo test through L7Provider::relay

Security Notes

  • Post-upgrade L7 bypass is inherent and documented: After copy_bidirectional, no further L7 inspection occurs. The initial HTTP request is policy-checked before the upstream ever sees it.
  • Unsolicited 101 rejection: If the upstream sends 101 but the client never requested an upgrade, the connection is killed (Consumed). Prevents a non-compliant upstream from triggering a policy bypass.
  • Overflow buffer is bounded: ~17 KiB max (header read buffer + one read call), immediately forwarded, then dropped.

Testing

  • mise run pre-commit passes (format fixed)
  • All 405 existing unit tests pass
  • 8 new unit tests: 101 overflow capture, 101 no overflow, unsolicited 101 rejection, accepted 101 with upgrade headers, 4 client_requested_upgrade header parsing tests
  • 2 new integration tests: WebSocket upgrade through L7 relay with echo exchange, normal HTTP regression test
  • cargo clippy clean (no new warnings)

Checklist

  • Follows Conventional Commits
  • Architecture docs updated (not applicable — no new components)
  • Tests added for new behavior

Detect 101 Switching Protocols in relay_response() and switch to raw
bidirectional TCP relay instead of re-entering the HTTP parsing loop.

Previously, is_bodiless_response() treated 101 as a generic 1xx
informational response, forwarding only the headers and returning to
the HTTP parsing loop. After a 101, subsequent bytes are upgraded
protocol frames (e.g. WebSocket), not HTTP — causing the relay to
block or silently drop all post-upgrade traffic.

Changes:
- Add RelayOutcome enum (Reusable/Consumed/Upgraded) replacing bool
  return type across L7Provider::relay trait and all relay functions
- Detect 101 before generic 1xx handler in relay_response(), capture
  overflow bytes, return RelayOutcome::Upgraded
- Validate client sent Upgrade + Connection: Upgrade headers before
  accepting 101 (rejects unsolicited upgrades from non-compliant
  upstream servers)
- Extract shared handle_upgrade() helper used by both relay_rest()
  and relay_passthrough_with_credentials()
- Add l7_decision=allow_upgrade audit log annotation for upgrades
- Add unit tests for 101 overflow capture, unsolicited 101 rejection,
  and client_requested_upgrade header validation
- Add integration test: WebSocket echo through L7Provider::relay

Fixes: #652
@davidpeden3
Copy link
Copy Markdown

davidpeden3 commented Apr 1, 2026

tested pr #718 on docker desktop + wsl2 (windows 11, amd64, openshell v0.0.16). clean before/after results confirm the fix works.

setup:

  • icingdeath: windows 11, docker desktop 29.3.1, wsl2 kernel 6.6.87.2
  • openclaw sandbox connecting to gateway at MY_GATEWAY_IP:18789
  • test: socat proxy tunnel inside sandbox netns → raw python websocket upgrade + frame exchange
  • policy: openclaw_gateway allowing socat → MY_GATEWAY_IP:18789

before (stock binary, v0.0.16):
binary md5: 6ed90711e33ec487abf39e32dea43a60

[1/4] connecting to socat tunnel at 127.0.0.1:18790...
[2/4] sending websocket upgrade request...
    upgrade response: http/1.1 101 switching protocols
[3/4] sending websocket text frame ('hello')...
[4/4] waiting for websocket echo response...
[fail] timeout — websocket frames not relayed after 101 (issue #652 confirmed)

after (pr #718 fix/websocket-101-upgrade-relay-v2):
binary md5: 99be46a66d97a8012fed6e36d16a1693

[1/4] connecting to socat tunnel at 127.0.0.1:18790...
[2/4] sending websocket upgrade request...
    upgrade response: http/1.1 101 switching protocols
[3/4] sending websocket text frame ('hello')...
[4/4] waiting for websocket echo response...
    received: opcode=8 length=2 payload='...'
[pass] websocket frame received

proxy logs (after):

debug openshell_sandbox::proxy: evaluate_opa_tcp total: 3ms host=MY_GATEWAY_IP port=18789
debug openshell_sandbox::proxy: handle_tcp_connection dns_resolve_and_tcp_connect: 4ms host=MY_GATEWAY_IP
debug openshell_sandbox::l7::rest: relay_response framing
debug openshell_sandbox::l7::rest: 101 switching protocols — signaling protocol upgrade

the 101 is correctly detected in relay_response(), relayoutcome::upgraded is returned, and handle_upgrade() switches to copy_bidirectional for raw frame relay. the gateway responded with a close frame, confirming end-to-end bidirectional websocket communication through the proxy.

note on pr #720 interaction:

we also merged feat/ocsf-log-integration (#720) into the same build to test the new ocsf structured logging alongside the websocket fix. there's a minor merge conflict: handle_upgrade() in relay.rs uses info!() (from #718), but #720 removes info from the tracing imports in that file in favor of ocsf events. we resolved it by converting the handle_upgrade() log to a networkactivitybuilder ocsf event at severityid::informational (preserving the intended log level from #718).

depending on merge order:

either way it's a one-liner, just flagging it so it doesn't surprise anyone.

@mjamiv
Copy link
Copy Markdown

mjamiv commented Apr 1, 2026

Tested this against a live Slack Socket Mode deployment running OpenClaw inside an OpenShell sandbox. Different setup than @davidpeden3's socat echo test — this exercises the full application path including bidirectional file transfers.

Environment

  • Ubuntu 22.04, Hetzner, x86_64
  • OpenShell 0.0.16 cluster container (unchanged)
  • Sandbox binary built from this branch, swapped in via overlay remount
  • OpenClaw 2026.3.31 with Slack Socket Mode (@slack/socket-mode + ws)
  • All traffic through CONNECT proxy at 10.200.0.1:3128

How we tested

We run two sandboxes under the same gateway — agent (production, stock 0.0.16 binary) and agent2 (test, this PR). Side-by-side, no risk to production.

  1. Built openshell-sandbox from this branch
  2. Swapped binary in agent2 pod only (remount overlay rw, cp, restart pod)
  3. Confirmed Slack connects with our existing tls: skip workaround (baseline)
  4. Removed tls: skip from the slack_websocket policy — on 0.0.16 this kills the connection
  5. Restarted gateway to force fresh WebSocket through the L7 path

Results

Everything works:

  • Slack Socket Mode connects without tls: skip — L7 proxy handles 101 upgrade correctly
  • Inbound/outbound messages normal
  • Inbound file processing (image sent in Slack → downloaded through proxy → vision model) — works
  • Outbound file delivery (image generated → uploaded to Slack via *.slack-files.com) — works
  • No frame corruption, socket hang ups, or connection drops during testing
  • Production sandbox on 0.0.16 completely unaffected throughout

The working policy (for anyone else testing)

slack_websocket:
  name: slack_websocket
  endpoints:
    - host: wss-primary.slack.com
      port: 443
      # tls: skip not needed with this fix
      enforcement: enforce
      access: full
    - host: wss-backup.slack.com
      port: 443
      enforcement: enforce
      access: full
  binaries:
    - { path: /usr/bin/node }
    - { path: /usr/bin/curl }

Bottom line

This fixes #652 for us. On 0.0.16 the same config fails because is_bodiless_response() treats 101 as generic 1xx and re-enters the HTTP parser. With this PR, the transition to raw bidirectional TCP works cleanly. We can drop our tls: skip split-policy workaround once this ships.

Nice work @johntmyers — appreciate the quick turnaround. We didn't test the #720 interaction — only learned about it from @davidpeden3's comment. Our build was this branch only, no other PRs merged in.

@johntmyers johntmyers merged commit c6f3087 into main Apr 2, 2026
16 checks passed
@johntmyers johntmyers deleted the fix/websocket-101-upgrade-relay-v2 branch April 2, 2026 04:39
johntmyers added a commit that referenced this pull request Apr 2, 2026
PR #718 added two log calls for WebSocket upgrade handling:

- 101 Switching Protocols info → NetworkActivity with Upgrade activity.
  This is a significant state change (L7 enforcement drops to raw relay).

- Unsolicited 101 without client Upgrade header → DetectionFinding with
  High severity. A non-compliant upstream sending 101 without a client
  Upgrade request could be attempting to bypass L7 inspection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Egress proxy fails to relay WebSocket frames after successful HTTP CONNECT + 101 Switching Protocols

4 participants