consomme: return SERVFAIL for rate-limited UDP DNS queries#3196
consomme: return SERVFAIL for rate-limited UDP DNS queries#3196benhillis wants to merge 1 commit intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a reliability bug in consomme’s DNS resolver where UDP DNS queries could be silently dropped once the pending-request limit is reached, by returning a synthetic SERVFAIL to the guest instead of letting the query time out.
Changes:
- Check the
submit_query()return value insubmit_udp_query(). - When rate-limited, enqueue a synthetic SERVFAIL response back through the UDP response channel.
submit_udp_query() was silently discarding the return value of submit_query(), so when the pending-request limit (256) was reached the query was dropped without any response. The guest resolver would then wait for a timeout before retrying (often falling back to TCP), causing intermittent DNS resolution failures under load. Send a synthetic SERVFAIL response instead so the guest gets a timely negative answer and can retry immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
99ed6a2 to
2777be7
Compare
| if !self.submit_query(request, sender.clone()) { | ||
| // Rate-limited: send a SERVFAIL so the guest gets a timely | ||
| // negative response instead of waiting for a timeout. | ||
| // | ||
| // Increment the counter so the decrement in | ||
| // `poll_udp_response()` stays balanced. | ||
| self.pending_requests += 1; | ||
| sender.send(DnsResponse { |
There was a problem hiding this comment.
pending_requests is incremented for every rate-limited UDP query, which means it can grow beyond max_pending_requests under sustained load (even though no backend query was submitted). Because submit_query() only accepts when pending_requests < max_pending_requests, this can keep the resolver in a prolonged “rate-limited” state and effectively turns queued synthetic responses into unbounded backpressure; combined with the unbounded mpsc queue this can amplify memory/CPU usage under a UDP flood. Consider separating accounting for backend in-flight queries vs queued synthetic responses (e.g., add a flag/source on DnsResponse so poll_udp_response() only decrements for backend-submitted queries, or use a separate channel/counter with its own cap).
| sender.send(DnsResponse { | ||
| flow: request.flow.clone(), | ||
| response_data: build_servfail_response(request.dns_query), | ||
| }); |
There was a problem hiding this comment.
build_servfail_response() copies the entire remainder of the query (query[12..]) into the response while hard-coding ANCOUNT/NSCOUNT/ARCOUNT to 0. For queries that include an EDNS OPT record or any additional section, this produces an internally inconsistent DNS message (trailing bytes that aren’t accounted for by the header counts), which some resolvers treat as malformed. Since this path now triggers on rate-limiting (potentially frequently), it would be safer to copy only the question section (based on QDCOUNT and parsing QNAME/QTYPE/QCLASS) and omit any additional records, or to set ARCOUNT consistently if you intend to echo OPT.
Bug:
submit_udp_query()in consomme's DNS resolver silently discards queries when the pending-request limit (256) is reached. The return value ofsubmit_query()(which returnsfalsewhen rate-limited) is ignored, so the query vanishes — no SERVFAIL, no error, nothing. The guest's resolver waits for a response that never comes, eventually times out, and may fall back to TCP DNS (which can also be affected under sustained load).Symptoms: Intermittent DNS resolution failures, especially under load. Tests exercising both UDP and TCP DNS via
diganddig +tcpcan fail flakily.Fix: When
submit_query()returnsfalse, send a synthetic SERVFAIL response through the UDP channel so the guest gets a timely negative answer and can retry immediately instead of waiting for a timeout.