Skip to content

consomme: return SERVFAIL for rate-limited UDP DNS queries#3196

Open
benhillis wants to merge 1 commit intomicrosoft:mainfrom
benhillis:fix-dns-udp-silent-drop
Open

consomme: return SERVFAIL for rate-limited UDP DNS queries#3196
benhillis wants to merge 1 commit intomicrosoft:mainfrom
benhillis:fix-dns-udp-silent-drop

Conversation

@benhillis
Copy link
Copy Markdown
Member

Bug: submit_udp_query() in consomme's DNS resolver silently discards queries when the pending-request limit (256) is reached. The return value of submit_query() (which returns false when rate-limited) is ignored, so the query vanishes — no SERVFAIL, no error, nothing. The guest's resolver waits for a response that never comes, eventually times out, and may fall back to TCP DNS (which can also be affected under sustained load).

Symptoms: Intermittent DNS resolution failures, especially under load. Tests exercising both UDP and TCP DNS via dig and dig +tcp can fail flakily.

Fix: When submit_query() returns false, send a synthetic SERVFAIL response through the UDP channel so the guest gets a timely negative answer and can retry immediately instead of waiting for a timeout.

Copilot AI review requested due to automatic review settings April 5, 2026 19:21
@benhillis benhillis requested a review from a team as a code owner April 5, 2026 19:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a reliability bug in consomme’s DNS resolver where UDP DNS queries could be silently dropped once the pending-request limit is reached, by returning a synthetic SERVFAIL to the guest instead of letting the query time out.

Changes:

  • Check the submit_query() return value in submit_udp_query().
  • When rate-limited, enqueue a synthetic SERVFAIL response back through the UDP response channel.

submit_udp_query() was silently discarding the return value of
submit_query(), so when the pending-request limit (256) was reached
the query was dropped without any response. The guest resolver would
then wait for a timeout before retrying (often falling back to TCP),
causing intermittent DNS resolution failures under load.

Send a synthetic SERVFAIL response instead so the guest gets a timely
negative answer and can retry immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines +163 to +170
if !self.submit_query(request, sender.clone()) {
// Rate-limited: send a SERVFAIL so the guest gets a timely
// negative response instead of waiting for a timeout.
//
// Increment the counter so the decrement in
// `poll_udp_response()` stays balanced.
self.pending_requests += 1;
sender.send(DnsResponse {
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending_requests is incremented for every rate-limited UDP query, which means it can grow beyond max_pending_requests under sustained load (even though no backend query was submitted). Because submit_query() only accepts when pending_requests < max_pending_requests, this can keep the resolver in a prolonged “rate-limited” state and effectively turns queued synthetic responses into unbounded backpressure; combined with the unbounded mpsc queue this can amplify memory/CPU usage under a UDP flood. Consider separating accounting for backend in-flight queries vs queued synthetic responses (e.g., add a flag/source on DnsResponse so poll_udp_response() only decrements for backend-submitted queries, or use a separate channel/counter with its own cap).

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +173
sender.send(DnsResponse {
flow: request.flow.clone(),
response_data: build_servfail_response(request.dns_query),
});
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_servfail_response() copies the entire remainder of the query (query[12..]) into the response while hard-coding ANCOUNT/NSCOUNT/ARCOUNT to 0. For queries that include an EDNS OPT record or any additional section, this produces an internally inconsistent DNS message (trailing bytes that aren’t accounted for by the header counts), which some resolvers treat as malformed. Since this path now triggers on rate-limiting (potentially frequently), it would be safer to copy only the question section (based on QDCOUNT and parsing QNAME/QTYPE/QCLASS) and omit any additional records, or to set ARCOUNT consistently if you intend to echo OPT.

Copilot uses AI. Check for mistakes.
@benhillis benhillis added the bug Something isn't working label Apr 6, 2026
Copy link
Copy Markdown
Contributor

@damanm24 damanm24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants