-
Notifications
You must be signed in to change notification settings - Fork 8
Description
What happened?
This webhook has a bug that causes only one ACME DNS challenge can exist at the same time for multiple subdomains, although it's perfectly valid and expected to have multiple.
As a result, it's not possible for this webhook to support a cluster with multiple different subdomains - it will randomly cause certificate renewal to fail. This contention problem grows when using more subdomains, as the chances they will request subdomains at close enough contend for challenges increases. I should note that it's not always possible to use the HTTP-01 challenge as an alternative, as using that requires the actual HTTP endpoints to be publicly available. The DNS-01 challenge allows getting a certificate for an endpoint that is not publicly available.
How can we reproduce this?
Here's what happens:
- Alice requests a certificate for
alice.example.comusing the DNS-01 challenge stackit-cert-manager-webhookcreates a TXT record at_acme_challenge.example.comwith a challenge foralice.example.com- Slightly later, Bob requests a certificate for bob.example.com using the DNS-01 challenge
stackit-cert-manager-webhookattempts to create a TXT record at_acme_challenge.example.com
What should happen at this point:
stackit-cert-manager-webhookcreates a second, separate TXT record at_acme_challenge.example.comwith a challenge forbob.example.com- Both Alice and Bob get a new certificate for their subdomains.
What actually happens instead (this is the bug):
stackit-cert-manager-webhookfinds that an existing set of records already exists (since Alice's record already exists), and just updates Alice's existing TXT record with a new TTL. Bob's request is completely ignored otherwise, and no challenge is created for Bob's domain.- Only Alice gets a new certificate for her subdomain. Bob doesn't.
- Every time Bob requests a new certificate,
stackit-cert-manager-webhookjust updates Alice's ACME challenge with a new TTL instead. So Bob remains without a new certificate until after a cleanup of Alice's challenge occurs.
Search
- I did search for other open and closed issues before opening this.
Code of Conduct
- I agree to follow this project's Code of Conduct
Additional context
Note that Let's Encrypt in particular expects multiple TXT records.
You can have multiple TXT records in place for the same name.
There's also this old cert-manager discussion where this was explicitly allowed and other implementations were updated.
There seems to be a related bug in CleanUp() - the cert-manager specification quoted in the comments states that only the requested challenge key should be deleted, but the implementation just throws out the entire set for a specific domain. Both codepaths seem to implement a hidden assumption that only one ACME challenge can exist at the time, whereas that is both allowed and expected.