Let's Encrypt DNS challenge method fails when using slave autoconfiguration

Hi,

I am encountering an issue generating and renewing Let's Encrypt certificates that specifically only manifests when using slave DNS servers with Virtualmin (as in the 'DNS Slave Auto-configuration" document found at https://www.virtualmin.com/documentation/dns/slave-configuration).

Right now I have a cluster of 4 production Web servers, one of which runs Virtualmin and acts as the primary BIND DNS server, and the other 3 run Webmin and act as slave DNS servers. All three slaves are set up in the Webmin Servers Index module and they are also set up as cluster slave servers in the BIND DNS module. They are all working properly, responding to requests for records, etc.; no issues exist with the master or slave zones and when I edit a DNS record in Virtualmin it is quickly deployed to all the slaves without issue. Also, all slaves can request transfers from the other slaves and the master without issue as well.

But it's not all sunshine and roses, as Let's Encrypt certificates fail to be generated or renewed after I set up slave DNS servers, but the output gives no indication as to what could be wrong. For reference, I have a development/staging server that runs Virtualmin and BIND, and does not have any slave servers, and Let's Encrypt works just fine by way of the DNS challenge type on that server.

All servers are running CentOS Linux 7.7, Virtualmin 6.08 Pro (or Webmin 1.932 in the case of the DNS slaves), and the most recent versions of BIND and Certbot available for my platform. Here is the output from one of my renewal attempts earlier this evening:

Requesting SSL certificate for je-marketingsolutions.com www.je-marketingsolutions.com je-digitalmarketing.com www.je-digitalmarketing.com .. .. failed : DNS-based validation failed : Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org Obtaining a new certificate Performing the following challenges: dns-01 challenge for www.je-digitalmarketing.com Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Waiting for verification... Challenge failed for domain www.je-digitalmarketing.com dns-01 challenge for www.je-digitalmarketing.com Cleaning up challenges Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Some challenges have failed. IMPORTANT NOTES: - The following errors were reported by the server:

Domain: www.je-digitalmarketing.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.je-digitalmarketing.com

Any assistance in this matter would be greatly appreciated. I've set Virtualmin to automatically attempt to renew SSL certificates one month before they are set to expire (so 2 months into the 3-month Let's Encrypt certificate lifetime), so my sites are in no immediate danger of becoming insecure, but there does exist the possibility of launching a new site that can now no longer be secured because of this issue.

Status: 
Active

Comments

I bet the problem is that the Let's Encrypt service is querying your slave DNS servers, but sometimes replication of the validation DNS record is delayed. Try editing the file /etc/webmin/webmin/config and adding the line letsencrypt_dns_wait=60 at the end, and see if that helps.

When I request ssl with lets encrypt it shows following error

Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Obtaining a new certificate Performing the following challenges: dns-01 challenge for dev-cafe.thecrystalpos.com dns-01 challenge for www.dev-cafe.thecrystalpos.com Hook command "/etc/webmin/webmin/letsencrypt-dns.pl" returned error code 1 Error output from letsencrypt-dns.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-dns.pl line 47.

Hook command "/etc/webmin/webmin/letsencrypt-dns.pl" returned error code 1 Error output from letsencrypt-dns.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-dns.pl line 47.

Waiting for verification... Cleaning up challenges Hook command "/etc/webmin/webmin/letsencrypt-cleanup.pl" returned error code 1 Error output from letsencrypt-cleanup.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-cleanup.pl line 38.

Hook command "/etc/webmin/webmin/letsencrypt-cleanup.pl" returned error code 1 Error output from letsencrypt-cleanup.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-cleanup.pl line 38.

Failed authorization procedure. dev-cafe.thecrystalpos.com (dns-01): urn:ietf:params:acme:error:dns :: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.dev-cafe.thecrystalpos.com, www.dev-cafe.thecrystalpos.com (dns-01): urn:ietf:params:acme:error:dns :: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.dev-cafe.thecrystalpos.com IMPORTANT NOTES: - The following errors were reported by the server:

Domain: dev-cafe.thecrystalpos.com Type: None Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.dev-cafe.thecrystalpos.com

Domain: www.dev-cafe.thecrystalpos.com Type: None Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.dev-cafe.thecrystalpos.com

Ok looks like you hit a different bug. Try editing the file /usr/share/webmin/webmin/letsencrypt-dns.pl and changing the words &restart_zone to &bind8::restart_zone

Haha @JamieCameron, you want me to program it. Can you fix this issue and release next version of webmin?

As I see, dns validation currently does not work. Requests are succeeding via file based validation so not yet any big problem. Still the wildcard domain does not get renewed automatically which I do manually via command line.

I ran into the same error and after following the direction to add &bind8:: to the call, I am receiving a large number of invalid error responses like this:

2019-11-29 15:58:34,226:DEBUG:acme.client:Storing nonce: 0001k858gu66CbwyzczWJnmlccoP2NNWOGn3YdEOgO6Y3cE
2019-11-29 15:58:34,227:DEBUG:acme.client:JWS payload:
b''
2019-11-29 15:58:34,229:DEBUG:acme.client:Sending POST request to https://acme-v02.api.letsencrypt.org/acme/authz-v3/1463873457:
{
  "signature": "U2LdVRa5RMICUz9ytM5ATaSrK6UHVUW1WVm-4iMxuMERprz9fm93rWZQ4u-Fdp6tvo4VDiezjVdHnGOQ8EyNKAZdzy_7SpQh9VuEkETJq2cSZB6F5xvH8vhV_mP767f-6vWSKS-I-1UP-pykD7qzE7pa8AWAyVX-GRgoCWOf5ULoR1Ue2QLrfLtv85j_mZbJnG32nLVfShB2PC32lvvDPbnwR7Nmh7OtjYimrVMgZABnCQffUn21aCkTAs_Kn9R8Yx1sXNokMu3w9WUwNAqBJFogQsL6xL35JBTI01DgGaA8OSlFvEomhuYXM5dmLoINp_Vq6AAPhFkWw3B4Kk-CPg",
  "payload": "",
  "protected": "eyJub25jZSI6ICIwMDAxazg1OGd1NjZDYnd5emN6V0pubWxjY29QMk5OV09HbjNZZEVPZ082WTNjRSIsICJ1cmwiOiAiaHR0cHM6Ly9hY21lLXYwMi5hcGkubGV0c2VuY3J5cHQub3JnL2FjbWUvYXV0aHotdjMvMTQ2Mzg3MzQ1NyIsICJhbGciOiAiUlMyNTYiLCAia2lkIjogImh0dHBzOi8vYWNtZS12MDEuYXBpLmxldHNlbmNyeXB0Lm9yZy9hY21lL3JlZy8xNjIyNDAzMyJ9"
}
2019-11-29 15:58:34,275:DEBUG:requests.packages.urllib3.connectionpool:https://acme-v02.api.letsencrypt.org:443 "POST /acme/authz-v3/1463873457 HTTP/1.1" 200 989
2019-11-29 15:58:34,276:DEBUG:acme.client:Received response:
HTTP 200
Server: nginx
Date: Fri, 29 Nov 2019 15:58:34 GMT
Content-Type: application/json
Content-Length: 989
Connection: keep-alive
Boulder-Requester: 16224033
Cache-Control: public, max-age=0, no-cache
Link: <https://acme-v02.api.letsencrypt.org/directory>;rel="index"
Replay-Nonce: 00021inveITR26vFxNO6Brme6BZPJHhUw1xyu34MH_P7pwo
X-Frame-Options: DENY
Strict-Transport-Security: max-age=604800

{
  "identifier": {
    "type": "dns",
    "value": "smtp.mydomain.com"
  },
  "status": "invalid",
  "expires": "2019-12-06T15:51:20Z",
  "challenges": [
    {
      "type": "http-01",
      "status": "invalid",
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/1463873457/W20Eaw",
      "token": "RUTxRB2Is21fl7Yd7Qvs4dF1WzUbk-llZIU31WCkZbg"
    },
    {
      "type": "dns-01",
      "status": "invalid",
      "error": {
        "type": "urn:ietf:params:acme:error:dns",
        "detail": "DNS problem: NXDOMAIN looking up TXT for _acme-challenge.smtp.mydomain.com",
        "status": 400
      },
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/1463873457/Urx8Wg",
      "token": "RUTxRB2Is21fl7Yd7Qvs4dF1WzUbk-llZIU31WCkZbg"
    },
    {
      "type": "tls-alpn-01",
      "status": "invalid",
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/1463873457/K7VhiA",
      "token": "RUTxRB2Is21fl7Yd7Qvs4dF1WzUbk-llZIU31WCkZbg"
    }
  ]
}

Yes, the next release of Webmin (due out soon) will include this fix - so you can just wait.

So returning to the original issue for a bit, I added the wait parameter to /etc/webmin/webmin/config (it was actually already there but set to 10 so I increased it to 60). This solved the problem in the sense that Virtualmin still failed to renew certificates, but then a little while later it retried and it renewed successfully. So Virtualmin e-mails me now with a failure message and then a little while later tries again automatically and the renewal is successful.

Is there a way to make Virtualmin renew the certificate successfully the first time/try, without failing first?

Title: Let's Encrypt DNS challenge method fails when u sing slave autoconfiguration » Let's Encrypt DNS challenge method fails when using slave autoconfiguration

The first time it failed, did you still get the exact same message about NXDOMAIN looking up TXT for _acme-challenge.dev-cafe.thecrystalpos.com,

As an example of what I am seeing now, even after changing the Let's Encrypt DNS wait time in /etc/webmin/webmin/config from 10 seconds (the existing value) to 60 seconds (the new value), I am still getting errors like this when Virtualmin goes to renew a certificate, several days after making the change and restarting Webmin:

An error occurred requesting a new certificate for sicilyspizzaeaston.com, www.sicilyspizzaeaston.com from Let's Encrypt :

DNS-based validation failed : Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org Obtaining a new certificate Performing the following challenges: dns-01 challenge for sicilyspizzaeaston.com dns-01 challenge for www.sicilyspizzaeaston.com Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Waiting for verification... Challenge failed for domain sicilyspizzaeaston.com Challenge failed for domain www.sicilyspizzaeaston.com dns-01 challenge for sicilyspizzaeaston.com dns-01 challenge for www.sicilyspizzaeaston.com Cleaning up challenges Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Some challenges have failed. IMPORTANT NOTES: - The following errors were reported by the server:

Domain: sicilyspizzaeaston.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.sicilyspizzaeaston.com

Domain: www.sicilyspizzaeaston.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.sicilyspizzaeaston.com

I have confirmed that the domain is using Virtualmin's name servers (the primary and the slaves I configured through Virtualmin) and that all DNS servers are functioning properly and accepting requests.

Do you control the DNS servers that are secondaries for this domain? If so, on the master system, is BIND setup to notify the secondaries when a record is changed?

Yes. I control all four name servers (one master and three slaves), and all slave servers were set up using the DNS Slave Auto-Configuration guide in the Virtualmin docs. I would assume BIND is set up as described on the master system but I haven't changed any settings other than following Virtualmin's official slave DNS configuration guide. I did however do a DNS test where I added a record to a zone on the master (Virtualmin > Server Configuration > DNS Records) and then did a DNS lookup on one of the slaves and the record was present on that slave. I don't remember how long I waited in between adding the record and performing the test though (this was a few weeks ago when trying to resolve a different issue).

Just to verify that notifications for new records are being received by slave systems, can you check if anything gets logged by bind or named to /var/log/messages on all the slaves when you try to request a Let's Encrypt cert?

Hmm...I ran the command 'cat /var/log/messages | grep "sicilyspizzaeaston.com"' on the master and the slave servers after re-attempting to request the Let's Encrypt certificate for the domain but no results were returned on any of the four servers.

I don't suppose there is any firewall blocking UDP or TCP ports 53 from the master to slave systems?

No, we unfortunately don't have a firewall in place right now because when I set up this new cluster of systems I realized FirewallD is not supported on the Linode-provided kernel they are currently running.

If you add a DNS record on the master system (using the DNS Records page in Virtualmin), does it get immediately replicated to the slaves?

I just added a test DNS record to the domain on Virtualmin and less than 3 minutes later ran the following commands, trying to retrieve the A record I just added across all four of my name servers (the master and three slaves, with ns1 being the master):

lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns1.jemediacorp.com Using domain server: Name: ns1.jemediacorp.com Address: 45.79.158.84#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84 lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns2.jemediacorp.com Using domain server: Name: ns2.jemediacorp.com Address: 45.56.99.222#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84 lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns3.jemediacorp.com Using domain server: Name: ns3.jemediacorp.com Address: 45.79.131.13#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84 lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns4.jemediacorp.com Using domain server: Name: ns4.jemediacorp.com Address: 66.228.34.214#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84

Edit: All commands were ran on my local MacBook Pro connected to my apartment's wireless network. They were not ran on any server.

Can you re-try that and see if the record appears within 1 minute on all slaves?

The test was successful. I added the DNS record test-2.sicilyspizzaeaston.com at 9:30 AM EST this morning and then immediately went into Terminal.app on my Mac and ran the host command with all 4 name servers as I did yesterday. The command also ran at 9:30 AM, just after I added the record, and here are the results:

Using domain server: Name: ns1.jemediacorp.com Address: 45.79.158.84#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222 Using domain server: Name: ns2.jemediacorp.com Address: 45.56.99.222#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222 Using domain server: Name: ns3.jemediacorp.com Address: 45.79.131.13#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222 Using domain server: Name: ns4.jemediacorp.com Address: 66.228.34.214#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222

Ok, so during a Let's Encrypt cert request, does a record like _acme-challenge.sicilyspizzaeaston.com ever appear on master or slave systems?

I'm not quite sure how I would monitor that since those records get added and removed quickly by Virtualmin correct? What would you suggest to be the best way to be able to spot the record before it gets removed when the request fails?

OK, so as time goes by I'm learning more and more about this issue. Virtualmin tried a few times over the course of today to renew a certificate for another domain that we host and control the DNS for, but failed every time with the same errors I've been seeing. However I just got an e-mail from Virtualmin saying a certificate for that domain was successfully requested and installed from Let's Encrypt. Same domain that was failing earlier, and we've made no changes to our Virtualmin server today. So I'm not sure why it would be failing one minute but succeeding the next, it's very odd.

It feels like the problem is propogation of changed records to the slave servers, which means that the check will randomly succeed if the Let's Encrypt service happens to query the DNS master.

That makes sense. So you're saying if Let's Encrypt queries the slaves instead of the master it will fail? How can we resolve that issue? I mean we know that when I add a DNS record manually, I can look it up on all the slave servers, as demonstrated in our test a few days ago.

Regarding comment #22 - you can use the letsencrypt_dns_wait=60 setting mentioned above to increase the time that the DNS records are kept around for.

I already used that setting, I set it in /etc/webmin/webmin/config as instructed (see my above comment) and I am still having issues. Virtualmin attempted to renew a certificate for one of our domains this evening and of course the DNS challenge method failed, but searching the logs for the acme-challenge DNS record produced no results on either the master or any of the slaves.

Any update from the Virtualmin guys on this? We are starting to launch new Websites for some of our new clients and are having problems issuing SSL certificates for those sites because of this ongoing problem. Even with the Let's Encrypt DNS wait time set to 60 in /etc/webmin/webmin/config certificate issuance only succeeds every once in a while, probably when the master DNS server is queried by the Let's Encrypt client, and even though manually adding a DNS record in Virtualmin causes it to propagate to the slaves immediately, the same cannot be said concretely for when it's done automatically via Let's Encrypt.

Ilia's picture
Submitted by Ilia on Thu, 12/12/2019 - 03:12

Jamie has to finish new feature for ACL module and the new release of Webmin/Usermin should be coming within 5-7 days, or even sooner.

Sorry for inconvenience.

This one is complex because it seems specific to your system. So I'd have to login myself to debug what's going on here..

I was doing some testing more testing this evening and discovered something very interesting! The issue appears to be down to the fact that Virtualmin only copies newly-added DNS records to slave servers if they are added manually from within the GUI, not automatically via Let's Encrypt. I tested this by starting a certificate renewal request, then using the 60 seconds wait time to go into Virtualmin > Server Configuration > DNS Records, clicking on the _acme-challenge TXT record that had just been created, then hitting Save. If done before the 60 seconds expired, the record would be copied to the slave servers in time for the renewal request to complete successfully.

So in other words, when requesting a certificate renewal, Virtualmin does not copy the generated DNS record to the slave servers at all, but when adding a DNS record manually via the GUI it does.

Jamie, I can still provide you with login information to my system if you'd like, but unless my system is configured in some super weird way, this may be a Virtualmin-specific issue after all; does the Let's Encrypt renewal routine skip calling whatever code that copies DNS records to slave servers?

Any update on this?

If the records aren't being replicated just for Let's Encrypt, that does sound like a bug. I can't re-produce this, so access to your system would be really useful to fix it..

Hi Jamie,

You're certainly welcome to access my system to take a look at this potential bug. I just logged into Virtualmin and it says that remote logins are already activated, so I must have turned them on for an earlier issue or something. The IP address of the system is 45.79.158.84. Let me know if you need any other info to log in. Thanks!

Jamie not sure if this is related to this bug but I also just ran into this

Error registering: Account creation on ACMEv1 is disabled. Please upgrade your ACME client to a version that supports ACMEv2 / RFC 8555

Thanks, I'm in. Which domain were you having trouble for most recently?

Ok I found the cause - there was a misconfiguration in the BIND module in Webmin on your system, which due to a bug was silently ignored and caused the DNS change to not be picked up. I've corrected your config, and will include a fix in the next Webmin release.

Thanks for your prompt repair of this issue, Jamie. Do you mind if I ask what option in Webmin's BIND module was configured incorrectly? What did you set the correct config value to?

Hi Jamie, it seems every time we think the issue is fixed something else pops up LOL. After you said last night that you found and fixed the Virtualmin bug that was causing this issue, Virtualmin tried to auto-renew one of our certificates this morning (for a different domain than the one I gave you yesterday) and DNS0based validation failed once again with pretty much the same errors as before. Remote logins to my system are still active if you want to log back in and take a look at things again. Here's the output from Virtualmin:

An error occurred requesting a new certificate for bendersdaylightdonuts.com, www.bendersdaylightdonuts.com from Let's Encrypt : DNS-based validation failed : Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org Obtaining a new certificate Performing the following challenges: dns-01 challenge for www.bendersdaylightdonuts.com Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Waiting for verification... Challenge failed for domain www.bendersdaylightdonuts.com dns-01 challenge for www.bendersdaylightdonuts.com Cleaning up challenges Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Some challenges have failed. IMPORTANT NOTES: - The following errors were reported by the server:

Domain: www.bendersdaylightdonuts.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.bendersdaylightdonuts.com

It was trying to use the rndc config file /etc/rndc.conf , which didn't exist on your system.

Update certbot

yum install socat certbot certbot register

Jamie,

I am continuing to get failed certificate renewal e-mails from Virtualmin as if the bug you fixed had come back or something. Is there something I should check on my system? I posted an example of the e-mail in an earlier reply from the other day. It almost seems like the records are not being copied again because every few hours Virtualmin will succeed in renewing a certificate for a domain that failed before, making me think that Let's Encrypt is hitting the primary DNS server during those times and the slave servers at all other times.

Ilia's picture
Submitted by Ilia on Sun, 12/29/2019 - 03:48

You need to update to the latest Webmin 1.940. It will be available on the repos pretty soon. You could also update it using the command below:

yum update http://download.webmin.com/download/yum/webmin-1.940-2.noarch.rpm