Let's Encrypt DNS challenge method fails when using slave autoconfiguration

Hi,

I am encountering an issue generating and renewing Let's Encrypt certificates that specifically only manifests when using slave DNS servers with Virtualmin (as in the 'DNS Slave Auto-configuration" document found at https://www.virtualmin.com/documentation/dns/slave-configuration).

Right now I have a cluster of 4 production Web servers, one of which runs Virtualmin and acts as the primary BIND DNS server, and the other 3 run Webmin and act as slave DNS servers. All three slaves are set up in the Webmin Servers Index module and they are also set up as cluster slave servers in the BIND DNS module. They are all working properly, responding to requests for records, etc.; no issues exist with the master or slave zones and when I edit a DNS record in Virtualmin it is quickly deployed to all the slaves without issue. Also, all slaves can request transfers from the other slaves and the master without issue as well.

But it's not all sunshine and roses, as Let's Encrypt certificates fail to be generated or renewed after I set up slave DNS servers, but the output gives no indication as to what could be wrong. For reference, I have a development/staging server that runs Virtualmin and BIND, and does not have any slave servers, and Let's Encrypt works just fine by way of the DNS challenge type on that server.

All servers are running CentOS Linux 7.7, Virtualmin 6.08 Pro (or Webmin 1.932 in the case of the DNS slaves), and the most recent versions of BIND and Certbot available for my platform. Here is the output from one of my renewal attempts earlier this evening:

Requesting SSL certificate for je-marketingsolutions.com www.je-marketingsolutions.com je-digitalmarketing.com www.je-digitalmarketing.com .. .. failed : DNS-based validation failed : Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org Obtaining a new certificate Performing the following challenges: dns-01 challenge for www.je-digitalmarketing.com Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Waiting for verification... Challenge failed for domain www.je-digitalmarketing.com dns-01 challenge for www.je-digitalmarketing.com Cleaning up challenges Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Some challenges have failed. IMPORTANT NOTES: - The following errors were reported by the server:

Domain: www.je-digitalmarketing.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.je-digitalmarketing.com

Any assistance in this matter would be greatly appreciated. I've set Virtualmin to automatically attempt to renew SSL certificates one month before they are set to expire (so 2 months into the 3-month Let's Encrypt certificate lifetime), so my sites are in no immediate danger of becoming insecure, but there does exist the possibility of launching a new site that can now no longer be secured because of this issue.

Status: 
Active

Comments

I bet the problem is that the Let's Encrypt service is querying your slave DNS servers, but sometimes replication of the validation DNS record is delayed. Try editing the file /etc/webmin/webmin/config and adding the line letsencrypt_dns_wait=60 at the end, and see if that helps.

When I request ssl with lets encrypt it shows following error

Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Obtaining a new certificate Performing the following challenges: dns-01 challenge for dev-cafe.thecrystalpos.com dns-01 challenge for www.dev-cafe.thecrystalpos.com Hook command "/etc/webmin/webmin/letsencrypt-dns.pl" returned error code 1 Error output from letsencrypt-dns.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-dns.pl line 47.

Hook command "/etc/webmin/webmin/letsencrypt-dns.pl" returned error code 1 Error output from letsencrypt-dns.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-dns.pl line 47.

Waiting for verification... Cleaning up challenges Hook command "/etc/webmin/webmin/letsencrypt-cleanup.pl" returned error code 1 Error output from letsencrypt-cleanup.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-cleanup.pl line 38.

Hook command "/etc/webmin/webmin/letsencrypt-cleanup.pl" returned error code 1 Error output from letsencrypt-cleanup.pl: Undefined subroutine &main::restart_zone called at /usr/share/webmin/webmin/letsencrypt-cleanup.pl line 38.

Failed authorization procedure. dev-cafe.thecrystalpos.com (dns-01): urn:ietf:params:acme:error:dns :: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.dev-cafe.thecrystalpos.com, www.dev-cafe.thecrystalpos.com (dns-01): urn:ietf:params:acme:error:dns :: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.dev-cafe.thecrystalpos.com IMPORTANT NOTES: - The following errors were reported by the server:

Domain: dev-cafe.thecrystalpos.com Type: None Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.dev-cafe.thecrystalpos.com

Domain: www.dev-cafe.thecrystalpos.com Type: None Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.dev-cafe.thecrystalpos.com

Ok looks like you hit a different bug. Try editing the file /usr/share/webmin/webmin/letsencrypt-dns.pl and changing the words &restart_zone to &bind8::restart_zone

Haha @JamieCameron, you want me to program it. Can you fix this issue and release next version of webmin?

As I see, dns validation currently does not work. Requests are succeeding via file based validation so not yet any big problem. Still the wildcard domain does not get renewed automatically which I do manually via command line.

I ran into the same error and after following the direction to add &bind8:: to the call, I am receiving a large number of invalid error responses like this:

2019-11-29 15:58:34,226:DEBUG:acme.client:Storing nonce: 0001k858gu66CbwyzczWJnmlccoP2NNWOGn3YdEOgO6Y3cE
2019-11-29 15:58:34,227:DEBUG:acme.client:JWS payload:
b''
2019-11-29 15:58:34,229:DEBUG:acme.client:Sending POST request to https://acme-v02.api.letsencrypt.org/acme/authz-v3/1463873457:
{
  "signature": "U2LdVRa5RMICUz9ytM5ATaSrK6UHVUW1WVm-4iMxuMERprz9fm93rWZQ4u-Fdp6tvo4VDiezjVdHnGOQ8EyNKAZdzy_7SpQh9VuEkETJq2cSZB6F5xvH8vhV_mP767f-6vWSKS-I-1UP-pykD7qzE7pa8AWAyVX-GRgoCWOf5ULoR1Ue2QLrfLtv85j_mZbJnG32nLVfShB2PC32lvvDPbnwR7Nmh7OtjYimrVMgZABnCQffUn21aCkTAs_Kn9R8Yx1sXNokMu3w9WUwNAqBJFogQsL6xL35JBTI01DgGaA8OSlFvEomhuYXM5dmLoINp_Vq6AAPhFkWw3B4Kk-CPg",
  "payload": "",
  "protected": "eyJub25jZSI6ICIwMDAxazg1OGd1NjZDYnd5emN6V0pubWxjY29QMk5OV09HbjNZZEVPZ082WTNjRSIsICJ1cmwiOiAiaHR0cHM6Ly9hY21lLXYwMi5hcGkubGV0c2VuY3J5cHQub3JnL2FjbWUvYXV0aHotdjMvMTQ2Mzg3MzQ1NyIsICJhbGciOiAiUlMyNTYiLCAia2lkIjogImh0dHBzOi8vYWNtZS12MDEuYXBpLmxldHNlbmNyeXB0Lm9yZy9hY21lL3JlZy8xNjIyNDAzMyJ9"
}
2019-11-29 15:58:34,275:DEBUG:requests.packages.urllib3.connectionpool:https://acme-v02.api.letsencrypt.org:443 "POST /acme/authz-v3/1463873457 HTTP/1.1" 200 989
2019-11-29 15:58:34,276:DEBUG:acme.client:Received response:
HTTP 200
Server: nginx
Date: Fri, 29 Nov 2019 15:58:34 GMT
Content-Type: application/json
Content-Length: 989
Connection: keep-alive
Boulder-Requester: 16224033
Cache-Control: public, max-age=0, no-cache
Link: <https://acme-v02.api.letsencrypt.org/directory>;rel="index"
Replay-Nonce: 00021inveITR26vFxNO6Brme6BZPJHhUw1xyu34MH_P7pwo
X-Frame-Options: DENY
Strict-Transport-Security: max-age=604800

{
  "identifier": {
    "type": "dns",
    "value": "smtp.mydomain.com"
  },
  "status": "invalid",
  "expires": "2019-12-06T15:51:20Z",
  "challenges": [
    {
      "type": "http-01",
      "status": "invalid",
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/1463873457/W20Eaw",
      "token": "RUTxRB2Is21fl7Yd7Qvs4dF1WzUbk-llZIU31WCkZbg"
    },
    {
      "type": "dns-01",
      "status": "invalid",
      "error": {
        "type": "urn:ietf:params:acme:error:dns",
        "detail": "DNS problem: NXDOMAIN looking up TXT for _acme-challenge.smtp.mydomain.com",
        "status": 400
      },
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/1463873457/Urx8Wg",
      "token": "RUTxRB2Is21fl7Yd7Qvs4dF1WzUbk-llZIU31WCkZbg"
    },
    {
      "type": "tls-alpn-01",
      "status": "invalid",
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/1463873457/K7VhiA",
      "token": "RUTxRB2Is21fl7Yd7Qvs4dF1WzUbk-llZIU31WCkZbg"
    }
  ]
}

Yes, the next release of Webmin (due out soon) will include this fix - so you can just wait.

So returning to the original issue for a bit, I added the wait parameter to /etc/webmin/webmin/config (it was actually already there but set to 10 so I increased it to 60). This solved the problem in the sense that Virtualmin still failed to renew certificates, but then a little while later it retried and it renewed successfully. So Virtualmin e-mails me now with a failure message and then a little while later tries again automatically and the renewal is successful.

Is there a way to make Virtualmin renew the certificate successfully the first time/try, without failing first?

Title: Let's Encrypt DNS challenge method fails when u sing slave autoconfiguration » Let's Encrypt DNS challenge method fails when using slave autoconfiguration

The first time it failed, did you still get the exact same message about NXDOMAIN looking up TXT for _acme-challenge.dev-cafe.thecrystalpos.com,

As an example of what I am seeing now, even after changing the Let's Encrypt DNS wait time in /etc/webmin/webmin/config from 10 seconds (the existing value) to 60 seconds (the new value), I am still getting errors like this when Virtualmin goes to renew a certificate, several days after making the change and restarting Webmin:

An error occurred requesting a new certificate for sicilyspizzaeaston.com, www.sicilyspizzaeaston.com from Let's Encrypt :

DNS-based validation failed : Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Starting new HTTPS connection (1): acme-v02.api.letsencrypt.org Obtaining a new certificate Performing the following challenges: dns-01 challenge for sicilyspizzaeaston.com dns-01 challenge for www.sicilyspizzaeaston.com Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Running manual-auth-hook command: /etc/webmin/webmin/letsencrypt-dns.pl Waiting for verification... Challenge failed for domain sicilyspizzaeaston.com Challenge failed for domain www.sicilyspizzaeaston.com dns-01 challenge for sicilyspizzaeaston.com dns-01 challenge for www.sicilyspizzaeaston.com Cleaning up challenges Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Running manual-cleanup-hook command: /etc/webmin/webmin/letsencrypt-cleanup.pl Some challenges have failed. IMPORTANT NOTES: - The following errors were reported by the server:

Domain: sicilyspizzaeaston.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.sicilyspizzaeaston.com

Domain: www.sicilyspizzaeaston.com Type: dns Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.www.sicilyspizzaeaston.com

I have confirmed that the domain is using Virtualmin's name servers (the primary and the slaves I configured through Virtualmin) and that all DNS servers are functioning properly and accepting requests.

Do you control the DNS servers that are secondaries for this domain? If so, on the master system, is BIND setup to notify the secondaries when a record is changed?

Yes. I control all four name servers (one master and three slaves), and all slave servers were set up using the DNS Slave Auto-Configuration guide in the Virtualmin docs. I would assume BIND is set up as described on the master system but I haven't changed any settings other than following Virtualmin's official slave DNS configuration guide. I did however do a DNS test where I added a record to a zone on the master (Virtualmin > Server Configuration > DNS Records) and then did a DNS lookup on one of the slaves and the record was present on that slave. I don't remember how long I waited in between adding the record and performing the test though (this was a few weeks ago when trying to resolve a different issue).

Just to verify that notifications for new records are being received by slave systems, can you check if anything gets logged by bind or named to /var/log/messages on all the slaves when you try to request a Let's Encrypt cert?

Hmm...I ran the command 'cat /var/log/messages | grep "sicilyspizzaeaston.com"' on the master and the slave servers after re-attempting to request the Let's Encrypt certificate for the domain but no results were returned on any of the four servers.

I don't suppose there is any firewall blocking UDP or TCP ports 53 from the master to slave systems?

No, we unfortunately don't have a firewall in place right now because when I set up this new cluster of systems I realized FirewallD is not supported on the Linode-provided kernel they are currently running.

If you add a DNS record on the master system (using the DNS Records page in Virtualmin), does it get immediately replicated to the slaves?

I just added a test DNS record to the domain on Virtualmin and less than 3 minutes later ran the following commands, trying to retrieve the A record I just added across all four of my name servers (the master and three slaves, with ns1 being the master):

lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns1.jemediacorp.com Using domain server: Name: ns1.jemediacorp.com Address: 45.79.158.84#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84 lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns2.jemediacorp.com Using domain server: Name: ns2.jemediacorp.com Address: 45.56.99.222#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84 lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns3.jemediacorp.com Using domain server: Name: ns3.jemediacorp.com Address: 45.79.131.13#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84 lmerrill@AirshockMBP ~ % host -t a test-1.sicilyspizzaeaston.com ns4.jemediacorp.com Using domain server: Name: ns4.jemediacorp.com Address: 66.228.34.214#53 Aliases:

test-1.sicilyspizzaeaston.com has address 45.79.158.84

Edit: All commands were ran on my local MacBook Pro connected to my apartment's wireless network. They were not ran on any server.

Can you re-try that and see if the record appears within 1 minute on all slaves?

The test was successful. I added the DNS record test-2.sicilyspizzaeaston.com at 9:30 AM EST this morning and then immediately went into Terminal.app on my Mac and ran the host command with all 4 name servers as I did yesterday. The command also ran at 9:30 AM, just after I added the record, and here are the results:

Using domain server: Name: ns1.jemediacorp.com Address: 45.79.158.84#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222 Using domain server: Name: ns2.jemediacorp.com Address: 45.56.99.222#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222 Using domain server: Name: ns3.jemediacorp.com Address: 45.79.131.13#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222 Using domain server: Name: ns4.jemediacorp.com Address: 66.228.34.214#53 Aliases:

test-2.sicilyspizzaeaston.com has address 45.56.99.222

Ok, so during a Let's Encrypt cert request, does a record like _acme-challenge.sicilyspizzaeaston.com ever appear on master or slave systems?

I'm not quite sure how I would monitor that since those records get added and removed quickly by Virtualmin correct? What would you suggest to be the best way to be able to spot the record before it gets removed when the request fails?

OK, so as time goes by I'm learning more and more about this issue. Virtualmin tried a few times over the course of today to renew a certificate for another domain that we host and control the DNS for, but failed every time with the same errors I've been seeing. However I just got an e-mail from Virtualmin saying a certificate for that domain was successfully requested and installed from Let's Encrypt. Same domain that was failing earlier, and we've made no changes to our Virtualmin server today. So I'm not sure why it would be failing one minute but succeeding the next, it's very odd.

It feels like the problem is propogation of changed records to the slave servers, which means that the check will randomly succeed if the Let's Encrypt service happens to query the DNS master.

That makes sense. So you're saying if Let's Encrypt queries the slaves instead of the master it will fail? How can we resolve that issue? I mean we know that when I add a DNS record manually, I can look it up on all the slave servers, as demonstrated in our test a few days ago.

Regarding comment #22 - you can use the letsencrypt_dns_wait=60 setting mentioned above to increase the time that the DNS records are kept around for.

I already used that setting, I set it in /etc/webmin/webmin/config as instructed (see my above comment) and I am still having issues. Virtualmin attempted to renew a certificate for one of our domains this evening and of course the DNS challenge method failed, but searching the logs for the acme-challenge DNS record produced no results on either the master or any of the slaves.