Nameservers wont work after server restart on fresh Vmin install

Hello,

After i waited for few months to see how things comes out with all this updates (website, vmin, etc) i decided to make another try with Vmin and i immediately encountered one problem with Named. I took one of my VPS and made fresh install of Centos 7, updated OS and proceeded with Vmin installation. Once done i attached one domain so i can test and after server reboot intoDNS reported problems with nameservers :

WARNING: One or more of your nameservers did not return any of your NS records.

ERROR: One or more of your nameservers did not respond:
The ones that did not respond are:
37.247.xxx.xxx

ERROR: Looks like you have less than 2 nameservers. According to RFC2182 section 5 you must have at least 3 nameservers, and no more than 7. Having 2 nameservers is also ok by me.

You should already know that your NS records at your nameservers are missing, so here it is again:
ns1.mydomain.com.
ns2.mydomain.com.

No valid SOA record came back!

Oh well, I did not detect any MX records so you probably don't have any and if you know you should have then they may be missing at your nameservers!

Simple restart or reload of Named will solve this problem but i cant find where is the problem. Log file doesnt show any problem:

30-May-2016 22:08:25.194 general: info: shutting down
30-May-2016 22:08:25.194 general: notice: stopping command channel on 127.0.0.1#953
30-May-2016 22:08:25.194 general: notice: stopping command channel on ::1#953
30-May-2016 22:08:25.194 network: info: no longer listening on ::#53
30-May-2016 22:08:25.194 network: info: no longer listening on 127.0.0.1#53
30-May-2016 22:08:25.194 network: info: no longer listening on 37.247.xxx.xxx#53
30-May-2016 22:08:25.199 general: notice: exiting
30-May-2016 22:08:38.960 general: info: managed-keys-zone: loaded serial 2
30-May-2016 22:08:38.961 general: info: zone 0.in-addr.arpa/IN: loaded serial 0
30-May-2016 22:08:38.961 general: info: zone 1.0.0.127.in-addr.arpa/IN: loaded serial 0
30-May-2016 22:08:38.962 general: info: zone 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa/IN: loaded serial 0
30-May-2016 22:08:38.962 general: info: zone localhost.localdomain/IN: loaded serial 0
30-May-2016 22:08:38.962 general: info: zone localhost/IN: loaded serial 0
30-May-2016 22:08:38.963 general: info: zone mydomain.com/IN: loaded serial 2016053002
30-May-2016 22:08:38.963 general: notice: all zones loaded
30-May-2016 22:08:38.963 general: notice: running
30-May-2016 22:08:38.968 notify: info: zone mydomain.com/IN: sending notifies (serial 2016053002)

Reinstalled 4 times my VPS to be sure its not just some random bug poping out during installation but it is not and the problem was present in each try.

Centos 7 fully updated on VPS same as Vmin/Wmin.

Status: 
Closed (fixed)

Comments

Howdy -- hmm, that's an odd issue you're seeing there, and we haven't received any other reports similar to that.

In your setup there, are ns1.mydomain.com and ns2.mydomain.com both pointing to the same IP address, 37.247.xxx.xxx?

It does look like BIND is listening on 37.247.xxx.xxx port 53.

One thing I'd suggest is to verify the time and date on your server. If they are off even a little, BIND may not respond to requests due to new security options.

Diabolico's picture
Submitted by Diabolico on Mon, 05/30/2016 - 17:25

->"In your setup there, are ns1.mydomain.com and ns2.mydomain.com both pointing to the same IP address, 37.247.xxx.xxx?"

Correct, same IP for both.

->"One thing I'd suggest is to verify the time and date on your server. If they are off even a little, BIND may not respond to requests due to new security options."

I remember at beginning of Centos 7 there was a bug with imjournal and rsyslog.service but it was affecting only OS capabilities to write in log files. Only thing i changed was the time zone. Any suggestion what time could be compromised as based on logs the time is correct. Only thing what comes into my mind is to take same solution what it was for imjournal and rsyslog.service but i'm not sure how this could help with Named.

Diabolico's picture
Submitted by Diabolico on Mon, 05/30/2016 - 20:32

I found the problem. It was Named starting before the network finish to configure and IP address to be assigned. After your post Eric i start to think and it was strange that old bug with imjournal would make problems with Named. Maybe in writing the log files but not sure if could be a cause for this problem. So i start to dig and saw similar problem what gave me the idea that nameservers could fail if IP address is not assigned before Named starts.

Solution:

- /usr/lib/systemd/system/named.service
- at beginning i had:
[Unit]
Description=Berkeley Internet Name Domain (DNS)
Wants=nss-lookup.target
Wants=named-setup-rndc.service
Before=nss-lookup.target
After=network.target
After=named-setup-rndc.service
...

- Add "After=network-online.target" after "After=network.target":

[Unit]
Description=Berkeley Internet Name Domain (DNS)
Wants=nss-lookup.target
Wants=named-setup-rndc.service
Before=nss-lookup.target
After=network.target
After=network-online.target
After=named-setup-rndc.service
...

- run "systemctl daemon-reload" and then "service named restart"

I found only one report about this bug and it say its random. Not 100% sure because i tried 4 times with clean install and to get 4 of 4 doesnt look something random. It could be connected with OpenVZ templates, hosting company and their network, i'm not sure but something must be the trigger.

Posted here my findings just in case someone else encounter this bug as it was not easy to find a solution.

Thanks for letting us know how you fixed this!

I had been trying to reproduce that issue, but so far wasn't able to.

Perhaps networking is taking longer to start up in some situations though.

I'll review whether that's something we want to modify by default for future installs. Though that kind of change is generally something we'd prefer they change upstream though, as networking would certainly need to be online by the time BIND starts.

But I'll see what Joe and Jamie have to say :-)

Do you happen to have a link to the bug report you saw regarding this?

FYI, the order of these boot actions is determined by the distribution-provided packages - we don't customize it in the Virtualmin install.

Diabolico's picture
Submitted by Diabolico on Mon, 05/30/2016 - 22:08

->"Thanks for letting us know how you fixed this!"

No problem.

->"I had been trying to reproduce that issue, but so far wasn't able to."

I never even heard of this situation let alone saw it. On bug report the person said its random but i suspect must be triggered with something. Maybe similar situation when many Centos 7 templates on OpenVZ were bugged and after restart you could not connect anymore because the network was down and unable to start.

->"Perhaps networking is taking longer to start up in some situations though."

I dont think so, fresh install on VPS what is not oversold or overused on SSD.

->"Do you happen to have a link to the bug report you saw regarding this?"

Yes: https://bugs.centos.org/view.php?id=10575

->"FYI, the order of these boot actions is determined by the distribution-provided packages - we don't customize it in the Virtualmin install."

After i found out what and where was the problem i agree but at beginning it looked like Vmin bug.

Diabolico's picture
Submitted by Diabolico on Tue, 05/31/2016 - 11:44

Status: Active ยป Fixed