BIND exiting and shutting down for no apparent reason - random RCODE REFUSED

23 posts / 0 new
Last post
#1 Wed, 09/25/2013 - 10:12
calderwood

BIND exiting and shutting down for no apparent reason - random RCODE REFUSED

I have been seeing random BIND shut down and restart on Virtualmin GPL install. The shut down is always preceded by some error but random domain or IP entry. these IP addresses and domains are not on my server. The code message is often "error (unexpected RCODE REFUSED) resolving to some domain/IP"

If anyone can point me in the correct direction for finding the cause. I am assuming there is a bad setting somewhere. I am running CentOS6 with Virtualmin GPL

A sample log of a shutdown

Sep 25 06:01:06 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.225.70#53
Sep 25 06:01:07 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.224.69#53
Sep 25 06:01:09 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.224.69#53
Sep 25 06:01:09 ip-68-178-130-21 named[20231]: error (unexpected RCODE REFUSED) resolving '49.142.218.222.in-addr.arpa/PTR/IN': 202.103.225.70#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: shutting down
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: stopping command channel on 127.0.0.1#953
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: stopping command channel on ::1#953
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on ::#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 127.0.0.1#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 68.178.130.21#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 68.178.131.108#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: no longer listening on 50.62.34.152#53
Sep 25 06:02:17 ip-68-178-130-21 named[20231]: exiting
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: starting BIND 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6.3 -u named
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-gssapi=yes' '--disable-isc-spnego' '--with-docbook-xsl=/usr/share/sgml/docbook/xsl-stylesheets' '--enable-fixed-rrset' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE'
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: ----------------------------------------------------
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: BIND 9 is maintained by Internet Systems Consortium,
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: Inc. (ISC), a non-profit 501(c)(3) public-benefit
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: corporation.  Support and training for BIND 9 are
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: available at https://www.isc.org/support
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: ----------------------------------------------------
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: adjusted limit on open files from 1024 to 1048576
Sep 25 06:07:46 ip-68-178-130-21 named[29722]: found 16 CPUs, using 16 worker threads

Another BIND Shutdown was this one - these IP addresses and domains are not on my server:

Sep 25 05:46:25 ip-68-178-130-21 named[22235]: client 122.136.196.116#63897: query (cache) 'aa3247.com/A/IN' denied
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: client 122.136.196.116#3209: query (cache) 'aa3247.com/A/IN' denied
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: shutting down
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: stopping command channel on 127.0.0.1#953
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: stopping command channel on ::1#953
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on ::#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 127.0.0.1#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 68.178.130.21#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 68.178.131.108#53
Sep 25 05:46:25 ip-68-178-130-21 named[22235]: no longer listening on 50.62.34.152#53
Sep 25 05:46:27 ip-68-178-130-21 named[22235]: exiting
Wed, 09/25/2013 - 12:07
andreychek

Howdy,

Are you by chance using an OpenVZ-based VPS?

If so, can you paste in the contents of your /proc/user_beancounters file?

-Eric

Wed, 09/25/2013 - 12:19 (Reply to #2)
calderwood

Yes it is.

lockedpages                     0                    0  9223372036854775807  9223372036854775807                    0
            privvmpages                799349               840633              1572864              1572864                    0
            shmpages                     1206                 1206  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            numproc                       167                  167                32567                32567                    0
            physpages                  638774               678198  9223372036854775807  9223372036854775807                    0
            vmguarpages                     0                    0               786432  9223372036854775807                    0
            oomguarpages               638775               678199  9223372036854775807  9223372036854775807                    0
            numtcpsock                     48                   51  9223372036854775807  9223372036854775807                    0
            numflock                      171                  172  9223372036854775807  9223372036854775807                    0
            numpty                          1                    1                  255                  255                    0
            numsiginfo                      6                    7                 1024                 1024                    0
            tcpsndbuf                 1465696              1520640  9223372036854775807  9223372036854775807                    0
            tcprcvbuf                  786432               835584  9223372036854775807  9223372036854775807                    0
            othersockbuf               282656               565536  9223372036854775807  9223372036854775807                    0
            dgramrcvbuf                     0                 9344  9223372036854775807  9223372036854775807                    0
            numothersock                  201                  207  9223372036854775807  9223372036854775807                    0
            dcachesize                2355922              2400994  9223372036854775807  9223372036854775807                    0
            numfile                     18951                18960  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            numiptent                      14                   14  9223372036854775807  9223372036854775807                    0
Wed, 09/25/2013 - 12:17
calderwood

This time BIND just went down twice in a row but with yet another error code:

Sep 25 09:32:38 ip-68-178-130-21 named[29722]: error (unexpected RCODE SERVFAIL) resolving '228.195.58.213.in-addr.arpa/PTR/IN': 2001:67c:e0::6#53
Sep 25 09:35:16 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 25 09:35:19 ip-68-178-130-21 named[29722]: client 122.136.196.116#29758: query (cache) 'aa3247.com/A/IN' denied
Sep 25 09:35:33 ip-68-178-130-21 named[29722]: error (unexpected RCODE REFUSED) resolving 'ns.bta.net.cn/A/IN': 202.106.196.233#53
Sep 25 09:35:33 ip-68-178-130-21 named[29722]: error (unexpected RCODE REFUSED) resolving 'ns.bta.net.cn/AAAA/IN': 202.106.196.233#53
Sep 25 09:35:33 ip-68-178-130-21 named[29722]: error (unexpected RCODE REFUSED) resolving 'ns2.bta.net.cn/A/IN': 202.106.196.233#53
Sep 25 09:35:33 ip-68-178-130-21 named[29722]: error (unexpected RCODE REFUSED) resolving 'ns2.bta.net.cn/AAAA/IN': 202.106.196.233#53
Sep 25 09:35:56 ip-68-178-130-21 named[29722]: error (connection refused) resolving '130.36.99.91.psbl.surriel.com/A/IN': 82.94.250.75#53
Sep 25 09:37:41 ip-68-178-130-21 named[29722]: error (unexpected RCODE SERVFAIL) resolving '228.195.58.213.in-addr.arpa/PTR/IN': 193.0.9.6#53
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: shutting down
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: stopping command channel on 127.0.0.1#953
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: stopping command channel on ::1#953
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: no longer listening on ::#53
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: no longer listening on 127.0.0.1#53
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: no longer listening on 68.178.130.21#53
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: no longer listening on 68.178.131.108#53
Sep 25 09:38:21 ip-68-178-130-21 named[29722]: no longer listening on 50.62.34.152#53
Sep 25 09:38:22 ip-68-178-130-21 named[29722]: exiting
Wed, 09/25/2013 - 13:49
Locutus

I don't think the error message and the exiting is directly related. In your first example, there's over a minute between these events. Can you check how much time passed between each of these incidences?

I'd also think that BIND is hitting some kind of resource limit. You might check the syslog for further errors, maybe OOM messages. Short of a programming bug, it should not simply exit like that. You might check with "tcpdump udp port 53" what the load of your BIND is.

Wed, 09/25/2013 - 15:17 (Reply to #5)
calderwood

Looks like about 20 items per second - not sure what is normal.

This happens about 4 times a day, but at no particular time. I've looked thru the logs, but the only consistency I see is right before BIND closes, there is always an error message before such as:

error (unexpected RCODE REFUSED)
error (unexpected RCODE SERVFAIL)
error (unexpected RCODE REFUSED)

Sometimes there can be 30-50 error messages within a few minutes all on the same PID with different IPs

Wed, 09/25/2013 - 15:17 (Reply to #6)
calderwood

Looks like about 20 items per second - not sure what is normal.

This happens about 4 times a day, but at no particular time. I've looked thru the logs, but the only consistency I see is right before BIND closes, there is always an error message before such as:

error (unexpected RCODE REFUSED)
error (unexpected RCODE SERVFAIL)
error (unexpected RCODE REFUSED)

Sometimes there can be 30-50 error messages within a few minutes all on the same PID with different IPs

Wed, 09/25/2013 - 15:18 (Reply to #7)
calderwood

Looks like about 20 items per second - not sure what is normal.

This happens about 4 times a day, but at no particular time. I've looked thru the logs, but the only consistency I see is right before BIND closes, there is always an error message before such as:

error (unexpected RCODE REFUSED)
error (unexpected RCODE SERVFAIL)
error (unexpected RCODE REFUSED)

Sometimes there can be 30-50 error messages within a few minutes all on the same PID with different IPs

Wed, 09/25/2013 - 16:50
Locutus

The resolver errors you see are reverse lookups, probably triggered by some service on your machine that wants to know the hostname for an IP address that is trying to connect (e.g. SSHD and Postfix can do that). You might want to check your syslog, auth-log or mail-log for an unusually high amount of connection attempts / login failures, which indicate dictionary attacks.

Still, BIND should not simply exit from something like this. It's meant to process a considerable amount of requests per second.

Did you check the syslog or kernel log for additional errors like OOMs? Otherwise, I'm fairly certain that it's a problem with the virtualization software in use for your server. What's your used and free memory? You might want to use the tool "atop" to record and playback performance data, that should give you an idea if something has a memory spike or similar.

Wed, 09/25/2013 - 18:35 (Reply to #9)
calderwood

No OOMs and nothing in the kernel logs.

However I did find spamassasin set to lookup DNS test mode (not sure that would matter) and did fine an email account that is trying to login every 3 seconds with an authentication failure in saslauthd

The only thing I have seen pulling a lot of CPU is clamav.

It's not a busy server - only 100 virtual servers.

Wed, 09/25/2013 - 22:11
andreychek

It's not uncommon that we see unusual issues and service that die unexpectedly on OpenVZ-based systems... but typically when that occurs, the "failcnt" field of the user_beancounters file shows some non-0 values in it.

On your system, they all appear to be "0".

How much RAM is in your server, and what is the current usage?

You can determine that by running this command:

free -m

Thu, 09/26/2013 - 08:35 (Reply to #11)
calderwood

The system has 3GB ram with 6GB burstable

]# free -m
             total       used       free     shared    buffers     cached
Mem:          3072       1708       1363          0          0          0
-/+ buffers/cache:       1708       1363
Swap:            0          0          0
Thu, 09/26/2013 - 03:43
Locutus

I can only repeat what I wrote last time, and what Eric said too now: We need to know what your RAM situation is. free and atop are recommended to provide that information, and to check historical memory data when BIND crashes.

Thu, 09/26/2013 - 09:21
Locutus

This is really odd. Memory should be plenty, and no beancounter violation.

Two things I can think of right now: Intermittent memory problems. To find out about those, install "atop". Or the BIND version you're using (being an RC) might be buggy. Ubuntu officially uses 9.8.1, yours is 9.8.2-RC1.

You could try turning on BIND's own logging, including debug logging. Or you might want to check if there's an earlier, non-RC version of BIND available. Are you using third-party repositories for your CentOS? Those often contain versions of software that make trouble when used with Virtualmin.

Thu, 09/26/2013 - 09:47 (Reply to #14)
calderwood

The Centos iss a clean standard install and Virtualmin is the standard. It is a real problem as the server closes down about 6 times a day. Stopped twice again this morning. I searched the logs, and the same RCODE error then a BIND exit. I noticed that there is a bunch of messages all relating to the same PID. Don't know if it means anything.

Close at 7.03AM

Sep 26 07:03:38 ip-68-178-130-21 named[4010]: error (unexpected RCODE REFUSED) resolving '16.136.119.140.in-addr.arpa/PTR/IN': 2001:288:2001::8c77:36e#53
Sep 26 07:03:40 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 07:03:40 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 07:04:01 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 07:04:06 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 07:04:39 ip-68-178-130-21 named[4010]: error (unexpected RCODE SERVFAIL) resolving '65.37.114.42.in-addr.arpa/PTR/IN': 210.245.31.10#53
Sep 26 07:04:49 ip-68-178-130-21 named[4010]: client 122.136.196.116#45453: query (cache) 'aa3247.com/A/IN' denied
Sep 26 07:05:46 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 07:06:41 ip-68-178-130-21 named[4010]: client 122.136.196.116#61009: query (cache) 'aa3247.com/A/IN' denied
Sep 26 07:07:17 ip-68-178-130-21 named[4010]: client 122.136.196.116#45732: query (cache) 'aa3247.com/A/IN' denied
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: shutting down
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: stopping command channel on 127.0.0.1#953
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: stopping command channel on ::1#953
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: no longer listening on ::#53
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: no longer listening on 127.0.0.1#53
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: no longer listening on 68.178.130.21#53
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: no longer listening on 68.178.131.108#53
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: no longer listening on 50.62.34.152#53
Sep 26 07:07:41 ip-68-178-130-21 named[4010]: exiting
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: starting BIND 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6.3 -u named
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-gssapi=yes' '--disable-isc-spnego' '--with-docbook-xsl=/usr/share/sgml/docbook/xsl-stylesheets' '--enable-fixed-rrset' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE'
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: ----------------------------------------------------
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: BIND 9 is maintained by Internet Systems Consortium,
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: Inc. (ISC), a non-profit 501(c)(3) public-benefit
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: corporation.  Support and training for BIND 9 are
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: available at https://www.isc.org/support
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: ----------------------------------------------------
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: adjusted limit on open files from 1024 to 1048576
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: found 16 CPUs, using 16 worker threads
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: using up to 4096 sockets
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: loading configuration from '/etc/named.conf'
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: reading built-in trusted keys from file '/etc/named.iscdlv.key'
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: using default UDP/IPv4 port range: [1024, 65535]
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: using default UDP/IPv6 port range: [1024, 65535]
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: listening on IPv6 interfaces, port 53
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: listening on IPv4 interface lo, 127.0.0.1#53
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: listening on IPv4 interface eth0, 68.178.130.21#53
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: listening on IPv4 interface eth0:1, 68.178.131.108#53
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: listening on IPv4 interface eth0:2, 50.62.34.152#53
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: generating session key for dynamic DNS
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: sizing zone task pool based on 95 zones
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: set up managed keys zone for view _default, file '/var/named/dynamic/managed-keys.bind'
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: Warning: 'empty-zones-enable/disable-empty-zone' not set: disabling RFC 1918 empty zones
Sep 26 07:12:50 ip-68-178-130-21 named[7851]: automatic empty zone: 127.IN-ADDR.ARPA

Close at 9:22AM

Sep 26 09:21:14 ip-68-178-130-21 named[26623]: error (unexpected RCODE SERVFAIL) resolving '226.128.136.183.in-addr.arpa/PTR/IN': 202.96.103.36#53
Sep 26 09:21:28 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 09:21:52 ip-68-178-130-21 php-cgi: PHP Warning:  Module 'mcrypt' already loaded in Unknown on line 0
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: received control channel command 'stop'
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: shutting down: flushing changes
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: stopping command channel on 127.0.0.1#953
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: stopping command channel on ::1#953
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: no longer listening on ::#53
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: no longer listening on 127.0.0.1#53
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: no longer listening on 68.178.130.21#53
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: no longer listening on 68.178.131.108#53
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: no longer listening on 50.62.34.152#53
Sep 26 09:22:30 ip-68-178-130-21 named[26623]: exiting
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: starting BIND 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6.3 -u named
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--with-dlz-ldap=yes' '--with-dlz-postgres=yes' '--with-dlz-mysql=yes' '--with-dlz-filesystem=yes' '--with-gssapi=yes' '--disable-isc-spnego' '--with-docbook-xsl=/usr/share/sgml/docbook/xsl-stylesheets' '--enable-fixed-rrset' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE'
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: ----------------------------------------------------
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: BIND 9 is maintained by Internet Systems Consortium,
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: Inc. (ISC), a non-profit 501(c)(3) public-benefit
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: corporation.  Support and training for BIND 9 are
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: available at https://www.isc.org/support
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: ----------------------------------------------------
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: adjusted limit on open files from 1024 to 1048576
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: found 16 CPUs, using 16 worker threads
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: using up to 4096 sockets
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: loading configuration from '/etc/named.conf'
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: reading built-in trusted keys from file '/etc/named.iscdlv.key'
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: using default UDP/IPv4 port range: [1024, 65535]
Sep 26 09:22:32 ip-68-178-130-21 named[3617]: using default UDP/IPv6 port range: [1024, 65535]
Thu, 09/26/2013 - 10:30 (Reply to #15)
Locutus

So what about my suggestions, like atop, BIND debug log and downgrading to a stable non-RC version? I'm afraid I can't help you if you don't reply to my suggestions.

Thu, 09/26/2013 - 11:06 (Reply to #16)
calderwood

The BIND debug is already on, but it does not seem to be writing to the file. I will be willing to downgrade but the current stable version is 9.8.6 and I have installed 9.8.2-0.17.rc1.el6.3 which is not the latest. I do not have atop installed, but checked it out this morning and will install.

Thanks for the advice

Thu, 09/26/2013 - 09:52
calderwood

One setting question for BIND.

Do full recursive lookups for clients? It is set to Yes

Should it be NO?

Thu, 09/26/2013 - 11:28
calderwood

I noticed that my ram is at capacity.

#free -m
             total       used       free     shared    buffers     cached
Mem:          3072       3072          0          0          0          0
-/+ buffers/cache:       3072          0
Swap:            0          0          0
Thu, 09/26/2013 - 12:46
Locutus

You'd definitely want to limit recursive queries to your local server, and not allow it for any user on the internet. That's done with the "allow-recursion" directive, or there's also an edit box in Webmin's BIND module for it. So you should set "Do fully recursive lookups" to "Default", and in "Addresses and Topology", set "Allow recursive queries from" to "Listed..." and enter in the box

localnets localhost

As for the RAM filling up, did that happen when your BIND crashed? You really need ATOP, to have historical data about processes. It could be a script going haywire and spawning many processes, or BIND itself using too much RAM. But you won't find that out without ATOP.

When you have it installed, go to /etc/init.d/atop and modify the line "DARGS=", the number at the end is the interval in seconds that atop uses to store resource data. You might want to lower the default of 10 minutes to e.g. one minute, for your debugging purposes.

Thu, 09/26/2013 - 12:59 (Reply to #20)
calderwood

Thanks. Have changed settings and am installing atop. Let's see what happens.

Sun, 09/29/2013 - 12:59
calderwood

I installed atop and was able to track down the problem.

Seemed that the dkim-mil filter and SASLAUTHD were out of control and the dkim file was around 1GB and was using 100% CPU. Seemed to have spawned an entire city of processes (150+). Still not sure why. I stopped the service uninstalled and re-installed both programs and so far, so good.

The other program that was using 100% of CPU for 20 minutes at a time was bw.pl (bandwidth monitoring) which I didn't really need so I turned that service off.

Sun, 09/29/2013 - 15:58
Locutus

Good to hear you found the issue! You see, atop really helps in such cases. Also for me it has helped greatly in the past when nailing down processes running rampant. :)