Beancounter FAILCNT issue - server stops

13 posts / 0 new
Last post
#1 Wed, 06/22/2016 - 08:35
calderwood
calderwood's picture

Beancounter FAILCNT issue - server stops

My Virtualmin GPL stops randomly at various times of the day and night. I'm not seeing anything in the logs. Apache does NOT stop, but all other services do including named, webmin, saslauth. etc. I'm running COS6.6 and Virtualmin 5.03. I have 4GB ram, but the only errors I find are Failcnt for privvmpages in beancounters. From what I've read this is related to lack of RAM, but the numbers don't support that. What am I missing?

Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
10086860:  kmemsize                 43551524             45789054  9223372036854775807  9223372036854775807                    0
            lockedpages                     0                    0  9223372036854775807  9223372036854775807                    0
            privvmpages                535592               582425              2097152              2097152                   61
            shmpages                     1206                 1206  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            numproc                       134                  138                32567                32567                    0
            physpages                  292724               334227  9223372036854775807  9223372036854775807                    0
            vmguarpages                     0                    0              1048576  9223372036854775807                    0
            oomguarpages               292725               334228  9223372036854775807  9223372036854775807                    0
            numtcpsock                     42                   44  9223372036854775807  9223372036854775807                    0
            numflock                      242                  245  9223372036854775807  9223372036854775807                    0
            numpty                          1                    1                  255                  255                    0
            numsiginfo                      2                    3                 1024                 1024                    0
            tcpsndbuf                 1844960              1880032  9223372036854775807  9223372036854775807                    0
            tcprcvbuf                  696896               729664  9223372036854775807  9223372036854775807                    0
            othersockbuf               316096              1524640  9223372036854775807  9223372036854775807                    0
            dgramrcvbuf                     0                32096  9223372036854775807  9223372036854775807                    0
            numothersock                  193                  222  9223372036854775807  9223372036854775807                    0
            dcachesize                2841694              2972215  9223372036854775807  9223372036854775807                    0
            numfile                     13820                14459  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            numiptent                      61                   61  9223372036854775807  9223372036854775807                    0
Wed, 06/22/2016 - 13:56
Diabolico
Diabolico's picture

If you dont have 15+ WP websites or one but with the code leaking everywhere i would blame the host. Still you didnt provide too much info so its hard to say. Before anything else if you have WP or any other CMS with nulled/hacked themes or plugins you can stop right now because you found the reason what is happening - you got hacked.

If this is not your case than:

  1. First thing check Apache and MySQL logs. Maybe is worth to take a look at messages and secure logs. Use some free service to monitor your server with 1 min interval, e.g. uptimedoctor.com is good but you have many others. Just pay attention to have 1 min interval. It will show you precisely when your server went down and make easier to check the log files.

  2. Google up mysqltuner, save to your server (/root) and execute. This will show you in what state is your DB with suggestion what to change. If you are not SysAdmin i would strongly suggest to google the results before any changes. If you are using Innodb instead MyISAM then before any changes please read this http://stackoverflow.com/questions/3927690/howto-clean-a-mysql-innodb-st...

  3. If you have WP pay special attention to apache log and see if you can find how many direct hits you got on login page or xmlrpc.php (especially the second one). If your login page have direct access you must block this with htaccess (less complicated) or changing the name and location of your login file and admin folder (much more complicated and prone to errors). For the xmlrpc.php if you dont use JetPack its pretty safe to block all access to this file.

This is what i first thought after reading your post. Can you SSH to your server, check your memory, disk and cpu usage and post it here. More information you post easier will be for someone to check and maybe give you some directions what to do. Dont forget to say what you have on that server because as i said earlier in case of CMS it could be some theme or plugin with bad code consuming all your memory.

For the sake of your health before you change anything please make a local copy of your websites and all files you intend to modify.

- I often come to the conclusion that my brain has too many tabs open. -
Failing at desktop publishing & graphic design since 1994.

Wed, 06/22/2016 - 14:37 (Reply to #2)
calderwood
calderwood's picture

I run about 80 virtual servers and I'd guess that 80% have Wordpress or Joomla or eCommerce CMS. System is CentOS Linux 6.6, Webmin 1.801, Virtualmin 5.03. 4GB gauranteed RAM. I do monitor the server and know at exactly what time the services stop, but there is nothing in logs for messages, Webmin, mysql (and others) that give me a clue. Here are some of the log items - you can see from the messages when the server stops at 4.46 and starts at 5.50:

Jun 22 05:42:14 ip-68-178-130-21 clamd[19509]: SelfCheck: Database status OK.
Jun 22 05:42:30 ip-68-178-130-21 named[9904]: client 12.161.74.244#42513: query (cache) '143.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:42:30 ip-68-178-130-21 named[9904]: client 12.161.74.244#42513: query (cache) '143.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:43:26 ip-68-178-130-21 named[9904]: client 12.161.74.244#35101: query (cache) '146.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:43:26 ip-68-178-130-21 named[9904]: client 12.161.74.244#35101: query (cache) '146.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:44:23 ip-68-178-130-21 named[9904]: client 12.161.74.244#22515: query (cache) '147.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:44:23 ip-68-178-130-21 named[9904]: client 12.161.74.244#22515: query (cache) '147.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:45:19 ip-68-178-130-21 named[9904]: client 12.161.74.244#5546: query (cache) '148.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:45:19 ip-68-178-130-21 named[9904]: client 12.161.74.244#5546: query (cache) '148.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:46:15 ip-68-178-130-21 named[9904]: client 12.161.74.244#46301: query (cache) '149.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:46:15 ip-68-178-130-21 named[9904]: client 12.161.74.244#46301: query (cache) '149.126.10.10.in-addr.arpa/PTR/IN' denied
Jun 22 05:50:03 ip-68-178-130-21 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jun 22 05:50:03 ip-68-178-130-21 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="27953" x-info="http://www.rsyslog.com"] start
Jun 22 05:50:09 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 27860 due to rate-limiting
Jun 22 05:50:10 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 38 messages from pid 27860 due to rate-limiting
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: clamd daemon 0.99.1 (OS: linux-gnu, ARCH: x86_64, CPU: x86_64)
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Running as user clam (UID 497, GID 498)
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Log file size limited to 4294967295 bytes.
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Reading databases from /var/lib/clamav
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Not loading PUA signatures.
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Bytecode: Security mode set to "TrustSigned".
Jun 22 05:52:25 ip-68-178-130-21 clamd[28630]: Loaded 4543230 signatures.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: TCP: Bound to [127.0.0.1]:3310
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: TCP: Setting connection queue length to 30
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: LOCAL: Removing stale socket file /var/run/clamav/clamd.sock
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: LOCAL: Unix socket file /var/run/clamav/clamd.sock
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: LOCAL: Setting connection queue length to 30
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: Global size limit set to 104857600 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: File size limit set to 26214400 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: Recursion level limit set to 16.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: Files limit set to 10000.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxEmbeddedPE limit set to 10485760 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxHTMLNormalize limit set to 10485760 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxHTMLNoTags limit set to 2097152 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxScriptNormalize limit set to 5242880 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxZipTypeRcg limit set to 1048576 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxPartitions limit set to 50.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxIconsPE limit set to 100.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxRecHWP3 limit set to 16.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: PCREMatchLimit limit set to 10000.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: PCRERecMatchLimit limit set to 5000.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: PCREMaxFileSize limit set to 26214400.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Archive support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Algorithmic detection enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Portable Executable support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: ELF support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Detection of broken executables enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Mail files support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: OLE2 support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: PDF support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: SWF support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: HTML support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: XMLDOCS support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: HWP3 support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Heuristic: precedence enabled
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Self checking every 600 seconds.

The other messages I've been seeing is:

Jun 20 06:04:32 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 15492 due to rate-limiting
Jun 20 06:04:35 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 62 messages from pid 15492 due to rate-limiting
Jun 20 09:59:56 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 15492 due to rate-limiting
Jun 20 10:02:37 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 40 messages from pid 15492 due to rate-limiting
Jun 20 12:47:36 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 23766 due to rate-limiting
Jun 20 12:47:39 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 23826 due to rate-limiting
Jun 20 12:47:45 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 410 messages from pid 23826 due to rate-limiting
Jun 20 12:51:56 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 24509 due to rate-limiting
Jun 20 12:52:00 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 24562 due to rate-limiting
Jun 20 12:52:06 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 437 messages from pid 24562 due to rate-limiting

David Calderwood - Euro-Pacific Digital Media

Wed, 06/22/2016 - 12:47
andreychek

Howdy,

OpenVZ can be a bit rough when it comes to resource limits.

It does look like you're running into memory limits there... with OpenVZ, if you are using what that call burstable (non-guaranteed) RAM, it's possible that a process could be killed off if another Virtual Machine on the same host requires RAM.

To resolve that, you'd either need to increase how much RAM allocated to your Virtual Machine, or free up some more available RAM. Or ask your provider for all guaranteed RAM.

-Eric

Wed, 06/22/2016 - 13:45 (Reply to #4)
calderwood
calderwood's picture

I have 4GB guaranteed with another 4GB burst. But I will check with the provider that I am getting 4/4

total       used       free     shared    buffers     cached
Mem:          4096       2119       1976          0          0          0
-/+ buffers/cache:       2119       1976
Swap:            0          0          0

David Calderwood - Euro-Pacific Digital Media

Wed, 06/22/2016 - 14:18
Diabolico
Diabolico's picture

Burst RAM is just optional and should be used for rare cases where you need a little more than your guaranteed RAM and for short amount of time. Counting on burstable ram as something free and available 24/7 will not do any good. If you need more than 4GB then you should buy it.

Dedicated RAM: RAM you are guaranteed access to at all times.
Burst RAM: RAM you can access if no one else is using it.

To make it simple, burst RAM is actually RAM "stolen" or "borrowed" (however you like to call) from other accounts. If your host say that will guarantee 4/4 you should imeditelly leave and find better one because i'm pretty sure they are overselling that server as hell. If host can guarantee all 4GB of burst ram all the time then why not offer you 8GB without burst RAM for the same price?

- I often come to the conclusion that my brain has too many tabs open. -
Failing at desktop publishing & graphic design since 1994.

Wed, 06/22/2016 - 14:41
calderwood
calderwood's picture

Sorry, I mis-typed. I only have 4GB RAM guaranteed so yes, I only have $gb. There is also a burst of 4GB, not guaranteed.But I have not been hitting 4, average usage 2-3 gb, so I don't think memory issue.

David Calderwood - Euro-Pacific Digital Media

Wed, 06/22/2016 - 23:18 (Reply to #7)
coderinthebox

You can do this for the meantime

increase the messages allowed and the time interval before rate-limiting occurs in rsyslog. To do this, locate the rsyslog.conf and/or rsyslog.early.conf (usually in /etc) and add the following lines:

$SystemLogRateLimitInterval 10 $SystemLogRateLimitBurst 500

This is 500 messages for the span of 10 seconds before being limited

You may have been a brute force victim and with 80 virtualhost, that is 80x the amount of logs.

The first thing you can do is move SSH to port higher than 1500 since most bots don't attempt to scan those ports. This will technically solve a lot of issues. For the domain, disable recursion, you seem to have it enabled or only allow it from a limited IP address/host

Visit me at coderinthebox.com

Thu, 06/23/2016 - 15:13 (Reply to #8)
calderwood
calderwood's picture

Thanks for the response. I've added the rating code and I'll see how that goes. I'll also move the SSH port - always meant to do that anyway. I already had BIND set to "Allow recursive queries from localhost" . I'm wondering if I don't have it set correctly? I have in named:

options {
allow-recursion {
        localnets;
        localhost;
        };

David Calderwood - Euro-Pacific Digital Media

Wed, 06/22/2016 - 15:29
Diabolico
Diabolico's picture

Jun 22 05:46:15 ip-68-178-130-21 named[9904]: client 12.161.74.244#46301: query (cache) '149.126.10.10.in-addr.arpa/PTR/IN' denied

Check named.conf, this message usually comes when named is missing some directives and while there check if you disabled recursion.

Jun 22 05:50:09 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 27860 due to rate-limiting

If i'm not mistaken this is rsyslog limiting messages in your log. That means something must flooding your logs as rate limiting will happen if you have more than 100-200 messages in just few seconds. Good start would be to find the process behind each PID and then should be much easier to find the problem. From the logs you posted here in just 6 hours rsyslog drop more than 1000 lines and lack of any useful information in logs could be because rsyslog is dropping them.

Last but not least, 80 accounts of which 80% run on CMS i would say its overkill for the server. If everything is configured perfectly so you can fully utilize your server resources and majority of CMS are optimized and with low traffic then it could be possible to have all this without problems but i'm not sure if this is your case.

- I often come to the conclusion that my brain has too many tabs open. -
Failing at desktop publishing & graphic design since 1994.

Sat, 06/25/2016 - 11:14
calderwood
calderwood's picture

I tried to make the recursive queries more strict, but it stopped all mail out. I have in named:

options {
allow-recursion {
        localnets;
        localhost;
        };

If I set to NO for "Do full recursive lookups for clients?" - BIND > Miscellaneous Options Set "Send outgoing mail via host [dedrelay.secureserver.net] "- Postfix >General Options (my relay required for all mail"

All outgoing mail stops.If I set Do full recursive lookups for clients back to default, mail sends again.

What are the setup options for??? "Map for allowed addresses for relaying" - in POSTFIX SMTP Server Options

Ant to top it all off, the server totally stopped at 6.03 AM today. Last log file was at 12.23AM

David Calderwood - Euro-Pacific Digital Media

Sat, 06/25/2016 - 14:37
Diabolico
Diabolico's picture

For named.conf you can try this:

acl "trusted" {
localhost;
localnets;
};

options {

............. some stuff .................

version "unknown";
allow-transfer { trusted; };
allow-recursion { trusted; };
allow-query-cache { trusted; };
recursion no;
additional-from-cache no;
allow-query { any; };

............. some more stuff .................

};
};

logging {
channel default_debug {
file "/var/named/data/named.run";
severity dynamic;
print-category yes;
print-severity yes;
print-time yes;
};
};

Its missing part of the code but this should be enough to give you some idea what to do, for the rest best is to read about named.conf and fine tune depending on what you need, e.g. in case you have master slave you have to do some minor modifications.

Whatever you do keep in mind to disable recursion or limit to specific IP(s) or your server could be used for DDoS amplification attacks and thats pretty bad.

- I often come to the conclusion that my brain has too many tabs open. -
Failing at desktop publishing & graphic design since 1994.

Sat, 06/25/2016 - 19:49
calderwood
calderwood's picture

Thanks. I'll give this a go and see where I end.

David Calderwood - Euro-Pacific Digital Media

Topic locked