Load average spikes when collectinfo.pl runs

Hi guys,

My production server periodically has load average spikes of up to 100 - 200. These last for about 3-4 minutes and then the system either goes completely south requiring a reboot or it settles back down to it's norm of about 1.

I notice that these events always start about 20-25 minutes past the hour and that collectinfo.pl is always in the "run" state from a ps taken during this time.

So I guess I'm looking to know a few things: 1) Are there any known issues with collectinfo.pl that could cause it to spin out of control and/or spawn processes erroneously?

2) What would be the impact if I turned it off?

Looking to find root cause since this has devastating effects on my customers ecommerce sites. Their customers start experiencing timeouts and the the phones start ringing.

Please help if you can.

Status: 
Active

Comments

collectinfo.pl runs pretty often - usually once every 5 minutes or so. So it isn't surprising to see it running, and it would likely take even login when the system is loaded.

You could try running top to see what is using the most CPU time when this happens. Also, when top is running hit M to see what is using the most RAM .. often high load can be caused by a process using too much memory.

Unfortunately, I can't usually get in when the load has spiked. I get some PS output that I capture which is basically the same as top. An example of what I capture in included below.

What is the impact of turning off collectinfo.pl?

The version I'm looking at in cron runs at 21 minutes after every hour, not every 5 minutes. It's the only constant (other than the http/php-cgi activity) that I see across these "events".

A PS captured during today's "event" is (using ps : "ps -eflFH r"). Note that almost everything is in a "disk wait" status.:

F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY        TIME CMD
4 D root     31841 31823  0  78   0 - 20587 sync_p 24276   1 11:21 ?          0:02 /usr/bin/perl /usr/libexec/webmin/virtual-server/collectinfo.pl
0 D ezom     31763 31761  0  76   0 - 43996 sync_p  4792   1 11:20 ?          0:00 /usr/bin/php -f ./launchDispatch.php
0 D root     31732 31730  0  78   0 - 12553 sync_b 15040   1 11:20 ?          0:01 /usr/bin/perl /usr/libexec/webmin/status/monitor.pl
4 D tackle   32417 31107  0  76   0 - 44997 sync_p 19132   1 11:25 ?          0:00 /usr/bin/php-cgi
4 D tackle   32393 31107  0  76   0 - 44997 sync_p 19236   1 11:25 ?          0:00 /usr/bin/php-cgi
4 R tackle   32286 31107  0  78   0 - 44324 -      14376   1 11:23 ?          0:00 /usr/bin/php-cgi
4 D tackle   32273 31107  0  76   0 - 45970 sync_p 17652   1 11:23 ?          0:00 /usr/bin/php-cgi
4 D tackle   32263 31107  0  76   0 - 44324 sync_p 12992   1 11:23 ?          0:00 /usr/bin/php-cgi
4 D tackle   32257 31107  0  76   0 - 44324 sync_p 12856   1 11:23 ?          0:00 /usr/bin/php-cgi
4 D tackle   32239 31107  0  76   0 - 45323 sync_p 15120   1 11:23 ?          0:00 /usr/bin/php-cgi
4 D tackle   32213 31107  0  78   0 - 45323 sync_p 15092   1 11:23 ?          0:00 /usr/bin/php-cgi
4 R tackle   32208 31107  0  76   0 - 45579 -      15736   1 11:23 ?          0:00 /usr/bin/php-cgi
4 R tackle   32197 31107  0  78   0 - 44324 -      13136   1 11:23 ?          0:00 /usr/bin/php-cgi
4 D cork     31738 31107  0  76   0 - 47446 sync_p 15408   1 11:20 ?          0:04 /usr/bin/php-cgi
4 D 509      30066 31107  0  76   0 - 60552 sync_p 11880   1 10:42 ?          0:01 /usr/bin/php-cgi
4 D rmt      19978 31107  0  76   0 - 47719 sync_p 20620   1 06:20 ?          0:04 /usr/bin/php-cgi
4 D 509      15504 31107  0  76   0 - 61069 sync_p  8164   1 04:25 ?          0:32 /usr/bin/php-cgi
5 D apache   32242 23962  0  76   0 - 74720 sync_p 11484   0 11:23 ?          0:00 /usr/sbin/httpd
5 D apache   32240 23962  0  76   0 - 74720 sync_p 11068   1 11:23 ?          0:00 /usr/sbin/httpd
5 D apache   32180 23962  0  76   0 - 74785 sync_p 10572   1 11:22 ?          0:00 /usr/sbin/httpd
5 D apache   32164 23962  0  76   0 - 74720 sync_p 11608   1 11:22 ?          0:00 /usr/sbin/httpd
5 D apache   32159 23962  0  76   0 - 74720 sync_p 11616   1 11:22 ?          0:00 /usr/sbin/httpd
5 D apache   32044 23962  0  76   0 - 74785 sync_p 11088   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   32043 23962  0  76   0 - 74720 sync_p 10992   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   32032 23962  0  76   0 - 74785 sync_p 10592   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   32022 23962  0  76   0 - 74720 sync_p 12196   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   32013 23962  0  76   0 - 74720 sync_p 11120   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   32004 23962  0  76   0 - 74720 sync_p 11300   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   32003 23962  0  77   0 - 74720 sync_p 10924   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31993 23962  0  76   0 - 74720 sync_p 11148   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31991 23962  0  76   0 - 74785 sync_p 11064   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31988 23962  0  76   0 - 92825 -       4436   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31983 23962  0  76   0 - 74785 sync_p 10588   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31982 23962  0  76   0 - 74785 sync_p 10724   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31966 23962  0  76   0 - 74785 sync_p 10684   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31962 23962  0  76   0 - 74785 sync_p 10804   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31960 23962  0  76   0 - 74785 sync_p 10788   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31958 23962  0  76   0 - 74785 sync_p 10736   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31956 23962  0  76   0 - 74785 sync_p 10892   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31952 23962  0  76   0 - 92825 -       4396   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31943 23962  0  76   0 - 92825 -       5032   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31937 23962  0  76   0 - 74785 sync_p 10768   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31935 23962  0  76   0 - 74785 sync_p 10780   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31921 23962  0  76   0 - 92825 -       4296   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31906 23962  0  76   0 - 92825 -       5224   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31898 23962  0  76   0 - 74785 sync_p 10720   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31897 23962  0  76   0 - 74785 sync_p 11272   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31896 23962  0  76   0 - 74720 sync_p 11584   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31895 23962  0  76   0 - 74785 sync_p 10844   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31882 23962  0  76   0 - 92825 -       4600   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31881 23962  0  76   0 - 92825 sync_b  4648   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31857 23962  0  76   0 - 92825 -       4416   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31856 23962  0  78   0 - 92825 -       4080   0 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31840 23962  0  76   0 - 92825 -       4404   1 11:21 ?          0:00 /usr/sbin/httpd
5 D apache   31807 23962  0  78   0 - 92825 -       4100   1 11:20 ?          0:00 /usr/sbin/httpd
5 D apache   31792 23962  0  76   0 - 92825 sync_p  3528   1 11:20 ?          0:00 /usr/sbin/httpd
5 D apache   31717 23962  0  76   0 - 92825 -       4616   1 11:19 ?          0:00 /usr/sbin/httpd
5 D apache   31497 23962  0  76   0 - 74720 sync_b 11224   1 11:13 ?          0:00 /usr/sbin/httpd
5 D apache   30571 23962  0  76   0 - 74850 sync_p 11680   0 10:55 ?          0:00 /usr/sbin/httpd
5 D apache   27407 23962  0  76   0 - 74850 sync_b 12064   1 09:33 ?          0:02 /usr/sbin/httpd
5 D apache   27058 23962  0  76   0 - 74850 sync_b 11448   1 09:22 ?          0:02 /usr/sbin/httpd
5 D apache   20992 23962  0  76   0 - 92825 stext   5340   1 06:49 ?          0:03 /usr/sbin/httpd
4 R root     32570  3810  4  78   0 - 16424 -       1004   0 11:28 pts/0      0:00 ps -eflFH r
4 D postfix  32479  2776  0  76   0 - 14167 sync_b  2992   1 11:26 ?          0:00 smtpd -n smtp -t inet -u -o smtpd_sasl_auth_enable yes -o smtp_bind_address 98.129.216.127
4 D postfix  32467  2776  0  76   0 - 13617 sync_p  2344   0 11:26 ?          0:00 cleanup -z -t unix -u
5 D root     17795     1  0  76   0 - 21839 sync_p  2748   1 Jun26 ?          3:21 /usr/libexec/webmin/virtual-server/lookup-domain-daemon.pl

Lovely, it formatted just fine in the edit window...

Turning off collectinfo.pl will break Virtualmin's system statistics graphs, and will make the System Information page slower to load. However, other things will still run fine..

You might try SSHing in first, running top, then waiting for the problem to occur.