High System CPU Load Average

26 posts / 0 new
Last post
#1 Fri, 08/29/2014 - 10:57
JamesSimpson

High System CPU Load Average

Hi All,

I am totally puzzled at the moment as to what Virtualmin is doing, after recently updating everything to the latest versions, I am getting the following CPU load averages and constant alerts from CFS.

CPU load averages 9.45 (1 min) 9.32 (5 mins) 9.77 (15 mins)

Running top via ssh I get the following

Processes: 175 total, 2 running, 4 stuck, 169 sleeping, 944 threads    16:54:15
Load Avg: 1.16, 1.13, 1.13  CPU usage: 3.74% user, 2.72% sys, 93.53% idle
SharedLibs: 14M resident, 14M data, 0B linkedit.
MemRegions: 55177 total, 917M resident, 48M private, 345M shared.
PhysMem: 2845M used (1000M wired), 4237M unused.
VM: 447G vsize, 1073M framework vsize, 11607078(0) swapins, 14171139(0) swapouts
Networks: packets: 14989373/17G in, 10427533/1423M out.
Disks: 2651509/109G read, 2162583/222G written.

PID    COMMAND      %CPU TIME     #TH  #WQ  #PORT #MREGS MEM    RPRVT  PURG
19094  mdworker     0.0  00:00.03 3    0    52    67     2196K  1340K  0B
19093  mdworker     0.0  00:00.03 3    0    52    69     3084K  2228K  0B
19092  syncdefaults 0.0  00:00.28 6    2    88    82     5132K  3952K  0B
19091  mdworker     0.0  00:00.06 3    0    52    69     5164K  4256K  0B
19089  top          9.3  00:14.13 1/1  0    26    41     2204K  1972K  0B
19086  bash         0.0  00:00.00 1    0    19    31     616K   448K   0B
19085  login        0.0  00:00.01 2    0    30    52     1168K  840K   0B
19078  TextEdit     0.0  00:00.27 5    2    170   184    13M    6556K  20K
19070  CVMCompiler  0.0  00:00.73 2    1    32    80     24M    24M    12K
19067  Terminal     24.0 00:03.02 13   7    179   212    20M+   15M+   80K
19057  com.apple.We 0.0  00:02.84 14   2    183   331    28M    25M    36K
19055  netbiosd     0.0  00:00.07 2    1    42    53     1888K  1484K  0B
19049  com.apple.iC 0.0  00:00.24 4    0    82    82     3892K  3112K  0B
19040  rpcsvchost   0.0  00:00.02 16   1    44    82     1428K  1092K  0B

Not sure where Virtualmin is pulling those averages from, and I'm not sure what is causing it. First I thought my server got hacked and sending out SPAM, but there is nothing in the mail queue.

Anyone got any ideas? Restarting my server gets it back down to the usual average of 0.3 for a day or two, then it starts to build back up.

I got an alert for 11.4 5 min load average around a hour ago. The websites aren't getting any extra hits as usual, so it can't be that...

Fri, 08/29/2014 - 12:02
andreychek

Howdy,

Hmm, the output above appears that it's from an Apple computer, not a Linux server that would be running Virtualmin. Is that process information from the correct system?

-Eric

Fri, 08/29/2014 - 13:43
JamesSimpson

Ooops you are correct, what I get for posting in haste - saying that, i cannot connect to the server by ssh, it asks me for a login, and then i enter my password then it just stays blank :S

Fri, 08/29/2014 - 13:48
JamesSimpson

At this moment in time, its now running 11.4

CPU load averages: 11.30 (1 mins) , 11.25 (5 mins) , 11.22 (15 mins) CPU type: Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz , 4 cores

21916 jamessimpson 3.0 % /usr/bin/php-cgi
22225 jamessimpson 3.0 % /usr/bin/php-cgi
21915 jamessimpson 2.0 % /usr/bin/php-cgi
23138 root 1.2 % /usr/libexec/webmin/proc/index_cpu.cgi
1772 mysql 0.5 % /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-e ...
19 root 0.4 % [events/0]
14555 drivingroads 0.4 % /usr/bin/php-cgi
14797 drivingroads 0.4 % /usr/bin/php-cgi
6827 bojotoolstore 0.3 % /usr/bin/php-cgi
7484 bojotoolstore 0.3 % /usr/bin/php-cgi
15398 drivingroads 0.3 % /usr/bin/php-cgi
18444 bojotoolstore 0.2 % /usr/bin/php-cgi
22486 apache 0.2 % /usr/sbin/httpd
78 root 0.1 % [kipmi0]
23139 root 0.1 % /usr/bin/perl /usr/libexec/webmin/miniserv.pl /etc/webmin/miniserv.conf
1 root 0.0 % /sbin/init
Fri, 08/29/2014 - 14:47
andreychek

Howdy,

Well, there's a number of PHP related processes there... it's possible that means one or more of your sites is seeing an influx of traffic.

However, what is the output of these commands:

free -m
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -15

Also, can you run the command "ps auxw", and attach that output as a text file?

-Eric

Fri, 08/29/2014 - 14:53
JamesSimpson

Thats the thing, I cannot get onto SSH at the moment, it lets me login but then won't let me type anything.

It has happened before but i had to restart the server to allow me access again, which would mean i would be running normal processes again for a day or two.

Fri, 08/29/2014 - 15:29
JamesSimpson

Finally managed to connect

Top:

top - 21:26:57 up 4 days, 21:58, 12 users,  load average: 21.79, 20.18, 17.46
Tasks: 256 total,   1 running, 248 sleeping,   0 stopped,   7 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16321220k total, 15633400k used,   687820k free,   390020k buffers
Swap:  2097144k total,     7880k used,  2089264k free, 11586296k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
   19 root      20   0     0    0    0 D  0.7  0.0  34:16.18 events/0          
   61 root      39  19     0    0    0 S  0.3  0.0   0:20.72 khugepaged        
5119 root      20   0  153m  15m 1668 S  0.3  0.1   0:34.30 lfd               
    1 root      20   0 19356 1476 1232 S  0.0  0.0   0:00.62 init              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.05 kthreadd          
    3 root      RT   0     0    0    0 S  0.0  0.0   0:02.98 migration/0       
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.69 ksoftirqd/0       
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0       
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.59 watchdog/0        
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.64 migration/1       
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1       
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.58 ksoftirqd/1       
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.38 watchdog/1        
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.39 migration/2       
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2       
   13 root      20   0     0    0    0 S  0.0  0.0   0:01.15 ksoftirqd/2       
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.35 watchdog/2        
Fri, 08/29/2014 - 15:58
JamesSimpson

And now SSH is frozen again, and I cannot get past successful authentication

Sat, 08/30/2014 - 03:47
Locutus

In your latest "top" output, there seem to be no processes using any considerable CPU power, yet your system load is excessively high. This could indicate that the system is waiting a great deal for other resources (RAM, HDD, network) to become available. Might indicate an overload there or hardware issues.

Also I noticed 12 users logged on, and 7 zombie processes. Those might be hanging sessions of your failed attempts to log on via SSH, but you might want to check those out, using the commands "w" and "last".

I also recommend the tool "atop" over "top", since it displays more information like disk, memory, swap and network usage, and records historical data, for later review. atop shows zombie processes with a "Z" in the state column.

You might have to hard-reboot the server if you can't reliably get in via SSH anymore. A system load of 20 will most likely prevent you from doing any serious work on the server.

When you can get in again, you might want to review the system and kernel logs, and install atop.

Sat, 08/30/2014 - 05:14
JamesSimpson

Right I have had to restart the server, as last night it got up to 40.1 CPU average. After restarting this morning I am able to get back into SSH

Output from atop
atop

ATOP - JSServer01 2014/08/30 11:05:26 --------- 10s elapsed
PRC | sys 0.14s | user 1.49s | #proc 182 | #zombie 0 | #exit 5 |
CPU | sys 2% | user 15% | irq 0% | idle 378% | wait 5% |
cpu | sys 1% | user 11% | irq 0% | idle 83% | cpu000 w 5% |
cpu | sys 0% | user 4% | irq 0% | idle 96% | cpu002 w 0% |
cpu | sys 0% | user 0% | irq 0% | idle 99% | cpu001 w 0% |
cpu | sys 0% | user 0% | irq 0% | idle 100% | cpu003 w 0% |
CPL | avg1 0.17 | avg5 0.39 | avg15 0.36 | csw 5269 | intr 2754 |
MEM | tot 15.6G | free 12.7G | cache 811.7M | buff 86.2M | slab 353.2M |
SWP | tot 2.0G | free 2.0G | | vmcom 2.7G | vmlim 9.8G |
LVM | Group00-root | busy 5% | read 10 | write 192 | avio 2.62 ms |
DSK | sda | busy 5% | read 10 | write 71 | avio 6.53 ms |
NET | transport | tcpi 38 | tcpo 37 | udpi 0 | udpo 0 |
NET | network | ipi 47 | ipo 37 | ipfrw 0 | deliv 38 |
NET | em1 0% | pcki 66 | pcko 37 | si 4 Kbps | so 24 Kbps |
NET | lo ---- | pcki 10 | pcko 10 | si 0 Kbps | so 0 Kbps |

PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/5
2168 0.02s 0.82s 0K 0K 0K 8K -- - S 8% php-cgi
2383 0.01s 0.30s 0K 0K 0K 0K -- - S 3% php-cgi
1866 0.03s 0.27s 0K 0K 36K 100K -- - S 3% mysqld
2224 0.01s 0.04s 75780K 20K 48K 88K -- - S 1% httpd
4131 0.01s 0.04s 0K 0K - - NE 0 E 1%
78 0.03s 0.00s 0K 0K 0K 0K -- - S 0% kipmi0

It is showing normal usage now, so not sure what the hell is going on after a day or two.

Installing atop i did get a warning
There are unfinished transactions remaining. You might consider running yum-complete-transaction first to finish them.

So I ran that too, and it looks as if I cannot install what is required
yum-complete-transaction
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.melbourne.co.uk
* epel: mirror.bytemark.co.uk
* extras: mirror.bytemark.co.uk
* updates: mirrors.ukfast.co.uk
Checking for new repos for mirrors
There are 1 outstanding transactions to complete. Finishing the most recent one
The remaining transaction had 10 elements left to run
--> Running transaction check
---> Package automake.noarch 0:1.11.1-4.el6 will be installed
---> Package cloog-ppl.x86_64 0:0.15.7-1.2.el6 will be installed
---> Package cpp.x86_64 0:4.4.7-4.el6 will be installed
---> Package gcc.x86_64 0:4.4.7-4.el6 will be installed
---> Package gcc-c++.x86_64 0:4.4.7-4.el6 will be installed
---> Package libgomp.x86_64 0:4.4.7-4.el6 will be installed
---> Package libstdc++-devel.x86_64 0:4.4.7-4.el6 will be installed
---> Package mpfr.x86_64 0:2.4.1-6.el6 will be installed
---> Package php-devel.x86_64 0:5.3.3-27.el6_5 will be installed
--> Processing Dependency: php(x86-64) = 5.3.3-27.el6_5 for package: php-devel-5.3.3-27.el6_5.x86_64
---> Package ppl.x86_64 0:0.10.2-11.el6 will be installed
--> Finished Dependency Resolution
Error: Package: php-devel-5.3.3-27.el6_5.x86_64 (updates)
Requires: php(x86-64) = 5.3.3-27.el6_5
Installed: php-5.3.3-27.el6_5.1.x86_64 (@updates)
php(x86-64) = 5.3.3-27.el6_5.1
Available: php-5.3.3-26.el6.x86_64 (base)
php(x86-64) = 5.3.3-26.el6
Available: php-5.3.3-27.el6_5.x86_64 (updates)
php(x86-64) = 5.3.3-27.el6_5
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

Running free-m now (kinda pointless as it is back to normal now)

free -m
total used free shared buffers cached
Mem: 15938 2921 13017 0 88 816
-/+ buffers/cache: 2016 13922
Swap: 2047 0 2047M

And the netstat


netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -15
19
4 127.0.0.1
2 81.156.223.142
1 servers)
1 Address
1 90.206.201.8

Sat, 08/30/2014 - 05:30
JamesSimpson

Hmm I think i may have found the issue

I seem to have thousands of these in the messages log

Aug 30 05:05:14 JSServer01 named[29765]: client 127.0.0.1#45585: query (cache) '131.205.13.211.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#43407: query (cache) '29.193.26.103.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#41691: query (cache) '241.150.174.195.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#37403: query (cache) '166.109.97.211.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#58532: query (cache) '241.150.174.195.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#44044: query (cache) '102.120.149.107.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#37691: query (cache) '91.34.135.174.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#57784: query (cache) '219.106.153.184.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#40505: query (cache) '204.5.106.41.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#35974: query (cache) '91.34.135.174.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#35621: query (cache) '53.79.234.212.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#44718: query (cache) '102.120.149.107.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#52370: query (cache) '53.79.234.212.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:17 JSServer01 named[29765]: client 127.0.0.1#42438: query (cache) '177.10.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:17 JSServer01 named[29765]: client 127.0.0.1#41674: query (cache) '202.209.241.61.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:18 JSServer01 named[29765]: client 127.0.0.1#56260: query (cache) '124.10.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:19 JSServer01 named[29765]: client 127.0.0.1#48054: query (cache) '166.109.97.211.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:22 JSServer01 named[29765]: client 127.0.0.1#49980: query (cache) '188.17.82.36.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#49930: query (cache) '204.5.106.41.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#57424: query (cache) '188.17.82.36.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#57964: query (cache) '120.107.255.193.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#35676: query (cache) '124.10.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#35009: query (cache) '101.95.101.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#47569: query (cache) '120.107.255.193.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#39782: query (cache) '227.58.73.203.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#50507: query (cache) '101.95.101.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#41356: query (cache) '156.12.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#43907: query (cache) '227.58.73.203.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#50367: query (cache) '179.107.160.163.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#58792: query (cache) '179.107.160.163.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#45449: query (cache) '182.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#35984: query (cache) '19.96.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#42738: query (cache) '19.96.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#57701: query (cache) '187.92.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#33209: query (cache) '77.113.182.192.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#51364: query (cache) '240.9.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#56060: query (cache) '240.9.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:27 JSServer01 named[29765]: client 127.0.0.1#54580: query (cache) '238.210.34.89.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:27 JSServer01 named[29765]: client 127.0.0.1#34927: query (cache) '187.92.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:27 JSServer01 named[29765]: client 127.0.0.1#54763: query (cache) '170.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:28 JSServer01 named[29765]: client 127.0.0.1#51508: query (cache) '170.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:28 JSServer01 named[29765]: client 127.0.0.1#34891: query (cache) '77.113.182.192.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:29 JSServer01 named[29765]: client 127.0.0.1#37835: query (cache) '181.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:29 JSServer01 named[29765]: client 127.0.0.1#47091: query (cache) '156.12.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:31 JSServer01 named[29765]: client 127.0.0.1#47907: query (cache) '167.13.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:31 JSServer01 named[29765]: client 127.0.0.1#42951: query (cache) '167.13.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:31 JSServer01 named[29765]: client 127.0.0.1#37369: query (cache) '223.59.200.220.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#54876: query (cache) '187.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#56875: query (cache) '187.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#56911: query (cache) '182.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#37661: query (cache) '171.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#35656: query (cache) '220.59.200.220.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#42569: query (cache) '33.114.193.123.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:33 JSServer01 named[29765]: client 127.0.0.1#40194: query (cache) '33.114.193.123.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:33 JSServer01 named[29765]: client 127.0.0.1#43916: query (cache) '181.233.15.199.in-addr.arpa/PTR/IN' denied
Sat, 08/30/2014 - 07:36
Locutus

Okay, Eric might be able to say more about the error you get when trying to finish package updates; I'm not familiar enough with CentOS (I'm assuming you're using that, or another distro that uses "yum").

Did this issue start just after you installed updates? Or did it happen before that?

Note that the 40 is not the CPU usage, but system load. CPU usage is usually expressed in form of a percentage that the CPU spends handling processes. In your case, that'd be a maximum of 400% or 100% for each core.

System load on the other hand basically tells you how many processes on the average are ready to execute per time unit (usually 1 minute, 5 minutes, 15 minutes). In addition to CPU, this also takes other required resources into account, e.g. when a process has to wait for HDD availability. With your 4-core CPU, a load of up to 4 is acceptable and "normal" if the system is very heavily used.

So a load of 40 means that 40 processes are ready to do something but can't, because resources are lacking. It's to be expected that the system is nearly unresponsive then. In your case, that's probably not CPU power (since your top output showed that the CPU was mostly idle), but something else.

A good candidate is the HDD, in case there's hardware trouble with it. What kind of HDD setup do you have in the server? Single disk? Software/hardware RAID? You might want to use the command smartctl to review the HDDs' status values.

Since this only happens after a while, you might want to observe it for a bit and note if the system load goes up. You can review historical atop data by running atop -r /var/log/atop.log. When the load goes up, note if the disk is overloaded ("DSK % busy" is a good indicator), also check which processes use what amount of memory, disk, network etc. You can sort the output of atop accordingly and switch to different screens. Press "?" for a help screen.

Also don't forget to check last to see what those 12 logins were during your last problem phase! It shows you all logins with username and IP address. Pay attention to any entries with unexpected users/IP addresses there!

Sat, 08/30/2014 - 09:07
JamesSimpson

I checked the last login's and i can confirm they are all mine.

It also looks like my server may have been in a ddos attack maybe?

Sat, 08/30/2014 - 09:16
JamesSimpson

I am seeing a lot of these in the messages log

Aug 29 19:51:52 JSServer01 named[29765]: client 127.0.0.1#11277: query (cache) 'gmx.net/NS/IN' denied
Aug 29 19:51:52 JSServer01 named[29765]: client 127.0.0.1#11277: query (cache) 'cingular.com/NS/IN' denied
Aug 29 19:51:52 JSServer01 named[29765]: client 127.0.0.1#11277: query (cache) 'sourceforge.net/NS/IN' denied
Aug 29 19:50:18 JSServer01 named[29765]: client 127.0.0.1#52864: query (cache) 'intel.com/NS/IN' denied
Aug 29 19:50:18 JSServer01 named[29765]: client 127.0.0.1#52864: query (cache) 'msn.com/NS/IN' denied
Aug 29 19:50:18 JSServer01 named[29765]: client 127.0.0.1#52864: query (cache) 'comcast.net/NS/IN' denied

And then what looks like a dos attack?

Aug 30 01:11:41 JSServer01 kernel: Firewall: *TCP_OUT Blocked* IN= OUT=em1 SRC=149.255.100.109 DST=69.46.36.10 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=25880 DF PROTO=TCP SPT=50786 DPT=9050 WINDOW=14600 RES=0x00 SYN URGP=0 UID=508 GID=503
Aug 30 01:11:41 JSServer01 named[29765]: client 127.0.0.1#44437: query (cache) '187.88.217.189.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:41 JSServer01 named[29765]: client 127.0.0.1#46883: query (cache) '187.88.217.189.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:41 JSServer01 named[29765]: client 127.0.0.1#53390: query (cache) '225.222.197.69.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:42 JSServer01 named[29765]: client 127.0.0.1#38526: query (cache) '252.55.186.210.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:42 JSServer01 kernel: Firewall: *TCP_OUT Blocked* IN= OUT=em1 SRC=149.255.100.109 DST=69.46.36.10 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=25881 DF PROTO=TCP SPT=50786 DPT=9050 WINDOW=14600 RES=0x00 SYN URGP=0 UID=508 GID=503
Aug 30 01:11:42 JSServer01 named[29765]: client 127.0.0.1#56360: query (cache) '94.158.55.50.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:42 JSServer01 named[29765]: client 127.0.0.1#33568: query (cache) '34.137.46.77.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:43 JSServer01 named[29765]: client 127.0.0.1#55732: query (cache) '190.243.45.70.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:43 JSServer01 named[29765]: client 127.0.0.1#57461: query (cache) '120.141.93.216.in-addr.arpa/PTR/IN' denied
Sat, 08/30/2014 - 10:38
JamesSimpson

Locutus, I run updates all the time to keep the server updated, but around a week ago there was quite a few updates which i ran, and then I enabled graylisting as i was starting to see a lot of spam emails coming through.

After that, I then started to get CSF alerts of high load averages, and then it seemed to get worse.

I am running a Dell Poweredge R210, which comes with a Dell Raid Card, and two 1TB hard drives set up in RAID 1

In virtualmin, it only shows the raid (SCSI device A Drive size 953.31 GB - Make and model Dell VIRTUAL DISK)

I have another machine which is running quite happily without the same issues, but that is running a software raid across two disks and I am able to query the raid / disks, but with this machine, I've never been able to query the raid, as I don't think there are any proper Dell drivers for the raid card to run Linux.

The raid card is a Dell SAS 6/iR Adapter

Sat, 08/30/2014 - 10:54
JamesSimpson

Hi Guys,

It started building up again, ran atop -r and this is the output

ATOP - JSServer01 2014/08/30 15:02:04 --------- 4h25m53s elapsed
PRC | sys 94.89s | user 19m30s | #proc 184 | #zombie 0 | #exit 0 |
CPU | sys 1% | user 19% | irq 0% | idle 371% | wait 9% |
cpu | sys 1% | user 9% | irq 0% | idle 82% | cpu000 w 8% |
cpu | sys 0% | user 5% | irq 0% | idle 94% | cpu002 w 1% |
cpu | sys 0% | user 3% | irq 0% | idle 97% | cpu001 w 0% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% |
CPL | avg1 0.27 | avg5 0.29 | avg15 0.27 | csw 5643189 | intr 6191011 |
MEM | tot 15.6G | free 11.8G | cache 1.4G | buff 232.5M | slab 406.8M |
SWP | tot 2.0G | free 2.0G | | vmcom 2.8G | vmlim 9.8G |
LVM | Group00-root | busy 10% | read 158419 | write 785040 | avio 1.76 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | avio 2.57 ms |
DSK | sda | busy 10% | read 112136 | write 262769 | avio 4.43 ms |
NET | transport | tcpi 534967 | tcpo 484902 | udpi 13309 | udpo 13651 |
NET | network | ipi 555500 | ipo 516192 | ipfrw 0 | deliv 548501 |
NET | em1 0% | pcki 492572 | pcko 649938 | si 36 Kbps | so 409 Kbps |
NET | lo ---- | pcki 101110 | pcko 101110 | si 13 Kbps | so 13 Kbps |
Window has been resized...
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/17
11352 1 6.33s 4m13s 310.1M 102.7M 324K 10720K N- - S 0 2% php-cgi
11353 1 5.98s 3m59s 310.1M 102.7M 124K 12436K N- - S 2 2% php-cgi
14890 1 7.86s 3m47s 286.6M 81180K 0K 11336K N- - S 1 1% php-cgi
1866 16 20.63s 1m45s 863.0M 63104K 81144K 1.0G N- - S 3 1% mysqld
6279 1 4.30s 79.64s 311.8M 104.1M 2692K 70844K N- - D 2 1% php-cgi
6992 1 2.12s 64.37s 278.9M 77108K 164K 4K N- - S 0 0% php-cgi
10698 1 2.92s 57.18s 301.1M 95656K 572K 52416K N- - S 1 0% php-cgi
6242 1 1.36s 39.79s 285.7M 78572K 80K 4K N- - S 0 0% php-cgi
6993 1 1.10s 33.55s 272.9M 66768K 220K 4K N- - S 2 0% php-cgi
78 1 21.30s 0.00s 0K 0K 0K 0K N- - S 3 0% kipmi0
6600 1 0.51s 17.45s 264.5M 63392K 176K 164K N- - S 0 0% php-cgi

Sat, 08/30/2014 - 14:24
JamesSimpson

I think I have figured it out - It's something to do with BIND - I think i've been going through DDOS attacks for some strange reason

I have just added this into named.conf

acl "trusted"{
        My server ip address
        My server ip address 2
        My secondary DNS server IP address
        localhost;
        localnets;
};

options {
listen-on port 53 {
any;
};
listen-on-v6 port 53 {
any;
};
directory "/var/named";
dump-file "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
        allow-query { trusted; };
        allow-transfer { trusted; };
        allow-recursion { trusted;} ;
        allow-query-cache { trusted; };
recursion no;

dnssec-enable yes;
dnssec-validation yes;
dnssec-lookaside auto;

/* Path to ISC DLV key */
bindkeys-file "/etc/named.iscdlv.key";

managed-keys-directory "/var/named/dynamic";
also-notify {
};
};

I now see a lot of these type of warnings in my log file

Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'dansimpson.net/SPF/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'dansimpson.net/SPF/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'ns2.j5huh.net/A/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'ns1.j5huh.net/A/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.192.25#21267: query 'ns1.j5huh.com/A/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.192.25#20384: query 'ns1.j5huh.com/A/IN' denied

Which I am assuming is remains of a DNS attack?

Sun, 08/31/2014 - 06:17
JamesSimpson

Well adding those DNS settings broke my websites, as I couldn't access them, although I have upped the firewall to block multiple queries which seems to have worked,

Mon, 09/01/2014 - 14:18
JamesSimpson

Does this give any clues? LVM and DSK are flashing red?

ATOP - JSServer01 2014/09/01 13:08:44 --------- 2m54s elapsed
PRC | sys 5.84s | user 2.64s | #proc 138 | #zombie 0 | #exit 0 |
CPU | sys 8% | user 7% | irq 0% | idle 307% | wait 78% |
cpu | sys 4% | user 2% | irq 0% | idle 25% | cpu000 w 69% |
cpu | sys 2% | user 4% | irq 0% | idle 88% | cpu001 w 5% |
cpu | sys 1% | user 1% | irq 0% | idle 96% | cpu002 w 2% |
cpu | sys 0% | user 0% | irq 0% | idle 97% | cpu003 w 2% |
CPL | avg1 1.38 | avg5 0.58 | avg15 0.21 | csw 248036 | intr 226145 |
MEM | tot 15.6G | free 14.2G | cache 501.0M | buff 14.9M | slab 334.3M |
SWP | tot 2.0G | free 2.0G | | vmcom 868.7M | vmlim 9.8G |
LVM | Group00-root | busy 78% | read 109666 | write 2872 | avio 1.21 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | avio 1.09 ms |
DSK | sda | busy 79% | read 65008 | write 1376 | avio 2.07 ms |
NET | transport | tcpi 24 | tcpo 24 | udpi 75 | udpo 102 |
NET | network | ipi 120 | ipo 135 | ipfrw 0 | deliv 102 |
NET | em1 0% | pcki 182 | pcko 85 | si 0 Kbps | so 0 Kbps |
NET | lo ---- | pcki 33 | pcko 33 | si 0 Kbps | so 0 Kbps |
*** system and process activity since boot ***
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/16
158 1 4.22s 0.98s 36096K 1368K 276K 16K N- - S 1 3% plymouthd
2158 1 0.04s 1.33s 239.1M 52280K 2804K 4K N- - S 0 1% spamd
1 1 0.56s 0.02s 19356K 1524K 409.7M 6968K N- - S 0 0% init
34 1 0.54s 0.00s 0K 0K 0K 0K N- - S 0 0% kblockd/0
78 1 0.32s 0.00s 0K 0K 0K 0K N- - S 3 0% kipmi0
437 1 0.01s 0.15s 10648K 756K 9268K 0K N- - S 2 0% udevd
2182 1 0.04s 0.01s 154.2M 13520K 11332K 7712K N- - S 3 0% postgrey
1843 2 0.01s 0.04s 37812K 4184K 1556K 4K N- - S 0 0% hald
2260 1 0.00s 0.04s 81296K 3408K 520K 8K N- - S 3 0% master

Mon, 09/01/2014 - 14:18
JamesSimpson

And this was from yesterday, when it started to build up again
ATOP - JSServer01 2014/08/31 00:00:01 --------- 6h17m12s elapsed
PRC | sys 3m58s | user 25m22s | #proc 201 | #zombie 0 | #exit 1 |
CPU | sys 2% | user 15% | irq 0% | idle 374% | wait 9% |
cpu | sys 0% | user 8% | irq 0% | idle 84% | cpu000 w 8% |
cpu | sys 0% | user 4% | irq 0% | idle 95% | cpu002 w 1% |
cpu | sys 0% | user 2% | irq 0% | idle 97% | cpu001 w 0% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% |
CPL | avg1 0.13 | avg5 0.16 | avg15 0.14 | csw 9523149 | intr 8306300 |
MEM | tot 15.6G | free 12.0G | cache 1.2G | buff 255.2M | slab 192.2M |
SWP | tot 2.0G | free 2.0G | | vmcom 3.1G | vmlim 9.8G |
LVM | Group00-root | busy 10% | read 158124 | write 917942 | avio 2.18 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | avio 0.88 ms |
DSK | sda | busy 10% | read 119043 | write 345707 | avio 5.04 ms |
NET | transport | tcpi 539048 | tcpo 506361 | udpi 43734 | udpo 44075 |
NET | network | ipi 598771 | ipo 564033 | ipfrw 0 | deliv 583078 |
NET | em1 0% | pcki 514411 | pcko 678076 | si 19 Kbps | so 301 Kbps |
NET | lo ---- | pcki 131997 | pcko 131997 | si 13 Kbps | so 13 Kbps |
*** system and process activity since boot ***
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/23
12952 1 9.41s 6m04s 303.8M 97396K 296K 18116K N- - S 0 2% php-cgi
12239 1 8.58s 5m44s 292.0M 86864K 684K 16916K N- - S 0 2% php-cgi
13618 1 7.58s 5m12s 310.1M 102.7M 32K 13992K N- - S 0 1% php-cgi
1772 15 26.10s 2m08s 798.8M 66204K 214.2M 1.2G N- - S 0 1% mysqld
78 1 2m13s 0.00s 0K 0K 0K 0K N- - S 3 1% kipmi0
6474 1 3.62s 95.06s 286.2M 78744K 53280K 12952K N- - S 0 0% php-cgi
3119 1 3.16s 84.38s 287.4M 80580K 105.0M 8660K N- - S 0 0% php-cgi
2571 33 4.86s 42.56s 2.6G 181.8M 155.9M 13296K N- - S 1 0% dsm_om_connsvc
20531 1 2.01s 27.72s 275.4M 69604K 476K 47256K N- - S 0 0% php-cgi

Tue, 09/02/2014 - 07:48
Locutus

You posted the system activity since boot, you should also watch the ongoing activity. You can change the update interval with the i key. With t you can trigger a manual update.

It seems like the HDD is under constant high load. You can sort the process list by disk usage with shift-d and switch to disk details with d, to find out which process(es) are using the disk so much.

Mon, 09/08/2014 - 13:52
JamesSimpson

Right, I have to restart the server like every other day to get it back to normal processes enough for me to even login to SSH.

These logs are from the 5th - shows high LVM and DSK

ATOP - JSServer01 2014/09/05 13:29:42 --------- 3m22s elapsed
PRC | sys 6.72s | user 2.89s | #proc 141 | #trun 1 | #tslpi 161 | #tslpu 3 | #zombie 0 | clones 2157 | | #exit 0 |
CPU | sys 7% | user 6% | irq 0% | idle 304% | wait 82% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 4% | user 2% | irq 0% | idle 19% | cpu000 w 75% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 0% | user 3% | irq 0% | idle 93% | cpu001 w 4% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 3% | user 1% | irq 0% | idle 95% | cpu002 w 2% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 1% | user 0% | irq 0% | idle 97% | cpu003 w 2% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
CPL | avg1 1.17 | avg5 0.53 | | avg15 0.20 | | csw 256516 | intr 253915 | | | numcpu 4 |
MEM | tot 15.6G | free 14.2G | cache 489.2M | dirty 1.2M | buff 13.6M | slab 343.7M | | | | |
SWP | tot 2.0G | free 2.0G | | | | | | | vmcom 864.7M | vmlim 9.8G |
LVM | Group00-root | busy 82% | read 112338 | write 2805 | KiB/r 7 | KiB/w 4 | MBr/s 4.32 | MBw/s 0.05 | avq 4.86 | avio 1.45 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | KiB/r 4 | KiB/w 0 | MBr/s 0.01 | MBw/s 0.00 | avq 3.27 | avio 0.93 ms |
DSK | sda | busy 83% | read 67273 | write 1386 | KiB/r 13 | KiB/w 8 | MBr/s 4.46 | MBw/s 0.05 | avq 2.46 | avio 2.44 ms |
NET | transport | tcpi 28 | tcpo 27 | udpi 93 | udpo 145 | tcpao 2 | tcppo 1 | tcprs 1 | tcpie 0 | udpip 0 |
NET | network | ipi 151 | ipo 183 | ipfrw 0 | deliv 138 | | | | icmpi 17 | icmpo 9 |
NET | em1 0% | pcki 141 | pcko 122 | si 0 Kbps | so 0 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | lo ---- | pcki 33 | pcko 33 | si 0 Kbps | so 0 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
*** system and process activity since boot ***
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR DSK CMD 1/8
1 - root root 1 0.55s 0.03s 19232K 1516K 428.7M 8236K N- - S 0 54% init
1038 - root root 1 0.01s 0.01s 108.0M 1804K 339.9M 1008K N- - S 0 42% rc
1923 - mysql mysql 11 0.01s 0.02s 477.5M 23128K 9304K 92K N- - S 1 1% mysqld
434 - root root 1 0.01s 0.17s 10760K 876K 9148K 0K N- - S 0 1% udevd
1973 - root root 1 0.03s 1.35s 239.1M 52280K 2804K 4K N- - S 0 0% spamd
1661 - haldaemo haldaemo 2 0.02s 0.03s 37824K 4200K 1560K 4K N- - S 0 0% hald
1323 - named named 7 0.02s 0.02s 382.6M 17392K 1468K 16K N- - S 0 0% named
346 - root root 1 0.00s 0.00s 0K 0K 0K 1128K N- - S 1 0% jbd2/dm-0-8
2072 - postfix postfix 1 0.00s 0.00s 81584K 3940K 1064K 0K N- - S 3 0% trivial-rewrit
1284 - root root 4 0.00s 0.00s 243.3M 1612K 416K 172K N- - S 0 0% rsyslogd
2073 - postfix postfix 1 0.00s 0.00s 81580K 3612K 572K 0K N- - S 0 0% smtp
2062 - root root 1 0.00s 0.03s 81296K 3408K 520K 8K N- - S 1 0% master
2106 - root root 1 0.01s 0.01s 269.3M 28532K 516K 4K N- - D 1 0% httpd
1662 - root root 1 0.00s 0.00s 20328K 1156K 520K 0K N- - S 0 0% hald-runner
2117 - root root 1 0.01s 0.00s 17532K 5252K 500K 4K N- - R 2 0% atop
1764 - root root 1 0.00s 0.00s 107.7M 1460K 368K 0K N- - S 2 0% mysqld_safe
2071 - postfix postfix 1 0.00s 0.00s 81520K 3504K 336K 0K N- - S 3 0% qmgr
157 - root root 1 5.11s 1.21s 36096K 1372K 276K 12K N- - S 1 0% plymouthd

It looks like init and rc are causing issues?

Mon, 09/08/2014 - 13:53
JamesSimpson

And this is the day before

ATOP - JSServer01 2014/09/02 00:00:01 --------- 10h54m10s elapsed PRC | sys 6m30s | user 22m41s | #proc 229 | #trun 3 | #tslpi 468 | #tslpu 0 | #zombie 1 | clones 48622 | | #exit 5 | CPU | sys 2% | user 17% | irq 0% | idle 369% | wait 12% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 0% | user 8% | irq 0% | idle 80% | cpu000 w 11% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 1% | user 5% | irq 0% | idle 94% | cpu002 w 1% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 0% | user 3% | irq 0% | idle 97% | cpu001 w 0% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | CPL | avg1 0.21 | avg5 0.23 | | avg15 0.26 | | csw 17736354 | intr 16086e3 | | | numcpu 4 | MEM | tot 15.6G | free 10.4G | cache 2.2G | dirty 1.5M | buff 273.5M | slab 521.9M | | | | | SWP | tot 2.0G | free 2.0G | | | | | | | vmcom 4.7G | vmlim 9.8G | LVM | Group00-root | busy 14% | read 248593 | write 1841e3 | KiB/r 15 | KiB/w 3 | MBr/s 0.09 | MBw/s 0.18 | avq 7.79 | avio 2.57 ms | LVM | Group00-swap | busy 0% | read 477 | write 0 | KiB/r 4 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avq 1.74 | avio 1.86 ms | DSK | sda | busy 14% | read 185218 | write 758680 | KiB/r 20 | KiB/w 9 | MBr/s 0.10 | MBw/s 0.18 | avq 1.76 | avio 5.68 ms | NET | transport | tcpi 1188535 | tcpo 1085220 | udpi 31117 | udpo 31510 | tcpao 12781 | tcppo 28282 | tcprs 37102 | tcpie 0 | udpip 0 | NET | network | ipi 1239374 | ipo 1153826 | ipfrw 0 | deliv 1220e3 | | | | icmpi 280 | icmpo 142 | NET | em1 0% | pcki 1189111 | pcko 1707309 | si 26 Kbps | so 463 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 | NET | lo ---- | pcki 107796 | pcko 107796 | si 12 Kbps | so 12 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 | *** system and process activity since boot *** PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR DSK CMD 1/13 2108 - mysql mysql 18 55.63s 4m35s 1.1G 68400K 213.8M 3.0G N- - S 3 36% mysqld 2310 - root apache 1 0.34s 0.18s 218.7M 7228K 1.1G 300.1M N- - S 1 15% httpd 1 - root root 1 0.56s 0.03s 19356K 1548K 782.4M 27124K N- - S 3 9% init 2260 - root root 1 0.87s 0.17s 81296K 3408K 611.1M 67524K N- - S 3 7% master 347 - root root 1 11.44s 0.00s 0K 0K 0K 634.0M N- - S 3 7% jbd2/dm-0-8 3179 - root root 1 0.40s 1.01s 86620K 15836K 149.0M 458.3M N- - S 2 7% miniserv.pl 2302 - root root 1 2.02s 0.39s 414.0M 38388K 246.7M 296.8M N- - S 3 6% httpd 3013 - root root 8 0.00s 0.00s 690.8M 6292K 121.1M 155.9M N- - S 3 3% dsm_om_shrsvcd 2949 - root root 1 0.00s 0.00s 131.9M 712K 151.6M 14536K N- - S 2 2% dsm_om_connsvc 9091 - drivingr drivingr 1 6.12s 93.42s 302.7M 97156K 724K 112.2M N- - S 0 1% php-cgi 17087 - drivingr drivingr 1 4.00s 63.19s 310.4M 102.6M 1472K 90044K N- - S 2 1% php-cgi 20851 - drivingr drivingr 1 4.47s 50.94s 310.1M 102.2M 10436K 78680K N- - S 0 1% php-cgi 2182 - postgrey postgrey 1 0.20s 1.05s 154.2M 13904K 12784K 55040K N- - S 2 1% postgrey 5853 - bojotool bojotool 1 2.66s 62.79s 279.2M 77548K 46700K 9264K N- - S 0 1% php-cgi 398 - root root 1 1.53s 0.00s 0K 0K 448K 49824K N- - S 2 1% flush-253:0 5858 - bojotool bojotool 1 2.50s 63.19s 285.7M 78408K 26052K 10380K N- - S 0 0% php-cgi 2321 - root root 1 0.18s 0.08s 114.5M 1268K 20052K 16180K N- - S 3 0% crond 1466 - root root 4 0.50s 0.47s 243.3M 1772K 1144K 27104K N- - S 0 0% rsyslogd
Mon, 09/08/2014 - 18:10
JamesSimpson

The load has now shot up to over 1, and the dsk is flashing on atop

ATOP - JSServer01 2014/09/09 00:06:50 --------- 10s elapsed PRC | sys 0.37s | user 2.16s | #proc 238 | #trun 2 | #tslpi 492 | #tslpu 1 | #zombie 0 | clones 5 | | #exit 1 | CPU | sys 3% | user 22% | irq 0% | idle 278% | wait 96% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 2% | user 13% | irq 0% | idle 1% | cpu000 w 84% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 1% | user 4% | irq 0% | idle 96% | cpu003 w 0% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 1% | user 3% | irq 0% | idle 95% | cpu001 w 2% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | cpu | sys 0% | user 2% | irq 0% | idle 87% | cpu002 w 11% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% | CPL | avg1 1.37 | avg5 1.17 | | avg15 0.66 | | csw 9608 | intr 13480 | | | numcpu 4 | MEM | tot 15.6G | free 4.7G | cache 7.8G | dirty 5.3M | buff 347.1M | slab 597.7M | | | | | SWP | tot 2.0G | free 2.0G | | | | | | | vmcom 4.8G | vmlim 9.8G | LVM | Group00-root | busy 98% | read 1433 | write 958 | KiB/r 4 | KiB/w 3 | MBr/s 0.57 | MBw/s 0.37 | avq 1.71 | avio 4.08 ms | DSK | sda | busy 98% | read 1433 | write 88 | KiB/r 4 | KiB/w 43 | MBr/s 0.57 | MBw/s 0.37 | avq 1.12 | avio 6.41 ms | NET | transport | tcpi 614 | tcpo 386 | udpi 17 | udpo 17 | tcpao 6 | tcppo 8 | tcprs 15 | tcpie 0 | udpip 0 | NET | network | ipi 632 | ipo 418 | ipfrw 0 | deliv 631 | | | | icmpi 0 | icmpo 0 | NET | em1 0% | pcki 657 | pcko 630 | si 59 Kbps | so 647 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 | NET | lo ---- | pcki 48 | pcko 48 | si 8 Kbps | so 8 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |   PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR DSK CMD 1/4 27735 - root root 1 0.13s 0.12s 0K 0K 5784K 0K -- - D 2 65% tar 346 - root root 1 0.01s 0.00s 0K 0K 0K 1188K -- - S 2 13% jbd2/dm-0-8 3163 - root root 1 0.00s 0.00s 0K 0K 516K 8K -- - S 0 6% miniserv.pl 23261 - drivingr drivingr 1 0.03s 0.49s 10240K 9924K 0K 492K -- - S 0 6% php-cgi 18385 - drivingr drivingr 1 0.04s 0.92s -8704K -8476K 0K 316K -- - S 0 4% php-cgi 27737 - root root 1 0.00s 0.00s 0K 0K 0K 304K -- - S 0 3% cat 12883 - mysql mysql 15 0.01s 0.02s 0K 0K 0K 180K -- - S 3 2% mysqld 1323 - root root 4 0.00s 0.00s 0K 0K 0K 24K -- - S 0 0% rsyslogd 18286 - apache apache 4 0.00s 0.03s 0K 96K 0K 12K -- - S 1 0% httpd 27791 - postfix postfix 1 0.00s 0.00s 82252K 4640K 0K 12K N- - S 2 0% cleanup 18258 - apache apache 5 0.02s 0.08s 0K 0K 0K 8K -- - S 1 0% httpd 18265 - apache apache 5 0.00s 0.00s 0K 0K 0K 8K -- - S 2 0% httpd 21194 - apache apache 5 0.00s 0.00s 0K 0K 0K 8K -- - S 1 0% httpd 715 - root root 1 0.00s 0.00s 0K 0K 0K 8K -- - S 2 0% flush-253:0 23173 - apache apache 5 0.00s 0.01s 0K 0K 0K 4K -- - S 1 0% httpd
Tue, 09/16/2014 - 04:01
Locutus

You again posted the "System activity since boot", you might want to observe the ongoing activity (press t to trigger a manual update of the screen) when the HDD is under high load, there check which processes use the most and how much.

Sat, 09/27/2014 - 10:45
JamesSimpson

Well i think i've managed to trace it down to what is causing the disk issues, two processes init and rc

Now what would be causing this?

ATOP - JSServer01 2014/09/25 12:51:08 --------- 4m36s elapsed
PRC | sys 8.92s | user 3.42s | #proc 138 | #trun 1 | #tslpi 158 | #tslpu 3 | #zombie 0 | clones 2275 | #exit 0 |
CPU | sys 6% | user 5% | irq 0% | idle 303% | wait 85% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 3% | user 2% | irq 0% | idle 16% | cpu000 w 79% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 3% | user 1% | irq 0% | idle 95% | cpu002 w 1% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 0% | user 2% | irq 0% | idle 93% | cpu001 w 4% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 0% | user 0% | irq 0% | idle 98% | cpu003 w 1% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
CPL | avg1 1.21 | avg5 0.67 | avg15 0.27 | | | csw 299648 | intr 335944 | | numcpu 4 |
MEM | tot 15.6G | free 14.1G | cache 519.1M | dirty 8.7M | buff 15.6M | slab 373.4M | | | |
SWP | tot 2.0G | free 2.0G | | | | | | vmcom 899.3M | vmlim 9.8G |
LVM | Group00-root | busy 86% | read 121974 | write 3222 | KiB/r 7 | KiB/w 3 | MBr/s 3.37 | MBw/s 0.05 | avio 1.89 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | KiB/r 4 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.93 ms |
DSK | sda | busy 86% | read 79345 | write 1608 | KiB/r 12 | KiB/w 8 | MBr/s 3.47 | MBw/s 0.05 | avio 2.94 ms |
NET | transport | tcpi 16 | tcpo 16 | udpi 178 | udpo 210 | tcpao 0 | tcppo 0 | tcprs 0 | udpip 0 |
NET | network | ipi 205 | ipo 236 | ipfrw 0 | deliv 197 | | | icmpi 3 | icmpo 9 |
NET | em1 0% | pcki 225 | pcko 180 | si 2 Kbps | so 0 Kbps | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | lo ---- | pcki 29 | pcko 29 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpi 0 | drpo 0 |
*** system and process activity since boot ***
PID TID RDDSK WRDSK WCANCL DSK CMD 1/28
1 - 462.4M 7388K 8K 54% init
1110 - 345.5M 992K 472K 40% rc
2016 - 26432K 432K 0K 3% mysqld
2108 - 3456K 8436K 4K 1% postgrey
439 - 9272K 0K 0K 1% udevd