High System CPU Load Average

26 posts / 0 new
Last post
#1 Fri, 08/29/2014 - 10:57
JamesSimpson

High System CPU Load Average

Hi All,

I am totally puzzled at the moment as to what Virtualmin is doing, after recently updating everything to the latest versions, I am getting the following CPU load averages and constant alerts from CFS.

CPU load averages 9.45 (1 min) 9.32 (5 mins) 9.77 (15 mins)

Running top via ssh I get the following

Processes: 175 total, 2 running, 4 stuck, 169 sleeping, 944 threads    16:54:15
Load Avg: 1.16, 1.13, 1.13  CPU usage: 3.74% user, 2.72% sys, 93.53% idle
SharedLibs: 14M resident, 14M data, 0B linkedit.
MemRegions: 55177 total, 917M resident, 48M private, 345M shared.
PhysMem: 2845M used (1000M wired), 4237M unused.
VM: 447G vsize, 1073M framework vsize, 11607078(0) swapins, 14171139(0) swapouts
Networks: packets: 14989373/17G in, 10427533/1423M out.
Disks: 2651509/109G read, 2162583/222G written.

PID    COMMAND      %CPU TIME     #TH  #WQ  #PORT #MREGS MEM    RPRVT  PURG
19094  mdworker     0.0  00:00.03 3    0    52    67     2196K  1340K  0B
19093  mdworker     0.0  00:00.03 3    0    52    69     3084K  2228K  0B
19092  syncdefaults 0.0  00:00.28 6    2    88    82     5132K  3952K  0B
19091  mdworker     0.0  00:00.06 3    0    52    69     5164K  4256K  0B
19089  top          9.3  00:14.13 1/1  0    26    41     2204K  1972K  0B
19086  bash         0.0  00:00.00 1    0    19    31     616K   448K   0B
19085  login        0.0  00:00.01 2    0    30    52     1168K  840K   0B
19078  TextEdit     0.0  00:00.27 5    2    170   184    13M    6556K  20K
19070  CVMCompiler  0.0  00:00.73 2    1    32    80     24M    24M    12K
19067  Terminal     24.0 00:03.02 13   7    179   212    20M+   15M+   80K
19057  com.apple.We 0.0  00:02.84 14   2    183   331    28M    25M    36K
19055  netbiosd     0.0  00:00.07 2    1    42    53     1888K  1484K  0B
19049  com.apple.iC 0.0  00:00.24 4    0    82    82     3892K  3112K  0B
19040  rpcsvchost   0.0  00:00.02 16   1    44    82     1428K  1092K  0B

Not sure where Virtualmin is pulling those averages from, and I'm not sure what is causing it. First I thought my server got hacked and sending out SPAM, but there is nothing in the mail queue.

Anyone got any ideas? Restarting my server gets it back down to the usual average of 0.3 for a day or two, then it starts to build back up.

I got an alert for 11.4 5 min load average around a hour ago. The websites aren't getting any extra hits as usual, so it can't be that...

Fri, 08/29/2014 - 12:02
andreychek

Howdy,

Hmm, the output above appears that it's from an Apple computer, not a Linux server that would be running Virtualmin. Is that process information from the correct system?

-Eric

Fri, 08/29/2014 - 13:43
JamesSimpson

Ooops you are correct, what I get for posting in haste - saying that, i cannot connect to the server by ssh, it asks me for a login, and then i enter my password then it just stays blank :S

Fri, 08/29/2014 - 13:48
JamesSimpson

At this moment in time, its now running 11.4

CPU load averages: 11.30 (1 mins) , 11.25 (5 mins) , 11.22 (15 mins) CPU type: Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz , 4 cores

21916 jamessimpson 3.0 % /usr/bin/php-cgi
22225 jamessimpson 3.0 % /usr/bin/php-cgi
21915 jamessimpson 2.0 % /usr/bin/php-cgi
23138 root 1.2 % /usr/libexec/webmin/proc/index_cpu.cgi
1772 mysql 0.5 % /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-e ...
19 root 0.4 % [events/0]
14555 drivingroads 0.4 % /usr/bin/php-cgi
14797 drivingroads 0.4 % /usr/bin/php-cgi
6827 bojotoolstore 0.3 % /usr/bin/php-cgi
7484 bojotoolstore 0.3 % /usr/bin/php-cgi
15398 drivingroads 0.3 % /usr/bin/php-cgi
18444 bojotoolstore 0.2 % /usr/bin/php-cgi
22486 apache 0.2 % /usr/sbin/httpd
78 root 0.1 % [kipmi0]
23139 root 0.1 % /usr/bin/perl /usr/libexec/webmin/miniserv.pl /etc/webmin/miniserv.conf
1 root 0.0 % /sbin/init
Fri, 08/29/2014 - 14:47
andreychek

Howdy,

Well, there's a number of PHP related processes there... it's possible that means one or more of your sites is seeing an influx of traffic.

However, what is the output of these commands:

free -m
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -15

Also, can you run the command "ps auxw", and attach that output as a text file?

-Eric

Fri, 08/29/2014 - 14:53
JamesSimpson

Thats the thing, I cannot get onto SSH at the moment, it lets me login but then won't let me type anything.

It has happened before but i had to restart the server to allow me access again, which would mean i would be running normal processes again for a day or two.

Fri, 08/29/2014 - 15:29
JamesSimpson

Finally managed to connect

Top:

top - 21:26:57 up 4 days, 21:58, 12 users,  load average: 21.79, 20.18, 17.46
Tasks: 256 total,   1 running, 248 sleeping,   0 stopped,   7 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16321220k total, 15633400k used,   687820k free,   390020k buffers
Swap:  2097144k total,     7880k used,  2089264k free, 11586296k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
   19 root      20   0     0    0    0 D  0.7  0.0  34:16.18 events/0          
   61 root      39  19     0    0    0 S  0.3  0.0   0:20.72 khugepaged        
5119 root      20   0  153m  15m 1668 S  0.3  0.1   0:34.30 lfd               
    1 root      20   0 19356 1476 1232 S  0.0  0.0   0:00.62 init              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.05 kthreadd          
    3 root      RT   0     0    0    0 S  0.0  0.0   0:02.98 migration/0       
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.69 ksoftirqd/0       
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0       
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.59 watchdog/0        
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.64 migration/1       
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1       
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.58 ksoftirqd/1       
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.38 watchdog/1        
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.39 migration/2       
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2       
   13 root      20   0     0    0    0 S  0.0  0.0   0:01.15 ksoftirqd/2       
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.35 watchdog/2        
Fri, 08/29/2014 - 15:58
JamesSimpson

And now SSH is frozen again, and I cannot get past successful authentication

Sat, 08/30/2014 - 03:47
Locutus

In your latest "top" output, there seem to be no processes using any considerable CPU power, yet your system load is excessively high. This could indicate that the system is waiting a great deal for other resources (RAM, HDD, network) to become available. Might indicate an overload there or hardware issues.

Also I noticed 12 users logged on, and 7 zombie processes. Those might be hanging sessions of your failed attempts to log on via SSH, but you might want to check those out, using the commands "w" and "last".

I also recommend the tool "atop" over "top", since it displays more information like disk, memory, swap and network usage, and records historical data, for later review. atop shows zombie processes with a "Z" in the state column.

You might have to hard-reboot the server if you can't reliably get in via SSH anymore. A system load of 20 will most likely prevent you from doing any serious work on the server.

When you can get in again, you might want to review the system and kernel logs, and install atop.

Sat, 08/30/2014 - 05:14
JamesSimpson

Right I have had to restart the server, as last night it got up to 40.1 CPU average. After restarting this morning I am able to get back into SSH

Output from atop
atop

ATOP - JSServer01 2014/08/30 11:05:26 --------- 10s elapsed
PRC | sys 0.14s | user 1.49s | #proc 182 | #zombie 0 | #exit 5 |
CPU | sys 2% | user 15% | irq 0% | idle 378% | wait 5% |
cpu | sys 1% | user 11% | irq 0% | idle 83% | cpu000 w 5% |
cpu | sys 0% | user 4% | irq 0% | idle 96% | cpu002 w 0% |
cpu | sys 0% | user 0% | irq 0% | idle 99% | cpu001 w 0% |
cpu | sys 0% | user 0% | irq 0% | idle 100% | cpu003 w 0% |
CPL | avg1 0.17 | avg5 0.39 | avg15 0.36 | csw 5269 | intr 2754 |
MEM | tot 15.6G | free 12.7G | cache 811.7M | buff 86.2M | slab 353.2M |
SWP | tot 2.0G | free 2.0G | | vmcom 2.7G | vmlim 9.8G |
LVM | Group00-root | busy 5% | read 10 | write 192 | avio 2.62 ms |
DSK | sda | busy 5% | read 10 | write 71 | avio 6.53 ms |
NET | transport | tcpi 38 | tcpo 37 | udpi 0 | udpo 0 |
NET | network | ipi 47 | ipo 37 | ipfrw 0 | deliv 38 |
NET | em1 0% | pcki 66 | pcko 37 | si 4 Kbps | so 24 Kbps |
NET | lo ---- | pcki 10 | pcko 10 | si 0 Kbps | so 0 Kbps |

PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/5
2168 0.02s 0.82s 0K 0K 0K 8K -- - S 8% php-cgi
2383 0.01s 0.30s 0K 0K 0K 0K -- - S 3% php-cgi
1866 0.03s 0.27s 0K 0K 36K 100K -- - S 3% mysqld
2224 0.01s 0.04s 75780K 20K 48K 88K -- - S 1% httpd
4131 0.01s 0.04s 0K 0K - - NE 0 E 1%
78 0.03s 0.00s 0K 0K 0K 0K -- - S 0% kipmi0

It is showing normal usage now, so not sure what the hell is going on after a day or two.

Installing atop i did get a warning
There are unfinished transactions remaining. You might consider running yum-complete-transaction first to finish them.

So I ran that too, and it looks as if I cannot install what is required
yum-complete-transaction
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.melbourne.co.uk
* epel: mirror.bytemark.co.uk
* extras: mirror.bytemark.co.uk
* updates: mirrors.ukfast.co.uk
Checking for new repos for mirrors
There are 1 outstanding transactions to complete. Finishing the most recent one
The remaining transaction had 10 elements left to run
--> Running transaction check
---> Package automake.noarch 0:1.11.1-4.el6 will be installed
---> Package cloog-ppl.x86_64 0:0.15.7-1.2.el6 will be installed
---> Package cpp.x86_64 0:4.4.7-4.el6 will be installed
---> Package gcc.x86_64 0:4.4.7-4.el6 will be installed
---> Package gcc-c++.x86_64 0:4.4.7-4.el6 will be installed
---> Package libgomp.x86_64 0:4.4.7-4.el6 will be installed
---> Package libstdc++-devel.x86_64 0:4.4.7-4.el6 will be installed
---> Package mpfr.x86_64 0:2.4.1-6.el6 will be installed
---> Package php-devel.x86_64 0:5.3.3-27.el6_5 will be installed
--> Processing Dependency: php(x86-64) = 5.3.3-27.el6_5 for package: php-devel-5.3.3-27.el6_5.x86_64
---> Package ppl.x86_64 0:0.10.2-11.el6 will be installed
--> Finished Dependency Resolution
Error: Package: php-devel-5.3.3-27.el6_5.x86_64 (updates)
Requires: php(x86-64) = 5.3.3-27.el6_5
Installed: php-5.3.3-27.el6_5.1.x86_64 (@updates)
php(x86-64) = 5.3.3-27.el6_5.1
Available: php-5.3.3-26.el6.x86_64 (base)
php(x86-64) = 5.3.3-26.el6
Available: php-5.3.3-27.el6_5.x86_64 (updates)
php(x86-64) = 5.3.3-27.el6_5
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

Running free-m now (kinda pointless as it is back to normal now)

free -m
total used free shared buffers cached
Mem: 15938 2921 13017 0 88 816
-/+ buffers/cache: 2016 13922
Swap: 2047 0 2047M

And the netstat


netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -15
19
4 127.0.0.1
2 81.156.223.142
1 servers)
1 Address
1 90.206.201.8

Sat, 08/30/2014 - 05:30
JamesSimpson

Hmm I think i may have found the issue

I seem to have thousands of these in the messages log

Aug 30 05:05:14 JSServer01 named[29765]: client 127.0.0.1#45585: query (cache) '131.205.13.211.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#43407: query (cache) '29.193.26.103.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#41691: query (cache) '241.150.174.195.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#37403: query (cache) '166.109.97.211.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:15 JSServer01 named[29765]: client 127.0.0.1#58532: query (cache) '241.150.174.195.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#44044: query (cache) '102.120.149.107.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#37691: query (cache) '91.34.135.174.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#57784: query (cache) '219.106.153.184.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#40505: query (cache) '204.5.106.41.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#35974: query (cache) '91.34.135.174.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#35621: query (cache) '53.79.234.212.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#44718: query (cache) '102.120.149.107.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:16 JSServer01 named[29765]: client 127.0.0.1#52370: query (cache) '53.79.234.212.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:17 JSServer01 named[29765]: client 127.0.0.1#42438: query (cache) '177.10.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:17 JSServer01 named[29765]: client 127.0.0.1#41674: query (cache) '202.209.241.61.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:18 JSServer01 named[29765]: client 127.0.0.1#56260: query (cache) '124.10.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:19 JSServer01 named[29765]: client 127.0.0.1#48054: query (cache) '166.109.97.211.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:22 JSServer01 named[29765]: client 127.0.0.1#49980: query (cache) '188.17.82.36.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#49930: query (cache) '204.5.106.41.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#57424: query (cache) '188.17.82.36.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#57964: query (cache) '120.107.255.193.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#35676: query (cache) '124.10.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:23 JSServer01 named[29765]: client 127.0.0.1#35009: query (cache) '101.95.101.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#47569: query (cache) '120.107.255.193.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#39782: query (cache) '227.58.73.203.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#50507: query (cache) '101.95.101.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:24 JSServer01 named[29765]: client 127.0.0.1#41356: query (cache) '156.12.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#43907: query (cache) '227.58.73.203.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#50367: query (cache) '179.107.160.163.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#58792: query (cache) '179.107.160.163.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:25 JSServer01 named[29765]: client 127.0.0.1#45449: query (cache) '182.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#35984: query (cache) '19.96.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#42738: query (cache) '19.96.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#57701: query (cache) '187.92.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#33209: query (cache) '77.113.182.192.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#51364: query (cache) '240.9.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:26 JSServer01 named[29765]: client 127.0.0.1#56060: query (cache) '240.9.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:27 JSServer01 named[29765]: client 127.0.0.1#54580: query (cache) '238.210.34.89.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:27 JSServer01 named[29765]: client 127.0.0.1#34927: query (cache) '187.92.95.23.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:27 JSServer01 named[29765]: client 127.0.0.1#54763: query (cache) '170.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:28 JSServer01 named[29765]: client 127.0.0.1#51508: query (cache) '170.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:28 JSServer01 named[29765]: client 127.0.0.1#34891: query (cache) '77.113.182.192.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:29 JSServer01 named[29765]: client 127.0.0.1#37835: query (cache) '181.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:29 JSServer01 named[29765]: client 127.0.0.1#47091: query (cache) '156.12.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:31 JSServer01 named[29765]: client 127.0.0.1#47907: query (cache) '167.13.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:31 JSServer01 named[29765]: client 127.0.0.1#42951: query (cache) '167.13.244.162.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:31 JSServer01 named[29765]: client 127.0.0.1#37369: query (cache) '223.59.200.220.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#54876: query (cache) '187.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#56875: query (cache) '187.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#56911: query (cache) '182.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#37661: query (cache) '171.233.15.199.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#35656: query (cache) '220.59.200.220.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:32 JSServer01 named[29765]: client 127.0.0.1#42569: query (cache) '33.114.193.123.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:33 JSServer01 named[29765]: client 127.0.0.1#40194: query (cache) '33.114.193.123.in-addr.arpa/PTR/IN' denied
Aug 30 05:05:33 JSServer01 named[29765]: client 127.0.0.1#43916: query (cache) '181.233.15.199.in-addr.arpa/PTR/IN' denied
Sat, 08/30/2014 - 07:36
Locutus

Okay, Eric might be able to say more about the error you get when trying to finish package updates; I'm not familiar enough with CentOS (I'm assuming you're using that, or another distro that uses "yum").

Did this issue start just after you installed updates? Or did it happen before that?

Note that the 40 is not the CPU usage, but system load. CPU usage is usually expressed in form of a percentage that the CPU spends handling processes. In your case, that'd be a maximum of 400% or 100% for each core.

System load on the other hand basically tells you how many processes on the average are ready to execute per time unit (usually 1 minute, 5 minutes, 15 minutes). In addition to CPU, this also takes other required resources into account, e.g. when a process has to wait for HDD availability. With your 4-core CPU, a load of up to 4 is acceptable and "normal" if the system is very heavily used.

So a load of 40 means that 40 processes are ready to do something but can't, because resources are lacking. It's to be expected that the system is nearly unresponsive then. In your case, that's probably not CPU power (since your top output showed that the CPU was mostly idle), but something else.

A good candidate is the HDD, in case there's hardware trouble with it. What kind of HDD setup do you have in the server? Single disk? Software/hardware RAID? You might want to use the command smartctl to review the HDDs' status values.

Since this only happens after a while, you might want to observe it for a bit and note if the system load goes up. You can review historical atop data by running atop -r /var/log/atop.log. When the load goes up, note if the disk is overloaded ("DSK % busy" is a good indicator), also check which processes use what amount of memory, disk, network etc. You can sort the output of atop accordingly and switch to different screens. Press "?" for a help screen.

Also don't forget to check last to see what those 12 logins were during your last problem phase! It shows you all logins with username and IP address. Pay attention to any entries with unexpected users/IP addresses there!

Sat, 08/30/2014 - 09:07
JamesSimpson

I checked the last login's and i can confirm they are all mine.

It also looks like my server may have been in a ddos attack maybe?

Sat, 08/30/2014 - 09:16
JamesSimpson

I am seeing a lot of these in the messages log

Aug 29 19:51:52 JSServer01 named[29765]: client 127.0.0.1#11277: query (cache) 'gmx.net/NS/IN' denied
Aug 29 19:51:52 JSServer01 named[29765]: client 127.0.0.1#11277: query (cache) 'cingular.com/NS/IN' denied
Aug 29 19:51:52 JSServer01 named[29765]: client 127.0.0.1#11277: query (cache) 'sourceforge.net/NS/IN' denied
Aug 29 19:50:18 JSServer01 named[29765]: client 127.0.0.1#52864: query (cache) 'intel.com/NS/IN' denied
Aug 29 19:50:18 JSServer01 named[29765]: client 127.0.0.1#52864: query (cache) 'msn.com/NS/IN' denied
Aug 29 19:50:18 JSServer01 named[29765]: client 127.0.0.1#52864: query (cache) 'comcast.net/NS/IN' denied

And then what looks like a dos attack?

Aug 30 01:11:41 JSServer01 kernel: Firewall: *TCP_OUT Blocked* IN= OUT=em1 SRC=149.255.100.109 DST=69.46.36.10 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=25880 DF PROTO=TCP SPT=50786 DPT=9050 WINDOW=14600 RES=0x00 SYN URGP=0 UID=508 GID=503
Aug 30 01:11:41 JSServer01 named[29765]: client 127.0.0.1#44437: query (cache) '187.88.217.189.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:41 JSServer01 named[29765]: client 127.0.0.1#46883: query (cache) '187.88.217.189.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:41 JSServer01 named[29765]: client 127.0.0.1#53390: query (cache) '225.222.197.69.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:42 JSServer01 named[29765]: client 127.0.0.1#38526: query (cache) '252.55.186.210.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:42 JSServer01 kernel: Firewall: *TCP_OUT Blocked* IN= OUT=em1 SRC=149.255.100.109 DST=69.46.36.10 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=25881 DF PROTO=TCP SPT=50786 DPT=9050 WINDOW=14600 RES=0x00 SYN URGP=0 UID=508 GID=503
Aug 30 01:11:42 JSServer01 named[29765]: client 127.0.0.1#56360: query (cache) '94.158.55.50.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:42 JSServer01 named[29765]: client 127.0.0.1#33568: query (cache) '34.137.46.77.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:43 JSServer01 named[29765]: client 127.0.0.1#55732: query (cache) '190.243.45.70.in-addr.arpa/PTR/IN' denied
Aug 30 01:11:43 JSServer01 named[29765]: client 127.0.0.1#57461: query (cache) '120.141.93.216.in-addr.arpa/PTR/IN' denied
Sat, 08/30/2014 - 10:38
JamesSimpson

Locutus, I run updates all the time to keep the server updated, but around a week ago there was quite a few updates which i ran, and then I enabled graylisting as i was starting to see a lot of spam emails coming through.

After that, I then started to get CSF alerts of high load averages, and then it seemed to get worse.

I am running a Dell Poweredge R210, which comes with a Dell Raid Card, and two 1TB hard drives set up in RAID 1

In virtualmin, it only shows the raid (SCSI device A Drive size 953.31 GB - Make and model Dell VIRTUAL DISK)

I have another machine which is running quite happily without the same issues, but that is running a software raid across two disks and I am able to query the raid / disks, but with this machine, I've never been able to query the raid, as I don't think there are any proper Dell drivers for the raid card to run Linux.

The raid card is a Dell SAS 6/iR Adapter

Sat, 08/30/2014 - 10:54
JamesSimpson

Hi Guys,

It started building up again, ran atop -r and this is the output

ATOP - JSServer01 2014/08/30 15:02:04 --------- 4h25m53s elapsed
PRC | sys 94.89s | user 19m30s | #proc 184 | #zombie 0 | #exit 0 |
CPU | sys 1% | user 19% | irq 0% | idle 371% | wait 9% |
cpu | sys 1% | user 9% | irq 0% | idle 82% | cpu000 w 8% |
cpu | sys 0% | user 5% | irq 0% | idle 94% | cpu002 w 1% |
cpu | sys 0% | user 3% | irq 0% | idle 97% | cpu001 w 0% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% |
CPL | avg1 0.27 | avg5 0.29 | avg15 0.27 | csw 5643189 | intr 6191011 |
MEM | tot 15.6G | free 11.8G | cache 1.4G | buff 232.5M | slab 406.8M |
SWP | tot 2.0G | free 2.0G | | vmcom 2.8G | vmlim 9.8G |
LVM | Group00-root | busy 10% | read 158419 | write 785040 | avio 1.76 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | avio 2.57 ms |
DSK | sda | busy 10% | read 112136 | write 262769 | avio 4.43 ms |
NET | transport | tcpi 534967 | tcpo 484902 | udpi 13309 | udpo 13651 |
NET | network | ipi 555500 | ipo 516192 | ipfrw 0 | deliv 548501 |
NET | em1 0% | pcki 492572 | pcko 649938 | si 36 Kbps | so 409 Kbps |
NET | lo ---- | pcki 101110 | pcko 101110 | si 13 Kbps | so 13 Kbps |
Window has been resized...
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/17
11352 1 6.33s 4m13s 310.1M 102.7M 324K 10720K N- - S 0 2% php-cgi
11353 1 5.98s 3m59s 310.1M 102.7M 124K 12436K N- - S 2 2% php-cgi
14890 1 7.86s 3m47s 286.6M 81180K 0K 11336K N- - S 1 1% php-cgi
1866 16 20.63s 1m45s 863.0M 63104K 81144K 1.0G N- - S 3 1% mysqld
6279 1 4.30s 79.64s 311.8M 104.1M 2692K 70844K N- - D 2 1% php-cgi
6992 1 2.12s 64.37s 278.9M 77108K 164K 4K N- - S 0 0% php-cgi
10698 1 2.92s 57.18s 301.1M 95656K 572K 52416K N- - S 1 0% php-cgi
6242 1 1.36s 39.79s 285.7M 78572K 80K 4K N- - S 0 0% php-cgi
6993 1 1.10s 33.55s 272.9M 66768K 220K 4K N- - S 2 0% php-cgi
78 1 21.30s 0.00s 0K 0K 0K 0K N- - S 3 0% kipmi0
6600 1 0.51s 17.45s 264.5M 63392K 176K 164K N- - S 0 0% php-cgi

Sat, 08/30/2014 - 14:24
JamesSimpson

I think I have figured it out - It's something to do with BIND - I think i've been going through DDOS attacks for some strange reason

I have just added this into named.conf

acl "trusted"{
        My server ip address
        My server ip address 2
        My secondary DNS server IP address
        localhost;
        localnets;
};

options {
listen-on port 53 {
any;
};
listen-on-v6 port 53 {
any;
};
directory "/var/named";
dump-file "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
        allow-query { trusted; };
        allow-transfer { trusted; };
        allow-recursion { trusted;} ;
        allow-query-cache { trusted; };
recursion no;

dnssec-enable yes;
dnssec-validation yes;
dnssec-lookaside auto;

/* Path to ISC DLV key */
bindkeys-file "/etc/named.iscdlv.key";

managed-keys-directory "/var/named/dynamic";
also-notify {
};
};

I now see a lot of these type of warnings in my log file

Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'dansimpson.net/SPF/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'dansimpson.net/SPF/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'ns2.j5huh.net/A/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.198.26#53: query 'ns1.j5huh.net/A/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.192.25#21267: query 'ns1.j5huh.com/A/IN' denied
Aug 30 20:21:20 JSServer01 named[12935]: client 80.241.192.25#20384: query 'ns1.j5huh.com/A/IN' denied

Which I am assuming is remains of a DNS attack?

Sun, 08/31/2014 - 06:17
JamesSimpson

Well adding those DNS settings broke my websites, as I couldn't access them, although I have upped the firewall to block multiple queries which seems to have worked,

Mon, 09/01/2014 - 14:18
JamesSimpson

Does this give any clues? LVM and DSK are flashing red?

ATOP - JSServer01 2014/09/01 13:08:44 --------- 2m54s elapsed
PRC | sys 5.84s | user 2.64s | #proc 138 | #zombie 0 | #exit 0 |
CPU | sys 8% | user 7% | irq 0% | idle 307% | wait 78% |
cpu | sys 4% | user 2% | irq 0% | idle 25% | cpu000 w 69% |
cpu | sys 2% | user 4% | irq 0% | idle 88% | cpu001 w 5% |
cpu | sys 1% | user 1% | irq 0% | idle 96% | cpu002 w 2% |
cpu | sys 0% | user 0% | irq 0% | idle 97% | cpu003 w 2% |
CPL | avg1 1.38 | avg5 0.58 | avg15 0.21 | csw 248036 | intr 226145 |
MEM | tot 15.6G | free 14.2G | cache 501.0M | buff 14.9M | slab 334.3M |
SWP | tot 2.0G | free 2.0G | | vmcom 868.7M | vmlim 9.8G |
LVM | Group00-root | busy 78% | read 109666 | write 2872 | avio 1.21 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | avio 1.09 ms |
DSK | sda | busy 79% | read 65008 | write 1376 | avio 2.07 ms |
NET | transport | tcpi 24 | tcpo 24 | udpi 75 | udpo 102 |
NET | network | ipi 120 | ipo 135 | ipfrw 0 | deliv 102 |
NET | em1 0% | pcki 182 | pcko 85 | si 0 Kbps | so 0 Kbps |
NET | lo ---- | pcki 33 | pcko 33 | si 0 Kbps | so 0 Kbps |
*** system and process activity since boot ***
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/16
158 1 4.22s 0.98s 36096K 1368K 276K 16K N- - S 1 3% plymouthd
2158 1 0.04s 1.33s 239.1M 52280K 2804K 4K N- - S 0 1% spamd
1 1 0.56s 0.02s 19356K 1524K 409.7M 6968K N- - S 0 0% init
34 1 0.54s 0.00s 0K 0K 0K 0K N- - S 0 0% kblockd/0
78 1 0.32s 0.00s 0K 0K 0K 0K N- - S 3 0% kipmi0
437 1 0.01s 0.15s 10648K 756K 9268K 0K N- - S 2 0% udevd
2182 1 0.04s 0.01s 154.2M 13520K 11332K 7712K N- - S 3 0% postgrey
1843 2 0.01s 0.04s 37812K 4184K 1556K 4K N- - S 0 0% hald
2260 1 0.00s 0.04s 81296K 3408K 520K 8K N- - S 3 0% master

Mon, 09/01/2014 - 14:18
JamesSimpson

And this was from yesterday, when it started to build up again
ATOP - JSServer01 2014/08/31 00:00:01 --------- 6h17m12s elapsed
PRC | sys 3m58s | user 25m22s | #proc 201 | #zombie 0 | #exit 1 |
CPU | sys 2% | user 15% | irq 0% | idle 374% | wait 9% |
cpu | sys 0% | user 8% | irq 0% | idle 84% | cpu000 w 8% |
cpu | sys 0% | user 4% | irq 0% | idle 95% | cpu002 w 1% |
cpu | sys 0% | user 2% | irq 0% | idle 97% | cpu001 w 0% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% |
CPL | avg1 0.13 | avg5 0.16 | avg15 0.14 | csw 9523149 | intr 8306300 |
MEM | tot 15.6G | free 12.0G | cache 1.2G | buff 255.2M | slab 192.2M |
SWP | tot 2.0G | free 2.0G | | vmcom 3.1G | vmlim 9.8G |
LVM | Group00-root | busy 10% | read 158124 | write 917942 | avio 2.18 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | avio 0.88 ms |
DSK | sda | busy 10% | read 119043 | write 345707 | avio 5.04 ms |
NET | transport | tcpi 539048 | tcpo 506361 | udpi 43734 | udpo 44075 |
NET | network | ipi 598771 | ipo 564033 | ipfrw 0 | deliv 583078 |
NET | em1 0% | pcki 514411 | pcko 678076 | si 19 Kbps | so 301 Kbps |
NET | lo ---- | pcki 131997 | pcko 131997 | si 13 Kbps | so 13 Kbps |
*** system and process activity since boot ***
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/23
12952 1 9.41s 6m04s 303.8M 97396K 296K 18116K N- - S 0 2% php-cgi
12239 1 8.58s 5m44s 292.0M 86864K 684K 16916K N- - S 0 2% php-cgi
13618 1 7.58s 5m12s 310.1M 102.7M 32K 13992K N- - S 0 1% php-cgi
1772 15 26.10s 2m08s 798.8M 66204K 214.2M 1.2G N- - S 0 1% mysqld
78 1 2m13s 0.00s 0K 0K 0K 0K N- - S 3 1% kipmi0
6474 1 3.62s 95.06s 286.2M 78744K 53280K 12952K N- - S 0 0% php-cgi
3119 1 3.16s 84.38s 287.4M 80580K 105.0M 8660K N- - S 0 0% php-cgi
2571 33 4.86s 42.56s 2.6G 181.8M 155.9M 13296K N- - S 1 0% dsm_om_connsvc
20531 1 2.01s 27.72s 275.4M 69604K 476K 47256K N- - S 0 0% php-cgi

Tue, 09/02/2014 - 07:48
Locutus

You posted the system activity since boot, you should also watch the ongoing activity. You can change the update interval with the i key. With t you can trigger a manual update.

It seems like the HDD is under constant high load. You can sort the process list by disk usage with shift-d and switch to disk details with d, to find out which process(es) are using the disk so much.

Mon, 09/08/2014 - 13:52
JamesSimpson

Right, I have to restart the server like every other day to get it back to normal processes enough for me to even login to SSH.

These logs are from the 5th - shows high LVM and DSK

ATOP - JSServer01 2014/09/05 13:29:42 --------- 3m22s elapsed
PRC | sys 6.72s | user 2.89s | #proc 141 | #trun 1 | #tslpi 161 | #tslpu 3 | #zombie 0 | clones 2157 | | #exit 0 |
CPU | sys 7% | user 6% | irq 0% | idle 304% | wait 82% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 4% | user 2% | irq 0% | idle 19% | cpu000 w 75% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 0% | user 3% | irq 0% | idle 93% | cpu001 w 4% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 3% | user 1% | irq 0% | idle 95% | cpu002 w 2% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 1% | user 0% | irq 0% | idle 97% | cpu003 w 2% | | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
CPL | avg1 1.17 | avg5 0.53 | | avg15 0.20 | | csw 256516 | intr 253915 | | | numcpu 4 |
MEM | tot 15.6G | free 14.2G | cache 489.2M | dirty 1.2M | buff 13.6M | slab 343.7M | | | | |
SWP | tot 2.0G | free 2.0G | | | | | | | vmcom 864.7M | vmlim 9.8G |
LVM | Group00-root | busy 82% | read 112338 | write 2805 | KiB/r 7 | KiB/w 4 | MBr/s 4.32 | MBw/s 0.05 | avq 4.86 | avio 1.45 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | KiB/r 4 | KiB/w 0 | MBr/s 0.01 | MBw/s 0.00 | avq 3.27 | avio 0.93 ms |
DSK | sda | busy 83% | read 67273 | write 1386 | KiB/r 13 | KiB/w 8 | MBr/s 4.46 | MBw/s 0.05 | avq 2.46 | avio 2.44 ms |
NET | transport | tcpi 28 | tcpo 27 | udpi 93 | udpo 145 | tcpao 2 | tcppo 1 | tcprs 1 | tcpie 0 | udpip 0 |
NET | network | ipi 151 | ipo 183 | ipfrw 0 | deliv 138 | | | | icmpi 17 | icmpo 9 |
NET | em1 0% | pcki 141 | pcko 122 | si 0 Kbps | so 0 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | lo ---- | pcki 33 | pcko 33 | si 0 Kbps | so 0 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
*** system and process activity since boot ***
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR DSK CMD 1/8
1 - root root 1 0.55s 0.03s 19232K 1516K 428.7M 8236K N- - S 0 54% init
1038 - root root 1 0.01s 0.01s 108.0M 1804K 339.9M 1008K N- - S 0 42% rc
1923 - mysql mysql 11 0.01s 0.02s 477.5M 23128K 9304K 92K N- - S 1 1% mysqld
434 - root root 1 0.01s 0.17s 10760K 876K 9148K 0K N- - S 0 1% udevd
1973 - root root 1 0.03s 1.35s 239.1M 52280K 2804K 4K N- - S 0 0% spamd
1661 - haldaemo haldaemo 2 0.02s 0.03s 37824K 4200K 1560K 4K N- - S 0 0% hald
1323 - named named 7 0.02s 0.02s 382.6M 17392K 1468K 16K N- - S 0 0% named
346 - root root 1 0.00s 0.00s 0K 0K 0K 1128K N- - S 1 0% jbd2/dm-0-8
2072 - postfix postfix 1 0.00s 0.00s 81584K 3940K 1064K 0K N- - S 3 0% trivial-rewrit
1284 - root root 4 0.00s 0.00s 243.3M 1612K 416K 172K N- - S 0 0% rsyslogd
2073 - postfix postfix 1 0.00s 0.00s 81580K 3612K 572K 0K N- - S 0 0% smtp
2062 - root root 1 0.00s 0.03s 81296K 3408K 520K 8K N- - S 1 0% master
2106 - root root 1 0.01s 0.01s 269.3M 28532K 516K 4K N- - D 1 0% httpd
1662 - root root 1 0.00s 0.00s 20328K 1156K 520K 0K N- - S 0 0% hald-runner
2117 - root root 1 0.01s 0.00s 17532K 5252K 500K 4K N- - R 2 0% atop
1764 - root root 1 0.00s 0.00s 107.7M 1460K 368K 0K N- - S 2 0% mysqld_safe
2071 - postfix postfix 1 0.00s 0.00s 81520K 3504K 336K 0K N- - S 3 0% qmgr
157 - root root 1 5.11s 1.21s 36096K 1372K 276K 12K N- - S 1 0% plymouthd

It looks like init and rc are causing issues?

Mon, 09/08/2014 - 13:53
JamesSimpson

And this is the day before

ATOP - JSServer01                                2014/09/02  00:00:01                                ---------                                  10h54m10s elapsed
PRC | sys    6m30s  | user  22m41s | #proc    229  | #trun  3 |  #tslpi   468 | #tslpu     0  | #zombie    1 | clones 48622  |              |  #exit      5 |
CPU | sys   2%  | user     17% | irq       0%  | idle    369% |  wait     12% |               | steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   0%  | user  8% | irq       0%  | idle     80% |  cpu000 w 11% |               | steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   1%  | user  5% | irq       0%  | idle     94% |  cpu002 w  1% |               | steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   0%  | user  3% | irq       0%  | idle     97% |  cpu001 w  0% |               | steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   0%  | user  2% | irq       0%  | idle     98% |  cpu003 w  0% |               | steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
CPL | avg1    0.21  | avg5    0.23 |               | avg15   0.26 |               | csw 17736354  | intr 16086e3 |               |              |  numcpu     4 |
MEM | tot    15.6G  | free   10.4G | cache   2.2G  | dirty   1.5M |  buff  273.5M | slab  521.9M  |              |               |              |       |
SWP | tot     2.0G  | free    2.0G |               |              |               |               |              |               | vmcom   4.7G |  vmlim   9.8G |
LVM | Group00-root  | busy     14% | read  248593  | write 1841e3 |  KiB/r     15 | KiB/w      3  | MBr/s   0.09 | MBw/s   0.18  | avq     7.79 |  avio 2.57 ms |
LVM | Group00-swap  | busy  0% | read     477  | write  0 |  KiB/r  4 | KiB/w      0  | MBr/s   0.00 | MBw/s   0.00  | avq     1.74 |  avio 1.86 ms |
DSK |          sda  | busy     14% | read  185218  | write 758680 |  KiB/r     20 | KiB/w      9  | MBr/s   0.10 | MBw/s   0.18  | avq     1.76 |  avio 5.68 ms |
NET | transport     | tcpi 1188535 | tcpo 1085220  | udpi   31117 |  udpo   31510 | tcpao  12781  | tcppo  28282 | tcprs  37102  | tcpie      0 |  udpip      0 |
NET | network       | ipi  1239374 | ipo  1153826  | ipfrw  0 |  deliv 1220e3 |               |              |               | icmpi    280 |  icmpo    142 |
NET | em1   0%  | pcki 1189111 | pcko 1707309  | si   26 Kbps |  so  463 Kbps | coll       0  | erri       0 | erro       0  | drpi       0 |  drpo       0 |
NET | lo      ----  | pcki  107796 | pcko  107796  | si   12 Kbps |  so   12 Kbps | coll       0  | erri       0 | erro       0  | drpi       0 |  drpo       0 |
                                                          *** system and process activity since boot ***
  PID      TID    RUID        EUID         THR     SYSCPU     USRCPU     VGROW     RGROW     RDDSK     WRDSK    ST    EXC    S    CPUNR     DSK    CMD       1/13
 2108        -    mysql       mysql         18     55.63s      4m35s      1.1G    68400K    213.8M  3.0G    N-  -    S        3     36%    mysqld
 2310        -    root        apache         1      0.34s      0.18s    218.7M     7228K      1.1G    300.1M    N-  -    S        1     15%    httpd
    1        -    root        root           1      0.56s      0.03s    19356K     1548K    782.4M    27124K    N-  -    S        3      9%    init
 2260        -    root        root           1      0.87s      0.17s    81296K     3408K    611.1M    67524K    N-  -    S        3      7%    master
  347        -    root        root           1     11.44s      0.00s        0K        0K        0K    634.0M    N-  -    S        3      7%    jbd2/dm-0-8
 3179        -    root        root           1      0.40s      1.01s    86620K    15836K    149.0M    458.3M    N-  -    S        2      7%    miniserv.pl
 2302        -    root        root           1      2.02s      0.39s    414.0M    38388K    246.7M    296.8M    N-  -    S        3      6%    httpd
 3013        -    root        root           8      0.00s      0.00s    690.8M     6292K    121.1M    155.9M    N-  -    S        3      3%    dsm_om_shrsvcd
 2949        -    root        root           1      0.00s      0.00s    131.9M      712K    151.6M    14536K    N-  -    S        2      2%    dsm_om_connsvc
 9091        -    drivingr    drivingr       1      6.12s     93.42s    302.7M    97156K      724K    112.2M    N-  -    S        0      1%    php-cgi
17087        -    drivingr    drivingr       1      4.00s     63.19s    310.4M    102.6M     1472K    90044K    N-  -    S        2      1%    php-cgi
20851        -    drivingr    drivingr       1      4.47s     50.94s    310.1M    102.2M    10436K    78680K    N-  -    S        0      1%    php-cgi
 2182        -    postgrey    postgrey       1      0.20s      1.05s    154.2M    13904K    12784K    55040K    N-  -    S        2      1%    postgrey
 5853        -    bojotool    bojotool       1      2.66s     62.79s    279.2M    77548K    46700K     9264K    N-  -    S        0      1%    php-cgi
  398        -    root        root           1      1.53s      0.00s        0K        0K      448K    49824K    N-  -    S        2      1%    flush-253:0
 5858        -    bojotool    bojotool       1      2.50s     63.19s    285.7M    78408K    26052K    10380K    N-  -    S        0      0%    php-cgi
 2321        -    root        root           1      0.18s      0.08s    114.5M     1268K    20052K    16180K    N-  -    S        3      0%    crond
 1466        -    root        root           4      0.50s      0.47s    243.3M     1772K     1144K    27104K    N-  -    S        0      0%    rsyslogd
Mon, 09/08/2014 - 18:10
JamesSimpson

The load has now shot up to over 1, and the dsk is flashing on atop

ATOP - JSServer01                                   2014/09/09  00:06:50                                   ---------                                   10s elapsed
PRC | sys    0.37s  | user   2.16s |  #proc    238 | #trun  2  | #tslpi   492 |  #tslpu     1 |  #zombie    0 | clones     5  |              |  #exit      1 |
CPU | sys       3%  | user     22% |  irq   0% | idle    278%  | wait     96% |               |  steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   2%  | user     13% |  irq   0% | idle      1%  | cpu000 w 84% |               |  steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   1%  | user      4% |  irq   0% | idle     96%  | cpu003 w  0% |               |  steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   1%  | user      3% |  irq   0% | idle     95%  | cpu001 w  2% |               |  steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
cpu | sys   0%  | user  2% |  irq   0% | idle     87%  | cpu002 w 11% |               |  steal     0% | guest     0%  | curf 3.06GHz |  curscal   ?% |
CPL | avg1    1.37  | avg5    1.17 |               | avg15   0.66  |              |  csw     9608 |  intr   13480 |               |              |  numcpu     4 |
MEM | tot    15.6G  | free    4.7G |  cache   7.8G | dirty   5.3M  | buff  347.1M |  slab  597.7M |               |               |              |       |
SWP | tot     2.0G  | free    2.0G |               |               |              |               |               |               | vmcom   4.8G |  vmlim   9.8G |
LVM | Group00-root  | busy     98% |  read    1433 | write    958  | KiB/r      4 |  KiB/w  3 |  MBr/s   0.57 | MBw/s   0.37  | avq     1.71 |  avio 4.08 ms |
DSK |          sda  | busy     98% |  read    1433 | write     88  | KiB/r      4 |  KiB/w     43 |  MBr/s   0.57 | MBw/s   0.37  | avq     1.12 |  avio 6.41 ms |
NET | transport     | tcpi     614 |  tcpo     386 | udpi      17  | udpo      17 |  tcpao      6 |  tcppo      8 | tcprs     15  | tcpie      0 |  udpip      0 |
NET | network       | ipi      632 |  ipo      418 | ipfrw      0  | deliv    631 |               |               |               | icmpi      0 |  icmpo      0 |
NET | em1       0%  | pcki     657 |  pcko     630 | si   59 Kbps  | so  647 Kbps |  coll       0 |  erri       0 | erro       0  | drpi       0 |  drpo       0 |
NET | lo      ----  | pcki  48 |  pcko  48 | si    8 Kbps  | so    8 Kbps |  coll   0 |  erri   0 | erro       0  | drpi       0 |  drpo       0 |
 
  PID      TID    RUID        EUID         THR     SYSCPU     USRCPU     VGROW     RGROW     RDDSK     WRDSK    ST    EXC    S    CPUNR     DSK     CMD        1/4
27735        -    root        root           1      0.13s      0.12s        0K        0K     5784K        0K    --  -    D        2     65%     tar
  346        -    root        root           1      0.01s      0.00s        0K        0K        0K     1188K    --  -    S        2     13%     jbd2/dm-0-8
 3163        -    root        root           1      0.00s      0.00s        0K        0K      516K        8K    --  -    S        0      6%     miniserv.pl
23261        -    drivingr    drivingr       1      0.03s      0.49s    10240K     9924K        0K  492K    --  -    S        0      6%     php-cgi
18385        -    drivingr    drivingr       1      0.04s      0.92s    -8704K    -8476K        0K      316K    --  -    S        0      4%     php-cgi
27737        -    root        root           1      0.00s      0.00s        0K        0K        0K      304K    --  -    S        0      3%     cat
12883        -    mysql       mysql         15      0.01s      0.02s        0K        0K        0K      180K    --  -    S        3      2%     mysqld
 1323        -    root        root           4      0.00s      0.00s        0K        0K        0K       24K    --  -    S        0      0%     rsyslogd
18286        -    apache      apache         4      0.00s      0.03s        0K       96K        0K   12K    --  -    S        1      0%     httpd
27791        -    postfix     postfix        1      0.00s      0.00s    82252K     4640K        0K       12K    N-  -    S        2      0%     cleanup
18258        -    apache      apache         5      0.02s      0.08s        0K        0K        0K        8K    --  -    S        1      0%     httpd
18265        -    apache      apache         5      0.00s      0.00s        0K        0K        0K    8K    --  -    S        2      0%     httpd
21194        -    apache      apache         5      0.00s      0.00s        0K        0K        0K        8K    --  -    S        1      0%     httpd
  715        -    root        root           1      0.00s      0.00s        0K        0K        0K        8K    --  -    S        2      0%     flush-253:0
23173        -    apache      apache         5      0.00s      0.01s        0K        0K        0K        4K    --  -    S        1      0%     httpd
Tue, 09/16/2014 - 04:01
Locutus

You again posted the "System activity since boot", you might want to observe the ongoing activity (press t to trigger a manual update of the screen) when the HDD is under high load, there check which processes use the most and how much.

Sat, 09/27/2014 - 10:45
JamesSimpson

Well i think i've managed to trace it down to what is causing the disk issues, two processes init and rc

Now what would be causing this?

ATOP - JSServer01 2014/09/25 12:51:08 --------- 4m36s elapsed
PRC | sys 8.92s | user 3.42s | #proc 138 | #trun 1 | #tslpi 158 | #tslpu 3 | #zombie 0 | clones 2275 | #exit 0 |
CPU | sys 6% | user 5% | irq 0% | idle 303% | wait 85% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 3% | user 2% | irq 0% | idle 16% | cpu000 w 79% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 3% | user 1% | irq 0% | idle 95% | cpu002 w 1% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 0% | user 2% | irq 0% | idle 93% | cpu001 w 4% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
cpu | sys 0% | user 0% | irq 0% | idle 98% | cpu003 w 1% | steal 0% | guest 0% | curf 3.06GHz | curscal ?% |
CPL | avg1 1.21 | avg5 0.67 | avg15 0.27 | | | csw 299648 | intr 335944 | | numcpu 4 |
MEM | tot 15.6G | free 14.1G | cache 519.1M | dirty 8.7M | buff 15.6M | slab 373.4M | | | |
SWP | tot 2.0G | free 2.0G | | | | | | vmcom 899.3M | vmlim 9.8G |
LVM | Group00-root | busy 86% | read 121974 | write 3222 | KiB/r 7 | KiB/w 3 | MBr/s 3.37 | MBw/s 0.05 | avio 1.89 ms |
LVM | Group00-swap | busy 0% | read 322 | write 0 | KiB/r 4 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.93 ms |
DSK | sda | busy 86% | read 79345 | write 1608 | KiB/r 12 | KiB/w 8 | MBr/s 3.47 | MBw/s 0.05 | avio 2.94 ms |
NET | transport | tcpi 16 | tcpo 16 | udpi 178 | udpo 210 | tcpao 0 | tcppo 0 | tcprs 0 | udpip 0 |
NET | network | ipi 205 | ipo 236 | ipfrw 0 | deliv 197 | | | icmpi 3 | icmpo 9 |
NET | em1 0% | pcki 225 | pcko 180 | si 2 Kbps | so 0 Kbps | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | lo ---- | pcki 29 | pcko 29 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpi 0 | drpo 0 |
*** system and process activity since boot ***
PID TID RDDSK WRDSK WCANCL DSK CMD 1/28
1 - 462.4M 7388K 8K 54% init
1110 - 345.5M 992K 472K 40% rc
2016 - 26432K 432K 0K 3% mysqld
2108 - 3456K 8436K 4K 1% postgrey
439 - 9272K 0K 0K 1% udevd

Topic locked