Problems with high loads and SPAM being sent - I think

40 posts / 0 new
Last post
#1 Fri, 12/07/2012 - 16:30
nothingless

Problems with high loads and SPAM being sent - I think

So, I have a fairly new server with VirtualMin, everything running fine until a few days ago. The server loads have gone insane, and there's always loads of postfix processes running, hundreds sometimes. I've also gotten some bounced emails that were being sent from addresses on my server that don't exist - hence me thinking this is maybe SPAM related.

I ran uptime:

 23:10:02 up  7:00,  1 user,  load average: 183.23, 180.38, 166.24

mailq | tail -1 -- 512 Kbytes in 168 Requests.

top (a few minutes ago there was a lot more postfix stuff in it)

11357 root 20 0 15468 1664 896 R 0.3 0.0 0:00.24 top 1 root 20 0 19272 1476 1192 S 0.0 0.0 0:01.62 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.69 ksoftirqd/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.36 kworker/0:0 5 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kworker/u:0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 7 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 0:00.76 ksoftirqd/1 11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 13 root 20 0 0 0 0 S 0.0 0.0 0:00.86 ksoftirqd/2 14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 15 root 20 0 0 0 0 S 0.0 0.0 0:00.24 kworker/3:0 16 root 20 0 0 0 0 S 0.0 0.0 0:00.62 ksoftirqd/3 17 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4 18 root 20 0 0 0 0 S 0.0 0.0 0:00.07 kworker/4:0 19 root 20 0 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/4 20 root RT 0 0 0 0 S 0.0 0.0 95:48.99 migration/5 21 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kworker/5:0 22 root 20 0 0 0 0 S 0.0 0.0 0:00.30 ksoftirqd/5 23 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/6 25 root 20 0 0 0 0 S 0.0 0.0 0:00.16 ksoftirqd/6 26 root RT 0 0 0 0 S 0.0 0.0 183:33.31 migration/7 28 root 20 0 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/7 29 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 cpuset 30 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper 31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs 32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns 425 root 20 0 0 0 0 S 0.0 0.0 0:00.02 sync_supers 427 root 20 0 0 0 0 S 0.0 0.0 0:00.00 bdi-default 428 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kintegrityd 430 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kblockd 514 postfix 20 0 79484 4184 2864 D 0.0 0.0 0:00.00 cleanup 527 postfix 20 0 96392 5372 3804 S 0.0 0.0 0:00.00 smtpd

Any idea on how I can further troubleshoot and fix this issue? Some help would be much appreciated! The server runs CentOS 6 with 32GB of RAM.

Fri, 12/07/2012 - 16:32
nothingless

System Information > Running Processes: 653

see here: http://pastie.org/5496096

Fri, 12/07/2012 - 16:33
nothingless

I tried turning on DKIM but that failed with: Failed to save DKIM settings : Failed to lock file /etc/postfix/main.cf after 5 minutes. Last error was :

Not sure it would have made a difference though!

Fri, 12/07/2012 - 18:00
andreychek

Howdy,

What output do you receive if you run the command "mailq | tail -1"?

I suspect that'll show a high number of email in your mail queue... if so, you'd need to figure out what is generating the spam.

It's likely either a legitimate user who's desktop got a virus, or someone who broke into a web app on your server and is using it to send spam.

You can figure all that out from the message headers of the email in your mail queue... you can view that in Webmin -> Servers -> Postfix -> Mail Queue.

-Eric

Fri, 12/07/2012 - 18:13
Locutus

@Eric: He did run mailq in his first post, and got 168 mails in the queue. :)

There's a great number of php-cgi processes running as user "disneysc", so that might be the one being flooded. I'd start checking any sites that run as that user. A good software to scan for web-based malware is "Linux Malware Detect".

Also I'd stop Postfix right away until you figure this out, before your server gets put onto too many blacklists for spamming.

Fri, 12/07/2012 - 18:43
nothingless

Hello! Thanks for the replies, guys!

Yes, I ran the mailq command, it's showing this at the moment:

mailq | tail -1 -- 596 Kbytes in 225 Requests.

I think it's probably important to note that this is a personal server, it only has around 20 of my own personal websites on it. No other users have access. Most of the websites have one info@ forwarding address setup, and one domain uses a single POP3 account, and that's all. On average, I get maybe 10 emails daily from all these sites combined, so even though 225 maybe doesn't seem like a lot, it's far more than I estimate it should be!

The disneysc user is actually the largest site on the server - it runs on Wordpress and gets over 1 million page views a month. I use caching, including APC and plugins for WP, but I thought the number of php-cgi processes were probably just due to the site being busy?

I'm trying to get into the Mail Queue to look at the headers but the load is so high it's not loading for me:

CPU load averages       246.73 (1 min) 245.45 (5 mins) 242.56 (15 mins)

Crazy, right??

Running processes shows:

CPU load averages: 247.10 (1 mins) , 246.31 (5 mins) , 243.48 (15 mins) CPU type: Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz , 8 cores   ID Owner CPU Command 20 root 90.0 % [migration/5] 26 root 32.7 % [migration/7] 3244 mysql 0.6 % /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-e ...

and everything else at 0%, around 600 or so again.

I've managed to stop PostFix via the VirtualMin interface, but still no luck with the mail queue. Anything else I can do, or some other way I can look at the Queue? VirtualMin itself seems responsive enough, it's just the Webmin -> Servers -> Postfix page that won't load, so I can't click on through to Queue. :/

Fri, 12/07/2012 - 22:24
andreychek

Howdy,

Sorry, missed your mailq output in your original post.

It's possible that Postfix not running is preventing Virtualmin from being able to display the Postfix screen... but you can accomplish all that from the command line too.

If you run "mailq", in that output, if it's a spam problem, you'll likely see a recurring user over and over in there.

If you can figure out which user is recurring (usually as the sender), take note of one of the message queue IDs associated with an email being sent from them.

Then, run:

postcat -q MESSAGE_QUEUE_ID

Scroll down to where the message headers show up -- and there you should be able to get some insight into how the email is being generated.

If you'd like a hand interpreting the email headers, feel free to post them here.

-Eric

Sat, 12/08/2012 - 04:58
nothingless
Sat, 12/08/2012 - 08:33
Locutus

The example you posted looks more like a non-deliverable reply being sent out in reply to a spam that was sent TO your system, i.e. the user "pamltup@nothing-less.net".

The postqueue excerpt looks like all those mails it's trying to deliver are NDRs.

My impression from that info is that your server is under a massive incoming spam attack, to non-existent addresses, and that it is trying to return NDRs to all of them.

Suggestion would be to prevent Postfix from sending NDRs (need to look up though how to do that - haven't tried that before) for the time being, and then clearing the queue from everything that's coming from your MAILER-DAEMON.

But: This NDR thing should not be responsible for the MASSIVE system load of over 200 you're seeing. Postfix should surely be able to handle a few hundred mails in the queue without overloading the system that massively. Maybe a web script on your system is being abused in an attempt to send spam to local addresses. You might want to install "atop" and see which process uses the most CPU.

Also install that Linux Malware Detect I mentioned and have it scan your web directories. Shut down Apache if required while doing so, if the system load doesn't decrease.

Sat, 12/08/2012 - 12:21
nothingless

Thanks so much for your reply! I've turned off NDRs for now using the instructions found here:

http://www.linuxquestions.org/questions/linux-server-73/disable-ndr-on-p... - seemed to work OK, restarted Postfix after I applied the changes.

I have now also installed Linux Malware Detect and am running a scan on the nothing-less domain, it might take a while but hopefully that will throw something up. Any ideas for how else I might investigate the crazy loads? Anything else I could do to optimize the server?

Sat, 12/08/2012 - 12:39
nothingless

Hmm, it didn't find anything:

maldet --scan-all /home/nothing Linux Malware Detect v1.4.1 (C) 2002-2011, R-fx Networks <proj@r-fx.org> (C) 2011, Ryan MacDonald <ryan@r-fx.org> inotifywait (C) 2007, Rohan McGovern <rohan@mcgovern.id.au> This program may be freely redistributed under the terms of the GNU GPL v2   maldet(5569): {scan} signatures loaded: 10427 (8559 MD5 / 1868 HEX) maldet(5569): {scan} building file list for /home/nothing, this might take awhile... maldet(5569): {scan} file list completed, found 18710 files... maldet(5569): {scan} found ClamAV clamscan binary, using as scanner engine... maldet(5569): {scan} scan of /home/nothing (18710 files) in progress...   maldet(5569): {scan} scan completed on /home/nothing: files 18710, malware hits 0, cleaned hits 0 maldet(5569): {scan} scan report saved, to view run: maldet --report 120812-1909.5569

Tried it on the disneysc one as well and nothing found. I'll go through the rest of my 15+ sites just in case too though.

I tried installed atop and it seems to be working in spite of a little error that popped up when I first ran it. Here's what it output, anything there catch your eye as being wrong or something to look into?

http://pastie.org/5499331

Sat, 12/08/2012 - 13:20
Locutus

Okay, if LMD doesn't turn up anything, that's a good first step towards securing that your site isn't compromised. You might check the Apache log if there are unusually many accesses to some page.

As for atop, nothing unusual right now. You might wanna watch it if CPL goes up again, and see which processes have high CPU, or if there is high disk I/O or disk wait.

Sat, 12/08/2012 - 17:33
nothingless

Hmm. I think the load must have crept up while I was away from the computer, as my server crashed again. Anything I can check for now that I'm rebooting it? Which Apache log can I check?

Sat, 12/08/2012 - 19:02
nothingless

OK, the loads have crept up a little again so I checked atop again:

 01:51:17 up  1:27,  1 user,  load average: 14.08, 8.17, 5.82

Results here: http://pastie.org/5500478

Does that show anything strange?

Sun, 12/09/2012 - 04:08
Locutus

It's always a bit difficult to give ideas about a dynamic process like system load creeping up by looking at a snapshot of performance data. :) Three things from here:

The disk performance stats indicate that you're using software RAID, which looks quite idle in your snapshot. The physical disks though show high and constant activity. It's possible that your RAID is performing a check or rebuild or something, you can check this with cat /proc/mdstat. You might want to post its output here.

atop writes historical performance data to files named /var/log/atop*.log. You can look at these with atop -r filename. Then you can navigate through the 10-minute interval snapshots using T and Shift-T.

I can offer to take a look at your system, if you trust me sufficiently to give me root login. If you'd like that, just post an instant messenger screen name here and I'll contact you there.

Sun, 12/09/2012 - 05:04
nothingless

I would be so happy if you could take a look at my system! I don't use IM, but my email address is fivebyfive@gmail.com - if you give me a shout on there I'll email it through. Thank you!

Sun, 12/09/2012 - 09:12
Locutus

Sent a mail, waiting for reply. :)

Sun, 12/09/2012 - 11:59
nothingless

Thanks, info sent! :)

Sun, 12/09/2012 - 12:21
Locutus

Okay, here's what I found out so far:

The RAID arrays seem to be okay, no rebuild in progress or anything. /dev/md1 and its physical disks are very busy, very high write rates, caused by php-cgi process of users "adriana" and "erin".

You might want to check what kind of web pages those have running and why they produce so many disk writes. The Apache access logs suggest that those pages get very many hits.

Unfortunately, your version of atop does not write log files and is missing some process audit functions. Might have to do with the hoster you use.

You might want to install the tool "apachetop". I could not find it with Yum, and unfortunately I'm not too familiar with yum-based distros.

Sun, 12/09/2012 - 12:22
nothingless

Yeah, they're 2 very busy Wordpress sites. All my sites are WP, and I'm using W3 Total Cache with Page Disk Caching turned on. Could that cause this? Do you have any tips for how I might better optimize or even upgrade my server to cope with these kinds of write rates?

Sun, 12/09/2012 - 12:27
Locutus

Users "disneysc", "halloween", "fashions" and "nikki" now also exhibit very high write rates. You might wanna check their web pages too.

Sun, 12/09/2012 - 12:31
Locutus

Well, your box has 32 GB of memory, why not install a PHP cache on several levels, like an instruction cache (xcache), and "memcached" which Wordpress should support.

If you have very fast HDDs that can cope with those high write rates, e.g. RAID-0 arrays with multiple disks, or SSDs, it is okay to use extensive disk caching, but since you're experiencing problems with system load going through the roof, you should try memory caches instead of disk caches.

This does not solve the spam issue, but that's another thing. Finding out where those come from is separate from the high load issue. Maybe let's first try to fix the load issue.

Sun, 12/09/2012 - 12:36
Locutus

I can see CPU Wait values going up a lot, which implies that your processes are bogged down by waiting for probably Disk I/O. Another indication that your disk caches are excessive.

The Apache logs seem to indicate that the Wordpress sites in question host photos or other images? It's not overly suggested to cache those on the disk. That's basically like copying them on every request. Caching is useful for compiled PHP code or dynamic pages that contain mostly text and don't change very often. Try disabling any disk caching for starters.

Sun, 12/09/2012 - 12:36
nothingless

OK - so you think I should try turning off disk caches? I've installed APC a few weeks ago, does that work with memcache or xcache too, do you know?

Sun, 12/09/2012 - 12:39
Locutus

Please reload, I edited my previous post, simultaneously when you sent yours.

Also: Do those Wordpress sites have any kind of contact form or other way of sending you emails? That might be what is (ab)used for sending the spam to you.

You might try disabling any contact forms while you test that theory.

Sun, 12/09/2012 - 12:39
nothingless

OK, I've turned off page disc caching for 'adriana' and changed it to APC instead - where would I go to see if that's making a difference?

Sun, 12/09/2012 - 12:41
Locutus

Also I'm not sure if Wordpress is the right software to host a bunch of high-traffic photo sites, performance-wise. There's probably better suited software for that, though I don't know off the bat what you could try.

About disk cache: You should turn it off for ALL your WP sites, and then take a look at atop if the disk activity dies down and CPL goes down too.

Sun, 12/09/2012 - 12:43
nothingless

Ah ok sorry, I've reloaded now. Yes, they are all image galleries - 95% of the content is images. OK, I'll try disabling page disk caching altogether, is that better than switching it to APC? And I'll take down the forms too just to see if that helps. Fingers crossed!

Sun, 12/09/2012 - 12:44
Locutus

Also, take a look at the "cache" value in the MEM row. That indicates how much of your physical memory is used for file system caches. As you can see, it's at about 2.3 GB already, which means that probably most of the photos that users request are in the memory cache.

So it's possible you don't even need xcache or memcached (considering you have an 8-core CPU which is bored out of its mind for the most part), but using those 32 GB RAM as memory cache (which Linux does automatically) might suffice.

Sun, 12/09/2012 - 12:47
Locutus

Yes, you don't need ANY page disk cache if you host mostly galleries. As you can see, it's on the contrary rather counter-productive. Your system spends most of its time writing to the cache, and the processes have to wait for the disk to write that cache.

Reading from the disk cache won't bring an advantage anyway, since that 8-core CPU and the 32 GB memory, when properly used as file system cache, deliver the gallery pages much quicker than reading the stuff from your disk cache.

Mon, 12/10/2012 - 05:29
nothingless

Well, I have to say I think you've hit the nail on the head!! I switched off all my disk caching last night, and when I checked my server again this morning, it was 1) still online!! and 2) the loads were all around 1.5 which is SO much more reasonable than 200! :-)

Thank you so much for your help - I can do a little further tweaking now, and I might try installing xcache or memcache also just to speed things up, but it really seems like the disk writing speed was the bottleneck, and without that, the server seems happy enough. Wasn't a SPAM issue after all then! Thanks again for all your help!!

Mon, 12/10/2012 - 09:28
nothingless

OK - so here's a quick update! I installed and configured Memcache, alongside APC which was already installed. I made sure all of my sites are using W3 Total Cache as a Wordpress plugin, with both page caching and minify set to APC, and object & database caching set to Memcache, so nothing is writing caches to disk. The sites that had WP Super Cache installed, have been switched to W3 Total Cache, with Supercache deleted completely. This seems to have helped a lot, my loads are a lot less and the server is less prone to crashing.

I've been checking atop, and although things look better, the rows for DSK are still in the red often enough, with busy listed around 80-90% most of the time, only sometimes falling to 60% or less, and sometimes spiking to 100% (when that happens, all my sites immediately start throwing 500 Internal Server Errors). Where should I look next to see which users are writing to disk so much, or how could I troubleshoot this next?

Mon, 12/10/2012 - 10:12
Locutus

The list below the generic statistics show which process is doing what. When you press g to switch to "generic data", then shift+d to order the list by disk activity, you should be able to see who's the culprit.

Mon, 12/10/2012 - 10:21
nothingless

Hmm, it seems to be the 'apache' process owned by root that does a lit of writing? Could you take a look and confirm I'm reading it right?

Mon, 12/10/2012 - 11:12
Locutus

I'm watching atop now for about a few minutes, but disk write rates are well below 1 MB/s, with an occasional "spike" of 2 MB/s caused by MySQL. Nothing anywhere near the red zone.

Just now a process named "bw.pl", possibly triggered by cron, had a quick CPU spike and read a few hundred MBs.

Nothing out of the ordinary so far.

Mon, 12/10/2012 - 11:25
nothingless

Ok thank you! I actually made a little tweak to my APC settings after my last post, changing the apc.mmap_file_mask to /dev/zero as that was supposed to lower disk activity, and it really has, everything looks pretty good at the moment. Thanks again for all your help - I'll keep monitoring it over the next few days to see how I go!

Mon, 12/10/2012 - 11:29
Locutus

Rogerroger, and you're welcome! :)

(Oh, and don't mind the little Trojan I left somewhere on your server. ;) )

Mon, 12/10/2012 - 11:30
nothingless

LOL! Thanks, I always enjoy little surprises like that. ;)

Tue, 12/11/2012 - 15:30
nothingless

Back with another quick question!! My server's going so far so good:

uptime 22:15:47 up 1 day, 6:09, 2 users, load average: 0.14, 0.29, 0.35

I've been monitoring things via atop, which all seems good, the disk writes are all around 20-30%. However, because I'm now actually using more of my RAM, I've seen that creep up and up over the past 2 days. At the moment, the MEM row looks like this:

MEM |  tot    31.3G  | free  521.1M  |  cache  22.6G  |  dirty   7.9M |  buff    3.0G  |  slab    1.5G  |               |                |

The value for free has gone from 25+ GB to 500MB over the past 48 hours. Obviously I want to use all of the RAM I have, but is this something to worry about? Do I need to make some changes to make sure the cache doesn't fill up completely, or, will Linux/APC/Memcache all be smart enough to automatically clear some of the cache when it needs more space?

Tue, 12/11/2012 - 18:14
Locutus

Nothing to worry about there. RAM used as cache IS basically free and can be reassigned to applications as soon as they require it. Linux follows the philosophy "free ram is wasted ram" and uses as much of it as cache as it can get. :)