mailserver slow occasionally

16 posts / 0 new

Topic locked

#1 Mon, 03/17/2014 - 05:01

adlsrl

mailserver slow occasionally

Hi,
I have got a public mail server that recently is very slow delivering mail locally occasionally.
This is my configuration of Procmail. Can be a wrong configuration there?

LOGFILE=/var/log/procmail.log
TRAP=/etc/webmin/virtual-server/procmail-logger.pl
:0wi
VIRTUALMIN=|/etc/webmin/virtual-server/lookup-domain.pl $LOGNAME
EXITCODE=$?
:0
* ?/usr/bin/test "$EXITCODE" = "73"
/dev/null
:0
* ?/usr/bin/test "$VIRTUALMIN" != ""
{
INCLUDERC=/etc/webmin/virtual-server/procmail/$VIRTUALMIN
}
DEFAULT=/var/spool/mail/$LOGNAME
ORGMAIL=/var/spool/mail/$LOGNAME
DROPPRIVS=yes
:0
$DEFAULT
:0
* ^X-Spam-Status: Yes
/dev/null

I need to wait for the output of this command about 30 seconds:
mailq | tail -1
Total requests: 55

Other useful data can be:

uptime
10:39:58 up 1 day, 20:51, 2 users, load average: 19.47, 13.06, 9.04

free -m
total used free shared buffers cached
Mem: 3042 2947 95 0 47 2149
-/+ buffers/cache: 749 2292
Swap: 4094 0 4094

Thanks to all that can help me.

Thanks a lot.

#2 Mon, 03/17/2014 - 06:47

Locutus

You system load is at an incredibly high value (19.47), you might want to install the tool "atop" and use it to check which process is using so many resources. It will show resources (CPU, RAM, disk I/O, network and so on) in red that are overloaded. Order the list of processes accordingly to see which one uses the most resources.

Based on that result we can further look for solutions.

#3 Mon, 03/17/2014 - 06:53

adlsrl

Ok I will try and let you know. Thanks.

Now server is ok and output is:

uptime 12:33:33 up 1 day, 22:45, 0 users, load average: 5.44, 4.33, 5.88

#4 Mon, 03/17/2014 - 07:05

Locutus

Okay, so load is going down, you might need to catch this "in the act" with atop to see which process is causing this.

Might also help to check the logs in /var/log to see if there's anything out of the ordinary at the time when the issue started.

#5 Mon, 03/17/2014 - 10:17

andreychek

We're sorry for the rude comments posted here, please feel free to continue posting about the issues you're having with your server, we're happy to help :-)

Another thing you may wish to try, in addition to running atop (or top), is to see how many emails are in your mail queue. You can do that by running this command:

mailq | tail -1

#6 Tue, 03/18/2014 - 04:45

adlsrl

Hi. I check the server this morning with atop. I attach a screen of the server when it is slow at 10.14am. Mail queue has got 76 elements. After 10 minutes it is like sreen 10.25, queue is 69 but not increase anymore and with new email it is working fast. What do you think? Thanks a lot.

#7 Tue, 03/18/2014 - 06:00

Locutus

Okay, so it looks your high system load is caused by disk I/O overload. The screenshot shows a rather low read and write rate though, so this should not overload the disk. Of course since this is only a snapshot, you might want to watch the DSK lines in atop closely when the problem occurs.

Is this a physical disk, or a virtual system, or some kind of RAID or LVM or so? You might want to check the disk, maybe it has a hardware issue. You might want to test the disk throughput and see if that triggers the problem.

Please sort the process list in atop by disk usage using shift-D.

You can simulate intense disk activity using these:

dd if=/dev/zero of=/tmp/BIGFILE bs=1M count=10000
dd if=/tmp/bigfile of=/dev/null bs=1M count=10000

That will write 10 GB zeroes to /tmp/BIGFILE and read them to null. Open a second shell and watch atop when doing that.

#8 Tue, 03/18/2014 - 11:57

adlsrl

Hi.

This is a vm running under VMware.
On 6th of January a blackout caused an unexpected shutdown of the host.
Array on host checked itself and correct some errors, then vm started correctly.
How can I check disk while system is running? This is a production mail server.
VM has got 2 hard disks. One for data (sda) and one for backup (sdb).

Disco /dev/sda: 536.8 GB, 536870912000 byte

255 heads, 63 sectors/track, 65270 cylinders
Unità = cilindri di 16065 * 512 = 8225280 byte

Dispositivo Boot Start End Blocks Id System
/dev/sda1 * 1 25 200781 83 Linux
/dev/sda2 26 547 4192965 82 Linux swap / Solaris
/dev/sda3 548 65270 519887497+ 83 Linux

Disco /dev/sdb: 107.3 GB, 107374182400 byte

255 heads, 63 sectors/track, 13054 cylinders
Unità = cilindri di 16065 * 512 = 8225280 byte

Dispositivo Boot Start End Blocks Id System
/dev/sdb1 1 13055 104856576 83 Linux

I tried commands you suggested, here results. I made it with 1gb file to prevent system block.

dd if=/dev/zero of=/tmp/BIGFILE bs=1M count=1000

1000+0 records in
1000+0 records out
1048576000 bytes (1,0 GB) copied, 32,54 seconds, 32,2 MB/s

dd if=/tmp/BIGFILE of=/dev/null bs=1M count=1000

1000+0 records in
1000+0 records out
1048576000 bytes (1,0 GB) copied, 28,9007 seconds, 36,3 MB/s

I can attach only screen for the first command because I'm not be in time to take the second one.

Thanks a lot.

#9 Tue, 03/18/2014 - 15:08

Locutus

Read and write rate is rather low for a modern HDD, under VMware it should be over 100 MB/s. Are you the operator of the host system?

If not, you might want to ask them if this low R/W rate is normal, or if they have some resource restrictions in place, or if there's any after-effect of that power failure and array rebuild.

This can have a number of reasons though, without further information about the host system and without witnessing the process myself it's hard to guess. It'd be important to watch the system while the issue is occurring, and see the behavior of disk I/O and which processes cause it over a longer time.

#10 Wed, 03/19/2014 - 09:21

sgrayban

If you run top instead of atop you should see what is eating up your IO on the very first line.. the default for top is to sort by CPU usage which should catch the program thats hogging IO

#11 Thu, 03/20/2014 - 07:07

adlsrl

Hi.

Linux vm is running under Windows Server 2008 and Vmware Server 2, not Esx. I've got a raid1 array with 2 500gb sata 2 drives.
I've got another vm running on the same server (web server centos) that haven't got problems.

Coping a file on Windows host system run from 35MB/s to 50MB/s, depend on system load.

Can be that some cron jobs give me problems?

crontab -l
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /etc/webmin/status/monitor.pl
0 0 * * * /etc/webmin/security-updates/update.pl
0 1 * * * /etc/webmin/virtual-server/backup.pl --id 129908147120979
10 10 * * * /etc/webmin/package-updates/update.pl
@monthly ls -S -hl /var/spool/mail | mail support@adlgroup.it,tecno@adlgroup.it -s STATISTICHE_DIMENSIONE_FILE_MAIL_ADLHOST65
# * 5 * * 2 /etc/webmin/cron/range.pl 20-2-2012 22-2-2012 reboot
00 4 * * * /etc/webmin/usermin/update.pl
0 22 * * * /etc/webmin/backup-config/backup.pl 129908278022106

Attach you can see other cron jobs.

Any idea?

Thanks.

#12 Thu, 03/20/2014 - 07:50

Locutus

Okay, this might not t really help you, but I'd strongly suggest using a different virtualization software. VMware Server 2 is discontinued and has not been updated in at least 5 years, and even before that, it was recommended by VMware to be only used for testing and development, and not in production use.

Using VMware Server 2 on top of a full-blown Windows server with software RAID can quite likely lead to the problems you're seeing.

If at all possible, I'd suggest migrating to the free ESXi, and possibly use a hardware RAID controller (because ESXi does not support software RAID). Alternatively you can set up software RAID within your virtual machines. While you're at it, you can take a look at the SMART values of your HDDs and perform a long SMART self-test.

Another alternative might be HyperV under Windows Server 2k12, if you for some reason are forced to use a Windows server as the host system. You need a quite powerful machine for that though.

I'm quite sure your performance issues will go away then.

My personal opinion is: Except the host HAS to do other functions like Active Directory, you should use a bare-metal hypervisor like ESXi for best performance.

Your cron jobs should not be the reason for the issues, those run only at specific times. You can repeat the "dd" test a few times to check if the disk throughput is so low all the time.

#13 Thu, 03/20/2014 - 09:18

adlsrl

I know that the server is old but it is running from many years without problems and we can't change it now.

Are there any other things I can check?

#14 Thu, 03/20/2014 - 09:29

adlsrl

Attach you can see "crond" process that use a lot of disk. This is the reason why I asked you about cron tasks. Thanks.

#15 Thu, 03/20/2014 - 09:37

adlsrl

I see that while the host seems to be ok (figure host.jpg) vm has got 100% disk in use (vm.jpg).

Seems to be strange.

Host can arrive up to 100MB/s.

#16 Thu, 03/20/2014 - 10:00

Locutus

In your atop screenshot, crond has 54% disk usage, but that's only a relative value. It uses 54% of the total disk I/O, because no other process is using more.

It's striking though, again, that the rather low usage of about 1 MB/s causes a disk load of 34%.

So it's not a specific problem of cron jobs, but a general problem that even low disk I/O causes a high load. It just so happened that cron at that point caused that I/O.

You could check if the virtual disk file has a high level of fragmentation or something like that.

I'm sorry, but otherwise I don't really have any other ideas besides recommending ESXi for production purposes instead of VMware Server 2. I'm not familiar with Server 2 (last time I used it for experiments was about 6 years ago), and this is too complex a setup to successfully guess via the forum. :)

Topic locked