lookup-domain-daemon.pl uses 100% of the CPU

Hello,

After a recent batch of package updates on my Debian 6.0 amd64 system, I (and at least one forum member) encounter a problem with the 'lookup-domain-daemon'. Indeed, this service now uses 100% of our CPU.

One of the package that was updated in both systems was MySQL (client, common, server, server_core): v5.1.61. But looking at the content of the 'lookup-lookup-daemon.pl' file, I doubt that MySQL can be the reason for the problem. On my system, the following package were also updated:

  • firmware-linux-free_2.6.32-41squeeze2_all.deb,
  • linux-base_2.6.32-41squeeze2_all.deb,
  • linux-image-2.6.32-5-amd64_2.6.32-41squeeze2_amd64.deb,
  • linux-libc-dev_2.6.32-41squeeze2_amd64.deb.

On my particular system, I have no more than 4 processes linked to 'lookup-domain-daemon.pl' and 3 of them are using 100% of my CPU (I have 2 x Intel Xeon E5620).

root@sd-28802:~# ps aux | grep  "lookup"
root     13199  0.0  0.1  76472 47280 ?        Ss   03:55   0:00 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
root     14321  100  0.2  88608 53528 ?        R    03:58   0:50 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
root     14326  100  0.2  88608 53528 ?        R    03:58   0:50 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
root     14336  101  0.2  88608 53528 ?        R    03:58   0:49 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
r

The file '/var/webmin/lookup-domain-daemon.pid' indicates process #13199 as the initial one.

The file '/var/webmin/lookup-domain-daemon.log' shows no error message and contains normal output ("[Mon Apr 16 04:00:03 2012] user=root NOUSER", for example) that allow me to say that, even if using 100% of the CPU, the daemon seems to be working.

After executing "/etc/init.d/lookup-domain stop" and waiting approximatively 1 minute, the 4 processes are gone from "ps aux" output.

I am wondering why this specific daemon stopped working.

If I can be of any help to figure out the root of the problem, I'm ready to help.

Tristan CHARBONNIER

PS: Topic on the forum: https://www.virtualmin.com/node/21885

Status: 
Closed (works as designed)

Comments

Which Virtualmin version are you running there? We recently released version 3.91, which includes a fix for an issue like this..

For me the problem, on Debian 6 64 bit, kernel 3, (reported in forum post https://www.virtualmin.com/node/21885) started with the upgrade to Virtualmin 3.91.

Incidentally I also found issuing command 'service lookup-domain stop' stops the process only after a delay. You can of course stop it immediately with kill -9 [pid number]

I also use v3.91-gpl that was updated at the same time (forgot to notice it at first, sorry).

Jamie, there's another fellow who posted a similar issue in the forums who sees high CPU usage with lookup-domain and is using Virtualmin 3.91 as well.

Probably refering to my thread, linked above.

Unless this apparent bug can be resolved fairly quickly, it would be useful to have brief advice on whether it is safe to downgrade Virtualmin back to 3.90, and the easiest method for downgrading, since for the resource consumption of lookup-domain-d under 3.90 was acceptable on my setup.

John - when lookup-domain is using high CPU like this, can you try running :

strace -o /tmp/strace.txt -p XXX

where XXX is the PID of the high CPU process. Let it run for 10 seconds, hit ctrl-c, and then email me the strace.txt file at jcameron@virtualmin.com

A short-term mitigation for this issue is to just stop the lookup-domain-daemon.pl process. Virtualmin will fall back to an alternate method of looking up users, which uses more CPU for each email but won't run into this bug.

Thanks for the tip about stopping the process and still running Spamassassin.

I tried running strace and ltrace on the process pid (which anyway is not persistent), but can get no output either to file or screen. Sorry!

So does the process that is using 100% of CPU only run for a short time before exiting?

Correct. And restarts with new pid, I see sometimes one, sometimes two running. Maybe a minute or two each. I can find nothing in messages or dmesg log about it.

So in Virtualmin 3.91, the lookup-domain-daemon server was changed to run a separate process for each incoming message, rather than processing them in series. This shouldn't make any difference unless your system is low on memory, or you get huge spikes of email ..

Roughly how many messages does your system get each hour?

Less than 200 per day including spam in the three accounts for which spamassassin is now enabled. Less than double that for accounts with alias forwarding (no actual mailbox), plus spam hitting the server for accounts with no mail set up.

After the problem started I disabled spam protection for all but three domains, with a total of 5 email addresses. Since that, if I start lookup-domain service, within five minutes it is showing one or two processes using 100% cpu. The chances are that in that time one or two email have come in. Ten arriving in five minutes would be unusual.

You might want to check the log file /var/log/procmail.log to see how many messages your system is actually processing per day - it's possible there is a local mail loop that is creating more email that you expect.

You are right.
1. an uninstalled Aegir was trying to write a deleted mailbox (some kind of php-based cron job)
2. awstats cron job was trying to write warning messages to a mailbox already full with warning messages at www-data@servername.com, and writing instead to /var/www/Maildir/new

It never occurred to me to check that www-data mailbox. Anyway awstats is not useful for me because it reports only one unique when behind a reverse proxy. so I disabled it, then disabled its associated cron tabs.

I am now seeing spamassassin consuming 70% CPU. but only briefly. lookup-domain.d consuming 3.1% RAM on 1.5GB, so 48M. So all OK.

Thanks for your help.

Cool, that will explain it. I will mark this bug as closed..