Failure of clamAV causes mail loss by delivering to /dev/null

I'm not sure if this is exactly the same as https://www.virtualmin.com/node/2941 it doesn't seem to directly related to a timeout, but a program failure is causing the same result of mail being lost.

In this case there's an unknown issue that caused clam or the wrapper to be killed but, the filter shouldn't default to marking as a virus and sending to /dev/null or perhaps should do some kind of quarantine instead especially since there's no bounce, no warning (other than in procmail.log after the fact) and no way to recover the mail. For now I've disabled virus scanning as a work around.

from procmail.log:

procmail: Program failure (-15) of "/etc/webmin/virtual-server/clam-wrapper.pl"
Time:* From:* To:* User:* Size:* Dest:/dev/null Mode:Virus
sh: line 1: 1203 Killed 
Status: 
Active

Comments

That's odd, as the reason we have that clam-wrapper.pl script in the config is to prevent exactly this failure mode.

Is there anything in the logs indicating why it is failing, or why the clamdscan command that it runs is failing?

The system ran out of memory, oom-killer ran and killed clam. I had the dashboard open at the time and the load average jumped to over 50 before it stopped responding. From the logs it looks like clam ended up using several GB of memory before it consumed all the available memory. It isn't a low memory install, there's swap and it's a low use testing system. So far I haven't found any reasons in the logs or really anything abnormal besides running out of memory.

Wow - was there a particular email that triggered this?

The OOM killer just kills the largest process, which may not be the one that caused the lack of RAM if something else caused hundreds of other small processes to be spawned.

I went through the rest of the old procmail logs and found it happened once before. After checking the corresponding messages log both times oom-killer killed clam. One of the messages was a short message from a colleague and the other was an order confirmation from a major retailer so neither were big/atypical and neither should have set off clam. The dump process list in the logs after running out of memory doesn't show anything but the expected processes, not hundreds of anything, nothing else really using any significant memory.

Looking at the original message, it looks like the script clam-wrapper.pl was killed, which is unusual as it's a very small Perl script that exists only to exit cleanly if the clamscan command dies.

Could something else have killed it, other than a lack of memory?

No I don't think something else is killing clam. The system is a stock virtualmin install. I wanted to wait a few days before responding to see if anything else happened, but with virus scanning disabled the system hasn't run out memory. No changes to the system other than disabling virus scanning and no changes to usage.

I've done some testing and I think I understand what's happening just not the why yet:

When clam itself gets terminated clam-wrapper.pl works as intended and sends exit code of 0 to procmail. When clam-wrapper.pl is terminated an exit code of 143 or 137 on account of SIGTERM or SIGKILL is passed to procmail, and the procmail recipe is interpreting any non zero return as a virus found then sending the message to /dev/null.

You're right that if clam-wraper.pl is terminated that procmail will drop the message. What I'm wondering is why it would be terminated when it's a fairly small process that doesn't use much RAM..