Postfix stops accepting mails after one of last virtualmin upgrades on servers with multiple IP addresses with SSL (bad master.cf config file upgrade for smtps)

Howdy team,

This is a bug report for a nasty bug that had severe consequences for us on top of those special times, as we couldn't detect the issue despite monitoring our servers with Virtualmin and with external probes:

I had this Webmin/Virtualmin upgrade bug on two servers now, resulting in loss of incomming email messages for days. The first one was around 2 weeks back, and the second one had the same issue this week, so as it had really severe consequences, I'm reporting it here:

The /etc/postfix/master.cf file automatic upgrade to adjust SMTPS certificates happens automatically during a version upgrade, but it makes the file buggy, resulting in Postfix not working anymore without any error message, resulting in days of refused mails, hanging then on secondary servers.

Here the change that the last upgrade(s) made (i have /etc under git): Take a look at the last line added automatically by the upgrade:

diff --git a/postfix/master.cf b/postfix/master.cf
index 31ea370..10b95f9 100644
--- a/postfix/master.cf
+++ b/postfix/master.cf
@@ -19,7 +19,7 @@
 #  -o smtpd_sasl_auth_enable=yes
 #  -o smtpd_client_restrictions=permit_sasl_authenticated,reject
 #  -o milter_macro_daemon_name=ORIGINATING
-smtps  inet    n       -       -       -       -       smtpd -o syslog_name=postfix/smtps -o smtpd_tls_wrappermode=yes -o smtpd_sasl_auth_enable=yes -o smtpd_client_restrictions=permit_sasl_authenticated,reject -o milter_macro_daemon_name=ORIGINATING
+80.88.152.233:smtps    inet    n       -       -       -       -       smtpd -o syslog_name=postfix/smtps -o smtpd_tls_wrappermode=yes -o smtpd_sasl_auth_enable=yes -o smtpd_client_restrictions=permit_sasl_authenticated,reject -o milter_macro_daemon_name=ORIGINATING
 #628       inet  n       -       -       -       -       qmqpd
 pickup    fifo  n       -       -       60      1       pickup
 cleanup   unix  n       -       -       -       0       cleanup
@@ -115,3 +115,6 @@ smtp-125660303823614        unix    -       -       -       -       -       smtp -o smtp_bind_address=80.88.158.166
 80.88.158.162:submission       inet    n       -       -       -       -       smtpd -o smtpd_sasl_auth_enable=yes -o smtpd_tls_cert_file=/home/ns1/ssl.cert -o smtpd_tls_key_file=/home/ns1/ssl.key -o smtpd_tls_CAfile=/home/ns1/ssl.ca -o smtpd_tls_security_level=may -o myhostname=ns1.myotherdomain.com
 80.88.158.166:submission       inet    n       -       -       -       -       smtpd -o smtpd_sasl_auth_enable=yes -o smtpd_tls_cert_file=/home/mydomain/ssl.cert -o smtpd_tls_key_file=/home/mydomain/ssl.key -o smtpd_tls_CAfile=/home/mydomain/ssl.ca -o myhostname=mail.mydomain.com
 
+80.88.158.162:smtps    inet    n       -       -       -       -       smtpd -o syslog_name=postfix/smtps -o smtpd_tls_wrappermode=yes -o smtpd_sasl_auth_enable=yes -o smtpd_client_restrictions=permit_sasl_authenticated,reject -o milter_macro_daemon_name=ORIGINATING -o smtpd_tls_cert_file=/home/ns1/ssl.cert -o smtpd_tls_key_file=/home/ns1/ssl.key -o smtpd_tls_CAfile=/home/ns1/ssl.ca -o smtpd_tls_security_level=may -o myhostname=ns1.myotherdomain.com
+127.0.0.1:smtps        inet    n       -       -       -       -       smtpd -o syslog_name=postfix/smtps -o smtpd_tls_wrappermode=yes -o smtpd_sasl_auth_enable=yes -o smtpd_client_restrictions=permit_sasl_authenticated,reject -o milter_macro_daemon_name=ORIGINATING
+smtps  inet    n       -       -       -       1       postscreen -o smtpd_tls_wrappermode=yes

It results in Postfix starting and running (looks like AT the NEXT reboot only, as it looks like it's not restarted on webmin/virtualmin upgrade), but NOT accepting some mails, but not all, but not accepting them especially on the localhost, which is dramatic for e.g. a ticket system or registration mails. And what's even more dramatic that it shows running fine, and no warnings, errors or notifications are visible.

However, by looking at the systemctl status postfix.service, this last line is visible:

Nov 12 22:03:42 mu33.cbpolis.com postfix/master[28593]: fatal: bind 0.0.0.0 port 465: Address already in use

Commenting out the last line that got added by error (?) makes the problem gone.

What was the purpose of that last line ? What should it have been ?

Status: 
Active
Virtualmin version: 
6.13
Webmin version: 
1.960

Comments

Ilia's picture
Submitted by Ilia on Fri, 11/13/2020 - 04:02

Status:
Active
»
Closed (works as designed)

Hi,

What was the purpose of that last line ? What should it have been ?

I would like to assure you that Virtualmin doesn't add any mentioned lines. It's rather a coincidence.

Status:
Closed (works as designed)
»
Active

Hello Ilia, Thanks for your reply, but I am 100.00% certain that I didn't add those lines. And 100% certain that I don't have any other management tool than Virtualmin on our servers.

And I'm 99.999% certain that either Virtualmin itself (or a program that Virtualmin manages directly) added the mentioned lines automatically during or shortly after the recent update of Virtualmin that added support for individual SSL certificates per IP address for SMTP. Except last line, all other lines added seem correctly added to support that new Virtualmin feature.

And looking more closely, the last line is not really added, but changes the first line that is removed in the diff. So I would say, that Virtualmin (or a program it manages directly) attempted to adapt that existing line to the new SMTP SSL framework it uses.

So imho, it does not "work as designed", if "as designed" means not touching that file, or adding that last line that makes Postfix run but not handle mails. That's why I am re-opening this.

That said, please note that I am not accusing you here, just trying to help (you can check my other bug reports to see that I am here to help you fix bugs and improve Virtualmin).

I am now more than fully aware of this bug and have checked Postfix and that file on allof our few servers, so for me it's not an issue anymore. However, I have reported this issue in the interest of you guys and other customers with multiple IPs that might run into that issue too. So please feel free to handle it like you think is best for you and your users. :-) Cheers!

Ilia's picture
Submitted by Ilia on Fri, 11/13/2020 - 09:17

Thank you.

How did the config look exactly before that change (I assume you clicked Copy to Postfix button from SSL Configuration page?

The pasted part is the "git diff", so shows exactly the changes.

No, I didn't click "Copy to Postfix", nor added any other domain to that server. Only thing that happens automatically regarding SSL certificates is the automatic updates from Virtualmin to the Let's Encrypt certificates.

So the file change was made by Virtualmin either at the upgrade or shortly after during an automated Let's Encrypt renewal, entirely managed by Virtualmin.

When I search in Webmin action log for actions in November that changed file /etc/postfix/master.cf I have following result:

 Logged actions between 01/Jan/2020 and 13/Nov/2020 ...

Action      Module      User        Client Address      Date        Time   
Ran post-install script     Virtualmin Virtual Servers  -   -   2020/11/05  17:32
Ran post-install script     Virtualmin Virtual Servers  -   -   2020/11/05  17:31

However, that file change is unfortunately not logged there.

And Let's encrypt certificate renewals don't log anything.

Finally to be complete, I changed our DMARC BIND record in one of the domains end of October.

But the issue only appeared this week on the 2nd server after a server reboot (and same on the other server a few weeks back, after a reboot), when postfix started but didn't work du to this warning of port already used due to the last line (a systemctl restart postfix.service does NOT return an error, that error is only visible with a systemctl status postfix.service).

Do you need the complete exact master.cf file of before the boggus change visible above in my git diff ?

Ilia's picture
Submitted by Ilia on Sat, 11/14/2020 - 10:59

Do you need the complete exact master.cf file of before the boggus change visible above in my git diff ?

Yes, please.

I have attached to this bug ticket the /etc/postfix/master.cf file that worked before the Virtualmin upgrade and reboot (I have only anonymized consistently the IP addresses and domain-names in it).

I had the same experience.

I did a number of updates on Nov 15, 2020, including an update to webmin-1.960-1.noarch.

On Nov 17 postfix did a "refresh" right after a Let's Encrypt certificate renewal. Postfix failed to restart. Attempts to startup thereafter failed.

When I realized that Postfix was not running, I found error messages like:

postfix/master[5195]: fatal: bind 0.0.0.0 port 465: Address already in use

When I looked at my postfix master.cf I found several lines at the bottom. I do not know how many of them were there before. I know this last one was not:

smtps inet n - n - 1 postscreen -o smtpd_tls_wrappermode=yes

When I commented that line out I was able to restart postfix.