Restore of mysql and email users for 2 domains makes all other existing domains suexec settings mismatch

I had a major outage today of our main sites, resulting in an error 500 on them (FCGID mode).

I don't know when exactly this happened, but it looks that it most probably happened when i did a partial backup and restore at last minute before switching servers to synchronize quickly databases and email (and it didn't synchronize email folders!) of:

  • mysql data
  • Email users

for just 2 servers out of 15...

Most of the virtual servers apache suexec config lost relation with userids AND groups:

E.g. this part:

< VirtualHost 80.11.22.33:80 > SuexecUserGroup "#1036" "#1015"

should have been:

< VirtualHost 80.11.22.33:80 > SuexecUserGroup "#1010" "#1003"

So suddenly most of the 15 (actually 13, and actually not the 2 that got imported, but all others !!!) stopped working !

Also this discrepancy didn't get catched by the "Check config"s.

Status: 
Active

Comments

So are you sure that the backup and restore caused this? If you just restored MySQL DBs and email users, Apache settings shouldn't have been touched at all ..

You can detect these kinds of issues with Virtualmin's validation feature, under Limits and Validation on the left menu.

As said, not 100% sure, but I checked all domains carefully, using directly the DNS server of the new host in preparation, then just restored latest database and Email users (later one with the wrong assumption that it would ALSO backup / restore the Email-boxes...) then I changed IP in /etc/network/interfaces, shutdown old server, rebooted new one, changed IP addresses in virtualmin pro, and tried the domains, with Error 500...

So within all of this the only thing that could have modified userid/groupid of users, without changing the Apache settings would be the restore of database and Email users above, imho.

But as said i might be wrong, as I had various other issues to deal with for main domains (like the http working fine on port 80, but the https not working on port 443, and some other networking related troubles). But none should have renumbered users/groups.

Really strange.

Anyway, now manually edited the apache files, and all domains are now working.

Btw, while at Error 500, your default setting for headers in apache displays all revision details and all installed apache modules, which is not recommended security-wise. My prefered default setting is "Product only". I change that systematically, so no bug here, just a default setting which isn't great security-wise (of course great debug-wise).

So is it possible that the Apache settings were wrong before you did the final restore of databases and mailboxes? I could believe that there is a Virtualmin bug that causes the Apache restore to get the wrong UID and GID in the case where they are re-allocated on the new system ..

Yes, uid & gid were re-alocated on the new system, as we did a global mass-restore from a single TAR file to start with (see my other bug report for that).

Just before the partial final restore the sites on new system did work fine.

The strange thing is that the partial restore did in fact not concern all sites and not webserver. Only mysql database and email/ftp users for 2 domains.

And it's all the other domains, and indeed not the 2 transfered ones that got bad uid / gid in apache conf files.

Thanks for the "Check virtual servers" hint. E.g. I get there:

Apache website : SuExec user is set to #1035, but the virtual server's UID is 1031

But no way to quickly fix from there.

The only fix unfortunately is to manually edit httpd.conf to set the correct UIDs as mentioned in the validation report.

I'm at a loss as to how this could have been broken by the second restore though. as like you said it only touched a few times. Perhaps the first restore was the one that got it wrong?

I doubt the first restore got it wrong, as i checked each virtual server by using the dns of the new server which had a different IP address. The dns was disconnected from our dns cluster, so didn't update the official addresses during tests.

So unless my local freebsd/osx system didn't clear caches on dns change, which i doubt, as i saw the virtual servers of the new servers (dynamic content like users # and no online users indicated that it was not the old still running server), i don't see either what else messed up either the apache settings OR recreated the users and groups.

My first suspect is still the partial restore of Email/FTP users, which as it's written in the log, recreates the users (and maybe assigns new UID/GID, but probably doesn't update the apache settings ?)

The mailbox user restore does re-create users, but only mailboxes .. not the Unix user for the domain owner.

Can you check if the UID for the domain's Unix user was changed by the second restore?

I don't have a spare server for trials right now, and don't want to risk another outage on a live server.

How large is your backup file? I could try a restore on one of my test systems ..

Unfortunately, it's many gigabytes, and also our local data privacy protection laws don't allow us to send user data out.

That's a pity .. there isn't really much I can do to debug this then, unless you can create a small test domain and re-produce the problem by backing up and restoring that.