About the downtime on 10/27

1 post / 0 new
#1 Fri, 10/28/2016 - 05:11
Joe
Joe's picture

About the downtime on 10/27

Howdy all,

I just wanted to fill folks in about the down time our primary web server experienced yesterday.

Our colo performed unscheduled maintenance on the routers that route for our primary web server, which caused a brief outage (less than an hour, as I understand it), but once finished, our server didn't regain network connectivity. Our colo changed ownership a couple of years ago, and a few months ago they took their old support portal out of service; so, we had a bit of confusion over how to get support, which added some time to the process. And, they were experiencing a higher than normal support volume today (possibly due to the network outage), and so it took several hours to get access to a KVM, so we could see why our server wasn't responding.

Long story short, we were offline for about five hours. software.virtualmin.com was not affected directly, but license checks are performed against the database on virtualmin.com, so only the GPL repos would have functioned fully (there is caching of that data on software.virtualmin.com, so I suspect some requests would have worked, but not all).

This gave us a good reason to check our backups (which are good, hooray!). I'll be spending some time analyzing the logs and testing to see if I can figure out why the network just stopped responding and never recovered, as that seems likely to be a problem that'll bite us again in the future.

Oddly, we also had a (brief) outage of software.virtualmin.com the day before; it was an unrelated issue (disk failure in that case). So, it's been a bad couple of days for our servers, and for my sense of calm. At least our backup and redundancy plans are proving themselves to be working, as well as to be expected given the limits of our budget and network and time, as we lost no data in either outage. So, take this as a recommendation to check your backups and RAID or LVM arrays, before you need theml! Also, use SMART to check the health of your disks now and then, preferably automatically. That's one precaution we did not have in place, and I suspect it would have been able to warn us about the oncoming failure on the system that hosts software.virtualmin.com. We'll be setting that up across all of our Cloudmin host systems, to help insure we get some warning before the next disk failure.

Thanks for your patience during the outage.

Cheers,

Joe