We are running CentOS 6.2 on our web server. This is request for your best thoughts on Disaster Recovery...
Excuse this long Post, but I am copying an internal memo here to show my current status in researching recovery disaster options. We recently had a file system corruption. Hard drive OK, but tech at GoGrid (hosting service for our dedicated server in SF) had to run FSCK on the box over and over again for 10 hours...it kept choking... We thought we were doomed when finally I got a phone call (I was asleep in Hawaii) "FSCK finally completed! I got a prompt!" half asleep I told him "OK... enter "webmin start" then "httpd start" Now...try. www.himalayanacademy.com... are the sites back up?" Yes!
We were lucky... but it was too close to the edge. I set up some new daily back up of the /home directory and mySQL databases (on drive 2 of the box) by FTP to our server here in Hawaii, using Vicom FTP client (I don't know Rsync) which has excellent intelligent mirror options. and that is working. The only thing is I don't have any back up of the OS and application configuration layer: httpd.conf, ssh keys, /opt and other things on drive 1.
If you have time to read the following and offer ideas... I would appreciate some direction from a "strategy" point of view... I spent hours taking with Amazon Web Services and frankly in the end walked away from them because of the complexity of their framework and cost/time to be in a "fail-safe" ready state, was way too high... and they don't offer actual engineering support... just "consulting" support. I'm looking into other options now.
NOte "Varuna" is the name of our internal OSX server our in house OSX server with Terabytes of space that is backed up to another hard drive daily and sent to Iron Mountain every month... "PK" is our business team who has a different look at the web services than I do. For our team it is about blog, media, free publications--downtime forgiveable... for them it is about financial transactions-- downtime not so forgivable.
---------- INTERNAL MEMO TO OUR TEAM (nothing particularly proprietary here) ----------
I've been doing a lot of work and research on securing our web server and have reached a point where we need to make some new decisions... we can talk about it when everyone gets home.
See key questions to think about in bold, large type.
Keep this memo for future reference. I would like to schedule an initial meeting to review this in person with the PK when they are free... let me know when that might be.
Note this discussion is to cover a projected future event that may never happen. So the first question is:
1) How much time and money do we want to spend working on, testing and maintaining fail-safe systems that may never be used, or may only be used once in a blue moon, and who is responsible?
We recently came very close to the edge with a 10-hour down time and were on the precipice of a file system failure that would have required setting up a new server. So this made me look into this more closely.
I spent a quite a bit of time recently doing a major upgrade to our web back up systems-- a new FTPclient on Varuna (set up testing, scanning logs, etc) that reliably backs up our web data every day on a schedule, including the MySQL databases.
We do not have back ups of the OS and low level application layer configuration files (/opt /monetra/ httpd.conf. ssh etc...) But we have good documentation for these, enough to set up a new server and restore to a near "before the event" state with very little loss of mission critical data.. it would just take time -- our current RTO Recovery Time Objective (see below) is somewhere in the vicinity of 3-5 days.
For a non-profit "we give away everything for free...." -- TAKA, SSC/Academy site, Hinduism Today, Himalayan Academy etc... this RTO (3-5 days) is acceptable. We just tell the world "our server died, We will be back up in a few days."
I have set up new web servers 8-10 times in the past (since 1995) and I have no problem taking responsibility for recovery if the acceptable RTO is 3-5 days. If that is not an acceptable RTO then we need to talk...
So, the second question is:
2) Is a web Server RTO of 3-5 days acceptable to everyone?
No need to reply by email. Various options from a "zero fault tolerance" option (RTO: back up in less than an hour) vs RTO of 3-5 days are various scenarios in terms of cost and time to administrate the fail-safe framework on-going... best discussed once at a meeting.
The world of Disaster Recovery (hard disk failure, data center blows up) has it's own jargon.
Two important terms
1) Recovery Time Objective (RTO) -- how long are you content wait until the web services are back up? -- hopefully equals how long will it will really take you to get web services back up
2) Recovery Point Objective (RPO) -- refers to the point in time you need to get back to where transactions are recoverable. -- typically an issue where e.g. POINT A is initiating transactions by sending data to POINT B, then POINT B Dies, but POINT A continues and transactions are lost. We may not have this problem. In our initial discussions we focus on only on RTO.
My research is on-going. Moving our web sites to a cloud service may be a good solution. We might get to a simple, easily implemented, RTO of 1 hour without a lot time, fuss or on-going admin...this is the ideal.
But the annual fee for web services might go up substantially. I will have definitive numbers when we get together.