Virtualmin defer backups until mountpoint available [#27889]

Submitted by aitte on Thu, 06/06/2013 - 07:54

We use a mounted NFS share for backups for the highest efficiency. Why? Because both the FTP and SSH/SFTP modes first create a temporary tar file on the local filesystem of the Virtualmin server and then uploads that to the backup server. This means that for any backup process, there's a lot of work involved: "read data to be backed up+compress+write to tmp file, then read tmp file+write to destination server". By mounting the backup storage as a local folder instead, we're able to cut out all of that extra work and immediately write to the destination.

So far so good. The very serious issue:

/mnt/backups on the Virtualmin system is a folder that's actually mounting an NFS share.
Sometimes, the backup server might be down for maintenance, thus disabling the NFS share, meaning that /mnt/backups goes back to being just a regular folder on the / filesystem (of the Virtualmin server) again.
If Virtualmin tries to perform a backup while the NFS share is unmounted, it will 1) think that all old backups are deleted since the /mnt/backups folder is now "empty", and 2) begin writing the new backup to the folder local / filesystem.
This is very, very bad! Not only does it risk filling up the local filesystem with backup files. But more seriously - the next time that /mnt/backups mounts the NFS filesystem again, the old NFS contents of /mnt/backups will magically reappear, and any new backups that Virtualmin did while the NFS share were down will "vanish" since they're on another filesystem.

Idea for a solution:

A checkbox in the "Backup Schedule" dialog of Virtualmin, where instead of typing "Backup destinations: Local file or directory = /mnt/backups", you choose "Backup destinations: Mounted filesystem = /mnt/backups".
In this mode, before doing any backups, Virtualmin first runs the "mount" command to ensure that /mnt/backups has some filesystem mounted on it. If it doesn't, then the backup is deferred until later, the virtualmin admin is emailed with an error report saying that the backup server is down, and a cron job is scheduled to retry again later.
It retries either infinitely or a set number of times, possibly with staggered intervals like first trying in 5 minutes, then if that fails wait another 20 minutes, then if that fails wait another 40 minutes, then 60 minutes, etc. This is the same strategy used by Postfix when trying to deliver emails to non-responding servers, and is very efficient, by first assuming that the server might be down briefly and therefore trying tight intervals, but then gradually slowing down when it becomes clear that the downtime will be long.
Note: The /mnt/backups location of course has to be mounted using "autofs" for this to work, so that it reconnects automatically - otherwise it will just stay dismounted the whole time even after the server has come back online, and the deferred backup will never succeed.

The deferred processing is the most difficult part of this idea. Running the "mount" command to validate the mountpoint before proceeding and on failure deferring+emailing admin is easy - ensuring that the mountpoint is set up with "autofs" so that it'll actually be coming back online is easy - but actually rewriting how Virtualmin backups are scheduled to allow for deferred processing is tougher, mainly because of the need to avoid pileups in case "the next scheduled backup is hit while another earlier backup is still in the deferred queue". For that, it'd have to be more clever and only perform one of the deferred backups when it comes to each clashing scenario.

What are your thoughts? Any other ways this setup could be improved?

Status:

Active

Comments

Submitted by JamieCameron on Thu, 06/06/2013 - 11:38 Comment #1

We aren't likely to implement special handling for mount points any time soon - it makes too many assumptions about how a customer's system it setup.

However, you can define a command that gets run before a scheduled backup. This could be a shell script that does a mount, waits for a mount point to become available, and optionally exits with a non-zero status if the mount isn't working.

Submitted by aitte on Thu, 06/06/2013 - 13:54 Comment #2

"optionally exits with a non-zero status if the mount isn't working." - FANTASTIC! That is good enough. It'll be simple to write a shell script that checks if something is mounted there, or otherwise return 1 to abort the backup. Heck, it's even possible to send an email to the admin from that shell script in case of an error.

That solves absolutely everything apart from retries via deferred processing, but it's okay. I'd rather miss a few backups if the backup server is down at that exact time, than risk having backups written to the local filesystem again. :-) Too many people nearly fill their VMs as it is, so backups can't go to the local filesystem.

I wish there was a way to retry backups later, but hey I'll take the lesser of 2 evils.

Edit: Actually, come to think of it, what do you think of a special case such as "if pre-backup script returns 31337, defer the backup; if 0, perform backup, for any other value skip backup".

A scriptable process/abort/defer-queue like that would be a very powerful feature. But also one that not many users are likely to notice, except the power users. So I know your work is better put in other places...

Still, a queue might be as simple as something like:

loop through all scheduled backup jobs
   if time_for_this_job_to_run
     if NOT this_job_already_in_defer_queue (avoids the pileup scenario where a job runs again while an earlier copy of it is still deferred)
       if ok
           do backup now
       elseif not ok
           skip this backup completely
       elseif script_says_defer_this_job
           add this job to defer-queue

With that setup, all that's needed then is a separate cronjob which scans the defer-queue and if it's time to retry a job, it re-executes it (including pre-command) and makes a new decision. Etc etc etc... Until it either exceeds maximum retries or succeeds.

Submitted by JamieCameron on Thu, 06/06/2013 - 15:01 Comment #3

We don't really have a mechanism to defer a backup, since they are just scheduled by cron (or a cron-like system in Webmin). An alternative would be for your script to loop and sleep for up to X minutes to wait for the mount point to become available, and only exit when it is ready.

Submitted by aitte on Thu, 06/06/2013 - 15:18 Comment #4

Wow, that's actually a very good idea. As long as I retry for less time than it takes for the next scheduled backup job to start, there will never be a clash either. Heck, it's enough to retry for half an hour or so in my case.

You have fantastic ideas. This makes network backups very reliable.

Submitted by aitte on Thu, 06/06/2013 - 17:01 Comment #5

As a courtesy, here's the final script I created. It uses a staggered approach, basically waiting 30 seconds on the first failure, then 60 seconds, then 120 seconds, then 240 seconds, etc. It stops when it succeeds or when it has been trying for 1800 seconds in total (30 minutes).

I put it in /usr/local/libexec/is_backupserver_online.sh and use it as the pre-backup command in Virtualmin.

#!/bin/bash

# waits up to 30 minutes for the mountpoint to become available
MOUNTPOINT=/mnt/backups
MAX_WAIT_TIME=1800

# ...
ATTEMPTS=0
TOTAL_WAITED=0
until [ $TOTAL_WAITED -gt $MAX_WAIT_TIME ]; do
  (( ATTEMPTS++ ))
  
  # try to access the mountpoint (to trigger automount)
  ls $MOUNTPOINT > /dev/null 2>&1
  
  # check for successful mount
  if mountpoint -q $MOUNTPOINT; then
    # tell virtualmin to proceed with the backup
    exit 0 # success
  fi
 
  # calculate wait time (30 seconds longer after each attempt)
  let "WAIT_TIME = 30 * $ATTEMPTS"
  
  # make sure that won't put us over the max_wait_time
  let "TOTAL_WAITED_OVERSHOOT = $MAX_WAIT_TIME - ($TOTAL_WAITED + $WAIT_TIME)"
  if [ $TOTAL_WAITED_OVERSHOOT -lt 0 ]; then
    # note that this will give us a wait-time of 0 when we've reached the end
    let "WAIT_TIME += $TOTAL_WAITED_OVERSHOOT"
    # if we ended up with a ridiculously small final wait-time, it's not worth even doing it
    # this is how we avoid one last rapid retry, as well as how we break out of the loop at the end
    if [ $WAIT_TIME -lt 10 ]; then
      break # exit the loop
    fi
  fi
  
  # wait before next retry
  sleep $WAIT_TIME
  let "TOTAL_WAITED += $WAIT_TIME"
done

# optional: email an admin here, warning that the backup server was unreachable

# tell virtualmin to abort the current backup attempt
exit 1 # failure

Submitted by JamieCameron on Thu, 06/06/2013 - 18:33 Comment #6

Cool, that looks reasonable.