KVM Guest Snapshot Won't Revert After Changes

I've been trying to get snapshots of my guests working without any luck. The GUI says the guest has been reverted, but when I log into the guest all of my changes are still present. It's as if the guest never reverted.

Host System Info:
-Cloudmin v9.1
-CentOS 7.2.1511
-KVM virtualization (installed from CentOS repos)
-VMs stored in a Volume Group on host
-KVM Image is kvm-64-centos7.0-base downloaded from Cloudmin

Steps taken:
- Create KVM virtual machine using Cloudmin image. 1024mb RAM, 10gb Disk
- Create Snapshot using Cloudmin GUI (Change to guest system in menu -> Resources -> Disk Snapshots). Snapshot ID is "testsnapshot1", 20% out of 10GB. The displayed snapshot size is 2gb.
-SSH into guest and update OS using yum.
-Create a few test files in /root to see if they are removed after the snapshot is reverted.
-Reboot guest from GUI.
-Revert snapshot using Cloudmin GUI (Change to guest system in menu -> Resources -> Disk Snapshots). Before revert, maximum usage on snapshot shows about 73%.
-When I SSH into the guest after the revert, all of the packages are still updated and the test files are still in /root. I was expecting that after a snapshot revert, all the updates and added files on the guest would be lost.

Status: 
Active

Comments

Yes, reverting a snapshot should bring the VM back to the state it was in when the snapshot was taken.

Does this VM have only a single disk?

Yes, the guest has only one disk. It's 10GB EXT3 format.

Ok .. and when you click the Revert button, does the snapshot disappear from the list?

Yes, GUI reports it has successfully reverted the snapshot and it disappears from the list. I don't know if this helps but when I try to make another snapshot after the revert I get this error message:

Creating snapshot of system test1.cloudmin.cloud ..
.. failed : Snapshots of an origin that has a merging snapshot is not supported

/var/log/messages doesn't have any errors about the revert.

I did find out that if I Revert the guest's snapshot and reboot the host, the guest machine completes it's Snapshot Revert and all the changes are gone.

When you do this reversion to the snapshot, is the VM running or shut down?

I've tried it with the VM running, then rebooting it and with the VM in the powered off state. Both outcomes are the same (no snapshot revert).

Ok, I think the issue is that the snapshot rollback takes time to complete, but Cloudmin doesn't wait for it. And during that time, the VMs disks still appear as they were originally.

If you start a rollback, wait 30 minutes, and then boot the VM, is it rolled back as expected?

I'm testing your suggestion now. I went back and dug through /var/log/messages after a few reboots and found some log items that might help. I took out everything unrelated to LVM and added time/dates when the reboots occured:

Oct 18 11:49:27 8041rf1 lvm[28766]: dm_task_run failed, errno = 6, No such device or address
Oct 18 11:49:27 8041rf1 lvm[28766]: VG--Virtual_Machines-test1.cloudmin.com_0_testsnapshot1_snap disappeared, detaching
Oct 18 11:49:27 8041rf1 lvm[28766]: No longer monitoring snapshot VG--Virtual_Machines-test1.cloudmin.com_0_testsnapshot1_snap
Oct 18 13:34:19 8041rf1 lvm[28766]: Monitoring snapshot VG--Virtual_Machines-snap1.cloudmin.com_0_snap1333_snap
Oct 18 13:36:39 8041rf1 lvm[28766]: Snapshot VG--Virtual_Machines-snap1.cloudmin.com_0_snap1333_snap is now 83% full.
Oct 18 13:36:49 8041rf1 lvm[28766]: Snapshot VG--Virtual_Machines-snap1.cloudmin.com_0_snap1333_snap is now 95% full.
Oct 18 13:38:54 8041rf1 lvm[28766]: dm_task_run failed, errno = 6, No such device or address
Oct 18 13:38:54 8041rf1 lvm[28766]: VG--Virtual_Machines-snap1.cloudmin.com_0_snap1333_snap disappeared, detaching
Oct 18 13:38:54 8041rf1 lvm[28766]: No longer monitoring snapshot VG--Virtual_Machines-snap1.cloudmin.com_0_snap1333_snap
Oct 18 13:41:35 8041rf1 lvm[28766]: Monitoring snapshot VG--Virtual_Machines-snap2.cloudmin.com_0_snap2--1341_snap
Oct 18 14:04:15 8041rf1 lvm[28766]: dm_task_run failed, errno = 6, No such device or address
Oct 18 14:04:15 8041rf1 lvm[28766]: VG--Virtual_Machines-snap2.cloudmin.com_0_snap2--1341_snap disappeared, detaching
Oct 18 14:04:15 8041rf1 lvm[28766]: No longer monitoring snapshot VG--Virtual_Machines-snap2.cloudmin.com_0_snap2--1341_snap
Oct 18 14:16:51 8041rf1 lvm[28766]: Monitoring snapshot VG--Virtual_Machines-snap1.cloudmin.com_0_mysnapshot1416_snap
Oct 18 14:22:22 8041rf1 lvmpolld: W: #011LVPOLL: PID 30561: STDERR: '  snap1_cloudmin_com_img: Failed query for merging percentage. Aborting merge.'
Oct 18 14:22:22 8041rf1 lvmpolld[30558]: LVMPOLLD: lvm2 cmd (PID 30561) failed (retcode: 5)
Oct 18 14:24:42 8041rf1 lvmpolld: W: #011LVPOLL: PID 30932: STDERR: '  sandbox_cloudmin_com_img: Failed query for merging percentage. Aborting merge.'
Oct 18 14:24:42 8041rf1 lvmpolld[30929]: LVMPOLLD: lvm2 cmd (PID 30932) failed (retcode: 5)
Oct 18 14:24:42 8041rf1 lvmpolld: W: #011LVPOLL: PID 30934: STDERR: '  snap1_cloudmin_com_img: Failed query for merging percentage. Aborting merge.'
Oct 18 14:24:42 8041rf1 lvmpolld[30929]: LVMPOLLD: lvm2 cmd (PID 30934) failed (retcode: 5)
--------------Tue Oct 18 14:30 REBOOT-------------------
Oct 18 14:30:12 8041rf1 lvm: 2 logical volume(s) in volume group "centos_8041rf1" monitored
Oct 18 14:30:12 8041rf1 lvm: 2 logical volume(s) in volume group "centos_8041rf1" now active
Oct 18 14:30:14 8041rf1 lvm: 3 logical volume(s) in volume group "VG-Virtual_Machines" now active
Oct 18 14:30:14 8041rf1 lvm: Background polling started for 2 logical volume(s) in volume group "VG-Virtual_Machines"
Oct 18 14:30:29 8041rf1 lvmpolld: W: #011LVPOLL: PID 834: STDERR: '  WARNING: This metadata update is NOT backed up'
Oct 18 14:30:59 8041rf1 lvmpolld: W: #011LVPOLL: PID 832: STDERR: '  WARNING: This metadata update is NOT backed up'
Oct 19 20:46:03 8041rf1 lvm[6310]: Monitoring snapshot VG--Virtual_Machines-snaptest1.cloudmin.com_0_SnapshotAt2045_snap
--------------Wed Oct 19 21:04 REBOOT-------------------
Oct 19 21:04:50 8041rf1 lvm: 2 logical volume(s) in volume group "centos_8041rf1" monitored
Oct 19 21:04:51 8041rf1 lvm: 2 logical volume(s) in volume group "VG-Virtual_Machines" now active
Oct 19 21:04:51 8041rf1 lvm: Background polling started for 1 logical volume(s) in volume group "VG-Virtual_Machines"
Oct 19 21:04:52 8041rf1 lvm: 2 logical volume(s) in volume group "centos_8041rf1" now active
Oct 19 21:05:06 8041rf1 lvmpolld: W: #011LVPOLL: PID 706: STDERR: '  WARNING: This metadata update is NOT backed up'
Oct 21 13:36:50 8041rf1 lvm[9463]: Monitoring snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap
Oct 21 13:40:30 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 80% full.
Oct 21 13:41:20 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 85% full.
Oct 21 13:42:10 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 91% full.
Oct 21 13:53:40 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 95% full.

The guest was shut down when I did the revert, and I left it off for about an hour. I powered the guest up and all the changes were still there. After a reboot of the host, all the guest changes had been reverted.

Those messages like :

Oct 18 14:24:42 8041rf1 lvmpolld: W: #011LVPOLL: PID 30934: STDERR: ' snap1_cloudmin_com_img: Failed query for merging percentage. Aborting merge.'

look very suspicious.

Can you post logs from the time the snapshot was created?

Jamie, Sorry for the delay. The only items in the log file from the snapshot creation are the last 5 entries in the block I posted above:

Oct 21 13:36:50 8041rf1 lvm[9463]: Monitoring snapshot VG--Virtual_Machinesst1.cloudmin.com_0_ss10212016--1336_snap
Oct 21 13:40:30 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 80% full.
Oct 21 13:41:20 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 85% full.
Oct 21 13:42:10 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 91% full.
Oct 21 13:53:40 8041rf1 lvm[9463]: Snapshot VG--Virtual_Machines-st1.cloudmin.com_0_ss10212016--1336_snap is now 95% full.

The snapshot created was created on Oct 21 at 13:36:50. I logged into it and updated the OS using yum. The updates completed successfully and the snapshot ended up being about 96% used. I shutdown the guest, reverted it, and waited about an hour. When I started the guest back up the OS was still updated (the revert hadn't occurred). I shut down the guest and rebooted the host. When the host was finished with its reboot and KVM had come back up, I logged back into the guest and the OS was back to it's un-updated state (the revert had occurred). I didn't see any further information in /var/log/messages about the revert or any LVM events. Thanks for your help on this, I know it's tricky issue. Please let me know any more information I can provide or if anything needs clarification.

Is the guest's filesystem perhaps mounted on the host system, or being accessed by some other process? That would prevent the snapshot from being fully reverted.

No. The filesystem is not being used by anything else. Not sure if this helps, but we use the "Create Disk Images in: LVM Volume Group" option in the "KVM Host Settings" menu, and all our guests are in individual Logical Volumes in one large Volume Group.

Not to jump in but I get same thing. Revert fails. I never tried rebooting main host. My virtual machine storage is on the same machine as host but is ssd drives dedicated to LVM only. Has several separate Logical Volumes on the ssd array which were created by cloudmin. Normally has main disk and swap disk for each LVM server.
No response need, Just thought I'd describe my storage to see if it would help the thread.

EDIT: quote from LVM instructions "Before changing the state of the logical volume to that of it's snapshot, you need to have unmounted that logical volume if you want the change to happen immediately. Alternatively, if you can't unmount, you can carry out the command but the changes will not take effect until you do a reboot."

SO the original disk needs to be unmounted in order to revert to the snapshot. Which means merging the two. If you want to go back to original LV, you'd delete the snapshot. But seems the LV is not being unmounted during a reboot of the LV.

Yeah, reverting an LV that is in use (either mounted or by a VM) will prevent the snapshot from being reverted. However, shutting down the VM should resolve that .... unless some other process on the system is accessing LVs?

With the virtual machine shutoff completely... I Tried mounting the LVM server image into a folder. Failed, says already mounted. Tried un mounting with umount. Says not mounted.
Something is preventing the /dev/vg/img from being released. Thus snap shot revert never completes. Also tried un mounting /dev/mapper/vg-image and tried un mounting /dev/mapper/dm-# and tried un mounting ../dm-#. Cant not find what has the LVM img locked as busy.

Try running the command fuser /dev/vg/img

That should tell you what process is using the device, if any.

I've completely started over with LVM. Formatted, recreated volume group. Had Cloudmin create Ubuntu 14.04 instance. - Works fine. Shows up in LV manager. Created snapshot. Ran apt-get update - snapshot showed 20% usage. Ran apt-get upgrade - snapshot showed 95% usage. Deleted snapshot via cloudmin resources. Rebooted. Apt-get update still shows up to date and no new updates. It should show new updates again.(correction, this is the correct result-note below). Created new snapshot. Ran apt-get dist-upgrade - snapshot showed 70% usage. "Reverted" via cloudmin resources snapshot tool. Rebooted Ran apt-get dist-upgrade - still shows all up to date. Tried creating another snapshot - shows "Creating snapshot of system lvmtest.hostinglz.com .. .. failed : Snapshots of an origin that has a merging snapshot is not supported"

So deleting the snapshot deletes it. Reverting a snapshot never completes, gives above error.

sudo fuser /dev/vg1/img - shows nothing using it.

sudo fuser /dev/vg1/img.snap - show nothing using it.

Boot the VM back up and fuser shows the process ID in use.

After some reading, in LVM2 the original img is modified, (but the original bits are copied to the snapshot then changed on the original img.) So deleting a snapshot "commits" all changes because they have already occurred with the original image and you've just deleted the backup bits by deleting the snapshot. That fits with what I saw above in my test.

Reverting to an LVM2 snapshot is telling it to put the old bits from the snapshot back onto the original img. Essentially reverting it back to the date of the snapshot. But it wont complete. Going to try and run a snap/revert via lvm manager and see what happens. ......

EDIT: After several revert attempts on a cloudmin img, it wont complete via cmd line or via the VM resources, or via webmin LVM manager. So its not a webmin/cloudmin thing I dont think. Something is preventing the image from completing the revert upon reboot of the image. And fuser shows no process ID in use when the VM is shutdown.

If even reverting a snapshot from the command line doesn't work, there isn't much Cloudmin can do unfortunately :-( You may have to bring this up with LVM or Debian developers.