Loss of connectivity upon KVM guest restart/shutdown

Hi,

I noticed that for half a minute or so I lose connectivity to the host (webmin) when I restart or shutdown a guest.
This might be related to the following error I picked up from "Boot messages":
"(qemu) /etc/qemu-ifdown: could not launch network script"

Status: 
Active

Comments

Odd, I've never seen this happen on any of my KVM hosts.

Do you lose SSH access too?

Hm, I hate when this happens. I can no longer reproduce the issue. Seems to have been a one time hick-up. Well, I'm glad it's working without interuptions (checked both reboot and shutdown procedures).
I still get the error "(qemu) /etc/qemu-ifdown: could not launch network script" though.

Ok, please re-open this if you see it again.

Hi,

I managed to consistently reproduce this. What I did:
1. Set virtio as default for both network and disk cloudmin-wide.
2. Create a Centos guest and wait for it to fail (prolly has to do with virtio, will figure it out and open separate issue).
3. Shutdown (check "Force immediate termination, instead of Unix shutdown?").
4. No connectivity to the machine for a minute or so, existing SSH sessions get reset.

"Boot messages" still reports:
"(qemu) /etc/qemu-ifdown: could not launch network script"
Not sure if this is serious or not or related to the problem above, but it's all I can see suspicious so far.

In step 4, do you mean there was no connectivity to the host ?

Yes, sorry for not being more clear.

Pings to host don't work either so it's complete disconnection.

That is very odd .. I can't see how this could happen.

Cloudmin doesn't do anything special network-wise when a VM is shut down. All KVM should do is disconnect the host's tapN interface from the bridge br0.

Are you doing anything unusual with routing or ARP on your network?

Jamie,

I am not doing anything outside Cloudmin. Let me try this on a separate box in a different data centre and see if it still happens.

Hmm. I tried it in a different data centre and I cannot reproduce this. Must be something specific to the network setup. I guess you can close this issue.

Ok .. if you figure out something cloudmin is doing wrong with the KVM network setup, please let me know..

Automatically closed -- issue fixed for 2 weeks with no activity.

Status:
Closed (fixed)
»
Needs review

So I hate to bring this ticket back up, but I think I have fallen into this strange problem.

I basically have the main cloudmin server, a physical dell poweredge machine with presently two guests.

I have shutdown one of the guests to figure out how to add ram to it. When I did this, the main box looses gateway connectivity. So locally I'm still fine, ssh does take time to connect (to main box) since it can't make an internet connection. When I bring this guest back up, it takes a couple of minutes from what I've seen, but the main box also re-gains connectivity and I can ping outside again.

While this was happening, my other guest machine has no problems connecting to the internet. So basically my main OS hosting cloudmin looses internet, the online guest does not have an issue with internet, while the other guest is offline. I just re-did it to ensure I'm not loosing my mind and it happened again.

looking at ip config, nothing looks odd.

Let me know if I can provide anything to help figure this out!

update: I tried shutting down the guest from its console to see its behavior, logged into SSH to the main server and tried to ping and it wasn't not able to ping, ssh took time to open up. However this time after the guest has fully shutdown (I can only assume) the main server can ping and it open to the internet.

Guy

That is very odd! If you SSH into the host system and run netstat -rn before and after the connectivity loss, does it show the same routes? In particular, does the default route disappear?

I do not have net-tools suite because this a base centos7.5 came with the new iproute2 suite. I looked up the new commands to try and get the results you're looking for. I'm not sure if I'm capturing all you need. But I'm pasting the logs as well, which I think you might find interesting...

this is when the guest was already offline and main host can ping... [root@virtual /]# ss -n | grep tcp tcp ESTAB 0 0 192.168.10.10:22 192.168.10.14:64094
tcp ESTAB 0 0 192.168.10.10:2049 192.168.10.20:679
tcp ESTAB 0 0 192.168.10.10:22 192.168.10.14:50713
tcp ESTAB 0 0 192.168.10.10:2049 192.168.10.14:63365
tcp ESTAB 0 0 192.168.10.10:2049 192.168.10.14:60899
tcp ESTAB 0 0 192.168.10.10:55372 192.168.10.4:445
[root@virtual /]# ip route default via 192.168.10.1 dev br0 169.254.0.0/16 dev em1 scope link metric 1002 169.254.0.0/16 dev br0 scope link metric 1004 192.168.10.0/24 dev br0 proto kernel scope link src 192.168.10.10

this is when the guest was starting up ( which I triggered from cloudmin )

[root@virtual /]# ss -n | grep tcp tcp ESTAB 0 0 127.0.0.1:10001 127.0.0.1:56210
tcp ESTAB 0 0 192.168.10.10:22 192.168.10.14:64094
tcp ESTAB 0 0 192.168.10.10:2049 192.168.10.20:679
tcp ESTAB 0 0 127.0.0.1:56210 127.0.0.1:10001
tcp ESTAB 0 0 192.168.10.10:22 192.168.10.14:50713
tcp ESTAB 0 0 192.168.10.10:2049 192.168.10.14:63365
tcp ESTAB 0 0 192.168.10.10:2049 192.168.10.14:60899
tcp ESTAB 0 0 192.168.10.10:10000 192.168.10.14:57153
tcp ESTAB 0 0 192.168.10.10:55372 192.168.10.4:445

[root@virtual /]# ip route default via 192.168.10.1 dev br0 169.254.0.0/16 dev em1 scope link metric 1002 169.254.0.0/16 dev br0 scope link metric 1004 192.168.10.0/24 dev br0 proto kernel scope link src 192.168.10.10

journalctl logs while guest starting up and main box cannot ping outside

Nov 17 14:35:10 virtual kernel: br0: port 3(tap1) entered blocking state Nov 17 14:35:10 virtual kernel: br0: port 3(tap1) entered disabled state Nov 17 14:35:10 virtual kernel: device tap1 entered promiscuous mode Nov 17 14:35:10 virtual kernel: br0: port 3(tap1) entered blocking state Nov 17 14:35:10 virtual kernel: br0: port 3(tap1) entered listening state Nov 17 14:35:10 virtual kvm[11170]: 2 guests now active Nov 17 14:35:12 virtual kernel: br0: port 3(tap1) entered learning state Nov 17 14:35:14 virtual kernel: br0: port 3(tap1) entered forwarding state Nov 17 14:35:14 virtual kernel: br0: topology change detected, propagating Nov 17 14:35:14 virtual kernel: kvm [11147]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff Nov 17 14:35:37 virtual kernel: EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem Nov 17 14:35:37 virtual kernel: EXT4-fs (loop0): recovery complete Nov 17 14:35:37 virtual kernel: EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)

let me know if there is anything else I can provide

Is there any difference in the output from ip route when ping is failing vs. when it's working?

Also, does the VM have a different ethernet address from the host system (it should)

This is the result of ip route while the system is fine [root@virtual ~]# ip route default via 192.168.10.1 dev br0 169.254.0.0/16 dev em1 scope link metric 1002 169.254.0.0/16 dev br0 scope link metric 1004 192.168.10.0/24 dev br0 proto kernel scope link src 192.168.10.10

this is after I request a reboot from for one of the guests through the main box using cloudmin [root@virtual ~]# ip route default via 192.168.10.1 dev br0 169.254.0.0/16 dev em1 scope link metric 1002 169.254.0.0/16 dev br0 scope link metric 1004 192.168.10.0/24 dev br0 proto kernel scope link src 192.168.10.10

While the guest was rebooting I could ping internal machines 192.168.10.x but I could not ping the gateway itself 192.168.10.1, nor could I ping anything outside like google.com or yahoo

as soon as the guest is done rebooting everything is ok. It came up fairly fast, but it still went down.

All guests have their own ip's and the main physical host has its own separate as well yes.

When doing the pings on the main box aka cloudmin, the ping would hang. No timed out or not reachable just hang. It took time but the ping request gave this result after hang [root@virtual tmp]# ping google.com ping: google.com: Name or service not known

again all this while the guest is rebooting. And the cloudmin UI is super slow... to barely functioning.

The guest is back up and still the main box cloudmin isn't able to reach the internet as of yet. Nothing in ip route has changed.

Keep in mind this cloudmin is inside my LAN.

another note, this last guest reboot, cloudmin still cannot ping out and it has been several minutes. cloudmin UI is not responding or down to a crawl.

I'm not sure if this is relevant but doing 'ip a' I just noticed that there are adapters or links which are tap0, tap1 and tap2. I do not recall seeing tap1 and tap2 before?

here is a paste

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether e0:db:55:03:46:a4 brd ff:ff:ff:ff:ff:ff inet6 fe80::e2db:55ff:fe03:46a4/64 scope link valid_lft forever preferred_lft forever 3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP group default qlen 1000 link/ether e0:db:55:03:46:a6 brd ff:ff:ff:ff:ff:ff inet6 fe80::e2db:55ff:fe03:46a6/64 scope link valid_lft forever preferred_lft forever 4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 92:e9:e3:05:f7:d1 brd ff:ff:ff:ff:ff:ff inet 192.168.10.10/24 brd 192.168.10.255 scope global dynamic br0 valid_lft 47744sec preferred_lft 47744sec inet6 fe80::e2db:55ff:fe03:46a6/64 scope link valid_lft forever preferred_lft forever 11: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UNKNOWN group default qlen 1000 link/ether b2:4e:a6:50:60:15 brd ff:ff:ff:ff:ff:ff inet6 fe80::b04e:a6ff:fe50:6015/64 scope link valid_lft forever preferred_lft forever 12: tap1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UNKNOWN group default qlen 1000 link/ether 92:e9:e3:05:f7:d1 brd ff:ff:ff:ff:ff:ff inet6 fe80::90e9:e3ff:fe05:f7d1/64 scope link valid_lft forever preferred_lft forever 17: tap2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UNKNOWN group default qlen 1000 link/ether ce:7d:8b:f3:d7:f3 brd ff:ff:ff:ff:ff:ff inet6 fe80::cc7d:8bff:fef3:d7f3/64 scope link valid_lft forever preferred_lft forever

I now attempted to restart the network and this is what happens - NetworkManager is inactive

[root@virtual tmp]# systemctl restart network Job for network.service failed because the control process exited with error code. See "systemctl status network.service" and "journalctl -xe" for details. [root@virtual tmp]# systemctl status network ‚óŹ network.service - LSB: Bring up/down networking Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2018-11-26 17:44:40 EST; 11s ago Docs: man:systemd-sysv-generator(8) Process: 8916 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=1/FAILURE)

Nov 26 17:44:40 virtual network[8916]: RTNETLINK answers: File exists Nov 26 17:44:40 virtual network[8916]: RTNETLINK answers: File exists Nov 26 17:44:40 virtual network[8916]: RTNETLINK answers: File exists Nov 26 17:44:40 virtual network[8916]: RTNETLINK answers: File exists Nov 26 17:44:40 virtual network[8916]: RTNETLINK answers: File exists Nov 26 17:44:40 virtual network[8916]: RTNETLINK answers: File exists Nov 26 17:44:40 virtual systemd[1]: network.service: control process exited, code=exited status=1 Nov 26 17:44:40 virtual systemd[1]: Failed to start LSB: Bring up/down networking. Nov 26 17:44:40 virtual systemd[1]: Unit network.service entered failed state. Nov 26 17:44:40 virtual systemd[1]: network.service failed.

I can still ping my guests, my macbook, but not gateway 192.168.10.1 or the net.

If I ssh into any of its guests, they can ping google just fine. I have no idea what's going on. lol

its back up now.. how I noticed is because I saw in chrome that the cloudmin UI was fast responsive and I tested a ping to google... OK

'ip a' - everything is still as above. maybe I didn't see those tap's before!?

sorry for the step by step long text.

That is very unusual - and actually I've seen something similar where rebooting two VMs in close succession can cause host networking to go offline.

Are you just rebooting one VM when this happens, or multiple?

Hi Jamie

no this was just one guest doing a reboot. Either through cloudmin->system state->reboot system. Or if I go into the guest and do shutdown -h now.

I just realize that the taps I see in 'ip a' are each representing a guest.

I also just noticed another problem on this system about spinning up a custom image. I can use the images already there in cloudmin settings -> new system images And it is showing my local images, but when I try to use them either in New System -> Create new KVM instance -> working system will not let me choose my locals in 'Cloudmin system image' empty system lets me choose it, but then does not boot up with this boot message QEMU waiting for connection on: tcp:127.0.0.1:40005,server and you can see that cloudmin assigns a good lan ip.

I posted it here https://www.virtualmin.com/node/59600

I also tried following a couple of your leads on this problem, but nothing paned out. I have arch linux image and a yunohost image and I cannot get any of them to work. I can't see this being a bad install on the system since the install went so well but I'm starting to doubt this cloudmin is working well. I haven't found any logs that could lead me anywhere either.

CentOS Linux 7 (Core) - CentOS Linux 7.5.1804 Webmin version 1.901
Cloudmin version 9.3.kvm Pro

Do you see the same problem when using one of the Cloudmin-provided images to create a new VM, like CentOS or Debian?

yup, I've mentioned that above ;) . I have spun up latest ubuntu and centos from the cloudmin provided images.

Honestly, cloudmin going offline when I reboot a guest to me isn't as important as the ability to have local images imported and able to use. This being because this cloudmin is LAN and nothing outside is dependant on it.

c .

yup, I've mentioned that above ;) . I have spun up latest ubuntu and centos from the cloudmin provided images.

Honestly, cloudmin going offline when I reboot a guest to me isn't as important as the ability to have local images imported and able to use. This being because this cloudmin is LAN and nothing outside is dependant on it.

c .

yup, I've mentioned that above ;) . I have spun up latest ubuntu and centos from the cloudmin provided images. Sorry I'll try keeping me text as short as I can. I'm sure you have tons to respond to.

Honestly, cloudmin going offline when I reboot a guest to me isn't as important as the ability to have local images imported and able to use. This being because this cloudmin is LAN and nothing outside is dependant on it.

I have an old blade here which has cloudmin. Since all these problems do not seem common I'll spark this up and see if I have the same symptoms.

sorry, my baby boy got a hold of my keyboard. Please delete the duplicates if you can.

I have just tried to create a guest with a local image arch linux on an older physical server using cloudmin. This is still the exact same problem with trying to use a local image. It does not work.

I cannot be the first to have this problem? Two physical servers with cloudmin installs not being able to install local images?

This machine does not seem to have the problem of loosing internet access while a guest is rebooting. I can ping just fine...

Is this a bug? I would need to know before I go too far implementing cloudmin.

Status:
Needs review
»
Active

are there any ideas on this? It is really preventing me from using any other os that is not in the cloudmin list...

I've run into something similar I think. Its a KVM issue, not a 'min' issue. Loses connection to host when kvm guest is rebooted. Seems to happen some with KVM/Ubuntu. Issue seems to be that upon kvm network restart it attempts to find or update the mac address of the host for some reason. Fix was to add line in interfaces stating what mac to use for the host. My bridge interface is br0 and the mac address in from my eth0 which is bridged to br0. Put in your own interface's mac address. It will be listed in ifconfig. add to /etc/network/interfaces. This is on ubuntu16.04.

post-up ip link set br0 address 11:11:11:11:11:11

thanks for your comments. This topic basically turned into another bigger problem. I am not able to create guest hosts from a local or downloaded image. I can create a new KVM instance using any of the pre-loaded cloudmin system images, but I cannot choose any of the custom downloaded ones I have. Like Arch linux or yunohost etc.

I will work with you suggestion about the master OS going temporarily offline.