Reinstall Cloudmin Master

We are thinking of doing a complete reinstall of our Cloudmin master because it keeps hanging intermittently. The thing we are wondering is what configuration files we need to backup before doing this so we can get the virtual machines running as quickly as posible again?

Regards, Jakob

Status: 
Active

Comments

Howdy -- sorry to hear that your server is locking up!

Are you hosting your Virtual Machines (VM's) on your Cloudmin master server?

Or are your VM's on another server?

Also, which type of VM are you using? Xen? KVM? Or another?

Thanks!

It may be better for us to debug the underlying problem, as a re-install is unlikely to help and may cause Cloudmin to lose track of your VMs. Can you tell us more about these hangs?

Hi,

Yes, we have four VM's running on it now, and we plan to run at least one more. All of them are KVM.

/J

Hi,

That would be great. The problem we are having with the server is that it runs fine for a day or two sometimes a week then all VMs stops responding. After a reboot everything is fine and so on. We have lookes at the logs and so on but can't find the cause for this problem? What information/logs do you need to debug this problem?

Regards, Jakob

When the VM's stop responding -- is the host server still responsive? Or is the host unavailable as well?

Also, is there any chance we could log into your host and take a peek at some of your logs and related info?

It's hard to say what exactly the issue is, though we'd be interested in the messages, syslog, and kern.log files -- as well as your current dmesg output, and whether you're using the most recent kernel version available to your distribution.

So you're welcome to provide us with that info if you prefer, but it may be simpler if we logged in to take a look.

Also, if you have a rough time this issue last occurred, that would help us know where to look in the logs.

Thanks!

That would be great, how do we proceed?

We found this in the kernel log after a reboot, could it have something to do with the problem?

"Mar 17 12:32:03 bfg kernel: [173304.969115] irq 19: nobody cared (try booting with the "irqpoll" option)

Mar 17 12:32:03 bfg kernel: [173304.969120] Pid: 0, comm: swapper Not tainted 2.6.32-39-server #86-Ubuntu

Mar 17 12:32:03 bfg kernel: [173304.969122] Call Trace:

Mar 17 12:32:03 bfg kernel: [173304.969124] [] __report_bad_irq+0x2b/0xa0

Mar 17 12:32:03 bfg kernel: [173304.969133] [] note_interrupt+0x18c/0x1d0

Mar 17 12:32:03 bfg kernel: [173304.969136] [] handle_fasteoi_irq+0xdd/0x100

Mar 17 12:32:03 bfg kernel: [173304.969140] [] handle_irq+0x22/0x30

Mar 17 12:32:03 bfg kernel: [173304.969145] [] do_IRQ+0x6c/0xf0

Mar 17 12:32:03 bfg kernel: [173304.969147] [] ret_from_intr+0x0/0x11

Mar 17 12:32:03 bfg kernel: [173304.969149] [] ? finish_task_switch+0x59/0xe0

Mar 17 12:32:03 bfg kernel: [173304.969155] [] ? finish_task_switch+0x50/0xe0

Mar 17 12:32:03 bfg kernel: [173304.969159] [] ? thread_return+0x48/0x41f

Mar 17 12:32:03 bfg kernel: [173304.969163] [] ? cpu_idle+0xeb/0x110

Mar 17 12:32:03 bfg kernel: [173304.969167] [] ? rest_init+0x77/0x80

Mar 17 12:32:03 bfg kernel: [173304.969171] [] ? start_kernel+0x36d/0x376

Mar 17 12:32:03 bfg kernel: [173304.969174] [] ? x86_64_start_reservations+0x125/0x129

Mar 17 12:32:03 bfg kernel: [173304.969178] [] ? x86_64_start_kernel+0xfa/0x109

Mar 17 12:32:03 bfg kernel: [173304.969180] handlers:

Mar 17 12:32:03 bfg kernel: [173304.969181] [] (pdc_interrupt+0x0/0x2d0 [sata_promise])

Mar 17 12:32:03 bfg kernel: [173304.969193] Disabling IRQ #19"

Update: Error seems to have disappeared after we replaced the SATA-card.

Yeah it's certainly possible that something was awry with the SATA card.

It's up to you how we proceed then -- if you'd like us to take a look at your logs, you can either enable Remote Support using the Virtualmin Support module, or you can email your login details to eric@virtualmin.com.

If you do that, be sure to let us know when the last problem occurred, so we know where to look in the logs.

Or, if you'd like to see if this new SATA card fixes the issue, we can hold off to see if it happens again.

We will let it run and see, if it hangs again we will let you know.

Thanks! Jakob

So, after about 1 week the server is behaving strange (slow) again. I have attached the logfiles you wanted, tell me if there is any other info you need.

Kernel is 2.6.32.40-server

Regards, Jakob

Well, I see a few unusual issues related to what I think are some video drivers, though I'm not sure if that's related to the slowness you're seeing now.

What output do you see if you run the command "uptime" on the host -- are you seeing a high load at the moment?

If so, what processes do you see consuming a lot of resources when you run the command "top"?

Also, if you're still seeing this problem, we'd also be happy to log into your host and take a look around a bit.

Output from uptime

user@bfg:~$ uptime 11:12:30 up 17:06, 2 users, load average: 70.39, 49.91, 43.15

Okay, so that load there appears to be the issue -- something is hammering your server. And we just need to figure out what :-)

If you run the command "top", do any processes stand out to you?

Also, are you able to review the "uptime" output of your individual VPS's?

If one VPS were having significant load issues, that could potentially affect the host server like you're seeing.

Ok, we received the same IRQ-error yesterday and replaced the other SATA-card and updated the motherboard BIOS. It's running fine right now with at load of 2-5 on the master which seems normal. The VPS:es has a load of 0.3 to 4 depending on the load, so that seems ok too.

We'll let you know next week if everything is ok or if the problem shows up again. I think the problem is the Promise SATA-card.

/HÃ¥kan