Error reading response length from fastrpc.cgi : Connection reset by peer

A physical host we are managing using Cloudmin Pro is repeatedly reporting the status as "Webmin down", and we can't figure out why.

Last status Webmin down (Last changed at 14/Aug/2020 10:45) Detailed status error Error reading response length from fastrpc.cgi : Connection reset by peer

Change time Old status New status Changed by
14/Aug/2020 10:45 Webmin Webmin down Monitoring 14/Aug/2020 10:40 Webmin down Webmin Monitoring 14/Aug/2020 10:35 Webmin Webmin down Monitoring 14/Aug/2020 10:30 Webmin down Webmin Monitoring 14/Aug/2020 10:25 Webmin Webmin down Monitoring 14/Aug/2020 10:20 Webmin down Webmin Monitoring ...

We can't find any errors logged on the physical machine and Webmin loads up fine when we access it in a browser. CPU load is low, there is free RAM, and no other hosts on the same network are exhibiting the same problem.

Can you please advise how we might debug what's going on here?

Thanks

Chris

Status: 
Active

Comments

Is there any firewall that could be blocking ports 10000 - 10100 between the Cloudmin master and the host system, or the master and a VM?

No firewall rules that could be causing this from what I can see

This is an intermittent problem - if Webmin is restarted, Cloudmin then reports a 'Webmin' status for a period of time before falling back in to the same intermittent pattern.

Our Nagios monitoring system monitors the status of Cloudmin managed servers/VMs, and reports if any are not in Webmin, SSH or Alive status, and this is what our Nagios log looks like for the past few hours:

August 17, 2020 11:00

Service Warning[2020-08-17 11:35:18] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;HARD;2;WARNING - redacted-hostname.localdomain (Webmin Down)

Service Warning[2020-08-17 11:25:12] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;SOFT;1;WARNING - redacted-hostname.localdomain (Webmin Down)

Service Ok[2020-08-17 11:05:06] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;OK;SOFT;2;OK - No systems down

August 17, 2020 10:00

Service Warning[2020-08-17 10:54:59] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;SOFT;1;WARNING - redacted-hostname.localdomain (Webmin Down)

Service Ok[2020-08-17 10:44:53] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;OK;HARD;2;OK - No systems down

Service Warning[2020-08-17 10:34:47] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;HARD;2;WARNING - redacted-hostname.localdomain (Webmin Down)

Service Warning[2020-08-17 10:24:41] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;SOFT;1;WARNING - redacted-hostname.localdomain (Webmin Down)

Service Ok[2020-08-17 10:14:35] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;OK;HARD;2;OK - No systems down

Service Warning[2020-08-17 10:04:29] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;HARD;2;WARNING - redacted-hostname.localdomain (Webmin Down)

And this is from Cloudmin's Status change history:

17/Aug/2020 13:15 Webmin Webmin down Monitoring

17/Aug/2020 13:00 Webmin down Webmin Monitoring

17/Aug/2020 12:25 Webmin Webmin down Monitoring

17/Aug/2020 12:20 Webmin down Webmin Monitoring

17/Aug/2020 12:15 Webmin Webmin down Monitoring

17/Aug/2020 12:10 Webmin down Webmin Monitoring

17/Aug/2020 12:05 Webmin Webmin down Monitoring

17/Aug/2020 12:00 Webmin down Webmin Monitoring

17/Aug/2020 11:45 Webmin Webmin down Monitoring

17/Aug/2020 11:40 Webmin down Webmin Monitoring

This is happening almost 24-7.

How loaded is the remote Webmin system when this happens? We got another report recently of a user seeing very high system load when Cloudmin was doing a status check.

Load avg is consistently < 1

We also bumped RAM on Xen dom0 from 1GB to 2GB recently to see if that made any difference, even though we weren't seeing any out of memory errors logged. It hasn't helped.

It's a very new server with NVMe SSD, fast CPUs and light load. We can safely rule out network issues as lots of other servers on the same network are working just fine. We're also not experiencing the same issue with Webmin running on Xen VMs on this server, so that rules out an issue with the NICs on the server.

Would it be possible to do a packet capture (with tcpdump) when this happens? I'd be interested to see what connections were being made, or at least attempted.