Server locks and controlpanel inaccessible after update 5.03

Based on forum post: https://www.virtualmin.com/comment/757228 (I hope I'm right I have to create this issue in order to request your premium support, I bought a "issue" but don't know how to activate it)

Debian 6 (Squeeze) Version 5.03 Webmin 1.801

Everything should be up-to-date.

I can't login the controlpanel (https://domain.tld:10000). - With authentic theme: the bar at the side loads, but the main frame keeps grey and 'loading' - With the default theme (when I remove authentic theme from the /etc/webmin/config) it keeps on loading and after a long time, I get error "Failed to lock file /etc/group after 5 minutes. Last error was :"

I found many lockfiles (like /etc/passwd.lock) and they keep on appearing when I delete them.

Some processes, like: - /usr/share/webmin/authentic-theme/index.cgi - something.../domain-list.pl - /usr/share/webmin/virtual-server/collectinfo.pl

Keep on running at 100% CPU and using much memory (like 10GB).

Everything worked well (also after reboots etc.) before I installed 5.03.

Files: 
Status: 
Active

Comments

Howdy -- yup, this is the right place to ask your question after purchasing a support incident!

You do seem to be experiencing some unusual issues there.

Let's start here -- can you share the output of these two commands (you provided the one in your forum post, but I'd like to make sure that's here as well):

find /etc -name "*.lock"
ps auxwf

Thanks!

The /etc/group.lock file should contain the PID of the process holding the lock, which you can find with the ps auxwww command.

Yeah let us know what PID is contained within one or two of those .lock files, and we can help review your process list to see what's going on there.

root     19523 97.0  1.3 178020 110048 ?       R    22:07   0:16 /usr/share/webmin/authentic-theme/index.cgi
root     19772  0.0  0.0   6068   712 pts/0    S+   22:07   0:00 grep 19523

I think it should be the authentic-theme-index file in this case. One thing I don't understand is that I commented the theme line in /etc/webmin/config, so I thought it wouldn't use the authentic theme this way anymore. But I guess it doesn't work that way. (if I uncomment the line and restart the webmin service, the same process is locking the /etc/group)

When I peek at the locked files myself, I don't see different looking rows or something like that. (but those are big files, so it's possible I have overlooked or didn't recognize the problem)

To manually disable the theme, you also need to edit /etc/webmin/miniserv.conf and comment out/edit the preroot= line.

Hi and thank you for your response.

I could enter the controlpanel after commenting out that line. Sadly it did still hang when I opened Virtualmin from there. (Servers -> Virtualmin Virtual Servers)

I see a different process hanging now (it also did before, but I haven't reported it so far), so I checked the process-id in /etc/passwd.lock and it's indeed locking the files this time:

ps auxwww | grep 11957
root     11957 97.7 27.4 2335824 2253756 ?     R    08:55  37:23 /usr/share/webmin/virtual-server/collectinfo.pl
root     23872  0.0  0.0   6072   732 pts/0    S+   09:33   0:00 grep 11957
zaidsoft's picture
Submitted by zaidsoft on Sat, 05/28/2016 - 05:12

we are also facing the same issue since last update. I changed the theme manually to "mscstyle3" and was able to get into webmin panel. The moment I try using any feature (like "virtual servers" module) that has anything to do with "users (or virtualmin domains)" it gets stuck with 100% CPU. Same behavior on all cli commands like backup/delete-domain etc.

I ran strace and found that it gets into infinite loop and keeps emitting at strace output like:

munmap(0x7fb3b5ad7000, 4096)            = 0
open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG, st_size=43586, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b5ad7000
read(4, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
read(4, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
read(4, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
read(4, "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"..., 4096) = 4096
close(4)           

After lot of time most commands print "Failed to lock file /etc/passwd after 5 minutes. Last error was :".

Deleting lock files in /etc has no effect.

zaidsoft, thanks for the info!

tfhelp had paid for a support incident here though, so we're going to work with him first on this one. You're welcome to keep monitoring it though so long as he doesn't end up needing to mark it as private.

tfhelp, it sounds like we might need to log in to figure out what's going on there. Would that be possible?

If so, could you email login details to Jamie and I, at jamie@virtualmin.com and eric@virtualmin.com?

We'll need to log into both SSH and Virtualmin as a user with root privileges.

Thank you for your help so far. I've send you the details about the server by mail. Please let me know when I can do something to help you.

P.S. I've shutted down webmin, because otherwise the server crashes every now and then. I think you now this better than I do, but to be sure, you can start it using /etc/init.d/webmin start)

Ok, I found the issue here - your system had an empty nameless domain in Virtualmin's DB with ID 138053978626166 which was triggering a bug that caused Virtualmin to spend a long time trying to fill in missing fields in the domain object. I deleted that, and all appears to be good now.

Also, the next release will handling this situation better.

Reviewing Virtualmin on your server, it does appear to be working now that Jamie removed the problem in the config file.

It's currently setup to use the Virtualmin Framed Theme, but feel free to change that back to Authentic if you like.

Can you verify that things are working as expected?

If so, then you are welcome to close up port 22 on your server again.

Also, note that there shouldn't have been any downtime on your server, other than Webmin itself. We didn't need to reboot, or restart any other services.

The key was the files in /etc/webmin/virtual-server/domains. In there was a config file with an empty value after "dom=". The "dom=" parameter should always have a domain name. How that got there, we're not quite sure; but Jamie just checked some code into git which should prevent that problem in the future.

zaidsoft's picture
Submitted by zaidsoft on Sat, 05/28/2016 - 22:19

I checked our server again (after new comments in this thread) and lots of files in /etc/webmin/virtual-server/domains have subdom= in them.

Can you kindly point me if that can cause this issue. What purpose that variable/property serve? Is empty string a valid value.

I also find many other fields with empty values like ipfollow=, db_postgres=

I there some documentation somewhere that can authoritatively specify the format of these config files (so one can check the actual config data against it)?

Since he paid for this ticket, we'll need to wait to assist other folks until the original problem is confirmed as solved.

However, what we can offer in the meantime is that I'd look for a blank dom= line -- the other lines being blank should be okay.

Hi and good morning. First of all: thank you for your help. The controlpanel works again and I don't see enormous processes blocking (besides the backup-process at this time, but that's usual ;) ).

Thank you again and I hope the problem doesn't return. I'll ask my collegue whether he knows how this strange-looking domain came here, but as I read your comment, you already know it's from a known bug. So that's not important anymore.

EDIT: unrelated (API-interface seems to be changed, needs a &multiline=-parameter now, but it's in your documentation so that's normal) :)

Great, glad to hear that's working!

Since this is due to a reproducible bug, we're going to give you an additional support request to use in the future. If you run into an issue that you'd like a hand with, feel free to just open a support request and we'll gladly help out. Just mention this request when you bring it up, as we don't have a formal way of applying an issue to your account, and I might forget if it's way in the future :-)

Also, I'll talk to Jamie about that API issue you mentioned, as we try not to change the API much in order to avoid breaking applications. I imagine if that changed it was necessary for some reason.

Since it sounds like things are working for you now we're going to open this up to other folks who are experiencing a similar issue.

zaidsoft, I've asked Jamie for any other ideas he has on issues that could cause this problem. But were you able to find a blank "dom=" line?

Which API command were you having trouble with?

I've had the same problem /read: nightmare/ with webmin/virtualmin hanging on login after update from 1.800 to 1.801.

Sorry for the long comment, but I hope this helps to find a possible bug. My system is Debian 8.4

After reading your issue and the hint by andreychek in https://www.virtualmin.com/comment/757326#comment-757326 I had an idea what I should have to search for. strace was showing an infinite loop on this almost empty domain file:

~# ls -lrt /etc/webmin/virtual-server/domains/
...
-rwx------ 1 root root 1090 Mai 28 10:03 146442259017391
-rwx------ 1 root root   70 Mai 28 10:42 14644249412916
-rwx------ 1 root root 2622 Mai 28 15:36 14644225791739
~# cat  /etc/webmin/virtual-server/domains/14644249412916
lastsave=1464424941
creator=root
created=1464424941
id=14644249412916

After removing this file and killing/restarting all webmin processes I got access to webmin/virtualmin as usual, at least in chrome. After deactivating some addons in firefox got it back there, too - but this could be unrelated.

~# date -d @1464424941
Sa 28. Mai 10:42:21 CEST 2016

Using etckeeper (!sic) I can follow each change, git log shows for the debian packages:

Date:   Thu May 26 08:08:19 2016 +0200     Package changes:
    -webmin 1.795
    +webmin 1.800
 
...
 
Date:   Sat May 28 10:55:01 2016 +0200
    saving uncommitted changes in /etc prior to apt run
 
Date:   Sat May 28 10:58:49 2016 +0200   Package changes:
    -libgd3 2.1.0-5+deb8u1
    +libgd3 2.1.0-5+deb8u3
    -webmin 1.800
 
Date:   Sat May 28 11:04:17 2016 +0200    Package changes:
    +webmin 1.801

After this time, I couldn't get any access to webmin/virtualmin via browser. Running virtualmin on command line did work.

So, which files have been in directory /etc/webmin/virtual-server/domains/ and what have I done yesterday:

146441957124332 (new virtual server) 28/May/2016 09:13
1464419583243321 (new alias server)  28/May/2016 09:13
# both deleted
146442231230534 (new virtual server) 28/May/2016 09:58
# got deleted
14644225791739  (new virtual server) 28/May/2016 10:03
146442259017391 (new alias server) 28/May/2016 10:03

At 10:42 there was no creation of virtual server, so there shouldn't be any file "14644249412916"?

Looking at "Webmin Action Log" I've found the following entry in Action Detail Log:

ip   xx.165.98.yy
id  1464424941.2916.0
module  virtual-server
user    root
script  save_spf.cgi
object  -
type    domain
time    1464424941
sid     30166d0044b8f5e1545dc07079e01917
action  spf

So may be this helps you to find this/another related bug ...

zaidsoft's picture
Submitted by zaidsoft on Mon, 05/30/2016 - 01:48

Our issue is resolved. I found a couple of virtual-server definition files in /etc/webmin/virtual-server/domains having no "dom=" property at all! Not even blank value! Deleting these files and restarting webmin fixes the infinite loop (100% CPU) issue and we can login and use the control panel as before.

contents of one of the files is:

created=1353336369
backup_mail_folders=Maildir
creator=root
lastsave=1353336369
id=13533363696766

We never created these domain files explicitly. It somehow appeared/got-created and virtualmin releases up till now were ignoring it.

I think this is a bug and should be fixed in next release. Whenever virtualmin does not find valid "dom=" in virtual-server definition files it goes into some kind of infinite loop. It should simply ignore these definitions and move on. Please forward to Jamie so it gets fixed in next release.

We're glad you got it working!

Jamie has already added in code for the next release to handle that issue. Reviewing his code, it looks like it's aimed at preventing such a Virtual Server from being added in the first place.