One of my virtualmin scheduled backups to rackspace seemed to be completing successfully (all domains.tar.gz present), but not saving the log or sending the results email or deleting the old backups. The server has just over 700 sites (mainly aliases).
I have tracked the cause of this down to a compound of 2 bugs during the "delete old backups" phase:
rs_http_single_call function in
/usr/share/webmin/virtual-server/rs-lib.pl an HTTP connection is opened to rackspace server and an API call is issued. Upon success the connection is closed and the response is returned. However, upon error the function returns an error code but the connection is not closed and remains in CLOSE_WAIT state. There are 5 places this function can return without closing the connection.
purge_domain_backups function in
/usr/share/webmin/virtual-server/backups-lib.pl which loops through all files in the rackspace container. For each file
rs_stat_object is called. If the file is a .gz file it is deleted, along with the .dom and .info files. The next loop iteration is the .dom file (which was just deleted) so when
rs_stat_object is called a 404 is correctly returned, but this is considered an error by
rs_http_single_call which then leaves the connection open as described above. Could this function be re-factored to put the
rs_stat_object call in the next conditional block which is not executed for .dom and .info files?
After deleting about 170 files the process has over 1000 CLOSE_WAIT connections, and the OS then refuses to allow any more files to be opened causing the process to crash and exit silently. Original forum thread here.