Server crashed, out of memory, where to start looking?

17 posts / 0 new
Last post
#1 Fri, 07/31/2009 - 19:30
pcm2a

Server crashed, out of memory, where to start looking?

Hey all. I have been running Virtualmin 100% trouble free for about a month now. This is running on a Ubuntu VMWare guest OS with a 2.2ghz CPU and 1 gig (out of 4gigs) of ram set aside for it.

Yesterday I set up 4 jobs. A weekly full backup, a daily incremental backup, a weekly full ftp backup, and a weekly incremental ftp backup. I ran the two full jobs and set the incremental to run at 0 and 2 am. Both jobs ran fine and emailed me the results.

Around 11:30 my whole server has crashed due to an out of memory problem. I have looked at the /var/log/kern.log and I see that the problem started around 8:30 am. This is what some of the log looks like:

Jul 31 08:27:12 kernel: [127533.320750] apache2 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
Jul 31 08:29:01 kernel: [127533.320906] Pid: 29284, comm: apache2 Tainted: GF       2.6.24-23-server #1
Jul 31 08:30:34 kernel: [127533.320926]
Jul 31 08:30:36 kernel: [127533.320954] Call Trace:
Jul 31 08:31:17 kernel: [127533.321756]  [oom_kill_process+0xf6/0x110] oom_kill_process+0xf6/0x110
Jul 31 08:34:24 kernel: [127533.321777]  [out_of_memory+0x15e/0x270] out_of_memory+0x15e/0x270
Jul 31 08:34:26 kernel: [127533.321785]  [__alloc_pages+0x399/0x3d0] __alloc_pages+0x399/0x3d0
Jul 31 08:34:26 kernel: [127533.321795]  [__do_page_cache_readahead+0xe0/0x210] __do_page_cache_readahead+0xe0/0x210
Jul 31 08:34:26 kernel: [127533.321810]  [filemap_fault+0x28e/0x390] filemap_fault+0x28e/0x390
Jul 31 08:34:26 kernel: [127533.321817]  [__do_fault+0x6c/0x440] __do_fault+0x6c/0x440
Jul 31 08:34:26 kernel: [127533.321826]  [handle_mm_fault+0x1c1/0x800] handle_mm_fault+0x1c1/0x800
Jul 31 08:34:26 kernel: [127533.321864]  [do_page_fault+0x1ab/0x840] do_page_fault+0x1ab/0x840
Jul 31 08:34:26 kernel: [127533.321875]  [thread_return+0x3a/0x59a] thread_return+0x3a/0x59a
Jul 31 08:34:26 kernel: [127533.321888]  [error_exit+0x0/0x51] error_exit+0x0/0x51
Jul 31 08:34:26 kernel: [127533.321921]
Jul 31 08:34:26 kernel: [127533.321954] Mem-info:
Jul 31 08:34:26 kernel: [127533.321987] Node 0 DMA per-cpu:
Jul 31 08:34:26 kernel: [127533.322038] CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Jul 31 08:34:26 kernel: [127533.322066] Node 0 DMA32 per-cpu:
Jul 31 08:34:26 kernel: [127533.322069] CPU    0: Hot: hi:  186, btch:  31 usd: 133   Cold: hi:   62, btch:  15 usd:  14
Jul 31 08:34:26 kernel: [127533.322119] Active:233447 inactive:987 dirty:0 writeback:0 unstable:0
Jul 31 08:34:26 kernel: [127533.322120]  free:1474 slab:4439 mapped:8 pagetables:9530 bounce:0
Jul 31 08:34:26 kernel: [127533.322146] Node 0 DMA free:1944kB min:40kB low:48kB high:60kB active:0kB inactive:0kB present:10484kB pages_scanned:0 all_unreclaimable? yes
Jul 31 08:34:26 kernel: [127533.322186] lowmem_reserve[]: 0 994 994 994

What other logs can I look at to try to trace this problem? The only reason I have my eye on the backup process is because it is the only single thing I have changed, and I did it yesterday.

I'm going to leave them enabled for tonight to see if things bomb out tomorrow again.

Thanks for any pointers.

Fri, 07/31/2009 - 20:48
pcm2a

Found the answer to this second post, so clearing it out.

Fri, 07/31/2009 - 20:59
andreychek

Well, it's unclear as to why this happened, but the above suggests that Apache was killed by the oom-killer process.

Do you have any swap setup? You'd probably want a half decent amount of that available on your box.

You could also setup a cron job to email you a "ps auxw" list throughout the night so you can have a decent idea of what's going on.

-Eric

Fri, 07/31/2009 - 21:08
pcm2a

I'll set that cron job of ps -auxw for sure, great idea. I know there was some default swap stuff that I set up when I installed Ubuntu server, but I do not remembe the exact size I set up. How can I check that to give you the exact details?

Thanks for the quick response by the way!

Fri, 07/31/2009 - 21:16
andreychek

Sure... what does the output of the command "free" show?

-Eric

Fri, 07/31/2009 - 23:01 (Reply to #5)
pcm2a

Previously free showed 1 gig of actual memory and almost 1 gig of memory in use. There was about 45 megs of available memory. At first I was alarmed by this, it is what my deleted post above was about. However, I think that the memory must not be accurately reported for virtual machines.

This is free ran on my dedicated server that I am trying to move away from: total used free shared buffers cached Mem: 1026528 1006692 19836 0 168212 408104

See how it says 1 gig used and 19 megs free? Obviously that is not being reported correctly since that server has been online and running great for a very long time.

With that said, my vmware server access says that 81megs is being used which seems more likely.

I also increased the guest os from 1 gig of ram to 2 gigs of ram: total used free shared buffers cached Mem: 2062920 762672 1300248 0 27064 176292

How can I see my current swap filesystem information?

Fri, 07/31/2009 - 23:12
andreychek

Well, the "free" command normally shows a few lines of output, and includes regular memory as well as swap. This is an example:

Mem:       1474776    1402460      72316          0      26400     455260
-/+ buffers/cache:     920800     553976
Swap:      2047992         68    2047924

In the case of the above, it's showing 1402460 of 1474776 as being used; except, the second line clarifies this by subtracting the buffers and cache from that, which aren't actually in use by any processes running on the system.

So actually, there's 920800 of regular RAM available.

The last line is how much swap is available.

If you don't have a line mentioning swap -- can you paste in the contents of your /etc/fstab file?

-Eric

Sat, 08/01/2009 - 00:23
pcm2a

You're right, I have the other lines I just wasn't thinking to paste them in here

             total       used       free     shared    buffers     cached
Mem:       2062920     894988    1167932          0      44028     244868
-/+ buffers/cache:     606092    1456828
Swap:            0          0          0

At the time I'm pasting this in here the VMWare console shows around 163mb of ram in use.

Also, here is the fstab file:

# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
# /dev/sda1
UUID=08c85ab4-91f6-4b4e-8a78-d73bf4f2bb81  /  ext3  grpquota,errors=remount-ro,usrquota,relatime,rw  0  1
# /dev/sda5
UUID=8175cf43-1ce5-4df9-89c5-67d4a2e4f628 none            swap    sw              0       0
/dev/scd0       /media/cdrom0   udf,iso9660 user,noauto,exec,utf8 0       0
/dev/fd0        /media/floppy0  auto    rw,user,noauto,exec,utf8 0       0
# Beginning of the block added by the VMware software
.host:/                 /mnt/hgfs               vmhgfs  defaults,ttl=5     0 0
# End of the block added by the VMware software
#Mount backup archive
//192.168.10.25/Backups /backup/archive cifs auto,username=######,password=#####,workgroup=workgroup,rw 0 0
Sat, 08/01/2009 - 11:44
pcm2a

Things are still up and running. Maybe we will never know or maybe it has to do with me increasing the VM memory to 2 gigs. Currently VMConsole says 122mb is in use.

The free command is reporting this:

             total       used       free     shared    buffers     cached
Mem:       2062920    1698448     364472          0      81020     978792
-/+ buffers/cache:     638636    1424284
Swap:            0          0          0

I'm not exactly sure how to do the calculations on these free numbers to get the real amount.

Nothing abnormal to report in /var/log/syslog. Nothing abnormal in my ps -auxw.

I have my script running every 15 minutes and spitting the ps -auxw into a folder dated today and pruning files more than 2 days old. I'll leave it running indefinitely until I am sure this isn't going to happen again.

Sat, 08/01/2009 - 11:47
andreychek

Well, it does appear as if you don't have any swap enabled; so any sort of spike in usage that used up your available RAM could have caused the problem you had.

If you had created a swap partition, I don't see anything in your fstab that's enabling it.

I would recommend having swap, as spikes can happen; and it's probably just a matter of time before your 2GB there runs out :-)

-Eric

Sat, 08/01/2009 - 14:11
pcm2a

I'll run gparted and take a screen shot of what the partitions look like. I'm sure I created one....I hope.

Sat, 08/01/2009 - 15:16
pcm2a

Ok, here is a link to a gparted screen shot: http://img31.imagefra.me/img/img31/2/8/1/pcm2a/f_jr8q6wtm_873f2a2.jpg

/dev/hda1, ext3, 77.91gig, 3,83 used
/dev/hda2 extended 2.08gig
    /dev/hda5, linux-swap, 2.08 gig.

This was set up automatically during the install of Ubuntu server. How do I get Ubuntu to start using this swap space?

Also here is a fdisk -l:

sudo fdisk -l
[sudo] password for pcm2a:
 
Disk /dev/sda: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x0008af06
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1       10171    81698526   83  Linux
/dev/sda2           10172       10443     2184840    5  Extended
/dev/sda5           10172       10443     2184808+  82  Linux swap / Solaris
Sat, 08/01/2009 - 15:25
pcm2a

Ok how about this one. I replaced the UUID lines in my /etc/fstab with /dev/sda1 and /dev/sda5 and rebooted.

#UUID=08c85ab4-91f6-4b4e-8a78-d73bf4f2bb81  /  ext3  grpquota,errors=remount-ro,usrquota,relatime,rw  0  1
/dev/sda1  /  ext3  grpquota,errors=remount-ro,usrquota,relatime,rw  0  1
# /dev/sda5
#UUID=8175cf43-1ce5-4df9-89c5-67d4a2e4f628 none            swap    sw              0       0
/dev/sda5 none            swap    sw              0       0
/dev/scd0       /media/cdrom0   udf,iso9660 user,noauto,exec,utf8 0       0

Now look when I run free:

             total       used       free     shared    buffers     cached
Mem:       2062920     454648    1608272          0       4996     149740
-/+ buffers/cache:     299912    1763008
Swap:      2184800          0    2184800

I think that's turned the swap on right? Should I increase my swap partition to 4 gigs since my current ram is at 2 gigs?

Sat, 08/01/2009 - 16:03
andreychek

Ahh, nice work, that did indeed enable about 2GB of swap.

Should you bump it to 4GB? Well, I'm inclined to think you're okay with your current setup, but I guess you just need to keep an eye out for how much RAM/swap you're using.

Have a good one,

-Eric

Sun, 08/02/2009 - 10:24
pcm2a

Thanks for all the help. I went ahead and repartitioned the swap to 4 gigs to make it double the ram amount. At some point will I see the swap memory being used with the "free" program?

             total       used       free     shared    buffers     cached
Mem:       2062920    1598724     464196          0      97696     880596
-/+ buffers/cache:     620432    1442488
Swap:      4281280          0    4281280

Why does it report 1.59gigs of memory used when there is barely anything actively running on the box? Certainly nothing taking up 1.59 gigs of ram. Is this just 1.59 gigs of ram that has been used in the past, but isn't active right now? VMware reports around 150mb of ram being actively used.

Maybe I'm supposed to be subtracting something from the "buffers" line?

Sun, 08/02/2009 - 12:34
ronald
ronald's picture

Why does it report 1.59gigs of memory used It isn't using it. It is holding it in cache, so the programs that used it is available immediately from the RAM. Other programs can use that RAM if needed.

As long as the RAM is sufficient, it won't use the swap. It is mainly used as burstable RAM. Temporarily used if 2 GB isn't enough.

When you have so many things going on, that your box needs to use swap for longer periods of time, then it is wise to add more RAM to the box.

Sat, 08/08/2009 - 03:40
tuaris

I had a similar issue on a similar configuration around the same time, very weird. My problem was a couple of items. First I only had 512 MB of RAM. Secondly, I found that the hard drives I was running the Virtual Machine off of were creating a serious bottleneck. I would run out of memory, and disk swapping would become so slow that the machine would come to a halt, and sometimes crash.

I had both AWStats and Webalizer on the machine for each domain, It turns out those stats where slowing things down and using up a lot of ram when they run. They were also causing some problems with duplicate emails due to the fact that Postfix would time out and resend the message to Amavisd. By adding more RAM and moving the VM to better performance disk and interface, I was able to solve the issue.

Virtual machines will do best with at least 250 to 300MB/s of read/write throughput. Avoid RAID 5 whenever possible. Use a RAID 1 or 0+1 setup instead.

Topic locked