1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

VPS crashes with the dreaded oom message....

Discussion in 'Server Configuration and Hosting' started by craigiri, Mar 26, 2014.

  1. craigiri

    craigiri Well-Known Member

    My 1G debian VPS (running XF and WP - light traffic site and small db) crashes once every couple of weeks - seems to be arbitrary, as neither the load, traffic or CPU use is ever heavy. Basically it hangs up and shows an incredibly heavy load until mysql is shut down manually and then restarted. But it's not (IMHO) mysql running out of memory - just that the whole systems seems to panic and overload.

    The OOM message shows that Apache get it.....
    "apache2 invoked oom-killer"

    This happened before I doubled the RAM - and I also use caches for the wordpress and flush RAM once in a while. In other words, I never see the system eating up available memory - the error seems to come out of nowhere....

    Looking around the web, I saw some folks mention that it could be a fault of Linux stock settings - that they are not overcommitting:
    http://www.hskupin.info/2010/06/17/how-to-fix-the-oom-killer-crashe-under-linux/comment-page-1/

    Anyone else have experience with this syndrome? It seems that if it were default behavior in Linux, it would be happening to lots of people!
     
  2. MattW

    MattW Well-Known Member

    Do you have any active monitoring on the VPS, so you can see how RAM is being used over time?

    I'd suggest sticking your VPS on here while it's still in free open beta: https://nodequery.com/

    Then you can at least monitor it's usage over time, and see if it does tie in with anything:

    upload_2014-3-26_8-46-31.png

    EDIT: I also take it you don't have any swap for things to be getting killed off with OOM errors, or you are also using all your swap up.
     
    craigiri and Tracy Perry like this.
  3. Tracy Perry

    Tracy Perry Well-Known Member

    Nice find. I have munin-node installed on all my VPS's but I'm going to give this a try also. (y)
     
    MattW likes this.
  4. Slavik

    Slavik XenForo Moderator Staff Member

    Found your problem! ;)
     
    Floren likes this.
  5. Tracy Perry

    Tracy Perry Well-Known Member

    Really? :p
    These show some down time due to being recently created on my ProxMox server (which, by the way uses Debian as it's core :whistle: ) and being rebooted during testing. The one that has been up for 30 days was rebooted after I installed CSF and wanted to make sure that I had not completely locked myself out upon a reboot of the server (which I had done before). Before that it had been up for 138 days. Most of the VPS's are using OpenLiteSpeed now for the httpd server.

    screenshot.png screenshot1.png
     
  6. Slavik

    Slavik XenForo Moderator Staff Member

    I'll get my coat :)

    Though in my defence, one of my CentOS servers is coming up on 5 years uptime I think, I had one that was longer but a drive failure meant it had to be powered off for a replacement :(
     
  7. Tracy Perry

    Tracy Perry Well-Known Member

    Just as long as you take your boots along also. :D
    It might get deep when you go talking OS's! Food Fight!!!

    Usually the only time I reboot is when a kernel update comes out. I just can't leave 'em well enough alone. Also, instead of doing a snapshot of my VPS's when I back then up I do a stop of the VPS (it's only down for about 20 seconds) but it reflects as the server being shut down - which it is in fact.
     
  8. craigiri

    craigiri Well-Known Member

    Memory is plenty free - swap is there but not used very much.....

    Only programs are the usual - mostly apache and mysql......

    As I said, it doesn't seem to be anything in the normal course of affairs - I even flush the caches regularly. Memory seems to be using about 1/2 the ram and no traffic or cron or anything else seems to be happening at the time it crashes.

    Here is a top screen shot - as you can see, it's not really a busy server. Based on stats, it's usually not being called on at all or perhaps delivering one page - two at the most.

    Screen Shot 2014-03-26 at 9.38.02 AM.png
     
  9. WSWD

    WSWD Well-Known Member

    What are the actual logs showing? Not the OOM message, but the actual VPS logs.
     
  10. craigiri

    craigiri Well-Known Member

    I'm not sure where those logs are......if they are the messages, they show almost nothing until the oom, then all kinds of things afterward (stats, memory stuff)...but that's after the panic.

    I installed the monitor suggested above and will see if it reports anything on the next crash (or before).
     
  11. WSWD

    WSWD Well-Known Member


    Yes...that's more or less the logs I was referring to. You won't always find something in there, unfortunately.

    I honestly don't think this is all that mysterious, if your hearth.com forum is hosted there. You only have 1/2 gig of free RAM, and your forum seems reasonably busy. Apache is what's being killed off, which usually means that is the resource that is using the most memory (processes are usually killed off in order of resource usage, when you start getting OOM errors).

    If your hearth.com forum isn't hosted there, I'm honestly at a loss as to what might be causing this, outside of a minor "DDoS attack" of sorts, or bots visiting your site en masse.
     
  12. craigiri

    craigiri Well-Known Member

    Hearth.com is on a dedicated server and has never crashed....

    This is the drone site which receives about 250-400 page views per hour at max. Load is rarely above .1 - yes, that's point 1, and CPU use of the single VPS slice is usually 0-15%. Total data is quite small in the db's.

    If it were bots, I'd suspect that google or my apache logs or even XF might show it?

    So, it looks like we are stumped for now......maybe that server monitoring service will show something....I'll report back. It looks like it happens about every two weeks, although not on any particular schedule.
     
  13. WSWD

    WSWD Well-Known Member


    Yeah, that doesn't really make sense then. I could see it happening if hearth.com was there, but if it's just the drone site, that shouldn't really cause that. Bots would likely show up in the logs. Are there any known issues perhaps with the kernel you're using? The only time I have ever experienced something like this (where the logs really show nothing) is on a VPS server, and one particular version of the kernel did not like a piece of hardware we had installed. Would throw OOM errors and then kernel panic. This was on a machine that had almost 200GB of free RAM. It was obviously never coming anywhere close to running out of RAM, which is why the OOM errors really threw us.
     
  14. craigiri

    craigiri Well-Known Member

    This is probably the situation....
    How did you narrow down the problem?
     
  15. WSWD

    WSWD Well-Known Member

    We actually didn't until afterward. There were absolutely no indications of anything crazy, nothing in the logs, etc. except for the entries saying the machine was run out of memory, which we know was not true. We even hired a couple external admin teams to take a look at things with fresh eyes, and they didn't find anything more than what we did. We went with the kernel update simply because it was time to update, and that did the trick. After that, we built a similar test system with identical hardware, put on the same kernel, and it started having the same problems.

    Take all that for what it's worth, because you're on a VPS, and it might be something COMPLETELY different. But you might just want to try updating the kernel amd see what happens.
     
    craigiri likes this.
  16. Tracy Perry

    Tracy Perry Well-Known Member

    When they finally go paid, I will probably have to go ahead and pony up the money. :whistle:
    screenshot.png
     
    MattW likes this.
  17. p4guru

    p4guru Well-Known Member

    Nah the OOM problem can happen on CentOS as well. Surprised no one on here has never encountered it considering alot of folks have the server's swap allocation incorrectly set.

    @craigiri are you on 32bit or 64bit OS ? post the output of your meminfo command

    Code:
    cat /proc/meminfo
    
    Check your CommitLimit and Comitted_AS values, if you have alot of memory it's won't all be used if your CommitLimit is below your physically available memory size. Given your screenshot top output I'm guestimating your CommitLimit is around ~1418138 KB

    also check sar memory %comimit

    Code:
    sar -r
    
    Solution = set your swap file size to at least 2x times your physically installed memory size ;) So in craigiri's case set swap at least to 2GB in size.

    Basically, system wide memory usage (taking into account overcomitted memory) is hitting your CommitLimit and triggering OOM process which looks for the biggest memory using app which is Apache and kills it. You need to properly tune both system and MySQL/Apache to work under your existing server memory and CommitLimit so you don't trigger OOM process.

    I've worked with multi cluster large forum setups and even with 256GB of memory, you can hit OOM if your system isn't properly configured :)
     
    Last edited: Mar 27, 2014
    craigiri likes this.
  18. craigiri

    craigiri Well-Known Member

    Looks like 64 bit - monitoring, as I suspected, shows relatively light use....about a total of 1.2 G in and out in 24 hours, only using 1/3 or so of the RAM. System and CPU use seem to be about 10% or less taken over a long period.

    I noticed it has the mkswap commands, etc......wonder if this is something I should attempt to increase or if it's too easy to screw something up!

    MemTotal: 1034412 kB

    MemFree: 621540 kB

    Buffers: 66028 kB

    Cached: 124028 kB

    SwapCached: 88 kB

    Active: 232756 kB

    Inactive: 156972 kB

    Active(anon): 138960 kB

    Inactive(anon): 91600 kB

    Active(file): 93796 kB

    Inactive(file): 65372 kB

    Unevictable: 0 kB

    Mlocked: 0 kB

    HighTotal: 143304 kB

    HighFree: 312 kB

    LowTotal: 891108 kB

    LowFree: 621228 kB

    SwapTotal: 901112 kB

    SwapFree: 901024 kB

    Dirty: 44 kB

    Writeback: 0 kB

    AnonPages: 199588 kB

    Mapped: 46860 kB

    Shmem: 30884 kB

    Slab: 15432 kB

    SReclaimable: 11172 kB

    SUnreclaim: 4260 kB

    KernelStack: 816 kB

    PageTables: 1732 kB

    NFS_Unstable: 0 kB

    Bounce: 0 kB

    WritebackTmp: 0 kB

    CommitLimit: 1418316 kB

    Committed_AS: 664480 kB

    VmallocTotal: 122880 kB

    VmallocUsed: 6168 kB

    VmallocChunk: 109124 kB

    HardwareCorrupted: 0 kB

    HugePages_Total: 0

    HugePages_Free: 0

    HugePages_Rsvd: 0

    HugePages_Surp: 0

    Hugepagesize: 4096 kB

    DirectMap4k: 16376 kB

    DirectMap4M: 888832 kB
    Screen Shot 2014-03-27 at 8.24.47 AM.png
     
  19. craigiri

    craigiri Well-Known Member

    BTW, that monitoring service may look cool, but I find it fairly useless....unless I am missing something, it keeps no archives longer than a few hours. In other words, you can't find out what happened even yesterday....or the tendencies over time.....

    You'd think it would have many more settings - maybe after beta it will. If it did everything in one place...such as seeing if the server was down, alerts at "x" (ram use, disk use, etc" it would be great.
     
  20. Tracy Perry

    Tracy Perry Well-Known Member

    What I like about it is it is a more compact version of my munin install. If I want to pull up that much data, I'll hit my munin status page. For down and dirty it's handy.
     

Share This Page