Grid Infrastructure on RHEL5 x86 story / OOM explained

Have you ever installed and used Oracle Grid Infrastructure on RHEL 5x 32 bit server? Have you ever heard such an abbreviation as OOM?
Let me answer these questions first. Well, no I didn’t. Before a while ago.
The background is following – we had some Oracle Applications running on RHEL 4 32 bit with 32 GB RAM, when it was decided to upgrade and migrate to RHEL 5 32 bit with same memory amount. As a part of it, we also wanted to try out ACFS (ASM Cluster File System) (actually a part of an Oracle Cloud File System now after rebranding in January 2011) as a shared storage for application filesystem, which comes up together with Grid Infrastructure 11gR2 installation. So far, so good. Servers were delivered, it was time get the job done …

Basically, it was just a simple two-node cluster installation with ACFS created and shared between the nodes. Moreover, ACFS resource was also exported for some other servers using NFS protocol. After the installation it all looked cool and simple (and even working :)). We had put out applications on ACFS and started testing. For my surprise, when I arrived at the office next morning, applications were down (hmm, someone might have stopped them for some reason – I thought). No problem, let’s start them up and continue working. The next morning even bigger surprise was waiting for me – one node got evicted and server rebooted. That looked a lot more serious to me, so I started investigating this. It didn’t took long to find the cause of this, one look in clusterware alert.log and another in server /var/log/messages brought me to something I’ve never seen before.

kernel: Out of memory: Killed process 24735, UID 1200, (oracle).
kernel: [Oracle ADVM] The ASM instance terminated unexpectedly. All ADVM volumes will be taken offline.  You must close all applications using these volumes and unmount the file systems.  After restarting the instance, you may need to re-enable the volumes for use.
syslogd 1.4.1: restart.
kernel: klogd 1.4.1, log source = /proc/kmsg started.

So it’s been clearly shown that something have killed cluster heartbeat process on one of the nodes which caused the restart.
Looking deeper in the log file I’ve found something more:


kernel: oracle invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
kernel:  [<c045af33>] out_of_memory+0x72/0x1a3
kernel:  [<c045c49a>] __alloc_pages+0x24e/0x2cf
kernel:  [<c045c540>] __get_free_pages+0x25/0x31
kernel:  [<c04a8814>] proc_file_read+0x74/0x224
kernel:  [<c044a993>] audit_syscall_entry+0x18f/0x1b9
kernel:  [<c04a87a0>] proc_file_read+0x0/0x224
kernel:  [<c0476330>] vfs_read+0x9f/0x141
kernel:  [<c04767b6>] sys_read+0x3c/0x63
kernel:  [<c0404f4b>] syscall_call+0x7/0xb

My first reaction was like why? (out-of-memory with no consumption at all?) And what’s that oom-killer?
It appeared that there is a lot of information about that in internet and even MoS has it [Doc ID 452000.1].
So the OOM or Out-Of-Memory killer is an OS feature enabled by default (and cannot be switched off), a self protection mechanism employed the Linux kernel when under severe memory pressure. If kernel can not find memory to allocate when it’s needed, it puts in-use user data pages on the swap-out queue, to be swapped out. If the Virtual Memory (VM) cannot allocate memory and canot swap out in-use memory, the Out-of-memory killer may begin killing current userspace processes.  It will sacrifice one or more processes in order to free up memory for the system when all else fails.

The behavior of OOM killer in principle is as follows:

  • Lose the minimum amount of work done
  • Recover as much as memory it can
  • Do not kill anything actually not using a lot memory alone
  • Kill the minimum amount of processes (one)
  • Try to kill the process the user expects to kill

Sounds very thoughtful, isn’t it? But these were not my symptoms at all, because first of all, I had no any load/consumption on the servers. Secondly, it was not only killing some heavy processes, but rather all processes on a random basis. Here are some more examples from the log file:


# cat /var/log/messages.2 | grep oom
Oct 24 05:11:13 kernel: FNDLIBR invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 25 04:50:26 kernel: oracle invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 25 04:50:35 kernel: oraagent.bin invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 25 04:50:47 kernel: nfsd invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 25 04:50:50 kernel: FNDLIBR invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 25 04:50:58 kernel: oraagent.bin invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 25 04:51:00 kernel: FNDLIBR invoked oom-killer: gfp_mask=0x200d0, order=0, oomkilladj=0
Oct 25 04:51:06 kernel: ps invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Oct 26 19:37:46 kernel: sendmail invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0

As you see, even the “ps” command and “sendmail” are killed when OOM wants it. Cruel thing as you notice. Sometimes it just killed ASM/ACFS processes which lead to a FS dismount and input/output errors on every server where it was mouted. So in our case it was very important and a real disaster, because we have had our application software exactly on ACFS. I’ve even thought about some analogy which can be used in that case in order to explain OOM logic.

A little analogy:

[“An aircraft company discovered that it was cheaper to fly its planes  with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions, however, the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.)  A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if,  for example, the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted?  Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused…”]

I would be glad if that story could help you understand OOM better, but let’s get back to the initial topic. We’ve never faced something similar on RHEL4. Is it RHEL5 and OOM or maybe it is Grid Infrastructure what is causing all these problems? Looking more deeper in the causes of the killer event lead me think of the situation when the Kernel is out of LowMem (32-bit architectures only) MoS [Doc ID 452326.1] This eventually seemed to be like my case. We do have 32 GB RAM so “LowFree” in /proc/meminfo is very low, but HighFree is much higher.  OOM killer will take action under that situation. Here is the output from my server (note HighFree and LowFree) values:


# cat /proc/meminfo
MemTotal:     37436164 kB
MemFree:      33526272 kB
Buffers:        136076 kB
Cached:        1900048 kB
SwapCached:      16084 kB
Active:        3046452 kB
Inactive:       395872 kB
HighTotal:    36821180 kB
HighFree:     33449312 kB
LowTotal:       614984 kB
LowFree:         76960 kB
SwapTotal:    16383992 kB
SwapFree:     16315860 kB
Dirty:            1940 kB
Writeback:           8 kB
AnonPages:     1400696 kB
Mapped:         223416 kB
Slab:           373888 kB
PageTables:      29984 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  35102072 kB
Committed_AS:  6765332 kB
VmallocTotal:   116728 kB
VmallocUsed:     31844 kB
VmallocChunk:    84264 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

However, it appears that even with an healthy system you might see low LowFree values which does not mean a LowMem shortage. For example, a system with 2GB of memory and hugemem kernel:

MemTotal:      2061160 kB
MemFree:         10228 kB
Buffers:        119840 kB
Cached:        1307784 kB
Active:         587724 kB
Inactive:      1236924 kB
...
LowTotal:      2061160 kB
LowFree:         10228 kB

Here the system seems to be short of memory, but we see that buffers are high (and they can be released if needed), along with 1.24 GB of cached pages. For the cached pages 1.17 GB of them are inactive, so they can be released if needed. That is based on the workload.

So if OOM was introduced already on RHEL3, why do we face it on RHEL5 and not on RHEL4? Let’s look at the kernel and memory situation in both versions(info taken from http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417):

RHEL 5 situation:

The original i686 only had 32-bits with which to address the memory in a machine.  Because 2^32 == 4GB, the original i686 can only address up to 4GB of memory. Sometime later on, Intel introduced the PAE extensions for i686.  This means that although the processor still only runs 32-bit code, it can use up to 36-bits to address memory (they added more bits later on, but that’s not super-relevant).  So newer i686 processor could address 2^36 bits of memory, or 64GB. However, the linux kernel has limitations that makes it essentially impossible to run with 64GB of memory reliably. Therefore, while running the linux kernel on an i686 PAE machine, you can really only use up to about 16GB of memory reliably. To work around that limitation, RHEL4 (and RHEL3) had a new mode called “hugemem”.  Among other things, it allowed the kernel to reliably address all 64GB of memory. The problem is that the hugemem patches were never accepted upstream.  Because of that, and because of the fact that 64-bit machines were so prevalent by the time RHEL5 was released, Red Hat decided to drop the hugemem variant from RHEL5. So in the end we have:

RHEL4 kernels:

  • i686 – no PAE, no hugemem patches, can address up to 4GB memory
  • i686-smp – PAE, no hugemem patches, can reliably run with around 16GB
  • i686-hugemem – PAE, hugemem patches, can reliably run with 64GB

RHEL5 kernels:

  • i686 – no PAE, no hugemem patches, can address up to 4GB of memory
  • i686-PAE – PAE, no hugemem patches, can reliably run with around 16GB

In summary, if customers need to use > 16GB of memory, the absolute best suggestion is to use RHEL5 x86_64, which suffers from none of these limitations.
A secondary idea is to use RHEL4 i686 hugemem, but as RHEL4 gets closer to end-of-life, this becomes less and less a good idea. Another important thing to mention here is that RHEL5 x86 can address only to 1GB of memory – it is explained in details here – http://kerneltrap.org/node/2450, whereas RHEL4 was able to use 4GB, thus supporting up to 64GB – https://lwn.net/Articles/39283/. By the way, the answer from Oracle and RHEL is also to go for 64 bit. But I didn’t believe there is no other way than moving to 64 bits. Meanwhile, servers were bouncing every day and OOM killer was having fun on our servers. I’ve started to look for the way on how to solve it on 32 bit system.

Workaround

First of all, we’ve tried applying the latest kernel version, but no luck. Then I’ve started playing with vm server settings (tried a lot of settings/values/variations, will describe the most important only) I started off with lower_zone_protection (replaced by lowmem_reserve_ratio in RHEL5), which in theory should help to protect LowMem shortage (google for it if you want to know more details about the setting). By default it is set to the following value:


# cat /proc/sys/vm/lowmem_reserve_ratio
256     256     32

I’ve tried increasing the last value to 250. Added “vm.lowmem_reserve_ratio = 256     256     250” to /etc/sysctl.conf. (and sysctl -p). Again, the fix did not help. I’ve followed up with vm.min_free_kbytes tweaking then and set it as follows:


echo "131072" > /proc/sys/vm/min_free_kbytes

This tells the kernel to try and keep 128MB of RAM free at all times.

It’s useful in two main cases:

  • Swap-less machines, where you don’t want incoming network traffic to overwhelm the kernel and force an OOM before it has time to flush any buffers.
  • x86 machines, for the same reason: the x86 architecture only allows DMA transfers below approximately 900MB of RAM. So you can end up with the bizarre situation of an OOM error with tons of RAM free.

So it looks that it might be useful in our case also. Let’s compare the meminfo output after adding this setting:


# cat /proc/meminfo  | grep Low
LowTotal:       713504 kB
LowFree:        196500 kB

It is more than twice before the tweeking.

After that, I’ve started waiting for the morning and OOM (of course with my fingers crossed). Surprisingly, I’ve discovered that OOM was not there at night. I’ve waited some more days but everything seemed to be fine and without OOM under the same circumstances. I didn’t stop at that point a took a look at the problem also from a Grid Infrastructure point of view. Basically, I wanted to review the bug fixes in the latest PSUs for clusterware (initially we had 11.2.0.2 version) and seek for some memory issues. In the Bugs Fixed by 11.2.0.2 Grid Infrastructure PSU 3 (Patch 12419353), I’ve found some interesting ones like:

  • 10065216 – VIRTUAL MEMORY USAGE OF ORAROOTAGENT IS BIG(1321MB) AND NOT DECREASING
  • 10168006 – ORAAGENT PROCESS MEMORY GROWTH PERIODICALLY.
  • 10015210 OCTSSD LEAK MEMORY 1.7M HR ON PE MASTER DURING 23 HOURS RUNNI

I’ve decided also to apply the latest PSU 3 at that moment. (Now it should be worth applying 11.2.0.3 also). Since then I’ve never met OOM on that servers. Maybe there was a partial fault also on Clusterware side, but be sure that Oracle would not solve any problems, while you’re running that on 32 bits.  It is obvious that because of these memory limitations in RHEL5, we’ll have to consider upgrading to 64 bits, but still it was a great experience for me to solve this on 32 bits and also get to know a lot of new things. I hope that someone that reads it will find it useful too.

The real solution

Note: there is no real magic number for the values I’ve tuned – it all depends on server specification. One time it might help, but sometimes even lead to performance degradation and other problems. Remember this is an architectural memory limitation we are trying to bypass. The real solution will always be moving to x86_64.

Advertisement

4 thoughts on “Grid Infrastructure on RHEL5 x86 story / OOM explained

  1. Thanks, bro.
    Hit exactly the same situation with 2.6.18-194.el5xen i686 and 24Gb RAM. Don’t ask me why the customer wants exactly 32bit.
    Hope, the workaround will help.

  2. Hi Vladis,
    Thanks, I am glad you’ve found this article useful. You can contact me directly on my mailbox if you need some more hints. But I would still suggest you to prove customer to go for x86_64. Otherwise, there is no guarantee in system stability on 32 bits. I don’t see any major reason why Oracle still supports such products like Database or Grid Infrastructure on 32 bits…

  3. Hi Andrejs,

    There is also one solution i.e. monit where we can manage the memory and CPU utilization.
    What you think will it help ?

    Thanks-
    Shrikant

  4. Hi Shrikant,

    I did not get your solution? If you will be monitoring resources, you will still have to free them up by eliminating some top consumers. Or even worse, OOM will still kick in even after that (I’ve seen the similar situations). The best solution is to migrate to x86_64 architecture.

    Regards,
    Andrejs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s