Have you ever installed and used Oracle Grid Infrastructure on RHEL 5x 32 bit server? Have you ever heard such an abbreviation as OOM?
Let me answer these questions first. Well, no I didn’t. Before a while ago.
The background is following – we had some Oracle Applications running on RHEL 4 32 bit with 32 GB RAM, when it was decided to upgrade and migrate to RHEL 5 32 bit with same memory amount. As a part of it, we also wanted to try out ACFS (ASM Cluster File System) (actually a part of an Oracle Cloud File System now after rebranding in January 2011) as a shared storage for application filesystem, which comes up together with Grid Infrastructure 11gR2 installation. So far, so good. Servers were delivered, it was time get the job done …
Basically, it was just a simple two-node cluster installation with ACFS created and shared between the nodes. Moreover, ACFS resource was also exported for some other servers using NFS protocol. After the installation it all looked cool and simple (and even working ). We had put out applications on ACFS and started testing. For my surprise, when I arrived at the office next morning, applications were down (hmm, someone might have stopped them for some reason – I thought). No problem, let’s start them up and continue working. The next morning even bigger surprise was waiting for me – one node got evicted and server rebooted. That looked a lot more serious to me, so I started investigating this. It didn’t took long to find the cause of this, one look in clusterware alert.log and another in server /var/log/messages brought me to something I’ve never seen before.
kernel: Out of memory: Killed process 24735, UID 1200, (oracle). kernel: [Oracle ADVM] The ASM instance terminated unexpectedly. All ADVM volumes will be taken offline. You must close all applications using these volumes and unmount the file systems. After restarting the instance, you may need to re-enable the volumes for use. syslogd 1.4.1: restart. kernel: klogd 1.4.1, log source = /proc/kmsg started.
So it’s been clearly shown that something have killed cluster heartbeat process on one of the nodes which caused the restart.
Looking deeper in the log file I’ve found something more:
kernel: oracle invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 kernel: [<c045af33>] out_of_memory+0x72/0x1a3 kernel: [<c045c49a>] __alloc_pages+0x24e/0x2cf kernel: [<c045c540>] __get_free_pages+0x25/0x31 kernel: [<c04a8814>] proc_file_read+0x74/0x224 kernel: [<c044a993>] audit_syscall_entry+0x18f/0x1b9 kernel: [<c04a87a0>] proc_file_read+0x0/0x224 kernel: [<c0476330>] vfs_read+0x9f/0x141 kernel: [<c04767b6>] sys_read+0x3c/0x63 kernel: [<c0404f4b>] syscall_call+0x7/0xb
My first reaction was like why? (out-of-memory with no consumption at all?) And what’s that oom-killer?
It appeared that there is a lot of information about that in internet and even MoS has it [Doc ID 452000.1].
So the OOM or Out-Of-Memory killer is an OS feature enabled by default (and cannot be switched off), a self protection mechanism employed the Linux kernel when under severe memory pressure. If kernel can not find memory to allocate when it’s needed, it puts in-use user data pages on the swap-out queue, to be swapped out. If the Virtual Memory (VM) cannot allocate memory and canot swap out in-use memory, the Out-of-memory killer may begin killing current userspace processes. It will sacrifice one or more processes in order to free up memory for the system when all else fails.
The behavior of OOM killer in principle is as follows:
- Lose the minimum amount of work done
- Recover as much as memory it can
- Do not kill anything actually not using a lot memory alone
- Kill the minimum amount of processes (one)
- Try to kill the process the user expects to kill
Sounds very thoughtful, isn’t it? But these were not my symptoms at all, because first of all, I had no any load/consumption on the servers. Secondly, it was not only killing some heavy processes, but rather all processes on a random basis. Here are some more examples from the log file:
# cat /var/log/messages.2 | grep oom Oct 24 05:11:13 kernel: FNDLIBR invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 25 04:50:26 kernel: oracle invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 25 04:50:35 kernel: oraagent.bin invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 25 04:50:47 kernel: nfsd invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 25 04:50:50 kernel: FNDLIBR invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 25 04:50:58 kernel: oraagent.bin invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 25 04:51:00 kernel: FNDLIBR invoked oom-killer: gfp_mask=0x200d0, order=0, oomkilladj=0 Oct 25 04:51:06 kernel: ps invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Oct 26 19:37:46 kernel: sendmail invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
As you see, even the “ps” command and “sendmail” are killed when OOM wants it. Cruel thing as you notice. Sometimes it just killed ASM/ACFS processes which lead to a FS dismount and input/output errors on every server where it was mouted. So in our case it was very important and a real disaster, because we have had our application software exactly on ACFS. I’ve even thought about some analogy which can be used in that case in order to explain OOM logic.
A little analogy:
["An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions, however, the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if, for example, the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused..."]
I would be glad if that story could help you understand OOM better, but let’s get back to the initial topic. We’ve never faced something similar on RHEL4. Is it RHEL5 and OOM or maybe it is Grid Infrastructure what is causing all these problems? Looking more deeper in the causes of the killer event lead me think of the situation when the Kernel is out of LowMem (32-bit architectures only) MoS [Doc ID 452326.1] This eventually seemed to be like my case. We do have 32 GB RAM so “LowFree” in /proc/meminfo is very low, but HighFree is much higher. OOM killer will take action under that situation. Here is the output from my server (note HighFree and LowFree) values:
# cat /proc/meminfo MemTotal: 37436164 kB MemFree: 33526272 kB Buffers: 136076 kB Cached: 1900048 kB SwapCached: 16084 kB Active: 3046452 kB Inactive: 395872 kB HighTotal: 36821180 kB HighFree: 33449312 kB LowTotal: 614984 kB LowFree: 76960 kB SwapTotal: 16383992 kB SwapFree: 16315860 kB Dirty: 1940 kB Writeback: 8 kB AnonPages: 1400696 kB Mapped: 223416 kB Slab: 373888 kB PageTables: 29984 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 35102072 kB Committed_AS: 6765332 kB VmallocTotal: 116728 kB VmallocUsed: 31844 kB VmallocChunk: 84264 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB
However, it appears that even with an healthy system you might see low LowFree values which does not mean a LowMem shortage. For example, a system with 2GB of memory and hugemem kernel:
MemTotal: 2061160 kB MemFree: 10228 kB Buffers: 119840 kB Cached: 1307784 kB Active: 587724 kB Inactive: 1236924 kB ... LowTotal: 2061160 kB LowFree: 10228 kB
Here the system seems to be short of memory, but we see that buffers are high (and they can be released if needed), along with 1.24 GB of cached pages. For the cached pages 1.17 GB of them are inactive, so they can be released if needed. That is based on the workload.
So if OOM was introduced already on RHEL3, why do we face it on RHEL5 and not on RHEL4? Let’s look at the kernel and memory situation in both versions(info taken from http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417):
RHEL 5 situation:
The original i686 only had 32-bits with which to address the memory in a machine. Because 2^32 == 4GB, the original i686 can only address up to 4GB of memory. Sometime later on, Intel introduced the PAE extensions for i686. This means that although the processor still only runs 32-bit code, it can use up to 36-bits to address memory (they added more bits later on, but that’s not super-relevant). So newer i686 processor could address 2^36 bits of memory, or 64GB. However, the linux kernel has limitations that makes it essentially impossible to run with 64GB of memory reliably. Therefore, while running the linux kernel on an i686 PAE machine, you can really only use up to about 16GB of memory reliably. To work around that limitation, RHEL4 (and RHEL3) had a new mode called “hugemem”. Among other things, it allowed the kernel to reliably address all 64GB of memory. The problem is that the hugemem patches were never accepted upstream. Because of that, and because of the fact that 64-bit machines were so prevalent by the time RHEL5 was released, Red Hat decided to drop the hugemem variant from RHEL5. So in the end we have:
- i686 – no PAE, no hugemem patches, can address up to 4GB memory
- i686-smp – PAE, no hugemem patches, can reliably run with around 16GB
- i686-hugemem – PAE, hugemem patches, can reliably run with 64GB
- i686 – no PAE, no hugemem patches, can address up to 4GB of memory
- i686-PAE – PAE, no hugemem patches, can reliably run with around 16GB
In summary, if customers need to use > 16GB of memory, the absolute best suggestion is to use RHEL5 x86_64, which suffers from none of these limitations.
A secondary idea is to use RHEL4 i686 hugemem, but as RHEL4 gets closer to end-of-life, this becomes less and less a good idea. Another important thing to mention here is that RHEL5 x86 can address only to 1GB of memory - it is explained in details here – http://kerneltrap.org/node/2450, whereas RHEL4 was able to use 4GB, thus supporting up to 64GB – https://lwn.net/Articles/39283/. By the way, the answer from Oracle and RHEL is also to go for 64 bit. But I didn’t believe there is no other way than moving to 64 bits. Meanwhile, servers were bouncing every day and OOM killer was having fun on our servers. I’ve started to look for the way on how to solve it on 32 bit system.
First of all, we’ve tried applying the latest kernel version, but no luck. Then I’ve started playing with vm server settings (tried a lot of settings/values/variations, will describe the most important only) I started off with lower_zone_protection (replaced by lowmem_reserve_ratio in RHEL5), which in theory should help to protect LowMem shortage (google for it if you want to know more details about the setting). By default it is set to the following value:
# cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32
I’ve tried increasing the last value to 250. Added “vm.lowmem_reserve_ratio = 256 256 250” to /etc/sysctl.conf. (and sysctl -p). Again, the fix did not help. I’ve followed up with vm.min_free_kbytes tweaking then and set it as follows:
echo "131072" > /proc/sys/vm/min_free_kbytes
This tells the kernel to try and keep 128MB of RAM free at all times.
It’s useful in two main cases:
- Swap-less machines, where you don’t want incoming network traffic to overwhelm the kernel and force an OOM before it has time to flush any buffers.
- x86 machines, for the same reason: the x86 architecture only allows DMA transfers below approximately 900MB of RAM. So you can end up with the bizarre situation of an OOM error with tons of RAM free.
So it looks that it might be useful in our case also. Let’s compare the meminfo output after adding this setting:
# cat /proc/meminfo | grep Low LowTotal: 713504 kB LowFree: 196500 kB
It is more than twice before the tweeking.
After that, I’ve started waiting for the morning and OOM (of course with my fingers crossed). Surprisingly, I’ve discovered that OOM was not there at night. I’ve waited some more days but everything seemed to be fine and without OOM under the same circumstances. I didn’t stop at that point a took a look at the problem also from a Grid Infrastructure point of view. Basically, I wanted to review the bug fixes in the latest PSUs for clusterware (initially we had 188.8.131.52 version) and seek for some memory issues. In the Bugs Fixed by 184.108.40.206 Grid Infrastructure PSU 3 (Patch 12419353), I’ve found some interesting ones like:
- 10065216 – VIRTUAL MEMORY USAGE OF ORAROOTAGENT IS BIG(1321MB) AND NOT DECREASING
- 10168006 – ORAAGENT PROCESS MEMORY GROWTH PERIODICALLY.
- 10015210 OCTSSD LEAK MEMORY 1.7M HR ON PE MASTER DURING 23 HOURS RUNNI
I’ve decided also to apply the latest PSU 3 at that moment. (Now it should be worth applying 220.127.116.11 also). Since then I’ve never met OOM on that servers. Maybe there was a partial fault also on Clusterware side, but be sure that Oracle would not solve any problems, while you’re running that on 32 bits. It is obvious that because of these memory limitations in RHEL5, we’ll have to consider upgrading to 64 bits, but still it was a great experience for me to solve this on 32 bits and also get to know a lot of new things. I hope that someone that reads it will find it useful too.
The real solution
Note: there is no real magic number for the values I’ve tuned – it all depends on server specification. One time it might help, but sometimes even lead to performance degradation and other problems. Remember this is an architectural memory limitation we are trying to bypass. The real solution will always be moving to x86_64.