Posted in Monitoring, Troubleshooting | Posted on 16-05-2013|
In this blog post, I will describe two memory-related issues that we have recently experienced on our 190-node Apache Hadoop cluster at Spotify.
We have noticed that some nodes were suddenly marked dead by both NameNode and JobTracker. Although we could ping them, we were unable to ssh into them, what often suggests some really heavy load on these machines. When looking at Ganglia graphs, we have discovered that all nodes that were marked dead have one common issue – a heavy swapping (in case of Apache Hadoop, the practice shows that a heavy swapping of JVM process usually means some kind of unresponsiveness and/or even the “death”).
Surprisingly, only servers newly-added to the cluster were swapping, while the “old” server were running perfectly fine. Almost always such a observation means some type of misconfiguration of servers, and this is one of reason why it is good to know which nodes and when were added to the cluster ;)
At Spotify, we use Debian 6.0.7 (squeeze), that contains following default values for swap-reated settings:
swap space size = 10GB vm.swapinnes = 60
This issue shows that values above are not perfect ones since they caused the swapping of some of our servers. Let’s leave them for a moment in order to ask one key question – why actually were these nodes swapping?
Obviously, the answer is because the physical memory limit on these nodes was exceeded due to some memory-hungry processes. Normally it is not so easy to quickly identify these processes, however we got a tip from our data engineer who submitted really memory-intensive job at that time.
This memory-intensive job spawned 2000 map tasks, which is around ~10 tasks per node on our 190-node cluster at Spotify. In order to optimize the sort and shuffle phase and avoid spilling to disk in the map phase, this user increased the memory assigned to each tasks (
To prevent from issues like this, the swapping simply must be avoided. I was not sure whether the swap space size should be set to 0 (instead of default 10GB), or to some really small value. I asked Cloudera instructors for recommendations (thanks a lot!), and they are as follow:
- Swap size space should be set to some small value (say, 500MB). Having a small amount of swap allows you to avoid the kernel OOM killer (if possible), that is started by the OS under desperately low memory conditions to kill some process(es) for the benefit of the rest of the system. In the worse case, the OOM killer can take out a Hadoop processes (e.g. TaskTracker or DataNode) and/or ones that they depend on. Setting it to zero is not recommended, because always when the OS wants to swap, it will start the OOM killer. On the other hand having a little swap space will not throw a panic – a slightly slower server is better than a server that is down or dis-functional (i.e. with some Hadoop process killed). Ideally, Hadoop nodes should never swap. Obviously, an aggressive monitoring is needed to detect when and why machines are swapping, and if this happens, try to eliminate this behavior.
vm.swappiness, which is a kernel parameter that controls the kernel’s tendency to swap application data from memory to disk, should be set to some small value like 0 or 5 to instruct the kernel to never swap, if there is an option. However, even if you set it to 0, swap out might still happen and we saw it at Spotify ;) jvm.child.optsshould marked final, so that developers would not be able to override it. Obviously, it has an advantage (because they can not increase it) and a disadvantage (because they can not decrease it). Anyway, if the limit is set to some reasonable number (say, 1GB), good developers should adapt to to it. What if some developers really need to use more? Usually, it exposes some scalability issues (e.g. buffering reducer’s values in RAM) what can be eliminated by e.g. using the secondary sort technique, an additional MapReduce job or writing the data to a external file (look at slides from Cloudera Developer course) or maybe Cassandra/HBase.
I have mentioned that we quickly identified this memory-heavy MapReduce job, because our data engineer told us that he submitted such a job at that time. It is a bit annoying when somebody submits so badly-configured job, because it causes interruptions and other jobs/users experience some problems. In this case this user was … me ;) so that I was able to troubleshoot this issue quickly and at the end I have learnt more than I initially aimed ;)
This issue proved me, that it is a good idea to, from time to time, submit a really heavy, ugly, memory-intensive and destructive job to test the behavior of the cluster ;)
Another issue that we have experienced was related to TaskTracker, which were marked dead by JobTracker from time to time. We noticed that TaskTrackers’ processes suddenly stopped running on these servers for some reason.
A quick troubleshooting on one of the servers showed that the TaskTracker process was killed with the -9 signal. This was diagnosed by the existence of
2013-05-02T19:40:10.151+00:00 calc110.c.lon.spotify.net kernel: [531957.938770] Out of memory: kill process 32339 (java) score 818032 or a child 2013-05-02T19:40:10.151+00:00 calc110.c.lon.spotify.net kernel: [531957.938897] Killed process 32339 (java)
Who and why could be a killer? The answer probably could be only one. The kernel out-of-memory killer starts murdering processes according to their “badness” score. It looks that the OOM killer takes out a Hadoop process (in this case TaskTracker). You can read how “badness” score is calculated here, but in case of “tradional” Hadoop slave servers, TaskTracker usually becomes the prime candidate to be killed, because together with its child processes (JVM running map and reduce tasks, and potentially an external scripts invoking map and reduce functions, if Hadoop Streaming is used), it consumes a lot of memory.
It also means that again we are using too much memory. Why did it happen? Do not we know the math to correctly calculate how much RAM we actually need to run all map and all reduce tasks on a server?
Actually, at Spotify we do know the math, but that day we introduced a change to our cluster configuration. We decreased the number of task per TaskTracker and we gave more memory to each task. This change was propagated by Puppet to every client machine (so that jobs were submitted with higher memory expectations) and majority of the Hadoop nodes, but unfortunately not all of them. These nodes, where a change was not propagated to successfully, were running more map and reduce tasks with higher memory setting, and became the ones where the OOM killer has murdered TaskTrackers.
How did we know that a puppetization failure was the root cause here? Simply because the attempt to run a Puppet agent on a server produced the following notice:
kawaa@calc149:~$ sudo sppuppet agent -t Notice: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists)
When we I removed a lock (an empty file, by the way) and run the Puppet agent again, it worked properly and the issue was successfully solved ;)