Posted in Monitoring | Posted on 06-10-2013
A couple months ago, we got an email from Chris:
The Hadoop cluster has been a bit slow the past few days and I noticed that the bottleneck seems to be coming from the map tasks. We have separate map and reduce task capacities and it continuously looks like the mapper slots are all taken while there’s a surplus of open reduce slots. Is there any reason that we can’t open any of the free reduce slots to map tasks?
In this blog post, I will describe two memory-related issues that we have recently experienced on our 190-node Apache Hadoop cluster at Spotify.
We have noticed that some nodes were suddenly marked dead by both NameNode and JobTracker. Although we could ping them, we were unable to ssh into them, what often suggests some really heavy load on these machines. When looking at Ganglia graphs, we have discovered that all nodes that were marked dead have one common issue – a heavy swapping (in case of Apache Hadoop, the practice shows that a heavy swapping of JVM process usually means some kind of unresponsiveness and/or even the “death”).