Posted in Monitoring | Posted on 06-10-2013
A couple months ago, we got an email from Chris:
The Hadoop cluster has been a bit slow the past few days and I noticed that the bottleneck seems to be coming from the map tasks. We have separate map and reduce task capacities and it continuously looks like the mapper slots are all taken while there’s a surplus of open reduce slots. Is there any reason that we can’t open any of the free reduce slots to map tasks?
Mysterious Mass Murder
This is one of the most bloodcurling (and my favorites) stories, that we have recently seen in our 190-square-meter Hadoopland. In a nutshell, some jobs were surprisingly running extremely long, because thousands of their tasks were constantly being killed for some unknown reasons by someone (or something).
For example, a photo, taken by our detectives, shows a job running for 12hrs:20min that spawned around 13,000 tasks until that moment. However (only) 4,118 of map tasks had finished successfully, while 8,708 were killed (!) and … surprisingly only 1 task failed (?) – obviously spreading panic in the Hadoopland.
A couple of weeks ago, we got a JIRA ticket complaining about JobTracker being super slow (while it used to be super snappy most of the time). Obviously in such a situation, developers and analysts are a bit annoyed because it results in difficulties in submitting and tracking status of MapReduce jobs (however, the side effect is having a time for unplanned coffee break, what should not be so bad ;)) Anyway, we are also a bit ashamed and sad, because we aim for a perfect Hadoop cluster and no unplanned
coffee breaks interruptions.
In this blog post, I will describe two memory-related issues that we have recently experienced on our 190-node Apache Hadoop cluster at Spotify.
We have noticed that some nodes were suddenly marked dead by both NameNode and JobTracker. Although we could ping them, we were unable to ssh into them, what often suggests some really heavy load on these machines. When looking at Ganglia graphs, we have discovered that all nodes that were marked dead have one common issue – a heavy swapping (in case of Apache Hadoop, the practice shows that a heavy swapping of JVM process usually means some kind of unresponsiveness and/or even the “death”).
Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.
Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.
Basic Ganglia overview
Ganglia consists of three components:
- Ganglia monitoring daemon (gmond) – a daemon which needs to run on every single node that is monitored. It collects local monitoring metrics and announce them, and (if configured) receives and aggregates metrics sent to it from other gmonds (and even from itself).
- Ganglia meta daemon (gmetad) – a daemon that polls from one or more data sources (a data source can be a gmond or other gmetad) periodically to receive and aggregate the current metrics. The aggregated results are stored in database and can be exported as XML to other clients – for example, the web frontend.
- Ganglia PHP web frontend – it retrieves the combined metrics from the meta daemon and displays them in form of nice, dynamic HTML pages containing various real-time graphs.
If you want to learn more about gmond, gmetad and the web frontend, a very good description is available at Ganglia’s wikipedia page. Hope, that following picture (showing an exemplary configuration) helps to understand the idea: