A couple months ago, we got an email from Chris:
The Hadoop cluster has been a bit slow the past few days and I noticed that the bottleneck seems to be coming from the map tasks. We have separate map and reduce task capacities and it continuously looks like the mapper slots are all taken while there’s a surplus of open reduce slots. Is there any reason that we can’t open any of the free reduce slots to map tasks?
The bottleneck comming from the map tasks
The situation described by Chris is quite common on “classical” Hadoop cluster (by “classical”, I mean a cluster without YARN): all map slots are taken (and we still want more!), while many reduce slots are free (or vice versa). In a result, the cluster resources are not fully utilized, so that the jobs at whole are running slower. One example is bellow:
This becomes even more visible when a chart show fairly flat-topped utilization of map slots:
Apache Hadoop MRv1 resource utilization
Everything because in MRv1, we have to divide our resources into map and reduce slots and hard-code them in the TaskTracker’s configuration (
Obviously, the situation could be alleviated, if we had simply task slots (e.g.
In general, I think that it is difficult to find ideal hard-coded values for the maximum number of map and reduce tasks running on the cluster, because the workload may change very often. If you look at the chart bellow, you will see that in some moment we are bottlenecked by map tasks, but in some other moments we are bottlenecked by reduce tasks.
Tuning map and reduce tasks distribution
Usually, good starting point is to 60% map tasks and 40% reduce tasks (or 70% map and 30% reduce tasks). At Spotify, we simply started with 60%-40% following the this recommendation. However, in the meantime, we were collecting statistics about our tasks (how many and how long they are running) from Ganglia and JobTracker’s logs. After aggregating statistics from the last couple of months, we discovered that our current usage is something around 68%-32%, according to the charts bellow:
Where I took data from
As mentioned earlier, I took data from Ganglia and JobTracker’s logs. You can find how to setup Ganglia for your Hadoop (and HBase) clusters in my previous post titled Ganglia configuration for a small Hadoop cluster and some troubleshooting.
Additionally, you can also use simple JobTracker’s logs that can be enabled in hadoop-log4j.properties file:
hadoop.mapreduce.jobsummary.logger=INFO,JSA hadoop.mapreduce.jobsummary.log.file=hadoop-mapreduce.jobsummary.log hadoop.mapreduce.jobsummary.log.maxfilesize=256MB hadoop.mapreduce.jobsummary.log.maxbackupindex=20
Then in your hadoop log directory, you will find hadoop-mapreduce.jobsummary.log that contains content usefull for many cluster analysis:
13/03/01 22:32:58 INFO mapred.JobInProgress$JobSummary: jobId=job_201303012227_0001,submitTime=1362177138946,launchTime=1362177141965,firstMapTaskLaunchTime=1362177147207,firstReduceTaskLaunchTime=1362177165121,firstJobSetupTaskLaunchTime=1362177141968,firstJobCleanupTaskLaunchTime=1362177173792,finishTime=1362177178517,numMaps=8,numSlotsPerMap=1,numReduces=2,numSlotsPerReduce=1,user=kawaa,queue=default,status=SUCCEEDED,mapSlotSeconds=36,reduceSlotsSeconds=17,clusterMapCapacity=2,clusterReduceCapacity=2 13/03/01 22:35:29 INFO mapred.JobInProgress$JobSummary: jobId=job_201303012227_0003,submitTime=1362177310049,launchTime=1362177312838,firstMapTaskLaunchTime=1362177316497,firstJobSetupTaskLaunchTime=1362177313141,firstJobCleanupTaskLaunchTime=1362177326372,finishTime=1362177329731,numMaps=8,numSlotsPerMap=1,numReduces=2,numSlotsPerReduce=1,user=kawaa,queue=default,status=KILLED,mapSlotSeconds=21,reduceSlotsSeconds=3,clusterMapCapacity=2,clusterReduceCapacity=2