Posted in Monitoring, Troubleshooting | Posted on 23-08-2013|
Mysterious Mass Murder
This is one of the most bloodcurling (and my favorites) stories, that we have recently seen in our 190-square-meter Hadoopland. In a nutshell, some jobs were surprisingly running extremely long, because thousands of their tasks were constantly being killed for some unknown reasons by someone (or something).
For example, a photo, taken by our detectives, shows a job running for 12hrs:20min that spawned around 13,000 tasks until that moment. However (only) 4,118 of map tasks had finished successfully, while 8,708 were killed (!) and … surprisingly only 1 task failed (?) – obviously spreading panic in the Hadoopland.