A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a large cluster and listening to perfect music for every moment, something surprising might happen…!
Well, after some time, a data engineer starts discovering Hadoop (and its related concepts) in the lyrics of many popular songs. How can Coldplay, Black Eyed Peas, Michael Jackson or Justin Timberlake sing about Hadoop?
Maybe it is some kind of illness? Definitely! A doctor could call it “inlusio elephans” ;)
Mysterious Mass Murder
This is one of the most bloodcurling (and my favorites) stories, that we have recently seen in our 190-square-meter Hadoopland. In a nutshell, some jobs were surprisingly running extremely long, because thousands of their tasks were constantly being killed for some unknown reasons by someone (or something).
For example, a photo, taken by our detectives, shows a job running for 12hrs:20min that spawned around 13,000 tasks until that moment. However (only) 4,118 of map tasks had finished successfully, while 8,708 were killed (!) and … surprisingly only 1 task failed (?) – obviously spreading panic in the Hadoopland.
I am extremely happy to say that my proposal was accepted for Strata Conference + Hadoop World 2013 ;) The title of my presentation is Hadoop Adventures At Spotify and I will talk about the most interesting problems that we have seen at our fast-growing Hadoop cluster.
The “official” abstract of the presentation is bellow:
Operating a small-size Hadoop cluster is a calm walk in a forest, while working with a big-size Hadoop cluster is a big adventure in a real jungle. The bigger elephant is, the more love and care it demands and we have discovered it in a hard way.
In this talk, we will take you for a trip into Hadoop jungle at Spotify to show the most interesting, exciting and surprising places where we have been to while growing fast from a 60 to 690-node Hadoop cluster. We will expose our JIRA tickets, real graphs, statistics, even excerpts from our dialogues. We will also share the mistakes that we made and describe the fixes that finally domesticated this love-demanding yellow elephant and its friends.
More details at Strata website. Hopefully, see you in NYC in October 2013!
Posted in Troubleshooting | Posted on 26-05-2013
A couple of months ago, one of our data analysts pernamently run into troubles when he wanted to run more resource-intensive Hive queries. Surprisingly, his queries were valid, syntactically-correct and run successfully on small data, but they just failed on larger datasets. On the other hand, other users were able to run the same queries successfully on the same large datasets. Obviously, it sounds like some permissions problem, however the user had right HDFS and Hive permissions.
A couple of weeks ago, we got a JIRA ticket complaining about JobTracker being super slow (while it used to be super snappy most of the time). Obviously in such a situation, developers and analysts are a bit annoyed because it results in difficulties in submitting and tracking status of MapReduce jobs (however, the side effect is having a time for unplanned coffee break, what should not be so bad ;)) Anyway, we are also a bit ashamed and sad, because we aim for a perfect Hadoop cluster and no unplanned
coffee breaks interruptions.