A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a large cluster and listening to perfect music for every moment, something surprising might happen…!
Well, after some time, a data engineer starts discovering Hadoop (and its related concepts) in the lyrics of many popular songs. How can Coldplay, Black Eyed Peas, Michael Jackson or Justin Timberlake sing about Hadoop?
Maybe it is some kind of illness? Definitely! A doctor could call it “inlusio elephans” ;)
Curious about Hadoop playlist? Listen and learn – each song simply will become an excuse to talk about Hadoop’s internals, tips and the issues that we have experienced.
Feel free, to add songs where you hear anything related to Hadoop to the playlist!
- Nine Million Bicycles by Katie Melua
There are nine million bicycles in Beijing
That’s a fact
It’s a thing we can’t deny
We are twelve billion light years from the edge
That’s a guess
There are six billion people in the world
More or less
Hadoop cluster at Spotify is growing to 690 nodes!
That’s a fact
We will see more than 4 disk failures each day in 2016
That’s a guess
We run 6.500 jobs per day (192.000 per month) on the cluster
More or less
Our cluster is growing quicker and quicker.
February 2012 => 60 nodes
September 2012 => 120 nodes
January 2013 => 190 nodes
September 2013 => 328 nodes
November 2013 => 690 nodes
Lesson learned – the bigger Hadoop is (and the less free disk space it has), the more love and care it demands and we have learned it in a hard and exciting way!
- Murder On The Dancefloor by Sophie Ellis-Bextor
It’s murder on the dancefloor
But you’d better not kill the groove, DJ
Gonna burn this god damn house right down
Oh I know, I know, I know, I know, I know, I know, I know
About your kind
And so and so and so and so and so and so and so
I’ll have to play
Where is the dancefloor, who is the murder, who is the groove? Are there any victims?
If you like bloodcurling stories, there is a one that we have recently seen in our Hadoopland. In a nutshell, some jobs were surprisingly running extremely long, because thousands of their tasks were constantly being murdered for some unknown reasons by someone (or something). It obviously spread panic across the Hadoopland, so that our detectives started to investigate these crimes. At the end, it turned out that the murder is FairScheduler, the groove is JobTracker (who fortunately has not been not murdered yet ;)), while the victims are tasks that belong to resource-intensive Hive queries.
Read more about Mysterious Mass Murder In The Hadoopland.
- Meet Me Halfway by The Black Eyed Peas
And all those things we use to use to use to do
Hey girl wuz up, wuz up, wuz up, wuz up
Meet me halfway, right at the boarderline
That’s where I’m gonna wait, for you
I’ll be lookin out, night n’day
Took my heart to the limit, and this is where I’ll stay
I can’t go any further then this
In the past, we got a JIRA ticket complaining about JobTracker being super slow (while it used to be super snappy most of the time). Obviously everybody was annoyed by unresponsive JobTracker web interface.
To attack this problem, we decided to limit the number of tasks per job using mapred.jobtracker.maxtasks.per.job property. We had a small dilemma when we needed to specify the right value here, and it resulted in a following real-life conversation:
Sven: What value for mapred.jobtracker.maxtasks.per.job should we use?
Gustav: Maybe 50K
Sven: Why not 30K?
Gustav: Because we might be running large production jobs and it is safer to start with a high value and eventually decrease it later.
Sven: Hmmm? Sounds good…! But maybe 40K, not 50K? ;)
Gustav: OK. Let’s meet in the middle…
The previous approach is a great example of “guesstimation” where negotiation skills are necessary. The dialogue above did really happen, but quickly after one more question was asked – Should a real data-driven company make such a decision based on a guess or data? Actually, the answer can be only one!
To find data, we have looked at logs from JobTracker. We did found that our largest production job creates 22,609 tasks (and it is increasing, because our input datasets grow each day). The jobs that create more than 22,609 tasks are ad-hoc jobs (in our case, mostly Hive queries). Surprisingly or not, these large ad-hoc jobs usually fail (e.g. due to some parsing error, out of memory exception) or are killed manually by an impatient user.
Finally, we set mapred.jobtracker.maxtasks.per.job set to 24,000 tasks.
Read more about JobTracker slowness, guesstimation and a data-driven answer.
- I’m Your Puppet by James & Bobby Purify
I’m your puppet, I’m your puppet
I’m just a toy, just a funny boy
That makes you laugh when you’re blue
I’ll be wonderful, do just what I’m told
I’ll do anything for you
I’m your puppet, I’m your puppet
Just pull them little strings and I’ll sing you a song, I’m your puppet
Make me do right or make me do wrong, I’m your puppet.
Mm, treat me good and I’ll do anything
I’m just a puppet and you hold my string, I’m your puppet
Spotify is highly puppetized company. We use Puppet everywhere where it is possible. Our Hadoop cluster is also puppetized.
Although Puppet is really great and we like it, we had one interesting issue when Puppet, a funny boy, did a funny thing for us! ;)
The issue was related to a couple of TaskTrackers that were constantly being killed with the -9 signal for some reason. It turns out the kernel out-of-memory killer started murdering processes according to their “badness” score and took out the TaskTracker’s daemon. You can read how “badness” score is calculated here, but in case of “classical” Hadoop’s slave node (with MRv1), TaskTracker usually becomes the prime candidate to be killed, because together with its child processes (JVM running map and reduce tasks, and potentially an external scripts invoking map and reduce functions, if Hadoop Streaming is used), it consumes the biggest chunk of memory.
It also meant that we were using too much memory. Why? That day we introduced a change to our cluster configuration. We decreased the number of task per TaskTracker in order to give more memory to each task. This change was propagated by Puppet to every client machine (so that jobs were submitted with higher memory expectations) and majority of the Hadoop nodes, but unfortunately not all of them. These nodes, where a change was not propagated to successfully, were running more map and reduce tasks with higher memory requirements, and became the ones where the OOM killer has murdered TaskTrackers.
How did we know that a puppetization failure was the root cause here? Simply because the attempt to run a Puppet agent on a server produced the following notice:
kawaa@calc149:~$ sudo sppuppet agent -t Notice: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists)
When we removed a lock and run the Puppet agent again, it worked properly and the issue was successfully solved! ;)
- 42 by Coldplay
Those who are dead are not dead
They’re just living in my head
Time is so short and I’m sure
There must be something more
This song is related to a super exciting and happy-ending issue that we have experienced on the cluster. I will describe it in my next blog post!
- No Guns To Town – Natty King
Don’t take your guns to town, son
Leave your guns at home, Bill
Don’t take your guns to town.
Luigi is our home-grown and open-sourced scheduler. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. Please watch the video and browse the code
At Spotify, Luigi schedules thousands jobs per day. Foursquare and Bitly are also using and contributing to it. You can use it instead Apache Oozie or Azkaban and people do it – this picture was taken by Luigi’s fan, who does NOT work at Spotify ;)
- Dead And Gone – feat. Justin Timberlake by T.I.
Oh hey, I’ve been travelin’ on this road too long
Just tryin’ to find my way back home
But the old me’s dead and gone
Dead and gone
Well, rather “Gone and dead” and to be more precise “Disk gone and DataNode dead”.
Assuming the default settings of DataNode, when at least one volume storing HDFS block replicas fails, then DataNode commits suicide and is dead.
$ ls /disk/hd12/ ls: reading directory /disk/hd12/: Input/output error org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 11, volumes configured: 12, volumes failed: 1, Volume failures tolerated: 0 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
If you have servers with multiple disks, you can consider increasing dfs.datanode.failed.volumes.tolerated to avoid expensive block replication when a disk fails (and enable monitoring to detect failed disks). At Spotify, we have servers with 12 disks, and we set it to 2 (maybe we set it to something higher in the future).
Currently, we have 3420 (relatively new) disks in Hadoop cluster (growing to 8280 disks in November 2013), and we saw three disk failures last week. It was exciting for me to see the stuff breaking first time, so I sent this email to the colleagues in the middle of the night:
We do also see that some disks are 30% slower than the the fastest ones when testing performance of reading from the middle of the disk with hdparm.
- Yellow by Coldplay
- Lovefool by The Cardigans
- Smooth Criminal by Michael Jackson
- Hakuna Matata – From Disney’s “The Lion King”
- Accidentally in love by Counting Crows
And it was all yellow
Oh yeah your skin and bones
Turn into something beautiful
Hadoop is beautiful! One of its beauties (and the one that impressed me so much, when I saw Hadoop fist time) is the abstraction that it offers to developers, so that they can implement … beautiful code.
This abstraction was originally proposed by Jeffrey Dean and Sanjay Ghemawat (both Google), and their academic paper became a design guide to implementing Apache Hadoop. Please let me quote a part of it:
“Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.
As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
Programs (…) are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.”
Source: MapReduce: Simplied Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.
Dear, I fear we’re facing a problem
Just say that you need me
I can’t care ’bout anything but you…
Lately I have desperately pondered,
Spent my nights awake and I wonder
What I could have done in another way
To make you stay
Reason will not lead to solution
I will end up lost in confusion
I don’t care if you really care
As long as you don’t go
At Spotify we are using Cassandra heavily (hundreds of servers, tens of clusters, terabytes of data – e.g. all playlists are stored in Cassandra). However we also maintain a small HBase cluster with Socorro (we consider deploying another one for hRaven soon).
If you administer HBase cluster, you probably heard about Juliet Pause scenario which is nicely described in Todd Lipcon’s blog post Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers:
“HBase relies on Apache ZooKeeper to track cluster membership and liveness. If a server pauses for a significant amount of time, it will be unable to send heartbeat ping messages to the ZooKeeper quorum, and the rest of the servers will presume that the server has died. This causes the master to initiate certain recovery processes to account for the presumed-dead server. When the server comes out of its pause, it will find all of its leases revoked, and commit suicide. The HBase development team has affectionately dubbed this scenario a Juliet Pause — the master (Romeo) presumes the region server (Juliet) is dead when it’s really just sleeping, and thus takes some drastic action (recovery). When the server wakes up, it sees that a great mistake has been made and takes its own life. Makes for a good play, but a pretty awful failure scenario!”
Source: Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers
Yes, a pretty awful failure scenario!
So She Ran Into The Bedroom
She Was Struck Down, It Was
Annie Are You Ok
So, Annie Are You Ok
Are You Ok, Annie
Annie Are You Ok
A funny command ruok (“Are You OK?”) checks whether Zookeeper daemon is running in a non-error state. Zookeeper responds with imok, if it is OK. Otherwise it does not respond at all.
$ echo ruok | nc 127.0.0.1 2181 imok
Hakuna Matata! What a wonderful phrase
Hakuna Matata! Ain’t no passing craze
It means no worries for the rest of your days
It’s our problem-free philosophy
“Hakuna MapData! What a wonderful phrase!” ;)
“Hakuna MapData” is the motto of my private blog, where share my experience, thoughts and observations related to both practical and non-practical (but still fun!) use of Apache Hadoop Ecosystem (and try to add some bytes of humor to it) ;)
Think about it everytime
I think about it
Can’t stop thinking about it
How much longer will it take to cure this?
Just to cure it cause I can’t ignore it if its love (love)
We’re accidentally in love
Accidentally in love
Love …I’m in love
The Hadoop cluster at Spotify is called Shrek (this is why you see Shrek’s icon on our Jira tickets) and yes … I am in love with it.
I can’t stop thinking about it and I do not know how much longer it will take to cure this ;) In my case, everything started … accidentally 3.5 years ago.