Posted in Failures, Troubleshooting | Posted on 21-12-2013|
At Spotify, we have a company-wide culture of celebrating successes and … failures. Because we want to iterate fast, we do realize that failures can happen. On the other hand, we can not afford to make the same mistake more than once. One way of preventing from that is sharing our failures, mistakes and learning across the company.
Today however, I would like to share my failures … outside of the company ;) While my failures relate to my recent work with Apache Hadoop cluster, I think that the lessons that I have learned are generic enough, so that many people can benefit from them.
A typical Hadoop problem that might be caused by many different reasons
We saw many DataNodes being marked dead, because they were not able to send any heartbeat messages to the NameNode, saying “Hi! I am still alive!” for longer than 10 minutes. Surprisingly, the
By design, when a DataNode is marked dead, a costly block replication process is started. The network usage increases, because alive DataNodes start recreating missing replicas and the map tasks, scheduled on a server that lost a DataNode, can only process non-local blocks. What’s more, scheduled tasks and jobs can even fail or be not started at all due to the BlockMissingException, if we lose access to all replicas of a given block.
Get a bit of breathing room until the issue is permanently fixed
Since we did not know what was the root cause, we came up with a relief idea to temporarily increase the datanode-liveness interval from 10 to 20 minutes to get some breathing room, hoping that at least a costly block replication process will be avoided for these DataNodes which quickly re-join the cluster.
Wouter: Maybe we should up the dfs.namenode.heartbeat.recheck-interval?
Adam: What value could be OK? 15, maybe 20 minutes?
Wouter: 20 maybe? It will give us a bit more breathing room.
Adam: Yes, the replication will not start so quickly.
A following patch (that increases the heartbeat recheck interval from 300 to 600 seconds) was quickly written by me, reviewed and deployed by someone from SRE.
According to the formula:
2 * dfs.namenode.heartbeat.recheck.interval + 10 * dfs.heartbeat.interval
our new datanode-liveness interval should equal to: 2 * 600 sec + 10 * 3 sec = 20min:30sec.
After that, we came up with one more idea. Let’s do other (important) stuff after deploying this change … and do not measure its impact!
A new issue that exposes some weird behaviour
And this is a moment, when a new story starts.
A couple of days later, on Saturday morning, Anders (who was on-call that day) raised an incident ticket saying that a cluster becomes unstable after around 1 hour of uptime.
Sadly, it was true. Each time, around 1 hour after restarting NameNode, the majority (or all) of DataNodes become marked dead making a cluster completely unstable and nonfunctional.
To solve this problem, we first tried to answer the question: “Why does this happen exactly one hour after restarting the NameNode?”
Our first idea was based on our theoretical knowledge of Apache Hadoop: “When a DataNode initially starts up, as well as every hour thereafter, it sends a block report to the NameNode containing information about all block replicas that it currently has.”. You can read about this in Hadoop books, blog posts, even see it in the code!
Maybe, the block report storm coming from the DataNodes exactly after the first hour
- heavily overloads the NameNode?
- and/or starts heavy GC that freezes the NameNode?
For us, it sounded like a pretty awful, but a possible scenario!
Making an iteration bigger to solve more problems in fewer iterations!
To avoid the storm of block reports, we decided to use a configuration setting called
We increased the initial delay to one hour, so that all block reports coming from DataNodes will not be sent at the same time, hopefully, reducing the load on the NameNode.
To avoid any possible heavy GC on the NameNode, we decided to increase the heap of the NameNode and tweak the sizes of the young and old generations.
Unfortunately, changing these settings requires restarting the NameNode and now a restart will take more than one hour, because the block reports are sent with a random delay between 0 and 60 minutes. Even worse, it is a sunny Sunday afternoon… The unconditioned reaction can be only one – let’s deploy all this changes to the NameNode at the same time, so that we will solve more problems in fewer iterations!
A couple hours later…
We were very happy, because the cluster had been running fine. It looked that the issue was successfully fixed! ;)
Wouter: Yeah, looks like we’re pretty good
Adam: Good timing, one evening before Monday
Johannes: Yeah, maybe I’ll get two or three hours of weekend!
Obviously, we added a comment to the JIRA ticket with our findings:
A couple days later…
We were very unhappy, because the cluster had not been running fine. It looked that the issue was not successfully fixed – we still saw many DataNodes marked dead regularly! :(
A shocking and ashaming finding
We started investigating this issue again, but everybody worked rather independently (no particular reason why).
My first finding was a really shocking for me! When looking more deeply in the code, I discovered that the
The mail of shame
Due to this mistake, we spent a half of weekend troubleshooting Hadoop cluster… and I felt guilty. :(
I had a dilemma whether to silently change this interval (it happened during my first months at Spotify, so I was a bit afraid of my reputation), or send an email with apologies to my colleagues explaining this mistake.
Finally, I sent “a mail of shame”. Today I think that I should also sent flowers.
Thankfully, I got a reply that was not so bad.
Thanks to this reply, I still felt an important member of the team!
Smile…. Tomorrow will be worse
“Those (DataNodes) who are dead and not dead were still living in my head”, so I decided to put all my efforts, blood, sweat and tears to solve this issue!
My next finding was also shocking and kind of demotivating, and I sent “a mail of shame” again:
This means that we did not have any block report storm coming after 1 hour! (it might come after 6 hours). With this in mind, increasing the initial block report delay was definitely not a right fix that weekend (guess, who proposed that?).
This means that increasing the heap of NameNode and tweaking the size of the young and old generations helped us that weekend (to solve a problem of the cluster being unstable around 1 hours after restarting the NameNode). Submitting multiple changes to the cluster at once, (almost always) means that you do not know which is the right one!
Too many lessons learned!
We did not fix the problem permanently at that point of time, because we were still losing a couple of DataNodes every now and then, but we had learned a lot. Probably, there are too many lessons learned!
- Measure the impact of each single change (if you do not have time/energy to do that, do not deploy a change)
- Never make bulk changes to the cluster (you will know which one is a right fix)
- Double-check a configuration parameter description (description can be different in different places)
- Check your default values (they might be overridden/optimized by newer versions of Hadoop)
- Question (almost) everything (even if your colleague submits a patch that you discussed together 5 minutes ago)
- Troubleshoot the cluster together interactively and non-interactively
- Share the knowledge, even if you make a mistake (this includes conference talks and blog posts! ;))
- Give a support to your team-mates, even if they fail (because they are friends and they will be double-motivated to fix the issue)
Amat victoria curam
Amat victoria curam – Victory favors those who try.
And I was double motivated to fix this issue with missing DataNodes. I start learning more about troubleshooting tools for Java applications e.g.
The blocked and waiting threads were waiting for the same lock, to finish operations related to reading a block, creating a block replica, initializing a replica recovery process and more:
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.getDfsUsed org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock
Eureka! The root cause is identified.
Unfortunately, we had not enough hands to implement a patch that fixes this problem (but we definitely should), and we just wrote a simple watcher script that periodically checks if the DataNode is both running and marked dead by the NameNode and restarts it, if true.
Today, I think that we should prioritize the contribution to core Apache Hadoop higher. We were probably one of the first companies that run into this issue, because when we saw it first time, it was only one JIRA ticket open for that and after a couple of months two more tickets were created and the problem was finally solved.
Two additional lessons learned
There are two additional important lessons that we learned from this issue. First, apart from mastering the Hadoop knowledge, we should learn as much as possible about all available troubleshooting and monitoring tools and their functionality. The second lessons is that we should prioritize the contribution to core Apache Hadoop higher (at least sending a post to Hadoop mailing list telling about the problem or raising a well-documented JIRA ticket, and maybe even implement a patch).