Hakuna MapData! » Blog Archive » Celebrate failure(s) – a real-world Hadoop example (HDFS issues)

Celebrate failure(s) – a real-world Hadoop example (HDFS issues)

| Posted in Failures, Troubleshooting |

At Spotify, we have a company-wide culture of celebrating successes and … failures. Because we want to iterate fast, we do realize that failures can happen. On the other hand, we can not afford to make the same mistake more than once. One way of preventing from that is sharing our failures, mistakes and learning across the company.

Today however, I would like to share my failures … outside of the company ;) While my failures relate to my recent work with Apache Hadoop cluster, I think that the lessons that I have learned are generic enough, so that many people can benefit from them.

A typical Hadoop problem that might be caused by many different reasons

We saw many DataNodes being marked dead, because they were not able to send any heartbeat messages to the NameNode, saying “Hi! I am still alive!” for longer than 10 minutes. Surprisingly, the $ps command showed us the DataNode daemons existed on their servers.

By design, when a DataNode is marked dead, a costly block replication process is started. The network usage increases, because alive DataNodes start recreating missing replicas and the map tasks, scheduled on a server that lost a DataNode, can only process non-local blocks. What’s more, scheduled tasks and jobs can even fail or be not started at all due to the BlockMissingException, if we lose access to all replicas of a given block.

The $ps command showed us that DataNode daemon existed on their servers, but the $jstack -F command did not show us anything interesting. Surprisingly, we noticed that some (but not all) of the dead DataNodes automatically re-join the cluster after 15-30 minutes. Maybe they were blocked by something for some reason?

Get a bit of breathing room until the issue is permanently fixed

Since we did not know what was the root cause, we came up with a relief idea to temporarily increase the datanode-liveness interval from 10 to 20 minutes to get some breathing room, hoping that at least a costly block replication process will be avoided for these DataNodes which quickly re-join the cluster.

Wouter: Maybe we should up the dfs.namenode.heartbeat.recheck-interval?
Adam: What value could be OK? 15, maybe 20 minutes?
Wouter: 20 maybe? It will give us a bit more breathing room.
Adam: Yes, the replication will not start so quickly.

A following patch (that increases the heartbeat recheck interval from 300 to 600 seconds) was quickly written by me, reviewed and deployed by someone from SRE.

According to the formula:
2 * dfs.namenode.heartbeat.recheck.interval + 10 * dfs.heartbeat.interval
our new datanode-liveness interval should equal to: 2 * 600 sec + 10 * 3 sec = 20min:30sec.

After that, we came up with one more idea. Let’s do other (important) stuff after deploying this change … and do not measure its impact!

A new issue that exposes some weird behaviour

And this is a moment, when a new story starts.

A couple of days later, on Saturday morning, Anders (who was on-call that day) raised an incident ticket saying that a cluster becomes unstable after around 1 hour of uptime.

Sadly, it was true. Each time, around 1 hour after restarting NameNode, the majority (or all) of DataNodes become marked dead making a cluster completely unstable and nonfunctional.

To solve this problem, we first tried to answer the question: “Why does this happen exactly one hour after restarting the NameNode?”

Our first idea was based on our theoretical knowledge of Apache Hadoop: “When a DataNode initially starts up, as well as every hour thereafter, it sends a block report to the NameNode containing information about all block replicas that it currently has.”. You can read about this in Hadoop books, blog posts, even see it in the code!

Maybe, the block report storm coming from the DataNodes exactly after the first hour

  • heavily overloads the NameNode?
  • and/or starts heavy GC that freezes the NameNode?

For us, it sounded like a pretty awful, but a possible scenario!

Making an iteration bigger to solve more problems in fewer iterations!

To avoid the storm of block reports, we decided to use a configuration setting called dfs.blockreport.initialDelay. With this setting, when a DataNode starts, it waits a random number of seconds between 0 and dfs.blockreport.initialDelay before sending the first block report. After the first block report is sent, next block reports will be sent every hour thereafter.

We increased the initial delay to one hour, so that all block reports coming from DataNodes will not be sent at the same time, hopefully, reducing the load on the NameNode.

To avoid any possible heavy GC on the NameNode, we decided to increase the heap of the NameNode and tweak the sizes of the young and old generations.

Unfortunately, changing these settings requires restarting the NameNode and now a restart will take more than one hour, because the block reports are sent with a random delay between 0 and 60 minutes. Even worse, it is a sunny Sunday afternoon… The unconditioned reaction can be only one – let’s deploy all this changes to the NameNode at the same time, so that we will solve more problems in fewer iterations!

A couple hours later…

We were very happy, because the cluster had been running fine. It looked that the issue was successfully fixed! ;)

Wouter: Yeah, looks like we’re pretty good
Adam: Good timing, one evening before Monday
Johannes: Yeah, maybe I’ll get two or three hours of weekend!

Obviously, we added a comment to the JIRA ticket with our findings:

A couple days later…

We were very unhappy, because the cluster had not been running fine. It looked that the issue was not successfully fixed – we still saw many DataNodes marked dead regularly! :(

A shocking and ashaming finding

We started investigating this issue again, but everybody worked rather independently (no particular reason why).

My first finding was a really shocking for me! When looking more deeply in the code, I discovered that the dfs.namenode.heartbeat.recheck.interval should be specified in milliseconds (not in seconds, as I thought). This means that the formula gave us only 2 * 600 milsec + 10 * 3 sec = 31.2sec (instead of 20 minutes)! This explains why the majority of DataNodes became marked dead so quickly and so often the last weekend. Due to a higher load and/or heavier garbage collection, the NameNode was freezing for longer than half a minute, causing the DataNodes to lose connectivity and became marked dead!

The mail of shame

Due to this mistake, we spent a half of weekend troubleshooting Hadoop cluster… and I felt guilty. :(

I had a dilemma whether to silently change this interval (it happened during my first months at Spotify, so I was a bit afraid of my reputation), or send an email with apologies to my colleagues explaining this mistake.

Finally, I sent “a mail of shame”. Today I think that I should also sent flowers.

Thankfully, I got a reply that was not so bad.

Thanks to this reply, I still felt an important member of the team!

Smile…. Tomorrow will be worse

“Those (DataNodes) who are dead and not dead were still living in my head”, so I decided to put all my efforts, blood, sweat and tears to solve this issue!

My next finding was also shocking and kind of demotivating, and I sent “a mail of shame” again:

This means that we did not have any block report storm coming after 1 hour! (it might come after 6 hours). With this in mind, increasing the initial block report delay was definitely not a right fix that weekend (guess, who proposed that?).

This means that increasing the heap of NameNode and tweaking the size of the young and old generations helped us that weekend (to solve a problem of the cluster being unstable around 1 hours after restarting the NameNode). Submitting multiple changes to the cluster at once, (almost always) means that you do not know which is the right one!

Too many lessons learned!

We did not fix the problem permanently at that point of time, because we were still losing a couple of DataNodes every now and then, but we had learned a lot. Probably, there are too many lessons learned!

  • Measure the impact of each single change (if you do not have time/energy to do that, do not deploy a change)
  • Never make bulk changes to the cluster (you will know which one is a right fix)
  • Double-check a configuration parameter description (description can be different in different places)
  • Check your default values (they might be overridden/optimized by newer versions of Hadoop)
  • Question (almost) everything (even if your colleague submits a patch that you discussed together 5 minutes ago)
  • Troubleshoot the cluster together interactively and non-interactively
  • Share the knowledge, even if you make a mistake (this includes conference talks and blog posts! ;))
  • Give a support to your team-mates, even if they fail (because they are friends and they will be double-motivated to fix the issue)

Amat victoria curam

Amat victoria curam – Victory favors those who try.

And I was double motivated to fix this issue with missing DataNodes. I start learning more about troubleshooting tools for Java applications e.g. jstack, jmap. Actually, I was a good decision!

sudo jstack -F datanode-pid did not show us anything interesting,
sudo -u hdfs jstack datanode-pid did show us something really interesting i.e. 462 threads were blocked/waiting, and only 19 threads are runnable.

The blocked and waiting threads were waiting for the same lock, to finish operations related to reading a block, creating a block replica, initializing a replica recovery process and more:


The lock was held by other thread that run into the deadlock. This deadlock is described in HDFS-3655 (duplicated by HDFS-5016 and related to HDFS-4851).

Eureka! The root cause is identified.

Unfortunately, we had not enough hands to implement a patch that fixes this problem (but we definitely should), and we just wrote a simple watcher script that periodically checks if the DataNode is both running and marked dead by the NameNode and restarts it, if true.

Today, I think that we should prioritize the contribution to core Apache Hadoop higher. We were probably one of the first companies that run into this issue, because when we saw it first time, it was only one JIRA ticket open for that and after a couple of months two more tickets were created and the problem was finally solved.

Two additional lessons learned

There are two additional important lessons that we learned from this issue. First, apart from mastering the Hadoop knowledge, we should learn as much as possible about all available troubleshooting and monitoring tools and their functionality. The second lessons is that we should prioritize the contribution to core Apache Hadoop higher (at least sending a post to Hadoop mailing list telling about the problem or raising a well-documented JIRA ticket, and maybe even implement a patch).

No related posts found.

VN:F [1.9.20_1166]

Rate this post!

Rating: 5.0/5 (7 votes cast)
Celebrate failure(s) – a real-world Hadoop example (HDFS issues), 5.0 out of 5 based on 7 ratings



Comments are closed.