Hakuna MapData! » hadoop

Agile migration of a single-node cluster from MRv1 to YARN

| Posted in Tutorials |


I am happy to say that my blog post “Agile migration of a single-node cluster from MRv1 to YARN” has been published by IBM developerWorks. Please find the abstract of the article below:

Although Hadoop vendors such as Cloudera and Hortonworks provide excellent and detailed documentation for installing YARN, they follow an all-or-nothing approach. With this approach, you perform almost all of the migration steps first, then you start the cluster and verify that it is correctly migrated. If the migration fails, you review the migration steps to determine where the misconfiguration was made. Because the migration to YARN is a complex and error-prone process, it can be challenging to troubleshoot an almost-migrated cluster.

In contrast, this article describes how to use an agile approach with quick and frequent iterations. In the first iteration, you install only the necessary components and start the YARN cluster to verify whether it runs applications successfully. In the next iterations, you extend the cluster’s functionality and optimize the most important configuration settings. The goal is to have a working YARN cluster that can process users’ applications after each iteration. Using this approach, administrators have the ability to temporarily halt the migration process after each iteration and continue it later at a convenient time.

Read more at IBM developerWorks.

Introduction To YARN

| Posted in Reading |


I am happy to say that my blog post “Introduction To YARN” has been published by IBM developerWorks. Please find the abstract of the article below:

Apache Hadoop is currently one of the most popular tools for big data processing. It has been successfully deployed in production by many companies for several years. Though Hadoop is considered as a reliable, scalable, and cost-effective solution, it is constantly being improved by a large community of developers. As a result, the 2.0 version offers several revolutionary features including YARN, HDFS Federation, and a highly-available NameNode which make the Hadoop cluster much more efficient,powerful, and reliable. In this article, learn about the advantages YARN provides over the previous version of the distributed processing layer in Hadoop.

Read more at IBM developerWorks.

“Hadoop Operations Powered By … Hadoop” accepted for Hadoop Summit 2014 in Amsterdam! ;)

| Posted in Community, Presentations, Troubleshooting |


I am extremely happy to say that my proposal was accepted for Hadoop Summit 2014 in Amsterdam ;) The title of my presentation is Hadoop operations powered by … Hadoop and I will talk about various metrics, logs and files that Hadoop generates and how to analyze them … using Hadoop (and open-source tools and simple scripts) to learn more about Hadoop and avoid guesstimates!

Two memory-related issues on the Apache Hadoop cluster (memory swapping and the OOM killer)

| Posted in Monitoring, Troubleshooting |


In this blog post, I will describe two memory-related issues that we have recently experienced on our 190-node Apache Hadoop cluster at Spotify.

Hadoop Unreachable Nodes Jira Ticket

We have noticed that some nodes were suddenly marked dead by both NameNode and JobTracker. Although we could ping them, we were unable to ssh into them, what often suggests some really heavy load on these machines. When looking at Ganglia graphs, we have discovered that all nodes that were marked dead have one common issue – a heavy swapping (in case of Apache Hadoop, the practice shows that a heavy swapping of JVM process usually means some kind of unresponsiveness and/or even the “death”).

Servers swapping