Hakuna MapData! » hdfs

Avoiding The Mess In The Hadoop Cluster (Part 1)

| Posted in Tips, Uncategorized |


I am excited to say that the blog post that I have co-authored Avoiding The Mess In The Hadoop Cluster (Part 1) has been published by GetInData and and Apache Software Foundation.

In the first part of this blog series, we describe possible open-source solutions for data cataloguing, data discovery and process scheduling such as Apache Hive, HCatalog and Apache Falcon.

If interested, please read more at Avoiding The Mess In The Hadoop Cluster (Part 1).

Celebrate failure(s) – a real-world Hadoop example (HDFS issues)

| Posted in Failures, Troubleshooting |


At Spotify, we have a company-wide culture of celebrating successes and … failures. Because we want to iterate fast, we do realize that failures can happen. On the other hand, we can not afford to make the same mistake more than once. One way of preventing from that is sharing our failures, mistakes and learning across the company.

Today however, I would like to share my failures … outside of the company ;) While my failures relate to my recent work with Apache Hadoop cluster, I think that the lessons that I have learned are generic enough, so that many people can benefit from them.

Hadoop adventures at Spotify (slides from my talk at Strata + Hadoop World 2013)

| Posted in Presentations |


I am very happy to present the slides from my presentation at Strata + Hadoop World 2013.

The presentation is titled ” Hadoop adventures at Spotify” and I am simply talking about five real-world Hadoop issues that either broke our cluster at Spotify or made it very unstable. Each story comes from our JIRA dashboard and is based on facts! ;) To make it even more engaging, I am exposing real graphs, numbers, even our emails and conversations. For each story, I am sharing the mistakes that we made and I am describing the lessons that we learned.

This includes also the mistake that I made and I do not like to talk about, but today I will share it as well ;)

Hadoop Playlist at Spotify

| Posted in Tips, Troubleshooting |


A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a large cluster and listening to perfect music for every moment, something surprising might happen…!


Well, after some time, a data engineer starts discovering Hadoop (and its related concepts) in the lyrics of many popular songs. How can Coldplay, Black Eyed Peas, Michael Jackson or Justin Timberlake sing about Hadoop?

Maybe it is some kind of illness? Definitely! A doctor could call it “inlusio elephans” ;)

A user having surprising troubles running more resource-intensive Hive queries

| Posted in Troubleshooting |


The problem

A couple of months ago, one of our data analysts pernamently run into troubles when he wanted to run more resource-intensive Hive queries. Surprisingly, his queries were valid, syntactically-correct and run successfully on small data, but they just failed on larger datasets. On the other hand, other users were able to run the same queries successfully on the same large datasets. Obviously, it sounds like some permissions problem, however the user had right HDFS and Hive permissions.