Hakuna MapData! » apache hbase

Hadoop Playlist at Spotify

| Posted in Tips, Troubleshooting |


A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a large cluster and listening to perfect music for every moment, something surprising might happen…!


Well, after some time, a data engineer starts discovering Hadoop (and its related concepts) in the lyrics of many popular songs. How can Coldplay, Black Eyed Peas, Michael Jackson or Justin Timberlake sing about Hadoop?

Maybe it is some kind of illness? Definitely! A doctor could call it “inlusio elephans” ;)

What I did when assigning META region in HBase cluster took ages…

| Posted in Troubleshooting |


Today, I faced the issue of inability to assign META region to a regionserver when starting HBase cluster. Basically ROOT region was assigned correctly and quickly (just a matter of seconds), where META region was permanently in transition state. Obviously and unfortunately, it resulted in HBase cluster inaccessibility.

Since it have never encountered such situation before, I did not know best-to-use solution.

Recreating HBase table without violating region starting keys

| Posted in Doubts Resolving, Troubleshooting |


Today, I was using completebulkload to load larger amount of data into HBase. To keep this process well-balanced over the entire cluster (and thus faster), I was loading data into a table with pre-created regions. Since my bulkloading process failed in the middle a couple of times due to some misconfiguration, I needed to truncate the table each time.

I have noticed that a command like

hbase(main):017:0> truncate 'table'

will disable, drop and recreate the table with the same name settings (number of column familiers, compression, ttl, blocksize etc), but it does not maintain the region boundaries.

Ganglia configuration for a small Hadoop cluster and some troubleshooting

| Posted in Monitoring, Software, Troubleshooting |


Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.

Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.

Basic Ganglia overview

Ganglia consists of three components:

  • Ganglia monitoring daemon (gmond) – a daemon which needs to run on every single node that is monitored. It collects local monitoring metrics and announce them, and (if configured) receives and aggregates metrics sent to it from other gmonds (and even from itself).
  • Ganglia meta daemon (gmetad) – a daemon that polls from one or more data sources (a data source can be a gmond or other gmetad) periodically to receive and aggregate the current metrics. The aggregated results are stored in database and can be exported as XML to other clients – for example, the web frontend.
  • Ganglia PHP web frontend – it retrieves the combined metrics from the meta daemon and displays them in form of nice, dynamic HTML pages containing various real-time graphs.

If you want to learn more about gmond, gmetad and the web frontend, a very good description is available at Ganglia’s wikipedia page. Hope, that following picture (showing an exemplary configuration) helps to understand the idea:

Pigitos in action – Reading HBase column family content in a real-world application

| Posted in Programming |


In this post I will demonstrate how you can use Pigitos (a library that contains tiny, but highly useful UDFs for Apache Pig) to implement “friends you may have” feature (inspired by “people you may know”) for a simple real-world application.

Problem definition

Assume that we are launching a social website called “CloseFriendBook” and we have to design a basic HBase table that stores information about the users and their close friends.

Our access pattern is either:

  • read a user profile (information like first name, email, age), or
  • read the full list of friends of a given user (theoretically, a user may have unlimited number of close friends, but in reality it has no more than tens or hundreds of them).