Hakuna MapData! » apache pig

Football zero, Apache Pig hero – the essence from hundreds of posts from Apache Pig user mailing list

| Posted in Community |


I am big fan of football and I really like reading football news. Last week however, I definitely overdid reading it (because Poland played against England in the World Cup 2014 qualifying match). Hopefully, I did realize that it is not the best way to waste my time and today I decided that my next 7 days will be different. Instead, I will read posts from Apache Pig user mailing lists!

The idea is just to read post from the mailing list anytime I feel like reading about football. It means Football zero, Apache Pig hero for me this week ;)

Pigitos in action – Reading HBase column family content in a real-world application

| Posted in Programming |


In this post I will demonstrate how you can use Pigitos (a library that contains tiny, but highly useful UDFs for Apache Pig) to implement “friends you may have” feature (inspired by “people you may know”) for a simple real-world application.

Problem definition

Assume that we are launching a social website called “CloseFriendBook” and we have to design a basic HBase table that stores information about the users and their close friends.

Our access pattern is either:

  • read a user profile (information like first name, email, age), or
  • read the full list of friends of a given user (theoretically, a user may have unlimited number of close friends, but in reality it has no more than tens or hundreds of them).

Pigitos – MapKeysToBag, MapSize and more UDFs to manipulate maps in Apache Pig

| Posted in Programming |


I have already created a project called Pigitos which is a set of tiny, but highly useful Java UDFs for Apache Pig.

Currently, Pigitos contains a couple of UDFs that support working with maps. It provides UDFs to calculate the size of the map and get map’s keys (or values, or key/value pairs) as a bag. Such UDFs are very useful when working with dynamically created column qualifiers (that hold some meaningful information that you want to process) in Apache HBase tables.

Apache Pig at The 3rd Meeting of WHUG

| Posted in Community, Presentations, Programming |


I am very happy to share my slides from the presentation about Apache Pig that I gave at the 3rd meeting of Warsaw Hadoop User Group about 2 weeks ago. There is basic info about motivation for Apache Pig, PigLatin statements, simple PigLatin code that processes Last.fm data set and many many pictures. Hope you will find it interesting (please refresh this page if slides do not appear).

Process a million songs to find exotic ones with Apache Pig

| Posted in Programming |


Recently I have found an interesting dataset, called Million Song Dataset (MSD) which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, loudness as well as artist’s name, popularity, localization (latitude and longitude pair) and many others. There are no music files included here, but the links to MP3 songs’ previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218 GB, processing it using one machine may take very long time.

Definitely, much more interesting and efficient approach is to use multiple machines and process the songs in parallel fashion by taking advantage of open-source tools from Apache Hadoop Ecosystem (e.g. Apache Pig). If you have some own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop) which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple comands) or even automatically using Cloudera Manager Free Edition. Both CDH and Cloudera Manager are freely downloadable from the Cloudera website. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the 2nd meeting of Warsaw Hadoop User Group).

Problem definition

I came up with the idea to process this dataset to find “exotic” (but still popular) songs. By an exotic songs, I simply mean a song which is recorded by an artist who lives in some foreign country, far away from other artists. The general goal is to discover a couple of fancy and folk songs which are associated with the culture of some country. A funny example could be the song “Koko Koko Euro Spoko” by Jarzębina which was chosen by Poles to be the official song of Polish national football team during UEFA EURO 2012 ;)