Posted by Adam Kawa | Posted in Programming | Posted on 03-08-2012
In this post I will demonstrate how you can use Pigitos (a library that contains tiny, but highly useful UDFs for Apache Pig) to implement “friends you may have” feature (inspired by “people you may know”) for a simple real-world application.
Assume that we are launching a social website called “CloseFriendBook” and we have to design a basic HBase table that stores information about the users and their close friends.
Our access pattern is either:
read a user profile (information like first name, email, age), or
read the full list of friends of a given user (theoretically, a user may have unlimited number of close friends, but in reality it has no more than tens or hundreds of them).
Posted by Adam Kawa | Posted in Programming | Posted on 29-06-2012
Recently I have found an interesting dataset, called Million Song Dataset (MSD) which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, loudness as well as artist’s name, popularity, localization (latitude and longitude pair) and many others. There are no music files included here, but the links to MP3 songs’ previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218 GB, processing it using one machine may take very long time.
Definitely, much more interesting and efficient approach is to use multiple machines and process the songs in parallel fashion by taking advantage of open-source tools from Apache Hadoop Ecosystem (e.g. Apache Pig). If you have some own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop) which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple comands) or even automatically using Cloudera Manager Free Edition. Both CDH and Cloudera Manager are freely downloadable from the Cloudera website. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the 2nd meeting of Warsaw Hadoop User Group).
I came up with the idea to process this dataset to find “exotic” (but still popular) songs. By an exotic songs, I simply mean a song which is recorded by an artist who lives in some foreign country, far away from other artists. The general goal is to discover a couple of fancy and folk songs which are associated with the culture of some country. A funny example could be the song “Koko Koko Euro Spoko” by Jarzębina which was chosen by Poles to be the official song of Polish national football team during UEFA EURO 2012 ;)