Hakuna MapData! » milion song dataset

How to build Hadoop cluster on Amazon Elastic MapReduce using Karashpere Studio for EMR

| Posted in Presentations, Software |


Here is my video (15 minute long, in Polish), that shows how to create Hadoop cluster on Amazon Elastic MapReduce and use Karashpere Studio for EMR (a plugin for Eclipse). It demonstrates how to rent 10 of EC2 small instances to run exemplary calculation that process ~220GB of data in less then one hour, what costs $1.25.

Process a million songs to find exotic ones with Apache Pig

| Posted in Programming |


Recently I have found an interesting dataset, called Million Song Dataset (MSD) which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, loudness as well as artist’s name, popularity, localization (latitude and longitude pair) and many others. There are no music files included here, but the links to MP3 songs’ previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218 GB, processing it using one machine may take very long time.

Definitely, much more interesting and efficient approach is to use multiple machines and process the songs in parallel fashion by taking advantage of open-source tools from Apache Hadoop Ecosystem (e.g. Apache Pig). If you have some own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop) which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple comands) or even automatically using Cloudera Manager Free Edition. Both CDH and Cloudera Manager are freely downloadable from the Cloudera website. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the 2nd meeting of Warsaw Hadoop User Group).

Problem definition

I came up with the idea to process this dataset to find “exotic” (but still popular) songs. By an exotic songs, I simply mean a song which is recorded by an artist who lives in some foreign country, far away from other artists. The general goal is to discover a couple of fancy and folk songs which are associated with the culture of some country. A funny example could be the song “Koko Koko Euro Spoko” by Jarzębina which was chosen by Poles to be the official song of Polish national football team during UEFA EURO 2012 ;)