Today, I was using completebulkload to load larger amount of data into HBase. To keep this process well-balanced over the entire cluster (and thus faster), I was loading data into a table with pre-created regions. Since my bulkloading process failed in the middle a couple of times due to some misconfiguration, I needed to truncate the table each time.
I have noticed that a command like
hbase(main):017:0> truncate 'table'
will disable, drop and recreate the table with the same name settings (number of column familiers, compression, ttl, blocksize etc), but it does not maintain the region boundaries.
I am very happy to share my slides from the presentation about Apache Pig that I gave at the 3rd meeting of Warsaw Hadoop User Group about 2 weeks ago. There is basic info about motivation for Apache Pig, PigLatin statements, simple PigLatin code that processes Last.fm data set and many many pictures. Hope you will find it interesting (please refresh this page if slides do not appear).
Posted by Adam Kawa | Posted in Programming | Posted on 29-06-2012
Recently I have found an interesting dataset, called Million Song Dataset (MSD) which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, loudness as well as artist’s name, popularity, localization (latitude and longitude pair) and many others. There are no music files included here, but the links to MP3 songs’ previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218 GB, processing it using one machine may take very long time.
Definitely, much more interesting and efficient approach is to use multiple machines and process the songs in parallel fashion by taking advantage of open-source tools from Apache Hadoop Ecosystem (e.g. Apache Pig). If you have some own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop) which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple comands) or even automatically using Cloudera Manager Free Edition. Both CDH and Cloudera Manager are freely downloadable from the Cloudera website. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the 2nd meeting of Warsaw Hadoop User Group).
I came up with the idea to process this dataset to find “exotic” (but still popular) songs. By an exotic songs, I simply mean a song which is recorded by an artist who lives in some foreign country, far away from other artists. The general goal is to discover a couple of fancy and folk songs which are associated with the culture of some country. A funny example could be the song “Koko Koko Euro Spoko” by Jarzębina which was chosen by Poles to be the official song of Polish national football team during UEFA EURO 2012 ;)