Recently, I have been refactoring one of our Hive scripts. Because, I introduced significant changes, a query is resource-intensive (process almost two terabytes of data) and … I wanted to iterate fast, I decided to test it locally.
To make it easier, I implemented Beetest – a super simple utility that helps you to test your Apache Hive scripts locally without any Java knowledge.
Posted in Programming | Posted on 03-08-2012
In this post I will demonstrate how you can use Pigitos (a library that contains tiny, but highly useful UDFs for Apache Pig) to implement “friends you may have” feature (inspired by “people you may know”) for a simple real-world application.
Assume that we are launching a social website called “CloseFriendBook” and we have to design a basic HBase table that stores information about the users and their close friends.
Our access pattern is either:
- read a user profile (information like first name, email, age), or
- read the full list of friends of a given user (theoretically, a user may have unlimited number of close friends, but in reality it has no more than tens or hundreds of them).
Posted in Programming | Posted on 02-08-2012
I would like to share some basic tricks to use with Apache HBase shell that I have learned by reading HBase: The Definitive Guide by L.George, HBase in Action by N.Dimiduk and A.Khurana and taking part in Cloudera Training for Apache HBase.
In this post, I will create HBase table, populate it with sample data and scan it. Each step will demonstrate a different technique to achieve the goal.
Create the User table
You can pipe commands to the hbase shell and easily create the table using one single command (without the need to explicitely launch the HBase shell first).
$ echo "create 'user', 'info'" | hbase shell
create 'user', 'info'
0 row(s) in 1.7610 seconds
Posted in Programming | Posted on 30-07-2012
I have already created a project called Pigitos which is a set of tiny, but highly useful Java UDFs for Apache Pig.
Currently, Pigitos contains a couple of UDFs that support working with maps. It provides UDFs to calculate the size of the map and get map’s keys (or values, or key/value pairs) as a bag. Such UDFs are very useful when working with dynamically created column qualifiers (that hold some meaningful information that you want to process) in Apache HBase tables.
I am very happy to share my slides from the presentation about Apache Pig that I gave at the 3rd meeting of Warsaw Hadoop User Group about 2 weeks ago. There is basic info about motivation for Apache Pig, PigLatin statements, simple PigLatin code that processes Last.fm data set and many many pictures. Hope you will find it interesting (please refresh this page if slides do not appear).