Hakuna MapData! » Blog Archive » Pigitos in action – Reading HBase column family content in a real-world application

Pigitos in action – Reading HBase column family content in a real-world application

| Posted in Programming |

In this post I will demonstrate how you can use Pigitos (a library that contains tiny, but highly useful UDFs for Apache Pig) to implement “friends you may have” feature (inspired by “people you may know”) for a simple real-world application.

Problem definition

Assume that we are launching a social website called “CloseFriendBook” and we have to design a basic HBase table that stores information about the users and their close friends.

Our access pattern is either:

  • read a user profile (information like first name, email, age), or
  • read the full list of friends of a given user (theoretically, a user may have unlimited number of close friends, but in reality it has no more than tens or hundreds of them).

Schema design

We can take advantage of “flat-wide” table schema and store all information realated to a given user in one row (but in two separate column families: info and friend). Last but not least, the rowkey can be simply the user’s unique username.

It leads us to the following table schema:

(here is an interesting presentation by Evan Liu about how to design HBase table schema, in contrast to classical RDBMS. The user-friends schema is described on slides 12-14.)

Apache Pig and Apache HBase integration

Obviously, we can use Apache Pig to process data that resides in Apache HBase tables. It is relatively easy to read data from or store data in HBase tables and if you use HBaseStorage class (a HBase implementation of LoadFunc and StoreFunc functions), you can implement it in one single instruction in PigLatin. However, things may get little more complicated if your HBase table has dynamically created column qualifiers, that hold some meaningful information, which you want to process as well (as we do).

To read the friendship information, we could use a following PigLatin statement:

User = LOAD 'hbase://user' 
  USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
     'friends:*', '-loadKey true'
  ) AS (username:bytearray, friendMap:map[]);

So far, so good. But imagine, that you want to access data from friend_map. This map contains of key/value pairs and in order to access such a pair, you need to know the exact key. Unfortunately, in our case, keys are dynamically created (i.e. usernames of friends) at run-time and it is impossible to know them at the time of implementing this script. Putting it simple, we cannot write instruction like this:

UserOneFriend = FOREACH User 
  GENERATE username, friendMap#'What_should_I_to_put_here?';

In fact, it seems that there is no Pig’s build-in functionality to easily get the full list of key/value pairs (or just keys or values) from a map. There are UDFs like TOBAG or TOTUPLE, but they do not take a map as an input parameter. What we need is set of simple UDFs that take a map as input and produces a bag that contains all keys (or values, or tuples) from that map (e.g. MapKeysToBag, MapValuesToBag, MapEntriesToBag). Having these UFDs, we may FLATTEN such a bag and generate a relation that contains unnested keys or values extracted from the map e.g.:

UserFriend = FOREACH Users
  GENERATE username, FLATTEN(MapKeysToTuple(friendMap)) AS friendUsername;

Using Pigitos

Naturally, we would use MapKeysToTuple() in our “CloseFriendBook” application to implement the feature “friends you may have” (i.e. for a given user, find all users who are friends of his/her firends, but are not (at least not yet) friends of that user). How such a PigLatin script may look like?

register Pigitos-1.0-SNAPSHOT.jar
register /usr/lib/zookeeper/zookeeper-3.4.3-cdh4.0.1.jar
register /usr/lib/hbase/hbase.jar
register /usr/lib/hbase/lib/guava-11.0.2.jar
DEFINE MapKeysToBag pl.ceon.research.pigitos.pig.udf.MapKeysToBag();
User = LOAD 'hbase://$table'
  USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('$colfam:*', '-loadKey true')
  AS (username: chararray, friendMap: map[]);
UserFriend = FOREACH User 
  GENERATE username, FLATTEN(MapKeysToBag(friendMap)) AS friendname;
UserFriendCopy = FOREACH UserFriend GENERATE *;
UserFriendFriend = JOIN UserFriend BY friendname, UserFriendCopy BY username;
UserMayBeFriend = FOREACH UserFriendFriend
  GENERATE UserFriend::username, UserFriendCopy::friendname AS mayBeFriendname;
UserGroup = COGROUP UserFriend BY (username, friendname), 
  UserMayBeFriend BY (username, mayBeFriendname);
UserNewFriendGroup = FILTER UserGroup BY COUNT(UserFriend) == 0;
UserNewFriend = FOREACH UserNewFriendGroup GENERATE FLATTEN(UserMayBeFriend);
STORE UserNewFriend INTO '$output';

Quick example

I have tested this example on Hadoop/HBase cluster with CDH4.0.1 installed (e.g. hbase-0.92.1, pig-0.9.2, see CHD Packaging Information). If you want you can run this simple application, please follow these steps:

1. Sample data preparation

$ hbase shell
hbase(main):001:0> create 'user', 'info', 'friend'
hbase(main):002:0> put 'user', 'username1', 'friend:username2', 'childhood'
hbase(main):003:0> put 'user', 'username1', 'friend:username3', 'childhood'
hbase(main):004:0> put 'user', 'username2', 'friend:username3', 'childhood'
hbase(main):005:0> put 'user', 'username2', 'friend:username4', 'childhood'
hbase(main):006:0> put 'user', 'username3', 'friend:username5', 'childhood'
hbase(main):007:0> put 'user', 'username1', 'friend:username5', 'childhood'

2. Runing the script

$ wget https://github.com/kawaa/Pigitos/raw/master/Pigitos-1.0-SNAPSHOT.jar
$ pig -p table=user -p colfam=friend -p output=fymk friends_you_may_know.pig

3. Checking the results

$ hadoop fs -cat fymk/part-r-00000
username1	username4
username2	username5

More about Pigitos

The ready-to-use jar can be downloaded from Pigitos github repo. At the time of writing this posts, it contains following UDFs:

  • MapSize – takes a map and returns the number of entries in the map
  • MapKeysToBag – takes a map and produces a bag that contains all keys from that map
  • MapValuesToBag -takes a map and produces a bag that contains all values from that map
  • MapEntriesToBag – takes a map and produces a bag that contains tuples, where each tuple consists of two field: key and value (each tuple corresponds to one key/value pair from a map)

No related posts found.

VN:F [1.9.20_1166]

Rate this post!

Rating: 4.8/5 (4 votes cast)
Pigitos in action - Reading HBase column family content in a real-world application, 4.8 out of 5 based on 4 ratings