Posted in Programming | Posted on 03-08-2012|
In this post I will demonstrate how you can use Pigitos (a library that contains tiny, but highly useful UDFs for Apache Pig) to implement “friends you may have” feature (inspired by “people you may know”) for a simple real-world application.
Our access pattern is either:
- read a user profile (information like first name, email, age), or
- read the full list of friends of a given user (theoretically, a user may have unlimited number of close friends, but in reality it has no more than tens or hundreds of them).
We can take advantage of “flat-wide” table schema and store all information realated to a given user in one row (but in two separate column families: info and friend). Last but not least, the rowkey can be simply the user’s unique username.
It leads us to the following table schema:
(here is an interesting presentation by Evan Liu about how to design HBase table schema, in contrast to classical RDBMS. The user-friends schema is described on slides 12-14.)
Apache Pig and Apache HBase integration
Obviously, we can use Apache Pig to process data that resides in Apache HBase tables. It is relatively easy to read data from or store data in HBase tables and if you use HBaseStorage class (a HBase implementation of
To read the friendship information, we could use a following PigLatin statement:
User = LOAD 'hbase://user' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'friends:*', '-loadKey true' ) AS (username:bytearray, friendMap:map);
So far, so good. But imagine, that you want to access data from friend_map. This map contains of key/value pairs and in order to access such a pair, you need to know the exact key. Unfortunately, in our case, keys are dynamically created (i.e. usernames of friends) at run-time and it is impossible to know them at the time of implementing this script. Putting it simple, we cannot write instruction like this:
UserOneFriend = FOREACH User GENERATE username, friendMap#'What_should_I_to_put_here?';
In fact, it seems that there is no Pig’s build-in functionality to easily get the full list of key/value pairs (or just keys or values) from a map. There are UDFs like TOBAG or TOTUPLE, but they do not take a map as an input parameter. What we need is set of simple UDFs that take a map as input and produces a bag that contains all keys (or values, or
UserFriend = FOREACH Users GENERATE username, FLATTEN(MapKeysToTuple(friendMap)) AS friendUsername;
Naturally, we would use
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
register Pigitos-1.0-SNAPSHOT.jar register /usr/lib/zookeeper/zookeeper-3.4.3-cdh4.0.1.jar register /usr/lib/hbase/hbase.jar register /usr/lib/hbase/lib/guava-11.0.2.jar DEFINE MapKeysToBag pl.ceon.research.pigitos.pig.udf.MapKeysToBag(); User = LOAD 'hbase://$table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('$colfam:*', '-loadKey true') AS (username: chararray, friendMap: map); UserFriend = FOREACH User GENERATE username, FLATTEN(MapKeysToBag(friendMap)) AS friendname; UserFriendCopy = FOREACH UserFriend GENERATE *; UserFriendFriend = JOIN UserFriend BY friendname, UserFriendCopy BY username; UserMayBeFriend = FOREACH UserFriendFriend GENERATE UserFriend::username, UserFriendCopy::friendname AS mayBeFriendname; UserGroup = COGROUP UserFriend BY (username, friendname), UserMayBeFriend BY (username, mayBeFriendname); UserNewFriendGroup = FILTER UserGroup BY COUNT(UserFriend) == 0; UserNewFriend = FOREACH UserNewFriendGroup GENERATE FLATTEN(UserMayBeFriend); STORE UserNewFriend INTO '$output';
I have tested this example on Hadoop/HBase cluster with CDH4.0.1 installed (e.g. hbase-0.92.1, pig-0.9.2, see CHD Packaging Information). If you want you can run this simple application, please follow these steps:
1. Sample data preparation
$ hbase shell hbase(main):001:0> create 'user', 'info', 'friend' hbase(main):002:0> put 'user', 'username1', 'friend:username2', 'childhood' hbase(main):003:0> put 'user', 'username1', 'friend:username3', 'childhood' hbase(main):004:0> put 'user', 'username2', 'friend:username3', 'childhood' hbase(main):005:0> put 'user', 'username2', 'friend:username4', 'childhood' hbase(main):006:0> put 'user', 'username3', 'friend:username5', 'childhood' hbase(main):007:0> put 'user', 'username1', 'friend:username5', 'childhood'
2. Runing the script
$ wget https://github.com/kawaa/Pigitos/raw/master/Pigitos-1.0-SNAPSHOT.jar $ pig -p table=user -p colfam=friend -p output=fymk friends_you_may_know.pig
3. Checking the results
$ hadoop fs -cat fymk/part-r-00000 username1 username4 username2 username5
More about Pigitos
The ready-to-use jar can be downloaded from Pigitos github repo. At the time of writing this posts, it contains following UDFs:
MapSize– takes a map and returns the number of entries in the map MapKeysToBag– takes a map and produces a bag that contains all keys from that map MapValuesToBag-takes a map and produces a bag that contains all values from that map MapEntriesToBag– takes a map and produces a bag that contains tuples, where each tuple consists of two field: key and value (each tuple corresponds to one key/value pair from a map)