Skip to content

Pig HBase lookups

Pig HBase lookups

Pig can nicely read from and write data to HBase, which can be done as I described here. Additionally we may use Pig UDF to manage data in HBase – like retrieving some values for a given key. There is one difficulty though – Zookeeper manages the number of concurrent connections done to HBase and if our application exceeds that, then the whole job will simply fail with message:

In order to prevent that you may manage the connections setting with parameter maxClientCnxns or enforce smaller number of tasks running concurrently.

 
Here I post some simple Pig UDF doing lookups in HBase. It:

  • connects to HBase
  • retrieves the names of the table’s columns in order to create the output tuple
  • fetches the row from HBase for key passed to the LookUp function
  • appends fetched data to the processed tuple

In this case there’s one connection established for each running mapper. So still at some point this may go beyond the specified connection limit.

 

HOW TO

I used CDH 5.8.4 with Kerberos.

If you work with Maven there are 4 libraries that need to be imported (or more in case of more sophisticated code):

 

Pig Code

In order to run Pig UDF we obviously need so me Pig code. This can be done via Grunt shell or with Oozie. Let’s assume that we want to retrieve some persons’ pesels based on last name, country and id number. First we need to define the LookUp function using the constructor, which will establish connection to the HBase. Then we can execute the function as with any Pig UDF.

 

Java Code

Here I created 3 classes – one for handling HBase communication, second for UDF logic and third handling the lookup key for retrieving the data. Each Pig record being processed is combined with data from the HBase table and a flag LOOKUP_MATCH specifying if the data was found or not. I assumed the each HBase row key has a following pattern: country_last_name_id_number. Also each column with name ending with “number” has a int datatype and the rest have chararray type.

 
Class for handling HBase connection:

 
LookUp key class – for lookup key creation:

 

Class with Pig UDF logic:

Leave a Reply

Your email address will not be published. Required fields are marked *