Skip to content

HBase + Pig + Oozie

HBase + Pig + Oozie

Although HBase is mostly used for lookups, sometimes there comes a need to perform bulk reads and writes. Doing that through Pig is very convenient. Here’s how to establish Pig-HBase communication.

HOW TO

I used CDH 5.8.4 with Kerberos.

1. HBase and Pig

To establish Pig and HBase communication you need two things:

  • hbase-server.jar – registered in Pig script
  • hbase-site.xml – either placed under /etc/hbase/conf or added to Pig CLASSPATH

For both loading and storing data use org.apache.pig.backend.hadoop.hbase.HBaseStorage.

 

Loading data with Pig

Specify HBase columns which you want to load (like column_family:column_name) or whole column families (column_family:*). Separate them with spaces.

 

Storing data with Pig

Before storing the data make sure that HBase table exists. You need to specify HBase columns to which the data should be loaded. First column in your Pig entity will be treated as a HBase row key automatically.

 

2. and Oozie

To create an Oozie workflow manipulating HBase data with Pig you need to:

  • copy hbase-server.jar into Oozie libs folder (should be picked up automatically if it’s set up) or copy it into HDFS and attach with action file property

  • attach hbase-site.xml with action file property. It should contain at least following properties:

    Note: Instead of attaching hbase-site.xml you may define mentioned properties in the action config section.

  • specify HBase credentials:

    and set them up for Pig action with cred=”hbase” option.

Leave a Reply

Your email address will not be published. Required fields are marked *