HBase + Pig + Oozie
Although HBase is mostly used for lookups, sometimes there comes a need to perform bulk reads and writes. Doing that through Pig is very convenient. Here’s how to establish Pig-HBase communication.
Contents
HOW TO
I used CDH 5.8.4 with Kerberos.
1. HBase and Pig
To establish Pig and HBase communication you need two things:
- hbase-server.jar – registered in Pig script
- hbase-site.xml – either placed under /etc/hbase/conf or added to Pig CLASSPATH
For both loading and storing data use org.apache.pig.backend.hadoop.hbase.HBaseStorage.
Loading data with Pig
Specify HBase columns which you want to load (like column_family:column_name) or whole column families (column_family:*). Separate them with spaces.
1 2 3 4 5 6 7 8 9 10 11 12 |
data = LOAD 'hbase://${hbase_namespace}:${hbase_table}' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'key:col1 key:col2 metadata:col1 metadata:col2 metadata:dates', '-loadKey true') AS ( record_key:chararray, key_col1:chararray, key_col2:chararray, metadata_col1:chararray, metadata_col2:chararray, metadata_dates:bag{} ); |
Storing data with Pig
Before storing the data make sure that HBase table exists. You need to specify HBase columns to which the data should be loaded. First column in your Pig entity will be treated as a HBase row key automatically.
1 2 3 |
STORE result INTO 'hbase://${hbase_namespace}:${hbase_table}' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('metadata:col1 metadata:col2 metadata:dates'); |
2. and Oozie
To create an Oozie workflow manipulating HBase data with Pig you need to:
- copy hbase-server.jar into Oozie libs folder (should be picked up automatically if it’s set up) or copy it into HDFS and attach with action file property
- attach hbase-site.xml with action file property. It should contain at least following properties:
12345678910111213141516171819202122<configuration><property><name>hive.metastore.kerberos.principal</name><value>${hive_metastore_principal}</value></property><property><name>hbase.security.authentication</name><value>kerberos</value></property><property><name>hbase.zookeeper.quorum</name><value>${zookeeper_quorum}</value></property><property><name>hbase.master.kerberos.principal</name><value>${hbase_master_principal}</value></property><property><name>hbase.regionserver.kerberos.principal</name><value>${hbase_region_server_principal}</value></property></configuration>
Note: Instead of attaching hbase-site.xml you may define mentioned properties in the action config section.
- specify HBase credentials:
12<credential name="hbase" type="hbase"></credential>
and set them up for Pig action with cred=”hbase” option.