Skip to content

Hive table to manipulate HBase data

Hive table to manipulate HBase data

Hive gives a nice option to manipulate the data stored in HBase. Not only it provides the SQL capabilities but also can be easily incorporated into the workflow processing.

Contents

HOW TO

I used CDH 5.8.4 with Kerberos.


In order to establish Hive-HBase connection you need to create a Hive table pointing to a HBase one. You can achieve that by using org.apache.hadoop.hive.hbase.HBaseStorageHandler.

By default Hive will search for HBase table with the same name as its table. If you want the names to be different, you need to specify the mapping with hbase.table.name parameter. If you want to allow inserts into the HBase table, you need to set hbase.mapred.output.outputtable parameter as well.

hbase.columns.mapping defines mapping between Hive and HBase columns (columns won’t be identified automatically). For each Hive column you need to specify corresponding HBase column. If specified HBase column doesn’t exist, Hive will treat it as null-values-column, which you can normally use. Your HBase table may have more columns than the Hive table. Those not mapped will simply be not visible for Hive. Specify columns by separating them with commas (and no spaces, unless you want a space in your column name) in such a way:

  • :key – for HBase key
  • column_family:column_name – for other columns

Note: Hive needs to have a column mapped to the HBase row key.
You can also specify the data type with #, which currently can only be string (default one) or binary.
 
To create an empty HBase table use the CREATE command instead of CREATE EXTERNAL.
 
Now that you have a Hive table, you can execute the queries manipulating HBase data:

Note: If your data is loading for too long you may consider setting set hive.hbase.wal.enabled=false. However it’s stated that you may loose some data then in case of HBase failure.


Note: Hive has only access to the most recent HBase data (with the latest timestamp) and there’s no way to retrieve the older versions.
 

If your Hive column will be of type map, then you may link it to the whole HBase column family:

As HBase doesn’t allow duplicate row keys, if you insert records with the same key, the values will get overwritten.

 

With Oozie

In order to run Hive-HBase action through Oozie, just use the Hive action with Hive credentials:

Leave a Reply

Your email address will not be published. Required fields are marked *