Hive table to manipulate HBase data
Hive gives a nice option to manipulate the data stored in HBase. Not only it provides the SQL capabilities but also can be easily incorporated into the workflow processing.
Contents
HOW TO
I used CDH 5.8.4 with Kerberos.
In order to establish Hive-HBase connection you need to create a Hive table pointing to a HBase one. You can achieve that by using org.apache.hadoop.hive.hbase.HBaseStorageHandler.
1 2 3 4 5 |
CREATE EXTERNAL TABLE hive_table(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:value1,cf1:value2") TBLPROPERTIES ("hbase.table.name" = "hbase_table", "hbase.mapred.output.outputtable" = "hbase_table"); |
By default Hive will search for HBase table with the same name as its table. If you want the names to be different, you need to specify the mapping with hbase.table.name parameter. If you want to allow inserts into the HBase table, you need to set hbase.mapred.output.outputtable parameter as well.
hbase.columns.mapping defines mapping between Hive and HBase columns (columns won’t be identified automatically). For each Hive column you need to specify corresponding HBase column. If specified HBase column doesn’t exist, Hive will treat it as null-values-column, which you can normally use. Your HBase table may have more columns than the Hive table. Those not mapped will simply be not visible for Hive. Specify columns by separating them with commas (and no spaces, unless you want a space in your column name) in such a way:
- :key – for HBase key
- column_family:column_name – for other columns
Note: Hive needs to have a column mapped to the HBase row key.
You can also specify the data type with #, which currently can only be string (default one) or binary.
To create an empty HBase table use the CREATE command instead of CREATE EXTERNAL.
Now that you have a Hive table, you can execute the queries manipulating HBase data:
1 |
INSERT OVERWRITE TABLE hive_table SELECT * FROM other_hive_table WHERE foo=111; |
Note: If your data is loading for too long you may consider setting set hive.hbase.wal.enabled=false. However it’s stated that you may loose some data then in case of HBase failure.
Note: Hive has only access to the most recent HBase data (with the latest timestamp) and there’s no way to retrieve the older versions.
If your Hive column will be of type map, then you may link it to the whole HBase column family:
1 2 3 4 5 6 7 |
CREATE TABLE hive_table (value map<string,string>, row_key int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf:,:key"); INSERT OVERWRITE TABLE hive_table SELECT map(data, name), id FROM other_hive_table WHERE id=98 OR id=100; |
As HBase doesn’t allow duplicate row keys, if you insert records with the same key, the values will get overwritten.
With Oozie
In order to run Hive-HBase action through Oozie, just use the Hive action with Hive credentials:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
<workflow-app name="hive_hbase" xmlns="uri:oozie:workflow:0.4"> <global> <job-xml>job-conf.xml</job-xml> </global> <credentials> <credential name="hive2" type="hive2"> <property> <name>hive2.jdbc.url</name> <value>${hive2_jdbc_url}</value> </property> <property> <name>hive2.server.principal</name> <value>${hive2_server_principal}</value> </property> </credential> </credentials> <start to="hive_hbase"/> <action name="hive_hbase" cred="hive2"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>oozie.action.sharelib.for.hive</name> <value>hive2</value> </property> <property> <name>oozie.launcher.action.main.class</name> <value>org.apache.oozie.action.hadoop.Hive2Main</value> </property> </configuration> <script>hive_hbase_query.sql</script> .... </hive> <ok to="end"/> <error to="kill"/> </action> <kill name="kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> |