HBase + Sqoop + Oozie
Sqoop can be used to import data from the relational database into HBase. Although exporting data from HBase is not natively supported you can still manage it by putting Hive and HCatalog between HBase and Sqoop. Here’s how to do both importing and exporting with Oozie in Kerberised environment.
Contents
HOW TO
I used CDH 5.8.4 with Kerberos.
1. Import data to HBase with Sqoop and Oozie
HBase table doesn’t need to be prepared upfront. You may ask Sqoop to create it for you by specifying some of Sqoop arguments:
- hbase-create-table
- hbase-row-key pointing to column which will be used as HBase row key. If none is specified Sqoop will try to identify one by using primary key column, if it exists.
- column-family defining HBase column family. Only one can be defined though. All imported columns will be put into this column family.
To create an Oozie workflow Sqooping data into HBase you need to:
- specify HBase credentials in the workflow:
12<credential name="hbase" type="hbase"></credential>
- define HBase config parameters by uploading to HDFS and attaching hbase-site.xml as “job-xml” configuration file. It needs to be in the workflow global job config section, otherwise Oozie won’t be able to connect to HBase. hbase-site.xml needs to contain at least following properties:
123456789101112131415161718<configuration><property><name>hbase.security.authentication</name><value>kerberos</value></property><property><name>hbase.zookeeper.quorum</name><value>${zookeeper_quorum}</value></property><property><name>hbase.master.kerberos.principal</name><value>${hbase_master_principal}</value></property><property><name>hbase.regionserver.kerberos.principal</name><value>${hbase_region_server_principal}</value></property></configuration>
Note: Instead of attaching hbase-site.xml you may define mentioned properties in the workflow global config section. You may add some additional Oozie configurations in this section as well.
- attach following jar files with action file property:
- jdbc connection jar i.e. sqljdbc4.jar if you connect to SQLServer
- hbase-client.jar
- hbase-protocol.jar
- hbase-hadoop-compat.jar
Note: Instead of adding those jars to the job, you may put them into oozie libs folder. Oozie will then pick them up automatically. To set up such folder you should specify in Oozie job.properties 2 parameters: oozie.libpath pointing to folder with jars and oozie.use.system.libpath=true.
Here’s a complete Oozie workflow for importing data from SQLServer database to HBase:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
<workflow-app name="import_to_hbase" xmlns="uri:oozie:workflow:0.4"> <global> <job-xml>job-config.xml</job-xml> <configuration> <property> <name>hbase.security.authentication</name> <value>kerberos</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>${zookeeper_quorum}</value> </property> <property> <name>hbase.master.kerberos.principal</name> <value>${hbase_master_principal}</value> </property> <property> <name>hbase.regionserver.kerberos.principal</name> <value>${hbase_region_server_principal}</value> </property> </configuration> </global> <credentials> <credential name="hbase" type="hbase"> </credential> </credentials> <start to="import_data"/> <action name="import_data" cred="hbase"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>import</arg> <arg>--connect</arg> <arg>jdbc:sqlserver://${sql_server_ip};database=${sql_database}</arg> <arg>--username</arg> <arg>${sql_user_name}</arg> <arg>--password-file</arg> <arg>${dir_to_password_file}</arg> <arg>--table</arg> <arg>${sql_table_name}</arg> <arg>--hbase-table</arg> <arg>${hbase_table_name}</arg> <arg>--column-family</arg> <arg>${hbase_column_family}</arg> <arg>--hbase-row-key</arg> <arg>${row_key_column}</arg> <arg>--hbase-create-table</arg> <arg>--m</arg> <arg>1</arg> <arg>--</arg> <arg>--schema</arg> <arg>${sql_table_schema}</arg> <file>sqljdbc4.jar</file> <file>hbase-client.jar</file> <file>hbase-protocol.jar</file> <file>hbase-hadoop-compat.jar</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> <kill name="kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> |
2. Export data from HBase with Sqoop, Hive and Oozie
Exporting data from HBase with Sqoop is not supported. To workaround this you may create a Hive table on top of HBase table and then export the data using Sqoop and HCatalog.
In order to make it work you first need to create a Hive table mapped to the one in HBase. Your Hive and HBase tables may have different column names (although then mapping needs to be specified with mapping parameter – for details check Hive-HBase post). Then you need to create a table on database side prior to loading. This table needs to have the same column names as the Hive table as it’s required by HCatalog.
To make it work through Oozie you need to:
- specify HCatalog credentials with metastore uri and principal properties
12345678910<credential name="hcat" type="hcat"><property><name>hcat.metastore.uri</name><value>${hcat_metastore_uri}</value></property><property><name>hcat.metastore.principal</name><value>${hcat_metastore_principal}</value></property></credential>
- specify HBase credentials
12<credential name="hbase" type="hbase"></credential>
- attach hive-site.xml with file property. It should contain following properties:
123456789101112<property><name>hive.metastore.uris</name><value>${hive_metastore_uri}</value></property><property><name>hive.metastore.sasl.enabled</name><value>true</value></property><property><name>hive.metastore.kerberos.principal</name><value>${hive_metastore_principal}</value></property>
Note: There’s no way to put those as workflow configuration properties.
- define HBase config parameters by uploading to HDFS and attaching hbase-site.xml as “job-xml” configuration file. It needs to be in the workflow global job config section, otherwise Oozie won’t be able to connect to HBase. hbase-site.xml needs to contain at least following properties:
123456789101112131415161718<configuration><property><name>hbase.security.authentication</name><value>kerberos</value></property><property><name>hbase.zookeeper.quorum</name><value>${zookeeper_quorum}</value></property><property><name>hbase.master.kerberos.principal</name><value>${hbase_master_principal}</value></property><property><name>hbase.regionserver.kerberos.principal</name><value>${hbase_region_server_principal}</value></property></configuration>
Note: Instead of attaching hbase-site.xml you may define mentioned properties in the workflow global config section. You may add some additional Oozie configurations in this section as well.l.
- add skip-dist-cache property to avoid error:
1ERROR org.apache.sqoop.tool.ImportTool - Imported Failed: Can not create a Path from an empty string
- attach following jar files with action file property:
- jdbc connection jar i.e. sqljdbc4.jar if you connect to SQLServer
- hbase-client.jar
- hbase-protocol.jar
- hbase-hadoop-compat.jar
Note: Instead of adding those jars to the job, you may put them into oozie libs folder. Oozie will then pick them up automatically. To set up such folder you should specify in Oozie job.properties 2 parameters: oozie.libpath pointing to folder with jars and oozie.use.system.libpath=true.
Here’s a complete Oozie workflow for exporting data from HBase to SQLServer database using HCatalog:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
<workflow-app name="export_from_hbase" xmlns="uri:oozie:workflow:0.4"> <global> <job-xml>job-config.xml</job-xml> <configuration> <property> <name>hbase.security.authentication</name> <value>kerberos</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>${zookeeper_quorum}</value> </property> <property> <name>hbase.master.kerberos.principal</name> <value>${hbase_master_principal}</value> </property> <property> <name>hbase.regionserver.kerberos.principal</name> <value>${hbase_region_server_principal}</value> </property> </configuration> </global> <credentials> <credential name="hcat" type="hcat"> <property> <name>hcat.metastore.uri</name> <value>${hcat_metastore_uri}</value> </property> <property> <name>hcat.metastore.principal</name> <value>${hcat_metastore_principal}</value> </property> </credential> <credential name="hbase" type="hbase"> </credential> </credentials> <start to="export_data"/> <action name="export_data" cred="hcat,hbase"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>export</arg> <arg>--connect</arg> <arg>jdbc:sqlserver://${sql_server_ip};database=${sql_database}</arg> <arg>--username</arg> <arg>${sql_user_name}</arg> <arg>--password-file</arg> <arg>${dir_to_password_file}</arg> <arg>--table</arg> <arg>${sql_table_name}</arg> <arg>--hcatalog-database</arg> <arg>${hive_database}</arg> <arg>--hcatalog-table</arg> <arg>${hive_table_name}</arg> <arg>--m</arg> <arg>1</arg> <arg>--skip-dist-cache</arg> <arg>--</arg> <arg>--schema</arg> <arg>${sql_table_schema}</arg> <file>sqljdbc4.jar</file> <file>hive-hbase-handler.jar</file> <file>hbase-client.jar</file> <file>hbase-protocol.jar</file> <file>hbase-hadoop-compat.jar</file> <file>hive-site.xml</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> <kill name="kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> |
HBase table always holds a row key, which then also needs to be present in the Hive table. To omit that during the export use Sqoop export columns parameter specifying set of columns to be exported into the database.