Skip to content

Sqoop with HCatalog and Oozie

Sqoop with HCatalog and Oozie

Sqoop may use HCatalog to import and export data directly into/from Hive tables. It uses HCatalog to read table’s structure, data formats, partitions and then imports/exports data appropriately. It’s very useful combination for efficient data move, but requires matching column names on both sides. Here’s how to make Sqoop with HCatalog work through Oozie.

Contents

HOW TO

I used CDH 5.8.4 with Kerberos.

 

Import

In order to import data from relational database into Hive metastore with Sqoop using HCatalog first you need to create a table, i.e. with Hive. Table can be internal or external, partitioned or not, but the column names need to match those on the database side:

Note: By design columns in database table must match the ones in created table! It’s because HCatalog does the column matching by names. If some columns won’t match you’ll get error message:

 

Note: Original database column names are converted to lowercase and ‘-‘ characters are replaced with “_” by Sqoop. This has to be taken into consideration while creating Hive table.

 

Note: Currently Parquet format is not supported for Sqoop with HCatalog – there’s an issue pending.

 

Alternatively to creating a Hive table, you may choose to delegate Sqoop to create it for you with ‐‐create-hcatalog-table option.

 
If you want to do the import with Oozie use the Sqoop import action with:

  • HCatalog credentials
  • ‐‐hcatalog-database and ‐‐hcatalog-table options
  • jdbc jar file
  • hive-site.xml file

Your hive-site.xml should contain at least following properties:

  • Add also ‐‐skip-dist-cache option to avoid the error:

Note: Don’t specify ‐‐target-dir option in this case. Data will be stored in the location to which table is pointing (default is the metastore).

 

Note: ‐‐hive-overwrite option is not supported with HCatalog. As well as those specifying format:
‐‐as-sequencefile
‐‐as-avrodatafile
‐‐as-parquetfile

 

Note: Supported options are:
‐‐map-column-hive – to map a database column to HCatalog column type
‐‐hive-partition-key and ‐‐hive-partition-value – for specifying partition key and value, but here only one static partition is allowed in contrast to HCatalog partitioning options!
 
Here’s how an example Oozie workflow may look like:

 
Writing into specific partitions

If your table is partitioned and you want to import data only into selected partitions, you can do that with ‐‐hcatalog-partition-keys and ‐‐hcatalog-partition-values options and proper conditions in query where clause. However remember that imported data doesn’t overwrite existing partitions, it gets appended.
You can specify multiple partition levels but you can’t write into more than one specified partition at the time.

Supported combinations:

  • Multilevel partitioning:

2 levels of partitions in the table: partition1, partition2
‐‐hcatalog-partition-keys: partition1, partition2
‐‐hcatalog-partition-values: value1, value2
Import condition: where partition1 = value1 and partition2= value2

  • Specifying one level partition for 2 levels partitioned table:

2 levels of partitions in the table: partition1, partition2
‐‐hcatalog-partition-keys: partition1
‐‐hcatalog-partition-values: value1
Import condition: where partition1 = value1

Not supported combinations:

  • Specifying 2 values for 1 partition:

1 level of partition in the table: partition1
‐‐hcatalog-partition-keys: partition1
‐‐hcatalog-partition-values: value1, value2
Import condition: where partition1 = value1 or parition1 = value2

Although you may achieve same result by just specifying the table and adding where clause to the query:
‐‐hcatalog-table: table_name
Import condition: where partition1 = value1 or parition1 = value2

Here’s how an example Oozie workflow may look like:

 

Export

Before doing Sqoop export make sure that the table is present on the database side.

Note: By design columns in database table must match the ones in created table! It’s because HCatalog does the column matching by names. If some columns won’t match you’ll get error message:

 

Note: Original database column names are converted to lowercase and ‘-‘ characters are replaced with “_” by Sqoop. This has to be taken into consideration while creating Hive table.

For the Sqoop export action with HCatalog in Oozie, in addition to normal Oozie Sqoop action stuff, you need to add:

  • HCatalog credentials
  • jdbc jar file
  • hive-site.xml file

Your hive-site.xml should contain at least following properties:

  • Add also ‐‐skip-dist-cache option to avoid the error:
 

An example Oozie workflow may look like this:

Leave a Reply

Your email address will not be published. Required fields are marked *