Skip to content

Pig with HCatalog + Oozie

Pig with HCatalog + Oozie

HCatalog enables Pig to read and write directly to Hive metastore. Pig dynamically determines structure of the table allowing easier data manipulation. Here’s how to make Pig work with HCatalog and how to run such jobs through Oozie.

Contents

HOW TO

I used CDH 5.8.4 with Kerberos.

1. Pig with HCatalog

Pig does not automatically pick up HCatalog jars. To make them visible you can either:

  • use a flag in the Pig command
  • or add URI to your Hive metastore to the PIG_OPTS variable and add following jars to the PIG_CLASSPATH:
    • hcatalog-core*.jar
    • hive-hcatalog-pig-adapter*.jar
    • hive-metastore-*.jar
    • libthrift-*.jar
    • hive-exec-*.jar
    • libfb303-*.jar
    • jdo2-api-*.jar
    • slf4j-api-*.jar

That should be enough to use HCatalog with Pig.

Data loading

To load the data use the HCatLoader:

Note: Table name needs to be in brackets.

Data schema is automatically provided to Pig (although you may specify it if you want to).
There’s no way to limit the data loaded with HCatLoader. If you’re interested in loading just a specified partition you need to add FILTER command after the load. Pig is smart enough to load just this partition then.

 

Data storing

For storing the data use HCatStorer. You can either fill the whole table or just single partition.

Note: Table needs to be available prior to loading.

Writing to non-partitioned and partitioned table looks the same:

Partitions will be automatically determined.

Note: Any data present before loading will get overwritten.

Note: Partitioning column can’t have null values!

If you want to overwrite specified partition or add a new one, you need to specify it as HCatStorer parameters. Parameters need to be in single quotes ‘’. Remember that in this case you can’t have columns specifying partitions in the data.

You can also specify one level of partition in multi level partitioned table. Then all next levels of partitions will be dynamically generated.

Note: Remember that once a partition is created Pig won’t make it disappear. You can either overwrite it with Pig or drop it using Hive.

 

2. and Oozie

To create a Pig action in Oozie, which uses HCatalog you need to:

  • specify Oozie hcatalog credentials with Hive metastore uri and principal properties: and set Pig action credentials to hcat: cred=”hcat”
  • attach hive-site.xml to the job or explicitly specify following parameters in action config section:
    • hive.metastore.uris
    • hive.metastore.kerberos.principal
  • specify parameter: oozie.action.sharelib.for.pig = pig,hcatalog

 
Here’s how it looks all together:

Leave a Reply

Your email address will not be published. Required fields are marked *