Pig with HCatalog + Oozie
HCatalog enables Pig to read and write directly to Hive metastore. Pig dynamically determines structure of the table allowing easier data manipulation. Here’s how to make Pig work with HCatalog and how to run such jobs through Oozie.
Contents
HOW TO
I used CDH 5.8.4 with Kerberos.
1. Pig with HCatalog
Pig does not automatically pick up HCatalog jars. To make them visible you can either:
- use a flag in the Pig command
1 |
pig –useHCatalog |
- or add URI to your Hive metastore to the PIG_OPTS variable and add following jars to the PIG_CLASSPATH:
- hcatalog-core*.jar
- hive-hcatalog-pig-adapter*.jar
- hive-metastore-*.jar
- libthrift-*.jar
- hive-exec-*.jar
- libfb303-*.jar
- jdo2-api-*.jar
- slf4j-api-*.jar
1export PIG_OPTS=-Dhive.metastore.uris=thrift://<hostname>:<port>
That should be enough to use HCatalog with Pig.
Data loading
To load the data use the HCatLoader:
1 |
data = LOAD '${database}.${table_name}' USING org.apache.hive.hcatalog.pig.HCatLoader(); |
Note: Table name needs to be in brackets.
Data schema is automatically provided to Pig (although you may specify it if you want to).
There’s no way to limit the data loaded with HCatLoader. If you’re interested in loading just a specified partition you need to add FILTER command after the load. Pig is smart enough to load just this partition then.
1 2 3 |
A = LOAD '${database}.${table_name}' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A BY datestamp == ‘20170401’ |
Data storing
For storing the data use HCatStorer. You can either fill the whole table or just single partition.
Note: Table needs to be available prior to loading.
Writing to non-partitioned and partitioned table looks the same:
1 2 |
STORE final_data INTO '${database}.${table_name}' USING org.apache.hive.hcatalog.pig.HCatStorer(); |
Partitions will be automatically determined.
Note: Any data present before loading will get overwritten.
Note: Partitioning column can’t have null values!
If you want to overwrite specified partition or add a new one, you need to specify it as HCatStorer parameters. Parameters need to be in single quotes ‘’. Remember that in this case you can’t have columns specifying partitions in the data.
1 2 |
STORE final_data INTO '${database}.${table_name}' USING org.apache.hive.hcatalog.pig.HCatStorer(‘partition1=aa,partition2=bb’); |
You can also specify one level of partition in multi level partitioned table. Then all next levels of partitions will be dynamically generated.
Note: Remember that once a partition is created Pig won’t make it disappear. You can either overwrite it with Pig or drop it using Hive.
2. and Oozie
To create a Pig action in Oozie, which uses HCatalog you need to:
- specify Oozie hcatalog credentials with Hive metastore uri and principal properties:
123456789101112<credentials><credential name="hcat" type="hcat"><property><name>hcat.metastore.uri</name><value>${hcat_metasore_uri}</value></property><property><name>hcat.metastore.principal</name><value>${hcat_metastore_principal}</value></property></credential></credentials>
- attach hive-site.xml to the job or explicitly specify following parameters in action config section:
- hive.metastore.uris
- hive.metastore.kerberos.principal
- specify parameter: oozie.action.sharelib.for.pig = pig,hcatalog
Here’s how it looks all together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
<workflow-app name="hcat action" xmlns="uri:oozie:workflow:0.4"> <global> <job-xml>job-conf.xml</job-xml> </global> <credentials> <credential name="hcat" type="hcat"> <property> <name>hcat.metastore.uri</name> <value>${hcat_metastore_uri}</value> </property> <property> <name>hcat.metastore.principal</name> <value>${hcat_metastore_principal}</value> </property> </credential> </credentials> <start to="pig action"/> <action name="pig action" cred="hcat"> <pig> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>oozie.action.sharelib.for.pig</name> <value>pig,hcatalog</value> </property> </configuration> <script>pig_script.pig</script> ... <file>hive-site.xml</file> </pig> <ok to="end"/> <error to="kill"/> </action> <kill name="kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> |