Skip to content

Pig Java UDF

Pig Java UDF

We can define Pig UDF in few languages: Java, Jython, JavaScript, Ruby, Groovy and Python. But currently the biggest choice of options we have in Java, so I’ll stick to it in this post.
 
When it comes to Pig Java UDFs we can distinguish few types, depending on the type of operation we want to perform. But lets start with the common part. Once the UDF is ready we need to include its jar in the Pig script:

Remember to include the package name when using the UDF, or specify the alias for it with DEFINE, like that:

 

Contents

HOW TO

Let’s take a look at the UDF’s types by following different use cases:

 

1. One parameter function, working on a single row

Such function takes one value and replaces it with some other value.

To define such UDF you need to extend EvalFunc class and implement the logic in exec method. In <> following EvalFunc specify which data type will be returned.


Example:

 

2. Function working on a set of columns or the whole tuple, for a single row

This function can replace some values by mixing them all together.

Here you need to specify all needed columns of the tuple. FLATTEN function ensures that the result will have a columnar unnested representation. Similar to previous case you need to extend EvalFunc (this time Tuple will be returned) and implement exec method.


Example:

 

3. Function working on a set of Tuples – DataBag

In this case we talk about the aggregating function.

To create an aggregating UDF working on set of tuples we actually need to implement a set of functions. This comes from the parallel MapReduce processing. Separate function for map, combine and reduce step is needed.
For this purpose we need to create a class which implements Algebraic interface that consist of definition of three classes derived from EvalFunc: Initial (for map phase), Intermed (for combine phase), Final (for reduce phase). All of this classes need to implement its own exec method. Additionally Algebraic interface has 3 methods to be implemented:

 

Here is an example derived from the Pig documentation:

 

4. Function working on a small set of Tuples – DataBag

Function which requires some custom logic to be applied combining values from just few rows.

If you want to apply some custom logic on a set of tuples (like tuples with a given key field) you need to ensure that they all go to the same reducer. The simplest way to do it is by using a GROUP function. You may also control the order of the grouped tuples coming to given UDF with ORDER function.


Note: Created groups shouldn’t be too big, otherwise some of your reducers may be overwhelmed or running very long.
To process such ordered tuple set your UDF needs to extend EvalFunc and implement exec method.


Example:

 

5. Filtering function

This function can be used for filtering out some records, when applied to the record level. When applied to the DataBag level, checks if the DataBag matches some given condition.

To create a filtering function you need to extend FilterFunc class and implement exec method returning Boolean.


Example:

1 thought on “Pig Java UDF

Leave a Reply to Virgil Etheredge Cancel reply

Your email address will not be published. Required fields are marked *