Recently I was working with Pig (the apache one, not the animal 😉 ) and needed to implement a complex logic. Instead of struggling to write it in Pig, I decided to write a UDF (User defined Function). Also, I was too lazy to copy paste lot of boilerplate code to write the UDF in Java and decided to write it in Python. Long time readers might know that ever since I learned Python (around 7 years ago), I have been a huge fan.
In the end, I found that it was too easy to write UDF’s using Python, when compared with writing them in Java. I thought of writing about it here so that it would be helpful and will act as a starting point for people who also want to write their own UDF using Python.
Python vs Jython
Well, before we start, one thing that we have to keep in mind is that, even though we would be writing our code in Python, Pig will internally execute the code using Jython. 99% of time there will not be any difference, but it is good to keep that in mind.
Python code
First in the python side all we need to do to expose a Python function as a UDF, is to just specify a decorator to it.
Let’s say we have the following Python function that returns the length of the argument that is passed to it.
All we need to expose this function as a UDF is to add the @outputSchema
decorator. So the code becomes
When data is passed from Pig to Python, it is passed as bytearray
. Most of the time, this shouldn’t be a problem. But there are times when this could be a problem. In those cases, we can just convert it into proper string before we consume it. So the final code would look like this
Pig code
In the Pig side, we should do two things.
- Register the UDF
- Call the UDF 😉
Register the UDF
As I said in the beginning, Pig internally will use Jython to parse Python code. So we first need to register our Python file using the REGISTER
statement. We can just say REGISTER 'udf.py' USING jython as pyudf
Call UDF
Once we register the UDF using the REGISTER
statement, we can then call the UDF function using the alias that we created.
Here is the complete code in the Pig side.
And believe me, that’s all you need to do to write Pig UDF functions using Python. No more unneeded Java classes, boilerplate code or Jar creation process 🙂
it is heaven… Thanks for the post.
Good explanation. Please upload more stuffs on Python usage in Pig or Map Reduce.
Nice to know that you like it 🙂
Well explained @muthu!!! Can you provide some more advanced example using python udf for pig….
Hello Vijay,
Nice to know that you liked it 🙂
Will write more about it sometime.
do you have to register each time? Or can you persist the udf?
Can we call the same python function more than once in a pig statement?
eg a = FOREACH B GENERATE myudf.some_func(something,something),myudf.some_func(otherthing,otherthing);