After I figured out how I can use Python to create Pig UDF functions, I got interested in Jython and wanted to play around with it. So I installed it in my Mac through homebrew by executing the following command.
brew install jython
Everything got installed properly, I was able to run Jython after setting
up the following lines in my [bashrc](https://github.com/sudar/dotfiles)
But there was one small annoyance. Every time I was staring Jython, I was getting the following strange warning.
expr: syntax error
After a couple of web searches, I stumbled upon an email thread in the Jython-users mailing list. Basically it was due to a bug in the bash script that is used to start Jython. I opened /usr/local/Cellar/jython/2.5.3/libexec/bin/jython file and changed the line, which fixed the error.
if expr "$link" : '/' > /dev/null; then to if expr "$link" : '[/]' > /dev/null; then
I am exploring Jython more and will keep you guys updated, if I find something interesting.
Well, I wanted to have the geekiest synonym for “We just had a baby” as the title for this post (similar to my wedding card). Since I am currently typing this from a hospital, sleep-deprived, while constantly sifting my attention between my recovering wife and my new born child, this is the best I came up with
In the last 24 hours, I have experienced extreme cases of all the 8 emotions specified in Robert Plutchik’s theory, but in spite of it this had been the best 24 hours in my entire life so far
So guys, join me in welcoming our son to this beautiful world. By the way, all credit for the creation of this new process goes to my wife who did all the real work
More details and pics coming soon, after (if) I catch up with some sleep
I was reading the wikipedia article of Foundation Series and then realized that the books were not published in chronological order, which bought the question – In which order should I read the books? In chronological order or published order.
It’s been quite sometime since I read anything substantial and so I am planning to take a “reading break” and want to dedicate my free time in the next couple of months for reading.
Since I have a couple of months for this, I don’t want to read short stories or a single novel. Instead, I want to read a series so that once the characters are introduced, they will continue to be present for most of the series. This will be an important factor for me, since it motivates me to complete all the books in the series. This is one of the reason, why I was able to complete the entire Sherlock Holmes series in less than a week
I am not very particular about the genre very much, but I would prefer either Fantasy or Sci-Fi. The series that immediately came to my mind were Harry Potter and The Lord of the Rings. I posted about it today morning in Twitter and asked for recommendations.
I am planning to spend a couple of months reading fantasy/fiction. What is your recommendation? Harry Potter or Lord of the Rings
In one of my projects, we had huge number of Pig scripts which dealt with data from a single source. The schema for this common data source is quite complex and changes every few months. Since this schema was present in all Pig files, when ever it changes, it was a real pain to update all Pig scripts.
I was looking for a way to separate out the schema into a separate Pig file and then include it in all other Pig scripts, like how you import a class in Java, instead of copy pasting it into all Pig files.
After some quick web searches, I found that from Pig 0.9 and above this feature is indeed available in Pig itself. It’s called macros. All you need to do is to just include the following line in your Pig script where you need it to be included.
You can either give relative path in the above line or set the search path as well from where Pig should include the scripts. If you want to include the search path, then you can do something like this.
set pig.import.search.path '/usr/local/pig,/grid/pig'; import 'external-file.pig';
Now my Pig scripts are organized properly. Hope this helps you as well
When I wrote about using Python to write UDF functions for Pig, I mentioned that Pig would internally be using Jython to parse the code, but 99% of time this shouldn’t be an issue. But I hit the other 1% recently
I had a small piece of Python code that used the built-in json module to parse JSON data. I converted that into a UDF function and when I tried to call it from Pig, I was getting “module not found” exception. After some quick checks, I found that the latest stable version of Jython is 2.5.x and json module was added from 2.6
After some web searches, I came across jyson through a blog post about using JSON in Jython. jyson is an Java implementation of JSON codec for Jython 2.5 which can be used as a drop-in replacement for Python’s built-in json module.
I downloaded jyson jar and then added it to Pig’s Dpig.additional.jars property. In the Python code, I changed the import statement to import com.xhaus.jyson.JysonCodec as json. After that everything started to work again
Recently I was working with Pig (the apache one, not the animal ) and needed to implement a complex logic. Instead of struggling to write it in Pig, I decided to write a UDF (User defined Function). Also, I was too lazy to copy paste lot of boilerplate code to write the UDF in Java and decided to write it in Python. Long time readers might know that ever since I learned Python (around 7 years ago), I have been a huge fan.
In the end, I found that it was too easy to write UDF’s using Python, when compared with writing them in Java. I thought of writing about it here so that it would be helpful and will act as a starting point for people who also want to write their own UDF using Python.
Python vs Jython
Well, before we start, one thing that we have to keep in mind is that, even though we would be writing our code in Python, Pig will internally execute the code using Jython. 99% of time there will not be any difference, but it is good to keep that in mind.
First in the python side all we need to do to expose a Python function as a UDF, is to just specify a decorator to it.
Let’s say we have the following Python function that returns the length of the argument that is passed to it.
All we need to expose this function as a UDF is to add the @outputSchema decorator. So the code becomes
When data is passed from Pig to Python, it is passed as bytearray. Most of the time, this shouldn’t be a problem. But there are times when this could be a problem. In those cases, we can just convert it into proper string before we consume it. So the final code would look like this
In the Pig side, we should do two things.
Register the UDF
Call the UDF
Register the UDF
As I said in the beginning, Pig internally will use Jython to parse Python code. So we first need to register our Python file using the REGISTER statement. We can just say REGISTER 'udf.py' USING jython as pyudf
Once we register the UDF using the REGISTER statement, we can then call the UDF function using the alias that we created.
Here is the complete code in the Pig side.
And believe me, that’s all you need to do to write Pig UDF functions using Python. No more unneeded Java classes, boilerplate code or Jar creation process