Sudar's Blog - Night Dreaming (by Sudar)

You are here: Home » Blog

Including external Pig files into Pig Latin scripts

Published Sep 19, 2013 | In Hadoop/Pig

In one of my projects, we had huge number of Pig scripts which dealt with data from a single source. The schema for this common data source is quite complex and changes every few months. Since this schema was present in all Pig files, when ever it changes, it was a real pain to update all Pig scripts.

I was looking for a way to separate out the schema into a separate Pig file and then include it in all other Pig scripts, like how you import a class in Java, instead of copy pasting it into all Pig files.

After some quick web searches, I found that from Pig 0.9 and above this feature is indeed available in Pig itself. It’s called macros. All you need to do is to just include the following line in your Pig script where you need it to be included.

import 'other-file.pig'

You can either give relative path in the above line or set the search path as well from where Pig should include the scripts. If you want to include the search path, then you can do something like this.

set pig.import.search.path '/usr/local/pig,/grid/pig';
import 'external-file.pig';

Now my Pig scripts are organized properly. Hope this helps you as well 🙂

Posted in Hadoop/Pig | Tagged Import, Macro, Pig | Leave a comment

Parsing JSON in Pig UDF written in Python

Published Sep 17, 2013 | In Hadoop/Pig, Python

When I wrote about using Python to write UDF functions for Pig, I mentioned that Pig would internally be using Jython to parse the code, but 99% of time this shouldn’t be an issue. But I hit the other 1% recently 🙂

I had a small piece of Python code that used the built-in json module to parse JSON data. I converted that into a UDF function and when I tried to call it from Pig, I was getting “module not found” exception. After some quick checks, I found that the latest stable version of Jython is 2.5.x and json module was added from 2.6

After some web searches, I came across jyson through a blog post about using JSON in Jython. jyson is an Java implementation of JSON codec for Jython 2.5 which can be used as a drop-in replacement for Python’s built-in json module.

I downloaded jyson jar and then added it to Pig’s Dpig.additional.jars property. In the Python code, I changed the import statement to import com.xhaus.jyson.JysonCodec as json. After that everything started to work again 🙂

Posted in Hadoop/Pig, Python | Tagged JSON, Jython, Pig, Python, UDF | 1 Comment

Install OpenCV in Mac using Homebrew

Published Sep 13, 2013 | In Apple/Mac/iDevices

Recently, I switched to a new Mac Retina (one of the numerous perks of working at Yahoo 😉 ) and wanted to install OpenCV in it using Homebrew.

I ran the usual brew install opencv command and got No available formula for opencv error message. I have personally ran this command before to install OpenCV and was wondering what went wrong.

After a couple of web searches and ‘hair tearing’, I found that the opencv formula has moved to homebrew-science and I had to do the following.

brew tap homebrew/science

brew install opencv

Finally everything worked. So even if you are facing the same problem, then you know what to do now (and possibly save some hair as well) 🙂

Oh, by the way, if you own a Mac and don’t know what Homebrew is, then stop everything and install it right away. You can thank me later 😉

Posted in Apple/Mac/iDevices | Tagged brew, Homebrew, Mac, OpenCV | Leave a comment

Writing Pig UDF functions using Python

Published Sep 12, 2013 | In Hadoop/Pig, Python

Recently I was working with Pig (the apache one, not the animal 😉 ) and needed to implement a complex logic. Instead of struggling to write it in Pig, I decided to write a UDF (User defined Function). Also, I was too lazy to copy paste lot of boilerplate code to write the UDF in Java and decided to write it in Python. Long time readers might know that ever since I learned Python (around 7 years ago), I have been a huge fan.

In the end, I found that it was too easy to write UDF’s using Python, when compared with writing them in Java. I thought of writing about it here so that it would be helpful and will act as a starting point for people who also want to write their own UDF using Python.

Python vs Jython

Well, before we start, one thing that we have to keep in mind is that, even though we would be writing our code in Python, Pig will internally execute the code using Jython. 99% of time there will not be any difference, but it is good to keep that in mind.

Python code

First in the python side all we need to do to expose a Python function as a UDF, is to just specify a decorator to it.

Let’s say we have the following Python function that returns the length of the argument that is passed to it.

All we need to expose this function as a UDF is to add the @outputSchema decorator. So the code becomes

When data is passed from Pig to Python, it is passed as bytearray. Most of the time, this shouldn’t be a problem. But there are times when this could be a problem. In those cases, we can just convert it into proper string before we consume it. So the final code would look like this

@outputSchema("num:long")
def get_length(data):
    str_data = ''.join([chr(x) for x in data])
    return len(str_data)

Pig code

In the Pig side, we should do two things.

Register the UDF
Call the UDF 😉

Register the UDF

As I said in the beginning, Pig internally will use Jython to parse Python code. So we first need to register our Python file using the REGISTER statement. We can just say REGISTER 'udf.py' USING jython as pyudf

Call UDF

Once we register the UDF using the REGISTER statement, we can then call the UDF function using the alias that we created.

Here is the complete code in the Pig side.

REGISTER 'udf.py' USING jython as pyudf

A = LOAD 'data.txt' USING PigStorage();
B = FOREACH A GENERATE $0, pyudf.get_length($0);

DUMP B;

And believe me, that’s all you need to do to write Pig UDF functions using Python. No more unneeded Java classes, boilerplate code or Jar creation process 🙂

Posted in Hadoop/Pig, Python | Tagged Pig, Python, UDF | 8 Comments

Hack 101 at IIT Kanpur for HackU

Published Aug 23, 2013 | In Events/Conferences

I am currently in IIT – Kanpur to conduct Yahoo! HackU (Hackday for University), as I am part of the Tech crew that is conducting the event. This is similar to my previous HackU events.

During the event, I gave a talk titled “Hack 101”, which basically explains what is an hack, what is HackU and how to participate etc.

Posted in Events/Conferences | Tagged Hack, HackU, slides, Yahoo | Leave a comment

Make your hack standout using PureCSS

Published Jul 13, 2013 | In Events/Conferences

I am currently in Hyderabad, attending Yahoo’s 6th Open hackday and I just finished my talk titled “Make your hack standout using PureCSS”.

PureCSS

PureCSS is a set of small, responsive CSS modules that you can use in every web project and is one of the latest offering from Yahoo.

Do check it out, it is really awesome 🙂

Posted in Events/Conferences | Tagged HackDay, PureCSS, slides | Leave a comment

Rollback a commit in git, while maintaining history

Published Jul 8, 2013 | In Unix/Server Stuff

I use git for pretty much all my projects these days. When ever I do a commit, I make sure that it is atomic and has a proper commit message. One of the main reason why I keep the commits atomic is that, I can rollback the commit if needed.

Recently, I faced a situation, where in one of my projects, the feature I added a couple of commits back wasn’t working properly and I decided to completely get rid of it. Since I follow semantic versioning, I wanted to make sure that I maintain the different release of my project properly.

After a bit of digging up, I found the exact command in git to do that. Since the command was not that intuitive I thought of documenting it here, so that I know where to lookup when I face the situation again 😉

Problem

Let me explain problem properly. Let’s assume that I have the following list of commits in my history.

A -> B -> C -> D -> E

The head is at E now. The feature that I was talking about was introduced in commit C. Now I want to rollback the changes that I did in commit C, after E, but still maintain the history. In short I want the commits to look like

A -> B -> C -> D -> E -> C'

where C' is the opposite of C

Solution

I could have manually removed that changes I did in commit C. But being a fan of fancy one-liners, I was looking for a solution and found that there is a command in git which can do this for you.

All you need to do is to execute the following command.

git revert 'SHA_of_C'

It’s really fascinating to see, how the developers of git have thought of these different cases 🙂

Posted in Unix/Server Stuff | Tagged git, one liner | Leave a comment

Sum up all values of a column in a text file

Published Jul 4, 2013 | In Unix/Server Stuff

Recently, I had to sum up all integers of a column in a text file (similar to how you do in excel). After some digging up, I came up with a awk one liner to do it.

Following my tradition of documenting one liners, I am going to document this one as well 🙂

Input and output data

Here is the super simplified version of input data that I was using.

I wanted to find the sum of all values present in the 3rd column. So in the case, the output that I was expecting was 145

Command

Here is the awk one liner, which does this.

Explanation

awk script is executed for each line and the first part of the command creates a variable s that stores the sum of all values in the 3rd column.

When the end of file is reached, the second part of the command is executed, which just prints the value of the variable.

Field separator

If the columns are separated by a comma or by any other non whitespace character, then you have to just specify it by adding FS=',' to the above command.

The more I dig deeper into awk, the more I like it and it is really fascinating to see how much you can do with this tool.

I learned a lot about awk and hopefully this teaches something to you as well 😉

Posted in Unix/Server Stuff | Tagged awk, command line, one liner | 7 Comments

How to encourage contribution in open source projects

Published Jun 24, 2013 | In Random/Personal

Last month, I talked about why you should open source your pet project, and also explained, how I started maintaining the Arduino Makefile project.

After that, this past one month has been really busy. I guess instead of me explaining how busy it had been, I guess the following picture would give a better idea.

There were 70+ commits that got into the project and huge number of critical bugs got fixed. If you look very closely, you will also find that it’s not just me who have been doing this 🙂

I always believed that github encouraged contribution to projects, but in the last one month I also learned a couple of new lessons on how you could encourage more participation and contribution to your open source projects.

Below are some of those learnings. Do take them with a pinch of salt though 😉

Let people know that they can contribute

This might seem silly, but it is really important to let people know that your project is open for contribution. I have spent countless hours fixing bugs and sending pull requests to projects in github, only to know that the repo owner is not interested in contribution 🙂 And it is still worse, when they just decide to be silent about it, without even leaving a small comment.

So let people know upfront whether you will be accepting contributions or not. You don’t have to write huge paragraph about it. Just a single line, which says “contributions are welcome” in the readme file should do it. I generally add some variation of the following in my projects.

All contributions (even documentation) are welcome 🙂 Open a pull request and I would be happy to merge them.

Let people know how to contribute

The next thing to do is to let people know how to contribute. The following are things which you might have to let them know.

Which coding style they should follow
Tabs vs space or 2 vs 4 character indent or to have a semi-colon or not 😉
Whether they should open a pull request or send you a patch/diff
Whether all pull requests should contain test or/and documentation
How they should word the commit message. I try to follow this guide by Tim Pope
Whether you need them to have multiple commits or one single commit

You could choose your own process, but keep in mind that the simpler the process the easier it is to follow it. I generally follow a very simple process and reword or change the commit message if needed.

One other thing, you might have to keep in mind is that, not everyone is comfortable with git/github. In those cases, I point them to the guide that I wrote to help people contribute to github projects.

Track feature requests and TODO’s

Sometimes people want to just get their hands dirty by contributing a simple fix or feature before they become comfortable. In order to help people get started I track all feature requests and TODO’s in github issues. I also use TODO comments in the code and if applicable also link it to corresponding github issue. I also add the following line in the readme file.

If you are looking for ideas to work on, then check out the following TODO items or the issue tracker.

In addition to helping people who want to contribute, this also helps me to keep track of issues and features for the projects. If applicable also try to create milestones and assign issues and feature requests to corresponding milestones. This will also be useful for users of your project, to know when to expect a feature to be implemented.

Respond to pull requests

If someone sends a pull requests or reports an issue, try to respond to it. When someone creates an issue or pull request, they have spent their free time and you should respect that fact.

Sometimes you may not be interested in accepting the pull request. In those cases at least let them that you can’t accept it and also let them know the reason why you don’t want to accept it.

Be polite

Eventually, you will receive a pull request which is just plain stupid. Even in those cases try to be very polite.

When people are just getting started they may miss out something and might make mistakes or may not understand something which will look too trivial. But if you are going to embarrass them or be hostile, then it is not going to be helpful for them. They might just stop contributing not just to your project but for any project. Be polite and encouraging, especially if they are just getting started and you might make someone’s day a little brighter 🙂

Also leave a simple “Thank you” comment after you merge someone’s pull request. It is a little mark of respect that you show for someone who has taken an effort to fix your code.

Follow a proper versioning policy

Any project which will be used by more than one person needs a proper versioning. These days I have started to follow Semantic Versioning for most of my projects. It is very simple to use and very effective.

Show people how to do it, by doing it

Whenever I want to contribute to a project, I try to get a sense of how things are done by seeing the commit history. So if you want people to follow a certain process or style when they contribute, then you should follow it in the first place.

Hide your time machine 😉

.. and don’t forget to hide your time machine 😉

Sometimes a little humor in your comment to a issue might make it easy for you to communicate with someone on the other side of the globe.

Happy coding 🙂

Posted in Random/Personal | Tagged github, Open Source | 2 Comments

Easy Retweet plugin now supports Google Analytics tracking

Published Jun 11, 2013 | In Plugin Releases

I recently released a couple of updates to my Easy Retweet WordPress Plugin.

For those who don’t know, Easy Retweet WordPress Plugin allows you to add Twitter tweet or bit.ly buttons to your WordPress posts.

You can choose to add these buttons using any one of the following ways

Automatic way – Just configure the button in the settings screen
Using shortcodes
Using template functions

Google Analytics tracking

One of the new feature that I have added to the Plugin is the ability to add Google Analytics tracking to links that are tweeted using the Tweet buttons.

You can add Google Analytics tracking by including the following in the url that gets tweeted.

utm-campaign
utm-source
utm-medium

Translations

In addition to the new features, I have also added translations for the following languages

Danish
Irish
Hindi
Romanian

Mandatory update

The current version of the Plugin is 3.0.1 and it includes both new features and bug fixes. So it is a mandatory update. You can install it from your WordPress admin or download it from the Plugin homepage.

Posted in Plugin Releases | Tagged Easy Retweet, Plugin, tweet, WordPress | Leave a comment