In pretty much every Pig script that you will be writing, you will have to specify at least two locations – the input and the output locations. If you are going to use multiple inputs or have to register multiple jars for UDF, then this is bound to increase.
I run most of my Pig scripts through a shell script and I was looking for a way to pass in these locations at runtime instead of hard coding them in the Pig script. After a bit of research, I found that Pig has the ability to accept command-line parameters and there are in fact multiple options to pass them. I thought of documenting them here so that I know where to look when I need to 🙂
Parameter Placeholder
First, we need to create a place holder for the parameter that needs to be replaced inside the Pig script. Let’s say you have the following line in your Pig script where you are loading an input file.
INPUT = LOAD '/data/input/20130326'
In the above statement, if you want to replace date part dynamically, then have to create a placeholder for it.
INPUT = LOAD '/data/input/$date'
Individual Parameters
To pass individual parameters to the Pig script we can use the -param
option while invoking the Pig script. So the syntax would be
pig -param date=20130326 -f myfile.pig
If you want to pass two parameters then you can add one more -param
option.
pig -param date=20130326 -param date2=20130426 -f myfile.pig
Param File
If there are lot of parameters that needs to be passed, or if we needed a more flexible way to do it, then we can place all of them in a single file and pass the file name using the -param_file
option.
The param file uses the simple ini file format where every line contains the param name and the value. We can specify comments using the #
character.
date=20130326
date2=20130426
We can pass the param file using the following syntax
pig -param_file=myfile.ini -f myfile.pig
Default Statement
We can also assign a default value to a parameter inside the Pig script using the default
statement like below
%default date '20130326'
Processing Order
One good thing about parameter substitution in Pig is that you can pass in value for the same parameter using multiple options simultaneously. Pig will pick them up in the following order.
- The
default
statement takes the lowest precedence. - The values passed using
-param_file
takes the next precedence.- If there are multiple entries for the same param is present in a file, then the one which comes later takes more precedence.
- If there are multiple param files, then the files that are specified later will take more precedence.
- The values that are passed using the
-param
option takes the next precedence.- If multiple values are specified for the same param, then the ones which are specified later takes more precedence.
Debugging
Sometimes, the precedence might be little confusing, especially if you have multiple files and multiple params. Pig also provides a -debug
option to debug this kind of scenario’s. If you invoke Pig with this option, then it will generate a file with extension .substitued
in the current directory with the place holders replaced with the correct values.
What I use?
I follow this convention while passing params in Pig and it has worked nicely for me so far.
I specify a default value using the default
statement and then pass actual values using the -param_file
option. If I am in a hurry and just want to test something locally, then I use -param
option, but generally I try to put them in a separate ini file so that I can check-in the options as well.
Leave a Reply