Recently while working on formatting some data files for further processing, I had to remove duplicate lines from the file based on a particular field. After trying out cut and grep commands, I was finally able to solve it with a very concise awk command/script.
The command was so concise but still was packed with so much information and it helped me to learn more about the awk scripting language. I thought of writing about it here so that it is useful for others and also I know where to search for it, when I needed it
Feel free to use it in whatever way you want, if it solves your problem as well.
Input and output data
Let me first explain the input data I had and the output that I was expecting.
Consider a file which has the following lines. Each line has four fields.
Now assume that we want to remove duplicate lines by comparing only the second field. We want the output to look like this.
Get ready for the surprise. The actual command is just this.
awk script execution and printing
awk script is executed for each line and if the result is true then the line is printed. If the result is false then the line is not printed.
The awk language supports associate arrays, similar to the ones found in PHP. The script x[$2]++ fills up an associate array. The key used here is $2 which refers to the second field and x is the variable name. You can use any name for it.
The array is populated for every line. This is how the array would look like after each line.
The ! operator results in a boolean evaluation which determines whether a particular line should be passed on to the output (printed) or not.
When the field is not present in the array, then it results in a zero value which is false. The ! (not) operator evaluates it to non-zero, which results in a true value and the line is passed on to the output (printed). When a duplicate is found, the array returns a non-zero count, which is true, but the ! converts it to false and that line is not passed on to the output.
The expanded version of the above command would be
But what is the fun in using the expanded version
In the input file that I had, the fields were separated by whitespace, so I didn’t have to specify the field separators. But if you are using a non-whitespace field separators, then you can specify it by adding FS="," to the above command.
This one-liner actually thought me that awk supports a full programming language that can be used to create scripts and also increased my understanding of the way awk command works. Hopefully this teaches something for you as well
I know that this is already a concise version, but if you think that this can be improved, then do let me know.
During my recent FTP adventures, I also found that some shared hosting sites give you an FTP username with the ‘@’ symbol in them. It is fine as long as you are going to use a GUI client to connect to FTP. But if you try using the commandline or Finder in Mac, you will have issues since the ‘@’ symbol is also used to separate the username from the host.
After some research I found that the ‘@’ symbol in the username can be replaced with ‘+’ while specifying it in the command line. I tested it with both wput and the Finder in Mac and it worked perfectly in both.
So remember, the next time you try to connect to FTP server from command line and you have a ‘@’ symbol in the username, then replace it with the ‘+’ symbol. Happy FTP’ing
Recently I had to transfer an entire folder, with lot of sub-folders to an FTP server. I know that there are lot of FTP GUI tools available that can do it, but I wanted to do it in command line so that I can script it.
I searched for the solution and came across an excellent tool called wput, which does exactly that very easily. It is very similar to wget, but instead of downloading the content, it allows you to upload it.
I installed it using apt-get and was trying to upload the entire directory. It was at this point I realized that I want to exclude all the .svn folders.
I again started searching for an answer. I even posted about it in stackoverflow, but couldn’t find a solution. I then went over the man page of wput and hidden inside was this gem, which allowed you decide on which files to include/exclude from the directory.
I thought of posting it here, so that it is useful for others and also I know where to find it when I need it next time.
So all you need is just one line. If you have not installed wput before, the install it using one of the following commands based on your operating system.
Recently, I wanted to play around with some stuff which is available only in PHP 5.3.x (more about it later in a separate blog post) and so was looking for a way to install it on my Ubuntu server, where this blog is running.
After poking around a bit, I found that Karmic Ubuntu hasn’t upgraded to PHP 5.3.x yet and the only way to do is to compile from PHP source. Even though I am pretty comfortable doing it, I didn’t wanted to do it, because it is very difficult to upgrade at a later point in time.
I was continuing my research and then found that it is in fact possible to install PHP 5.3.x though apt-get or aptitude. I thought of documenting it here, so that it would be useful for others who want to do the same thing.
Adding dotdeb to the source list
First you should add dotdeb repository to your apt-get source list. Add the following two lines to your /etc/apt-get/sources.list file
sudo vim /etc./apt-get/sources.list
Adding dotdeb keys to keyring
Dotdeb packages are GPGsigned. Issue the following commands to add the keys to key-ring
Install PHP5 packages
Then issue the following command to retrieve the updated package list. I am using aptitude here; you can use apt-get as well.
sudo aptitude update
sudo aptitude upgrade
And then you can install PHP5 packages (and modules) using the normal install command.
sudo aptitude install php5 libapache2-mod-php5
Installing php5-dev package
The above method will install all php5-* packages, but php5-dev has some dependency issues with libtool packages. In order to solve that you have to manually install libtool v1.5.26. To do that use the following commands.
Now it’s time to enjoy the new features that are available in PHP5.3
Both Slicehost and Linode are good but when compared with Linode, Slicehost was slightly costlier. I realized it after reading the comparison done by David. I bought an account in Linode for testing and was quite happy with it. But I was lazy to move all my sites, since it involved some work.
If you are following the articles at Slicehost, then you may notice that PickledOnion uses nano as the default editor. I somehow like vi more than nano (not ready for a debate ) and was looking for a way to make vi the default config editor. After some googling, I found how to do it.
I am writing it down here so that all I have to remember is that I just need to search my blog if I need to do it again in future.
Okay the command you have to use is (I am assuming that you are not logged in as root, which is the recommended approach)
sudo update-alternatives --config editor
And then press the number corresponding to the editor which you want to use. Below is the screenshot of how it looked in my slice.
I must confess that I am a stats freak. If you are a long time reader of my blog, then you wouldhaveknownthat by now yourself. This explains the reason why I want to preserve my Apache log files in spite of using a variety of stat services like Google Analytics, WordPress stats, statscounter, performancing metrics (before it was closed).
The default Apache configuration preserves the log files only for the last 10 days, but I wanted to permanently archive this files. After some searches in Google I came across an excellent program called Cronolog. Cronolog is a simple filter program which writes each log entry to a separate log file named after the filename format specified. You can use a variety of parameters like current date, time etc to define the filename template.
First we have to install cronolog, either by using aptitude or by downloading it from its download page. Then you have to change the log file name path in the virtual host file. (In Ubuntu Gusty, the virtual host files are situated in the path /etc/apache2/sites-enabled). I am using the following file format for this blog # Custom log file locations
ErrorLog "|/usr/sbin/cronolog /path/to/logs/%Y/%m/%Y-%m-%d-sudarmuthu.com-error.log"
CustomLog "|/usr/sbin/cronolog /path/to/logs/%Y/%m/%Y-%m-%d-sudarmuthu.com-access.log" combined
which will store my log files in separate folders for each year and for each month, like the below hierarchy /2007/12/2007-11-01-sudarmuthu.com-access.log
You can use a variety of modifiers for the filename and I have documented some of them in the below table. You can get more information from its documentation.
the locale’s AM or PM indicator
second (00..61, which allows for leap seconds)
the locale’s time representation (e.g.: “15:12:47″)
time zone (e.g. GMT), or nothing if the time zone cannot be determined
the locale’s abbreviated weekday name (e.g.: Sun..Sat)
the locale’s full weekday name (e.g.: Sunday .. Saturday)
the locale’s abbreviated month name (e.g.: Jan .. Dec)
the locale’s full month name, (e.g.: January .. December)
the locale’s date and time (e.g.: "Sun Dec 15 14:12:47 GMT 1996")
day of month (01 .. 31)
day of year (001 .. 366)
month (01 .. 12)
week of the year with Sunday as first day of week (00..53, where week 1 is the week containing the first Sunday of the year)
week of the year with Monday as first day of week (00..53, where week 1 is the week containing the first Monday of the year)
day of week (0 .. 6, where 0 corresponds to Sunday)
locale’s date representation (e.g. today in Britain: “15/12/96″)