Remove duplicate lines based on a field

Recently while working on formatting some data files for further processing, I had to remove duplicate lines from the file based on a particular field. After trying out cut and grep commands, I was finally able to solve it with a very concise awk command/script.

The command was so concise but still was packed with so much information and it helped me to learn more about the awk scripting language. I thought of writing about it here so that it is useful for others and also I know where to search for it, when I needed it 🙂

Feel free to use it in whatever way you want, if it solves your problem as well.

Input and output data

Let me first explain the input data I had and the output that I was expecting.

Consider a file which has the following lines. Each line has four fields.

Tom	CTO	32
Harry	Manager	45        -> Manager field is duplicate
Krish	CEO	50
Bob	Manager	49        -> Manager field is duplicate
Patrick	CFO	20

Now assume that we want to remove duplicate lines by comparing only the second field. We want the output to look like this.

Command

Get ready for the surprise. The actual command is just this.

Explanation

awk script execution and printing

awk script is executed for each line and if the result is true then the line is printed. If the result is false then the line is not printed.

Associative arrays

The awk language supports associate arrays, similar to the ones found in PHP. The script x[$2]++ fills up an associate array. The key used here is $2 which refers to the second field and x is the variable name. You can use any name for it.

The array is populated for every line. This is how the array would look like after each line.

# Before starting 
x = {}

# After line 1
x = {
	CTO => 1
    }

# After line 2
x = {
	CTO => 1
	Manager => 1
    }

# After line 3
x = {
	CTO => 1
	Manager => 1
	CEO => 1
    }

# After line 4
x = {
	CTO => 1
	Manager => 2
	CEO => 1
    }

# After line 5
x = {
	CTO => 1
	Manager => 2
	CEO => 1
	CFO => 1
    }

Conditional evaluation

The ! operator results in a boolean evaluation which determines whether a particular line should be passed on to the output (printed) or not.

When the field is not present in the array, then it results in a zero value which is false. The ! (not) operator evaluates it to non-zero, which results in a true value and the line is passed on to the output (printed). When a duplicate is found, the array returns a non-zero count, which is true, but the ! converts it to false and that line is not passed on to the output.

The expanded version of the above command would be

But what is the fun in using the expanded version 😉

Field separator

In the input file that I had, the fields were separated by whitespace, so I didn’t have to specify the field separators. But if you are using a non-whitespace field separators, then you can specify it by adding FS="," to the above command.

This one-liner actually thought me that awk supports a full programming language that can be used to create scripts and also increased my understanding of the way awk command works. Hopefully this teaches something for you as well 🙂

I know that this is already a concise version, but if you think that this can be improved, then do let me know.

Anonymous says:

May 15, 2013 at 5:49 PM

The below cmd gives the same result

cat filename | sort -u

mytechskill says:

Apr 5, 2017 at 10:29 PM

Hi Sudar,

Your link is quite helpful for me, I have a situation instead of removing the last matched records of duplicate I want to keep the last one and remove all the previous records.
As for example you have removed

Bob Manager 49 -> Manager field is duplicate

but I want to remove the first one not the last one.
Harry Manager 45 -> Manager field is duplicate

Can you please help me to resolve it.

Much thanks in Advance…!!!

Night Dreaming (by Sudar)