uniq
The uniq
command is a useful tool in Linux that helps you find unique lines of text inside a file or from standard output. It filters out all repeated lines, leaving only the unique lines of text in an output. This can be helpful when you’re working with large text files and need to quickly identify distinct lines of text and easily eliminate duplicates to streamline your text-processing tasks.
The aim of this guide is to help you discover the different applications and choices offered by the uniq
command, with the goal of incorporating it into your daily toolset.
Basic Syntax of the uniq Command
There are two main ways to use the uniq
command. You can either interact directly with a file or interact with stdout.
Interacting Directly With a File
The syntax for using uniq
directly with a file is simple:
uniq <optional-arguments> <filename>
This runs the uniq
command with any of your chosen arguments on a specified file. These arguments allow you to change the behavior and the output of the uniq
command. That output is then sent to stdout, which means you can manipulate the output further with other text-processing commands if you want to.
Interacting With stdout
If you have a file that you want to process using other commands before you run the uniq
command on the output, your syntax might look like this:
cat <filename> | grep -oP '<regular-expression>' | sort | uniq
Linux (and other Unix-based operating systems) introduce the concept of piping, which takes the output from one command and pipes it as an input to the next command in the pipeline.
The previous code snippet does the following:
cat
is used to display the contents of a file to stdout.- That output is sent to
grep
, which is a command that can extract certain parts of the output based on a pattern (a regular expression in this case). - The extracted lines that matched the pattern are sent to
sort
, which by default, sorts the output alphabetically. - Finally, the
uniq
command removes all duplicate lines from the output and displays the result to the user.
This example may look advanced to you, but in the next section, you’ll review a few more code samples so you can become more comfortable processing and analyzing data using the uniq
command as well as other Linux commands.
Using the uniq
Command on a File
Now that you understand the basic syntax of the uniq
command, try to use uniq
on a file on your system.
If you want to follow along, create a file called lyrics.txt
somewhere on your system and add the lyrics to one of your favorite songs. The example in this guide uses the lyrics for the song “Happy Birthday”:
Removing Duplicate Lines
To remove duplicate lines, run the uniq
command on your lyrics.txt
file to see the output:
The uniq
command removes any duplicate lines that follow one after the other. In this example, even though the fourth line is the same as the first and second lines, it’s not removed from the output because the fourth line is not a duplicate of the preceding line.
If you don’t specify any optional arguments for the command, this is the default mode of operation for the uniq
command.
Counting Duplicate Lines
You can also count the number of duplicate lines in a file by providing the -c
argument to your command:
As you can see, the count isn’t perfect. The first, and last lines are actually duplicates of one another, but that’s because the default operation of uniq
is to look for duplicates in preceding lines.
Sorting Text Before Counting
You can sort the file before using uniq
to count the unique lines. For instance, take a look at what the sort command does to the lyrics.txt
file:
sort
has many other options for sorting including, numerically, or alphabetically, ascending, or descending but for this example, the default sorting method works just fine as our file contains only alphabetic characters. If there are white spaces and other special characters, sorting is according to a particular precedence of characters.
You can process the results from sort
by piping the results of that command to the uniq
command:
This time around, uniq
is correctly counting the number of duplicate lines because the lines have already been sorted.
Running the uniq
Command with Arguments and Flags
So far, you’ve been introduced to the -c
argument for the uniq
command. However, the uniq
command is capable of a lot more.
As with almost any command in Linux, you can run the --help
or -h
parameter to find out what arguments are available to you for any particular command:
Don’t be alarmed at this wall of text. This guide goes through a few advanced examples with you and uses some of the other available arguments to show them in action.
Advanced Techniques to Use the uniq
Command
If you have a more structured data file, like a CSV, you can also use uniq
to count duplicate values of a specific field. The example used here is pretty specific for the purpose of this guide, but you can create your own example to experiment with.
Suppose you have a CSV file that tracks meteorite landings around the world with the following format: day, time, country
:
You may want to see how many times each country has experienced a meteorite strike since you’ve started tracking them.
Skipping a Few Characters
As you can see, this file has a very specific character count, so you can use the --skip-chars=N
option to skip a specified number of characters on each line so that you only compare the values of the country
field. This is only possible because the file has a rigid structure and a precise number of characters preceding the country
field:
Unfortunately, you can see that it didn’t work as expected. It did count duplicates correctly where the country
values were adjacent, so you know the --skip-chars
option is working. This means that you’re running into the same problem as before: uniq
is only counting a duplicate if the values being compared are adjacent to each other.
At this point, you’re probably realizing that the uniq
command has some limitations. It’s highly unlikely that you can use the uniq
command on its own without first modifying or extracting specific bits of data that you want to count with uniq
. You’re going to need to use other commands to help you achieve that.
Sort the file and then count the output with uniq
again:
This time, you’re telling sort
that you’re using commas for your value delimiter (-t,
) and that you want to sort the third column (-k3
).
Now, the uniq
command can properly check for duplicates because the sorted column makes sure that all duplicate values are adjacent to one another. However, the output isn’t very clean. You can fix that by using tail
instead of cat
and displaying everything except the CSV header line:
Skipping a Few Fields
You can tell uniq
to compare specific fields first. However, the uniq
command expects fields to be space-delimited. That means you have to replace the commas with spaces after sorting the output:
As before, you’re first using the tail
command to display everything except the first line, then sorting the third column with the sort
command.
Then the sed
command is replacing all commas with a space character, and the uniq
command can look for duplicates in the third column because you specified that you want to skip the first two fields when you’re comparing values.
Implementing a Cleaner Output
Suppose you only want to display the country names and their distinct value counts. You’re going to need the help of the cut
command. The cut
command in Linux allows you to extract specific sections or columns of text from files or command output:
Now you can even sort the count column to display the results in descending order:
The -nr
option for the sort
command instructs sort
to sort using numeric values in reverse order.
Alternatives to the uniq
Command
As you may have noticed, there are many ways to accomplish the same thing in Linux. You replaced cat
with tail
to remove the first line of the file and inserted cut
into your pipeline to only grab specific fields from the file.
In that same spirit, there are many other commands in the Linux ecosystem that can achieve what the uniq
command achieves in more or less the same fashion.
awk
awk
is basically a mini-scripting language that is commonly used for manipulating data and producing usable output.
It’s a lot more complex to use than uniq
and requires you to learn a separate scripting language. However, it doesn’t suffer from the same limitations that uniq
does. For example, awk
does not need a file to be sorted first.
Let’s use awk
with our sample CSV file:
The -F,
specifies that your file is comma-delimited. Then inside the scripting portion, the NR!=1
tells awk
to ignore the first line, and the {A[$3]++}
tells awk
to count unique values for the third column ($3) and store it in an array called A
. Finally, it prints all the values for A with their counted values using a for loop (ie {for(i in A)print i,A[i]}
).
The biggest change here is that you used only the awk
command to achieve this, and you didn’t need to sort the file at all. However, for novice Linux users, acquiring proficiency in the awk
scripting language may pose a more significant challenge.
sort
You can also use the sort
command to remove duplicates:
Here, you’ve added the -u
flag to sort. It now only displays the first occurrence of a distinct value in the specified column (ie -k3
).
Again, you can see that there are differences between sort
and uniq
. While sort
correctly removes the duplicates, you can’t display a count of the distinct values.
Conclusion
In this tutorial, you’ve learned how to use uniq
to count distinct values in a file, and the importance of sorting and cleaning the file beforehand. We covered how uniq
works best on well-structured, clean inputs. There are numerous Linux commands to help you tailor your output.
For further exploration, refer to the additional resources linked in this guide, and Text Processing Commands page from The Linux Documentation Project.
And after you’ve mastered uniq
and are ready for more Linux tools, why not give Earthly a try? It’s a fantastic tool for build tool that works great at the command line.
Keep experimenting and overcoming challenges that come your way!