Sample Header Ad - 728x90

How to remove duplicate lines inside a text file?

247 votes
12 answers
365366 views
A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table). What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won't be neighbours) there is to be only one of the kind left. I have written a program in Scala (consider it Java if you don't know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster? UPDATE: the awk '!seen[$0]++' filename solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn't work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don't feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.
Asked by Ivan (18358 rep)
Jan 27, 2012, 03:34 PM
Last activity: Aug 30, 2024, 01:12 AM