How to get unique occurrence of words from a very large file?

-1 votes

1 answer

1066 views

                          I have been asked to write a word frequency analysis program using the
unix/ shell-scripts with the following requirements:

 - Input is a text file with one word per line
 - Input words are    drawn from the Compact Oxford English Dictionary New Edition
 - Character encoding is UTF-8 
 - Input file is 1 Pebibyte (PiB) in    length 
 - Output is of the format “ Word occurred N times”

I am aware of one of the way to begin with as below ---
cat filename | xargs -n1 | sort | uniq -c > newfilename

What should be the best optimal way to do that considering performance as well?
                        

Asked by Pratik Barjatiya (23 rep)

Dec 29, 2017, 06:27 AM
Last activity: Apr 15, 2025, 03:40 PM

How to get unique occurrence of words from a very large file?

Related Questions