Sample Header Ad - 728x90

GNU sort command does not sort words of different lengths with common prefixes correctly when using field delimiter

10 votes
3 answers
2175 views
The GNU sort command is not sorting words of different lengths with common prefixes correctly for me, but only when using a field delimiter to sort on one of multiple fields. Here is the correct, expected sort behavior without using field delimiters:
$ cat /tmp/test0
b
c
ant
a
bcd
bc
cn

$ sort /tmp/test0
a
ant
b
bc
bcd
c
cn
Note that, for all words with a common string prefix, the shorter word sorts before the longer word. E.g. a is before ant, b is before bc is before bcd, etc. This is the accepted, standard way that English strings are sorted, e.g. in a dictionary. However, this sorting behavior changes when you are attempting to sort tabular data (such as a CSV file), and sorting on one of the columns. Here's what that looks like:
$ cat /tmp/test1
b,foo
c,bar
ant,baz
a,foo
bcd,ty
bc,pe
cn,cn

$ sort /tmp/test1 -t, -k1
a,foo
ant,baz
bcd,ty
bc,pe
b,foo
c,bar
cn,cn
Note that the words with a common prefix of a and c are still being handled correctly, but strings with a common prefix of b are not; bcd sorts before bc sorts before b, all of which is incorrect! This behavior is stable; you always get the same result. I'm experiencing this exact same issue on a much larger CSV file and the sorting errors there are deterministically random, if that makes sense. I've tried various flags to sort and none work to correct this behavior. -d and -s don't work. This is on GNU coreutils 9.4 sort for what it's worth. So, is this just a bug with the sort command? Am I somehow using it incorrectly? Is there anything better I can do that will dictionary sort the CVS by words in the first column?
Asked by Ben McIlwain (353 rep)
May 24, 2024, 04:21 PM
Last activity: May 24, 2024, 05:53 PM