GNU sort command does not sort words of different lengths with common prefixes correctly when using field delimiter
10
votes
3
answers
2175
views
The GNU sort command is not sorting words of different lengths with common prefixes correctly for me, but only when using a field delimiter to sort on one of multiple fields.
Here is the correct, expected sort behavior without using field delimiters:
$ cat /tmp/test0
b
c
ant
a
bcd
bc
cn
$ sort /tmp/test0
a
ant
b
bc
bcd
c
cn
Note that, for all words with a common string prefix, the shorter word sorts before the longer word. E.g. a
is before ant
, b
is before bc
is before bcd
, etc. This is the accepted, standard way that English strings are sorted, e.g. in a dictionary.
However, this sorting behavior changes when you are attempting to sort tabular data (such as a CSV file), and sorting on one of the columns. Here's what that looks like:
$ cat /tmp/test1
b,foo
c,bar
ant,baz
a,foo
bcd,ty
bc,pe
cn,cn
$ sort /tmp/test1 -t, -k1
a,foo
ant,baz
bcd,ty
bc,pe
b,foo
c,bar
cn,cn
Note that the words with a common prefix of a
and c
are still being handled correctly, but strings with a common prefix of b
are not; bcd
sorts before bc
sorts before b
, all of which is incorrect! This behavior is stable; you always get the same result. I'm experiencing this exact same issue on a much larger CSV file and the sorting errors there are deterministically random, if that makes sense.
I've tried various flags to sort
and none work to correct this behavior. -d
and -s
don't work. This is on GNU coreutils 9.4 sort for what it's worth.
So, is this just a bug with the sort
command? Am I somehow using it incorrectly? Is there anything better I can do that will dictionary sort the CVS by words in the first column?
Asked by Ben McIlwain
(353 rep)
May 24, 2024, 04:21 PM
Last activity: May 24, 2024, 05:53 PM
Last activity: May 24, 2024, 05:53 PM