Fast version of paste

4 votes
3 answers
964 views
                          paste is a brilliant tool, but it is dead slow: I get around 50 MB/s on my server when running:

    paste -d, file1 file2 ... file10000 | pv >/dev/null

paste is using 100% CPU according to top, so it is not limited by, say, a slow disk.

Looking at the source code it is probably because it uses getc:

              while (chr != EOF)
                {
                  sometodo = true;
                  if (chr == line_delim)
                    break;
                  xputchar (chr);
                  chr = getc (fileptr[i]);
                  err = errno;
                }

Is there another tool that does the same, but which is faster? Maybe by reading 4k-64k blocks at a time? Maybe by using vector instructions for finding the newline in parallel instead of looking at a single byte at a time? Maybe using awk or similar?

The input files are UTF8 and so big they do not fit in RAM, so reading *everything* into memory is not an option.

Edit:

thanasisp suggests running jobs in parallel. That improves throughput slightly, but it is still a magnitude slower than pure pv:

    # Baseline
    $ pv file* | head -c 10G >/dev/null
    10.0GiB 0:00:11 [ 897MiB/s] [>                                           ]  3%            

    # Paste all files at once
    $ paste -d, file* | pv | head -c 1G >/dev/null
    1.00GiB 0:00:21 [48.5MiB/s] [                                                         ]

    # Paste 11% at a time in parallel, and finally paste these
    $ paste -d,  /dev/null
    1.00GiB 0:00:14 [69.2MiB/s] [                                                         ]

top still shows that it is the outer paste that is the bottleneck. 

I tested if increasing the buffer made a difference:

    $ stdbuf -i8191 -o8191 paste -d,  /dev/null
    1.00GiB 0:00:12 [80.8MiB/s] [                                                         ]

This increased throughput 10%. Increasing the buffer further gave no improvement. This is likely hardware dependent (i.e. it may be due to the size of level 1 CPU cache).

Tests are run in a RAM disk to avoid limitations related to the disk subsystem.
                        
Asked by Ole Tange (37348 rep)
Nov 23, 2020, 11:46 PM
Last activity: Jan 14, 2025, 02:37 PM
Fast version of paste

Related Questions