Sample Header Ad - 728x90

comm for n files

3 votes
2 answers
109 views
I am looking for comm's functionality for n, i. e. more than two, files. man comm reads:
COMM(1)

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       With no options, produce three-column output.
       Column one contains lines unique to FILE1,
       column two contains lines unique to FILE2,
       and column three contains lines common to both files.
A first non-optimized and differently formatted approach in bash to illustrate the idea:
user@host MINGW64 dir
$ ls
abc  ac  ad  bca  bcd

user@host MINGW64 dir
$ tail -n +1 *
==> abc  ac  ad  bca  bcd &2 echo -en "${entry}\t"
   7   │     for file in "$@"; do
   8   │         foundentry=$(grep "$entry" "$file")
   9   │         echo -en "${foundentry}\t"
  10   │     done
  11   │     echo -en "\n"
  12   │ done
───────┴───────────────────────────────────────────────────────────────────────

user@host MINGW64 dir
$ time otherdir/ncomm.sh *
all     abc     ac      ad      bca     bcd
a       a       a       a       a
b       b                       b       b
c       c       c               c       c
d                       d               d

real    0m12.921s
user    0m0.579s
sys     0m4.586s

user@host MINGW64 dir
$
This displays column headers (to stderr), a first column "all" with all entries found in either file, sorted and then one column per file from the parameter list with their entries in the respective row. As for each cell outside of the first column and first row, grep is invoked once, this is really slow. As for comm, this output is only suitable for short lines/entries like ids. A more concise version could output an x or similar for each found entry in columns 2+. This should work on Git for Windows' MSYS2 and on RHEL. **How can this be achieved in a more performant manner?**
Asked by Julia (31 rep)
Jul 29, 2021, 04:17 AM
Last activity: Jan 31, 2022, 02:48 PM