Removing duplicate values based on two columns

0 votes

0 answers

189 views

                          I have a file that would like to filter duplicate values based column 1 and 6

    ID,sample,NAME,reference,app_name,appession_id,workflow,execution_status,status,date_created
    1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022

and the final output should look like


    ID,sample,NAME,reference,app_name,appession_id,workflow,execution_status,status,date_created
    1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022

So far this is what I have tried

    awk '!a[$1 $6]++ { print ;}' input.csv > output.csv

I end up with 

    ID,sample,NAME,reference,app_name,appession_id,workflow,execution_status,status,date_created
    1,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022
    2,ABC,XYZ,DOP,2022-08-18 13:31:09Z,28997974,same,Complete,PASS,18/08/2022

Any suggestion would be helpful. Thank you
                        

Asked by nbn (113 rep)

Oct 14, 2022, 03:59 PM
Last activity: Oct 17, 2022, 07:58 AM

Removing duplicate values based on two columns

Related Questions