Sample Header Ad - 728x90

Grouping rows by categories avoiding repetition

2 votes
5 answers
352 views
I have a tab-separated file with two columns on a Linux machine. The first column contains names, the second column contains GO IDs (these are always of the format GO: followed by seven digits) separated by commas. What I need to do is to keep one name with one unique GO ID in each row only, discarding repetitiveness and multiple entries. From this
Pr_g33687.t1    GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1    GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Into this
Pr_g33687.t1    GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565
I would appreciate your help. Thank you. RS
Asked by rseg (35 rep)
Jun 26, 2024, 02:04 PM
Last activity: Jul 7, 2024, 04:03 AM