Grouping rows by categories avoiding repetition
2
votes
5
answers
352
views
I have a tab-separated file with two columns on a Linux machine. The first column contains names, the second column contains GO IDs (these are always of the format
GO:
followed by seven digits) separated by commas. What I need to do is to keep one name with one unique GO ID in each row only, discarding repetitiveness and multiple entries.
From this
Pr_g33687.t1 GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1 GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1 GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1 GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1 GO:0003700,GO:0006355,GO:0043565
Into this
Pr_g33687.t1 GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1 GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565
I would appreciate your help. Thank you.
RS
Asked by rseg
(35 rep)
Jun 26, 2024, 02:04 PM
Last activity: Jul 7, 2024, 04:03 AM
Last activity: Jul 7, 2024, 04:03 AM