Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

8 votes

3 answers

15162 views

extract lines that match a list of words in another file

I have file 1 which have those lines: ATM 1434.972183 BMPR2 10762.78192 BMPR2 10762.78192 BMPR2 1469.14535 BMPR2 1469.14535 BMPR2 1738.479639 BMS1 4907.841667 BMS1 4907.841667 BMS1 880.4532628 BMS1 880.4532628 BMS1P17 1249.75 BMS1P17 1249.75 BMS1P17 1606.821429 BMS1P17 1606.821429 BMS1P17 1666.33333...

                                  I have file 1 which have those lines:

    ATM 1434.972183
    BMPR2 10762.78192
    BMPR2 10762.78192
    BMPR2 1469.14535
    BMPR2 1469.14535
    BMPR2 1738.479639
    BMS1 4907.841667
    BMS1 4907.841667
    BMS1 880.4532628
    BMS1 880.4532628
    BMS1P17 1249.75
    BMS1P17 1249.75
    BMS1P17 1606.821429
    BMS1P17 1606.821429
    BMS1P17 1666.333333
    BMS1P17 1666.333333
    BMS1P17 2108.460317
    BMS1P17 2108

And file 2 have a list of words:

    ATM
    BMS1
So, the output will be like this:

    ATM 1434.972183
    BMS1 4907.841667
    BMS1 4907.841667
    BMS1 880.4532628
    BMS1 880.4532628

I know it's really a duplicate question, but I tried all types of grep and sed and awk, maybe it will works with you guys with this tiny example 
but I have a very huge file > 1M lines and all previous way doesn't help

it return part of the lines that containing those words although there are other words in file 2 that matches the lines from file 1


                                

LamaMo (223 rep)

Jul 25, 2018, 06:59 PM • Last activity: Aug 6, 2025, 07:52 PM

2 votes

2 answers

1073 views

Converting column data to matrix

file-format bioinformatics

I am trying to create a matrix of plant traits and plant species. There are 2,912,746 rows in the data and 3 columns. There are different numbers of traits for each species, and not every species has every trait. The data format is tab delimited. Current format-- Species Trait Value Species_1 SLA 4...

                                  I am trying to create a matrix of plant traits and plant species. There are 2,912,746 rows in the data and 3 columns. There are different numbers of traits for each species, and not every species has every trait. The data format is tab delimited.

Current format--

      Species   Trait      Value
      Species_1 SLA        4
      Species_1 Photopath  C3
      Species_1 Mycorrhiza AMF
      Species_2 SLA        3 
      Species_2 Growth     10


Desired format--
 
              SLA Photopath Mycorrhiza Growth
    Species_1 4   C3        AMF
    Species_2 3                        10

Any help with this would be OH SO appreciated. It has been a quite the challenge, and I'm not sure where to begin.

Thank you!!!!

~Mark Anthony
                                

Mark Anthony (21 rep)

Feb 7, 2016, 06:51 PM • Last activity: Apr 20, 2025, 03:17 PM

1 votes

3 answers

84 views

edit all the values in a specific column based on row numbers range

text-processing awk bioinformatics

I have a [PDB file][1] (coordinates of atoms in a protein) on a Linux machine: ATOM 1 N GLY A 1 0.535 51.766 5.682 1.00 0.00 ATOM 2 CA GLY A 1 -0.712 50.962 5.596 1.00 0.00 ATOM 3 C GLY A 1 -1.243 50.872 4.179 1.00 0.00 ATOM 4 O GLY A 1 -1.313 51.888 3.492 1.00 0.00 ATOM 5 N GLN A 2 -1.600 49.664 3....

                                  I have a PDB file  (coordinates of atoms in a protein) on a Linux machine:

    ATOM      1   N  GLY A   1       0.535  51.766   5.682  1.00  0.00              
    ATOM      2  CA  GLY A   1      -0.712  50.962   5.596  1.00  0.00              
    ATOM      3   C  GLY A   1      -1.243  50.872   4.179  1.00  0.00              
    ATOM      4   O  GLY A   1      -1.313  51.888   3.492  1.00  0.00              
    ATOM      5   N  GLN A   2      -1.600  49.664   3.737  1.00  0.00              
    ATOM      6  CA  GLN A   2      -2.221  49.468   2.423  1.00  0.00              
    ATOM      7   C  GLN A   2      -3.542  48.719   2.507  1.00  0.00              
    ATOM      8   O  GLN A   2      -3.722  47.844   3.356  1.00  0.00              
    ATOM      9  CB  GLN A   2      -1.280  48.738   1.468  1.00  0.00              
    ATOM     10  CG  GLN A   2      -0.976  47.294   1.830  1.00  0.00              
    ....     ..  ..   .. .   .       ....   ....     ....   ....  ....
    TER   SPLIT LINE FOR INTERNAL USE ONLY
    ATOM      1  O5'  G  A   1     -44.412  97.503  31.177  1.00  0.00              
    ATOM      2  C5'  G  A   1     -45.447  96.803  31.882  1.00  0.00              
    ATOM      3  C4'  G  A   1     -45.225  95.295  31.894  1.00  0.00              
    ATOM      4  O4'  G  A   1     -46.441  94.578  31.654  1.00  0.00              
    ATOM      5  C3'  G  A   1     -44.328  94.850  30.748  1.00  0.00              
    ATOM      6  O3'  G  A   1     -42.943  94.877  31.129  1.00  0.00              
    ATOM      7  C2'  G  A   1     -44.804  93.425  30.542  1.00  0.00              
    ATOM      8  O2'  G  A   1     -44.163  92.592  31.466  1.00  0.00              
    ATOM      9  C1'  G  A   1     -46.304  93.444  30.772  1.00  0.00              
    ATOM     10  N9   G  A   1     -46.965  93.699  29.495  1.00  0.00
    ....     ..  ..   .  .   .     .......  ......   .....  ....   ...

The TER record explicitly marks the end of a particular amino acid chain.
I want to change the chain ID of the protein at the 5th column by awk to assign the correct ID to the new chain after TER.

Expected Output:

    ATOM      1   N  GLY A   1       0.535  51.766   5.682  1.00  0.00              
    ATOM      2  CA  GLY A   1      -0.712  50.962   5.596  1.00  0.00              
    ATOM      3   C  GLY A   1      -1.243  50.872   4.179  1.00  0.00              
    ATOM      4   O  GLY A   1      -1.313  51.888   3.492  1.00  0.00              
    ATOM      5   N  GLN A   2      -1.600  49.664   3.737  1.00  0.00              
    ATOM      6  CA  GLN A   2      -2.221  49.468   2.423  1.00  0.00              
    ATOM      7   C  GLN A   2      -3.542  48.719   2.507  1.00  0.00              
    ATOM      8   O  GLN A   2      -3.722  47.844   3.356  1.00  0.00              
    ATOM      9  CB  GLN A   2      -1.280  48.738   1.468  1.00  0.00              
    ATOM     10  CG  GLN A   2      -0.976  47.294   1.830  1.00  0.00                 
    TER   SPLIT LINE FOR INTERNAL USE ONLY
    ATOM      1  O5'  G  B   1     -44.412  97.503  31.177  1.00  0.00              
    ATOM      2  C5'  G  B   1     -45.447  96.803  31.882  1.00  0.00              
    ATOM      3  C4'  G  B   1     -45.225  95.295  31.894  1.00  0.00              
    ATOM      4  O4'  G  B   1     -46.441  94.578  31.654  1.00  0.00              
    ATOM      5  C3'  G  B   1     -44.328  94.850  30.748  1.00  0.00              
    ATOM      6  O3'  G  B   1     -42.943  94.877  31.129  1.00  0.00              
    ATOM      7  C2'  G  B   1     -44.804  93.425  30.542  1.00  0.00              
    ATOM      8  O2'  G  B   1     -44.163  92.592  31.466  1.00  0.00              
    ATOM      9  C1'  G  B   1     -46.304  93.444  30.772  1.00  0.00              
    ATOM     10  N9   G  B   1     -46.965  93.699  29.495  1.00  0.00  

   
Everything needs to be separated with the same spaces, this following arrangement would be wrong:

    ATOM   3674  CD1 PHE A 460       2.350  79.471  35.466  1.00  0.00              
    ATOM   3675  CD2 PHE A 460       1.037  81.443  35.196  1.00  0.00              
    ATOM   3676  CE1 PHE A 460       2.425  79.321  34.080  1.00  0.00              
    ATOM   3677  CE2 PHE A 460       1.108  81.298  33.805  1.00  0.00              
    ATOM   3678  CZ  PHE A 460       1.805  80.232  33.250  1.00  0.00              
    TER SPLIT LINE FOR B USE ONLY
    ATOM 1 O5' G B 1 -44.412 97.503 31.177 1.00 0.00
    ATOM 2 C5' G B 1 -45.447 96.803 31.882 1.00 0.00
    ATOM 3 C4' G B 1 -45.225 95.295 31.894 1.00 0.00
    ATOM 4 O4' G B 1 -46.441 94.578 31.654 1.00 0.00
    ATOM 5 C3' G B 1 -44.328 94.850 30.748 1.00 0.00


In addition, the file ends with this:

    TER
    ENDMDL

There is a blank line at the end of the file which needs to be left as it is 
                                

Paolo Lorenzini (444 rep)

Apr 17, 2025, 04:09 PM • Last activity: Apr 18, 2025, 08:22 AM

3 votes

1 answers

702 views

how to pass environment variables to singularity exec

bash environment-variables bioinformatics singularity

I have a `BASH` pipeline which at a point runs a `Singularity` container with *singularity exec* as follows: ``` singularity exec --bind `pwd`:/folder --bind $d:/results .sif -i /folder/ .fastq -v /results/ / .vcf -r /folder/ .fna -s -j 24 -t 24 -o /results/ ``` Since I'm running multiple experiment...

I have a BASH pipeline which at a point runs a Singularity container with *singularity exec* as follows:

singularity exec --bind pwd:/folder --bind $d:/results .sif  -i /folder/.fastq -v /results//.vcf -r /folder/.fna -s  -j 24 -t 24 -o /results/

Since I'm running multiple experiments at once with an array, I'm redefining the experiments with an environment variable that I wish to add to the `; it works for all the steps of the pipeline but Singularity` seems it's unable to *see* the variable(s) I'm defining in my script... Does anyone have any advice on how to do so? I looked up a bit but the --env does not seem to be working. Thanks in advance!

Matteo (209 rep)

Apr 13, 2025, 12:51 PM • Last activity: Apr 13, 2025, 06:48 PM

1 votes

4 answers

178 views

Find lines in Vim that start one way and that don't end in another way

regular-expression vim bioinformatics

I'm trying to use Vim to find, via `/`, lines that start and end in specific ways. In particular, I'd be looking for lines that start *with* the character `>` and *without* the string `RNA` at the very end. For example, I would want to find this line >NM_001010867.4 Homo sapiens iron-sulfur cluster...

                                  I'm trying to use Vim to find, via /, lines that start and end in specific ways. In particular, I'd be looking for lines that start *with* the character > and *without* the string RNA at the very end. For example, I would want to find this line

    >NM_001010867.4 Homo sapiens iron-sulfur cluster assembly factor IBA57 (IBA57),transcript variant 1, mRNA; nuclear gene for mitochondrial product

in a search, but not find this line

    >NR_107042.1 Homo sapiens microRNA 8075 (MIR8075), microRNA

I've looked hard for a solution but haven't been able to find one. Any help would be greatly appreciated.

Mark Pauley (61 rep)

Dec 14, 2024, 11:58 PM • Last activity: Feb 28, 2025, 09:56 PM

2 votes

4 answers

2061 views

Removing special characters from a fasta file

text-processing awk sed bioinformatics

I recently linearized a fasta file using awk. The output is perfect. However there is a caret(^) in my sequence. I want to remove this caret. below is my attempt, any assistance is highly appreciated. ``` >P1 MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRD...

>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK
>P3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS^MNAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKAKELGIEICLMYVEIE^MKGESVQEELLKGLDNKNPKIIVACIETLRKALS

I tried using:

$ sed '/s: ^// seq2.fa>seq3.fa

The code above is giving me an error of

:e expression #1,char7: unkown command: '/'

Any assistance is appreciated, thanks.

thole (33 rep)

Dec 29, 2022, 02:41 AM • Last activity: Feb 27, 2025, 07:33 PM

4 votes

3 answers

230 views

Add columns from variable number of files to base file

shell-script text-processing awk columns bioinformatics

I'm dealing with a series of bed files, which look like this: ``` chr1 100 110 0.5 chr1 150 175 0.2 chr1 200 300 1.5 ``` With the columns being chromosome, start, end, score. I have multiple different files with different scores in each one, and I'd like to combine them like this: ``` > cat a.bed ch...

I'm dealing with a series of bed files, which look like this:

chr1 100 110 0.5
chr1 150 175 0.2
chr1 200 300 1.5

With the columns being chromosome, start, end, score. I have multiple different files with different scores in each one, and I'd like to combine them like this:

> cat a.bed
chr1 100 110 0.5
chr1 150 175 0.2
chr1 200 300 1.5

> cat b.bed
chr1 100 110 0.4
chr1 150 175 0.7
chr1 200 300 0.9

> cat c.bed
chr1 100 110 1.5
chr1 150 175 1.2
chr1 200 300 0.1

> cat combined.bed
chr1 100 110 0.5 0.4 1.5
chr1 150 175 0.2 0.7 1.2
chr1 200 300 1.5 0.9 0.1

All the score columns (last column of the file) are added to a single file. I found [this answer](https://unix.stackexchange.com/a/167290/387150) , which can combine a column from one additional file into an existing file, but I would like a command which can add a *variable* number of columns together. So if I have 10 bed files to combine, I'd like a command that can process them all together and create a single file with 10 score columns. Each file should have the same number of lines, and each entry should have the same coordinates in all the files, so there should be no conflicts there. However there can be a lot of entries in each of the files (100K or more generally), so I'd like to avoid processing each one multiple times. Is there a way to handle this cleanly? This will be in a script so no need to be a one liner.

Whitehot (245 rep)

Feb 24, 2025, 03:00 PM • Last activity: Feb 25, 2025, 10:43 PM

1 votes

1 answers

917 views

sorting a file on column 1 lexically, then numerically on column 2 and 3

sort bioinformatics

I want to sort my file shown below: chr17 84938 85187 1 100 1 chr12 86723 87265 2 100 1 chr12 87368 87556 11 100 1 chr12 87704 87880 10 100 1 chr12 88018 88256 3 75 1 chr12 88018 88569 1 25 1 chr17 88171 69528 1 100 2 chr12 88393 88569 6 100 1 chr12 88750 88859 3 100 1 chr12 88772 88859 3 100 1 chr1...

                                  I want to sort my file shown below:

    chr17	84938	85187	1	100	1
    chr12	86723	87265	2	100	1
    chr12	87368	87556	11	100	1
    chr12	87704	87880	10	100	1
    chr12	88018	88256	3	75	1
    chr12	88018	88569	1	25	1
    chr17	88171	69528	1	100	2
    chr12	88393	88569	6	100	1
    chr12	88750	88859	3	100	1
    chr12	88772	88859	3	100	1
    chr12	89019	89674	7	100	1
    chr12	89828	90586	1	100	1
    chr12	90656	90795	3	100	1
    chr17	93459	92763	1	100	2
    chr17	96901	69528	4	100	2
    chr17	100273	99697	1	100	2
    chr16	101557	97558	13	100	2
    chr16	103475	101646	8	100	2
    chr16	104059	105458	18	100	1
    chr16	105550	105776	19	100	1
    chr16	105883	106538	17	100	1
    chr16	106614	107085	20	100	1
    chr18	107887	109384	1	100	1
    chr16	108971	108759	2	100	2

First I want to sort on column 1 then column 2 and then column 3 (all ascending order)

I did that in Microsoft Excel and got this result:

    chr12	86723	87265	2	100	1
    chr12	87368	87556	11	100	1
    chr12	87704	87880	10	100	1
    chr12	88018	88256	3	75	1
    chr12	88018	88569	1	25	1
    chr12	88393	88569	6	100	1
    chr12	88750	88859	3	100	1
    chr12	88772	88859	3	100	1
    chr12	89019	89674	7	100	1
    chr12	89828	90586	1	100	1
    chr12	90656	90795	3	100	1
    chr16	101557	97558	13	100	2
    chr16	103475	101646	8	100	2
    chr16	104059	105458	18	100	1
    chr16	105550	105776	19	100	1
    chr16	105883	106538	17	100	1
    chr16	106614	107085	20	100	1
    chr16	108971	108759	2	100	2
    chr17	84938	85187	1	100	1
    chr17	88171	69528	1	100	2
    chr17	93459	92763	1	100	2
    chr17	96901	69528	4	100	2
    chr17	100273	99697	1	100	2
    chr18	107887	109384	1	100	1

I used on unix command line this command

    sort -k 1,1 -nk2 -nk3 file.txt

It gave me:

    chr17	84938	85187	1	100	1
    chr12	86723	87265	2	100	1
    chr12	87368	87556	11	100	1
    chr12	87704	87880	10	100	1
    chr12	88018	88256	3	75	1
    chr12	88018	88569	1	25	1
    chr17	88171	69528	1	100	2
    chr12	88393	88569	6	100	1
    chr12	88750	88859	3	100	1
    chr12	88772	88859	3	100	1
    chr12	89019	89674	7	100	1
    chr12	89828	90586	1	100	1
    chr12	90656	90795	3	100	1
    chr17	93459	92763	1	100	2
    chr17	96901	69528	4	100	2
    chr17	100273	99697	1	100	2
    chr16	101557	97558	13	100	2
    chr16	103475	101646	8	100	2
    chr16	104059	105458	18	100	1
    chr16	105550	105776	19	100	1
    chr16	105883	106538	17	100	1
    chr16	106614	107085	20	100	1
    chr18	107887	109384	1	100	1
    chr16	108971	108759	2	100	2

What can I do here to get output similar to Excel?
                                

user3138373 (2589 rep)

Sep 8, 2016, 10:27 PM • Last activity: Nov 14, 2024, 08:33 AM

6 votes

3 answers

954 views

bash script quoting frustration

shell-script quoting bioinformatics

This problem is driving me crazy. From the command prompt I can enter this command and it works as expected (records where the INFO/RegionType tag contains the value Core are emitted in the output file): ``` bcftools filter -i "INFO/RegionType='Core'" -Oz -o temp.core.vcf.gz temp2.vcf.gz ``` But whe...

bcftools filter -i "INFO/RegionType='Core'" -Oz -o temp.core.vcf.gz temp2.vcf.gz

But when I try to put the command in a bash script things go awry. This seems to be due to the single and double quotes? I tried various permutations of escaping quotes with no success. Finally I tried the approach described here: to store the filter string in a variable and reference the variable. From my script:

filter_string=INFO/RegionType="'Core'"
bcftools filter -i \""$filter_string"\" -Oz -o temp.core.vcf.gz temp2.vcf.gz

This completes without an error but the command is not interpreting the filter correctly as evidenced by the fact that no records are included in the output file, whereas the command given directly from the command prompt yields 3124 records in the output file. What might be going on here? ---- A minimal example of the input file that works with bcftools and reproduces the issue (note that all whitespace in the lines starting with chr1 needs to be single tabs, not spaces):

-none
##fileformat=VCFv4.2
##FILTER=
##INFO=
##INFO=
##FORMAT=
##contig=
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
chr1	103385791	.	C	G	1182.77	PASS	AC=1;	GT	0/1
chr1	103471456	.	CCAT	C	2848.73	PASS	AC=2;RegionType=Core	GT	1/1

mcrepeau (69 rep)

Aug 21, 2024, 08:17 PM • Last activity: Aug 23, 2024, 05:17 AM

2 votes

5 answers

352 views

Grouping rows by categories avoiding repetition

text-processing bioinformatics

I have a tab-separated file with two columns on a Linux machine. The first column contains names, the second column contains GO IDs (these are always of the format `GO:` followed by seven digits) separated by commas. What I need to do is to keep one name with one unique GO ID in each row only, disca...

I have a tab-separated file with two columns on a Linux machine. The first column contains names, the second column contains GO IDs (these are always of the format GO: followed by seven digits) separated by commas. What I need to do is to keep one name with one unique GO ID in each row only, discarding repetitiveness and multiple entries. From this

Pr_g33687.t1    GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1    GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565

Into this

Pr_g33687.t1    GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565

I would appreciate your help. Thank you. RS

rseg (35 rep)

Jun 26, 2024, 02:04 PM • Last activity: Jul 7, 2024, 04:03 AM

3 votes

3 answers

4334 views

Removing number after decimal point

text-processing awk bioinformatics

I have an input file with these fields: ENST00000456328.2 1657 1350.015 0 0 I am trying awk to remove the number after the decimal and print the rest as it is awk -F[.] '{print $1"\t"$2"\t"$3}{next;}' But it doesn't work, as it gives an output like this: ENST00000456328 2 1657 1350 015 0 0 Can someo...

                                  I have an input file with these fields:

    ENST00000456328.2	1657	1350.015	0	0

I am trying awk to remove the number after the decimal and print the rest as it is

    awk -F[.] '{print $1"\t"$2"\t"$3}{next;}'
But it doesn't work, as it gives an output like this:

    ENST00000456328	2	1657	1350	015	0	0

Can someone help.

regards.

user1738234 (87 rep)

Jan 3, 2020, 06:29 PM • Last activity: Apr 21, 2024, 12:39 AM

2 votes

5 answers

1439 views

How to fetch fasta sequences corresponding to header lines in another file

text-processing bioinformatics

I have a file of lines of headers (`file1`) and another file is sequences in fasta format (`file2`). I want to grep fasta sequences if a header line from `file1` is present in `file2`. Example: * `file1`: ```none >sp|B7UM99|TIR_ECO27 >sp|P06616|ERA_ECOLI ``` * `file2`: ```none >sp|B7UM99|TIR_ECO27 M...

I have a file of lines of headers (file1) and another file is sequences in fasta format (file2). I want to grep fasta sequences if a header line from file1 is present in file2. Example: * file1:

>sp|B7UM99|TIR_ECO27
    >sp|P06616|ERA_ECOLI

* file2:

>sp|B7UM99|TIR_ECO27
    MPIGNLGNNVNGNHLIPPAPPLPSQTDGAA
    RGGTGHLISSTGALGSRSLFSPLRNSMADS
    VDSRDIPGLPTNPSRLAAATSETCLLGGFE
    VLHDKGPLDILNTQIGPSAFRVEVQADGTH
    ......
    >sp|P06616|ERA_ECOLI
    MSIDKSYCGFIAIVGRPNVGKSTLLNKLL
    GQKISITSRKAQTTRHRIVGIHTEGAYQAIY
    VDTPGLHMEEKRAINRLMNKAASSSIGDVE
    LVIFVVEGTRWTPDDEMVLNKLREGKAPVI
    ............
    >sp|P0AD68|HUMAN
    MKAAAKTQKPKRQEEHANFISWRFALLCGC
    ILLALAFLLGRVAWLQVISPDMLVKEGDMR
    SLRVQQVSTSRGMITDRSGRPLAVSVPVKA
    IWADPKEVHDAGGISVGDRWKALANALNIP
    .............

* Desired output

>sp|B7UM99|TIR_ECO27
    MPIGNLGNNVNGNHLIPPAPPLPSQTDGAA
    RGGTGHLISSTGALGSRSLFSPLRNSMADS
    VDSRDIPGLPTNPSRLAAATSETCLLGGFE
    VLHDKGPLDILNTQIGPSAFRVEVQADGTH
    ......
    >sp|P06616|ERA_ECOLI
    MSIDKSYCGFIAIVGRPNVGKSTLLNKLL
    GQKISITSRKAQTTRHRIVGIHTEGAYQAIY
    VDTPGLHMEEKRAINRLMNKAASSSIGDVE
    LVIFVVEGTRWTPDDEMVLNKLREGKAPVI
    ............

Manoj Kumar (41 rep)

May 9, 2019, 08:25 PM • Last activity: Apr 14, 2024, 06:48 AM

1 votes

5 answers

132 views

sed command to replace a word within a line following a pattern

linux sed regular-expression bioinformatics

I'm working with a file that looks like the following, containing with over 50,000 lines of gene IDs followed by their sequence: gene_A:3342234 CTCTTTCTTTTACGCCT gene_A:1244-5205 CTCTTTCTTTTACGCCT gene_A:1838438 CTCTTTCTTTTACGCCT gene_B:1848584 CTCTTTCTTTTACGCCT gene_B:1029-4920 CTCTTTCTTTTACGCCT ge...

                                  I'm working with a file that looks like the following, containing with over 50,000 lines of gene IDs followed by their sequence:

    gene_A:3342234 CTCTTTCTTTTACGCCT
    gene_A:1244-5205 CTCTTTCTTTTACGCCT
    gene_A:1838438 CTCTTTCTTTTACGCCT
    gene_B:1848584 CTCTTTCTTTTACGCCT
    gene_B:1029-4920 CTCTTTCTTTTACGCCT
    gene_C:3849029 CTCTTTCTTTTACGCCT

They all have the gene ID, followed by a colon, and then the reference number of 7-9 digits and (some include dashes).

I want to replace the gene IDs with their actual names, for example geneA and geneB, whilst keeping the information that follows them. Desired output:

    geneA CTCTTTCTTTTACGCCT
    geneA CTCTTTCTTTTACGCCT
    geneA CTCTTTCTTTTACGCCT
    geneB CTCTTTCTTTTACGCCT
    geneB CTCTTTCTTTTACGCCT
    geneB CTCTTTCTTTTACGCCT


This is my first time using sed, so I'm really not quite sure where to even start. I know how to replace all lines containing gene_A with 's/gene_A.*/geneA/' but I'm not sure how to preserve the information following the gene IDs.
                                

bryophyta (11 rep)

Feb 3, 2024, 09:34 PM • Last activity: Apr 4, 2024, 03:49 PM

1 votes

1 answers

173 views

Histogram with FASTA file

text-processing bioinformatics plotting

I am new to Linux. I have a FASTA file that looks something like this: ~~~ >scaffold1 AAGACATAATATTTTGGAGGAATTAAAAAATTTAAGATGTATTTTATTATACATGTATTTTATTTATAACATAAATAAACATCCCAAGGAAAAGCAGTAGCT >scaffold2 AAGGATAAGTGTAGCTGAGGAATTAAAAAATTTAACAAATAAAATTAATGCAATTTATTTTTTCAAATAAAAATACACGGAGAAAAATAATTTGTAAATT...

                                  I am new to Linux.
I have a FASTA file that looks something like this:
~~~
>scaffold1
AAGACATAATATTTTGGAGGAATTAAAAAATTTAAGATGTATTTTATTATACATGTATTTTATTTATAACATAAATAAACATCCCAAGGAAAAGCAGTAGCT
>scaffold2
AAGGATAAGTGTAGCTGAGGAATTAAAAAATTTAACAAATAAAATTAATGCAATTTATTTTTTCAAATAAAAATACACGGAGAAAAATAATTTGTAAATTTT
~~~
etc. This goes on to about 5000+ scaffolds.

I want to make a histogram with the scaffold lengths.   
I read about Biopython etc. but I don't know anything about installing these programs. Is there a way to get a histogram with just the Linux commands (terminal) or with R?
Thank you

                                

Max Mustermann (21 rep)

Jul 18, 2019, 11:59 AM • Last activity: Apr 4, 2024, 11:16 AM

1 votes

4 answers

381 views

replace header in a file with list of lines in another file

text-processing bioinformatics

                                  I have a fasta file contained ~28000 sequence. I want to replace header of these sequences by a list of lines in another file.
Example:

File 1:

    sp|B7UM99|TIR_ECO27
    MPIGNLGNNVNGNHLIPPAPP.....
    sp|P0ACF8|HNS_ECOLI
    MSEALKILNNIRTLRAQ........
    sp|P24232|HMP_ECOLI
    MLDAQTIATVKATIPLLVET..........

File 2:

    sp|B7UM99|TIR_ECO27OS=Escherichia coli
    sp|P0ACF8|HNS_ECOLI=Human
    sp|P24232|HMP_ECOLI=Flavohemoprotein

Desired Output:

    sp|B7UM99|TIR_ECO27OS=Escherichia coli
    MPIGNLGNNVNGNHLIPPAPP.....
    sp|P0ACF8|HNS_ECOLI=Human
    MSEALKILNNIRTLRAQ........
    sp|P24232|HMP_ECOLI=Flavohemoprotein
    MLDAQTIATVKATIPLLVET..........


                                

Manoj Kumar (41 rep)

May 9, 2019, 03:58 PM • Last activity: Apr 2, 2024, 11:08 AM

5 votes

6 answers

304 views

subset columns from the 1st file using column names in 2nd file

text-processing bioinformatics filter

I have two text files: 1st file is a Tab delimited file which looks like this: chrom pos ref alt a1 a2 a3 a4 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 2nd file looks like this: a1 a4 I want to...

                                  I have two text files:
1st file is a Tab delimited file which looks like this:

    chrom	pos	ref	alt	a1	a2	a3	a4
    10	12345	C	T	aa	bb	cc	dd
    10	12345	C	T	aa	bb	cc	dd
    10	12345	C	T	aa	bb	cc	dd
    10	12345	C	T	aa	bb	cc	dd
    10	12345	C	T	aa	bb	cc	dd
    10	12345	C	T	aa	bb	cc	dd

2nd file looks like this:

    a1
    a4

I want to extract those columns in the 1st file which are present in the 2nd file along with first 4 columns of the first file. So in the above case, the output will look like this:

    chrom	pos	ref	alt	a1	a4
    10	12345	C	T	aa	dd
    10	12345	C	T	aa	dd
    10	12345	C	T	aa	dd
    10	12345	C	T	aa	dd
    10	12345	C	T	aa	dd
    10	12345	C	T	aa	dd

I want to do this in shell. How can I do this?
I have a bigger file than shown here, so I have many columns in 1st file

    cut -f 1-4,$(grep -Fwf file2.txt <(head -1 file1.txt)) file1.txt
                                

user3138373 (2589 rep)

Mar 22, 2024, 04:01 AM • Last activity: Mar 24, 2024, 03:25 AM

0 votes

3 answers

823 views

how to print file name and total number of fasta sequences?

shell-script python bioinformatics

I have a fasta file namely test.fasta, pas.fasta, cel.fasta as shown below test.fasta >tile ATGTC >259 TGAT pas.fasta >ta ATGCT cel.fasta >787 TGTAG >yog TGTAT >In NNTAG I need to print the file name and the total number of fasta sequences as shown below, test,2 pas,1 cel,3 I have used the following...

                                  I have a fasta file namely test.fasta, pas.fasta, cel.fasta as shown below

    test.fasta
    >tile
    ATGTC
    >259
    TGAT

    pas.fasta
    >ta
    ATGCT

    cel.fasta
    >787
    TGTAG
    >yog
    TGTAT
    >In
    NNTAG

I need to print the file name and the total number of fasta sequences as shown below,

    test,2
    pas,1
    cel,3

I have used the following commands but failed to serve my purpose

    grep ">" test.fasta | wc -l && ls test.fasta

Please help me to do the same.

Thanks in advance.
                                

Kumar (129 rep)

Sep 5, 2021, 01:44 PM • Last activity: Mar 16, 2024, 09:12 AM

2 votes

5 answers

1498 views

Replace new lines with spaces using awk

text-processing awk bioinformatics

I have a text file that I generated of all files in a directory. I'd like to use this file as input into a script that I have, but I need the text file to be formatted in a particular way to be parsed correctly. Currently the text file, which is a list of file names, is formatted like so: ``` A1_R1....

A1_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz
A2_R2.fastq.gz
A3_R1.fastq.gz
A3_R2.fastq.gz

I need the paired reads (files with the same name, but different RN values) for each sample to be on the same line, separated by a tab:

A1_R1.fastq.gz A1_R2.fastq.gz
A2_R1.fastq.gz A2_R2.fastq.gz
A3_R1.fastq.gz A3_R2.fastq.gz

Since I have >1000 entries, I am hoping for a method using awk or something similar to modify the file, but I don't have much experience with awk.

lovelyrubbish (29 rep)

Feb 26, 2024, 01:43 AM • Last activity: Feb 28, 2024, 03:43 PM

0 votes

3 answers

8061 views

How to extract sequence lines from FASTQ file?

text-processing sed bioinformatics

I have FASTQ formatted Illumina sequence file like this: @ERR009148.2485 IL26_1382:7:1:224:616 length=36 ATCACATGCTCCTTGTTCTGCAGCTTGGTGCGGATG +ERR009148.2485 IL26_1382:7:1:224:616 length=36 >>>>>>>>>>>>>>>>>>> > >>5>>->> * @ERR009148.2486 IL26_1382:7:1:914:59 length=36 AAAGAAGTAAAATAAGAAGGCAATGCTTGT...

                                  I have FASTQ formatted Illumina sequence file like this:

    @ERR009148.2485 IL26_1382:7:1:224:616 length=36
    ATCACATGCTCCTTGTTCTGCAGCTTGGTGCGGATG
    +ERR009148.2485 IL26_1382:7:1:224:616 length=36
    >>>>>>>>>>>>>>>>>>>>>>5>>->>*
    @ERR009148.2486 IL26_1382:7:1:914:59 length=36
    AAAGAAGTAAAATAAGAAGGCAATGCTTGTGGAAGG
    +ERR009148.2486 IL26_1382:7:1:914:59 length=36
    .>>74::1>174151/7152313,3&003,00&2%2
    @ERR009148.2487 IL26_1382:7:1:251:589 length=36
    GCCATAAACACCCCAGCACCACATTCATCAGAAGGG
    +ERR009148.2487 IL26_1382:7:1:251:589 length=36
    >>>>>>>>>>>>>>>>>>>>>>8>>>>>>>>7
    @ERR009148.2488 IL26_1382:7:1:911:194 length=36
    ATTGAGGTGGAGTAGATTAGGCGTAGGTAGAAGTAG
    +ERR009148.2488 IL26_1382:7:1:911:194 length=36
    >>=>>>>>>>=;>7>==<<7;=67=/57/57

I need to extract only the raw sequences from each record. What sed 
command can be used for that?

Expected output:

    ATCACATGCTCCTTGTTCTGCAGCTTGGTGCGGATG
    AAAGAAGTAAAATAAGAAGGCAATGCTTGTGGAAGG
    GCCATAAACACCCCAGCACCACATTCATCAGAAGGG
    ATTGAGGTGGAGTAGATTAGGCGTAGGTAGAAGTAG
                                

iiii (11 rep)

Oct 13, 2017, 05:40 AM • Last activity: Feb 21, 2024, 04:17 AM

1 votes

5 answers

98 views

How to count word from a column when consecutive cells are equal in a different column using shell script!

bash shell-script bioinformatics

I'm trying to count the number of `C_R` and `S_R` in column 9 when consecutive cells in column 2, column 3, and column 1 are the same. The file is in bed format (tab-separated format). The original file is huge and column 1 defines chromosome number. The first few lines of the file look like this, c...

                                  I'm trying to count the number of C_R and S_R in column 9 when consecutive cells in column 2, column 3, and column 1 are the same. The file is in bed format (tab-separated format). The original file is huge and column 1 defines chromosome number. The first few lines of the file look like this,

    chr1	10200	10300	8	10000	10214	100	214	S_R
    chr1	10200	10300	8	10009	10233	100	224	S_R
    chr1	10200	10300	8	10014	10220	100	206	S_R
    chr1	10200	10300	8	10045	10215	100	170	S_R
    chr1	10200	10300	8	10068	10209	100	141	S_R
    chr1	10200	10300	8	10074	10300	100	226	C_R
    chr1	10200	10300	8	10182	10283	100	101	S_R
    chr1	10200	10300	8	10182	10387	100	205	C_R
    chr1	10300	10400	4	10182	10387	100	205	S_R
    chr1	10300	10400	4	10331	10467	100	136	S_R
    chr1	10300	10400	4	10346	10461	100	115	S_R
    chr1	10300	10400	4	10352	10468	100	116	S_R
    chr1	10400	10500	3	10331	10467	100	136	S_R
    chr1	10400	10500	3	10346	10461	100	115	S_R
    chr1	10400	10500	3	10352	10468	100	116	S_R
    chr1	11000	11100	2	11024	11163	100	139	S_R
    chr1	11000	11100	2	11024	11188	100	164	S_R
    chr1	11100	11200	3	11024	11163	100	139	S_R
    chr1	11100	11200	3	11024	11188	100	164	S_R
    chr1	11100	11200	3	11127	11296	100	169	S_R
    chr1	11200	11300	1	11127	11296	100	169	S_R
    chr1	11400	11500	2	11412	11561	100	149	S_R
    chr1	11400	11500	2	11457	11608	100	151	S_R
    chr1	11500	11600	3	11412	11561	100	149	S_R
    chr1	11500	11600	3	11457	11608	100	151	C_R
    chr1	11500	11600	3	11574	11744	100	170	S_R
    chr1	11600	11700	3	11457	11608	100	151	S_R
    chr1	11600	11700	3	11574	11744	100	170	C_R
    chr1	11600	11700	3	11640	11815	100	175	S_R
    chr1	11700	11800	4	11574	11744	100	170	S_R
    chr1	11700	11800	4	11640	11815	100	175	C_R
    chr1	11700	11800	4	11784	11963	100	179	S_R
    chr1	11700	11800	4	11791	11936	100	145	S_R

In this above table first 8 rows the col 1, 2, 3 are same 
so, the tentative output file would look like

    chr1    10200   10300   2   6
    chr1    10300   10400   0   4
    chr1    10400   10500   0   3
    chr1    11000   11100   0   2
    chr1    11100   11200   0   3
    chr1    11200   11300   0   1
    chr1    11400   11500   0   2
    chr1    11500   11600   1   2
    chr1    11600   11700   1   2
    chr1    11700   11800   1   3

Where in ouput file, col 4 is C_R and col 5 is S_R
                                

Debajyoti Kabiraj (251 rep)

Oct 8, 2023, 06:03 AM • Last activity: Feb 10, 2024, 02:44 AM

Showing page 1 of 20 total questions