Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

8 votes
3 answers
15162 views
extract lines that match a list of words in another file
I have file 1 which have those lines: ATM 1434.972183 BMPR2 10762.78192 BMPR2 10762.78192 BMPR2 1469.14535 BMPR2 1469.14535 BMPR2 1738.479639 BMS1 4907.841667 BMS1 4907.841667 BMS1 880.4532628 BMS1 880.4532628 BMS1P17 1249.75 BMS1P17 1249.75 BMS1P17 1606.821429 BMS1P17 1606.821429 BMS1P17 1666.33333...
I have file 1 which have those lines: ATM 1434.972183 BMPR2 10762.78192 BMPR2 10762.78192 BMPR2 1469.14535 BMPR2 1469.14535 BMPR2 1738.479639 BMS1 4907.841667 BMS1 4907.841667 BMS1 880.4532628 BMS1 880.4532628 BMS1P17 1249.75 BMS1P17 1249.75 BMS1P17 1606.821429 BMS1P17 1606.821429 BMS1P17 1666.333333 BMS1P17 1666.333333 BMS1P17 2108.460317 BMS1P17 2108 And file 2 have a list of words: ATM BMS1 So, the output will be like this: ATM 1434.972183 BMS1 4907.841667 BMS1 4907.841667 BMS1 880.4532628 BMS1 880.4532628 I know it's really a duplicate question, but I tried all types of grep and sed and awk, maybe it will works with you guys with this tiny example but I have a very huge file > 1M lines and all previous way doesn't help it return part of the lines that containing those words although there are other words in file 2 that matches the lines from file 1
LamaMo (223 rep)
Jul 25, 2018, 06:59 PM • Last activity: Aug 6, 2025, 07:52 PM
2 votes
2 answers
1073 views
Converting column data to matrix
I am trying to create a matrix of plant traits and plant species. There are 2,912,746 rows in the data and 3 columns. There are different numbers of traits for each species, and not every species has every trait. The data format is tab delimited. Current format-- Species Trait Value Species_1 SLA 4...
I am trying to create a matrix of plant traits and plant species. There are 2,912,746 rows in the data and 3 columns. There are different numbers of traits for each species, and not every species has every trait. The data format is tab delimited. Current format-- Species Trait Value Species_1 SLA 4 Species_1 Photopath C3 Species_1 Mycorrhiza AMF Species_2 SLA 3 Species_2 Growth 10 Desired format-- SLA Photopath Mycorrhiza Growth Species_1 4 C3 AMF Species_2 3 10 Any help with this would be OH SO appreciated. It has been a quite the challenge, and I'm not sure where to begin. Thank you!!!! ~Mark Anthony
Mark Anthony (21 rep)
Feb 7, 2016, 06:51 PM • Last activity: Apr 20, 2025, 03:17 PM
1 votes
3 answers
84 views
edit all the values in a specific column based on row numbers range
I have a [PDB file][1] (coordinates of atoms in a protein) on a Linux machine: ATOM 1 N GLY A 1 0.535 51.766 5.682 1.00 0.00 ATOM 2 CA GLY A 1 -0.712 50.962 5.596 1.00 0.00 ATOM 3 C GLY A 1 -1.243 50.872 4.179 1.00 0.00 ATOM 4 O GLY A 1 -1.313 51.888 3.492 1.00 0.00 ATOM 5 N GLN A 2 -1.600 49.664 3....
I have a PDB file (coordinates of atoms in a protein) on a Linux machine: ATOM 1 N GLY A 1 0.535 51.766 5.682 1.00 0.00 ATOM 2 CA GLY A 1 -0.712 50.962 5.596 1.00 0.00 ATOM 3 C GLY A 1 -1.243 50.872 4.179 1.00 0.00 ATOM 4 O GLY A 1 -1.313 51.888 3.492 1.00 0.00 ATOM 5 N GLN A 2 -1.600 49.664 3.737 1.00 0.00 ATOM 6 CA GLN A 2 -2.221 49.468 2.423 1.00 0.00 ATOM 7 C GLN A 2 -3.542 48.719 2.507 1.00 0.00 ATOM 8 O GLN A 2 -3.722 47.844 3.356 1.00 0.00 ATOM 9 CB GLN A 2 -1.280 48.738 1.468 1.00 0.00 ATOM 10 CG GLN A 2 -0.976 47.294 1.830 1.00 0.00 .... .. .. .. . . .... .... .... .... .... TER SPLIT LINE FOR INTERNAL USE ONLY ATOM 1 O5' G A 1 -44.412 97.503 31.177 1.00 0.00 ATOM 2 C5' G A 1 -45.447 96.803 31.882 1.00 0.00 ATOM 3 C4' G A 1 -45.225 95.295 31.894 1.00 0.00 ATOM 4 O4' G A 1 -46.441 94.578 31.654 1.00 0.00 ATOM 5 C3' G A 1 -44.328 94.850 30.748 1.00 0.00 ATOM 6 O3' G A 1 -42.943 94.877 31.129 1.00 0.00 ATOM 7 C2' G A 1 -44.804 93.425 30.542 1.00 0.00 ATOM 8 O2' G A 1 -44.163 92.592 31.466 1.00 0.00 ATOM 9 C1' G A 1 -46.304 93.444 30.772 1.00 0.00 ATOM 10 N9 G A 1 -46.965 93.699 29.495 1.00 0.00 .... .. .. . . . ....... ...... ..... .... ... The TER record explicitly marks the end of a particular amino acid chain. I want to change the chain ID of the protein at the 5th column by awk to assign the correct ID to the new chain after TER. Expected Output: ATOM 1 N GLY A 1 0.535 51.766 5.682 1.00 0.00 ATOM 2 CA GLY A 1 -0.712 50.962 5.596 1.00 0.00 ATOM 3 C GLY A 1 -1.243 50.872 4.179 1.00 0.00 ATOM 4 O GLY A 1 -1.313 51.888 3.492 1.00 0.00 ATOM 5 N GLN A 2 -1.600 49.664 3.737 1.00 0.00 ATOM 6 CA GLN A 2 -2.221 49.468 2.423 1.00 0.00 ATOM 7 C GLN A 2 -3.542 48.719 2.507 1.00 0.00 ATOM 8 O GLN A 2 -3.722 47.844 3.356 1.00 0.00 ATOM 9 CB GLN A 2 -1.280 48.738 1.468 1.00 0.00 ATOM 10 CG GLN A 2 -0.976 47.294 1.830 1.00 0.00 TER SPLIT LINE FOR INTERNAL USE ONLY ATOM 1 O5' G B 1 -44.412 97.503 31.177 1.00 0.00 ATOM 2 C5' G B 1 -45.447 96.803 31.882 1.00 0.00 ATOM 3 C4' G B 1 -45.225 95.295 31.894 1.00 0.00 ATOM 4 O4' G B 1 -46.441 94.578 31.654 1.00 0.00 ATOM 5 C3' G B 1 -44.328 94.850 30.748 1.00 0.00 ATOM 6 O3' G B 1 -42.943 94.877 31.129 1.00 0.00 ATOM 7 C2' G B 1 -44.804 93.425 30.542 1.00 0.00 ATOM 8 O2' G B 1 -44.163 92.592 31.466 1.00 0.00 ATOM 9 C1' G B 1 -46.304 93.444 30.772 1.00 0.00 ATOM 10 N9 G B 1 -46.965 93.699 29.495 1.00 0.00 Everything needs to be separated with the same spaces, this following arrangement would be wrong: ATOM 3674 CD1 PHE A 460 2.350 79.471 35.466 1.00 0.00 ATOM 3675 CD2 PHE A 460 1.037 81.443 35.196 1.00 0.00 ATOM 3676 CE1 PHE A 460 2.425 79.321 34.080 1.00 0.00 ATOM 3677 CE2 PHE A 460 1.108 81.298 33.805 1.00 0.00 ATOM 3678 CZ PHE A 460 1.805 80.232 33.250 1.00 0.00 TER SPLIT LINE FOR B USE ONLY ATOM 1 O5' G B 1 -44.412 97.503 31.177 1.00 0.00 ATOM 2 C5' G B 1 -45.447 96.803 31.882 1.00 0.00 ATOM 3 C4' G B 1 -45.225 95.295 31.894 1.00 0.00 ATOM 4 O4' G B 1 -46.441 94.578 31.654 1.00 0.00 ATOM 5 C3' G B 1 -44.328 94.850 30.748 1.00 0.00 In addition, the file ends with this: TER ENDMDL There is a blank line at the end of the file which needs to be left as it is
Paolo Lorenzini (444 rep)
Apr 17, 2025, 04:09 PM • Last activity: Apr 18, 2025, 08:22 AM
3 votes
1 answers
702 views
how to pass environment variables to singularity exec
I have a `BASH` pipeline which at a point runs a `Singularity` container with *singularity exec* as follows: ``` singularity exec --bind `pwd`:/folder --bind $d:/results .sif -i /folder/ .fastq -v /results/ / .vcf -r /folder/ .fna -s -j 24 -t 24 -o /results/ ``` Since I'm running multiple experiment...
I have a BASH pipeline which at a point runs a Singularity container with *singularity exec* as follows:
singularity exec --bind pwd:/folder --bind $d:/results .sif  -i /folder/.fastq -v /results//.vcf -r /folder/.fna -s  -j 24 -t 24 -o /results/
Since I'm running multiple experiments at once with an array, I'm redefining the experiments with an environment variable that I wish to add to the `; it works for all the steps of the pipeline but Singularity` seems it's unable to *see* the variable(s) I'm defining in my script... Does anyone have any advice on how to do so? I looked up a bit but the --env does not seem to be working. Thanks in advance!
Matteo (209 rep)
Apr 13, 2025, 12:51 PM • Last activity: Apr 13, 2025, 06:48 PM
1 votes
4 answers
178 views
Find lines in Vim that start one way and that don't end in another way
I'm trying to use Vim to find, via `/`, lines that start and end in specific ways. In particular, I'd be looking for lines that start *with* the character `>` and *without* the string `RNA` at the very end. For example, I would want to find this line >NM_001010867.4 Homo sapiens iron-sulfur cluster...
I'm trying to use Vim to find, via /, lines that start and end in specific ways. In particular, I'd be looking for lines that start *with* the character > and *without* the string RNA at the very end. For example, I would want to find this line >NM_001010867.4 Homo sapiens iron-sulfur cluster assembly factor IBA57 (IBA57),transcript variant 1, mRNA; nuclear gene for mitochondrial product in a search, but not find this line >NR_107042.1 Homo sapiens microRNA 8075 (MIR8075), microRNA I've looked hard for a solution but haven't been able to find one. Any help would be greatly appreciated.
Mark Pauley (61 rep)
Dec 14, 2024, 11:58 PM • Last activity: Feb 28, 2025, 09:56 PM
2 votes
4 answers
2061 views
Removing special characters from a fasta file
I recently linearized a fasta file using awk. The output is perfect. However there is a caret(^) in my sequence. I want to remove this caret. below is my attempt, any assistance is highly appreciated. ``` >P1 MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRD...
I recently linearized a fasta file using awk. The output is perfect. However there is a caret(^) in my sequence. I want to remove this caret. below is my attempt, any assistance is highly appreciated.
>P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP^MMEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ^MFFETRPEDLNPPKEEHIGKKKSGNDPTSVDPM
>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP^MLRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE^MDFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEK
>P3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS^MNAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKAKELGIEICLMYVEIE^MKGESVQEELLKGLDNKNPKIIVACIETLRKALS
I tried using:
$ sed '/s: ^// seq2.fa>seq3.fa
The code above is giving me an error of
:e expression #1,char7: unkown command: '/'
Any assistance is appreciated, thanks.
thole (33 rep)
Dec 29, 2022, 02:41 AM • Last activity: Feb 27, 2025, 07:33 PM
4 votes
3 answers
230 views
Add columns from variable number of files to base file
I'm dealing with a series of bed files, which look like this: ``` chr1 100 110 0.5 chr1 150 175 0.2 chr1 200 300 1.5 ``` With the columns being chromosome, start, end, score. I have multiple different files with different scores in each one, and I'd like to combine them like this: ``` > cat a.bed ch...
I'm dealing with a series of bed files, which look like this:
chr1 100 110 0.5
chr1 150 175 0.2
chr1 200 300 1.5
With the columns being chromosome, start, end, score. I have multiple different files with different scores in each one, and I'd like to combine them like this:
> cat a.bed
chr1 100 110 0.5
chr1 150 175 0.2
chr1 200 300 1.5

> cat b.bed
chr1 100 110 0.4
chr1 150 175 0.7
chr1 200 300 0.9

> cat c.bed
chr1 100 110 1.5
chr1 150 175 1.2
chr1 200 300 0.1

> cat combined.bed
chr1 100 110 0.5 0.4 1.5
chr1 150 175 0.2 0.7 1.2
chr1 200 300 1.5 0.9 0.1
All the score columns (last column of the file) are added to a single file. I found [this answer](https://unix.stackexchange.com/a/167290/387150) , which can combine a column from one additional file into an existing file, but I would like a command which can add a *variable* number of columns together. So if I have 10 bed files to combine, I'd like a command that can process them all together and create a single file with 10 score columns. Each file should have the same number of lines, and each entry should have the same coordinates in all the files, so there should be no conflicts there. However there can be a lot of entries in each of the files (100K or more generally), so I'd like to avoid processing each one multiple times. Is there a way to handle this cleanly? This will be in a script so no need to be a one liner.
Whitehot (245 rep)
Feb 24, 2025, 03:00 PM • Last activity: Feb 25, 2025, 10:43 PM
1 votes
1 answers
917 views
sorting a file on column 1 lexically, then numerically on column 2 and 3
I want to sort my file shown below: chr17 84938 85187 1 100 1 chr12 86723 87265 2 100 1 chr12 87368 87556 11 100 1 chr12 87704 87880 10 100 1 chr12 88018 88256 3 75 1 chr12 88018 88569 1 25 1 chr17 88171 69528 1 100 2 chr12 88393 88569 6 100 1 chr12 88750 88859 3 100 1 chr12 88772 88859 3 100 1 chr1...
I want to sort my file shown below: chr17 84938 85187 1 100 1 chr12 86723 87265 2 100 1 chr12 87368 87556 11 100 1 chr12 87704 87880 10 100 1 chr12 88018 88256 3 75 1 chr12 88018 88569 1 25 1 chr17 88171 69528 1 100 2 chr12 88393 88569 6 100 1 chr12 88750 88859 3 100 1 chr12 88772 88859 3 100 1 chr12 89019 89674 7 100 1 chr12 89828 90586 1 100 1 chr12 90656 90795 3 100 1 chr17 93459 92763 1 100 2 chr17 96901 69528 4 100 2 chr17 100273 99697 1 100 2 chr16 101557 97558 13 100 2 chr16 103475 101646 8 100 2 chr16 104059 105458 18 100 1 chr16 105550 105776 19 100 1 chr16 105883 106538 17 100 1 chr16 106614 107085 20 100 1 chr18 107887 109384 1 100 1 chr16 108971 108759 2 100 2 First I want to sort on column 1 then column 2 and then column 3 (all ascending order) I did that in Microsoft Excel and got this result: chr12 86723 87265 2 100 1 chr12 87368 87556 11 100 1 chr12 87704 87880 10 100 1 chr12 88018 88256 3 75 1 chr12 88018 88569 1 25 1 chr12 88393 88569 6 100 1 chr12 88750 88859 3 100 1 chr12 88772 88859 3 100 1 chr12 89019 89674 7 100 1 chr12 89828 90586 1 100 1 chr12 90656 90795 3 100 1 chr16 101557 97558 13 100 2 chr16 103475 101646 8 100 2 chr16 104059 105458 18 100 1 chr16 105550 105776 19 100 1 chr16 105883 106538 17 100 1 chr16 106614 107085 20 100 1 chr16 108971 108759 2 100 2 chr17 84938 85187 1 100 1 chr17 88171 69528 1 100 2 chr17 93459 92763 1 100 2 chr17 96901 69528 4 100 2 chr17 100273 99697 1 100 2 chr18 107887 109384 1 100 1 I used on unix command line this command sort -k 1,1 -nk2 -nk3 file.txt It gave me: chr17 84938 85187 1 100 1 chr12 86723 87265 2 100 1 chr12 87368 87556 11 100 1 chr12 87704 87880 10 100 1 chr12 88018 88256 3 75 1 chr12 88018 88569 1 25 1 chr17 88171 69528 1 100 2 chr12 88393 88569 6 100 1 chr12 88750 88859 3 100 1 chr12 88772 88859 3 100 1 chr12 89019 89674 7 100 1 chr12 89828 90586 1 100 1 chr12 90656 90795 3 100 1 chr17 93459 92763 1 100 2 chr17 96901 69528 4 100 2 chr17 100273 99697 1 100 2 chr16 101557 97558 13 100 2 chr16 103475 101646 8 100 2 chr16 104059 105458 18 100 1 chr16 105550 105776 19 100 1 chr16 105883 106538 17 100 1 chr16 106614 107085 20 100 1 chr18 107887 109384 1 100 1 chr16 108971 108759 2 100 2 What can I do here to get output similar to Excel?
user3138373 (2589 rep)
Sep 8, 2016, 10:27 PM • Last activity: Nov 14, 2024, 08:33 AM
6 votes
3 answers
954 views
bash script quoting frustration
This problem is driving me crazy. From the command prompt I can enter this command and it works as expected (records where the INFO/RegionType tag contains the value Core are emitted in the output file): ``` bcftools filter -i "INFO/RegionType='Core'" -Oz -o temp.core.vcf.gz temp2.vcf.gz ``` But whe...
This problem is driving me crazy. From the command prompt I can enter this command and it works as expected (records where the INFO/RegionType tag contains the value Core are emitted in the output file):
bcftools filter -i "INFO/RegionType='Core'" -Oz -o temp.core.vcf.gz temp2.vcf.gz
But when I try to put the command in a bash script things go awry. This seems to be due to the single and double quotes? I tried various permutations of escaping quotes with no success. Finally I tried the approach described here: to store the filter string in a variable and reference the variable. From my script:
filter_string=INFO/RegionType="'Core'"
bcftools filter -i \""$filter_string"\" -Oz -o temp.core.vcf.gz temp2.vcf.gz
This completes without an error but the command is not interpreting the filter correctly as evidenced by the fact that no records are included in the output file, whereas the command given directly from the command prompt yields 3124 records in the output file. What might be going on here? ---- A minimal example of the input file that works with bcftools and reproduces the issue (note that all whitespace in the lines starting with chr1 needs to be single tabs, not spaces):
-none
##fileformat=VCFv4.2
##FILTER=
##INFO=
##INFO=
##FORMAT=
##contig=
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
chr1	103385791	.	C	G	1182.77	PASS	AC=1;	GT	0/1
chr1	103471456	.	CCAT	C	2848.73	PASS	AC=2;RegionType=Core	GT	1/1
mcrepeau (69 rep)
Aug 21, 2024, 08:17 PM • Last activity: Aug 23, 2024, 05:17 AM
2 votes
5 answers
352 views
Grouping rows by categories avoiding repetition
I have a tab-separated file with two columns on a Linux machine. The first column contains names, the second column contains GO IDs (these are always of the format `GO:` followed by seven digits) separated by commas. What I need to do is to keep one name with one unique GO ID in each row only, disca...
I have a tab-separated file with two columns on a Linux machine. The first column contains names, the second column contains GO IDs (these are always of the format GO: followed by seven digits) separated by commas. What I need to do is to keep one name with one unique GO ID in each row only, discarding repetitiveness and multiple entries. From this
Pr_g33687.t1    GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1    GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Into this
Pr_g33687.t1    GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565
I would appreciate your help. Thank you. RS
rseg (35 rep)
Jun 26, 2024, 02:04 PM • Last activity: Jul 7, 2024, 04:03 AM
3 votes
3 answers
4334 views
Removing number after decimal point
I have an input file with these fields: ENST00000456328.2 1657 1350.015 0 0 I am trying awk to remove the number after the decimal and print the rest as it is awk -F[.] '{print $1"\t"$2"\t"$3}{next;}' But it doesn't work, as it gives an output like this: ENST00000456328 2 1657 1350 015 0 0 Can someo...
I have an input file with these fields: ENST00000456328.2 1657 1350.015 0 0 I am trying awk to remove the number after the decimal and print the rest as it is awk -F[.] '{print $1"\t"$2"\t"$3}{next;}' But it doesn't work, as it gives an output like this: ENST00000456328 2 1657 1350 015 0 0 Can someone help. regards.
user1738234 (87 rep)
Jan 3, 2020, 06:29 PM • Last activity: Apr 21, 2024, 12:39 AM
2 votes
5 answers
1439 views
How to fetch fasta sequences corresponding to header lines in another file
I have a file of lines of headers (`file1`) and another file is sequences in fasta format (`file2`). I want to grep fasta sequences if a header line from `file1` is present in `file2`. Example: * `file1`: ```none >sp|B7UM99|TIR_ECO27 >sp|P06616|ERA_ECOLI ``` * `file2`: ```none >sp|B7UM99|TIR_ECO27 M...
I have a file of lines of headers (file1) and another file is sequences in fasta format (file2). I want to grep fasta sequences if a header line from file1 is present in file2. Example: * file1:
>sp|B7UM99|TIR_ECO27
    >sp|P06616|ERA_ECOLI
* file2:
>sp|B7UM99|TIR_ECO27
    MPIGNLGNNVNGNHLIPPAPPLPSQTDGAA
    RGGTGHLISSTGALGSRSLFSPLRNSMADS
    VDSRDIPGLPTNPSRLAAATSETCLLGGFE
    VLHDKGPLDILNTQIGPSAFRVEVQADGTH
    ......
    >sp|P06616|ERA_ECOLI
    MSIDKSYCGFIAIVGRPNVGKSTLLNKLL
    GQKISITSRKAQTTRHRIVGIHTEGAYQAIY
    VDTPGLHMEEKRAINRLMNKAASSSIGDVE
    LVIFVVEGTRWTPDDEMVLNKLREGKAPVI
    ............
    >sp|P0AD68|HUMAN
    MKAAAKTQKPKRQEEHANFISWRFALLCGC
    ILLALAFLLGRVAWLQVISPDMLVKEGDMR
    SLRVQQVSTSRGMITDRSGRPLAVSVPVKA
    IWADPKEVHDAGGISVGDRWKALANALNIP
    .............
* Desired output
>sp|B7UM99|TIR_ECO27
    MPIGNLGNNVNGNHLIPPAPPLPSQTDGAA
    RGGTGHLISSTGALGSRSLFSPLRNSMADS
    VDSRDIPGLPTNPSRLAAATSETCLLGGFE
    VLHDKGPLDILNTQIGPSAFRVEVQADGTH
    ......
    >sp|P06616|ERA_ECOLI
    MSIDKSYCGFIAIVGRPNVGKSTLLNKLL
    GQKISITSRKAQTTRHRIVGIHTEGAYQAIY
    VDTPGLHMEEKRAINRLMNKAASSSIGDVE
    LVIFVVEGTRWTPDDEMVLNKLREGKAPVI
    ............
Manoj Kumar (41 rep)
May 9, 2019, 08:25 PM • Last activity: Apr 14, 2024, 06:48 AM
1 votes
5 answers
132 views
sed command to replace a word within a line following a pattern
I'm working with a file that looks like the following, containing with over 50,000 lines of gene IDs followed by their sequence: gene_A:3342234 CTCTTTCTTTTACGCCT gene_A:1244-5205 CTCTTTCTTTTACGCCT gene_A:1838438 CTCTTTCTTTTACGCCT gene_B:1848584 CTCTTTCTTTTACGCCT gene_B:1029-4920 CTCTTTCTTTTACGCCT ge...
I'm working with a file that looks like the following, containing with over 50,000 lines of gene IDs followed by their sequence: gene_A:3342234 CTCTTTCTTTTACGCCT gene_A:1244-5205 CTCTTTCTTTTACGCCT gene_A:1838438 CTCTTTCTTTTACGCCT gene_B:1848584 CTCTTTCTTTTACGCCT gene_B:1029-4920 CTCTTTCTTTTACGCCT gene_C:3849029 CTCTTTCTTTTACGCCT They all have the gene ID, followed by a colon, and then the reference number of 7-9 digits and (some include dashes). I want to replace the gene IDs with their actual names, for example geneA and geneB, whilst keeping the information that follows them. Desired output: geneA CTCTTTCTTTTACGCCT geneA CTCTTTCTTTTACGCCT geneA CTCTTTCTTTTACGCCT geneB CTCTTTCTTTTACGCCT geneB CTCTTTCTTTTACGCCT geneB CTCTTTCTTTTACGCCT This is my first time using sed, so I'm really not quite sure where to even start. I know how to replace all lines containing gene_A with 's/gene_A.*/geneA/' but I'm not sure how to preserve the information following the gene IDs.
bryophyta (11 rep)
Feb 3, 2024, 09:34 PM • Last activity: Apr 4, 2024, 03:49 PM
1 votes
1 answers
173 views
Histogram with FASTA file
I am new to Linux. I have a FASTA file that looks something like this: ~~~ >scaffold1 AAGACATAATATTTTGGAGGAATTAAAAAATTTAAGATGTATTTTATTATACATGTATTTTATTTATAACATAAATAAACATCCCAAGGAAAAGCAGTAGCT >scaffold2 AAGGATAAGTGTAGCTGAGGAATTAAAAAATTTAACAAATAAAATTAATGCAATTTATTTTTTCAAATAAAAATACACGGAGAAAAATAATTTGTAAATT...
I am new to Linux. I have a FASTA file that looks something like this: ~~~ >scaffold1 AAGACATAATATTTTGGAGGAATTAAAAAATTTAAGATGTATTTTATTATACATGTATTTTATTTATAACATAAATAAACATCCCAAGGAAAAGCAGTAGCT >scaffold2 AAGGATAAGTGTAGCTGAGGAATTAAAAAATTTAACAAATAAAATTAATGCAATTTATTTTTTCAAATAAAAATACACGGAGAAAAATAATTTGTAAATTTT ~~~ etc. This goes on to about 5000+ scaffolds. I want to make a histogram with the scaffold lengths. I read about Biopython etc. but I don't know anything about installing these programs. Is there a way to get a histogram with just the Linux commands (terminal) or with R? Thank you
Max Mustermann (21 rep)
Jul 18, 2019, 11:59 AM • Last activity: Apr 4, 2024, 11:16 AM
1 votes
4 answers
381 views
replace header in a file with list of lines in another file
I have a fasta file contained ~28000 sequence. I want to replace header of these sequences by a list of lines in another file. Example: File 1: sp|B7UM99|TIR_ECO27 MPIGNLGNNVNGNHLIPPAPP..... sp|P0ACF8|HNS_ECOLI MSEALKILNNIRTLRAQ........ sp|P24232|HMP_ECOLI MLDAQTIATVKATIPLLVET.......... File 2: sp|B...
I have a fasta file contained ~28000 sequence. I want to replace header of these sequences by a list of lines in another file. Example: File 1: sp|B7UM99|TIR_ECO27 MPIGNLGNNVNGNHLIPPAPP..... sp|P0ACF8|HNS_ECOLI MSEALKILNNIRTLRAQ........ sp|P24232|HMP_ECOLI MLDAQTIATVKATIPLLVET.......... File 2: sp|B7UM99|TIR_ECO27OS=Escherichia coli sp|P0ACF8|HNS_ECOLI=Human sp|P24232|HMP_ECOLI=Flavohemoprotein Desired Output: sp|B7UM99|TIR_ECO27OS=Escherichia coli MPIGNLGNNVNGNHLIPPAPP..... sp|P0ACF8|HNS_ECOLI=Human MSEALKILNNIRTLRAQ........ sp|P24232|HMP_ECOLI=Flavohemoprotein MLDAQTIATVKATIPLLVET..........
Manoj Kumar (41 rep)
May 9, 2019, 03:58 PM • Last activity: Apr 2, 2024, 11:08 AM
5 votes
6 answers
304 views
subset columns from the 1st file using column names in 2nd file
I have two text files: 1st file is a Tab delimited file which looks like this: chrom pos ref alt a1 a2 a3 a4 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 2nd file looks like this: a1 a4 I want to...
I have two text files: 1st file is a Tab delimited file which looks like this: chrom pos ref alt a1 a2 a3 a4 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 10 12345 C T aa bb cc dd 2nd file looks like this: a1 a4 I want to extract those columns in the 1st file which are present in the 2nd file along with first 4 columns of the first file. So in the above case, the output will look like this: chrom pos ref alt a1 a4 10 12345 C T aa dd 10 12345 C T aa dd 10 12345 C T aa dd 10 12345 C T aa dd 10 12345 C T aa dd 10 12345 C T aa dd I want to do this in shell. How can I do this? I have a bigger file than shown here, so I have many columns in 1st file cut -f 1-4,$(grep -Fwf file2.txt <(head -1 file1.txt)) file1.txt
user3138373 (2589 rep)
Mar 22, 2024, 04:01 AM • Last activity: Mar 24, 2024, 03:25 AM
0 votes
3 answers
823 views
how to print file name and total number of fasta sequences?
I have a fasta file namely test.fasta, pas.fasta, cel.fasta as shown below test.fasta >tile ATGTC >259 TGAT pas.fasta >ta ATGCT cel.fasta >787 TGTAG >yog TGTAT >In NNTAG I need to print the file name and the total number of fasta sequences as shown below, test,2 pas,1 cel,3 I have used the following...
I have a fasta file namely test.fasta, pas.fasta, cel.fasta as shown below test.fasta >tile ATGTC >259 TGAT pas.fasta >ta ATGCT cel.fasta >787 TGTAG >yog TGTAT >In NNTAG I need to print the file name and the total number of fasta sequences as shown below, test,2 pas,1 cel,3 I have used the following commands but failed to serve my purpose grep ">" test.fasta | wc -l && ls test.fasta Please help me to do the same. Thanks in advance.
Kumar (129 rep)
Sep 5, 2021, 01:44 PM • Last activity: Mar 16, 2024, 09:12 AM
2 votes
5 answers
1498 views
Replace new lines with spaces using awk
I have a text file that I generated of all files in a directory. I'd like to use this file as input into a script that I have, but I need the text file to be formatted in a particular way to be parsed correctly. Currently the text file, which is a list of file names, is formatted like so: ``` A1_R1....
I have a text file that I generated of all files in a directory. I'd like to use this file as input into a script that I have, but I need the text file to be formatted in a particular way to be parsed correctly. Currently the text file, which is a list of file names, is formatted like so:
A1_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz
A2_R2.fastq.gz
A3_R1.fastq.gz
A3_R2.fastq.gz
I need the paired reads (files with the same name, but different RN values) for each sample to be on the same line, separated by a tab:
A1_R1.fastq.gz A1_R2.fastq.gz
A2_R1.fastq.gz A2_R2.fastq.gz
A3_R1.fastq.gz A3_R2.fastq.gz
Since I have >1000 entries, I am hoping for a method using awk or something similar to modify the file, but I don't have much experience with awk.
lovelyrubbish (29 rep)
Feb 26, 2024, 01:43 AM • Last activity: Feb 28, 2024, 03:43 PM
0 votes
3 answers
8061 views
How to extract sequence lines from FASTQ file?
I have FASTQ formatted Illumina sequence file like this: @ERR009148.2485 IL26_1382:7:1:224:616 length=36 ATCACATGCTCCTTGTTCTGCAGCTTGGTGCGGATG +ERR009148.2485 IL26_1382:7:1:224:616 length=36 >>>>>>>>>>>>>>>>>>> > >>5>>->> * @ERR009148.2486 IL26_1382:7:1:914:59 length=36 AAAGAAGTAAAATAAGAAGGCAATGCTTGT...
I have FASTQ formatted Illumina sequence file like this: @ERR009148.2485 IL26_1382:7:1:224:616 length=36 ATCACATGCTCCTTGTTCTGCAGCTTGGTGCGGATG +ERR009148.2485 IL26_1382:7:1:224:616 length=36 >>>>>>>>>>>>>>>>>>>>>>5>>->>* @ERR009148.2486 IL26_1382:7:1:914:59 length=36 AAAGAAGTAAAATAAGAAGGCAATGCTTGTGGAAGG +ERR009148.2486 IL26_1382:7:1:914:59 length=36 .>>74::1>174151/7152313,3&003,00&2%2 @ERR009148.2487 IL26_1382:7:1:251:589 length=36 GCCATAAACACCCCAGCACCACATTCATCAGAAGGG +ERR009148.2487 IL26_1382:7:1:251:589 length=36 >>>>>>>>>>>>>>>>>>>>>>8>>>>>>>>7 @ERR009148.2488 IL26_1382:7:1:911:194 length=36 ATTGAGGTGGAGTAGATTAGGCGTAGGTAGAAGTAG +ERR009148.2488 IL26_1382:7:1:911:194 length=36 >>=>>>>>>>=;>7>==<<7;=67=/57/57 I need to extract only the raw sequences from each record. What sed command can be used for that? Expected output: ATCACATGCTCCTTGTTCTGCAGCTTGGTGCGGATG AAAGAAGTAAAATAAGAAGGCAATGCTTGTGGAAGG GCCATAAACACCCCAGCACCACATTCATCAGAAGGG ATTGAGGTGGAGTAGATTAGGCGTAGGTAGAAGTAG
iiii (11 rep)
Oct 13, 2017, 05:40 AM • Last activity: Feb 21, 2024, 04:17 AM
1 votes
5 answers
98 views
How to count word from a column when consecutive cells are equal in a different column using shell script!
I'm trying to count the number of `C_R` and `S_R` in column 9 when consecutive cells in column 2, column 3, and column 1 are the same. The file is in bed format (tab-separated format). The original file is huge and column 1 defines chromosome number. The first few lines of the file look like this, c...
I'm trying to count the number of C_R and S_R in column 9 when consecutive cells in column 2, column 3, and column 1 are the same. The file is in bed format (tab-separated format). The original file is huge and column 1 defines chromosome number. The first few lines of the file look like this, chr1 10200 10300 8 10000 10214 100 214 S_R chr1 10200 10300 8 10009 10233 100 224 S_R chr1 10200 10300 8 10014 10220 100 206 S_R chr1 10200 10300 8 10045 10215 100 170 S_R chr1 10200 10300 8 10068 10209 100 141 S_R chr1 10200 10300 8 10074 10300 100 226 C_R chr1 10200 10300 8 10182 10283 100 101 S_R chr1 10200 10300 8 10182 10387 100 205 C_R chr1 10300 10400 4 10182 10387 100 205 S_R chr1 10300 10400 4 10331 10467 100 136 S_R chr1 10300 10400 4 10346 10461 100 115 S_R chr1 10300 10400 4 10352 10468 100 116 S_R chr1 10400 10500 3 10331 10467 100 136 S_R chr1 10400 10500 3 10346 10461 100 115 S_R chr1 10400 10500 3 10352 10468 100 116 S_R chr1 11000 11100 2 11024 11163 100 139 S_R chr1 11000 11100 2 11024 11188 100 164 S_R chr1 11100 11200 3 11024 11163 100 139 S_R chr1 11100 11200 3 11024 11188 100 164 S_R chr1 11100 11200 3 11127 11296 100 169 S_R chr1 11200 11300 1 11127 11296 100 169 S_R chr1 11400 11500 2 11412 11561 100 149 S_R chr1 11400 11500 2 11457 11608 100 151 S_R chr1 11500 11600 3 11412 11561 100 149 S_R chr1 11500 11600 3 11457 11608 100 151 C_R chr1 11500 11600 3 11574 11744 100 170 S_R chr1 11600 11700 3 11457 11608 100 151 S_R chr1 11600 11700 3 11574 11744 100 170 C_R chr1 11600 11700 3 11640 11815 100 175 S_R chr1 11700 11800 4 11574 11744 100 170 S_R chr1 11700 11800 4 11640 11815 100 175 C_R chr1 11700 11800 4 11784 11963 100 179 S_R chr1 11700 11800 4 11791 11936 100 145 S_R In this above table first 8 rows the col 1, 2, 3 are same so, the tentative output file would look like chr1 10200 10300 2 6 chr1 10300 10400 0 4 chr1 10400 10500 0 3 chr1 11000 11100 0 2 chr1 11100 11200 0 3 chr1 11200 11300 0 1 chr1 11400 11500 0 2 chr1 11500 11600 1 2 chr1 11600 11700 1 2 chr1 11700 11800 1 3 Where in ouput file, col 4 is C_R and col 5 is S_R
Debajyoti Kabiraj (251 rep)
Oct 8, 2023, 06:03 AM • Last activity: Feb 10, 2024, 02:44 AM
Showing page 1 of 20 total questions