How to efficiently split up a large text file wihout splitting multiline records?
9
votes
7
answers
5827
views
I have a big text file (~50Gb when gz'ed). The file contains `
4*N
lines or
N
` records; that is every record consists of 4 lines. I would like to split this file into 4 smaller files each sized roughly 25% of the input file. How can I split up the file at the record boundary?
A naive approach would be `zcat file | wc -l
to get the line count, divide that number by 4 and then use
split -l file
`. However, this goes over the file twice and the line-counte is extremely slow (36mins). Is there a better way?
This comes close but is not what I am looking for. The accepted answer also does a line count.
**EDIT:**
The file contains sequencing data in fastq format. Two records look like this (anonymized):
@NxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxGCGA+ATAGAGAG
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTTTATGTTTTTAATTAATTCTGTTTCCTCAGATTGATGATGAAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+
AAAAA#FFFFFFFFFFFFAFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF /dev/null`` takes 31mins.
**EDIT3:**
Onlye the first line starts with `@
`. None of the others will ever. See here . Records need to stay in order. It's not ok to add anything to the resulting file.
Asked by Rolf
(932 rep)
Jun 16, 2015, 07:55 AM
Last activity: Nov 27, 2023, 07:55 AM
Last activity: Nov 27, 2023, 07:55 AM