Sample Header Ad - 728x90

How to efficiently split up a large text file wihout splitting multiline records?

9 votes
7 answers
5827 views
I have a big text file (~50Gb when gz'ed). The file contains `4*N lines or N` records; that is every record consists of 4 lines. I would like to split this file into 4 smaller files each sized roughly 25% of the input file. How can I split up the file at the record boundary? A naive approach would be `zcat file | wc -l to get the line count, divide that number by 4 and then use split -l file`. However, this goes over the file twice and the line-counte is extremely slow (36mins). Is there a better way? This comes close but is not what I am looking for. The accepted answer also does a line count. **EDIT:** The file contains sequencing data in fastq format. Two records look like this (anonymized): @NxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxGCGA+ATAGAGAG xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTTTATGTTTTTAATTAATTCTGTTTCCTCAGATTGATGATGAAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + AAAAA#FFFFFFFFFFFFAFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF /dev/null`` takes 31mins. **EDIT3:** Onlye the first line starts with `@`. None of the others will ever. See here . Records need to stay in order. It's not ok to add anything to the resulting file.
Asked by Rolf (932 rep)
Jun 16, 2015, 07:55 AM
Last activity: Nov 27, 2023, 07:55 AM