How to efficiently split up a large text file wihout splitting multiline records?

9 votes

7 answers

5827 views

                          I have a big text file (~50Gb when gz'ed). The file contains `4*N lines or N` records; that is every record consists of 4 lines. I would like to split this file into 4 smaller files each sized roughly 25% of the input file. How can I split up the file at the record boundary?

A naive approach would be `zcat file | wc -l to get the line count, divide that number by 4 and then use split -l  file`. However, this goes over the file twice and the line-counte is extremely slow (36mins). Is there a better way?

This  comes close but is not what I am looking for. The accepted answer also does a line count.

**EDIT:**

The file contains sequencing data in fastq format. Two records look like this (anonymized):

    @NxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxGCGA+ATAGAGAG
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTTTATGTTTTTAATTAATTCTGTTTCCTCAGATTGATGATGAAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    +
    AAAAA#FFFFFFFFFFFFAFFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF /dev/null`` takes 31mins.

**EDIT3:**
Onlye the first line starts with `@`. None of the others will ever. See here . Records need to stay in order. It's not ok to add anything to the resulting file.

Asked by Rolf (932 rep)

Jun 16, 2015, 07:55 AM
Last activity: Nov 27, 2023, 07:55 AM

How to efficiently split up a large text file wihout splitting multiline records?

Related Questions