Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

39 votes

4 answers

88654 views

Replace text quickly in very large file

I have 25GB text file that needs a string replaced on only a few lines. I can use `sed` successfully but it takes a really long time to run. sed -i 's|old text|new text|g' gigantic_file.sql Is there a quicker way to do this?

                                  I have 25GB text file that needs a string replaced on only a few lines. I can use sed successfully but it takes a really long time to run.

    sed -i 's|old text|new text|g' gigantic_file.sql

Is there a quicker way to do this?

eisaacson (491 rep)

Jan 14, 2016, 07:14 PM • Last activity: Jul 4, 2025, 08:19 AM

2 votes

4 answers

4927 views

EXT4 for very large (>1GB) files : increase block size, use block clusters, or both?

ext4 large-files

I'd like to format a 12 TB HDD (not SSD) **with EXT4**, in order to store large video files (each file being at least 1 GiB in size). I am working with an x86-64 (a.k.a. x64 or amd64) processor. There's of course the `-T largefile4` option of `mkfs.ext4`, but are there other optimizations that can b...

                                  I'd like to format a 12 TB HDD (not SSD) **with EXT4**, in order to store large video files (each file being at least 1 GiB in size).

I am working with an x86-64 (a.k.a. x64 or amd64) processor.

There's of course the -T largefile4 option of mkfs.ext4, but are there other optimizations that can be done ?

In particular, I wonder :
- Should I increase block size to its max (64K, -b 65536) ?
- OR should I use block clusters , and  set cluster size to its max (256M, -C 268 435 456)
- OR should I do both ?

What would be the best parameters in terms of both disk space and performance optimization ?
                                

ChennyStar (1969 rep)

Jan 12, 2024, 05:56 AM • Last activity: Dec 15, 2024, 12:45 PM

2 votes

5 answers

1555 views

How to compare huge files with progress information

command-line file-comparison large-files

In a Unix command line context I would like to compare two truly huge files (around 1TB each), preferable with a progress indicator. I have tried `diff` and `cmp`, and they both crashed the system (macOS Mojave), let alone giving me a progress bar. What's the best way to compare these very large fil...

                                  In a Unix command line context I would like to compare two truly huge files (around 1TB each), preferable with a progress indicator.

I have tried diff and cmp, and they both crashed the system (macOS Mojave), let alone giving me a progress bar. 

What's the best way to compare these very large files?

### Additional Details:

1. I just want to check that they are identical.

2. cmp crashed the system in a way that the system did restart by itself. :-( Maybe the system ran out of memory?

halloleo (649 rep)

Apr 20, 2022, 02:26 AM • Last activity: Oct 10, 2024, 11:46 PM

0 votes

3 answers

8950 views

How to copy files from linux to windows using winscp from a folder which contains millions of files

ssh ls windows scp large-files

I need to copy files from a linux machine to a windows machine where the only ports which can be open are for SSH (22). I can connect to the linux machine using WinSCP but the problem is once I try to navigate to the desired folder WinSCP gets stuck since the folder contains millions of records. Bas...

                                  I need to copy files from a linux machine to a windows machine where the only ports which can be open are for SSH (22).  
I can connect to the linux machine using WinSCP but the problem is once I try to navigate to the desired folder WinSCP gets stuck since the folder contains millions of records.  
Basically I don't really care which files I copy and I would be glad to find a solution which enables me to just copy the latest 200 files.  
Any ideas?  
I've tried using ls -f | less but that did not do the trick.
                                

Ittai (101 rep)

Sep 6, 2011, 06:54 AM • Last activity: Sep 8, 2024, 04:40 PM

247 votes

12 answers

365206 views

How to remove duplicate lines inside a text file?

files text-processing large-files

A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table). What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the o...

                                  A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).

What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won't be neighbours) there is to be only one of the kind left.

I have written a program in Scala (consider it Java if you don't know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?

UPDATE: the awk '!seen[$0]++' filename solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn't work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don't feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.

Ivan (18358 rep)

Jan 27, 2012, 03:34 PM • Last activity: Aug 30, 2024, 01:12 AM

2 votes

1 answers

1731 views

How to encrypt an 8TB disk with Veracrypt with hidden partition?

linux-mint large-files veracrypt

When I try to encrypt an USB 8tb drive (Seagate Expansion Drive) with Veracrypt to create a Hidden Veracrypt volume, I receive this error: > Error: The hidden volume to be created is larger than 2 TB (2048 GB). > > Possible solutions: > > - Create a container/partition smaller than 2 TB. > - Use a d...

                                  When I try to encrypt an USB 8tb drive (Seagate Expansion Drive) with Veracrypt to create a Hidden Veracrypt volume, I receive this error:

> Error: The hidden volume to be created is larger than 2 TB (2048 GB).
> 
> Possible solutions:
>
> - Create a container/partition smaller than 2 TB.
> - Use a drive with 4096-byte sectors to be able to create partition/device-hosted hidden volumes up to 16 TB in size.

I'm new to Veracrypt and Linux. If I understood it correctly, to have an 8tb hidden partition I need to format the drive so that it'll have *"4096-byte sectors"*. I'm not finding this option in GParted. 

Hence, my question is: *How can I in Linux Mint format the drive to have 4096-byte sectors, in order for me to install a Hidden Veracrypt partition?*

Steps taken to reproduce the problem:  

 1. Launch Veracrypt and choose "Create a volume within a
    partition/drive"
 2. In "Volume Type" choose "Hidden Veracrypt Volume"
 3. When in "Outer Volume Format", click format. Thus the error message I quoted will be displayed

I'm using the latest version of Veracrypt and Linux Mint

flen (161 rep)

Oct 13, 2019, 07:00 PM • Last activity: Aug 19, 2024, 01:22 PM

16 votes

5 answers

11430 views

How do I read the last lines of a huge log file?

bash tail large-files

I have a log of 55GB in size. I tried: ``` cat logfile.log | tail ``` But this approach takes a lot of time. Is there any way to read huge files faster or any other approach?

I have a log of 55GB in size. I tried:

cat logfile.log | tail

But this approach takes a lot of time. Is there any way to read huge files faster or any other approach?

Yi Qiang Ji (162 rep)

Feb 20, 2024, 03:52 PM • Last activity: Jul 6, 2024, 01:22 PM

6 votes

4 answers

1392 views

Delete huge directory that causes all commands to hang

rm delete large-files

How do I delete this large directory? stat session/ File: ‘session/’ Size: 321540096 Blocks: 628040 IO Block: 4096 directory Device: 903h/2307d Inode: 11149319 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-09-29 14:34:40.910894275 +0200 Modify: 2022-09-29 14:35:09.5...

                                  How do I delete this large directory?

    stat session/
      File: ‘session/’
      Size: 321540096       Blocks: 628040     IO Block: 4096   directory
    Device: 903h/2307d      Inode: 11149319    Links: 2
    Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2022-09-29 14:34:40.910894275 +0200
    Modify: 2022-09-29 14:35:09.598400050 +0200
    Change: 2022-09-29 14:35:09.598400050 +0200
     Birth: -

**Note that the size of directory (not the content, but the directory entry itself) is over 300MB.**
Number of inodes is over 11 million.

**The directory has no subdirectories, only large number of files.**

None of the usual commands work. I have tried these:

 - rsync -a --delete empty_dir/ session/
 - rm -rf session
 - find . -type f --delete

If I run ls -f1 inside, it hangs.

If I run mv -- * ../.tmp_to_delete inside, it hangs.

If I run du inside, it hangs.

At the moment the rsync --delete is running since two days, reading at the rate of up to 7MB/s, and I see no change in the stat output for the directory.

I assume the large size of the directory is the problem. 

                                

Bojan Hrnkas (200 rep)

Sep 29, 2022, 12:56 PM • Last activity: Jun 18, 2024, 02:11 AM

4 votes

3 answers

3234 views

How to determine size of tar archive without creating it?

backup tar archive large-files tape

I'm archiving a few directories every night to LTO-7 tape with about 100 or so large (2GB) files in each of them. As a check that the data has been written correctly, I'm verifying that the number of bytes reported written is the same as what should have been written. I'm first looking at the size o...

                                  I'm archiving a few directories every night to LTO-7 tape with about 100 or so large (2GB) files in each of them.

As a check that the data has been written correctly, I'm verifying that the number of bytes reported written is the same as what should have been written.

I'm first looking at the size of the archive by doing a tar dry-run:

tar -cP --warning=no-file-changed $OLDEST_DIR | wc -c

Then I'm creating the archive with:

tar -cvf /dev/nst0 --warning=no-file-changed --totals $OLDEST_DIR

If the filesizes match, then I delete the original file.

The problem is that the dry-run has to read the entire contents of the files and can take several hours. Ideally, it should use the reported filesizes, apply the necessary padding / aligning, and report back the size rather than thrashing the disk for hours.

Using du -s or similar doesn't work because the sizes don't quite match (filesystems treat a directory as 4096 bytes, tar treats it as 0 bytes for example).

Alternatively, is there a better way of checking that the file has been correctly written? I can't trust tar's return code, since I'm ignoring certain warnings (to handle some sort of bug with tar/mdraid)

nippoo (161 rep)

Sep 7, 2016, 02:33 PM • Last activity: Jun 12, 2024, 02:12 PM

134 votes

14 answers

34711 views

Replace string in a huge (70GB), one line, text file

text-processing sed large-files

I have a huge (70GB), **one line**, text file and I want to replace a string (token) in it. I want to replace the token ` `, with another dummy token ([glove issue][1]). I tried `sed`: sed 's/ / /g' corpus.txt.new but the output file `corpus.txt.new` has zero-bytes! I also tried using perl: perl -pe...

                                  I have a huge (70GB), **one line**, text file and I want to replace a string (token) in it. 
I want to replace the token ``, with another dummy token (glove issue ). 

I tried sed:

    sed 's///g'  corpus.txt.new
but the output file corpus.txt.new has zero-bytes!

I also tried using perl:

    perl -pe 's///g'  corpus.txt.new
but I got an out of memory error.

For smaller files, both of the above commands work.

How can I replace a string is such a file?
This  is a related question, but none of the answers worked for me.

**Edit**:
What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?

                                

Christos Baziotis (1467 rep)

Dec 29, 2017, 02:58 PM • Last activity: Apr 7, 2024, 02:06 PM

2 votes

1 answers

226 views

Use GNU parallel with very long lines

sed gnu-parallel large-files

I have a very large SQL dumpfile (30GB) that I need to edit (do some find/replace) before loading back into the database. Besides having a large size, the file also contains very long lines. Except for the first 40 and last 12 lines, all other lines have lenghts ~ 1MB. These lines are all INSERTO IN...

cat bigdumpfile.sql | cut -c-100
INSERT INTO table1 VALUES (951068,1407592,0.0267,0.0509,0.121),(285
INSERT INTO table1 VALUES (238317,1407664,0.008,0.0063,0.1286),(241
INSERT INTO table1 VALUES (938922,1407739,0.0053,0.0024,0.031),(226
INSERT INTO table1 VALUES (44678,1407886,0.0028,0.0028,0.0333),(234
INSERT INTO table1 VALUES (910412,1407961,0.001,0.0014,0),(911017,1
INSERT INTO table1 VALUES (903890,1408050,0.0066,0.01,0.0287),(9095
INSERT INTO table1 VALUES (257090,1408136,0.0023,0.0037,0.0196),(56
INSERT INTO table1 VALUES (593367,1408237,0.0066,0.0117,0.0286),(95
INSERT INTO table1 VALUES (870488,1408339,0.0131,0.009,0.0135),(870
INSERT INTO table1 VALUES (282798,1408414,0.0015,0.014,0.014),(2830
...

Parallel ends with an error on long lines:

parallel -a bigdumpfile.sql -k sed -i.bak 's/table1/newtable/'
parallel: Error: Command line too long (1018952 >= 63543) at input 0: INSERT INTO `table1...

Because all lines are similar and I only need the find/replace to happen at the beginning of the line I've follwed the advice [in this similar question here](https://unix.stackexchange.com/questions/642939/use-gnu-parallel-when-file-has-a-single-long-line) with a nice suggestion to use `--recstart and --recend`. However these are not working:

parallel -a bigdumpfile.sql -k --recstart 'INSERT' --recend 'VALUES' sed -i.bak 's/table/newtable/'
parallel: Error: Command line too long (1018952 >= 63543) at input 0: INSERT INTO `table1...

Tried a number of variations using --block but could not get it working. I am a GNU parallel newbie, and doing something wrong or just missing something obvious. Any help appreciated. Thanks! This is using GNU parallel 20240122.

fernan (23 rep)

Feb 24, 2024, 03:25 PM • Last activity: Feb 27, 2024, 07:21 AM

11 votes

3 answers

11842 views

Basic sed command on large one-line file: couldn't re-allocate memory

text-processing sed performance large-files out-of-memory

I have a 250 MB text file, all in one line. In this file I want to replace `a` characters with `b` characters: sed -e "s/a/b/g" < one-line-250-mb.txt It fails with: sed: couldn't re-allocate memory It seems to me that this kind of task could be performed inline without allocating much memory. Is the...

                                  I have a 250 MB text file, all in one line.

In this file I want to replace a characters with b characters:

    sed -e "s/a/b/g" < one-line-250-mb.txt

It fails with:

    sed: couldn't re-allocate memory

It seems to me that this kind of task could be performed inline without allocating much memory.  
Is there a better tool for the job, or a better way to use sed?

---
GNU sed version 4.2.1  
Ubuntu 12.04.2 LTS  
1 GB RAM
                                

Nicolas Raoul (8465 rep)

Dec 19, 2013, 03:31 AM • Last activity: Dec 2, 2023, 02:44 PM

53 votes

4 answers

81815 views

Diffing two big text files

performance diff large-files

I have two big files (6GB each). They are unsorted, with linefeeds (`\n`) as separators. How can I diff them? It should take under 24h.

                                  I have two big files (6GB each). They are unsorted, with linefeeds (\n) as separators. How can I diff them? It should take under 24h.
                                

Jonas Lejon (719 rep)

Sep 16, 2010, 10:50 AM • Last activity: Aug 31, 2023, 06:28 AM

198 votes

8 answers

378493 views

cat line X to line Y on a huge file

tail cat large-files head

Say I have a huge text file (>2GB) and I just want to `cat` the lines `X` to `Y` (e.g. 57890000 to 57890010). From what I understand I can do this by piping `head` into `tail` or viceversa, i.e. head -A /path/to/file | tail -B or alternatively tail -C /path/to/file | head -D where `A`,`B`,`C` and `D...

                                  Say I have a huge text file (>2GB) and I just want to cat the lines X to Y (e.g. 57890000 to 57890010).

From what I understand I can do this by piping head into tail or viceversa, i.e. 

    head -A /path/to/file | tail -B

or alternatively

    tail -C /path/to/file | head -D

where A,B,C and D can be computed from the number of lines in the file, X and Y.

But there are two problems with this approach:

1. You have to compute A,B,C and D.
2. The commands could pipe to each other **many more** lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)

Is there a way to have the shell just work with and output the lines I want? (while providing only X and Y)?

Amelio Vazquez-Reina (42851 rep)

Sep 6, 2012, 10:38 PM • Last activity: Aug 6, 2023, 09:33 AM

9 votes

1 answers

7593 views

Is there bdiff (1) in Linux?

diff large-files

There is `bdiff(1)` command in Solaris, which allow you to `diff(1)` files with size bigger than your RAM size ([documentation][1]). Is there something like that in Linux? I tried googling but I don't find which package has `bdiff` in Ubuntu. [1]: http://docs.oracle.com/cd/E19683-01/806-7612/files-2...

                                  There is bdiff(1) command in Solaris, which allow you to diff(1) files with size bigger than your RAM size (documentation ).

Is there something like that in Linux? I tried googling but I don't find which package has bdiff in Ubuntu.

AntonioK (1213 rep)

May 27, 2013, 11:26 AM • Last activity: Aug 4, 2023, 08:31 AM

9 votes

5 answers

6403 views

how to find offset of one binary file inside another?

linux files binary large-files

I have two binary files. One of few hundreds kilos and other of few gigabytes. I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file. I am interested only in exact matches i.e. whether the whole file is...

                                  I have two binary files.

One of few hundreds kilos and other of few gigabytes.

I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
 I am interested only in exact matches i.e. whether the whole file is contained by the another.

Both files are binary. 

Is there any existing tool/one-liner that does that ?

Cyryl Płotnicki (191 rep)

May 31, 2012, 10:05 AM • Last activity: Mar 30, 2023, 08:31 PM

1 votes

1 answers

539 views

Transferring very large dataset from cluster to a storage server

rsync scp cluster large-files

We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast transfer tool that can be parallelized for individual files (because the individual files are each...

                                  We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast transfer tool that can be parallelized for individual files (because the individual files are each in terabytes). 

In this regard, I am looking for a tool that does not require admin rights and is still considerably faster than scp or rsync. If there is any tool that can be installed locally without admin rights, that will also be useful. I come across this link (https://unix.stackexchange.com/questions/227951/what-is-the-fastest-way-to-send-massive-amounts-of-data-between-two-computers) , which mentions the netcat way but we couldn't make it work.

For information, we are trying to copy relatively few files with very large size (and not many many small files). 

Appreciate your time and help

Ikram Ullah (113 rep)

Mar 8, 2023, 02:45 PM • Last activity: Mar 17, 2023, 01:50 PM

66 votes

11 answers

34227 views

Is there a way to modify a file in-place?

text-processing grep large-files

I have a fairly large file (35Gb), and I would like to filter this file in situ (i.e. I don't have enough disk space for another file), specifically I want to grep and ignore some patterns — is there a way to do this without using another file? Let's say I want to filter out all the lines containing...

                                  I have a fairly large file (35Gb), and I would like to filter this file in situ (i.e. I don't have enough disk space for another file), specifically I want to grep and ignore some patterns — is there a way to do this without using another file?

Let's say I want to filter out all the lines containing foo: for example...

Nim (993 rep)

Apr 11, 2011, 09:53 AM • Last activity: Dec 22, 2022, 07:20 PM

9 votes

2 answers

2902 views

Is rsync --append able to resume an interrupted copy process without reading all the copied data?

rsync file-copy large-files

I need to copy one very large file (3TB) on the same machine from one external drive to another. This might take (because of low bandwidth) many days. So I want to be prepared when I have to interrupt the copying and resume it after, say, a restart. From [what I've read][rsync] I can use rsync --app...

                                  I need to copy one very large file (3TB) on the same machine from one external drive to another. This might take (because of low bandwidth) many days.

So I want to be prepared when I have to interrupt the copying and resume it after, say, a restart.
From what I've read  I can use 

    rsync --append 

for this (with rsync version>3). Two questions about the --append flag here:

1. Do I use rsync --append for *all* invocations? (For the first invocation when *no* interrupted copy on the destination drive yet exists and for the subsequent invocations when there *is* an interrupted copy at the destination.)

2. Does rsync --append resume for the subsequent invocations the copying process _without_ reading all the already copied data? (In other words: Does rsync mimic a dd-style seek-and-read operation ?)

halloleo (649 rep)

Sep 7, 2022, 01:22 PM • Last activity: Sep 7, 2022, 10:36 PM

1 votes

1 answers

1552 views

"Cannot reallocate" when create file?

bash sed aix large-files

I am trying to create excel sheet based on multiple files on a root. I read files line by line and append in the final excel sheet. I am trying this shell script on small files and it worked 100%, but when I try it on the needed files (85MB per each file) I get this error: (dsadm@DEVDS) /EDWH/XML/Mu...

                                  I am trying to create excel sheet based on multiple files on a root. I read files line by line and append in the final excel sheet.

I am trying this shell script on small files and it worked 100%, but when I try it on the needed files (85MB per each file) I get this error:

    (dsadm@DEVDS) /EDWH/XML/Must # XML.sh csv excel_outputfilename
    ./XML.sh: line 41: fallocate: command not found
    ./XML.sh: xmalloc: cannot allocate 172035663 bytes (0 bytes allocated)
    ./XML.sh: xrealloc: cannot reallocate 86013568 bytes (0 bytes allocated)
    ./XML.sh: xrealloc: cannot reallocate 86021888 bytes (0 bytes allocated)

Note: 

- The csv parameter is the file extension 

- My OS and version: Unix AIX 7.1 

Here's the script:


    #!/usr/bin/bash  
    
    #Files Extension#
    Ext=$1
    
    #OutPut File Name without extension ex: TEST#
    OutPutFileName=$2.xls
    
    function XMLHeader ()
    {
    	 echo "
    	"
    }
    
    function SheetHeader ()
    {
    	echo "
    	
    	"
    }
    
    function SheetFooter ()
    {
    	echo "
    	"
    }
    
    function XMLFooter ()
    {
    	echo ""
    }
    
    ####################################################################################
    
    cd /EDWH/Samir/XML/Must;
    
    fallocate -l 1G $OutPutFileName
    
    XMLHeader > $OutPutFileName;
    
    # loop on the exists files to build Worksheet per each file 
    for Vfile in $(ls | grep .$Ext); 
    do
    	echo "" >> $OutPutFileName
    	
    	### loop to write the Row 
    		VarRow=cat $Vfile
    		for Row in $(echo $VarRow )
    		do
    			
    			echo "" >> $OutPutFileName
    			
    				### loop to write the cells 
    				VarCell=echo $VarRow
    				for Cell in $(echo $VarCell | sed "s/,/ /g")
    				do
    					echo "$Cell" >> $OutPutFileName
    				done
    			
    			echo "" >> $OutPutFileName
    			
    		done
    		
    	echo "" >> $OutPutFileName
    	
    done	
    
    
    echo "" >> $OutPutFileName	
    
    ####################################################################################
    
    exit;
                                

Ahmed Samir (23 rep)

Nov 19, 2015, 09:59 AM • Last activity: Jul 23, 2022, 06:02 AM

Showing page 1 of 20 total questions