Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
39
votes
4
answers
88654
views
Replace text quickly in very large file
I have 25GB text file that needs a string replaced on only a few lines. I can use `sed` successfully but it takes a really long time to run. sed -i 's|old text|new text|g' gigantic_file.sql Is there a quicker way to do this?
I have 25GB text file that needs a string replaced on only a few lines. I can use
sed
successfully but it takes a really long time to run.
sed -i 's|old text|new text|g' gigantic_file.sql
Is there a quicker way to do this?
eisaacson
(491 rep)
Jan 14, 2016, 07:14 PM
• Last activity: Jul 4, 2025, 08:19 AM
2
votes
4
answers
4927
views
EXT4 for very large (>1GB) files : increase block size, use block clusters, or both?
I'd like to format a 12 TB HDD (not SSD) **with EXT4**, in order to store large video files (each file being at least 1 GiB in size). I am working with an x86-64 (a.k.a. x64 or amd64) processor. There's of course the `-T largefile4` option of `mkfs.ext4`, but are there other optimizations that can b...
I'd like to format a 12 TB HDD (not SSD) **with EXT4**, in order to store large video files (each file being at least 1 GiB in size).
I am working with an x86-64 (a.k.a. x64 or amd64) processor.
There's of course the
-T largefile4
option of mkfs.ext4
, but are there other optimizations that can be done ?
In particular, I wonder :
- Should I increase block size to its max (64K, -b 65536
) ?
- OR should I use block clusters , and set cluster size to its max (256M, -C 268 435 456
)
- OR should I do both ?
What would be the best parameters in terms of both disk space and performance optimization ?
ChennyStar
(1969 rep)
Jan 12, 2024, 05:56 AM
• Last activity: Dec 15, 2024, 12:45 PM
2
votes
5
answers
1555
views
How to compare huge files with progress information
In a Unix command line context I would like to compare two truly huge files (around 1TB each), preferable with a progress indicator. I have tried `diff` and `cmp`, and they both crashed the system (macOS Mojave), let alone giving me a progress bar. What's the best way to compare these very large fil...
In a Unix command line context I would like to compare two truly huge files (around 1TB each), preferable with a progress indicator.
I have tried
diff
and cmp
, and they both crashed the system (macOS Mojave), let alone giving me a progress bar.
What's the best way to compare these very large files?
### Additional Details:
1. I just want to check that they are identical.
2. cmp
crashed the system in a way that the system did restart by itself. :-( Maybe the system ran out of memory?
halloleo
(649 rep)
Apr 20, 2022, 02:26 AM
• Last activity: Oct 10, 2024, 11:46 PM
0
votes
3
answers
8950
views
How to copy files from linux to windows using winscp from a folder which contains millions of files
I need to copy files from a linux machine to a windows machine where the only ports which can be open are for SSH (22). I can connect to the linux machine using WinSCP but the problem is once I try to navigate to the desired folder WinSCP gets stuck since the folder contains millions of records. Bas...
I need to copy files from a linux machine to a windows machine where the only ports which can be open are for SSH (22).
I can connect to the linux machine using WinSCP but the problem is once I try to navigate to the desired folder WinSCP gets stuck since the folder contains millions of records.
Basically I don't really care which files I copy and I would be glad to find a solution which enables me to just copy the latest 200 files.
Any ideas?
I've tried using
ls -f | less
but that did not do the trick.
Ittai
(101 rep)
Sep 6, 2011, 06:54 AM
• Last activity: Sep 8, 2024, 04:40 PM
247
votes
12
answers
365206
views
How to remove duplicate lines inside a text file?
A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table). What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the o...
A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).
What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won't be neighbours) there is to be only one of the kind left.
I have written a program in Scala (consider it Java if you don't know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?
UPDATE: the
awk '!seen[$0]++' filename
solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn't work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don't feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.
Ivan
(18358 rep)
Jan 27, 2012, 03:34 PM
• Last activity: Aug 30, 2024, 01:12 AM
2
votes
1
answers
1731
views
How to encrypt an 8TB disk with Veracrypt with hidden partition?
When I try to encrypt an USB 8tb drive (Seagate Expansion Drive) with Veracrypt to create a Hidden Veracrypt volume, I receive this error: > Error: The hidden volume to be created is larger than 2 TB (2048 GB). > > Possible solutions: > > - Create a container/partition smaller than 2 TB. > - Use a d...
When I try to encrypt an USB 8tb drive (Seagate Expansion Drive) with Veracrypt to create a Hidden Veracrypt volume, I receive this error:
> Error: The hidden volume to be created is larger than 2 TB (2048 GB).
>
> Possible solutions:
>
> - Create a container/partition smaller than 2 TB.
> - Use a drive with 4096-byte sectors to be able to create partition/device-hosted hidden volumes up to 16 TB in size.
I'm new to Veracrypt and Linux. If I understood it correctly, to have an 8tb hidden partition I need to format the drive so that it'll have *"4096-byte sectors"*. I'm not finding this option in GParted.
Hence, my question is: *How can I in Linux Mint format the drive to have 4096-byte sectors, in order for me to install a Hidden Veracrypt partition?*
Steps taken to reproduce the problem:
1. Launch Veracrypt and choose "Create a volume within a
partition/drive"
2. In "Volume Type" choose "Hidden Veracrypt Volume"
3. When in "Outer Volume Format", click format. Thus the error message I quoted will be displayed
I'm using the latest version of Veracrypt and Linux Mint
flen
(161 rep)
Oct 13, 2019, 07:00 PM
• Last activity: Aug 19, 2024, 01:22 PM
16
votes
5
answers
11430
views
How do I read the last lines of a huge log file?
I have a log of 55GB in size. I tried: ``` cat logfile.log | tail ``` But this approach takes a lot of time. Is there any way to read huge files faster or any other approach?
I have a log of 55GB in size.
I tried:
cat logfile.log | tail
But this approach takes a lot of time. Is there any way to read huge files faster or any other approach?
Yi Qiang Ji
(162 rep)
Feb 20, 2024, 03:52 PM
• Last activity: Jul 6, 2024, 01:22 PM
6
votes
4
answers
1392
views
Delete huge directory that causes all commands to hang
How do I delete this large directory? stat session/ File: ‘session/’ Size: 321540096 Blocks: 628040 IO Block: 4096 directory Device: 903h/2307d Inode: 11149319 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-09-29 14:34:40.910894275 +0200 Modify: 2022-09-29 14:35:09.5...
How do I delete this large directory?
stat session/
File: ‘session/’
Size: 321540096 Blocks: 628040 IO Block: 4096 directory
Device: 903h/2307d Inode: 11149319 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-09-29 14:34:40.910894275 +0200
Modify: 2022-09-29 14:35:09.598400050 +0200
Change: 2022-09-29 14:35:09.598400050 +0200
Birth: -
**Note that the size of directory (not the content, but the directory entry itself) is over 300MB.**
Number of inodes is over 11 million.
**The directory has no subdirectories, only large number of files.**
None of the usual commands work. I have tried these:
-
rsync -a --delete empty_dir/ session/
- rm -rf session
- find . -type f --delete
If I run ls -f1
inside, it hangs.
If I run mv -- * ../.tmp_to_delete
inside, it hangs.
If I run du
inside, it hangs.
At the moment the rsync --delete is running since two days, reading at the rate of up to 7MB/s, and I see no change in the stat output for the directory.
I assume the large size of the directory is the problem.
Bojan Hrnkas
(200 rep)
Sep 29, 2022, 12:56 PM
• Last activity: Jun 18, 2024, 02:11 AM
4
votes
3
answers
3234
views
How to determine size of tar archive without creating it?
I'm archiving a few directories every night to LTO-7 tape with about 100 or so large (2GB) files in each of them. As a check that the data has been written correctly, I'm verifying that the number of bytes reported written is the same as what should have been written. I'm first looking at the size o...
I'm archiving a few directories every night to LTO-7 tape with about 100 or so large (2GB) files in each of them.
As a check that the data has been written correctly, I'm verifying that the number of bytes reported written is the same as what should have been written.
I'm first looking at the size of the archive by doing a tar dry-run:
tar -cP --warning=no-file-changed $OLDEST_DIR | wc -c
Then I'm creating the archive with:
tar -cvf /dev/nst0 --warning=no-file-changed --totals $OLDEST_DIR
If the filesizes match, then I delete the original file.
The problem is that the dry-run has to read the entire contents of the files and can take several hours. Ideally, it should use the reported filesizes, apply the necessary padding / aligning, and report back the size rather than thrashing the disk for hours.
Using du -s
or similar doesn't work because the sizes don't quite match (filesystems treat a directory as 4096 bytes, tar treats it as 0 bytes for example).
Alternatively, is there a better way of checking that the file has been correctly written? I can't trust tar's return code, since I'm ignoring certain warnings (to handle some sort of bug with tar/mdraid)
nippoo
(161 rep)
Sep 7, 2016, 02:33 PM
• Last activity: Jun 12, 2024, 02:12 PM
134
votes
14
answers
34711
views
Replace string in a huge (70GB), one line, text file
I have a huge (70GB), **one line**, text file and I want to replace a string (token) in it. I want to replace the token ` `, with another dummy token ([glove issue][1]). I tried `sed`: sed 's/ / /g' corpus.txt.new but the output file `corpus.txt.new` has zero-bytes! I also tried using perl: perl -pe...
I have a huge (70GB), **one line**, text file and I want to replace a string (token) in it.
I want to replace the token ``, with another dummy token (glove issue ).
I tried
sed
:
sed 's///g' corpus.txt.new
but the output file corpus.txt.new
has zero-bytes!
I also tried using perl:
perl -pe 's///g' corpus.txt.new
but I got an out of memory error.
For smaller files, both of the above commands work.
How can I replace a string is such a file?
This is a related question, but none of the answers worked for me.
**Edit**:
What about splitting the file in chunks of 10GBs (or whatever) each and applying sed
on each one of them and then merging them with cat
? Does that make sense? Is there a more elegant solution?
Christos Baziotis
(1467 rep)
Dec 29, 2017, 02:58 PM
• Last activity: Apr 7, 2024, 02:06 PM
2
votes
1
answers
226
views
Use GNU parallel with very long lines
I have a very large SQL dumpfile (30GB) that I need to edit (do some find/replace) before loading back into the database. Besides having a large size, the file also contains very long lines. Except for the first 40 and last 12 lines, all other lines have lenghts ~ 1MB. These lines are all INSERTO IN...
I have a very large SQL dumpfile (30GB) that I need to edit (do some find/replace) before loading back into the database.
Besides having a large size, the file also contains very long lines. Except for the first 40 and last 12 lines, all other lines have lenghts ~ 1MB. These lines are all INSERTO INTO commands that all look alike:
cat bigdumpfile.sql | cut -c-100
INSERT INTO table1
VALUES (951068,1407592,0.0267,0.0509,0.121),(285
INSERT INTO table1
VALUES (238317,1407664,0.008,0.0063,0.1286),(241
INSERT INTO table1
VALUES (938922,1407739,0.0053,0.0024,0.031),(226
INSERT INTO table1
VALUES (44678,1407886,0.0028,0.0028,0.0333),(234
INSERT INTO table1
VALUES (910412,1407961,0.001,0.0014,0),(911017,1
INSERT INTO table1
VALUES (903890,1408050,0.0066,0.01,0.0287),(9095
INSERT INTO table1
VALUES (257090,1408136,0.0023,0.0037,0.0196),(56
INSERT INTO table1
VALUES (593367,1408237,0.0066,0.0117,0.0286),(95
INSERT INTO table1
VALUES (870488,1408339,0.0131,0.009,0.0135),(870
INSERT INTO table1
VALUES (282798,1408414,0.0015,0.014,0.014),(2830
...
Parallel ends with an error on long lines:
parallel -a bigdumpfile.sql -k sed -i.bak 's/table1/newtable/'
parallel: Error: Command line too long (1018952 >= 63543) at input 0: INSERT INTO `table1...
Because all lines are similar and I only need the find/replace to happen at the beginning of the line I've follwed the advice [in this similar question here](https://unix.stackexchange.com/questions/642939/use-gnu-parallel-when-file-has-a-single-long-line) with a nice suggestion to use `--recstart
and
--recend
`. However these are not working:
parallel -a bigdumpfile.sql -k --recstart 'INSERT' --recend 'VALUES' sed -i.bak 's/table/newtable/'
parallel: Error: Command line too long (1018952 >= 63543) at input 0: INSERT INTO `table1...
Tried a number of variations using --block
but could not get it working. I am a GNU parallel newbie, and doing something wrong or just missing something obvious. Any help appreciated. Thanks!
This is using GNU parallel 20240122
.
fernan
(23 rep)
Feb 24, 2024, 03:25 PM
• Last activity: Feb 27, 2024, 07:21 AM
11
votes
3
answers
11842
views
Basic sed command on large one-line file: couldn't re-allocate memory
I have a 250 MB text file, all in one line. In this file I want to replace `a` characters with `b` characters: sed -e "s/a/b/g" < one-line-250-mb.txt It fails with: sed: couldn't re-allocate memory It seems to me that this kind of task could be performed inline without allocating much memory. Is the...
I have a 250 MB text file, all in one line.
In this file I want to replace
a
characters with b
characters:
sed -e "s/a/b/g" < one-line-250-mb.txt
It fails with:
sed: couldn't re-allocate memory
It seems to me that this kind of task could be performed inline without allocating much memory.
Is there a better tool for the job, or a better way to use sed
?
---
GNU sed version 4.2.1
Ubuntu 12.04.2 LTS
1 GB RAM
Nicolas Raoul
(8465 rep)
Dec 19, 2013, 03:31 AM
• Last activity: Dec 2, 2023, 02:44 PM
53
votes
4
answers
81815
views
Diffing two big text files
I have two big files (6GB each). They are unsorted, with linefeeds (`\n`) as separators. How can I diff them? It should take under 24h.
I have two big files (6GB each). They are unsorted, with linefeeds (
\n
) as separators. How can I diff them? It should take under 24h.
Jonas Lejon
(719 rep)
Sep 16, 2010, 10:50 AM
• Last activity: Aug 31, 2023, 06:28 AM
198
votes
8
answers
378493
views
cat line X to line Y on a huge file
Say I have a huge text file (>2GB) and I just want to `cat` the lines `X` to `Y` (e.g. 57890000 to 57890010). From what I understand I can do this by piping `head` into `tail` or viceversa, i.e. head -A /path/to/file | tail -B or alternatively tail -C /path/to/file | head -D where `A`,`B`,`C` and `D...
Say I have a huge text file (>2GB) and I just want to
cat
the lines X
to Y
(e.g. 57890000 to 57890010).
From what I understand I can do this by piping head
into tail
or viceversa, i.e.
head -A /path/to/file | tail -B
or alternatively
tail -C /path/to/file | head -D
where A
,B
,C
and D
can be computed from the number of lines in the file, X
and Y
.
But there are two problems with this approach:
1. You have to compute A
,B
,C
and D
.
2. The commands could pipe
to each other **many more** lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)
Is there a way to have the shell just work with and output the lines I want? (while providing only X
and Y
)?
Amelio Vazquez-Reina
(42851 rep)
Sep 6, 2012, 10:38 PM
• Last activity: Aug 6, 2023, 09:33 AM
9
votes
1
answers
7593
views
Is there bdiff (1) in Linux?
There is `bdiff(1)` command in Solaris, which allow you to `diff(1)` files with size bigger than your RAM size ([documentation][1]). Is there something like that in Linux? I tried googling but I don't find which package has `bdiff` in Ubuntu. [1]: http://docs.oracle.com/cd/E19683-01/806-7612/files-2...
There is
bdiff(1)
command in Solaris, which allow you to diff(1)
files with size bigger than your RAM size (documentation ).
Is there something like that in Linux? I tried googling but I don't find which package has bdiff
in Ubuntu.
AntonioK
(1213 rep)
May 27, 2013, 11:26 AM
• Last activity: Aug 4, 2023, 08:31 AM
9
votes
5
answers
6403
views
how to find offset of one binary file inside another?
I have two binary files. One of few hundreds kilos and other of few gigabytes. I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file. I am interested only in exact matches i.e. whether the whole file is...
I have two binary files.
One of few hundreds kilos and other of few gigabytes.
I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
I am interested only in exact matches i.e. whether the whole file is contained by the another.
Both files are binary.
Is there any existing tool/one-liner that does that ?
One of few hundreds kilos and other of few gigabytes.
I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
I am interested only in exact matches i.e. whether the whole file is contained by the another.
Both files are binary.
Is there any existing tool/one-liner that does that ?
Cyryl Płotnicki
(191 rep)
May 31, 2012, 10:05 AM
• Last activity: Mar 30, 2023, 08:31 PM
1
votes
1
answers
539
views
Transferring very large dataset from cluster to a storage server
We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast transfer tool that can be parallelized for individual files (because the individual files are each...
We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast transfer tool that can be parallelized for individual files (because the individual files are each in terabytes).
In this regard, I am looking for a tool that does not require admin rights and is still considerably faster than scp or rsync. If there is any tool that can be installed locally without admin rights, that will also be useful. I come across this link (https://unix.stackexchange.com/questions/227951/what-is-the-fastest-way-to-send-massive-amounts-of-data-between-two-computers) , which mentions the netcat way but we couldn't make it work.
For information, we are trying to copy relatively few files with very large size (and not many many small files).
Appreciate your time and help
Ikram Ullah
(113 rep)
Mar 8, 2023, 02:45 PM
• Last activity: Mar 17, 2023, 01:50 PM
66
votes
11
answers
34227
views
Is there a way to modify a file in-place?
I have a fairly large file (35Gb), and I would like to filter this file in situ (i.e. I don't have enough disk space for another file), specifically I want to grep and ignore some patterns — is there a way to do this without using another file? Let's say I want to filter out all the lines containing...
I have a fairly large file (35Gb), and I would like to filter this file in situ (i.e. I don't have enough disk space for another file), specifically I want to grep and ignore some patterns — is there a way to do this without using another file?
Let's say I want to filter out all the lines containing
foo:
for example...
Nim
(993 rep)
Apr 11, 2011, 09:53 AM
• Last activity: Dec 22, 2022, 07:20 PM
9
votes
2
answers
2902
views
Is rsync --append able to resume an interrupted copy process without reading all the copied data?
I need to copy one very large file (3TB) on the same machine from one external drive to another. This might take (because of low bandwidth) many days. So I want to be prepared when I have to interrupt the copying and resume it after, say, a restart. From [what I've read][rsync] I can use rsync --app...
I need to copy one very large file (3TB) on the same machine from one external drive to another. This might take (because of low bandwidth) many days.
So I want to be prepared when I have to interrupt the copying and resume it after, say, a restart.
From what I've read I can use
rsync --append
for this (with rsync version>3). Two questions about the
--append
flag here:
1. Do I use rsync --append
for *all* invocations? (For the first invocation when *no* interrupted copy on the destination drive yet exists and for the subsequent invocations when there *is* an interrupted copy at the destination.)
2. Does rsync --append
resume for the subsequent invocations the copying process _without_ reading all the already copied data? (In other words: Does rsync mimic a dd
-style seek-and-read operation ?)
halloleo
(649 rep)
Sep 7, 2022, 01:22 PM
• Last activity: Sep 7, 2022, 10:36 PM
1
votes
1
answers
1552
views
"Cannot reallocate" when create file?
I am trying to create excel sheet based on multiple files on a root. I read files line by line and append in the final excel sheet. I am trying this shell script on small files and it worked 100%, but when I try it on the needed files (85MB per each file) I get this error: (dsadm@DEVDS) /EDWH/XML/Mu...
I am trying to create excel sheet based on multiple files on a root. I read files line by line and append in the final excel sheet.
I am trying this shell script on small files and it worked 100%, but when I try it on the needed files (85MB per each file) I get this error:
(dsadm@DEVDS) /EDWH/XML/Must # XML.sh csv excel_outputfilename
./XML.sh: line 41: fallocate: command not found
./XML.sh: xmalloc: cannot allocate 172035663 bytes (0 bytes allocated)
./XML.sh: xrealloc: cannot reallocate 86013568 bytes (0 bytes allocated)
./XML.sh: xrealloc: cannot reallocate 86021888 bytes (0 bytes allocated)
Note:
- The
csv
parameter is the file extension
- My OS and version: Unix AIX 7.1
Here's the script:
#!/usr/bin/bash
#Files Extension#
Ext=$1
#OutPut File Name without extension ex: TEST#
OutPutFileName=$2.xls
function XMLHeader ()
{
echo "
"
}
function SheetHeader ()
{
echo "
"
}
function SheetFooter ()
{
echo "
"
}
function XMLFooter ()
{
echo ""
}
####################################################################################
cd /EDWH/Samir/XML/Must;
fallocate -l 1G $OutPutFileName
XMLHeader > $OutPutFileName;
# loop on the exists files to build Worksheet per each file
for Vfile in $(ls | grep .$Ext);
do
echo "" >> $OutPutFileName
### loop to write the Row
VarRow=cat $Vfile
for Row in $(echo $VarRow )
do
echo "" >> $OutPutFileName
### loop to write the cells
VarCell=echo $VarRow
for Cell in $(echo $VarCell | sed "s/,/ /g")
do
echo "$Cell" >> $OutPutFileName
done
echo "" >> $OutPutFileName
done
echo "" >> $OutPutFileName
done
echo "" >> $OutPutFileName
####################################################################################
exit;
Ahmed Samir
(23 rep)
Nov 19, 2015, 09:59 AM
• Last activity: Jul 23, 2022, 06:02 AM
Showing page 1 of 20 total questions