Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

22 votes

5 answers

44095 views

Compress a large number of large files fast

I have about 200 GB of log data generated daily, distributed among about 150 different log files. I have a script that moves the files to a temporary location and does a tar-bz2 on the temporary directory. I get good results as 200 GB logs are compressed to about 12-15 GB. The problem is that it tak...

                                  I have about 200 GB of log data generated daily, distributed among about 150 different log files.

I have a script that moves the files to a temporary location and does a tar-bz2 on the temporary directory.

I get good results as 200 GB logs are compressed to about 12-15 GB.

The problem is that it takes forever to compress the files. The cron  job runs at 2:30 AM daily and continues to run till 5:00-6:00 PM.

Is there a way to improve the speed of the compression and complete the job faster? Any ideas?

Don't worry about other processes and all, the location where the compression happens is on a NAS , and I can run mount the NAS on a dedicated VM  and run the compression script from there.

Here is the output of top  for reference:

    top - 15:53:50 up 1093 days,  6:36,  1 user,  load average: 1.00, 1.05, 1.07
    Tasks: 101 total,   3 running,  98 sleeping,   0 stopped,   0 zombie
    Cpu(s): 25.1%us,  0.7%sy,  0.0%ni, 74.1%id,  0.0%wa,  0.0%hi,  0.1%si,  0.1%st
    Mem:   8388608k total,  8334844k used,    53764k free,     9800k buffers
    Swap: 12550136k total,      488k used, 12549648k free,  4936168k cached
     PID  USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
     7086 appmon    18   0 13256 7880  440 R 96.7  0.1 791:16.83 bzip2
    7085  appmon    18   0 19452 1148  856 S  0.0  0.0   1:45.41 tar cjvf /nwk_storelogs/compressed_logs/compressed_logs_2016_30_04.tar.bz2 /nwk_storelogs/temp/ASPEN-GC-32459:nkp-aspn-1014.log /nwk_stor
    30756 appmon    15   0 85952 1944 1000 S  0.0  0.0   0:00.00 sshd: appmon@pts/0
    30757 appmon    15   0 64884 1816 1032 S  0.0  0.0   0:00.01 -tcsh

anu (362 rep)

May 4, 2016, 11:00 PM • Last activity: Jun 17, 2025, 09:12 AM

1 votes

1 answers

102 views

Parallel processing of single huge .bz2 or .gz file

gzip gnu-parallel bzip2

I would like to use GNU Parallel to process a huge .gz or .bz2 file. I know I can do: bzcat huge.bz2 | parallel --pipe ... But it would be nice if there was a way similar to `--pipe-part` that can read multiple parts of the file in parallel. One option is to decompress the file: bzcat huge.bz2 > hug...

                                  I would like to use GNU Parallel to process a huge .gz or .bz2 file.

I know I can do:

    bzcat huge.bz2 | parallel --pipe ...

But it would be nice if there was a way similar to --pipe-part that can read multiple parts of the file in parallel. One option is to decompress the file:

    bzcat huge.bz2 > huge
    parallel --pipe-part -a huge ...

but huge.bz2 is huge, and I would much prefer decompressing it multiple times than storing it uncompressed.

Ole Tange (37348 rep)

Mar 28, 2025, 11:58 AM • Last activity: Mar 29, 2025, 10:33 AM

247 votes

4 answers

99373 views

Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?

history gzip bzip2 xz

More and more [`tar`][1] archives use the [`xz`][2] format based on LZMA2 for compression instead of the traditional [`bzip2(bz2)`][3] compression. In fact *kernel.org* made a late "*Good-bye bzip2*" [announcement, 27th Dec. 2013][4], indicating kernel sources would from this point on be released in...

                                  More and more tar  archives use the xz  format based on LZMA2 for compression instead of the traditional bzip2(bz2)  compression. In fact *kernel.org* made a late "*Good-bye bzip2*" announcement, 27th Dec. 2013 , indicating kernel sources would from this point on be released in both tar.gz and tar.xz format - and on the main page of the website  what's directly offered is in tar.xz.

Are there any specific reasons explaining why this is happening and what is the relevance of gzip  in this context?

user44370

Jan 6, 2014, 06:39 PM • Last activity: Jan 21, 2025, 02:31 PM

0 votes

1 answers

65 views

Is it possible to compress a tar ball with gzip/bzip2/xz after tar ball file has been created?

tar compression gzip xz bzip2

If we create a tar ball file by giving the following command tar -cvf Docs.tar $HOME/Documents/* then post creation of the tar ball is it possible to use gzip or bzip2 or xz or some other compression utility to compress the tar file? I know that we can give the option `--bzip2` or `--xz` or `--gzip`...

                                  If we create a tar ball file by giving the following command

    tar -cvf Docs.tar $HOME/Documents/*

then post creation of the tar ball is it possible to use gzip or bzip2 or xz or some other compression utility to compress the tar file? 

I know that we can give the option --bzip2 or --xz or --gzip while creating the tar along with -cvf  option but what if that is not done. And after the tar is created then the compression is sought to be applied. Is it possible? If yes then how?

KDM (116 rep)

Oct 2, 2024, 01:50 PM • Last activity: Oct 2, 2024, 02:15 PM

0 votes

0 answers

140 views

vmlinuz to vmlinux ERROR

kernel gzip bzip2

``` $ file vmlinuz vmlinuz: Linux kernel x86 boot executable bzImage, version 4.14.244 (root@d0ea4514eda5) #1 SMP Thu Aug 31 01:23:02 PDT 2023, RO-rootFS, swap_dev 0x3, Normal VGA ``` I try to use `extract_vmlinux` and `vmlinux-to-elf` to extract vmlinux from vmlinuz, but report the following errors...

$ file vmlinuz
vmlinuz: Linux kernel x86 boot executable bzImage, version 4.14.244 (root@d0ea4514eda5) #1 SMP Thu Aug 31 01:23:02 PDT 2023, RO-rootFS, swap_dev 0x3, Normal VGA

I try to use extract_vmlinux and vmlinux-to-elf to extract vmlinux from vmlinuz, but report the following errors respectively:

$ vmlinux-to-elf vmlinuz vmlinux
Traceback (most recent call last):
  File "/usr/local/bin/vmlinux-to-elf", line 63, in 
    ElfSymbolizer(
  File "/usr/local/lib/python3.8/dist-packages/vmlinux_to_elf/elf_symbolizer.py", line 44, in __init__
    kallsyms_finder = KallsymsFinder(file_contents, bit_size)
  File "/usr/local/lib/python3.8/dist-packages/vmlinux_to_elf/kallsyms_finder.py", line 177, in __init__
    self.find_linux_kernel_version()
  File "/usr/local/lib/python3.8/dist-packages/vmlinux_to_elf/kallsyms_finder.py", line 225, in find_linux_kernel_version
    raise ValueError('No version string found in this kernel')
ValueError: No version string found in this kernel

$ ./extract_vmlinux vmlinuz > vmlinux
extract_vmlinux: Cannot find vmlinux.

Then I tried manual extraction:

$ od -A d -t x1 vmlinuz | grep 'fd 37 7a 58 5a 00'
3254032 fd 37 7a 58 5a 00 44 65 73 74 69 6e 61 74 69 6f

$ dd if=vmlinuz of=vmlinuz_unxz bs=1 skip=3254032
116928+0 records in
116928+0 records out
116928 bytes (117 kB, 114 KiB) copied, 1.10249 s, 106 kB/s

$ xz -d vmlinuz_unxz
xz: vmlinuz_unxz: Compressed data is corrupt

What went wrong? Any suggestions to extract vmlinux? Thank you!

pipik (1 rep)

Aug 16, 2024, 02:14 AM • Last activity: Aug 16, 2024, 02:31 AM

1 votes

1 answers

1932 views

How to extract all uncorrupted file from bzip2 compression?

tar compression bzip2

I am trying to uncompresss a bzip2 file (~55 GB) with the command ```tar -jxvf file.tar.bz2``` However I found that the decompression process gets stuck at a certain file and, after waiting a long duration, gives the error message shown below without decompressing the other files. ``` bzip2: Compres...

I am trying to uncompresss a bzip2 file (~55 GB) with the command

-jxvf file.tar.bz2

However I found that the decompression process gets stuck at a certain file and, after waiting a long duration, gives the error message shown below without decompressing the other files.

bzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bzip2: Inappropriate ioctl for device
        Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

The last file where decompression stucks happens to be tar file. Is it possible to bypass this tar file and continue to extract other files if I'm not interested in that tar file?

Raghvender (11 rep)

Aug 15, 2022, 02:38 PM • Last activity: Aug 15, 2022, 03:47 PM

1 votes

0 answers

725 views

Compressing small files with gzip makes its size smaller than with bzip2, why?

linux files compression gzip bzip2

I have a question, just a little thing couch my eyes it's about compression with gzip and bzip2: If I understood correctly - bzip2 requires more processing power but compresses files smaller and more efficient than gzip. When I tried to compress a 6MB file with bzip2 its size got smaller than with g...

5_bytes_file

-b 5_bytes_file.bz2

result: 42 bytes

5_bytes_file

-b 5_bytes_file.gz

result 38 bytes Why is it happening? am I doing something wrong?

Karuch (11 rep)

Aug 9, 2022, 09:27 AM • Last activity: Aug 9, 2022, 09:27 AM

2 votes

1 answers

3155 views

php build error: please reinstall BZip2 distribution

software-installation php bzip2

I tried to build PHP v8.0.0 from its source but after running `./configure` it says: ``` ... checking for BZip2... not found configure: error: please reinstall BZip2 distribution ``` But I have `bzip2` installed already. How do I fix that?

I tried to build PHP v8.0.0 from its source but after running ./configure it says:

...
checking for BZip2... not found
configure: error: please reinstall BZip2 distribution

But I have bzip2 installed already. How do I fix that?

Ar Rakin (189 rep)

Jul 17, 2021, 05:28 AM • Last activity: Mar 5, 2022, 04:56 PM

0 votes

2 answers

397 views

using bzip gzip zip in bash

linux bash shell-script bzip2

``` lang-bash #!/bin/bash # check if the user put a file name if [ $# -gt 0 ]; then # check if the file is exist in the current directory if [ -f "$1" ]; then # check if the file is readable in the current directory if [ -r "$1" ]; then echo "File:$1" echo "$(wc -c <"$1")" # Note the following two l...

lang-bash
#!/bin/bash

# check if the user put a file name
if [ $# -gt 0 ]; then

    # check if the file is exist in the current directory
    if [ -f "$1" ]; then

        # check if the file is readable in the current directory
        if [ -r "$1" ]; then
            echo "File:$1"
            echo "$(wc -c <"$1")"

            # Note the following two lines
            comp=$(bzip2 -k $1)
            echo "$(wc -c <"$comp")"
        else
            echo "$1 is unreadable"
            exit
        fi
    else
        echo "$1 is not exist"
        exit
    fi
fi

Currently my problem is that I can compress the $1 file into a $1.c.bz2 file by bzip, but what if I come to capture the size of the compressed file. My code shows no such files.

alan (1 rep)

Sep 17, 2021, 12:06 AM • Last activity: Sep 21, 2021, 07:11 AM

0 votes

0 answers

249 views

Compression vs. redundancy: do they cancel each other out?

linux tar gzip bzip2

Does it make sense to compress a tarball (or any kind of file, really), e.g., using `gzip` or `bzip2`, while at the same time creating redundancy files for it, e.g., a `par2` file? The context is that I am reasoning about how to best backup my personal files. My #1 priority is to avoid data loss due...

                                  Does it make sense to compress a tarball (or any kind of file, really), e.g., using gzip or bzip2, while at the same time creating redundancy files for it, e.g., a par2 file?

The context is that I am reasoning about how to best backup my personal files.
My #1 priority is to avoid data loss due to bitrot, hence, the par2 files.
Compression would be nice but is not my main concern.

The reason I am doubting the combined application of a compression algorithm and a erasure code algorithm is that the former works by eliminating redundancy (thereby creating smaller files) while the latter works by adding redundancy (thereby adding the capability to perform data recovery operations).
Don't they cancel each other out?

Is this a reasonable assumption or am I missing something here?
                                

pygumby (111 rep)

May 23, 2021, 10:37 PM • Last activity: May 23, 2021, 11:53 PM

1 votes

2 answers

1979 views

Check tar file for errors

tar bzip2

If there any way to see if there is a problem with the `.tar.bz2` file? As you can see, I can get a list of files, but neither `xjvf` nor `xzvf` works in this case. $ tar tf pytorch.20210702.tar.bz2 | head -n 5 pytorch/ pytorch/BUILD.bazel pytorch/requirements-flake8.txt pytorch/NOTICE pytorch/WORKS...

                                  If there any way to see if there is a problem with the .tar.bz2 file? As you can see, I can get a list of files, but neither xjvf nor xzvf works in this case.

    $ tar tf pytorch.20210702.tar.bz2  | head -n 5
    pytorch/
    pytorch/BUILD.bazel
    pytorch/requirements-flake8.txt
    pytorch/NOTICE
    pytorch/WORKSPACE

    $ tar xjvf pytorch.20210702.tar.bz2
    bzip2: (stdin) is not a bzip2 file.
    tar: Child returned status 2
    tar: Error is not recoverable: exiting now

    $ tar xzvf pytorch.20210702.tar.bz2    
    gzip: stdin: not in gzip format
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now


                                

mahmood (1271 rep)

Mar 17, 2021, 09:30 AM • Last activity: Mar 17, 2021, 10:00 AM

14 votes

3 answers

37329 views

bunzip2 to a different directory

directory bzip2

Say I have a file `foo.tbz2` in a directory. I want to extract the `tar` file from the archive, but to a different directory. It seems like `bunzip2` will only extract the archive to the same directory as the archive. This works, but I'm wondering if there is a better way: cd /another/directory bunz...

                                  Say I have a file foo.tbz2 in a directory. I want to extract the tar file from the archive, but to a different directory. It seems like bunzip2 will only extract the archive to the same directory as the archive.

This works, but I'm wondering if there is a better way:

    cd /another/directory
    bunzip2 -k /original/directory/foo.tbz2

longneck (430 rep)

Aug 15, 2012, 08:11 PM • Last activity: May 28, 2020, 09:37 AM

13 votes

2 answers

20136 views

How to check/test .tar.bz archives?

tar data-recovery bzip2

I've been using tar with its "--use-compress-prog=pbzip2" function to archive my files then compress them with pbzip2 to get an "*.tar.bz" archive. Afterwards I checked the resulting file with pbzip2's "-t" switch, and it passed the test. However, to great surprise, I got "file incomplete" or other...

                                  I've been using tar with its "--use-compress-prog=pbzip2" function to archive my files then compress them with pbzip2 to get an "*.tar.bz" archive.

Afterwards I checked the resulting file with pbzip2's "-t" switch, and it passed the test. However, to great surprise, I got "file incomplete" or other integrity errors when trying to extract the archive!

Is it because there might be something wrong with the tar file, but not when it was compressed by pbzip2? If so, is there a way to check the tar file itself? If not, what other problem might this be? Also, are there ways to recover data from tar files with errors?

I am afraid I might have already lost some important data through this process...

The point is, I would like to know a method to test the integrity of my archives after they are created.

hpy (4597 rep)

Apr 19, 2012, 02:19 PM • Last activity: May 12, 2020, 02:54 AM

2 votes

1 answers

9260 views

tar (child): : Cannot open: Is a directory

directory tar bzip2 streams

I know thats a pretty dumb question but I didn't found this precise question on internet I try to `tar -cvjf` all the contents of a directory (`/*`) and directly redirect that to a file (`> file`) but the title error message occurs. I compress both files and directories here

                                  I know thats a pretty dumb question but I didn't found this precise question on internet

I try to tar -cvjf all the contents of a directory (/*) and directly redirect that to a file (> file) but the title error message occurs. I compress both files and directories here

wxi (189 rep)

Mar 24, 2020, 12:08 PM • Last activity: Mar 24, 2020, 12:42 PM

8 votes

1 answers

2167 views

Is there a compression tool with an arbitrarily large dictionary?

compression gzip bzip2 xz zstd

I am looking for a compression tool with an arbitrarily large dictionary (and "block size"). Let me explain by way of examples. First let us create 32MB random data and then concatenate it to itself to make a file of twice the length of length 64MB. head -c32M /dev/urandom > test32.bin cat test32.bi...

                                  I am looking for a compression tool with an arbitrarily large dictionary (and "block size"). Let me explain by way of examples.

First let us create 32MB random data and then concatenate it to itself to make a file of twice the length of length 64MB.

    head -c32M /dev/urandom > test32.bin
    cat test32.bin test32.bin > test64.bin

Of course test32.bin is not compressible because it is random but the first half of test64.bin is the same as the second half, so it should be compressible by roughly 50%.

First let's try some standard tools. test64.bin is of size exactly 67108864.

 - gzip -9. Compressed size 67119133.
 - bzip2 -9. Compressed size 67409123. (A really big overhead!)
 - xz -7. Compressed size 67112252.
 - xz -8. Compressed size 33561724.
 - zstd --ultra -22. Compressed size 33558039.

We learn from this that gzip and bzip2 can never compress this file. However with a big enough dictionary xz and zstd can compress the file and in that case zstd does the best job.

However, now try:

    head -c150M /dev/urandom > test150.bin
    cat test150.bin test150.bin > test300.bin


test300.bin is of size exactly 314572800.  Let's try the best compression algorithms again at their highest settings.

 - xz -9. Compressed size 314588440
 - zstd --ultra -22. Compressed size 314580017

In this case neither tool can compress the file.

> Is there a tool that has an arbitrarily large dictionary size so it
> can compress a file such as test300.bin?

---

Thanks to the comment and answer it turns out both zstd and xz can do it. You need zstd version 1.4.x however.

 - zstd --long=28. Compressed size 157306814
 - xz -9 --lzma2=dict=150MiB. Compressed size 157317764.
                                

Simd (371 rep)

Jan 18, 2020, 09:50 PM • Last activity: Jan 19, 2020, 06:11 PM

7 votes

2 answers

6719 views

bzip2: Check file's decompressed size without actually decompressing it

compression bzip2

I have a big `bzip2` compressed file and I need to check it's decompressed size without actually decompressing it (similar to `gzip -l file.gz` or `xz -l file.xz`). How can this be done using `bzip2`?

                                  I have a big bzip2 compressed file and I need to check it's decompressed size without actually decompressing it (similar to gzip -l file.gz or xz -l file.xz). How can this be done using bzip2?
                                

manifestor (2563 rep)

Oct 12, 2019, 11:22 AM • Last activity: Oct 12, 2019, 04:34 PM

2 votes

3 answers

990 views

Difference between .bz2 and .tar.bz2 files

files tar bzip2

I am supposed to find whether a file is compressed using .bz2 or .tar.bz2(without using extension of file) and and decompress it accordingly. I used the `file` command but it is giving same result for both .bz2 and .tar.bz2. Please suggest a way to identify .bz2 and .tar.bz2 files distinctly.

                                  I am supposed to find whether a file is compressed using .bz2 or .tar.bz2(without using extension of file) and
and decompress it accordingly. I used the file command but it is giving same result for both .bz2 and .tar.bz2. Please suggest a way to identify .bz2 and .tar.bz2 files distinctly.
                                

charan priyatham (83 rep)

Sep 15, 2019, 09:17 PM • Last activity: Sep 16, 2019, 09:07 AM

0 votes

2 answers

236 views

Copying files, verifying and then zipping with a shell script

linux python verification bzip2

I am looking to create a script - a linux or a python scrip so that it can create the following things for me as well as Verify the files for me which are copied from a folder. I Have two folders: FolderA has 300 .xls files - This folder is missing some files which are in folder B currently. FolderB...

                                  I am looking to create a script - a linux or a python scrip so that it can create the following things for me as well as Verify the files for me which are copied from a folder.

I Have two folders:

FolderA has 300 .xls files - This folder is missing some files which are in folder B currently. 

FolderB has 500 .xls files

I want to copy select few 100 files from Folder B to folder A. Then want the script to verify that all the files currently residing now in folder A(should be 400 now after copying 100 files from B) also exist in folder B.

Then I want the script to zip all these files separately as its own bzip2 file. Basically there will be 400 bzip2 files(one for each excel) in the end when the process is completed.

mywayz (61 rep)

Jul 18, 2019, 08:22 PM • Last activity: Aug 6, 2019, 01:31 PM

8 votes

1 answers

1017 views

Can files compressed with bzip2 be relied upon to be deterministic (reproducible)?

checksum bzip2 reproducible-build

I am trying to determine if there are any potential issues using `bzip2` to compress files that need to be 100% reproducible. Specifically: can metadata (name / inode, lastmod date, etc) or anything else cause identical file contents to **produce a different checksum** on the resulting `.bz2` archiv...

                                  I am trying to determine if there are any potential issues using bzip2 to compress files that need to be 100% reproducible.  Specifically: can metadata (name / inode, lastmod date, etc) or anything else cause identical file contents to **produce a different checksum** on the resulting .bz2 archive?

As an example, gzip is not by default deterministic  unless -n is used.

My crude tests so far suggest that bzip2 does indeed consistently produce identical files given identical input data (regardless of metadata, platform, filesystem, etc), but it would be nice to have more than anecdotal evidence.

Jonathan Cross (258 rep)

Jul 22, 2019, 12:14 PM • Last activity: Jul 22, 2019, 01:52 PM

2 votes

2 answers

6330 views

BZIP2 multiple files without losing original files

bzip2

I want to bzip2 about 1000 files. However, I am tasked to not remove the old files and leave both the original and its bz2 file in the same folder. What is the quickest way to do this. Just to rephrase my question, suppose I have file1.txt, file2.txt, file3.txt...file1000.txt, I would need its bz2 v...

                                  I want to bzip2 about 1000 files. However, I am tasked to not remove the old files and leave both the original and its bz2 file in the same folder. What is the quickest way to do this.

Just to rephrase my question, suppose I have file1.txt, file2.txt, file3.txt...file1000.txt, I would need its bz2 versions in the same folder without removing them.

How to achieve this?

mywayz (61 rep)

Jun 28, 2019, 12:55 AM • Last activity: Jun 29, 2019, 07:47 AM

Showing page 1 of 20 total questions