Sample Header Ad - 728x90

How to extract and delete contents of a zip archive simultaneously?

3 votes
2 answers
1843 views
I want to download and extract a large zip archive (>180 GB) containing multiple small files of a single file format onto an SSD, but I don't have enough storage for both the zip archive and the extracted contents. I know that it would be possible to extract and delete individual files from an archive using the zip command as mentioned in the answers [here](https://unix.stackexchange.com/questions/14120/extract-only-a-specific-file-from-a-zipped-archive-to-a-given-directory) and [here](https://superuser.com/questions/600385/remove-single-file-from-zip-archive-on-linux) . I could also get the names of all the files in an archive using the
-l
command, store the results in an array as mentioned [here](https://www.baeldung.com/linux/reading-output-into-array) , filter out the unnecessary values using the method given [here](https://stackoverflow.com/questions/9762413/bash-array-leave-elements-containing-string) , and iterate over them in BASH as mentioned [here](https://www.cyberciti.biz/faq/bash-for-loop-array/) . So, the final logic would look something like this: 1. List the zip file's contents using
-l
and store the filenames in a bash array, using regular expressions to match the single file extension present in the archive. 2. Iterate over the array of filenames and successively extract and delete individual files using the
-j -d
and
-d
commands. How feasible is this method in terms of time required, logic complexity, and computational resources? I am worried about the efficiency of deleting and extracting single files, especially with such a large archive. If you have any feedback or comments about this approach, I would love to hear them. Thank you all in advance for your help. **Edit 1:** It seems this question has become a bit popular. Just in case anyone is interested, here is a BASH script following the logic I have outlined earlier, with batching for the extraction and deletion of files to reduce the number of operations. I have used DICOM files in this example but this would work for any other file type or for any files whose file names can be described by a regular expression. Here is the code:
#!/bin/bash

# Check if a zip file is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 "
  exit 1
fi

zipfile=$1

# List the contents of the zip file and store .dcm files in an array
mapfile -t dcm_files < <(unzip -Z1 "$zipfile" | grep '\.dcm$')

# Define the batch size
batch_size=10000
total_files=${#dcm_files[@]}

# Process files in batches
for ((i=0; i
The file would have to be saved with a name like
.sh
with a
.sh
extension and marked as executable. If the script and archive are in the same folder and the name of the archive is
.zip
, the method to run the script would be
./inplace_extractor.sh archive.zip
. Feel free to adjust the batch size or the regular expression or account for any subfolders in your archive. I tried it with my large archive and the performance was absolutely abysmal while the disk space rapidly shrunk, so I would still recommend going with the approaches suggested in other answers.
Asked by Kumaresh Balaji Sundararajan (51 rep)
Nov 12, 2023, 12:24 PM
Last activity: Oct 5, 2024, 11:37 AM