Could a 163-bytes size difference between the tgz of two different 590mb dirs be caused just from file metadata? Or is the file data diff?
1
vote
1
answer
85
views
While trying to remove duplicates of MRI readings I am tarring / compressing (**tgz**) the top-level directory with a mix of executables, pdf, text, dll, and data in proprietary format (but sometimes with different "last modified" dates). Any tgz files with same byte-size, I consider duplicate MRI scans. I am comparing 10+ MRI datasets that have been stored in a variety of compression formats in local and cloud drives for 10+ years. I'm sure some are duplicates.
> tar cfa mri01.tgz MRI01
> tar cfa mri02.tgz MRI02
For datasets that are, before tgz compression, typically about 615-mb with about 135 files, I sometimes see no size difference, a size difference of only 150-bytes or so, and significant size differences in the tgz files. I don't know what to think.
Could a 150-byte size difference between the tgz of such large datasets be caused just by file meta-data, like "last modified date"? Or, does such a tiny size difference indicate these are different MRI scans? Is there a better way to detect duplicates of this type of data?
Asked by smithknown34
(11 rep)
Aug 27, 2023, 08:24 PM
Last activity: Jan 25, 2024, 02:56 AM
Last activity: Jan 25, 2024, 02:56 AM