Is there a deduplicating software able to deal with partially deduplicated structures?
1
vote
1
answer
50
views
I started using rdfind to deduplicate my resources, and found an interesting flaw: when I try to deduplicate files which are already partially linked, rdfind does not fully consolidate them, but only merges one filename at a time.
> **EDIT**: by "partially linked" I meant a situation when the same data is copied in multiple files (inodes), each one with multiple filenames
> (hardlinks) pointing to it. See the example below for clarification.
> You can encounter such situation if you have copies of data scattered
> across multiple directories, but instead of deduplicating the whole
> filesystem in one big swoop, you first deduplicate individual
> subtrees, and finally try to consolidate all of them together using deduplication.
Let's assume that we have three already hardlinked files, A, B, and C, sharing an inode X. Then we have three files D, E and F having the same contents, but sharing an inode Y:
823106592 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 A
823106592 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 B
823106592 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 C
823106595 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 D
823106595 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 E
823106595 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 F
> du --si -sc *
107k A
107k D
213k total
Originally I found it trying to repair some issues with backintime backup software, but a similar situation could be achieved if the files A, B and C resided in a directory tree which had already been deduplicated, while D, E and F resided in another directory tree, which had been deduplicated separately.
For the sake of the demonstration and clarity I placed all the files in a single directory. Now I run rdfind on this directory:
> rdfind -makehardlinks .
the outcome is:
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 A
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 B
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 C
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 D
823106595 -rw-r--r-- 2 jasio jasio 104079 04-17 10:10 E
823106595 -rw-r--r-- 2 jasio jasio 104079 04-17 10:10 F
ie. A, B, C and D now point to the inode X, while E and F still use the inode Y. This is merely a slight reorganisation, but does not really help with overall disk usage:
> du --si -sc *
107k A
107k E
4,1k results.txt
218k total
Meanwhile an expected - and optimal - result would be to make all the files pointing to the same inode X, thus deleting the inode Y and the associated data:
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 A
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 B
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 C
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 D
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 E
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 F
> du --si -sc *
107k A
4,1k results.txt
111k total
However, achieving this requires that rdfind is run several times in a row with the same parameters, which can be quite time consuming in larger data sets.
Is there a deduplicator out there, which is free from this flaw, ie. it would lead to the final result in only one go?
Asked by Jasio
(634 rep)
Apr 17, 2025, 08:45 AM
Last activity: Apr 17, 2025, 06:26 PM
Last activity: Apr 17, 2025, 06:26 PM