Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

0 answers

37 views

Finding and removing duplicate files: Integrating a unix script with an application that opens a file for viewing?

Does anyone know of an open source application or perhaps how to integrate an existing find dupes script with some OS tool (emacs, vim, Finder.app, mc, etc...) to open a file before marking that file for deletion? And perhaps know of open source Unix based console based tools that manage and delete...

                                  Does anyone know of an open source application or perhaps how to integrate an existing find dupes script with some OS tool (emacs, vim, Finder.app, mc, etc...) to open a file before marking that file for deletion?  And perhaps know of open source Unix based console based tools that manage and delete duplicate files?  I'm using MacOS.

I'm asking here because it's Unix related and I'm hoping to leverage some dupes scripts if necessary.

I created the following script to find duplicate files and create output with #rm duplicate_filen in groups such that a line can be commented out and the script invoked to delete duplicates.  I'm wondering if such scripts could be used with other applications.  Hoping to ask here before reinventing this.

    #!/usr/bin/env ksh
    
    FINDOUT=$(mktemp)
    ( find "${@}" -type f -print0 | xargs -0 shasum ) > $FINDOUT
    HASHES=$(find "${@}" -type f -print0 | xargs -0 shasum | awk '{ print $1; }' | sort | uniq -d)
    
    DUPELIST=$(mktemp)
    DUPES=$(mktemp)
    for h in $HASHES; do
        if grep "^$h " $FINDOUT >> $DUPES; then
    	echo "" >> $DUPELIST
    	cat $DUPES >> $DUPELIST
        fi
        rm $DUPES
    done
    
    RMSCRIPT=$(mktemp)
    cat $DUPELIST | cut -d " " -f 2- > $RMSCRIPT
    
    echo "#!/usr/bin/env ksh"
    echo "# Duplicate files, remove comments and invoke script"
    
    while read -r l; do
        if [ "$l" != "" ]; then
    	echo "#rm \"$l\""
        else
    	echo $l
        fi
    done < $RMSCRIPT
    
    rm $FINDOUT $RMSCRIPT $DUPELIST
                                

atod (155 rep)

Jun 21, 2025, 03:54 AM • Last activity: Jun 21, 2025, 04:49 AM

0 votes

2 answers

81 views

How to programmatically deduplicate files into hard links while maintaining the time stamps of the containing directories?

debian timestamps hard-link deduplication

Continuing https://unix.stackexchange.com/a/22822, how to deduplicate files, given as a list, into hardlinks, while maintaining the timestamps of their directories? Unfortunately, `hardlinks` changes the time stamps: ```sh $ mkdir d1 $ mkdir d2 $ mkdir d3 $ echo "content" > d1/f1 $ echo "content" >...

Continuing https://unix.stackexchange.com/a/22822 , how to deduplicate files, given as a list, into hardlinks, while maintaining the timestamps of their directories? Unfortunately, hardlinks changes the time stamps:

$ mkdir d1
$ mkdir d2
$ mkdir d3
$ echo "content" > d1/f1
$ echo "content" > d2/f2
$ echo "content" > d3/f3
$ ls -la --full-time d1 d2 d3
d1:
total 4
drwxr-xr-x  2 username username  60 2025-04-23 17:26:18.624828807 +0200 .
drwxrwxrwt 29 root     root     820 2025-04-23 17:26:07.397001442 +0200 ..
-rw-r--r--  1 username username   8 2025-04-23 17:26:18.624828807 +0200 f1

d2:
total 4
drwxr-xr-x  2 username username  60 2025-04-23 17:26:26.016715230 +0200 .
drwxrwxrwt 29 root     root     820 2025-04-23 17:26:07.397001442 +0200 ..
-rw-r--r--  1 username username   8 2025-04-23 17:26:26.016715230 +0200 f2

d3:
total 4
drwxr-xr-x  2 username username  60 2025-04-23 17:26:29.296664852 +0200 .
drwxrwxrwt 29 root     root     820 2025-04-23 17:26:07.397001442 +0200 ..
-rw-r--r--  1 username username   8 2025-04-23 17:26:29.296664852 +0200 f3
$ hardlink -v -c -M -O -y memcmp d1/f1 d2/f2 d3/f3
Linking /tmp/d1/f1 to /tmp/d2/f2 (-8 B)
Linking /tmp/d1/f1 to /tmp/d3/f3 (-8 B)
Mode:                     real
Method:                   memcmp
Files:                    3
Linked:                   2 files
Compared:                 0 xattrs
Compared:                 2 files
Saved:                    16 B
Duration:                 0.000165 seconds
$ ls -la --full-time d1 d2 d3
d1:
total 4
drwxr-xr-x  2 username username  60 2025-04-23 17:26:18.624828807 +0200 .
drwxrwxrwt 29 root     root     820 2025-04-23 17:27:19.631893228 +0200 ..
-rw-r--r--  3 username username   8 2025-04-23 17:26:18.624828807 +0200 f1

d2:
total 4
drwxr-xr-x  2 username username  60 2025-04-23 17:28:45.922576280 +0200 .
drwxrwxrwt 29 root     root     820 2025-04-23 17:27:19.631893228 +0200 ..
-rw-r--r--  3 username username   8 2025-04-23 17:26:18.624828807 +0200 f2

d3:
total 4
drwxr-xr-x  2 username username  60 2025-04-23 17:28:45.922576280 +0200 .
drwxrwxrwt 29 root     root     820 2025-04-23 17:27:19.631893228 +0200 ..
-rw-r--r--  3 username username   8 2025-04-23 17:26:18.624828807 +0200 f3

As we see, two file have been replaced with hard links, which is good. However, the time stamps of d2 and d3 have been updated. That's NOT what we want. Ideally, we'd like to have a command that gets a list of files from

find /media/my_NTFS_drive -type f -size $(ls -la -- original_file| cut -d' ' -f5)c -exec cmp -s original_file {} \; -exec ls -t {} + 2>/dev/null

and converts them into hard links to original_file. If the time stamps of the hardlinked files are to be the same, change them to the oldest among the time stamps of original_file and its copies. The time stamps of the directories containing original_file and its copies have to be retained. Clearly, all this has to be automated. (No question we can do it with manual inspection and touch. From a user's viewpoint, it could be done with just another switch to hardlinks. As the task seems rather standard, my hope is that in the past decades, someone has already written a standalone program, perhaps even a shell script.)

AlMa1r (1 rep)

Apr 23, 2025, 03:38 PM • Last activity: Apr 23, 2025, 06:49 PM

1 votes

1 answers

49 views

Is there a deduplicating software able to deal with partially deduplicated structures?

linux deduplication

I started using rdfind to deduplicate my resources, and found an interesting flaw: when I try to deduplicate files which are already partially linked, rdfind does not fully consolidate them, but only merges one filename at a time. > **EDIT**: by "partially linked" I meant a situation when the same d...

                                  I started using rdfind to deduplicate my resources, and found an interesting flaw: when I try to deduplicate files which are already partially linked, rdfind does not fully consolidate them, but only merges one filename at a time. 

> **EDIT**: by "partially linked" I meant a situation when the same data is copied in multiple files (inodes), each one with multiple filenames
> (hardlinks) pointing to it. See the example below for clarification. 
> You can encounter such situation if you have copies of data scattered
> across multiple directories, but instead of deduplicating the whole
> filesystem in one big swoop, you first deduplicate individual
> subtrees, and finally try to consolidate all of them together using deduplication. 

Let's assume that we have three already hardlinked files, A, B, and C, sharing an inode X. Then we have three files D, E and F having the same contents, but sharing an inode Y:

    823106592 -rw-r--r--  3 jasio jasio 104079 04-17 10:10 A
    823106592 -rw-r--r--  3 jasio jasio 104079 04-17 10:10 B
    823106592 -rw-r--r--  3 jasio jasio 104079 04-17 10:10 C
    
    823106595 -rw-r--r--  3 jasio jasio 104079 04-17 10:10 D
    823106595 -rw-r--r--  3 jasio jasio 104079 04-17 10:10 E
    823106595 -rw-r--r--  3 jasio jasio 104079 04-17 10:10 F
    
    > du --si -sc *
    107k    A
    107k    D
    213k    total

Originally I found it trying to repair some issues with backintime backup software, but a similar situation could be achieved if the files A, B and C resided in a directory tree which had already been deduplicated, while D, E and F resided in another directory tree, which had been deduplicated separately. 

For the sake of the demonstration and clarity I placed all the files in a single directory. Now I run rdfind on this directory: 

    > rdfind -makehardlinks .

the outcome is:

    823106592 -rw-r--r--  4 jasio jasio 104079 04-17 10:10 A
    823106592 -rw-r--r--  4 jasio jasio 104079 04-17 10:10 B
    823106592 -rw-r--r--  4 jasio jasio 104079 04-17 10:10 C
    823106592 -rw-r--r--  4 jasio jasio 104079 04-17 10:10 D
    
    823106595 -rw-r--r--  2 jasio jasio 104079 04-17 10:10 E
    823106595 -rw-r--r--  2 jasio jasio 104079 04-17 10:10 F

ie. A, B, C and D now point to the inode X, while E and F still use the inode Y. This is merely a slight reorganisation, but does not really help with overall disk usage:

    > du --si -sc *
    107k    A
    107k    E
    4,1k    results.txt
    218k    total

Meanwhile an expected - and optimal - result would be to make all the files pointing to the same inode X, thus deleting the inode Y and the associated data:

    823106592 -rw-r--r--  6 jasio jasio 104079 04-17 10:10 A
    823106592 -rw-r--r--  6 jasio jasio 104079 04-17 10:10 B
    823106592 -rw-r--r--  6 jasio jasio 104079 04-17 10:10 C
    823106592 -rw-r--r--  6 jasio jasio 104079 04-17 10:10 D
    823106592 -rw-r--r--  6 jasio jasio 104079 04-17 10:10 E
    823106592 -rw-r--r--  6 jasio jasio 104079 04-17 10:10 F
    
    > du --si -sc *
    107k    A
    4,1k    results.txt
    111k    total

However, achieving this requires that rdfind is run several times in a row with the same parameters, which can be quite time consuming in larger data sets. 


Is there a deduplicator out there, which is free from this flaw, ie. it would lead to the final result in only one go? 
                                

Jasio (634 rep)

Apr 17, 2025, 08:45 AM • Last activity: Apr 17, 2025, 06:26 PM

12 votes

4 answers

3581 views

Are there any deduplication scripts that use btrfs CoW as dedup?

btrfs deduplication

Looking for deduplication tools on Linux there are plenty, see e.g. [this wiki page][1]. Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy. With the rise of btrfs there would be another option: creating a...

                                  Looking for deduplication tools on Linux there are plenty, see e.g. this wiki page .

Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy.

With the rise of btrfs there would be another option: creating a CoW (copy-on-write) copy of a file (like cp reflink=always). I have not found any tool that does this, is anyone aware of tool that does this?

Peter Smit (1184 rep)

Nov 8, 2012, 02:46 PM • Last activity: Mar 19, 2025, 10:13 AM

4 votes

1 answers

3139 views

Is there a way to consolidate (deduplicate) btrfs?

btrfs deduplication

I have a btrfs volume, which I create regular snapshots of. The snapshots are rotated, the oldest being one year old. As a consequence, deleting large files may not actually free up the space for a year after the deletion. About a year ago I copied the partition to a bigger drive but still kept the...

                                  I have a btrfs volume, which I create regular snapshots of. The snapshots are rotated, the oldest being one year old. As a consequence, deleting large files may not actually free up the space for a year after the deletion.

About a year ago I copied the partition to a bigger drive but still kept the old one around.

Now the new drive has become corrupted, so that the only way to get the data out is btrfs-restore. As far as I know, the data on the new drive should still fit on the old, smaller drive, and files do not really change much (at most, some new ones get added or some deleted, but the overhead from a year’s worth of snapshots should not be large). So I decided to restore the data onto the old drive.

However, the restored data filled up the old drive much more quickly than I expected. I suspect this has to do with the implementation of btrfs:

* Create a large file.
* Create a snapshot of the volume. Space usage will not change because both files (the original one and the one in the snapshot) refer to the same extent on the disk for their payload. Modifying one of the two files would, however, increase space usage due to the copy-on-write nature of btrfs.
* Overwrite the large file with identical content. I suspect space usage increases by the size of the file because btrfs does not realize the content has not changed. As a result, it copies the blocks occupied by the file and ends up filling it with identical content, creating two identical files in two separate sets of blocks.

Does btrfs offer a mechanism to revert this by finding files which are “genetically related” (i.e. descended from the same file by copying it and/or snapshotting the subvolume on which it resides), identical in content but stored in separate sets of blocks, and turning them back into reflinks so space can be freed up?

user149408 (1515 rep)

Jul 26, 2022, 05:23 PM • Last activity: Feb 3, 2025, 01:52 PM

0 votes

1 answers

65 views

Deduplication tool which is able to only compare two directories against each other?

deduplication

I tried `rdfind` and `jdupes`, and if I specify two directories for them, they both match not only files in one directory against other directory, but also files inside of one of the given directories against each other. I would like to exclude such results which reside in one directory only, and it...

                                  I tried rdfind and jdupes, and if I specify two directories for them, they both match not only files in one directory against other directory, but also files inside of one of the given directories against each other. I would like to exclude such results which reside in one directory only, and it would also be faster to avoid scanning them. But I couldn't find an option in neither of tools which allows that.
                                

bodqhrohro (386 rep)

Dec 15, 2024, 07:40 AM • Last activity: Dec 15, 2024, 03:54 PM

10 votes

2 answers

1881 views

Make tar (or other) archive, with data block-aligned like in original files for better block-level deduplication?

btrfs archive deduplication

How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689 )? (Am I correct that there is nothing intrinsic to the tar format that prevent us from ge...

                                  How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689  )?

(Am I correct that there is nothing intrinsic to the tar format that prevent us from getting such benefit? Otherwise, if not tar, is there maybe another archiver that has such a feature built in? )

P.S. I mean "uncompressed tar" - not tar+gz or something - uncompressed tar and question asks for some trick allowing aligning files block level.
AFAIRecall tar was designed for use with tape machines, so maybe adding some extra bits for alignment is possible and easy within file format?
I hope there might be even tool for it ;). As far as I recall tar files can be concatenated, so maybe there would be trick for filling space for alignment.
                                

Grzegorz Wierzowiecki (14740 rep)

Apr 16, 2016, 03:52 PM • Last activity: Oct 12, 2024, 03:14 PM

7 votes

2 answers

2376 views

Does ZFS deduplicates across datasets or only inside a single dataset?

zfs deduplication

Does ZFS deduplicates across datasets or only inside a single dataset? In other words if I have two nearly identical volumes will they be deduplicated?

                                  Does ZFS deduplicates across datasets or only inside a single dataset? In other words if I have two nearly identical volumes will they be deduplicated?
                                

Maja Piechotka (16936 rep)

Jul 28, 2017, 06:31 AM • Last activity: Aug 1, 2024, 12:53 PM

6 votes

3 answers

4566 views

Finding duplicate files with same filename AND exact same size

fedora find search mp3 deduplication

I have a huge songs folder with a messy structure and files duplicated in multiple folders. I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches: 1. Exact same file name 2. Exact same file size In this case, `song.mp3` with 1234 bytes of file siz...

                                  I have a huge songs folder with a messy structure and files duplicated in multiple folders. 

I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches: 

1. Exact same file name
2. Exact same file size

In this case, song.mp3 with 1234 bytes of file size is stored in /songs/album1 and /songs/albumz. The tool/script should keep only one of the copies.

I have tried czkawka  on Fedora, but it can search by either filename or file size, but not *both* combined together.

Electrifyings (63 rep)

Nov 21, 2021, 04:22 AM • Last activity: Apr 4, 2024, 01:02 PM

0 votes

4 answers

160 views

Filtering duplicates with AWK differing by timestamp

text-processing deduplication

Given the list of files ordered by timestamp as shown below. I am seeking to retrieve the last occurrence of each file (the one at the bottom of each) For example: archive-daily/document-sell-report-2022-07-12-23-21-02.html archive-daily/document-sell-report-2022-07-13-23-15-34.html archive-daily/do...

                                  Given the list of files ordered by timestamp as shown below. I am seeking to retrieve the last occurrence of each file (the one at the bottom of each)

For example:

    archive-daily/document-sell-report-2022-07-12-23-21-02.html
    archive-daily/document-sell-report-2022-07-13-23-15-34.html
    archive-daily/document-loan-report-2022-07-18-05-12-16.html
    archive-daily/document-loan-report-2022-07-18-17-07-26.html
    archive-daily/document-deb-report-2022-07-18-13-17-40.html
    archive-daily/document-deb-report-2022-07-18-10-04-21.html


Would be something like:

    archive-daily/document-sell-report-2022-07-13-23-15-34.html
    archive-daily/document-loan-report-2022-07-18-17-07-26.html
    archive-daily/document-deb-report-2022-07-18-10-04-21.html



Can I use awk or any other command to achieve this? Thanks in advance.
                                

Luciano (3 rep)

Jul 18, 2022, 09:46 PM • Last activity: Feb 12, 2024, 10:01 AM

-1 votes

2 answers

132 views

Batch rename of files that share the same prefix

bash files deduplication

I have a list of files on my server with a prefix that I want to de-dupe. These are completely different generated files. It seems to be generated files with ```none {Title} - {yyyy-MM-dd}_{random} - {Description}.ts ``` For example: ``` Camera Recording - 2023-08-11_14 - Front Deck.ts Camera Record...

I have a list of files on my server with a prefix that I want to de-dupe. These are completely different generated files. It seems to be generated files with

{Title} - {yyyy-MM-dd}_{random} - {Description}.ts

For example:

Camera Recording - 2023-08-11_14 - Front Deck.ts
Camera Recording - 2023-08-11_14 - Back Deck.ts
Camera Recording - 2023-08-16_27 - Front Deck.ts
Camera Recording - 2023-08-16_36 - Front Deck.ts
Camera Recording - 2023-08-17_56 - Front Deck.ts

I need to be able to run a script that identifies duplicate prefixes of the following filesnames. Then changes the number prefix {random} part after the date

Camera Recording - 2023-08-11_14 - Front Deck.ts
Camera Recording - 2023-08-11_14 - Back Deck.ts

To become the following, the {random} needs to be replaced with a different value (other than 14)

Camera Recording - 2023-08-11_14 - Front Deck.ts
Camera Recording - 2023-08-11_68 - Back Deck.ts

Any advice on how to achieve this with a linux shell script?

mysterio21_troy (11 rep)

Jan 8, 2024, 08:50 PM • Last activity: Feb 8, 2024, 10:09 AM

0 votes

1 answers

61 views

Pop OS with a deduplication filesystem

filesystems pop-os bootable deduplication refs

I'm moving a friend development machine to Linux (PopOS), permanently. Don't worry guys, he dualbooted and he's ready for the tux. The problem is his drive. It's a 256GB SSD, and he is moving from a rusty 512GB HDD almost full of projects, where a most of the space used is from vendor shared librari...

                                  I'm moving a friend development machine to Linux (PopOS), permanently. Don't worry guys, he dualbooted and he's ready for the tux.

The problem is his drive. It's a 256GB SSD, and he is moving from a rusty 512GB HDD almost full of projects, where a most of the space used is from vendor shared libraries (npm hell, composer, to name a few). 

Since package managers would download a library, and then copy it over the project, I think it would be convenient for a filesystem that deals with deduplication.

I'm trying to check if there is a filesystem to deduplicate files on Linux, as [Microsoft "Dev Drive"](https://learn.microsoft.com/en-us/windows/dev-drive/)  that can work (hopefully, but not required) as a boot drive too.

DarkGhostHunter (101 rep)

Jan 19, 2024, 07:32 PM • Last activity: Jan 19, 2024, 07:41 PM

0 votes

0 answers

52 views

20+ backup directories, I'd like to dedupe all files to 1 "master directory"

linux backup directory-structure software-rec deduplication

As the title suggests, I have inherited a file structure where there are about 30 "complete or partial backups" of a fileserver full of text files. This obviously makes no sense, and I'd like to run a dedupe on this that produces one "master directory" which contains all the unique files from all th...

                                  As the title suggests, I have inherited a file structure where there are about 30 "complete or partial backups" of a fileserver full of text files.  This obviously makes no sense, and I'd like to run a dedupe on this that produces one "master directory" which contains all the unique files from all the backups.  (At which point I can then delete all the backups, and not actually LOSE anything).

Yes, I realize file CHANGES are an issue, and in that case I'd like to keep the most recent file.

I've looked through rdupes, and jdupes, and robinhood, and not really seen options to do what I need.  Did I miss it?  Is there something better that I haven't seen?

Frank Rizzo (1 rep)

Jan 2, 2024, 04:06 AM • Last activity: Jan 3, 2024, 10:28 PM

2 votes

2 answers

2923 views

Deduplicating Files while moving them to XFS

linux filesystems xfs deduplication

I've got a folder on a non `reflink`-capable file system (ext4) which I know contains many files with identical blocks in them. I'd like to move/copy that directory to an XFS file system whilst simultaneously deduplicating them. (I.e. if a block of a copied file is already present in a different fil...

                                  I've got a folder on a non reflink-capable file system (ext4) which I know contains many files with identical blocks in them.

I'd like to move/copy  that directory to an XFS file system whilst simultaneously deduplicating them. (I.e. if a block of a copied file is already present in a different file, I'd like to not actually copy it, but to make a second block ref point to that in the new file.)

One option would of course be first copying over all files to the XFS filesystem, running duperemove on them there, and thus removing the duplicates after the fact. Small problem: this might get time-intense, as the target filesystem isn't as quick on random accesses.

Therefore, I'd prefer if the process that copies over the files already takes care of telling the kernel that, hey, that block is a duplicate of that other block that's already there.

Is such a thing possible?

Marcus Müller (47107 rep)

May 1, 2021, 12:15 PM • Last activity: Dec 19, 2023, 04:02 PM

2 votes

2 answers

3158 views

How to get deduplicatuion for Ext4 partition used by Debian, Ubuntu and Linux Mint?

ext4 deduplication

Ext4 don't support de duplication, against p.e. btrfs, bcachefs and ZFS, reduplication by standard. How to get support of reduplication for Ext4 ?

                                  Ext4 don't support de duplication, against p.e. btrfs, bcachefs and ZFS, reduplication by standard.

How to get support of reduplication for Ext4 ?

Alfred.37 (129 rep)

Oct 9, 2022, 08:50 AM • Last activity: Dec 1, 2023, 09:15 PM

5 votes

6 answers

961 views

Keep unique values (comma separated) from each column

text-processing deduplication

I have a `.tsv` (tab-separated columns) file on a Linux system with the following columns that contain different types of values (strings, numbers) separated by a comma: ``` col1 col2 . NS,NS,NS,true,true . 12,12,12,13 1,1,1,2 door,door,1,1 ``` I would like to keep the unique values (unfortunately I...

I have a .tsv (tab-separated columns) file on a Linux system with the following columns that contain different types of values (strings, numbers) separated by a comma:

col1    col2	
.       NS,NS,NS,true,true		
.       12,12,12,13	
1,1,1,2	door,door,1,1

I would like to keep the unique values (unfortunately I tried but couldn't). This would be the output:

col1 col2	
.    NS,true		
.    12,13	
1,2  door,1

df_v (51 rep)

Oct 20, 2023, 01:48 PM • Last activity: Oct 23, 2023, 08:46 AM

0 votes

2 answers

627 views

Standalone Fileserver with deduplication wanted

file-server deduplication

**Situation:** I want to reinstall a Homelab Server (Windows OS) as a linux-based Server **Server** | Purpose: Backup System (mostly offline) I currently have an HP Proliant Microserver N54\ Turion II Neo N54l 2,2Ghz , 4GB RAM https://geizhals.at/a688459.html **Setup**\ 6 physikal Disks (5 HDD, 1 SS...

                                  **Situation:**
I want to reinstall a Homelab Server (Windows OS) as a linux-based Server

**Server** | Purpose: Backup System (mostly offline)

I currently have an HP Proliant Microserver N54\
Turion II Neo N54l  2,2Ghz , 4GB RAM

https://geizhals.at/a688459.html 

**Setup**\
6 physikal Disks (5 HDD, 1 SSD) in a Pool to a JBOD Storage Space (15,6TiB)\
1 LUN, formatted NTFS\
Files are shared via Windows Share (SMB/Cifs)\
No special NTFS permissions (since it is just  me)\
Windows Server 2012 R2 (soon EOL)\
Deduplication enabled , which saved almost 4,5 TiB on data\
mode = general purpose file server

**Clients**
Clients mostly Windows, perhaps some Linux in near future. \
Access the server via SMB/Cifs and RDP (managing)

yeah, the server is slow, but the only purpose is archive, mostly turned off and sometimes access the data (single user, no parallel access needed). works OK as it is now

**Goal**\
Since I want to go for linux a lot more and the Server 2012 R2 is End-of-life, I want to reinstall the system on GNU/Linux, providing the same functionality using the same base. 



If I read about deduplication, it is always ZFS or BTRFS but LOTS of RAM needed. Or OpenMediaVault with BorgBackup... but the client also needs BorgBackup (and Clients will Windows still)

what would be the nearest equivalent linux setup?
                                

David (1 rep)

Aug 9, 2023, 04:36 PM • Last activity: Aug 10, 2023, 12:53 PM

1 votes

1 answers

77 views

See if any of a number of zip files contains any of the original files in a directory structure

files zip checksum deduplication rmlint

I have a pretty hard problem here. I have a photo library with a lot of photos in it in various folders. I then started using Google Photos for my photos, I put those originals into Google Photos, and used it for 5+ years. Now I want to move away from Google Photos. I have done a Google Takeout of all my photos, and downloaded all the Zip files, ~1.5TB worth of them (150 x ~10GB files). Now I want to keep my original directory structure, and delete all the files that are duplicated in Google Photos. After this operation, I basically want to have two directories left over each with unique files in them. I can then merge this by hand later. I have started extracting all the files and then I will run rmlint to detect duplicates and purge from Google Drive. The problem is I don't have enough space to maneuvre all this around, so I have to extract say 30 archives, then run rmlint, purge, extract another 30, run rmlint again, purge, etc. This rescans my original files over and over, and it's going to take a really long time to do. I already use the --xattr flag for rmlint to try and speed up subsequent runs. See appendix for full rmlint command. How can I do this WITHOUT having to first extract all the archives? Is there a way to just use the file checksums in the zip files and compare to those? Thanks! Appendix

rmlint \
        --xattr \
        -o sh:rmlint-photos.sh \
        -o json:rmlint-photos.json \
        --progress \
        --match-basename \
        --keep-all-tagged \
        --must-match-tagged \
        "/mnt/f/GoogleTakeout/" \
        // \
        "/mnt/e/My Documents/Pictures/" \

Albert (171 rep)

Jul 28, 2023, 01:25 AM • Last activity: Jul 28, 2023, 08:24 AM

209 votes

20 answers

78136 views

Is there an easy way to replace duplicate files with hardlinks?

files hard-link deduplication duplicate-files

I'm looking for an easy way (a command or series of commands, probably involving `find`) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory. Here's the situation: This is a file server which multiple people store audi...

                                  I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.

Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.

Josh (8728 rep)

Oct 12, 2010, 07:23 PM • Last activity: Jun 7, 2023, 03:16 PM

-1 votes

2 answers

224 views

Join files together without using space in filesystem

files deduplication

I want to join (concatenate) two files in Linux without using space in the filesystem. Can I do this? A + B = AB The file `AB` use sectors or fragments of `A` and `B` from the filesystem. Is it possible to do this? Could I use `gparted` to recognize `AB` as a new file without copying two files (whic...

                                  I want to join (concatenate) two files in Linux without using space in the filesystem. Can I do this?

    A + B = AB

The file AB use sectors or fragments of A and B from the filesystem. Is it possible to do this?

Could I use gparted to recognize AB as a new file without copying two files (which is a slow process)?

ArtEze (137 rep)

Apr 24, 2023, 12:03 AM • Last activity: Apr 24, 2023, 07:03 AM

Showing page 1 of 20 total questions