Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
0
votes
0
answers
37
views
Finding and removing duplicate files: Integrating a unix script with an application that opens a file for viewing?
Does anyone know of an open source application or perhaps how to integrate an existing find dupes script with some OS tool (emacs, vim, Finder.app, mc, etc...) to open a file before marking that file for deletion? And perhaps know of open source Unix based console based tools that manage and delete...
Does anyone know of an open source application or perhaps how to integrate an existing find dupes script with some OS tool (emacs, vim, Finder.app, mc, etc...) to open a file before marking that file for deletion? And perhaps know of open source Unix based console based tools that manage and delete duplicate files? I'm using MacOS.
I'm asking here because it's Unix related and I'm hoping to leverage some dupes scripts if necessary.
I created the following script to find duplicate files and create output with
#rm duplicate_filen
in groups such that a line can be commented out and the script invoked to delete duplicates. I'm wondering if such scripts could be used with other applications. Hoping to ask here before reinventing this.
#!/usr/bin/env ksh
FINDOUT=$(mktemp)
( find "${@}" -type f -print0 | xargs -0 shasum ) > $FINDOUT
HASHES=$(find "${@}" -type f -print0 | xargs -0 shasum | awk '{ print $1; }' | sort | uniq -d)
DUPELIST=$(mktemp)
DUPES=$(mktemp)
for h in $HASHES; do
if grep "^$h " $FINDOUT >> $DUPES; then
echo "" >> $DUPELIST
cat $DUPES >> $DUPELIST
fi
rm $DUPES
done
RMSCRIPT=$(mktemp)
cat $DUPELIST | cut -d " " -f 2- > $RMSCRIPT
echo "#!/usr/bin/env ksh"
echo "# Duplicate files, remove comments and invoke script"
while read -r l; do
if [ "$l" != "" ]; then
echo "#rm \"$l\""
else
echo $l
fi
done < $RMSCRIPT
rm $FINDOUT $RMSCRIPT $DUPELIST
atod
(155 rep)
Jun 21, 2025, 03:54 AM
• Last activity: Jun 21, 2025, 04:49 AM
0
votes
2
answers
81
views
How to programmatically deduplicate files into hard links while maintaining the time stamps of the containing directories?
Continuing https://unix.stackexchange.com/a/22822, how to deduplicate files, given as a list, into hardlinks, while maintaining the timestamps of their directories? Unfortunately, `hardlinks` changes the time stamps: ```sh $ mkdir d1 $ mkdir d2 $ mkdir d3 $ echo "content" > d1/f1 $ echo "content" >...
Continuing https://unix.stackexchange.com/a/22822 , how to deduplicate files, given as a list, into hardlinks, while maintaining the timestamps of their directories? Unfortunately,
hardlinks
changes the time stamps:
$ mkdir d1
$ mkdir d2
$ mkdir d3
$ echo "content" > d1/f1
$ echo "content" > d2/f2
$ echo "content" > d3/f3
$ ls -la --full-time d1 d2 d3
d1:
total 4
drwxr-xr-x 2 username username 60 2025-04-23 17:26:18.624828807 +0200 .
drwxrwxrwt 29 root root 820 2025-04-23 17:26:07.397001442 +0200 ..
-rw-r--r-- 1 username username 8 2025-04-23 17:26:18.624828807 +0200 f1
d2:
total 4
drwxr-xr-x 2 username username 60 2025-04-23 17:26:26.016715230 +0200 .
drwxrwxrwt 29 root root 820 2025-04-23 17:26:07.397001442 +0200 ..
-rw-r--r-- 1 username username 8 2025-04-23 17:26:26.016715230 +0200 f2
d3:
total 4
drwxr-xr-x 2 username username 60 2025-04-23 17:26:29.296664852 +0200 .
drwxrwxrwt 29 root root 820 2025-04-23 17:26:07.397001442 +0200 ..
-rw-r--r-- 1 username username 8 2025-04-23 17:26:29.296664852 +0200 f3
$ hardlink -v -c -M -O -y memcmp d1/f1 d2/f2 d3/f3
Linking /tmp/d1/f1 to /tmp/d2/f2 (-8 B)
Linking /tmp/d1/f1 to /tmp/d3/f3 (-8 B)
Mode: real
Method: memcmp
Files: 3
Linked: 2 files
Compared: 0 xattrs
Compared: 2 files
Saved: 16 B
Duration: 0.000165 seconds
$ ls -la --full-time d1 d2 d3
d1:
total 4
drwxr-xr-x 2 username username 60 2025-04-23 17:26:18.624828807 +0200 .
drwxrwxrwt 29 root root 820 2025-04-23 17:27:19.631893228 +0200 ..
-rw-r--r-- 3 username username 8 2025-04-23 17:26:18.624828807 +0200 f1
d2:
total 4
drwxr-xr-x 2 username username 60 2025-04-23 17:28:45.922576280 +0200 .
drwxrwxrwt 29 root root 820 2025-04-23 17:27:19.631893228 +0200 ..
-rw-r--r-- 3 username username 8 2025-04-23 17:26:18.624828807 +0200 f2
d3:
total 4
drwxr-xr-x 2 username username 60 2025-04-23 17:28:45.922576280 +0200 .
drwxrwxrwt 29 root root 820 2025-04-23 17:27:19.631893228 +0200 ..
-rw-r--r-- 3 username username 8 2025-04-23 17:26:18.624828807 +0200 f3
As we see, two file have been replaced with hard links, which is good.
However, the time stamps of d2
and d3
have been updated. That's NOT what we want.
Ideally, we'd like to have a command that gets a list of files from
find /media/my_NTFS_drive -type f -size $(ls -la -- original_file| cut -d' ' -f5)c -exec cmp -s original_file {} \; -exec ls -t {} + 2>/dev/null
and converts them into hard links to original_file
. If the time stamps of the hardlinked files are to be the same, change them to the oldest among the time stamps of original_file
and its copies. The time stamps of the directories containing original_file
and its copies have to be retained. Clearly, all this has to be automated. (No question we can do it with manual inspection and touch
. From a user's viewpoint, it could be done with just another switch to hardlinks
. As the task seems rather standard, my hope is that in the past decades, someone has already written a standalone program, perhaps even a shell script.)
AlMa1r
(1 rep)
Apr 23, 2025, 03:38 PM
• Last activity: Apr 23, 2025, 06:49 PM
1
votes
1
answers
49
views
Is there a deduplicating software able to deal with partially deduplicated structures?
I started using rdfind to deduplicate my resources, and found an interesting flaw: when I try to deduplicate files which are already partially linked, rdfind does not fully consolidate them, but only merges one filename at a time. > **EDIT**: by "partially linked" I meant a situation when the same d...
I started using rdfind to deduplicate my resources, and found an interesting flaw: when I try to deduplicate files which are already partially linked, rdfind does not fully consolidate them, but only merges one filename at a time.
> **EDIT**: by "partially linked" I meant a situation when the same data is copied in multiple files (inodes), each one with multiple filenames
> (hardlinks) pointing to it. See the example below for clarification.
> You can encounter such situation if you have copies of data scattered
> across multiple directories, but instead of deduplicating the whole
> filesystem in one big swoop, you first deduplicate individual
> subtrees, and finally try to consolidate all of them together using deduplication.
Let's assume that we have three already hardlinked files, A, B, and C, sharing an inode X. Then we have three files D, E and F having the same contents, but sharing an inode Y:
823106592 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 A
823106592 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 B
823106592 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 C
823106595 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 D
823106595 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 E
823106595 -rw-r--r-- 3 jasio jasio 104079 04-17 10:10 F
> du --si -sc *
107k A
107k D
213k total
Originally I found it trying to repair some issues with backintime backup software, but a similar situation could be achieved if the files A, B and C resided in a directory tree which had already been deduplicated, while D, E and F resided in another directory tree, which had been deduplicated separately.
For the sake of the demonstration and clarity I placed all the files in a single directory. Now I run rdfind on this directory:
> rdfind -makehardlinks .
the outcome is:
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 A
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 B
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 C
823106592 -rw-r--r-- 4 jasio jasio 104079 04-17 10:10 D
823106595 -rw-r--r-- 2 jasio jasio 104079 04-17 10:10 E
823106595 -rw-r--r-- 2 jasio jasio 104079 04-17 10:10 F
ie. A, B, C and D now point to the inode X, while E and F still use the inode Y. This is merely a slight reorganisation, but does not really help with overall disk usage:
> du --si -sc *
107k A
107k E
4,1k results.txt
218k total
Meanwhile an expected - and optimal - result would be to make all the files pointing to the same inode X, thus deleting the inode Y and the associated data:
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 A
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 B
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 C
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 D
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 E
823106592 -rw-r--r-- 6 jasio jasio 104079 04-17 10:10 F
> du --si -sc *
107k A
4,1k results.txt
111k total
However, achieving this requires that rdfind is run several times in a row with the same parameters, which can be quite time consuming in larger data sets.
Is there a deduplicator out there, which is free from this flaw, ie. it would lead to the final result in only one go?
Jasio
(634 rep)
Apr 17, 2025, 08:45 AM
• Last activity: Apr 17, 2025, 06:26 PM
12
votes
4
answers
3581
views
Are there any deduplication scripts that use btrfs CoW as dedup?
Looking for deduplication tools on Linux there are plenty, see e.g. [this wiki page][1]. Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy. With the rise of btrfs there would be another option: creating a...
Looking for deduplication tools on Linux there are plenty, see e.g. this wiki page .
Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy.
With the rise of btrfs there would be another option: creating a CoW (copy-on-write) copy of a file (like
cp reflink=always
). I have not found any tool that does this, is anyone aware of tool that does this?
Peter Smit
(1184 rep)
Nov 8, 2012, 02:46 PM
• Last activity: Mar 19, 2025, 10:13 AM
4
votes
1
answers
3139
views
Is there a way to consolidate (deduplicate) btrfs?
I have a btrfs volume, which I create regular snapshots of. The snapshots are rotated, the oldest being one year old. As a consequence, deleting large files may not actually free up the space for a year after the deletion. About a year ago I copied the partition to a bigger drive but still kept the...
I have a btrfs volume, which I create regular snapshots of. The snapshots are rotated, the oldest being one year old. As a consequence, deleting large files may not actually free up the space for a year after the deletion.
About a year ago I copied the partition to a bigger drive but still kept the old one around.
Now the new drive has become corrupted, so that the only way to get the data out is
btrfs-restore
. As far as I know, the data on the new drive should still fit on the old, smaller drive, and files do not really change much (at most, some new ones get added or some deleted, but the overhead from a year’s worth of snapshots should not be large). So I decided to restore the data onto the old drive.
However, the restored data filled up the old drive much more quickly than I expected. I suspect this has to do with the implementation of btrfs:
* Create a large file.
* Create a snapshot of the volume. Space usage will not change because both files (the original one and the one in the snapshot) refer to the same extent on the disk for their payload. Modifying one of the two files would, however, increase space usage due to the copy-on-write nature of btrfs.
* Overwrite the large file with identical content. I suspect space usage increases by the size of the file because btrfs does not realize the content has not changed. As a result, it copies the blocks occupied by the file and ends up filling it with identical content, creating two identical files in two separate sets of blocks.
Does btrfs offer a mechanism to revert this by finding files which are “genetically related” (i.e. descended from the same file by copying it and/or snapshotting the subvolume on which it resides), identical in content but stored in separate sets of blocks, and turning them back into reflinks so space can be freed up?
user149408
(1515 rep)
Jul 26, 2022, 05:23 PM
• Last activity: Feb 3, 2025, 01:52 PM
0
votes
1
answers
65
views
Deduplication tool which is able to only compare two directories against each other?
I tried `rdfind` and `jdupes`, and if I specify two directories for them, they both match not only files in one directory against other directory, but also files inside of one of the given directories against each other. I would like to exclude such results which reside in one directory only, and it...
I tried
rdfind
and jdupes
, and if I specify two directories for them, they both match not only files in one directory against other directory, but also files inside of one of the given directories against each other. I would like to exclude such results which reside in one directory only, and it would also be faster to avoid scanning them. But I couldn't find an option in neither of tools which allows that.
bodqhrohro
(386 rep)
Dec 15, 2024, 07:40 AM
• Last activity: Dec 15, 2024, 03:54 PM
10
votes
2
answers
1881
views
Make tar (or other) archive, with data block-aligned like in original files for better block-level deduplication?
How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689 )? (Am I correct that there is nothing intrinsic to the tar format that prevent us from ge...
How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689 )?
(Am I correct that there is nothing intrinsic to the tar format that prevent us from getting such benefit? Otherwise, if not tar, is there maybe another archiver that has such a feature built in? )
P.S. I mean "uncompressed tar" - not tar+gz or something - uncompressed tar and question asks for some trick allowing aligning files block level.
AFAIRecall tar was designed for use with tape machines, so maybe adding some extra bits for alignment is possible and easy within file format?
I hope there might be even tool for it ;). As far as I recall tar files can be concatenated, so maybe there would be trick for filling space for alignment.
Grzegorz Wierzowiecki
(14740 rep)
Apr 16, 2016, 03:52 PM
• Last activity: Oct 12, 2024, 03:14 PM
7
votes
2
answers
2376
views
Does ZFS deduplicates across datasets or only inside a single dataset?
Does ZFS deduplicates across datasets or only inside a single dataset? In other words if I have two nearly identical volumes will they be deduplicated?
Does ZFS deduplicates across datasets or only inside a single dataset? In other words if I have two nearly identical volumes will they be deduplicated?
Maja Piechotka
(16936 rep)
Jul 28, 2017, 06:31 AM
• Last activity: Aug 1, 2024, 12:53 PM
6
votes
3
answers
4566
views
Finding duplicate files with same filename AND exact same size
I have a huge songs folder with a messy structure and files duplicated in multiple folders. I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches: 1. Exact same file name 2. Exact same file size In this case, `song.mp3` with 1234 bytes of file siz...
I have a huge songs folder with a messy structure and files duplicated in multiple folders.
I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches:
1. Exact same file name
2. Exact same file size
In this case,
song.mp3
with 1234 bytes of file size is stored in /songs/album1
and /songs/albumz
. The tool/script should keep only one of the copies.
I have tried czkawka on Fedora, but it can search by either filename or file size, but not *both* combined together.
Electrifyings
(63 rep)
Nov 21, 2021, 04:22 AM
• Last activity: Apr 4, 2024, 01:02 PM
0
votes
4
answers
160
views
Filtering duplicates with AWK differing by timestamp
Given the list of files ordered by timestamp as shown below. I am seeking to retrieve the last occurrence of each file (the one at the bottom of each) For example: archive-daily/document-sell-report-2022-07-12-23-21-02.html archive-daily/document-sell-report-2022-07-13-23-15-34.html archive-daily/do...
Given the list of files ordered by timestamp as shown below. I am seeking to retrieve the last occurrence of each file (the one at the bottom of each)
For example:
archive-daily/document-sell-report-2022-07-12-23-21-02.html
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-05-12-16.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-13-17-40.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html
Would be something like:
archive-daily/document-sell-report-2022-07-13-23-15-34.html
archive-daily/document-loan-report-2022-07-18-17-07-26.html
archive-daily/document-deb-report-2022-07-18-10-04-21.html
Can I use awk or any other command to achieve this? Thanks in advance.
Luciano
(3 rep)
Jul 18, 2022, 09:46 PM
• Last activity: Feb 12, 2024, 10:01 AM
-1
votes
2
answers
132
views
Batch rename of files that share the same prefix
I have a list of files on my server with a prefix that I want to de-dupe. These are completely different generated files. It seems to be generated files with ```none {Title} - {yyyy-MM-dd}_{random} - {Description}.ts ``` For example: ``` Camera Recording - 2023-08-11_14 - Front Deck.ts Camera Record...
I have a list of files on my server with a prefix that I want to de-dupe. These are completely different generated files.
It seems to be generated files with
{Title} - {yyyy-MM-dd}_{random} - {Description}.ts
For example:
Camera Recording - 2023-08-11_14 - Front Deck.ts
Camera Recording - 2023-08-11_14 - Back Deck.ts
Camera Recording - 2023-08-16_27 - Front Deck.ts
Camera Recording - 2023-08-16_36 - Front Deck.ts
Camera Recording - 2023-08-17_56 - Front Deck.ts
I need to be able to run a script that identifies duplicate prefixes of the following filesnames. Then changes the number prefix {random} part after the date
Camera Recording - 2023-08-11_14 - Front Deck.ts
Camera Recording - 2023-08-11_14 - Back Deck.ts
To become the following, the {random} needs to be replaced with a different value (other than 14)
Camera Recording - 2023-08-11_14 - Front Deck.ts
Camera Recording - 2023-08-11_68 - Back Deck.ts
Any advice on how to achieve this with a linux shell script?
mysterio21_troy
(11 rep)
Jan 8, 2024, 08:50 PM
• Last activity: Feb 8, 2024, 10:09 AM
0
votes
1
answers
61
views
Pop OS with a deduplication filesystem
I'm moving a friend development machine to Linux (PopOS), permanently. Don't worry guys, he dualbooted and he's ready for the tux. The problem is his drive. It's a 256GB SSD, and he is moving from a rusty 512GB HDD almost full of projects, where a most of the space used is from vendor shared librari...
I'm moving a friend development machine to Linux (PopOS), permanently. Don't worry guys, he dualbooted and he's ready for the tux.
The problem is his drive. It's a 256GB SSD, and he is moving from a rusty 512GB HDD almost full of projects, where a most of the space used is from vendor shared libraries (npm hell, composer, to name a few).
Since package managers would download a library, and then copy it over the project, I think it would be convenient for a filesystem that deals with deduplication.
I'm trying to check if there is a filesystem to deduplicate files on Linux, as [Microsoft "Dev Drive"](https://learn.microsoft.com/en-us/windows/dev-drive/) that can work (hopefully, but not required) as a boot drive too.
DarkGhostHunter
(101 rep)
Jan 19, 2024, 07:32 PM
• Last activity: Jan 19, 2024, 07:41 PM
0
votes
0
answers
52
views
20+ backup directories, I'd like to dedupe all files to 1 "master directory"
As the title suggests, I have inherited a file structure where there are about 30 "complete or partial backups" of a fileserver full of text files. This obviously makes no sense, and I'd like to run a dedupe on this that produces one "master directory" which contains all the unique files from all th...
As the title suggests, I have inherited a file structure where there are about 30 "complete or partial backups" of a fileserver full of text files. This obviously makes no sense, and I'd like to run a dedupe on this that produces one "master directory" which contains all the unique files from all the backups. (At which point I can then delete all the backups, and not actually LOSE anything).
Yes, I realize file CHANGES are an issue, and in that case I'd like to keep the most recent file.
I've looked through rdupes, and jdupes, and robinhood, and not really seen options to do what I need. Did I miss it? Is there something better that I haven't seen?
Frank Rizzo
(1 rep)
Jan 2, 2024, 04:06 AM
• Last activity: Jan 3, 2024, 10:28 PM
2
votes
2
answers
2923
views
Deduplicating Files while moving them to XFS
I've got a folder on a non `reflink`-capable file system (ext4) which I know contains many files with identical blocks in them. I'd like to move/copy that directory to an XFS file system whilst simultaneously deduplicating them. (I.e. if a block of a copied file is already present in a different fil...
I've got a folder on a non
reflink
-capable file system (ext4) which I know contains many files with identical blocks in them.
I'd like to move/copy that directory to an XFS file system whilst simultaneously deduplicating them. (I.e. if a block of a copied file is already present in a different file, I'd like to not actually copy it, but to make a second block ref point to that in the new file.)
One option would of course be first copying over all files to the XFS filesystem, running duperemove
on them there, and thus removing the duplicates after the fact. Small problem: this might get time-intense, as the target filesystem isn't as quick on random accesses.
Therefore, I'd prefer if the process that copies over the files already takes care of telling the kernel that, hey, that block is a duplicate of that other block that's already there.
Is such a thing possible?
Marcus Müller
(47107 rep)
May 1, 2021, 12:15 PM
• Last activity: Dec 19, 2023, 04:02 PM
2
votes
2
answers
3158
views
How to get deduplicatuion for Ext4 partition used by Debian, Ubuntu and Linux Mint?
Ext4 don't support de duplication, against p.e. btrfs, bcachefs and ZFS, reduplication by standard. How to get support of reduplication for Ext4 ?
Ext4 don't support de duplication, against p.e. btrfs, bcachefs and ZFS, reduplication by standard.
How to get support of reduplication for Ext4 ?
Alfred.37
(129 rep)
Oct 9, 2022, 08:50 AM
• Last activity: Dec 1, 2023, 09:15 PM
5
votes
6
answers
961
views
Keep unique values (comma separated) from each column
I have a `.tsv` (tab-separated columns) file on a Linux system with the following columns that contain different types of values (strings, numbers) separated by a comma: ``` col1 col2 . NS,NS,NS,true,true . 12,12,12,13 1,1,1,2 door,door,1,1 ``` I would like to keep the unique values (unfortunately I...
I have a
.tsv
(tab-separated columns) file on a Linux system with the following columns that contain different types of values (strings, numbers) separated by a comma:
col1 col2
. NS,NS,NS,true,true
. 12,12,12,13
1,1,1,2 door,door,1,1
I would like to keep the unique values (unfortunately I tried but couldn't). This would be the output:
col1 col2
. NS,true
. 12,13
1,2 door,1
df_v
(51 rep)
Oct 20, 2023, 01:48 PM
• Last activity: Oct 23, 2023, 08:46 AM
0
votes
2
answers
627
views
Standalone Fileserver with deduplication wanted
**Situation:** I want to reinstall a Homelab Server (Windows OS) as a linux-based Server **Server** | Purpose: Backup System (mostly offline) I currently have an HP Proliant Microserver N54\ Turion II Neo N54l 2,2Ghz , 4GB RAM https://geizhals.at/a688459.html **Setup**\ 6 physikal Disks (5 HDD, 1 SS...
**Situation:**
I want to reinstall a Homelab Server (Windows OS) as a linux-based Server
**Server** | Purpose: Backup System (mostly offline)
I currently have an HP Proliant Microserver N54\
Turion II Neo N54l 2,2Ghz , 4GB RAM
https://geizhals.at/a688459.html
**Setup**\
6 physikal Disks (5 HDD, 1 SSD) in a Pool to a JBOD Storage Space (15,6TiB)\
1 LUN, formatted NTFS\
Files are shared via Windows Share (SMB/Cifs)\
No special NTFS permissions (since it is just me)\
Windows Server 2012 R2 (soon EOL)\
Deduplication enabled , which saved almost 4,5 TiB on data\
mode = general purpose file server
**Clients**
Clients mostly Windows, perhaps some Linux in near future. \
Access the server via SMB/Cifs and RDP (managing)
yeah, the server is slow, but the only purpose is archive, mostly turned off and sometimes access the data (single user, no parallel access needed). works OK as it is now
**Goal**\
Since I want to go for linux a lot more and the Server 2012 R2 is End-of-life, I want to reinstall the system on GNU/Linux, providing the same functionality using the same base.
If I read about deduplication, it is always ZFS or BTRFS but LOTS of RAM needed. Or OpenMediaVault with BorgBackup... but the client also needs BorgBackup (and Clients will Windows still)
what would be the nearest equivalent linux setup?
David
(1 rep)
Aug 9, 2023, 04:36 PM
• Last activity: Aug 10, 2023, 12:53 PM
1
votes
1
answers
77
views
See if any of a number of zip files contains any of the original files in a directory structure
I have a pretty hard problem here. I have a photo library with a lot of photos in it in various folders. I then started using Google Photos for my photos, I put those originals into Google Photos, and used it for 5+ years. Now I want to move away from Google Photos. I have done a Google Takeout of a...
I have a pretty hard problem here.
I have a photo library with a lot of photos in it in various folders.
I then started using Google Photos for my photos, I put those originals into Google Photos, and used it for 5+ years.
Now I want to move away from Google Photos. I have done a Google Takeout of all my photos, and downloaded all the Zip files, ~1.5TB worth of them (150 x ~10GB files).
Now I want to keep my original directory structure, and delete all the files that are duplicated in Google Photos. After this operation, I basically want to have two directories left over each with unique files in them. I can then merge this by hand later.
I have started extracting all the files and then I will run
rmlint
to detect duplicates and purge from Google Drive. The problem is I don't have enough space to maneuvre all this around, so I have to extract say 30 archives, then run rmlint
, purge, extract another 30, run rmlint
again, purge, etc. This rescans my original files over and over, and it's going to take a really long time to do. I already use the --xattr
flag for rmlint to try and speed up subsequent runs. See appendix for full rmlint
command.
How can I do this WITHOUT having to first extract all the archives? Is there a way to just use the file checksums in the zip files and compare to those?
Thanks!
Appendix
rmlint \
--xattr \
-o sh:rmlint-photos.sh \
-o json:rmlint-photos.json \
--progress \
--match-basename \
--keep-all-tagged \
--must-match-tagged \
"/mnt/f/GoogleTakeout/" \
// \
"/mnt/e/My Documents/Pictures/" \
Albert
(171 rep)
Jul 28, 2023, 01:25 AM
• Last activity: Jul 28, 2023, 08:24 AM
209
votes
20
answers
78136
views
Is there an easy way to replace duplicate files with hardlinks?
I'm looking for an easy way (a command or series of commands, probably involving `find`) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory. Here's the situation: This is a file server which multiple people store audi...
I'm looking for an easy way (a command or series of commands, probably involving
find
) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.
Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.
Josh
(8728 rep)
Oct 12, 2010, 07:23 PM
• Last activity: Jun 7, 2023, 03:16 PM
-1
votes
2
answers
224
views
Join files together without using space in filesystem
I want to join (concatenate) two files in Linux without using space in the filesystem. Can I do this? A + B = AB The file `AB` use sectors or fragments of `A` and `B` from the filesystem. Is it possible to do this? Could I use `gparted` to recognize `AB` as a new file without copying two files (whic...
I want to join (concatenate) two files in Linux without using space in the filesystem. Can I do this?
A + B = AB
The file
AB
use sectors or fragments of A
and B
from the filesystem. Is it possible to do this?
Could I use gparted
to recognize AB
as a new file without copying two files (which is a slow process)?
ArtEze
(137 rep)
Apr 24, 2023, 12:03 AM
• Last activity: Apr 24, 2023, 07:03 AM
Showing page 1 of 20 total questions