Wanted:Duplicate file detection-elimination for PCs

Best Selling RPGs - Available Now @ DriveThruRPG.com

Bunch

The other Mods are geese!
Moderator
Joined
Aug 16, 2017
Messages
16,733
Reaction score
36,539
So for over a decade I had a little 1.3 Terabyte NAS that I jammed files on (Ready as NV+) that worked like a champ. My PC had a tiny 120Gig SSD drive for the OS and things that wouldn't go on the NAS.

It failed. So I installed three 2 Terabyte drives on my of and in the span of a year I have through bad organization/ additional video ripping and I suspect Drivethrrpgs app downloading duplicates of things stored elsewhere I now an close to running out of space.

I need something that will reliably delete duplicate files and has a good UI.

Any suggestions?
 
If you're not above a bit of scripting you could compute MD5 hashes of the files and then de-dup based on the hashes.

This is a command line tool that will do it. If you wrap it in a script that walks the file system and computes hashes of all the files you could then do something with the output to identify duplicate files.


Alternatively, if you want to install the linux subsystem for W10 then it will come with a MD5 hash calculator called md5sum

Not also that some block storage (SAN) and file storage (NAS) systems actually do this behind the scenes and only store one unique copy of any given hashed file.
 
Last edited:
If you're not above a bit of scripting you could compute MD5 hashes of the files and then de-dup based on the hashes.

This is a command line tool that will do it. If you wrap it in a script that walks the file system and computes hashes of all the files you could then do something with the output to identify duplicate files.


Alternatively, if you want to install the linux subsystem for W10 then it will come with a MD5 hash calculator called md5sum

Not also that some block storage (SAN) and file storage (NAS) systems actually do this behind the scenes and only store one unique copy of any given hashed file.
I thought about that but I also figured it was a big enough problem that someone would have a whole business around and have solutions for all the common edge cases. If I can't find something that folks recommend I'll do just that.
 
I thought about that but I also figured it was a big enough problem that someone would have a whole business around and have solutions for all the common edge cases. If I can't find something that folks recommend I'll do just that.
Product-wise, it's mostly done at the storage engine level. Most SAN or NAS equipment does it behind the scenes, at least most of the enterprisey stuff does. You can hardware hash a disk block pretty quickly with an FPGA or ASIC; quite often it's done at disk block level.

I'm pretty sure there is software about that does what you want at a user level, though.
 
Have you considered rsync?
I have but I'm not an rsync master by any means. It's probably been almost 15-20 years since I used it. Will it detect duplicates in files in different locations?
 
I have but I'm not an rsync master by any means. It's probably been almost 15-20 years since I used it. Will it detect duplicates in files in different locations?
I managed to get banned from an Australian Debian mirror by having rsync set up wrong - it tried to mirror the whole site.
 
I have but I'm not an rsync master by any means. It's probably been almost 15-20 years since I used it. Will it detect duplicates in files in different locations?
I assume so if the files are the same name. If you have folder A and folder B of content, create folder C and rsync files from A and B to C. You could also merge all the files together, make a bash script that checks for the same file name and delete those.
 
I assume so if the files are the same name. If you have folder A and folder B of content, create folder C and rsync files from A and B to C. You could also merge all the files together, make a bash script that checks for the same file name and delete those.
Same name isn't always guaranteed but should be for a good majority of the files. The one place I could see being a big issue is files I saved using direct download and files downloaded through the Drivethrrpg app which I believe save with a version number. Could be wrong there
 
I managed to get banned from an Australian Debian mirror by having rsync set up wrong - it tried to mirror the whole site.
Before leaving for work one day I set up rsync to back up my home folder... to my home folder. By the time I got home recursive copies had filled my entire HD and there was no space to create a login record. Had to fix that by booting from a live CD and removing the backup file.
 
There's a tool, VisaVersa, that I use some for dedupe. It's primary purpose is to sync two directories, but of course a deduped pair of directories have no overlap at all... I did also at one time have a tool that would search the hard drive for duplicate files. The issue is that is an O(N^2) problem...
 
There's a tool, VisaVersa, that I use some for dedupe. It's primary purpose is to sync two directories, but of course a deduped pair of directories have no overlap at all... I did also at one time have a tool that would search the hard drive for duplicate files. The issue is that is an O(N^2) problem...
Which is a problem if you are a digital Horder.....
 
Here we go - a 5-line shell script that does it. This will work from the linux subsystem on Windows 10 (you can download and install this from the Windows app store). Bear in mind that calculating the hashes is rather computationally intensive and this is single-threaded, so it will take a while to run. Capture the output into a script that you can run to do the actual dirty work. You can also troll through the script output to see what the dups were.

Code:
find /mnt/c -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS="  "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}' | sed 's/\//\\/g'

It will output a script that deletes the second or later occurrence of any file with a given hash. It's more efficient than O(N^2) and only calculates the hashes once per file.

Also, make sure you have a backup before you do any automated tidyup on your file systems.

Edit: changed in-line text to code block to fix the issue noted
below.
 
Last edited:
Here we go - a 5-line shell script that does it. This will work from the linux subsystem on Windows 10 (you can download and install this from the Windows app store). Bear in mind that calculating the hashes is rather computationally intensive and this is single-threaded, so it will take a while to run. Capture the output into a script that you can run to do the actual dirty work. You can also troll through the script output to see what the dups were.

find /mnt/c -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS=" "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}' | sed 's/\//\\/g'


It will output a script that deletes the second or later occurrence of any file with a given hash. It's more efficient than O(N^2) and only calculates the hashes once per file.
I love a good bash script.
 
I love a good bash script.
I'm not sure whether I would characterise it as a good bash script, but I think this would probably run on sh and ksh as well, and pretty much any version of sed and awk.
 
Hmm, just ran this script from a Linux VM... Hmm, how would you modify it to show all the versions of a file? I'd like to know which is the original. Maybe use a different command for the "original". Also, it seemed to find a bunch of things that aren't files but maybe that's a failure of translation of find via the VM interface to the Windows Documents folder.
 
Hmm, just ran this script from a Linux VM... Hmm, how would you modify it to show all the versions of a file? I'd like to know which is the original. Maybe use a different command for the "original". Also, it seemed to find a bunch of things that aren't files but maybe that's a failure of translation of find via the VM interface to the Windows Documents folder.
I suspect find probably uses the file type per stat(2), which is a fairly *nix specific concept based on inodes. How it reported mounted Windows file systems would be a function of how they got mapped.

Add the following line to the awk script

ct==1 {printf "rem \"%s\"\n", $2;}

Note that it expects the Windows volumes to be mounted under /mnt/c or similar. For Cygwin you would need to change /mnt to /cygdrive. For Gnu Win32 you would want to remove the path swizzling entirely.

Code:
find /mnt/c/bmf3 -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS="  "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}
ct==1 {printf "rem \"%s\" (%s)\n", $2, $1;}' | sed 's/\//\\/g'
 
Last edited:
I suspect find probably uses the file type per stat(2), which is a fairly *nix specific concept based on inodes. How it reported mounted Windows file systems would be a function of how they got mapped.

Add the following line to the awk script

ct==1 {printf "rem \"%s\"\n", $2;}

Note that it expects the Windows volumes to be mounted under /mnt/c or similar. For Cygwin you would need to change /mnt to /cygdrive. For Gnu Win32 you would want to remove the path swizzling entirely.

find /mnt/c/bmf3 -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS=" "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}
ct==1 {printf "rem \"%s\" (%s)\n", $2, $1;}' | sed 's/\//\\/g'
Oh, the troublesome items may be files with spaces in the name...
 
Yep, just verified with a small directory with 4 files containing "foo\n": "1", "2", "3", and "1 2 3"... It finds "1" as the original AND finds "1" as a dupe. One of those "1" is supposed to be "1 2 3"...
 
And if I wasn't lazy I'd fix it myself... I do have O'Reiley "sed & awk"...
 
And if I wasn't lazy I'd fix it myself... I do have O'Reiley "sed & awk"...
Seems to work fine with filenames containing spaces when I did it. If there's something funky it's probably in the behaviour of find. FS is hard coded to two spaces in the awk script because that's the separator between the hash and filename in the output from md5sum. You might be able to fix this by swizzling the output of find with cut to change the field separator.

I think I see what's happening now. the assignment FS=" " is actually two spaces in the script but it renders as one space on the web page here. Changing to a code block fixes this. If you cut and pasted from the page it would have just one space in the field separator, which would misinterpret a filename '1 2 3' as '1'
 
Last edited:
Seems to work fine with filenames containing spaces when I did it. If there's something funky it's probably in the behaviour of find. FS is hard coded to two spaces in the awk script because that's the separator between the hash and filename in the output from md5sum. You might be able to fix this by swizzling the output of find with cut to change the field separator.

I think I see what's happening now. the assignment FS=" " is actually two spaces in the script but it renders as one space on the web page here. Changing to a code block fixes this. If you cut and pasted from the page it would have just one space in the field separator, which would misinterpret a filename '1 2 3' as '1'
Yep, fixed the two spaces and now it looks right. Looking forward to seeing all the dupes I have...

Thanks
 
Banner: The best cosmic horror & Cthulhu Mythos @ DriveThruRPG.com
Back
Top