Wanted:Duplicate file detection-elimination for PCs

Bunch · Nov 12, 2020

So for over a decade I had a little 1.3 Terabyte NAS that I jammed files on (Ready as NV+) that worked like a champ. My PC had a tiny 120Gig SSD drive for the OS and things that wouldn't go on the NAS.

It failed. So I installed three 2 Terabyte drives on my of and in the span of a year I have through bad organization/ additional video ripping and I suspect Drivethrrpgs app downloading duplicates of things stored elsewhere I now an close to running out of space.

I need something that will reliably delete duplicate files and has a good UI.

Any suggestions?

Nobby-W · Nov 12, 2020

If you're not above a bit of scripting you could compute MD5 hashes of the files and then de-dup based on the hashes.

This is a command line tool that will do it. If you wrap it in a script that walks the file system and computes hashes of all the files you could then do something with the output to identify duplicate files.

Availability and description of FCIV - Windows Server

Describes the File Checksum Integrity Verifier (FCIV) utility for use in Windows 2000, Windows XP, and Windows Server 2003.

docs.microsoft.com

Alternatively, if you want to install the linux subsystem for W10 then it will come with a MD5 hash calculator called md5sum

Not also that some block storage (SAN) and file storage (NAS) systems actually do this behind the scenes and only store one unique copy of any given hashed file.

Bunch · Nov 12, 2020

Nobby-W said:
If you're not above a bit of scripting you could compute MD5 hashes of the files and then de-dup based on the hashes.

This is a command line tool that will do it. If you wrap it in a script that walks the file system and computes hashes of all the files you could then do something with the output to identify duplicate files.

Availability and description of FCIV - Windows Server

Describes the File Checksum Integrity Verifier (FCIV) utility for use in Windows 2000, Windows XP, and Windows Server 2003.

docs.microsoft.com

Alternatively, if you want to install the linux subsystem for W10 then it will come with a MD5 hash calculator called md5sum

Not also that some block storage (SAN) and file storage (NAS) systems actually do this behind the scenes and only store one unique copy of any given hashed file.

I thought about that but I also figured it was a big enough problem that someone would have a whole business around and have solutions for all the common edge cases. If I can't find something that folks recommend I'll do just that.

Nobby-W · Nov 12, 2020

Bunch said:
I thought about that but I also figured it was a big enough problem that someone would have a whole business around and have solutions for all the common edge cases. If I can't find something that folks recommend I'll do just that.

Product-wise, it's mostly done at the storage engine level. Most SAN or NAS equipment does it behind the scenes, at least most of the enterprisey stuff does. You can hardware hash a disk block pretty quickly with an FPGA or ASIC; quite often it's done at disk block level.

I'm pretty sure there is software about that does what you want at a user level, though.

Nemesis · Nov 12, 2020

Bunch said:
I need something that will reliably delete duplicate files and has a good UI.

Have you considered rsync?

Bunch · Nov 12, 2020

Nemesis said:
Have you considered rsync?

I have but I'm not an rsync master by any means. It's probably been almost 15-20 years since I used it. Will it detect duplicates in files in different locations?

Nobby-W · Nov 12, 2020

Bunch said:
I have but I'm not an rsync master by any means. It's probably been almost 15-20 years since I used it. Will it detect duplicates in files in different locations?

I managed to get banned from an Australian Debian mirror by having rsync set up wrong - it tried to mirror the whole site.

Nemesis · Nov 12, 2020

Bunch said:
I have but I'm not an rsync master by any means. It's probably been almost 15-20 years since I used it. Will it detect duplicates in files in different locations?

I assume so if the files are the same name. If you have folder A and folder B of content, create folder C and rsync files from A and B to C. You could also merge all the files together, make a bash script that checks for the same file name and delete those.

Bunch · Nov 12, 2020

Nemesis said:
I assume so if the files are the same name. If you have folder A and folder B of content, create folder C and rsync files from A and B to C. You could also merge all the files together, make a bash script that checks for the same file name and delete those.

Same name isn't always guaranteed but should be for a good majority of the files. The one place I could see being a big issue is files I saved using direct download and files downloaded through the Drivethrrpg app which I believe save with a version number. Could be wrong there

spittingimage · Nov 12, 2020

Nobby-W said:
I managed to get banned from an Australian Debian mirror by having rsync set up wrong - it tried to mirror the whole site.

Before leaving for work one day I set up rsync to back up my home folder... to my home folder. By the time I got home recursive copies had filled my entire HD and there was no space to create a login record. Had to fix that by booting from a live CD and removing the backup file.

ffilz · Nov 12, 2020

There's a tool, VisaVersa, that I use some for dedupe. It's primary purpose is to sync two directories, but of course a deduped pair of directories have no overlap at all... I did also at one time have a tool that would search the hard drive for duplicate files. The issue is that is an O(N^2) problem...

Bunch · Nov 12, 2020

ffilz said:
There's a tool, VisaVersa, that I use some for dedupe. It's primary purpose is to sync two directories, but of course a deduped pair of directories have no overlap at all... I did also at one time have a tool that would search the hard drive for duplicate files. The issue is that is an O(N^2) problem...

Which is a problem if you are a digital Horder.....

Nobby-W · Nov 12, 2020

Here we go - a 5-line shell script that does it. This will work from the linux subsystem on Windows 10 (you can download and install this from the Windows app store). Bear in mind that calculating the hashes is rather computationally intensive and this is single-threaded, so it will take a while to run. Capture the output into a script that you can run to do the actual dirty work. You can also troll through the script output to see what the dups were.

Code:

find /mnt/c -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS="  "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}' | sed 's/\//\\/g'

It will output a script that deletes the second or later occurrence of any file with a given hash. It's more efficient than O(N^2) and only calculates the hashes once per file.

Also, make sure you have a backup before you do any automated tidyup on your file systems.

Edit: changed in-line text to code block to fix the issue noted below.

Nemesis · Nov 12, 2020

Nobby-W said:
Here we go - a 5-line shell script that does it. This will work from the linux subsystem on Windows 10 (you can download and install this from the Windows app store). Bear in mind that calculating the hashes is rather computationally intensive and this is single-threaded, so it will take a while to run. Capture the output into a script that you can run to do the actual dirty work. You can also troll through the script output to see what the dups were.

find /mnt/c -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS=" "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}' | sed 's/\//\\/g'

It will output a script that deletes the second or later occurrence of any file with a given hash. It's more efficient than O(N^2) and only calculates the hashes once per file.

I love a good bash script.

Nobby-W · Nov 12, 2020

Nemesis said:
I love a good bash script.

I'm not sure whether I would characterise it as a good bash script, but I think this would probably run on sh and ksh as well, and pretty much any version of sed and awk.

ffilz · Nov 12, 2020

Hmm, just ran this script from a Linux VM... Hmm, how would you modify it to show all the versions of a file? I'd like to know which is the original. Maybe use a different command for the "original". Also, it seemed to find a bunch of things that aren't files but maybe that's a failure of translation of find via the VM interface to the Windows Documents folder.

Nobby-W · Nov 12, 2020

ffilz said:
Hmm, just ran this script from a Linux VM... Hmm, how would you modify it to show all the versions of a file? I'd like to know which is the original. Maybe use a different command for the "original". Also, it seemed to find a bunch of things that aren't files but maybe that's a failure of translation of find via the VM interface to the Windows Documents folder.

I suspect find probably uses the file type per stat(2), which is a fairly *nix specific concept based on inodes. How it reported mounted Windows file systems would be a function of how they got mapped.

Add the following line to the awk script

ct==1 {printf "rem \"%s\"\n", $2;}

Note that it expects the Windows volumes to be mounted under /mnt/c or similar. For Cygwin you would need to change /mnt to /cygdrive. For Gnu Win32 you would want to remove the path swizzling entirely.

Code:

find /mnt/c/bmf3 -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS="  "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}
ct==1 {printf "rem \"%s\" (%s)\n", $2, $1;}' | sed 's/\//\\/g'

ffilz · Nov 12, 2020

Nobby-W said:
I suspect find probably uses the file type per stat(2), which is a fairly *nix specific concept based on inodes. How it reported mounted Windows file systems would be a function of how they got mapped.

Add the following line to the awk script

ct==1 {printf "rem \"%s\"\n", $2;}

Note that it expects the Windows volumes to be mounted under /mnt/c or similar. For Cygwin you would need to change /mnt to /cygdrive. For Gnu Win32 you would want to remove the path swizzling entirely.

find /mnt/c/bmf3 -type f -exec md5sum '{}' \; | sort | sed 's/\/mnt\/c/c:/g' | awk \
'BEGIN {FS=" "; hash=""; ct=1;} \
hash == $1 {ct++;}
hash != $1 {ct=1; hash=$1;}
ct>1 {printf "del \"%s\"\n", $2;}
ct==1 {printf "rem \"%s\" (%s)\n", $2, $1;}' | sed 's/\//\\/g'

Oh, the troublesome items may be files with spaces in the name...

ffilz · Nov 12, 2020

Yep, just verified with a small directory with 4 files containing "foo\n": "1", "2", "3", and "1 2 3"... It finds "1" as the original AND finds "1" as a dupe. One of those "1" is supposed to be "1 2 3"...

ffilz · Nov 12, 2020

And if I wasn't lazy I'd fix it myself... I do have O'Reiley "sed & awk"...

Nobby-W · Nov 12, 2020

ffilz said:
And if I wasn't lazy I'd fix it myself... I do have O'Reiley "sed & awk"...

Seems to work fine with filenames containing spaces when I did it. If there's something funky it's probably in the behaviour of find. FS is hard coded to two spaces in the awk script because that's the separator between the hash and filename in the output from md5sum. You might be able to fix this by swizzling the output of find with cut to change the field separator.

I think I see what's happening now. the assignment FS=" " is actually two spaces in the script but it renders as one space on the web page here. Changing to a code block fixes this. If you cut and pasted from the page it would have just one space in the field separator, which would misinterpret a filename '1 2 3' as '1'

ffilz · Nov 12, 2020

Nobby-W said:
Seems to work fine with filenames containing spaces when I did it. If there's something funky it's probably in the behaviour of find. FS is hard coded to two spaces in the awk script because that's the separator between the hash and filename in the output from md5sum. You might be able to fix this by swizzling the output of find with cut to change the field separator.

I think I see what's happening now. the assignment FS=" " is actually two spaces in the script but it renders as one space on the web page here. Changing to a code block fixes this. If you cut and pasted from the page it would have just one space in the field separator, which would misinterpret a filename '1 2 3' as '1'

Yep, fixed the two spaces and now it looks right. Looking forward to seeing all the dupes I have...

Thanks

Wanted:Duplicate file detection-elimination for PCs

The other Mods are geese!

Not an axe murderer

The other Mods are geese!

Not an axe murderer

Legendary Pubber

The other Mods are geese!

Not an axe murderer

Legendary Pubber

The other Mods are geese!

hawwwk-ptui

Legendary Pubber

The other Mods are geese!

Not an axe murderer

Legendary Pubber

Not an axe murderer

Legendary Pubber

Not an axe murderer

Legendary Pubber

Legendary Pubber

Legendary Pubber

Not an axe murderer

Legendary Pubber