Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to detect duplicate folders #4

Open
Boscop opened this issue Jul 4, 2017 · 2 comments
Open

Ability to detect duplicate folders #4

Boscop opened this issue Jul 4, 2017 · 2 comments

Comments

@Boscop
Copy link

Boscop commented Jul 4, 2017

If all files of a folder have dupes in another folder, the output can get very verbose and it's not exactly clear from looking at it. It would be very helpful if fddf could summarize that as folder dupes (or subset).
Because the primary use case for me is figuring out which files I can/should delete. If I could decide on the level of folders that would reduce the time it takes to sort through all the dupes.

Btw, here's a result I got, it took 12 mins and consumed 70 MB RAM on Win 8.1 64 bit. Most files in that folder are small files (<100KB, and the larger ones aren't much larger):

Overall results:
    16963 groups of duplicate files
    32744 files are duplicates
    1.2 GiB of space taken by dupliates
@birkenfeld
Copy link
Owner

Good idea! I'll consider this for the next version.

As for the results, the timing will depend basically exclusively on I/O speed if the files aren't hot. A second run should be faster, although that depends on OS caching behavior which I don't know very well for Windows.

@Boscop
Copy link
Author

Boscop commented May 5, 2021

I'm still very interested in this feature :)
The way it could work is that each folder gets a hash based on hashing all the hashes of its contents (files and subfolders).
And then you could detect duplicate folders by storing the folder hashes in a HashMap<Hash, PathBuf> (pseudo code) and iterate over all folders to check if their hash exists in the HashMap (with a different path), then it's a duplicate folder.
(Or HashMap<Hash, Vec<PathBuf>> to aggregate all duplicate folders for each hash.)
This would only find exact dups, which would be enough for my use case (deduplicating backed-up folders from years of unorganized manual backups).
For detecting almost-dups, it would be better to compare each folder with other folders of the same name.
(Another approach would be using algorithms for graph similarity / subgraph matching.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants