Find file duplicates

Library Kata “Find file duplicates”

Develop a library to spot file duplicates in a file system directory tree.

The contract of the library should be this:

First the method Compile_candidates() has to be called. It visits all files in the directory tree passed in and compares them roughly. The default comparison is by filename and file size. If selected, only the size is used. Files looking the same according to the comparison mode are returned as candidate duplicates.

Only in a second pass will candidate files be looked at more closely by comparing the MD5 hashes [1]. Only if the hashes are equal candidates will be returned as real duplicates.

Variations #1

Add some way to watch the progress of both methods.


[1]Wikipedia, MD5,