Find file duplicates

Library Kata “Find file duplicates”

Develop a library to spot file duplicates in a file system directory tree.

The contract of the library should be this:

interface ICheckForDuplicates {
	IEnumerable Compile_candidates(string folderpath);
	IEnumerable Compile_candidates(string folderpath, 
                                                  CompareModes mode);
	
IEnumerable Check_candidates 
                              (IEnumerable candidates)
}

interface IDuplicates {
	IEnumerable Filepath {get;}	
}

enum CompareModes {
	Size_and_name, // default
	Size
}

First the method Compile_candidates() has to be called. It visits all files in the directory tree passed in and compares them roughly. The default comparison is by filename and file size. If selected, only the size is used. Files looking the same according to the comparison mode are returned as candidate duplicates.

Only in a second pass will candidate files be looked at more closely by comparing the MD5 hashes [1]. Only if the hashes are equal candidates will be returned as real duplicates.

Variations #1

Add some way to watch the progress of both methods.

Resources

[1]Wikipedia, MD5, https://en.wikipedia.org/wiki/MD5