[3586] in cryptography@c2.net mail archive

home help back first fref pref prev next nref lref last post

Re: Using MD5/SHA1-style hashes for document

daemon@ATHENA.MIT.EDU (Phil Karn)
Tue Nov 3 20:24:07 1998

Date: Tue, 3 Nov 1998 16:01:09 -0800 (PST)
From: Phil Karn <karn@qualcomm.com>
To: ant@notatla.demon.co.uk
CC: cryptography@c2.net, karn@qualcomm.com
In-reply-to: <199810302230.WAA03304@notatla.demon.co.uk> (message from
	Antonomasia on Fri, 30 Oct 1998 22:30:25 GMT)

>Take disk files as an example.  Hashing files (ignoring the name)
>would be a saner way to discover whether you have duplicate files on your
>disk than to compare every file with every other.

I actually played with this many years ago when I wrote a utility
to traverse a UNIX file system looking for duplicate files and to link
the duplicates together to recover disk space.

My first program produced a MD5 hash of every file and sorted the hash
results, just as you suggest. That worked, but I later came up with a
better (faster) algorithm that doesn't involve hashing. I simply
quicksort the list of files using a comparison function that, as a
side effect, links duplicate files together. Then I rescan the sorted
list comparing adjacent entries to mop up any remaining duplicates.

The vast number of duplicates are found and deleted by the comparisons
in the quicksort phase; relatively few are found by the subsequent
re-scan. I also applied the obvious performance enhancements, such as
caching the results of fstat() operations in memory.

This algorithm was considerably faster than hashing because a typical
filesystem has many files with unique sizes, and some of these files
can be quite big. Once the comparison function discovers that two
files have different sizes, there's no need to go any further. You
don't need to hash them or even compare them. This is a big win when
you have lots of large files.

Phil



home help back first fref pref prev next nref lref last post