[3574] in cryptography@c2.net mail archive

home help back first fref pref prev next nref lref last post

Re: Using MD5/SHA1-style hashes for document

daemon@ATHENA.MIT.EDU (David R. Conrad)
Sat Oct 31 00:07:52 1998

Date: Fri, 30 Oct 1998 19:32:33 -0500 (EST)
From: "David R. Conrad" <drc@adni.net>
To: cryptography@c2.net
In-Reply-To: <1998Oct29.182100.1250.967567@otismtp.ott.oti.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 29 Oct 1998, Brian de Alwis OTT wrote:

> Any thoughts about using a document's hash (MD5, SHA1, etc) as a unique 
> document identifier or storage index? The chance of having two random 
> documents hash to the same value is very small (1 in 10^19 according to the 
> RSA data-sheets) and seems acceptable. It means that you'll only ever have 
> one instance of anything in your database, regardless of its title, which is 
> very good if you could be sticking in multiple copies of big things.

10^19 is about 2^64 so I assume this is considering a birthday attack on a
128-bit hash such as MD4 or MD5.  If you use SHA-1 (you should) then it's
a 160-bit hash and the chance drops to about 1 in 10^24 (2^80), but it
seems to me the real problem is going to be a lot of real-world details.

If someone submits a document, and later someone else submits a new
version with a comma changed to a semicolon, or just with the whitespace
reformatted, the hash, and thus the document id, will be completely
different.  Is that what you want?

If not, you also have to think about converting documents to some
canonical form before hashing, and perhaps doing diffs of documents or
using some other method to calculate the "edit distance" between them in
order to detect related, revised, and reformatted items.

(Note: if you wanted to decrease the odds of a match still further you
could concatenate the results of several hashes (say, SHA-1 and HAVAL, for
example), but that wouldn't seem to be necessary.)

David R. Conrad <drc@adni.net>

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 5.0i for non-commercial use
Charset: noconv

iQA/AwUBNjparYPOYu8Zk+GuEQLNbwCgspTCwYKIxybuMW9irxb9M7H1vQgAn1qc
LNtAp1yDoUFYTz0qRbNc+QC7
=+Rtg
-----END PGP SIGNATURE-----


home help back first fref pref prev next nref lref last post