[3573] in cryptography@c2.net mail archive

home help back first fref pref prev next nref lref last post

Re: Using MD5/SHA1-style hashes for document

daemon@ATHENA.MIT.EDU (Antonomasia)
Fri Oct 30 23:53:30 1998

Date: Fri, 30 Oct 1998 22:30:25 GMT
From: Antonomasia <ant@notatla.demon.co.uk>
To: cryptography@c2.net

Brian de Alwis writes:

> Any thoughts about using a document's hash (MD5, SHA1, etc) as a unique
> document identifier or storage index? ...
>                                        It means that you'll only ever have
> one instance of anything in your database, regardless of its title, which is
> very good if you could be sticking in multiple copies of big things.

> Is this an accepted practice? Are there any gotchas I should be aware of?


A checksum will make a good shortcut in the comparison of long documents.
Once you've determined that a proposed-incoming doc has a sum matching
a sum already in store you only then bother with a full comparision of
the docs.  Given the rarity of collisions this is likely to be a 
major time saver.  Using a hash rather than a CRC or similar thing obstructs
people deliberately submitting docs with the same checksum.

Remembering to avoid the "and if it does you're clobbered" outcome I'd
give everything a unique serial number rather than using the hash as an
index.

When a proposed-incoming doc is found to be already stored you can point
the new serial number to the old one to save on the storage, as you seem
to be planning.  Assuming alterations to existing stored docs are prohibited
you still have to think about how expiry dates are derived from the
sources of a multiply-sourced doc.  Or maybe expiry is out too.

It's also often nice to have the document identifier included in the title
page.  This gets hard if the id is a hash of the whole doc (including the
title page).  I assume from your "regardless of its title" that you are
only planning to hash document bodies.

Take disk files as an example.  Hashing files (ignoring the name)
would be a saner way to discover whether you have duplicate files on your
disk than to compare every file with every other.


--
##############################################################
# Antonomasia   ant@notatla.demon.co.uk                      #
# See http://www.notatla.demon.co.uk/                        #
##############################################################

home help back first fref pref prev next nref lref last post