[r6rs-discuss] Hash function return value constraints
lord at emf.net
Fri Apr 3 18:39:14 EDT 2009
On Fri, 2009-04-03 at 13:34 -0700, Brian Harvey wrote:
> Okay, I'm confused,
Yay! Something useful for me to do!
> A hash value is an index into an array.
The hell it is. That's BS. It's just a
string of bits that hopefully has some
useful statistical relationships to the
stuff being hashed.
Each bit of a hash function *ideally* (never
quite, in practice) splits the input - the
domain - into two halves. Is the thing
I'm hashing A or is it B? That's bit 0.
Is the thing I'm hashing C or D? That's
bit 1. Etc. We then demand a lot of
useful statistical properties of the choices
of values for variables A, B, C, D, etc.
Ideally, half of lists hashed have a 0 at
bit 0 (the others having 1 at bit 0).
We ideally hope that all the naturally
occurring hashed object we encounter
have distinct hashes.
We hope that the system of hashing generates
as few bits of hash value as possible, though
this ideal is subordinate to the uniqueness
The "positive" restriction of R6 has a practical
implication. Namely, for many plausible implementations
of Scheme, it essentially says "give up one bit
of hash values or else take a huge performance
> If the
> hash value doesn't fit in a fixnum, then the array doesn't fit in memory,\
As Per pointed out in an indirect reply,
modulo is your friend.
> or even in the virtual address space (by a factor of 4 on a 32-bit machine).
> What am I missing here?
The "perfect" way to hash is to compute a hash value
with enough bits (well enough computed) that all
objects hashed in a given run of the program get a
unique hash value.
Hashing algorithms don't, except in special cases,
promise to be perfect.
Instead, common algorithms (typical of a Scheme
implementation) try to be approximately perfect.
Not everything gets a unique hash value but collisions
A given hash algorithm has a "natural" performance
characteristic on real hardware. The relevant
example: it is easy to give a hashing algorithm
that can give 20 bits or 21 bits or 22 bits of hash
value, depending on how some constant is defined.
In real life, the 20 bit version and 21 bit versions
will run equally fast on a 32 bit CPU. Moreover,
let's (reasonably) assume that the 21 bit version
returns "close to ideal" hash values - that's
normally how these things go. So, we should set
that constant - the number of bits returned - to
the highest number that runs "just as fast as 20 bits"
on our 32 bit CPU.
The r6 sign limitation steals one bit from any
implementation that wants to limit hash values to
More information about the r6rs-discuss