[Toybox] tsort regrets (really: hashing)

Sun Nov 5 19:22:53 PST 2023

On 11/3/23 12:36:30, Rob Landley wrote:
> > As to the hash code, the only part I've taken from Python's
> > implementation is the probe sequence
>
> So... NOT the "linked lists descending from each hash bucket" approach
> then.
>
> > probe = (probe * 5 + 1 + (perturb >>= 5)) & hashmask
> > where hashmask is 2**n-1 (for table size 2**n) and perturb is
> > initially the entire original key hash.
>
> I had a class on this in college 20 years ago, and literally have not used
> it since, because "the hash table can entirely fill up" and "once the hash
> table is more than half full everything starts colliding just from
> displacements", and "picking the right hash algorithm is now _extra_
> magic"...

The chained hash table (linked lists from buckets) can grow indefinitely
but will slow down as the chains get longer, unless there's code to
rebuild it with more buckets.

The other approach, "open addressing" (named in 1957) does the thing with
an initial probe and then subsequent probes as needed until an open slot
is found. The table size is either prime or a power of 2. Power of 2 is
nice because you can do the initial probe by masking the original hash by
'and'ing with (2**n-1). An advantage is when growing it, you can double
it, while growing a prime-size table needs finding a suitable larger prime
(though now that I think of it, maybe a smallish table of primes would do
fine, and allow growing by smaller increments. I'll have to try that...)

The probe sequence can be j, j+1 mod m, j+2 mod m,... (for table size m)
but that's terrible. In 1968 Guy de Balbine came up with double hashing,
where the sequence is j+k mod m, j+2k mod m, ... where k is some different
hash of the same key, but must also be non-zero and relatively prime to m.
For prime table size, that's anything from 1 to m-1. For a power-of-two
table, that's any odd number 1 to m-1. (The main reason I think that table
size is always prime or 2**n).

> > This is brilliant and probably due to Tim Peters
>
> "probably"? (I'm happy to cite a primary source...)

I tried for an hour or two to find a definitive statement that Tim came up
with the perturb idea. The long comment in the Python source says he
tested various shifts for perturb. A commit by Tim
https://github.com/python/cpython/commit/81673fb5ca25e17a1c1b806c06b046669a566a88
is the best evidence I have that
j = (j * 5 + 1 + (perturb >>= 5)) mod m
is all Tim's. Lots of good info there too.

> > and already copied in
> > other hash table implementations (including vim), so I doubt anyone
> > will mind using the same probe sequence.
>
> And this is better than a constant prime number displacement because
> you're guaranteed to hit every entry in 2*sizeof(table) checks instead of
> some multiple with the prime, and also adjacent displacement chains are
> less likely to connect up?

The j = (j * 5 + 1) mod m part alone (without perturb) will hit all the
slots, as will any constant odd (not necessarily prime) stride.
Whether the constant is 1 or 37 or ... it will still be real slow.
The j*5+1 recurrence works but the 5 is not unique. For a 2**n table size,
j*a+c works if c is odd and (a-1) is a multiple of 4. (Hull-Dobell
theorem, in Knuth vol. 2.) So j*33+1 or j*17+3...

I ran some tests...

[Digression: My tsort counts all the input symbols, uses that count to set
the table size. Worst case is all symbols unique, so I need the table to
be at least 1 larger than that. But generally the symbols will appear at
least twice in real input, as predecessors and successors, so that table
will be less than half full. Nearly any decent probe scheme will be fast.
I tweaked the sizing code to:
  for (k = 2; k < (nsym < 128 ? 128 : nsym + nsym / 128); k *= 2);
Now 8323580 unique symbols will make a table of 2**23 which will be .992
full, which should stress the probe logic to the max. (8323582 will double
the table size so 8323580 is the max for a table of 2**23.)]

So I made the 8323580 unique symbols from 1 to 12 alphanumeric characters
and another file with 8323582 so it "unstressed" the 2**24 table.

Load %: .992       .496
With 8323580    8323582     probe sequence
    ========    =======     ==============
       18.10sec    4.26     j * 5 + 1 + (perturb >>= 5) (per Python)
       18.38       5.06     j * 33 + 1 ^ (perturb >>= 5)
       15.98       4.52     j * 33 + 1 ^ (perturb >>= 4)
     1m35.90       6.09     j * 5 + 1
     1m39.20       4.56     j * 33 + 1
     1m37.67       4.48     j * 17 + 3
        8.30       8.04     j + ((hash >> 8) | 1)
  didn't try    2m51.72     j + ((hash >> 23) | 1)
        9.06       6.01     j + ((hash * stringlength) | 1)
  didn't try    3m15.25     j + 37 (constant stride)
  didn't try   83m27.71     j + 1 (classic linear probe, constant stride)

With perturb shifting, j*a+1 are about the same, whether a is 5 or 33 and
perturb shift is 5 or 4, and using + or ^.
Note j*5+1, j*33+1, j*17+3 (no perturb) are about equal, and bad with a
full table.
Using the original hash shifted 1 byte right works pretty effectively as a
second hash (de Balbine, Knuth's algo D); hash >> 23 was not good.
Similarly, I tried hash*strlen(s) as a secondary hash and got pretty good
results as well. The constant stride of 37 did badly and the linear probe
(stride 1) was just terrible. (Yet linear probing is supposed to be decent
up to 50% full, per Knuth? Maybe need to investigate further.)

> With >>5 you need a table size over a million to more than 4 nonlinear
> displacements in a given chain. It's not THAT much protection. Still seems
> input data dependent. A hash table without a good hash function goes
> septic quick, and "good" is contextual. (...)

The point of the Python approach was to work well with a "bad" hash
function, as Python's is bad from the usual viewpoint. E.g. integers are
their own hash: h(1) is 1, h(2) is 2 etc. So the best approach for Python
may not be best with a different hash funtion. I'm using D. Bernstein's
djb2a (djb2 modified). (BTW looks like the pending/diff.c tried to use that
but there's a transposition in the initial constant: they have 5831,
should be 5381.)

> I didn't make it all the way through the 200+ lines of description at the
> top, not because it's hard to read but because they thought it NEEDED that
> much explanation.
>
> Remember my gripe at the end of
> https://landley.net/notes-2023.html#12-09-2023 where I was complaining
> that I'd felt the need to write a long explanation of the code as a
> comment in the source file? That meant (to me) that the code was still
> wrong.

See https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01701.html
for a different opinion on that, and on that comment in particular.

> Supporting deletion in-place in the linked list approach was reasonably
> straightforward. I remember that deleting in the table walk version was
> fiddly but I don't remember the solution.
> [...]
> Hang on, I think I asked the grad student about it. You could either read
> _backwards_ through displacements (not feasible when your displacements
> aren't constant) or having whiteouts as in a "there was something here so
> keep reading" marker commemorating removed entries so the table could
> never actually go back to a clean state. (Because with a variable stride
> even reading through a full-to-end whiteout list with no hits doesn't mean
> you can REMOVE the whiteouts, because you essentially have cross-linked
> lists. You have to garbage collect your table, and generally you allocate
> a BIGGER table and re-hash every entry into the new one to get a clean
> table once you've hit a certain pathological size, and I went "I'll let
> the library do this for me"...)

Yes, that's pretty much it. Mark the deleted entries, rebuild a bigger
table when needed. Most systems expand when the table is about 60%-80%
full, including deleted slots.

> I'm happy to do a proper version in lib/ if we decide we need this. As I
> said, tar could also use this and I'm pretty sure I've got a couple more
> in the todo heap. (It's just "tsort go fast" that seemed a strange entry
> point to this round of optimization...)

It's just kind of hard to get a decent generalized hash table system that
handles expansion and deletions without having a lot of code. My
forthcoming awk hash table uses the code from my expanding sequential
lists and my dynamic ref-counted strings, so I doubt it's general enough
to be useful in your lib.

> Hmmm... a 1024 entry table that garbage collects to a 32k entry table
> which garbage collects to a million table entry... That adds 5 bits each
> time so the perturb thingy gets 1 extra stride meaning the collisions get
> re-jiggered and at least the first 2 garbage collections would be
> negligible latency spikes (and if you gc from a million that's on you).

Not sure if you're getting what the perturb thingy is. It's just the full
original hash, that gets some upper bits mixed into the stride with each
step, but will eventually shift down to 0 so the remaining steps are just
the j*5+1 mod m part, which as shown in my table above does pretty well
for a half-full table, badly for a full table, but Python never lets it
get past about 2/3 full.

Some other interesting links: This guy
https://www.andreinc.net/2021/11/08/a-tale-of-java-hash-tables
did a benchmark of Java's chained hashing HashMap against a bunch of
different open-addressing versions he implemented, and the original
HashMap came out on top. I need to look into that. Then the Hacker News
guys commented on it (https://news.ycombinator.com/item?id=29319151) and I
think mostly preferred open-addressing.

There's a good comparison of hashing algos (not the table handling, but
the string hashing) at
https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed.

Ray