[Toybox] tsort regrets (really: hashing)

Rob Landley rob at landley.net
Sun Nov 5 21:54:14 PST 2023


On 11/5/23 21:22, Ray Gardner wrote:
> On 11/3/23 12:36:30, Rob Landley wrote:
>> > As to the hash code, the only part I've taken from Python's
>> > implementation is the probe sequence
>>
>> So... NOT the "linked lists descending from each hash bucket" approach
>> then.
>>
>> > probe = (probe * 5 + 1 + (perturb >>= 5)) & hashmask
>> > where hashmask is 2**n-1 (for table size 2**n) and perturb is
>> > initially the entire original key hash.
>>
>> I had a class on this in college 20 years ago, and literally have not used
>> it since, because "the hash table can entirely fill up" and "once the hash
>> table is more than half full everything starts colliding just from
>> displacements", and "picking the right hash algorithm is now _extra_
>> magic"...
> 
> The chained hash table (linked lists from buckets) can grow indefinitely
> but will slow down as the chains get longer, unless there's code to
> rebuild it with more buckets.

I.E. both approaches benefit from garbage collection style repacking, but one
degrades gracefully without it and the other locks up. (The non-list version
mostly seems to be about saving the extra pointer and memory fragmentation.)

I didn't mention "I poked at this way back when" to claim nontrivial expertise,
I was trying to say that my aversion to the area was not entirely based entirely
on ignorance, and you're not convincing me I was wrong.

> The other approach, "open addressing" (named in 1957) does the thing with
> an initial probe and then subsequent probes as needed until an open slot
> is found. The table size is either prime or a power of 2. Power of 2 is
> nice because you can do the initial probe by masking the original hash by
> 'and'ing with (2**n-1). An advantage is when growing it, you can double
> it, while growing a prime-size table needs finding a suitable larger prime
> (though now that I think of it, maybe a smallish table of primes would do
> fine, and allow growing by smaller increments. I'll have to try that...)
> 
> The probe sequence can be j, j+1 mod m, j+2 mod m,... (for table size m)
> but that's terrible. In 1968 Guy de Balbine came up with double hashing,
> where the sequence is j+k mod m, j+2k mod m, ... where k is some different
> hash of the same key, but must also be non-zero and relatively prime to m.
> For prime table size, that's anything from 1 to m-1. For a power-of-two
> table, that's any odd number 1 to m-1. (The main reason I think that table
> size is always prime or 2**n).

See, it's needing to know this many different things about the space that drove
me away from it.

I never did a survey of compression algorithms to see which was best for which
use, either. I'm aware of arj and arc and zoo and so on, but "which is better"
is NOT MY PROBLEM. Once a selection was externally made (because tar.gz and
tar.bz2 and tar.xz are out there on websites), I'd dig in as much as necessary.
But "is this an approach worth taking" is a SEPARATE QUESTION.

Nor did I dig deeply into the dozens of different kinds of sort algorithms. I
implemented a heap sort in college because that seemed the most general purpose
one, and then got yelled at when I tried to use it by People With Opinions.

I wrote the kernel's "Red Black Trees" documentation because somebody asked me
to, and despite that I never DID dig up an explanation of HOW you rebalance the
darn things. (Even the documentation I wrote DOES NOT SAY because I DID NOT
KNOW, and yet people kept complimenting me on it...) I eventually dug through
the kernel code and worked out at least how THEY were doing it (unwrapping
nested macros if I recall; I don't remember the answer but I _do_ remember the
slog), and have an "arbys.c" that could become a lib/rbtree.c if I ever need a
"dictionary" data structure (thus avoiding the hash table question entirely)...
but I have yet to wind up needing it. Even the recent tar thing, I told people
to poke me if the performance/scalability was insufficient and I'm still waiting
to hear back:

https://github.com/landley/toybox/issues/456#issuecomment-1742021534

In theory shell sort is Nlog(n) taking _advantage_ of partial sorting instead of
being penalized by it like quicksort so that SOUNDS like a good general purpose
sort algorithm, but between not being a stable sort and the whole gap nonsense
where wikipedia[citation needed] even says "The question of deciding which gap
sequence to use is difficult." at the start of
https://en.wikipedia.org/wiki/Shellsort#Gap_sequences and I just noped right out
of there.

It's not that I _can't_ get up to speed on these sort of things, it's that I
never saw the appeal. If a domain-specific problem needs domain-specific tuning
then I wait to have that specific domain ready with test data. If there's a
consensus on generic infrastructure, I'll leverage it. If there ISN'T a clear
consensus, and it's been 80 years since the Harvard Mark I went online, I do not
expect to be the one to pull the sword from the stone and clear up this area of
confusion once and for all.

Posix has quicksort in libc so I can just call that, and the fact posix has
quicksort means "there is a consensus" and it may not be GREAT but I can rely on
them to have already done all the arguing, and just get on with it. I can also
do a bubble sort in like 3 lines of C which is fine if I know I'm never having
more than a couple hundred things in the array, and between the two of them
that's WELL beyond 80/20.

My lack of familiarity with numerous sorting algorithms was actually the problem
I had with bzip2 compression side because it does a fallback through several
types of sorts I'd never heard of, and I under understood WHY it was doing the
ones it was doing. I put it on the todo list and never got back around to it...

http://lists.busybox.net/pipermail/busybox/2003-November/043969.html

And if tsort really does need a hash table on top of knuth's algorithm to
provide "acceptable" performance, toybox has no business implementing it and
should leave that to other packages.

>> > This is brilliant and probably due to Tim Peters
>>
>> "probably"? (I'm happy to cite a primary source...)
> 
> I tried for an hour or two to find a definitive statement that Tim came up
> with the perturb idea. The long comment in the Python source says he
> tested various shifts for perturb. A commit by Tim
> https://github.com/python/cpython/commit/81673fb5ca25e17a1c1b806c06b046669a566a88
> is the best evidence I have that
> j = (j * 5 + 1 + (perturb >>= 5)) mod m
> is all Tim's. Lots of good info there too.

I was assuming this Tim was somebody in the 1960s like knuth and friends,
wrestling with these problems back before I learned to type.

I believe the grad student in 1993's example way to avoid everything chaining
together immediately was some largeish initial offset (quarter of the table size
I think?) and then divided it by 2 and added 1 each iteration, which meant it
was usually an odd number that collapsed to 1 at the end. But that was just a
blackboard exercise. (With actual chalk, old school...)

> Using the original hash shifted 1 byte right works pretty effectively as a
> second hash (de Balbine, Knuth's algo D); hash >> 23 was not good.
> Similarly, I tried hash*strlen(s) as a secondary hash and got pretty good
> results as well. The constant stride of 37 did badly and the linear probe
> (stride 1) was just terrible. (Yet linear probing is supposed to be decent
> up to 50% full, per Knuth? Maybe need to investigate further.)

You are convincing me tsort does not belong in toybox.

>> With >>5 you need a table size over a million to more than 4 nonlinear
>> displacements in a given chain. It's not THAT much protection. Still seems
>> input data dependent. A hash table without a good hash function goes
>> septic quick, and "good" is contextual. (...)
> 
> The point of the Python approach was to work well with a "bad" hash
> function, as Python's is bad from the usual viewpoint. E.g. integers are
> their own hash: h(1) is 1, h(2) is 2 etc. So the best approach for Python
> may not be best with a different hash funtion. I'm using D. Bernstein's
> djb2a (djb2 modified). (BTW looks like the pending/diff.c tried to use that
> but there's a transposition in the initial constant: they have 5831,
> should be 5381.)

My diff rewrite was in a separate tree:

  https://github.com/landley/toybox/commit/42ac5b2334bd

And it ground to a halt because people were arguing with me about it until it
stopped being any fun at all to work on.

I have no clue what the one in pending is doing.

>> I didn't make it all the way through the 200+ lines of description at the
>> top, not because it's hard to read but because they thought it NEEDED that
>> much explanation.
>>
>> Remember my gripe at the end of
>> https://landley.net/notes-2023.html#12-09-2023 where I was complaining
>> that I'd felt the need to write a long explanation of the code as a
>> comment in the source file? That meant (to me) that the code was still
>> wrong.
> 
> See https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01701.html
> for a different opinion on that, and on that comment in particular.

No.

> Yes, that's pretty much it. Mark the deleted entries, rebuild a bigger
> table when needed. Most systems expand when the table is about 60%-80%
> full, including deleted slots.

So in order to keep tsort.c in toybox, I need to implement garbage collection or
it's not good enough.

> It's just kind of hard to get a decent generalized hash table system that
> handles expansion and deletions without having a lot of code.

80 years. Grace Hopper is dead. Dennis Ritchie is dead. Multiple lifetimes of
people smarter than me. And it's still "kind of hard to get a decent", but isn't
in libc for me to just use.

I am not wading into the Bog of Eternal Stench.

> Some other interesting links: This guy

Tsort is not worth this.

> There's a good comparison of hashing algos (not the table handling, but
> the string hashing) at

I am not interested in building domain expertise in algorithms I need to form
and defend an _opinion_ about.

Rob

P.S. Yes I checked if posix/libc just had one I could use, like with qsort(),
but man 3 hash says "This page documents interfaces provided in glibc up until
version 2.1. Since version 2.2, glibc no longer provides these interfaces.
Probably, you are looking for the APIs provided by the libdb library instead."
and I GUARANTEE you I am not looking for any separate gnu libdb, and as for the
berkeley one here's Oratroll buying it https://lwn.net/Articles/171888/ and
here's the inevitable license fsckery https://lwn.net/Articles/557820/ and I
haven't checked back in since, not that I'd use an external dependency for the
same reason the project's not linking against curses and implements its own hash
functions and gzip. But it would have been NICE if libc provided one...


More information about the Toybox mailing list