[Toybox] tsort regrets (really: hashing)

enh enh at google.com
Mon Nov 6 07:39:59 PST 2023


On Sun, Nov 5, 2023 at 9:50 PM Rob Landley <rob at landley.net> wrote:
>
> On 11/5/23 21:22, Ray Gardner wrote:
> > On 11/3/23 12:36:30, Rob Landley wrote:
> >> > As to the hash code, the only part I've taken from Python's
> >> > implementation is the probe sequence
> >>
> >> So... NOT the "linked lists descending from each hash bucket" approach
> >> then.
> >>
> >> > probe = (probe * 5 + 1 + (perturb >>= 5)) & hashmask
> >> > where hashmask is 2**n-1 (for table size 2**n) and perturb is
> >> > initially the entire original key hash.
> >>
> >> I had a class on this in college 20 years ago, and literally have not used
> >> it since, because "the hash table can entirely fill up" and "once the hash
> >> table is more than half full everything starts colliding just from
> >> displacements", and "picking the right hash algorithm is now _extra_
> >> magic"...
> >
> > The chained hash table (linked lists from buckets) can grow indefinitely
> > but will slow down as the chains get longer, unless there's code to
> > rebuild it with more buckets.
>
> I.E. both approaches benefit from garbage collection style repacking, but one
> degrades gracefully without it and the other locks up. (The non-list version
> mostly seems to be about saving the extra pointer and memory fragmentation.)
>
> I didn't mention "I poked at this way back when" to claim nontrivial expertise,
> I was trying to say that my aversion to the area was not entirely based entirely
> on ignorance, and you're not convincing me I was wrong.
>
> > The other approach, "open addressing" (named in 1957) does the thing with
> > an initial probe and then subsequent probes as needed until an open slot
> > is found. The table size is either prime or a power of 2. Power of 2 is
> > nice because you can do the initial probe by masking the original hash by
> > 'and'ing with (2**n-1). An advantage is when growing it, you can double
> > it, while growing a prime-size table needs finding a suitable larger prime
> > (though now that I think of it, maybe a smallish table of primes would do
> > fine, and allow growing by smaller increments. I'll have to try that...)
> >
> > The probe sequence can be j, j+1 mod m, j+2 mod m,... (for table size m)
> > but that's terrible. In 1968 Guy de Balbine came up with double hashing,
> > where the sequence is j+k mod m, j+2k mod m, ... where k is some different
> > hash of the same key, but must also be non-zero and relatively prime to m.
> > For prime table size, that's anything from 1 to m-1. For a power-of-two
> > table, that's any odd number 1 to m-1. (The main reason I think that table
> > size is always prime or 2**n).
>
> See, it's needing to know this many different things about the space that drove
> me away from it.
>
> I never did a survey of compression algorithms to see which was best for which
> use, either. I'm aware of arj and arc and zoo and so on, but "which is better"
> is NOT MY PROBLEM. Once a selection was externally made (because tar.gz and
> tar.bz2 and tar.xz are out there on websites), I'd dig in as much as necessary.
> But "is this an approach worth taking" is a SEPARATE QUESTION.
>
> Nor did I dig deeply into the dozens of different kinds of sort algorithms. I
> implemented a heap sort in college because that seemed the most general purpose
> one, and then got yelled at when I tried to use it by People With Opinions.
>
> I wrote the kernel's "Red Black Trees" documentation because somebody asked me
> to, and despite that I never DID dig up an explanation of HOW you rebalance the
> darn things. (Even the documentation I wrote DOES NOT SAY because I DID NOT
> KNOW, and yet people kept complimenting me on it...) I eventually dug through
> the kernel code and worked out at least how THEY were doing it (unwrapping
> nested macros if I recall; I don't remember the answer but I _do_ remember the
> slog), and have an "arbys.c" that could become a lib/rbtree.c if I ever need a
> "dictionary" data structure (thus avoiding the hash table question entirely)...
> but I have yet to wind up needing it. Even the recent tar thing, I told people
> to poke me if the performance/scalability was insufficient and I'm still waiting
> to hear back:
>
> https://github.com/landley/toybox/issues/456#issuecomment-1742021534
>
> In theory shell sort is Nlog(n) taking _advantage_ of partial sorting instead of
> being penalized by it like quicksort so that SOUNDS like a good general purpose
> sort algorithm, but between not being a stable sort and the whole gap nonsense
> where wikipedia[citation needed] even says "The question of deciding which gap
> sequence to use is difficult." at the start of
> https://en.wikipedia.org/wiki/Shellsort#Gap_sequences and I just noped right out
> of there.
>
> It's not that I _can't_ get up to speed on these sort of things, it's that I
> never saw the appeal. If a domain-specific problem needs domain-specific tuning
> then I wait to have that specific domain ready with test data. If there's a
> consensus on generic infrastructure, I'll leverage it. If there ISN'T a clear
> consensus, and it's been 80 years since the Harvard Mark I went online, I do not
> expect to be the one to pull the sword from the stone and clear up this area of
> confusion once and for all.
>
> Posix has quicksort in libc so I can just call that, and the fact posix has
> quicksort means "there is a consensus" and it may not be GREAT but I can rely on
> them to have already done all the arguing, and just get on with it. I can also
> do a bubble sort in like 3 lines of C which is fine if I know I'm never having
> more than a couple hundred things in the array, and between the two of them
> that's WELL beyond 80/20.
>
> My lack of familiarity with numerous sorting algorithms was actually the problem
> I had with bzip2 compression side because it does a fallback through several
> types of sorts I'd never heard of, and I under understood WHY it was doing the
> ones it was doing. I put it on the todo list and never got back around to it...
>
> http://lists.busybox.net/pipermail/busybox/2003-November/043969.html
>
> And if tsort really does need a hash table on top of knuth's algorithm to
> provide "acceptable" performance, toybox has no business implementing it and
> should leave that to other packages.
>
> >> > This is brilliant and probably due to Tim Peters
> >>
> >> "probably"? (I'm happy to cite a primary source...)
> >
> > I tried for an hour or two to find a definitive statement that Tim came up
> > with the perturb idea. The long comment in the Python source says he
> > tested various shifts for perturb. A commit by Tim
> > https://github.com/python/cpython/commit/81673fb5ca25e17a1c1b806c06b046669a566a88
> > is the best evidence I have that
> > j = (j * 5 + 1 + (perturb >>= 5)) mod m
> > is all Tim's. Lots of good info there too.
>
> I was assuming this Tim was somebody in the 1960s like knuth and friends,
> wrestling with these problems back before I learned to type.
>
> I believe the grad student in 1993's example way to avoid everything chaining
> together immediately was some largeish initial offset (quarter of the table size
> I think?) and then divided it by 2 and added 1 each iteration, which meant it
> was usually an odd number that collapsed to 1 at the end. But that was just a
> blackboard exercise. (With actual chalk, old school...)
>
> > Using the original hash shifted 1 byte right works pretty effectively as a
> > second hash (de Balbine, Knuth's algo D); hash >> 23 was not good.
> > Similarly, I tried hash*strlen(s) as a secondary hash and got pretty good
> > results as well. The constant stride of 37 did badly and the linear probe
> > (stride 1) was just terrible. (Yet linear probing is supposed to be decent
> > up to 50% full, per Knuth? Maybe need to investigate further.)
>
> You are convincing me tsort does not belong in toybox.
>
> >> With >>5 you need a table size over a million to more than 4 nonlinear
> >> displacements in a given chain. It's not THAT much protection. Still seems
> >> input data dependent. A hash table without a good hash function goes
> >> septic quick, and "good" is contextual. (...)
> >
> > The point of the Python approach was to work well with a "bad" hash
> > function, as Python's is bad from the usual viewpoint. E.g. integers are
> > their own hash: h(1) is 1, h(2) is 2 etc. So the best approach for Python
> > may not be best with a different hash funtion. I'm using D. Bernstein's
> > djb2a (djb2 modified). (BTW looks like the pending/diff.c tried to use that
> > but there's a transposition in the initial constant: they have 5831,
> > should be 5381.)
>
> My diff rewrite was in a separate tree:
>
>   https://github.com/landley/toybox/commit/42ac5b2334bd
>
> And it ground to a halt because people were arguing with me about it until it
> stopped being any fun at all to work on.
>
> I have no clue what the one in pending is doing.
>
> >> I didn't make it all the way through the 200+ lines of description at the
> >> top, not because it's hard to read but because they thought it NEEDED that
> >> much explanation.
> >>
> >> Remember my gripe at the end of
> >> https://landley.net/notes-2023.html#12-09-2023 where I was complaining
> >> that I'd felt the need to write a long explanation of the code as a
> >> comment in the source file? That meant (to me) that the code was still
> >> wrong.
> >
> > See https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01701.html
> > for a different opinion on that, and on that comment in particular.
>
> No.
>
> > Yes, that's pretty much it. Mark the deleted entries, rebuild a bigger
> > table when needed. Most systems expand when the table is about 60%-80%
> > full, including deleted slots.
>
> So in order to keep tsort.c in toybox, I need to implement garbage collection or
> it's not good enough.
>
> > It's just kind of hard to get a decent generalized hash table system that
> > handles expansion and deletions without having a lot of code.
>
> 80 years. Grace Hopper is dead. Dennis Ritchie is dead. Multiple lifetimes of
> people smarter than me. And it's still "kind of hard to get a decent", but isn't
> in libc for me to just use.
>
> I am not wading into the Bog of Eternal Stench.
>
> > Some other interesting links: This guy
>
> Tsort is not worth this.
>
> > There's a good comparison of hashing algos (not the table handling, but
> > the string hashing) at
>
> I am not interested in building domain expertise in algorithms I need to form
> and defend an _opinion_ about.
>
> Rob
>
> P.S. Yes I checked if posix/libc just had one I could use, like with qsort(),
> but man 3 hash says "This page documents interfaces provided in glibc up until
> version 2.1. Since version 2.2, glibc no longer provides these interfaces.

musl has the POSIX <search.h>, so i assume glibc still does too?

> Probably, you are looking for the APIs provided by the libdb library instead."
> and I GUARANTEE you I am not looking for any separate gnu libdb, and as for the
> berkeley one here's Oratroll buying it https://lwn.net/Articles/171888/ and
> here's the inevitable license fsckery https://lwn.net/Articles/557820/ and I
> haven't checked back in since, not that I'd use an external dependency for the
> same reason the project's not linking against curses and implements its own hash
> functions and gzip. But it would have been NICE if libc provided one...
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net


More information about the Toybox mailing list