[Toybox] tsort regrets (really: hashing)
enh
enh at google.com
Tue Nov 7 11:30:21 PST 2023
On Tue, Nov 7, 2023 at 11:19 AM Rob Landley <rob at landley.net> wrote:
>
> On 11/6/23 09:39, enh wrote:
> >> P.S. Yes I checked if posix/libc just had one I could use, like with qsort(),
> >> but man 3 hash says "This page documents interfaces provided in glibc up until
> >> version 2.1. Since version 2.2, glibc no longer provides these interfaces.
> >
> > musl has the POSIX <search.h>, so i assume glibc still does too?
>
> That API has no walk() function to return all entries in the table,
> hsearch(ENTER) returns null if the table is "full" (no resize so I'd have to do
> my own gc without access to a walk()
that's not true for the FreeBSD implementation that we reuse in bionic.
nor for musl.
> and only the non-posix _r extensions let
> you have old and new tables in play at once so you pretty much have to keep a
> linked linked list _and_ the table), can't pass a cleanup function to
> destroy_r() to free the entries, for the tar.c use case I'd have to xmprintf()
> the dev/ino pairs to produce the string "key" needed to store them in the table
> when some xor and shifts seem like the obvious hash function...
>
> It's a nice option to have, but didn't fit the use cases I'd compared it with.
> But I should probably read Rich's implementation of it because he has a history
> of doing "the sane thing"... his string hash is while (*p) h = 31*h + *p++;
> which looks fine. His stride is for (i = hash, j=1;; i += j++) which checks
> hash&size, (hash+1)&size, (hash+1+2)&size, (hash+1+2+3)&size thus retaining
> reasonable L1 cache locality without chaining into someone else's overflow (very
> nice). Yeah, this looks like it makes sense. Still no delete or walk. (Easy way
> to delete uses whiteouts, I tend to (void *)1 which _really_ seems to annoy the
> C++ compiler developers...)
>
> Huh, Rich _does_ resize() under the covers in __hsearch_r(). Pity I can't rely
> on that happening in an arbitrary libc implementation...
i suspect you probably can, unless glibc is broken?
> *shrug* I _can_ do this but was trying to hold the optimization off until we'd
> accumulated test loads inconvenienced by poor performance, so we can confirm the
> changes actually helped real world use. (Optimization in search of a load is a
> variant of infrastructure in search of a user...)
yeah, a "good enough" hash table is very little code anyway, and
until/unless you're worried about (a) data center-sized workloads or
(b) "this is a fundamental primitive in our language exported to
arbitrary user code" (the python case), i don't think you need to
worry about any of the things you've mentioned in this thread. (other
than the possibility that glibc might actually require you to specify
a _max_ size because it doesn't resize; but i assume musl would have
kept the same behavior in that case?)
> Lemme finish redoing crypt() first...
>
> Rob
More information about the Toybox
mailing list