[Toybox] tsort regrets

Fri Nov 3 12:36:30 PDT 2023

On 11/2/23 18:00, Ray Gardner wrote:
> Rob, feel free to post your reply on the list or your blog or wherever.
> 
> As to the hash code, the only part I've taken from Python's
> implementation is the probe sequence

So... NOT the "linked lists descending from each hash bucket" approach then.

> probe = (probe * 5 + 1 + (perturb >>= 5)) & hashmask
> where hashmask is 2**n-1 (for table size 2**n) and perturb is
> initially the entire original key hash.

I had a class on this in college 20 years ago, and literally have not used it
since, because "the hash table can entirely fill up" and "once the hash table is
more than half full everything starts colliding just from displacements", and
"picking the right hash algorithm is now _extra_ magic"...

> This is brilliant and probably due to Tim Peters

"probably"? (I'm happy to cite a primary source...)

> and already copied in
> other hash table implementations (including vim), so I doubt anyone
> will mind using the same probe sequence.

And this is better than a constant prime number displacement because you're
guaranteed to hit every entry in 2*sizeof(table) checks instead of some multiple
with the prime, and also adjacent displacement chains are less likely to connect up?

With >>5 you need a table size over a million to more than 4 nonlinear
displacements in a given chain. It's not THAT much protection. Still seems input
data dependent. A hash table without a good hash function goes septic quick, and
"good" is contextual. (This is an area of expertise I've seen other people
acquire, and wrestle with, over the course of many years. As in consecutive
years on _their_ part, and the answer still wasn't obvious to them. But then I
tend to professionally hang out at the "large and complex system went boing,
throw money at the problem" end of the force where we don't get the easy
problems. When it isn't a technical nightmare, it's political. Not always, but
often enough. So I may have gotten a skewed view of the space...)

> The perturb can eventually go
> to 0, and then it's just i = (5 * i + 1) mod 2**n which will hit every
> slot. Just have to make sure the table never gets completely full to
> prevent looping forever looking for an open slot.
> 
> BTW if you looked at the Python code,

I didn't make it all the way through the 200+ lines of description at the top,
not because it's hard to read but because they thought it NEEDED that much
explanation.

Remember my gripe at the end of https://landley.net/notes-2023.html#12-09-2023
where I was complaining that I'd felt the need to write a long explanation of
the code as a comment in the source file? That meant (to me) that the code was
still wrong.

> you've seen "(see any text on
> random-number generation for proof)" on the "will hit every slot"
> claim. I glanced at Knuth vol. 2 about that, and it's 3 pages of
> serious math, so I'll just take their word for that.
> 
> If you're considering generalizing it, I should mention that my awk
> implementation uses the same idea but also starts the tables (one for
> each "associative array" variable) small (8 slots) and doubles them as
> needed, similar to Python's approach. But that needs re-hashing when
> resizing and supports deletion, so quite a bit more than what's in the
> tsort. I made the tsort hash implementation kinda minimal in size but
> adequate performance.

Supporting deletion in-place in the linked list approach was reasonably
straightforward. I remember that deleting in the table walk version was fiddly
but I don't remember the solution.

(I not sure the professor actually covered it? Governor Witless got "elected"
and cut Rutgers' budget in a way that resulted in the administration blanket
denying tenure to everybody who was up for it that year, which included the
entire computer science department because it had only budded off the physics
department X years before and thus the faculty they'd hired all had the same
seniority, so my professor put out his resume and got a job in Aruba almost
immediately, and taught one more class where he explained one of his graduate
students would be taking over and he'd be on a beach with a mai-tai by then.)

Hang on, I think I asked the grad student about it. You could either read
_backwards_ through displacements (not feasible when your displacements aren't
constant) or having whiteouts as in a "there was something here so keep reading"
marker commemorating removed entries so the table could never actually go back
to a clean state. (Because with a variable stride even reading through a
full-to-end whiteout list with no hits doesn't mean you can REMOVE the
whiteouts, because you essentially have cross-linked lists. You have to garbage
collect your table, and generally you allocate a BIGGER table and re-hash every
entry into the new one to get a clean table once you've hit a certain
pathological size, and I went "I'll let the library do this for me"...)

I'm happy to do a proper version in lib/ if we decide we need this. As I said,
tar could also use this and I'm pretty sure I've got a couple more in the todo
heap. (It's just "tsort go fast" that seemed a strange entry point to this round
of optimization...)

Hmmm... a 1024 entry table that garbage collects to a 32k entry table which
garbage collects to a million table entry... That adds 5 bits each time so the
perturb thingy gets 1 extra stride meaning the collisions get re-jiggered and at
least the first 2 garbage collections would be negligible latency spikes (and if
you gc from a million that's on you). And on embedded 32 bit systems those first
two are 4k and 128k table overhead so even on nommu that's probably manageable.
(That readfd() stuff is defaulting to a 4k buffer when it can't size the input
stream, it's not deal but I'm ok wasting that much space on something a given
command should only have 2 or 3 of tops.)

> The remark about seekable input is not for toybox; I used your tsort
> code with readfd() which can handle input from terminal, pipe etc. and
> grows the buffer as needed. For my "standalone" (non-toybox) tsort I
> need to do that myself and I haven't tried that yet. BTW I hope I'm
> correct that readfd() always makes the buffer larger than the input,

It should. When it's allocating its own buffer it allocates an extra byte and
puts a NUL terminator in it.

  if (!*plen) {
    if ((len = fdlength(fd))>0) *plen = len;
...
  if (!ibuf) buf = xmalloc(len+1);
...
  } else buf[len] = 0;

> at least when starting with plen==0, because it depends on being able
> to safely plug a NUL byte after the last string, which might not have
> a newline to replace. I looked at readfd() and I *think* it works that
> way but not positive.

It was intended to. There's a NUL at the end because the file itself probably
wasn't NUL terminated and "treat the whole thing we read as a big string" is the
use case for half the files under /proc and /sys.

I _think_ the "realloc previous size by 1.5 times" loop will always leave one
byte at the end for the NUL? A second set of eyeballs on that wouldn't be bad,
my head's still fuzzy. (Week two! This is ridiculous. But ibuprofen, "compare
with zyrtec", and a BUNCH of caffeine is keeping me... something like upright,
when I'm not taking random unscheduled naps.)

> One other thing: In an earlier post to the list, I pondered that I'd
> need to do something about duplicate input pairs to tsort. Knuth's
> approach doesn't need to be concerned about that, the duplicates sorta
> come out in the wash.

      // Slight mod: if self-reference relation, do not record it.
      // (If it's self-reference, the symbol is already in the table now.)
      // Note that this links the most recent into the head of the successor
      // list, reversing their order in the input. If the symbols table kept a
      // tail pointer, they could be kept in the list in original order.
      if (strcmp(pred, succ)) {

That would be "the wash". :)

Rob