[Toybox] tsort

Ray Gardner raygard at gmail.com
Sat Oct 14 13:32:52 PDT 2023


> The topic at hand is whether I should redo it based on a different
> algorithm for better scalability up in the hundreds of thousands of
> entries range.  (Answer: probably yes, the numbers seem convincing. I have
> to hack through a lot of math-speak from a book first published in 1968 to
> understand the suggested algorithm though.)

I'm not sure you should spend more time on tsort for now; maybe e.g. toysh
or diff are more pressing needs, since you do have a working tsort.

I can send you pics or pdf of the relevant pages of TAOCP if you don't
have them, but I'll try to give an outline.

First I'll note that Knuth's algo makes assumptions you'd need some
preprocessing to deal with, and that's the harder part. His version
assumes the pairs are all drawn from consecutive integers [1..n] and no
dups. Not sure without more study what his does with self pairs (x, x).
Also, he reuses an integer field as a link, which was nice when memory was
much scarcer than it is now.

Call the parts of each pair fore and aft. Set up an array of nodes (struct
{count, link} A) A[1..n] where each fore goes in A[fore]. The node is the
head of a linked list of corresponding aft values for that fore.

As each (fore, aft) pair is added to the structure, a node (struct {aft,
link}) is inserted at the front of the A[fore] list and A[aft].count is
incremented. So the count at A[fore] indicates how many times that fore
has appeared as an aft in some pair.

Now A[fore].count == 0 means this fore value is "independent" of any aft
values. There must be at least one such, or there are loop(s).

Here, the algo scans once through the A.count fields and links the zero
values together into a queue, reusing the count fields as links.

Next, take a node off the queue and emit its index, i.e. if A[x].count ==
0 then emit x. Next go through its linked list of aft values A[x].link and
for each one, decrement the corresponding count, so if A[x].link is p,
then set A[p->aft].count -= 1, p = p->link, etc. And if that count is now
zero, insert it into the queue.

Repeat until the queue is empty. Knuth keeps a count N and decrements with
each fore value emitted. If it's zero at the end, it's all good, otherwise
there's a cycle. He has an exercise (with an answer) for printing a loop
if there is one.

In thinking of how I'd do it, I assume I need to map the input symbols to
integers, so probably a hash table. I'd need to remove dups from the input
as I build the array of linked lists, so maybe another hash table.
Alternatively you could search each linked list as you add an element and
remove dups there, but that would be N^2 for each list and could be for
the entire input if it's pathological: (1,2) (1,3) (1,4)...  Also, I think
I can use one hash table for both the separate fore/aft symbols (mapped to
integers) and for pairs (for dup removal). But it could be a large table.


More information about the Toybox mailing list