[Toybox] tsort

Ray Gardner raygard at gmail.com
Mon Oct 16 17:09:00 PDT 2023


In the blog, Rob said
> I MOSTLY don't care because I don't know of a use case that big (he
> didn't send the test data, nor say how he created it)

I did say
> I can post here about how I tested, if you're interested.

No one said they were interested, so this will probably be my last attempt
to contribute to the tsort discussion unless further interest is
indicated. But now I will say how...

I needed to create large inputs without cycles. The most obvious way to me
was to make pairs (pred, succ) such that pred <= succ for each pair, so no
cycle is possible. I did this:

# gen_tsort_input.awk -- generate test input for tsort
# Generate n pairs of numbers: left right
# such that 1 <= left <= right < limit
BEGIN {
    srand()
    limit = 10000
    n = 1000000
    for (i = 0; i < n; i++) {
        left = 1 + int(rand() * (limit - 1))
        #right = int(rand() * (limit - left)) + left
        right = 1 + int(rand() * (limit - 1))
        if (left > right) {t = left; left = right; right = t}
        print left, right
    }
}

I tried some variations that created different distributions of right for
each left, but that didn't make much difference in the test results.

I truncated the output at 100k, 200k, and 400k lines and tested those as
well as the full 1m lines.

I wrote a checker to validate the output, shown below.

Today I noticed a topological sort in _The AWK Programming Language_ by
Aho, Weinberger, and Kernighan, so I tested that as well. Slightly
modified to generate output one element per line:

# tsort - topological sort of a graph
#   input:  predecessor-successor pairs
#   output: linear order, predecessors first
    { if (!($1 in pcnt))
          pcnt[$1] = 0           # put $1 in pcnt
      pcnt[$2]++                 # count predecessors of $2
      slist[$1, ++scnt[$1]] = $2 # add $2 to successors of $1
    }
END { for (node in pcnt) {
          nodecnt++
          if (pcnt[node] == 0)   # if it has no predecessors
              q[++back] = node   # queue node
      }
      for (front = 1; front <= back; front++) {
          printf("%s\n", node = q[front])
          for (i = 1; i <= scnt[node]; i++)
              if (--pcnt[slist[node, i]] == 0)
                  # queue s if it has no more predecessors
                  q[++back] = slist[node, i]
      }
      if (back != nodecnt)
          print "\nerror: input contains a cycle"
    }

I had to modify my test data generator to avoid self-pairs (x, x) by
changing a line:
        do {right = 1 + int(rand() * (limit - 1))} while (left == right)

Now on 100k, 200k, 400k lines I get times of:

gawk:        0.662s 1.278s 2.515s
busybox awk: 3.051s 4.597s 7.680s
my awk:      0.902s 1.820s 3.434s

All the results checked OK with the checker below.

# checkts.awk -- check tsort output against input
#
# Usage: checkts.awk tsort_infile tsort_outfile
#
# Assumes no cycle (loop) in input
# ensure every input symbol appears in output
# ensure every output symbol appears only once
# does NOT check that every output symbol appears in input
# ensure left of each input pair precedes or equals right in output
BEGIN {
    tsin = ARGV[1]
    tsout = ARGV[2]
    print tsin, tsout
    checkts(tsin, tsout)
}
function readfile(fn, a,   s)
{
    while ((getline < fn) > 0) s = s "  " $0
    close(fn)
    gsub("^ *| *$", "", s)  # drop leading and trailing spaces
    return split(s, a, "  *")  # split symbols into array a
}
function checkts(tsin, tsout)
{
    n = readfile(tsin, a)
    if (n % 2) {print "unclosed pair"; return}
    k = readfile(tsout, v)
    # Invert the output array
    for (i in v) {
        if (v[i] in vv) {print "dup in output:", v[i]; return}
        vv[v[i]] = int(i)
    }
    for (i = 1; i <= n; i += 2) {
        left = a["" i]; right = a["" (i+1)]
        if (! (left in vv)) {print "not in output:", left; return}
        if (! (right in vv)) {print "not in output:", right; return}
        if (vv[left] > vv[right]) {print "not sorted: " vv[left], vv[right]}
        if (vv[left] > vv[right]) {print "not sorted: " left " > "
right; return}
    }
    print "ok"
}


More information about the Toybox mailing list