[Toybox] tsort
Ray Gardner
raygard at gmail.com
Mon Oct 16 17:09:00 PDT 2023
In the blog, Rob said
> I MOSTLY don't care because I don't know of a use case that big (he
> didn't send the test data, nor say how he created it)
I did say
> I can post here about how I tested, if you're interested.
No one said they were interested, so this will probably be my last attempt
to contribute to the tsort discussion unless further interest is
indicated. But now I will say how...
I needed to create large inputs without cycles. The most obvious way to me
was to make pairs (pred, succ) such that pred <= succ for each pair, so no
cycle is possible. I did this:
# gen_tsort_input.awk -- generate test input for tsort
# Generate n pairs of numbers: left right
# such that 1 <= left <= right < limit
BEGIN {
srand()
limit = 10000
n = 1000000
for (i = 0; i < n; i++) {
left = 1 + int(rand() * (limit - 1))
#right = int(rand() * (limit - left)) + left
right = 1 + int(rand() * (limit - 1))
if (left > right) {t = left; left = right; right = t}
print left, right
}
}
I tried some variations that created different distributions of right for
each left, but that didn't make much difference in the test results.
I truncated the output at 100k, 200k, and 400k lines and tested those as
well as the full 1m lines.
I wrote a checker to validate the output, shown below.
Today I noticed a topological sort in _The AWK Programming Language_ by
Aho, Weinberger, and Kernighan, so I tested that as well. Slightly
modified to generate output one element per line:
# tsort - topological sort of a graph
# input: predecessor-successor pairs
# output: linear order, predecessors first
{ if (!($1 in pcnt))
pcnt[$1] = 0 # put $1 in pcnt
pcnt[$2]++ # count predecessors of $2
slist[$1, ++scnt[$1]] = $2 # add $2 to successors of $1
}
END { for (node in pcnt) {
nodecnt++
if (pcnt[node] == 0) # if it has no predecessors
q[++back] = node # queue node
}
for (front = 1; front <= back; front++) {
printf("%s\n", node = q[front])
for (i = 1; i <= scnt[node]; i++)
if (--pcnt[slist[node, i]] == 0)
# queue s if it has no more predecessors
q[++back] = slist[node, i]
}
if (back != nodecnt)
print "\nerror: input contains a cycle"
}
I had to modify my test data generator to avoid self-pairs (x, x) by
changing a line:
do {right = 1 + int(rand() * (limit - 1))} while (left == right)
Now on 100k, 200k, 400k lines I get times of:
gawk: 0.662s 1.278s 2.515s
busybox awk: 3.051s 4.597s 7.680s
my awk: 0.902s 1.820s 3.434s
All the results checked OK with the checker below.
# checkts.awk -- check tsort output against input
#
# Usage: checkts.awk tsort_infile tsort_outfile
#
# Assumes no cycle (loop) in input
# ensure every input symbol appears in output
# ensure every output symbol appears only once
# does NOT check that every output symbol appears in input
# ensure left of each input pair precedes or equals right in output
BEGIN {
tsin = ARGV[1]
tsout = ARGV[2]
print tsin, tsout
checkts(tsin, tsout)
}
function readfile(fn, a, s)
{
while ((getline < fn) > 0) s = s " " $0
close(fn)
gsub("^ *| *$", "", s) # drop leading and trailing spaces
return split(s, a, " *") # split symbols into array a
}
function checkts(tsin, tsout)
{
n = readfile(tsin, a)
if (n % 2) {print "unclosed pair"; return}
k = readfile(tsout, v)
# Invert the output array
for (i in v) {
if (v[i] in vv) {print "dup in output:", v[i]; return}
vv[v[i]] = int(i)
}
for (i = 1; i <= n; i += 2) {
left = a["" i]; right = a["" (i+1)]
if (! (left in vv)) {print "not in output:", left; return}
if (! (right in vv)) {print "not in output:", right; return}
if (vv[left] > vv[right]) {print "not sorted: " vv[left], vv[right]}
if (vv[left] > vv[right]) {print "not sorted: " left " > "
right; return}
}
print "ok"
}
More information about the Toybox
mailing list