[Toybox] awk (Re: ps down, top to go)

Rob Landley rob at landley.net
Sun May 8 18:05:32 PDT 2016


On 05/08/2016 01:06 PM, Andy Chu wrote:
> On Fri, May 6, 2016 at 9:11 PM, Rob Landley <rob at landley.net> wrote:
>> (The end in sight for _busybox_ in my own use cases is next up on my
>> todo list. Really not looking forward to implementing awk, but it's
>> gotta be done...)
> 
> 
> I'm curious what research you've done on awk?

Well, way back when I tried to make sense of busybox's awk
implementation, which is around 3000 lines of C.

More recently, I read about half the posix awk description and dug up a
copy of the original "The AWK Programming Language" book by Aho,
Kernighan, and Weinberger from 1988, which I've read the introduction of.

>>From my research, it seems like a significant easier problem than the
> shell.

Yeah, probably. The shell is actually about 30 commands integrated
together, several of which are literal commands and several of which are
implicit "expand this environment variable, which could be $RANDOM or
$PPID or $SECONDS which actually invokes a function but you still need
to support ${#SECONDS} or ${RANDOM:1:3}. Some of them literally are
external commands where [ is "test" and : is "true" and "help" I already
did, and I already did ulimit, some are $(( )) is kinda like expr but
not quite, "trap" is its own thing, "read" is actually fairly elaborate,
command history navigation (and "history expansion" which I've never
personally used)... Job control is a whole subsystem (and pipes and
redirection are integrated into that; you suspend a _pipeline_ not a
process, and kill needs job control integration when run from toysh).
$PWD != abspath although it looks like getcwd() returns what we need
there, but I need to adjust cd to strip directory entries instead ofa
ctually traversing the filesystem .. (but only for _leading_ .. in the
path, I think? Need to test). There's all that loop and test logic,
shell functions and alias, pushd/popd/dirs, I have NEVER understood what
the "getopts" command is for but need to try again, don't get me STARTED
on the dozens of different things "set" does let alone "set -o"...

> Without interactive parsing and a completion system, it's
> probably 2-3x simpler, and if you account for that, it's probably 5x
> simpler.

A) I need to do _both_,

B) The shell I use extensively on a regularish basis. Awk I just pipe
data into '{print $5}' and that's literally all I ever use it for.

> Once thing that I didn't realize is that Ubuntu and Debian use mawk
> instead of gawk as their default awk.  So I assume all their package
> building scripts run with mawk?  That's good because mawk is a lot
> smaller than gawk.

Everything I've tried works ok with busybox awk. Back when I maintained
that there was a responsive awk developer who would fix stuff if I made
puppy eyes at them about a specific test case, and once I got it to
support all the linux from scratch packages that turned out to be
everything anybody ever actually used, that I've noticed since.

> And I think Aboriginal Linux runs with busybox awk?  That's also good
> because busybox awk is much smaller than mawk!
> 
> I took a peek at 4 implementations:
> 
> - gawk - GPLv3 - 66 K lines + 14K lines of extensions.  Yacc grammar.
> (This has a C extension interface, profiler and debugger, a somewhat
> ugly networking library built-in, etc.)
> 
> - mawk (updated 2015) - GPLv2 - 21K lines.  Yacc grammar.  (It's
> supposed to be fast because it's based on a byte-code interpreter
> rather than walking a tree?)
> 
> - busybox awk - GPLv2 - ~3300 lines in editors/awk.c, though it's not
> clear to me how much library code is used.  It includes xregex.h
> although also uses libc regexec().  Hand-written parser.
> 
> - Kernighan Awk (updated 2012) - 8K lines.  Lucent BSD? license.  Yacc grammar.
> 
> (Some of the line counts may be a bit off because I didn't really
> tease out the source parse.y file vs the generated .c and .h files)
> 
> All of them use Yacc except busybox, which isn't that surprising
> because I heard Kernighan say that Yacc was foundational in developing
> awk.  They designed the language with it.

I have that youtube video bookmarked on my phone. (The "computerphiles"
channel interview with kernighan, if I recall...)

> Busybox awk is impressively small.  I thought you said there was a lot
> of hairy awk in binutils or something, so I'm guessing that all runs
> under busybox awk?

It didn't when I started looking, but by the end of my maintainership
I'd run out of test cases that broke it, yes.

> I'm guessing it's not possible for toybox to borrow code from it
> because of the license,

Correct.

> but I wonder about the Lucent license.

I don't: we use a public domain equivalent license, that isn't.

> The lexer is 582 lines of clean looking C code (it's Kernighan, so I guess
> we all know his style :) ), which is not insignificant!

I'm not adding yacc as a build dependency.

> Andy

Rob


More information about the Toybox mailing list