[Toybox] awk seen in the wild

Andy Chu andychup at gmail.com
Wed Jul 20 22:25:52 PDT 2016


>> However, I did a bunch of research and hacking on Kernighan's Awk.  I
>> was trying to morph it into a "proper" modern language.
>
> Another one?
>
> why?

Because if you're going to rule out Python/Perl/etc. on a minimal
Unix, which I mostly agree with, then you still need a decent
scripting language that level of abstraction.  Awk is by far the
closest out of any "classic Unix" language.

> Presumably you wouldn't remove anything significant from the base
> language, since that would break compatability with existing awk
> scripts, so your reaction to awk was "how could I fork this to make it
> bigger"?

The overall system would be smaller if you expanded awk (add
readlink(), etc.) and then wrote the core utilities in it.

But as mentioned in my previous message, through my hacking I
determined that both Awk and Make have bad semantics (while shell has
good semantics).  So my current idea is to have "extreme
compatibility" for my shell, but add some awk and make features to it,
so you don't have to remember 3 different syntaxes for loops,
conditionals, and function calls.

In other words, the awk and make parts aren't compatible with actual
awk and make -- they just share the same architecture (row-wise
streaming of data and data-oriented parallel builds.)  But the shell
part is compatible.

To remind everyone, the basic beef is that Unix has a good
architecture, but horrible syntax.  And there are too many languages.
Nobody younger than me learns awk or make anymore.   Or even shell.
It just feels old.  The average engineer at Google doesn't know any of
those things if they started their career in the last 10 years.

The xkcd somewhat applies, but is mitigated by 2 things:

1) You can replace an entrenched technology/language if you make your
new thing a superset of the old thing, i.e. retaining a high degree of
compatibility.

After my research on ksh, it's clear that this is how bash gained
popularity.  ksh was the most popular implementation at the time of
the POSIX standard, and was probably the biggest influence on the
standard.  bash was playing catch up -- it aggressively implemented
POSIX *and* the non-POSIX parts of ksh.   So eventually people ported
their ksh scripts to bash.

It's basically embrace-and-extend in the open source world... there's
a reason that Microsoft used that strategy -- it works.  You implement
something bug-for-bug and then you extend it with useful features.

2) awk has a much smaller user base than shell.  You do see big awk
scripts, but you see MANY more big shell scripts.  And there are more
shell scripts altogether.

awk and make are also at least 5x smaller and 5x easier to implement
than the shell (if you look at bash/zsh vs GNU awk/make, as well as
other implementations)

If you can manage to fold some awk functionality into shell, then you
could possibly decrease the total number of languages (at least in a
given system).  As I said, nobody needs 3 different syntaxes for
loops, function calls, and expressions.  (And you cannot avoid them,
at least if you are looking at real systems...)

> The lua thing fell apart trying to write mount, ifconfig, netcat,
> losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
> dmesg... The language just didn't have the bindings.

Sure but you can find bindings or write them yourself.  That's the
whole point of Lua!

>> I think you mentioned you were looking for an awk test suite.  Well
>> there it is -- there are hundreds or thousands of test cases,
>> including for the regex language.
>
> Which is provided by libc.

Kernighan Awk has its own regex implementation "b.c" in 958 lines, and
there is an argument to keep it.  It uses the Thompson linear-time
NFA/DFA algorithm rather than exponential backtracking.  See the note
here:

https://github.com/andychu/bwk

Coincidentally StackOverflow was down today for a related reason...
matching regexes on user input can blow up CPU on your servers:
https://news.ycombinator.com/item?id=12131909 (I linked to my bwk repo
there).  And I have seen this bug before elsewhere.

It matters what algorithm you use, and awk/sed/grep are all used with
big input data and (I think?) big regexes.  I use them on gigabytes of
text.  It probably doesn't matter for bash [[, because there you are
just matching a short string against a regex.

I think GNU awk/sed/grep all have their own regex implementation and
don't use libc, but I could be wrong (?).  I thought busybox had some
of its own regex support too.

Andy


More information about the Toybox mailing list