[Toybox] awk seen in the wild

Thu Jul 21 03:57:34 PDT 2016

On 07/21/2016 12:25 AM, Andy Chu wrote:
>>> However, I did a bunch of research and hacking on Kernighan's Awk.  I
>>> was trying to morph it into a "proper" modern language.
>>
>> Another one?
>>
>> why?
> 
> Because if you're going to rule out Python/Perl/etc. on a minimal
> Unix, which I mostly agree with, then you still need a decent
> scripting language that level of abstraction.  Awk is by far the
> closest out of any "classic Unix" language.

I'm fairly certain awk wasn't intended to be turing complete, but I'd
have to dig into the computerphile youtube interviews with Kernighan to
try to find his actual quote.

>> Presumably you wouldn't remove anything significant from the base
>> language, since that would break compatability with existing awk
>> scripts, so your reaction to awk was "how could I fork this to make it
>> bigger"?
> 
> The overall system would be smaller if you expanded awk (add
> readlink(), etc.) and then wrote the core utilities in it.

You are aware that perl started life as an attempt to combine awk, sed,
and shell into a single tool, right?

Is this really a model you want to emulate? (I say this as someone who
happily used C++ back when it was "C with classes" before templates went
into the language. Doesn't mean I think another "C with classes" fork
would be worth doing, which is why I never tried to learn objective C.)

> But as mentioned in my previous message, through my hacking I
> determined that both Awk and Make have bad semantics (while shell has
> good semantics).  So my current idea is to have "extreme
> compatibility" for my shell, but add some awk and make features to it,
> so you don't have to remember 3 different syntaxes for loops,
> conditionals, and function calls.

So instead of awk+sed+sh you're doing awk+make+sh.

> In other words, the awk and make parts aren't compatible with actual
> awk and make -- they just share the same architecture (row-wise
> streaming of data and data-oriented parallel builds.)  But the shell
> part is compatible.

You're creating yet another programming language (because clearly we
haven't got enough of those yet) and if a system needs a script written
in your tool they'll install yet another language alongside ruby and lua
and python and so on.

People write programs in a language. If it's not a compiled langauge,
then the runtime of that language becomes a runtime dependency of that
thing. If it is compiled, it's still a build-time dependency, and
knowledge of it becomes a maintenance dependency. It can only be
modified/extended/ported/fixed by people who understand that language.

Everybody who creates a new language dreams that their language will
cause a net simplification of the world by displacing OTHER languages
and causing them to die out, but just about the only time this has EVER
happened was C and it took several decades to do it, and by that time C
itself was under embrace-and-extend attack by C++ (which still hasn't
got any sort of boundaries to stop its endless feature creep), not to
mention also-rans like objective C, or the modern crop of go/rust/swift
that is each sure it will become the new C and somehow get all the
kernels and other language runtimes rewritten in it.

I ran "apropos interpreter" on my ubuntu 14.04 netbook (reasonably
stock, I try not to install extra build dependencies on it) and it found
erb (ruby), perl, dash, and tclsh. This DIDN'T pull up bash or python,
which I know are on here, so clearly isn't a complete list.

Your new thing, if _wildly_ successful, would displace none of those.
Python 3 can't even displace python 2.

> To remind everyone, the basic beef is that Unix has a good
> architecture, but horrible syntax. And there are too many languages.
> Nobody younger than me learns awk or make anymore.

I didn't learn make until I had to (after graduating from college).

> Or even shell. It just feels old.

I sat though a couple decades being vastly outnumbered by MCSEs doing
visual basic. They also pointed and laughed at shell scripts. Guess
which outlived which?

> The average engineer at Google doesn't know any of
> those things if they started their career in the last 10 years.

If their job is making chrome render faster on windows, I'm not
surprised? You guys do web infrastructure through self driving cars on
systems that got installed for you. You're writing apps in Python and C
and such, doing cluster load balancing and AI research into semantic web
analysis. Banging on the OS is not their domain expertise.

(Do you know how hard it is to google for android _system_ information?
You try to find information about Android programming it's apps in java
all the way down. Clearly this means there IS no part of the system
written in C, it's just java, as javaos demonstrated the feasability of
20 years ago...)

Also, from 2005 to 2012 Guido Van Rossum worked at Google, during which
Google made a big deal about writing stuff in Python. Then Guido left
for Dropbox and Google stopped talking about python so much (at least
externally), but Ken Thompson and Rob Pike took over providing an
in-house language (Go) for Google to have Invented Here. (Apple's is
Swift, apparently because Objective C was so totally overshadowed by C++
they wanted to try again.)

As I said, a few years ago a friend worked for a company called Basho
that did everything in Haskell. A decade back I worked a contract at a
company that did everything in Perl 4. (Yes, significantly post-y2k!)
The culture of a specific company, even Google, isn't necessarily
representative of the world at large. "We use this set of technologies
here" != "this is the universally important set of technologies".

Years ago I noticed a corollary to Moore's Law, which is that 50% of
what you know about software becomes obsolete every 18 months. The nice
thing about unix is it's mostly been the same 50% cycling out over and
over for many decades now.

I've learned lots of things that only lasted 5 years, and decided to
just wait for others to go away. I waited out AOL. I'm currently waiting
out Facebook. It took a LONG TIME to wait out Windows but at this point
I no longer have to care about it. I can wait out systemd.

I'm not sure what parts of Android's infrastructure will cycle out in my
lifetime and which new generations will embrace, but "it builds under
itself" is necessary for the long-term health of any platform. Right now
it builds under a semi-posix environment that's enumerable and can be
bounded, and I'm trying to transplant that.

> The xkcd somewhat applies, but is mitigated by 2 things:
> 
> 1) You can replace an entrenched technology/language if you make your
> new thing a superset of the old thing, i.e. retaining a high degree of
> compatibility.

The way C++ included the whole of C and was therefore clearly superior
to C, you mean? Or the way perl combined awk and sed and shell?

Who was it who said any problem can be solved by adding another layer of
indirection, except too many layers of indirection?

I'm not sure "let's use a bigger tool to solve the same problems" is a
rallying cry I'm really comfortable getting behind. A couple core ideas
of unix were small tools connected by pipes (communicating via mostly
textual interfaces because humans can read that), and "do one thing and
do it well".

That's part of the reason it's survived so well, it's made from
decoupled parts you can individually swap out. (more->less and so on.)
The point of a shell script is to easily call lots of external commands
which are not, strictly speaking, part of the shell. Yes busybox and
toybox blur those lines, but not execing itself again out of the $PATH
is _mostly_ a performance hack, I.E. entirely optional and can be disabled.

There are a set of shell builtin commands that can't be implemented
externally (cd/exit/export/read are process-local because they modify
process attributes like environment variables and cwd), but when the
kernel guys gave me the tools to do "ulimit" as a standalone command, I did.

> After my research on ksh, it's clear that this is how bash gained
> popularity.  ksh was the most popular implementation at the time of
> the POSIX standard, and was probably the biggest influence on the
> standard.  bash was playing catch up -- it aggressively implemented
> POSIX *and* the non-POSIX parts of ksh.   So eventually people ported
> their ksh scripts to bash.

Yes and no.

Bash was the first program Linux ever ran. Linus created the Linux
kernel by extending his boot-from-floppy terminal program to handle bash
system calls so he didn't have to keep rebooting into minix to
list/rename/move/delete files and directories. (He had a tiny hard drive
and was constantly clearing off space to download more files from usenet
via the university microvax).

The initial release of bash was June 8, 1989, meaning when Linus posted
the first Linux announcement in August 25, 1991 bash was 2 years old.

Bash did not become the default shell of solaris (tcsh), freebsd (also
tcsh), or aix (korn). As far as I can tell, its popularity was largely
driven by the fact that it was the default shell of every single Linux
installation for fifteen years, until Ubuntu decided in 2006 that its
init scripts ran too slow. No really, that's the reason they gave for
the switch:

  https://wiki.ubuntu.com/DashAsBinSh

Even then diversifying away from bash was gradual, even Debian (which
Ubuntu was basically carrying at that point, hiring full-time people to
work on what was otherwise a badly struggling distro) took 3 years to
bow to Ubuntu's will here:

  https://lwn.net/Articles/343924/

> It's basically embrace-and-extend in the open source world... there's
> a reason that Microsoft used that strategy -- it works. You implement
> something bug-for-bug and then you extend it with useful features.

No, their strategy was bundling. There were two entire antitrust trials
about this, and a quote about a ham sandwich. From windows coming free
with every copy of DOS (and per-motherboard licensing so you couldn't
buy the hardware without getting their OS) to Office (can't use their
spreadsheet without their word processor and powerpoint) to making their
browser un-removable from their OS (which still didn't let them change
HTML much, no matter how they tried).

They've embraced and extended all sorts of stuff people utterly ignored
or which got traction and then lost it again, from the Zune embracing
and extending MP3, Microsoft's "J" language (then C# when they got sued)
embracing and extending Java... Outlook tried to embrace and extend
email but their SUCCESS in that area was bundling calendaring with email
so you needed one to use the other (both tied to an exchange server
using protocols they tried to make very hard to reverse engineer).

Bundling can't really be forced in the open source world, the closest
you get is de-facto standards, such as bash and gcc were for linux. As
Linux succeeded, bash succeeded, and got installed on other systems
because people wanted their familiar Linux environment there too. Then
ubuntu bundled dash instead (because stupid) and pushed the other way,
but it still took a while even with Ubuntu having the same 50%
workstation maketshare that Red Hat gave up a few years earlier...

> 2) awk has a much smaller user base than shell.  You do see big awk
> scripts, but you see MANY more big shell scripts.  And there are more
> shell scripts altogether.
> 
> awk and make are also at least 5x smaller and 5x easier to implement
> than the shell (if you look at bash/zsh vs GNU awk/make, as well as
> other implementations)

Which is why I plan to implement awk and make as their own commands?

> If you can manage to fold some awk functionality into shell, then you
> could possibly decrease the total number of languages (at least in a
> given system).

No, seriously, this is how Larry Wall created perl:

http://www.shlomifish.org/lecture/Perl/Newbies/lecture1/intro/history.html

That way lies the Emperor of all Cosmos having a drunken bender and
making you push a ball around Japanese living rooms to surprisingly
catching theme music in order to create replacement stars.

> As I said, nobody needs 3 different syntaxes for
> loops, function calls, and expressions.  (And you cannot avoid them,
> at least if you are looking at real systems...)

Those who do not know the history of loop syntaxes are doomed to repeat.

>> The lua thing fell apart trying to write mount, ifconfig, netcat,
>> losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
>> dmesg... The language just didn't have the bindings.
> 
> Sure but you can find bindings or write them yourself.  That's the
> whole point of Lua!

If I have to write code in C and cross-compile it to every supported
target in order to bootstrap a system, what am I bothering with Lua for?
It's just another unnecessary prerequisite package, I might as well just
write the whole thing in C. (So I did.)

(And no, "You can install 7 externally maintained prerequisite packages
instead" is not an improvement.)

>>> I think you mentioned you were looking for an awk test suite.  Well
>>> there it is -- there are hundreds or thousands of test cases,
>>> including for the regex language.
>>
>> Which is provided by libc.
> 
> Kernighan Awk has its own regex implementation "b.c" in 958 lines, and
> there is an argument to keep it.  It uses the Thompson linear-time
> NFA/DFA algorithm rather than exponential backtracking.

If musl or bionic should have a better regex expression, fine. But I see
no need to reinvent this particular wheel. (And I've reinvented a lot of
wheels. In fact I wrote my own regex engine for OS/2 feature install in
1996, although that used glob syntax rather than regex syntax because I
did way more DOS than Unix back then. But I have NOT written one for
toybox. Libc exists and posix says it should have this.)

> See the note here:
> 
> https://github.com/andychu/bwk
> 
> Coincidentally StackOverflow was down today for a related reason...
> matching regexes on user input can blow up CPU on your servers:
> https://news.ycombinator.com/item?id=12131909 (I linked to my bwk repo
> there).  And I have seen this bug before elsewhere.

I collated identical * runs even in my old glob implementation because
otherwise it was obvious N^X complexity dealing with them. I assume Rich
has decent stuff in musl and bionic can rip anything they haven't
already got from there. If not, I'm aware of a couple smart guys
maintaining those things who can handle this so it's not _my_ problem. :)

> It matters what algorithm you use, and awk/sed/grep are all used with
> big input data and (I think?) big regexes.  I use them on gigabytes of
> text.  It probably doesn't matter for bash [[, because there you are
> just matching a short string against a regex.

Don't assume what your inputs look like. Modern Linux removed the 128k
environment space limitation almost a decade ago (commit b6a2fea3931
which went into 2.6.22 released July 2007) and it was never there for
local shell variables anyway.

So there's no reason shell can't read X and "$X" ~= blah and churn
through as big an input as anything else.

> I think GNU awk/sed/grep all have their own regex implementation and
> don't use libc, but I could be wrong (?).

Gnu bash 2.x has its own malloc implementation, which is why I had to
say --without-bash-malloc in configure. (Dunno what more current bash is
doing, I haven't built any of the gplv3 versions from source.)

The epic "not invented here" of the gnu/dammit brigades is kind of
impressive. Me, I've consistently said if your libc is broken fix your libc.

(I'm also aware of the performance hacks their grep does with block
reads instead of line reads, and then backing up to find line context
after the match. In theory I could do that with libc's regex stuff too.
In practice, I haven't gone there. My big todo item is making it work
with embedded NUL bytes, which is delaying my current round of grep
replumbing...)

(So you can "grep string /bin/blah" out of executables, of course. Has
to find the string after the NUL byte. No, I can't use libc's regex
engine to cross null bytes, but I can't cross newlines either.)

> I thought busybox had some of its own regex support too.

Not that I recall, but it's been 10 years. The sed I wrote for busybox
way back when used libc's regex though. (There was a wrapper but I
believe it was just the standard "turn regcomp() failure into exit()"
sort of thing.)

> Andy

Rob