[Toybox] Would someone please explain what bash is doing here?

Sun May 17 04:11:18 PDT 2020

I had a reply window open to this when my laptop battery died, and thunderbird
doesn't store unfinished messages like kmail and vi and chrome...

Anyway, I was reminded of this thread by:

  $ IFS=x; ABC=cxd; for i in +($ABC); do echo =$i=; done
  =+(c=
  =d)=
  $ bash -c 'IFS=x; ABC=cxd; for i in +($ABC); do echo =$i=; done'
  bash: -c: line 0: syntax error near unexpected token `('
  bash: -c: line 0: `IFS=x; ABC=cxd; for i in +($ABC); do echo =$i=; done'
  $ readlink -f /proc/$$/exe
  /bin/bash

(I tried inserting shopt -o extglob; and shopt +o extglob; at the start of the
-c and it didn't change anything?)

Yes, I'm still trying to work out wildcard parsing behavior. :)

Here's a second attempt at replying to the rest, sorry if it's duplicative but I
don't remember what I already said because of the email that wasn't sent, and
because of the emails that were. :)

On 5/11/20 3:55 PM, Chet Ramey wrote:
> On 5/10/20 7:24 PM, Rob Landley wrote:
> 
>>>>   $ echo \
>>>>   > $LINENO
>>>>   2
>>>>
>>>>   $ echo $LINENO \
>>>>   $LINENO
>>>>   1 1
>>>
>>> Let's look at these two. This is one of the consequences of using a parser
>>> generator, which builds commands from the bottom up (This Is The House That
>>> yacc Built).
>>
>> I've never understood the point of yacc. I maintained a tinycc fork for 3 years
>> because it was a compiler built in a way that made sense to me.
> 
> It saves you having to write a lot of code if you have a good grammar to
> work from. One of the greatest achievements of the original POSIX working
> group was to create a grammar for the shell language that was close to
> being implementable with a generator.

Generated code exists and has its costs whether or not you had to manually write
it. There's a classic paper/blog rant about that (specifically its impact on the
Java language):

  http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.html

I maintained a tinycc fork for 3 years (https://landley.net/hg/tinycc) after
Fabrice Bellard abandoned it because its design made sense to me, and the result
was a 100k self-contained compiler binary that built a working linux kernel in a
single digit number of seconds on 15 year old hardware
(https://bellard.org/tcc/tccboot.html), and it did this by _not_ using yacc. I
keep hoping somebody will steal the https://landley.net/code/qcc idea and do it
so I don't have to.

Back before Eric Raymond went crazy, he and I were working together on a paper
he wanted to call "why C++ is not my favorite language", which was about local
peaks in language design space, the difference between static and dynamic
languages, and the downsides of the no man's land in between the two local peaks.

After we stopped being able to work together, I did a 3 part writeup of my
understanding of the what the paper would have been about:

  https://landley.net/notes-2011.html#16-03-2011
  https://landley.net/notes-2011.html#19-03-2011
  https://landley.net/notes-2011.html#20-03-2011

Anyway, the _point_ of that is "scripting languages are a thing", and the proper
tool for the proper job. To me code generation means "you should probably be
using a scripting language for this".

>> The tricky bit is "echo hello; if" does not print hello before prompting for the
>> next line, 
> 
> Yes. You're parsing a command list, because that's what the `;' introduces,
> and the rhs of that list isn't complete. A "complete_command" can only be
> terminated by a newline list or EOF.

Hence my line based parsing function that returns whether or not it needs
another line to finish the current thought.

>> and "while read i; echo $i; done" resolves $i differently every way,
> 
> You mean every time through the loop?

Yup.

>> which said to _me_ that the parsing order of operations is
>>
>> A) keep parsing lines of data until you do NOT need another line.
>>
>> B) then do what the lines say to do.
> 
> Roughly, if you mean "complete commands have been resolved with a proper
> terminator" and "execute the commands you just parsed."

The execution can be deferred arbitrarily long (you may be defining a function)
or never (the else case of an if statement that was true), but yeah.

>> I have parse_word() that finds the end of the next word (returning NULL if we
>> need another line to finish a quote, with trailing \ counting as a quote), and
>> parse_line() that adds words to struct sh_arg {int c; char **v;} in a linked
>> list of struct sh_pipeline in a struct sh_function, and when parse_line() does
>> NOT return a request for another line, the caller can run_function() and then
>> free_function().
> 
> Have you found that structure enough to handle, say, if-then-elif-else-fi
> and the various flavors of the `for' command?

Yup. I had to rewrite it about 5 times as I kept finding new cases I hadn't
accounted for , but eventually got there. Starting somewhere around:

  https://landley.net/notes-2019.html#02-06-2019

And going through... pretty much the rest of that month?

Sigh. I have "<span id=programming>" tags in my blog and I've been MEANING to
upgrade my old python rss generator to produce multiple rss feeds (so instead of
just rss.xml there's rss-programming.xml and so on.

But looking at it, I got really lazy about tagging at times because nothing ever
tested it. June 20 and 21 of 2019 say span id=programming when the should be
span id=energy. And I haven't wrapped anything with id=politics in forever (I
just left those entries unwrapped so they're NOT id=programming...)

(Yes, I've always had a PLAN to let people read specific TOPICS out of my blog
without having to deal with the rest of it. I just never got around to it, and
somebody telling me what to do when I'm already stressed is not the way to get
me to do it.)

>>> and goes back for more input. The
>>> lexer sees there are no tokens left on its current input line, notes that
>>> line ends in backslash and reads another line, incrementing the line
>>> number, throws away the newline because the previous line ended in
>>> backslash,
>>
>> I _was_ throwing away the newline, but I stopped because of this. Now I'm
>> keeping it but treating it as whitespace like spaces and tabs, but that's wrong:
> 
> It is wrong; it needs to be removed completely.

Yup, I already added that test:

  shxpect 'line continuation2' I$'echo ABC\\\n' E'> ' I$'DEF\n' O$'ABCDEF\n'

I.E. 'echo ABC\' as the shell's first line of input, wait for the > prompt,
input 'DEF' as the next line, and the output should be 'ABCDEF" all on one line.

(Of course I implemented my own expect-style plumbing in shell. How else do you
test line continuations?)

>>> and returns $LINENO. The parser finally has enough input to
>>> reduce to a simple command, and builds one, with the line number set to 2.
>> Ok, so line number is _ending_ line, not starting line. (Continuations increment
>> the LINENO counter _before_ recording it for the whole span.)
> 
> Not necessarily the ending line; a simple command can span an arbitrary
> number of lines, but $LINENO gets set from whatever line the lexer was on
> when the parser recognized what it had as a simple command. It can then
> continue reading words in that simple command until it gets the unescaped
> newline (or `;', or `&', or any of the command separators) to terminate it.

My parse_line() function digests a line into a linked list of struct sh_pipeline
which contains argc/argv pairs (one of the command and one more for each HERE
document encountered), and an integer "type" field that's 0 for normal
executable statement, 1 for start of block (if/while...), 2 for gearshift
(then/do), and 3 for end of block (fi/done), plus a few others like 'f' for
function definition and 's' for the argument line in a for loop that gets
expanded but isn't runnable the way an if statement's argument is...

Anyway, that structure needs an "int lineno" added that gets snapshot from the
global TT.lineno, and what I've learned from all this is it gets snapshot at the
end when we close out the sh_pipeline and start the next one, not at the
beginning when it's allocated. (That's the observation that makes the behavior
make sense now.)

P.S. All the wildcard plumbing I've been adding has bumped me up to 2866 lines
of sh.c. I'm trying to finish under 3500 lines total. Luckily there's debug code
and scaffolding that gets removed at the end...

> If you want to look ahead far enough, you can save the line number if you
> don't read a reserved word, peek at the next few characters to see if you
> get '()', and build a simple command using the saved line number. Yacc/
> Bison don't let you look that far ahead.

Last time I looked up youtube clips for the princess bride "once his HEAD is in
range HIT IT WITH THE ROCK", and winnie the pooh "bear of very little brain",
but I haven't got the spoons to do that again.

tl;dr: _I_ don't want to, no. That sounds way too much like work.

>>>>>>>> I currently have no IDEA what "sh --help" should look like when I'm done, 
>>>>>>>
>>>>>>> I'm pretty sure bash --help complies with whatever GNU coding standards
>>>>>>> cover that option.
>>>>>>
>>>>>> Currently 2/3 of bash --help lists the longopts, one per line, without saying
>>>>>> what they do. So yeah, that sounds like the GNU coding standards.
>>>
>>> Oh, please. It doesn't describe what each single-character option does,
>>> either. That's a job for a man page or texinfo manual.
>>
>> Then why are they one per line?
> 
> Because it's not worth the effort to space them across the screen.

I polish documentation fairly obsessively. My users greatly outnumber me.

>>>> Except you've got some parsing subtlety in there I don't, namely:
>>>>
>>>>   $ bash -hc 'echo $0' --norc
>>>>   --norc
>>>>
>>>>   $ bash -h --norc -c 'echo $0'
>>>>   bash: --: invalid option
>>>
>>> "Bash also  interprets  a number of multi-character options.  These op-
>>>  tions must appear on the command line before the  single-character  op-
>>>  tions to be recognized."
>>>
>>> Bash has always behaved this way, back to the pre-release alpha and beta
>>> versions, and I've never been inclined to change it.
>>
>> Indeed. Unfortunately for _my_ code to do that it would have to get
>> significantly bigger, because I'd need to stop using the generic command line
>> option parsing and pass them through to sh_main() to do it myself there. (Or add
>> intrusive special cases to the generic parsing codepath.)
> 
> You can probably get away with it as long as that option parsing code stops
> at the first word that doesn't begin with `-'.

That's literally one character ("^" at the start of the option string in the
middle argument of the NEWTOY() macro.)

Although it's a second to make -c do it, which I think I also need.

>> Documenting this as a deviance from <strike>posix</strike> the bash man page
>> seems the better call in this instance. 
> 
> Documenting what as a deviation? POSIX doesn't do long options; you can do
> whatever you like with them.

My shell standard isn't posix, the standard I'm trying to implement is the bash
man page. Posix can go hang here until Jorg Schilling dies of old age as far as
I'm concerned.

>> Wheee.
>>
>> In my case "how options work in all the other toybox commands" puts a heavy
>> weight on one side of the scales. (Not insurmountable, but even the exceptions
>> should have patterns.)
> 
> The Bourne shell option parsing long predates modernities like getopt(), so
> the basic rule is "scan for words starting with `-' or `+', parse them as
> binary flag options, handling `--' in some reasonable way to end option
> parsing, then grab what you need from the argument list (the command for
> -c), and use everything else to set the positional parameters. Oh, and use
> the same code for `set', so you have to reject the options that are only
> valid at invocation.

Even back under busybox I rewrote their mount command something like 5 times
before I was happy with it. As for toybox, I REALLY should have called this
project "dorodango"...

The reason for all these regression tests is so I know when I've broken
backwards compatibility, and can fix it. (Back on busybox I instead just ran
gobs of real world data through it by using the busybox command line utilities
to run an automated Linux From Scratch build, ala the "build control images" tab
on the left of https://landley.net/aboriginal/about.html . My "I broke backwards
compatibility" was usually some variant of autoconf making a different config
decision halfway through compiling ncurses or perl or something, so the diff
from last time had deviations I needed to fix...)

>> sed and sort had both but treated the man page as an afterthought. Many of their
>> gnu extensions were ONLY documented in the info page when I was writing new
>> implementations for busybox back in the day. (No idea about now, haven't looked
>> recently. 
> 
> OK. The bash man page and texinfo manual have the same content.

Oh good.

>> These days I handle that sort of thing by waiting for somebody to
>> complain. That way I only add missing features somebody somewhere actually _uses_.)
> 
> It has to be a lot more than one person.

Yeah, but if I'm on the fence about it to begin with it only takes one person to
confirm "yeah, that's actually used".

Also, Elliott speaks for the Android userbase. They ship a billion devices
annually. When he tells me he needs a thing, it carries some weight. (We argue
about how and where, but "what" is generally a foregone conclusion.)

>> For toysh, I've taken a hybrid approach. I'm _reading_ every man page corner
>> case and trying to evaluate it: for example /dev/fd can be a filesystem symlink
>> to /proc/self/fd so isn't toysh's problem, but I'm making <(blah) resolve to
>> /proc/self/fd/%d so it doesn't _require_ you to have the symlink. 
> 
> Yeah, you only have to worry about linux.

Yes and no. There's bsd and macos support now, and post-1.0 I might put some of
my own effort into expanding the BSD side of it. (MacOS is a cross between
"BSD's downstream" and "a weird proprietary mega-corporation that can sign a
check if it wants me to care", but Elliott has users who build AOSP on MacOS and
borrows a laptop to do fixups there every six months or so, and sends me patches.)

Ok, back up: my old aboriginal linux project (linked above) was my attempt to
create the simplest Linux system that could rebuild itself under itself from
source, then build Linux From Scratch under the result. I got it down to 7
packages (busybox, uclibc, linux, gcc, bintuils, make, bash). There's newer
stuff with fewer packages (current tip of tree is scripts/mkroot.sh in toybox,
which is 250 lines of bash building a linux system that boots to a shell prompt
under qemu for a dozen different hardware architectures) but that's a tangent.

The first thing aboriginal linux did was build an "airlock step" to isolate the
new system from idiosyncrasies in the host, loosely modeled on the old Linux
>From Scratch chapter 5 "temporary system" you would chroot into to build the
real system under:

http://archive.linuxfromscratch.org/lfs-museum/5.0/LFS-BOOK-5.0-HTML/chapter05/introduction.html

My builds run entirely as a normal user (so no chroot, because no root access)
which means my airlock step populated a directory with a host busybox and
symlinks to the compiler binaries it needed from the host, so it could point the
$PATH at just that ONE directory for the rest of the build, which meant package
./configure steps were less likely to find things like python installed and make
so many stupid config decisions while cross compiling.

(I once wrote a document that I WANTED to call "Why cross compiling sucks" but
that wasn't "professional": http://landley.net/writing/docs/cross-compiling.html )

I also did a giant 260 slide presentation about all of this back in the day:
https://speakerdeck.com/landley/developing-for-non-x86-targets-using-qemu

The android guys call their version of this concept a "hermetic build", meaning
the build is hermetically sealed and provides all its own prerequisites. The way
they do it isn't via building known binaries on the local system to go through
an airlock step, instead they provide their own prebuilt binaries and run the
build using those. The toybox 0.8.1 release notes had some links about that
right near the start:

  http://landley.net/toybox/#21-05-2019

Android doesn't care about supporting AOSP builds on FreeBSD, but they _do_ care
about supporting it on macos:

  https://source.android.com/setup/build/initializing

And that means shipping toybox binaries for mac as part of the AOSP hermetic
build. Which need to work, hence "make macos_defconfig" and the GSED= plumbing
in scripts/ and so on.

So technically, I have to worry about Linux, Android, and Mac. (And Android is
built with llvm, uses bionic as its libc, and is sprayed down with enough
selinux rules to be its own beast in places...)

>>> I abandoned the -o namespace to POSIX a
>>> long time ago, and there is still an effort to standardize pipefail as
>>> `-o pipefail', so I'm leaving it there. I originally made it a -o option so
>>> we could try and standardize it, and that reasoning is still relevant.
>>
>> It's a pity posix is moribund.
> 
> It's not dead, just slow.

Give him time.

> https://www.austingroupbugs.net/view.php?id=789
> 
> So we started talking about this in some official proposed way in 2013,
> continued sporadically until 2018, decided on some official text to add
> to the standard in September, 2018, and it will be in the next major
> revision of Posix, issue 8.

There's going to be an issue 8? Posix has been replacing issue 7 in place doing
the "continuous integration" thing (grab random snapshot du jour from the
website and call it good, no two people ever experience quite the same version)
for 12 years now.

Toybox has quarterly releases because I found Martin Michlmayr's "release
management in Large Free Software Projects" talk compelling:

  https://www.youtube.com/watch?v=IKsQsxubuAA

  ABSTRACT: Time based releases are made according to a specific time interval,
  instead of making a release when a particular functionality or set of features
  have been implemented. This talk argues that time based release management
  acts as an effective coordination mechanism in large volunteer projects and
  shows examples from seven projects that have moved to time based releases:
  Debian, GCC, GNOME, Linux, OpenOffice, Plone, and X.org.

(I ranted a lot more here last time in the email that got lost. Probably a good
thing.)

>  I mentioned the fact pipefail had been picked up
>> by multiple other shells in the toybox talk I did 7 years ago:
> 
> Bash wasn't the first shell to have it. I didn't add it until 2003, after
> a discussion with David Korn (the aforementioned Posix standardization
> effort).

I only started using bash in 1998. :)

>>>> ----------
>>>> Usage: sh [--LONG] [-ilrsD] [-abefhkmnptuvxBCHP] [-c CMD] [-O OPT] [SCRIPT] ...
> 
> You're missing `-o option'.

When I wrote that, I hadn't figured out that "set" and "shopt" were different
commands with different namespaces.

>>>> Do you really need to document --help in the --help text? 
>>>
>>> Why not? It's one of the valid long options.
>>
>> Lack of space. I was trying to squeeze one less line out of the output. :)
> 
> Believe me, if it were not there, someone would complain about its absence.
> There are a few things that remain undocumented, and I get crap about them
> as regular as clockwork.

Of course you know your userbase better than I do, but toybox's policy on
"compatibility features" like patch -g where we accept the option and ignore it
(sometimes because "this is specifying something that's our default behavior
anyway") is to NOT document them. Our help text is intended for somebody
figuring out what command line to feed to toybox to make it do a thing.

Also, all toybox commands (except ones like true that explicitly disable it)
support --help and --version, so they're documented in "help toybox" rather than
in the individual commands. (Also - as a synonym for stdin, that -- stops option
parsing, and so on.)

So we have some precedent for NOT documenting certain things. And "you had to
type --help to see this" would go in that bucket for me. :)

There's a "bang for the byte" principle that also applies to documentation. End
users only have so much bandwidth to read it.

Of course toybox cheats: if they really want to know every possible thing, the
implementation should be small and simple enough to easily dig through. Our
mount.c is 408 lines long, patch.c is 483 lines, find.c is 710 lines... And
find.c isn't _really_ that big: 67 lines of find.c are // comments, the first 63
lines are the big /* header block comment */ with the help text in it, and there
are 107 blank lines, for a total of less than 500 lines of actual command
implementation...

*shrug* That's the theory, anyway...

>>> The bash man page does
>>>> not include the string "--debug" (it has --debugger but not --debug), 
>>>
>>> It's just shorthand for the benefit of bashdb.
>>
>>   $ help bashdb
>>   bash: help: no help topics match `bashdb'.  Try `help help' or `man -k bashdb'
>>   or `info bashdb'.
>>   $ man -k bashdb
>>   bashdb: nothing appropriate.
>>   $ man bash | grep bashdb
>>   $
>>
>> google... huh, it's a sourceforge package.
> 
> It's the source of most of the bash debugging support. Rocky started out
> distributing patches, but I folded most of the features into the mainline
> source. It's a nifty little bit of work.

When helping port Linux to the hexagon processor back in 2010, I stuck print
statements into the uclibc dynamic loader to debug its relocation of its own
symbols. (It was a macro that expanded to a write syscall.) I could not call any
functions or use any globals, and had to put the data in stack char arrays
initialized by assigning element at a time. (I made a macro to convert integers
to hex I could output.)

When I added memory tests between uboot's DRAM init and relocating itself from
flash to DRAM, I stuck print statements into uboot that were a bit like
https://balau82.wordpress.com/2010/02/28/hello-world-for-bare-metal-arm-using-qemu/
except the loop had an if (FLAG&*register) to check the "can write next byte out
now" (which QEMU doesn't need but real hardware does) and the tricky part is
that because of the relocation all the string constants were linked at the
address they would be relocated TO not where they were in flash, so I had to
work out an UNRELOCATE constant to subtract from the string constants, ala
emit("string"-UNRELOCATE). (The functions were PIE that would -o binary into a
static blob or some such, but it couldn't find rodata without serious handholding.)

I stick printk(KERN_ERR) into the kernel all the time.

I started getting into VHDL when I figured out you could stick print statements
(um, "alert"?) into the simulator builds.

I don't think I'm the target audience for this feature, is what I'm saying. I
started out on a commodore 64 (38911 basic bytes free) and it left a MARK.

>> I'm not sure how I'd have divined the existence of the sourceforge package from
>> the --debug option in the help output (which didn't make an obvious behavior
>> difference when I tried it), but I often miss things...
> 
> The debugging support exists independently of bashdb, and can be used
> without it. Bashdb is just the biggest customer, and the origin of the
> features. The `debugger profile' in the documentation is the bashdb
> driver script. Try running `bash --debugger' sometime, you might like it.
> Assuming, of course, your vendor has installed bashdb in the right place.

Ok. Good luck with it.

>> -D	Display all the $"translatable" strings in a script.
>>
>> Oh right, I remember reading about $"" existing and going "that's weird, out of
>> scope" and moving on. Because I did _not_ understand how:
>>
>>        A double-quoted string preceded by a dollar sign ($"string") will cause
>>        the string to be translated according to the current  locale.   If  the
>>        current  locale  is  C  or  POSIX,  the dollar sign is ignored.  If the
>>        string is translated and replaced, the replacement is double-quoted.
>>
>> was supposed to work. (HOW is it translated? Bash calls out to
>> translate.google.com to convert english to japanese? Is there a thing humans can
>> do to supply an external translation file? Is it just converting dates and
>> currency markers and number grouping commas?)
> 
> You have a message catalog, install the right files, and use the gnu
> gettext infrastructure to get translated versions of the strings you
> mark. It's a shell script way of doing what bash does internally for its
> own messages. Very little-used.

Possibly very little-used because the documentation assumes you already know
what it is, how to use it, and why you'd want to? Which I didn't, but then I'm
VERY good at not understanding things, and I break everything, so not
necessarily representative...

  If your current locale setting has an appropriate gettext database installed,
  $"strings" get looked up and replaced with translated versions, otherwise
  they act like normal double quoted strings. Also, "bash -D SCRIPT" will show
  you all the $"" strings in a SCRIPT so translators can make a gettext database
  for a new $LANG.

>> Ah, gettext. That would explain why I don't know about it. I always used
>> http://penma.de/code/gettext-stub/ in my automated Linux From Scratch test
>> builds because it's one of those gnu-isms like info and libtool.
> 
> It will be in Posix one day.

I plead the third. (It's the one about quartering troops, it doesn't get enough
attention so I do what I can.)

> Chet

Rob