[Toybox] Would someone please explain what bash is doing here?

Rob Landley rob at landley.net
Sun May 10 16:24:37 PDT 2020


On 5/10/20 12:13 PM, Chet Ramey wrote:
>> Somebody on patreon. (I also stopped patreoning.)
> 
> I read the comment. It's too bad that donations come with those kinds of
> strings attached.
Eh, the new part is just me not handling things well right now. I'm at stress
capacity and cannot currently handle additional sources of stress, even when
they're "normal" stress.

I rewatched Neil Gaiman's make good art speech
(https://www.youtube.com/watch?v=ikAb-NYkseI and he's apparently having his own
quarrantine stress issues right now), which helped a bit. I gotta plug my ears
and go "lalala" and follow Sam Vimes advice to "do the job that's in front of
you"...

>>   $ echo \
>>   > $LINENO
>>   2
>>
>>   $ echo $LINENO \
>>   $LINENO
>>   1 1
> 
> Let's look at these two. This is one of the consequences of using a parser
> generator, which builds commands from the bottom up (This Is The House That
> yacc Built).

I've never understood the point of yacc. I maintained a tinycc fork for 3 years
because it was a compiler built in a way that made sense to me.

> You set $LINENO at execution time based on a saved line number in the
> command struct, and that line number gets set when the parser knows that
> it's parsing a simple command and begins to construct a struct describing
> it for later execution.
> 
> In all cases, the shell reads a line at a time from wherever it's reading
> input. In the first case, it reads
> 
> "echo \"
> 
> and starts handing tokens to the parser. After getting `echo', the parser
> doesn't have enough input to determine whether or not the word begins a
> simple command or a function definition,

The tricky bit is "echo hello; if" does not print hello before prompting for the
next line, and "while read i; echo $i; done" resolves $i differently every way,
which said to _me_ that the parsing order of operations is

A) keep parsing lines of data until you do NOT need another line.

B) then do what the lines say to do.

I have parse_word() that finds the end of the next word (returning NULL if we
need another line to finish a quote, with trailing \ counting as a quote), and
parse_line() that adds words to struct sh_arg {int c; char **v;} in a linked
list of struct sh_pipeline in a struct sh_function, and when parse_line() does
NOT return a request for another line, the caller can run_function() and then
free_function().

parse_line() gets the list of argv/argc pairs, each of which ends with the
control characters that terminated the command, which are:

      // Flow control characters that end pipeline segments
      s = end + anystart(end, (char *[]){";;&", ";;", ";&", ";", "||",
        "|&", "|", "&&", "&", "(", ")", 0});

Since parse_line() _also_ has to handle the do/for/in stuff to see when _those_
line continuations are required, it annotates each command line with a "type" (0
for executable statement, 1 for start of block ala "if", 2 for block gearshift
ala "then", 3 for end of block ala "fi", and then a few other types like "f" for
function and 's' for the the non-executable loop payload bit (the STUFF part of
for i in STUFF STUFF STUFF).

Then run_function() takes this pre-parsed stuff and traverses the block
structure, which has to match up or parse_line() would have barfed on it working
out line continuations.

I don't try to evaluate variable contents until after all that's done, which
means right NOW my $LINENO isn't particularly accurate. (I need to add a lineno
field to struct sh_pipeline next to the ->type initialized from the global
counter, but haven't done it yet. Hence asking these questions...)

> and goes back for more input. The
> lexer sees there are no tokens left on its current input line, notes that
> line ends in backslash and reads another line, incrementing the line
> number, throws away the newline because the previous line ended in
> backslash,

I _was_ throwing away the newline, but I stopped because of this. Now I'm
keeping it but treating it as whitespace like spaces and tabs, but that's wrong:

  $ echo ABC\
  > DEF
  ABCDEF

Sigh, test added. Back to throwing it out...

> and returns $LINENO. The parser finally has enough input to
> reduce to a simple command, and builds one, with the line number set to 2.
Ok, so line number is _ending_ line, not starting line. (Continuations increment
the LINENO counter _before_ recording it for the whole span.)

I think that's the answer to my question. :)

> In the second case, the lexer reads the complete line "echo $LINENO \"
> and starts handing back the tokens. The parser has enough tokens to reduce
> to a simple command before it goes back for more input to complete it,
> and the lexer processes the backslash-newline. The line number is set to 1,
> or at least not incremented, when the parser begins to build the simple
> command struct.
> 
> There's some variation in this area, by the way, but everyone agrees on
> the basics: the line number gets incremented when you process the
> backslash-newline. If you use a recursive-descent parser you have a little
> more flexibility with this case and several others.

On multiple occasions I've fallen into the trap of "./sh TEST" cursor up strip
the ./ off and "sh TEST" is behaving INSANELY how could bash possibly be getting
that wrong... oh, /bin/sh is the Defective Annoying SHell in devuan. (Which I
haven't fixed locally because I don't want to require other people to do that, I
need to make sure that every script points to #!/bin/bash instead of #!/bin/sh.)

>>>>>> I currently have no IDEA what "sh --help" should look like when I'm done, 
>>>>>
>>>>> I'm pretty sure bash --help complies with whatever GNU coding standards
>>>>> cover that option.
>>>>
>>>> Currently 2/3 of bash --help lists the longopts, one per line, without saying
>>>> what they do. So yeah, that sounds like the GNU coding standards.
> 
> Oh, please. It doesn't describe what each single-character option does,
> either. That's a job for a man page or texinfo manual.

Then why are they one per line?

(I understand "not spending the space", my objection was to spending the screen
real estate and not _using_ it...)

>> something like:
>>
>> ---
>>
>> Usage: bash [-ilrsDabefhkmnptuvxBCHP] [-c COMMAND] [-O SHOPT] [SCRIPT_FILE] ...
>>
>> Long options:
>> 	--debug --debugger --dump-po-strings --dump-strings --help --init-file
>> 	--login --noediting --noprofile --norc --posix --rcfile --restricted
>> 	--verbose --version
>>
>> For -O SHOPT list 'bash -c "help set"', for more information 'bash -c help'
>> or visit https://www.gnu.org/software/bash or run "man 1 bash".
> 
> That's certainly an acceptable way to present it.

This is the kind of exercise I go through to come up with toybox help text, I
don't expect bash to change I'm just saying this is how I use the space.

>> ---
>>
>> Except you've got some parsing subtlety in there I don't, namely:
>>
>>   $ bash -hc 'echo $0' --norc
>>   --norc
>>
>>   $ bash -h --norc -c 'echo $0'
>>   bash: --: invalid option
> 
> "Bash also  interprets  a number of multi-character options.  These op-
>  tions must appear on the command line before the  single-character  op-
>  tions to be recognized."
> 
> Bash has always behaved this way, back to the pre-release alpha and beta
> versions, and I've never been inclined to change it.

Indeed. Unfortunately for _my_ code to do that it would have to get
significantly bigger, because I'd need to stop using the generic command line
option parsing and pass them through to sh_main() to do it myself there. (Or add
intrusive special cases to the generic parsing codepath.)

Documenting this as a deviance from <strike>posix</strike> the bash man page
seems the better call in this instance. If -c is a "stop early" then command
lines accepted by bash should also be accepted by toysh, and that's close enough.
>> And some of this is just never going to parse the same way:
>>
>>   $ bash -cs 'echo $0'
>>   bash
> 
> This is ambiguous, but not in the way you expect. The thing that differs
> between shells is whether or not they read input from stdin (because of
> the -s option) after executing the `echo $0'. POSIX specifies them as
> separate cases, so nobody should expect anything in particular when they
> are combined. The ash-derived shells start reading from standard input,
> bash and the ksh-like shells exit after executing the echo, and yash
> rejects the option combination entirely.

Wheee.

In my case "how options work in all the other toybox commands" puts a heavy
weight on one side of the scales. (Not insurmountable, but even the exceptions
should have patterns.)

>> But again, you have to conform to the gnu style guidelines, which I thought
>> means you'd have a texinfo page instead of a man page?
> 
> I have both.

sed and sort had both but treated the man page as an afterthought. Many of their
gnu extensions were ONLY documented in the info page when I was writing new
implementations for busybox back in the day. (No idea about now, haven't looked
recently. These days I handle that sort of thing by waiting for somebody to
complain. That way I only add missing features somebody somewhere actually _uses_.)

For toysh, I've taken a hybrid approach. I'm _reading_ every man page corner
case and trying to evaluate it: for example /dev/fd can be a filesystem symlink
to /proc/self/fd so isn't toysh's problem, but I'm making <(blah) resolve to
/proc/self/fd/%d so it doesn't _require_ you to have the symlink. Yeah,
/dev/tcp/ still need special case code in the shell, it's on the TODO list...)

>> Also, I dunno why -O blah
>> is a seprate namespace from "bash --pipefail", 
> 
> I assumee you mean `-o pipefail'.

I had not, when I wrote that, noticed that "set" and "shopt" were different
commands. Which means "bash -H" is documented under "help set" and "bash -c" is
documented under "man bash" and shopt -O has its own namespace. Hence:

  $ bash -O pipefail
  bash: pipefail: invalid shell option name

vs:

  $ bash -o pipefail
  landley at driftwood:~/toybox/toybox$ false | echo hello
  hello
  landley at driftwood:~/toybox/toybox$ echo $?
  1

> I abandoned the -o namespace to POSIX a
> long time ago, and there is still an effort to standardize pipefail as
> `-o pipefail', so I'm leaving it there. I originally made it a -o option so
> we could try and standardize it, and that reasoning is still relevant.

It's a pity posix is moribund. I mentioned the fact pipefail had been picked up
by multiple other shells in the toybox talk I did 7 years ago:

  https://landley.net/talks/celf-2013.txt

But I've used bash as my main command line since 1998 and I'm still learning
weird corner cases as I do this. (And that's having sat down to try to read all
6000 lines of the bash man page on more than one occasion, and having printed
out the "advanced bash scripting guide" in a three ring binder back in 2007 and
brought it on my daily bus commute for a while.)

>> ----------
>> Usage: sh [--LONG] [-ilrsD] [-abefhkmnptuvxBCHP] [-c CMD] [-O OPT] [SCRIPT] ...
>>
>> -c	Run CMD then exit (with -s continue reading from stdin)
> 
> You can, of course, do anything you want with this and remain POSIX
> conformant.
> 
> 
>> Do you really need to document --help in the --help text? 
> 
> Why not? It's one of the valid long options.

Lack of space. I was trying to squeeze one less line out of the output. :)

In the specific case of toybox, every command (except false, test, and true)
supports --help and --version, so that's in "help toybox" rather than in the
individual commands anyway.

Ok, sed doesn't use the generic --help and --version parsing, it has
TOYFLAG_NOHELP and then handles it itself, but that's so it can pull the:

https://unix.stackexchange.com/questions/16350/which-sed-version-is-not-gnu-sed-4-0

trick, because:

http://lists.busybox.net/pipermail/busybox/2004-January/044642.html

(Except toybox says "this is not gnu said _9.0_" because time marches on... :)

> The bash man page does
>> not include the string "--debug" (it has --debugger but not --debug), 
> 
> It's just shorthand for the benefit of bashdb.

  $ help bashdb
  bash: help: no help topics match `bashdb'.  Try `help help' or `man -k bashdb'
  or `info bashdb'.
  $ man -k bashdb
  bashdb: nothing appropriate.
  $ man bash | grep bashdb
  $

google... huh, it's a sourceforge package.

I'm not sure how I'd have divined the existence of the sourceforge package from
the --debug option in the help output (which didn't make an obvious behavior
difference when I tried it), but I often miss things...

>> --dump-strings is -D which again:
>>
>> $ bash --dump-strings
>> bash-4.4$ help
>> bash-4.4$ echo hello> bash-4.4$ exit
>> bash-4.4$ break
>> bash-4.4$ stop
>> bash-4.4$ ^C
>> bash-4.4$ ^C
>> bash-4.4$ ^C
>> bash-4.4$
> 
> What point are you trying to make here? There aren't any translatable
> strings using the $"" notation to write to standard output,

I was trying to figure out, from the help text and the man page, what was being
documented.

> and the
> documntation for -D clearly says it implies -n. Should it not print the
> prompt? Should it scold the user for running it in interactive mode? I
> will admit that it's never been used as widely as I thought it might be,
> but it works as advertised.

I couldn't figure out from the man page what it was supposed to do or how to use
it. (Could easily be a "me" problem, of course.) Trying again...

Ah, you're saying print a list of the strings _IN_A_SCRIPT_. I thought it meant
there were strings built into bash or something. It's an option to find all the
translatable strings in a script, where translatable is indicated via $"" around
the string, so.

-D	Display all the $"translatable" strings in a script.

Oh right, I remember reading about $"" existing and going "that's weird, out of
scope" and moving on. Because I did _not_ understand how:

       A double-quoted string preceded by a dollar sign ($"string") will cause
       the string to be translated according to the current  locale.   If  the
       current  locale  is  C  or  POSIX,  the dollar sign is ignored.  If the
       string is translated and replaced, the replacement is double-quoted.

was supposed to work. (HOW is it translated? Bash calls out to
translate.google.com to convert english to japanese? Is there a thing humans can
do to supply an external translation file? Is it just converting dates and
currency markers and number grouping commas?)

Searching for other occurrences of "translat" in the bash man page...

       LC_MESSAGES
              This variable determines the locale used  to  translate  double-
              quoted strings preceded by a $.

used to translate by what? Google for LC_MESSAGES and the first hit is:

https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html

Ah, gettext. That would explain why I don't know about it. I always used
http://penma.de/code/gettext-stub/ in my automated Linux From Scratch test
builds because it's one of those gnu-isms like info and libtool.

       A double-quoted string preceded by a dollar sing ($"string") gets
       translated via the current locale (see LC_MESSAGES) when a gettext
       database is present. Otherwise treated as a double quoted "string".
       Use "bash -D SCRIPT" to see all translatable $"strings" in a SCRIPT.

>> P.S. --posix isn't -p, that's "privileged" mode which is not the same as
>> restricted mode and I'm walking away from the keyboard for a bit now.
> 
> Yeah, -p was already used when I implemented posix mode, so I went with
> `-o posix'. `--posix' is just more notational shorthand.
> 
>>
>> P.P.S. the man page has --init-file but the --help output doesn't.
> 
> Incorrect.
> 
> $ ./bash --help | grep init
> 	--init-file

Must have accidentally deleted it shuffling things around, sorry. (See "stepping
away from the keyboard", above.)

Sigh, this email is too long. I should go back to blogging.

> Chet

Rob


More information about the Toybox mailing list