[Toybox] Would someone please explain what bash is doing here?

Wed May 27 11:22:42 PDT 2020

On Sun, May 17, 2020 at 4:04 AM Rob Landley <rob at landley.net> wrote:
>
> I had a reply window open to this when my laptop battery died, and thunderbird
> doesn't store unfinished messages like kmail and vi and chrome...
>
> Anyway, I was reminded of this thread by:
>
>   $ IFS=x; ABC=cxd; for i in +($ABC); do echo =$i=; done
>   =+(c=
>   =d)=
>   $ bash -c 'IFS=x; ABC=cxd; for i in +($ABC); do echo =$i=; done'
>   bash: -c: line 0: syntax error near unexpected token `('
>   bash: -c: line 0: `IFS=x; ABC=cxd; for i in +($ABC); do echo =$i=; done'
>   $ readlink -f /proc/$$/exe
>   /bin/bash
>
> (I tried inserting shopt -o extglob; and shopt +o extglob; at the start of the
> -c and it didn't change anything?)
>
> Yes, I'm still trying to work out wildcard parsing behavior. :)
>
> Here's a second attempt at replying to the rest, sorry if it's duplicative but I
> don't remember what I already said because of the email that wasn't sent, and
> because of the emails that were. :)
>
> On 5/11/20 3:55 PM, Chet Ramey wrote:
> > On 5/10/20 7:24 PM, Rob Landley wrote:
> >
> >>>>   $ echo \
> >>>>   > $LINENO
> >>>>   2
> >>>>
> >>>>   $ echo $LINENO \
> >>>>   $LINENO
> >>>>   1 1
> >>>
> >>> Let's look at these two. This is one of the consequences of using a parser
> >>> generator, which builds commands from the bottom up (This Is The House That
> >>> yacc Built).
> >>
> >> I've never understood the point of yacc. I maintained a tinycc fork for 3 years
> >> because it was a compiler built in a way that made sense to me.
> >
> > It saves you having to write a lot of code if you have a good grammar to
> > work from. One of the greatest achievements of the original POSIX working
> > group was to create a grammar for the shell language that was close to
> > being implementable with a generator.
>
> Generated code exists and has its costs whether or not you had to manually write
> it. There's a classic paper/blog rant about that (specifically its impact on the
> Java language):
>
>   http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.html
>
> I maintained a tinycc fork for 3 years (https://landley.net/hg/tinycc) after
> Fabrice Bellard abandoned it because its design made sense to me, and the result
> was a 100k self-contained compiler binary that built a working linux kernel in a
> single digit number of seconds on 15 year old hardware
> (https://bellard.org/tcc/tccboot.html), and it did this by _not_ using yacc. I
> keep hoping somebody will steal the https://landley.net/code/qcc idea and do it
> so I don't have to.
>
> Back before Eric Raymond went crazy, he and I were working together on a paper
> he wanted to call "why C++ is not my favorite language", which was about local
> peaks in language design space, the difference between static and dynamic
> languages, and the downsides of the no man's land in between the two local peaks.
>
> After we stopped being able to work together, I did a 3 part writeup of my
> understanding of the what the paper would have been about:
>
>   https://landley.net/notes-2011.html#16-03-2011
>   https://landley.net/notes-2011.html#19-03-2011
>   https://landley.net/notes-2011.html#20-03-2011
>
> Anyway, the _point_ of that is "scripting languages are a thing", and the proper
> tool for the proper job. To me code generation means "you should probably be
> using a scripting language for this".
>
> >> The tricky bit is "echo hello; if" does not print hello before prompting for the
> >> next line,
> >
> > Yes. You're parsing a command list, because that's what the `;' introduces,
> > and the rhs of that list isn't complete. A "complete_command" can only be
> > terminated by a newline list or EOF.
>
> Hence my line based parsing function that returns whether or not it needs
> another line to finish the current thought.
>
> >> and "while read i; echo $i; done" resolves $i differently every way,
> >
> > You mean every time through the loop?
>
> Yup.
>
> >> which said to _me_ that the parsing order of operations is
> >>
> >> A) keep parsing lines of data until you do NOT need another line.
> >>
> >> B) then do what the lines say to do.
> >
> > Roughly, if you mean "complete commands have been resolved with a proper
> > terminator" and "execute the commands you just parsed."
>
> The execution can be deferred arbitrarily long (you may be defining a function)
> or never (the else case of an if statement that was true), but yeah.
>
> >> I have parse_word() that finds the end of the next word (returning NULL if we
> >> need another line to finish a quote, with trailing \ counting as a quote), and
> >> parse_line() that adds words to struct sh_arg {int c; char **v;} in a linked
> >> list of struct sh_pipeline in a struct sh_function, and when parse_line() does
> >> NOT return a request for another line, the caller can run_function() and then
> >> free_function().
> >
> > Have you found that structure enough to handle, say, if-then-elif-else-fi
> > and the various flavors of the `for' command?
>
> Yup. I had to rewrite it about 5 times as I kept finding new cases I hadn't
> accounted for , but eventually got there. Starting somewhere around:
>
>   https://landley.net/notes-2019.html#02-06-2019
>
> And going through... pretty much the rest of that month?
>
> Sigh. I have "<span id=programming>" tags in my blog and I've been MEANING to
> upgrade my old python rss generator to produce multiple rss feeds (so instead of
> just rss.xml there's rss-programming.xml and so on.
>
> But looking at it, I got really lazy about tagging at times because nothing ever
> tested it. June 20 and 21 of 2019 say span id=programming when the should be
> span id=energy. And I haven't wrapped anything with id=politics in forever (I
> just left those entries unwrapped so they're NOT id=programming...)
>
> (Yes, I've always had a PLAN to let people read specific TOPICS out of my blog
> without having to deal with the rest of it. I just never got around to it, and
> somebody telling me what to do when I'm already stressed is not the way to get
> me to do it.)
>
> >>> and goes back for more input. The
> >>> lexer sees there are no tokens left on its current input line, notes that
> >>> line ends in backslash and reads another line, incrementing the line
> >>> number, throws away the newline because the previous line ended in
> >>> backslash,
> >>
> >> I _was_ throwing away the newline, but I stopped because of this. Now I'm
> >> keeping it but treating it as whitespace like spaces and tabs, but that's wrong:
> >
> > It is wrong; it needs to be removed completely.
>
> Yup, I already added that test:
>
>   shxpect 'line continuation2' I$'echo ABC\\\n' E'> ' I$'DEF\n' O$'ABCDEF\n'
>
> I.E. 'echo ABC\' as the shell's first line of input, wait for the > prompt,
> input 'DEF' as the next line, and the output should be 'ABCDEF" all on one line.
>
> (Of course I implemented my own expect-style plumbing in shell. How else do you
> test line continuations?)
>
> >>> and returns $LINENO. The parser finally has enough input to
> >>> reduce to a simple command, and builds one, with the line number set to 2.
> >> Ok, so line number is _ending_ line, not starting line. (Continuations increment
> >> the LINENO counter _before_ recording it for the whole span.)
> >
> > Not necessarily the ending line; a simple command can span an arbitrary
> > number of lines, but $LINENO gets set from whatever line the lexer was on
> > when the parser recognized what it had as a simple command. It can then
> > continue reading words in that simple command until it gets the unescaped
> > newline (or `;', or `&', or any of the command separators) to terminate it.
>
> My parse_line() function digests a line into a linked list of struct sh_pipeline
> which contains argc/argv pairs (one of the command and one more for each HERE
> document encountered), and an integer "type" field that's 0 for normal
> executable statement, 1 for start of block (if/while...), 2 for gearshift
> (then/do), and 3 for end of block (fi/done), plus a few others like 'f' for
> function definition and 's' for the argument line in a for loop that gets
> expanded but isn't runnable the way an if statement's argument is...
>
> Anyway, that structure needs an "int lineno" added that gets snapshot from the
> global TT.lineno, and what I've learned from all this is it gets snapshot at the
> end when we close out the sh_pipeline and start the next one, not at the
> beginning when it's allocated. (That's the observation that makes the behavior
> make sense now.)
>
> P.S. All the wildcard plumbing I've been adding has bumped me up to 2866 lines
> of sh.c. I'm trying to finish under 3500 lines total. Luckily there's debug code
> and scaffolding that gets removed at the end...
>
> > If you want to look ahead far enough, you can save the line number if you
> > don't read a reserved word, peek at the next few characters to see if you
> > get '()', and build a simple command using the saved line number. Yacc/
> > Bison don't let you look that far ahead.
>
> Last time I looked up youtube clips for the princess bride "once his HEAD is in
> range HIT IT WITH THE ROCK", and winnie the pooh "bear of very little brain",
> but I haven't got the spoons to do that again.
>
> tl;dr: _I_ don't want to, no. That sounds way too much like work.
>
> >>>>>>>> I currently have no IDEA what "sh --help" should look like when I'm done,
> >>>>>>>
> >>>>>>> I'm pretty sure bash --help complies with whatever GNU coding standards
> >>>>>>> cover that option.
> >>>>>>
> >>>>>> Currently 2/3 of bash --help lists the longopts, one per line, without saying
> >>>>>> what they do. So yeah, that sounds like the GNU coding standards.
> >>>
> >>> Oh, please. It doesn't describe what each single-character option does,
> >>> either. That's a job for a man page or texinfo manual.
> >>
> >> Then why are they one per line?
> >
> > Because it's not worth the effort to space them across the screen.
>
> I polish documentation fairly obsessively. My users greatly outnumber me.
>
> >>>> Except you've got some parsing subtlety in there I don't, namely:
> >>>>
> >>>>   $ bash -hc 'echo $0' --norc
> >>>>   --norc
> >>>>
> >>>>   $ bash -h --norc -c 'echo $0'
> >>>>   bash: --: invalid option
> >>>
> >>> "Bash also  interprets  a number of multi-character options.  These op-
> >>>  tions must appear on the command line before the  single-character  op-
> >>>  tions to be recognized."
> >>>
> >>> Bash has always behaved this way, back to the pre-release alpha and beta
> >>> versions, and I've never been inclined to change it.
> >>
> >> Indeed. Unfortunately for _my_ code to do that it would have to get
> >> significantly bigger, because I'd need to stop using the generic command line
> >> option parsing and pass them through to sh_main() to do it myself there. (Or add
> >> intrusive special cases to the generic parsing codepath.)
> >
> > You can probably get away with it as long as that option parsing code stops
> > at the first word that doesn't begin with `-'.
>
> That's literally one character ("^" at the start of the option string in the
> middle argument of the NEWTOY() macro.)
>
> Although it's a second to make -c do it, which I think I also need.
>
> >> Documenting this as a deviance from <strike>posix</strike> the bash man page
> >> seems the better call in this instance.
> >
> > Documenting what as a deviation? POSIX doesn't do long options; you can do
> > whatever you like with them.
>
> My shell standard isn't posix, the standard I'm trying to implement is the bash
> man page. Posix can go hang here until Jorg Schilling dies of old age as far as
> I'm concerned.
>
> >> Wheee.
> >>
> >> In my case "how options work in all the other toybox commands" puts a heavy
> >> weight on one side of the scales. (Not insurmountable, but even the exceptions
> >> should have patterns.)
> >
> > The Bourne shell option parsing long predates modernities like getopt(), so
> > the basic rule is "scan for words starting with `-' or `+', parse them as
> > binary flag options, handling `--' in some reasonable way to end option
> > parsing, then grab what you need from the argument list (the command for
> > -c), and use everything else to set the positional parameters. Oh, and use
> > the same code for `set', so you have to reject the options that are only
> > valid at invocation.
>
> Even back under busybox I rewrote their mount command something like 5 times
> before I was happy with it. As for toybox, I REALLY should have called this
> project "dorodango"...
>
> The reason for all these regression tests is so I know when I've broken
> backwards compatibility, and can fix it. (Back on busybox I instead just ran
> gobs of real world data through it by using the busybox command line utilities
> to run an automated Linux From Scratch build, ala the "build control images" tab
> on the left of https://landley.net/aboriginal/about.html . My "I broke backwards
> compatibility" was usually some variant of autoconf making a different config
> decision halfway through compiling ncurses or perl or something, so the diff
> from last time had deviations I needed to fix...)
>
> >> sed and sort had both but treated the man page as an afterthought. Many of their
> >> gnu extensions were ONLY documented in the info page when I was writing new
> >> implementations for busybox back in the day. (No idea about now, haven't looked
> >> recently.
> >
> > OK. The bash man page and texinfo manual have the same content.
>
> Oh good.
>
> >> These days I handle that sort of thing by waiting for somebody to
> >> complain. That way I only add missing features somebody somewhere actually _uses_.)
> >
> > It has to be a lot more than one person.
>
> Yeah, but if I'm on the fence about it to begin with it only takes one person to
> confirm "yeah, that's actually used".
>
> Also, Elliott speaks for the Android userbase. They ship a billion devices
> annually. When he tells me he needs a thing, it carries some weight. (We argue
> about how and where, but "what" is generally a foregone conclusion.)

(i don't think i can claim to speak for the billions of users. the
thousands at OEMs/SoC vendors maybe :-) )

> >> For toysh, I've taken a hybrid approach. I'm _reading_ every man page corner
> >> case and trying to evaluate it: for example /dev/fd can be a filesystem symlink
> >> to /proc/self/fd so isn't toysh's problem, but I'm making <(blah) resolve to
> >> /proc/self/fd/%d so it doesn't _require_ you to have the symlink.
> >
> > Yeah, you only have to worry about linux.
>
> Yes and no. There's bsd and macos support now, and post-1.0 I might put some of
> my own effort into expanding the BSD side of it. (MacOS is a cross between
> "BSD's downstream" and "a weird proprietary mega-corporation that can sign a
> check if it wants me to care", but Elliott has users who build AOSP on MacOS and
> borrows a laptop to do fixups there every six months or so, and sends me patches.)

(i now have a mac laptop sitting right next to me all the time, so i
can check the mac build any time i think to. just checked. broken
again. i'll send a patch :-) --- but feel free to ping me any time you
need a mac question answered.)

> Ok, back up: my old aboriginal linux project (linked above) was my attempt to
> create the simplest Linux system that could rebuild itself under itself from
> source, then build Linux From Scratch under the result. I got it down to 7
> packages (busybox, uclibc, linux, gcc, bintuils, make, bash). There's newer
> stuff with fewer packages (current tip of tree is scripts/mkroot.sh in toybox,
> which is 250 lines of bash building a linux system that boots to a shell prompt
> under qemu for a dozen different hardware architectures) but that's a tangent.
>
> The first thing aboriginal linux did was build an "airlock step" to isolate the
> new system from idiosyncrasies in the host, loosely modeled on the old Linux
> From Scratch chapter 5 "temporary system" you would chroot into to build the
> real system under:
>
> http://archive.linuxfromscratch.org/lfs-museum/5.0/LFS-BOOK-5.0-HTML/chapter05/introduction.html
>
> My builds run entirely as a normal user (so no chroot, because no root access)
> which means my airlock step populated a directory with a host busybox and
> symlinks to the compiler binaries it needed from the host, so it could point the
> $PATH at just that ONE directory for the rest of the build, which meant package
> ./configure steps were less likely to find things like python installed and make
> so many stupid config decisions while cross compiling.
>
> (I once wrote a document that I WANTED to call "Why cross compiling sucks" but
> that wasn't "professional": http://landley.net/writing/docs/cross-compiling.html )
>
> I also did a giant 260 slide presentation about all of this back in the day:
> https://speakerdeck.com/landley/developing-for-non-x86-targets-using-qemu
>
> The android guys call their version of this concept a "hermetic build", meaning
> the build is hermetically sealed and provides all its own prerequisites. The way
> they do it isn't via building known binaries on the local system to go through
> an airlock step, instead they provide their own prebuilt binaries and run the
> build using those.

(though _those_ binaries themselves must come from a build on a build
server, not just from someone's laptop.)

> The toybox 0.8.1 release notes had some links about that
> right near the start:
>
>   http://landley.net/toybox/#21-05-2019
>
> Android doesn't care about supporting AOSP builds on FreeBSD, but they _do_ care
> about supporting it on macos:
>
>   https://source.android.com/setup/build/initializing
>
> And that means shipping toybox binaries for mac as part of the AOSP hermetic
> build. Which need to work, hence "make macos_defconfig" and the GSED= plumbing
> in scripts/ and so on.
>
> So technically, I have to worry about Linux, Android, and Mac. (And Android is
> built with llvm, uses bionic as its libc, and is sprayed down with enough
> selinux rules to be its own beast in places...)
>
> >>> I abandoned the -o namespace to POSIX a
> >>> long time ago, and there is still an effort to standardize pipefail as
> >>> `-o pipefail', so I'm leaving it there. I originally made it a -o option so
> >>> we could try and standardize it, and that reasoning is still relevant.
> >>
> >> It's a pity posix is moribund.
> >
> > It's not dead, just slow.
>
> Give him time.
>
> > https://www.austingroupbugs.net/view.php?id=789
> >
> > So we started talking about this in some official proposed way in 2013,
> > continued sporadically until 2018, decided on some official text to add
> > to the standard in September, 2018, and it will be in the next major
> > revision of Posix, issue 8.
>
> There's going to be an issue 8? Posix has been replacing issue 7 in place doing
> the "continuous integration" thing (grab random snapshot du jour from the
> website and call it good, no two people ever experience quite the same version)
> for 12 years now.

(i sent you a mail about that recently. TL;DR: you're already doing
everything they're adding/changing.)

> Toybox has quarterly releases because I found Martin Michlmayr's "release
> management in Large Free Software Projects" talk compelling:
>
>   https://www.youtube.com/watch?v=IKsQsxubuAA
>
>   ABSTRACT: Time based releases are made according to a specific time interval,
>   instead of making a release when a particular functionality or set of features
>   have been implemented. This talk argues that time based release management
>   acts as an effective coordination mechanism in large volunteer projects and
>   shows examples from seven projects that have moved to time based releases:
>   Debian, GCC, GNOME, Linux, OpenOffice, Plone, and X.org.
>
> (I ranted a lot more here last time in the email that got lost. Probably a good
> thing.)
>
> >  I mentioned the fact pipefail had been picked up
> >> by multiple other shells in the toybox talk I did 7 years ago:
> >
> > Bash wasn't the first shell to have it. I didn't add it until 2003, after
> > a discussion with David Korn (the aforementioned Posix standardization
> > effort).
>
> I only started using bash in 1998. :)
>
> >>>> ----------
> >>>> Usage: sh [--LONG] [-ilrsD] [-abefhkmnptuvxBCHP] [-c CMD] [-O OPT] [SCRIPT] ...
> >
> > You're missing `-o option'.
>
> When I wrote that, I hadn't figured out that "set" and "shopt" were different
> commands with different namespaces.
>
> >>>> Do you really need to document --help in the --help text?
> >>>
> >>> Why not? It's one of the valid long options.
> >>
> >> Lack of space. I was trying to squeeze one less line out of the output. :)
> >
> > Believe me, if it were not there, someone would complain about its absence.
> > There are a few things that remain undocumented, and I get crap about them
> > as regular as clockwork.
>
> Of course you know your userbase better than I do, but toybox's policy on
> "compatibility features" like patch -g where we accept the option and ignore it
> (sometimes because "this is specifying something that's our default behavior
> anyway") is to NOT document them. Our help text is intended for somebody
> figuring out what command line to feed to toybox to make it do a thing.
>
> Also, all toybox commands (except ones like true that explicitly disable it)
> support --help and --version, so they're documented in "help toybox" rather than
> in the individual commands. (Also - as a synonym for stdin, that -- stops option
> parsing, and so on.)
>
> So we have some precedent for NOT documenting certain things. And "you had to
> type --help to see this" would go in that bucket for me. :)
>
> There's a "bang for the byte" principle that also applies to documentation. End
> users only have so much bandwidth to read it.
>
> Of course toybox cheats: if they really want to know every possible thing, the
> implementation should be small and simple enough to easily dig through. Our
> mount.c is 408 lines long, patch.c is 483 lines, find.c is 710 lines... And
> find.c isn't _really_ that big: 67 lines of find.c are // comments, the first 63
> lines are the big /* header block comment */ with the help text in it, and there
> are 107 blank lines, for a total of less than 500 lines of actual command
> implementation...
>
> *shrug* That's the theory, anyway...
>
> >>> The bash man page does
> >>>> not include the string "--debug" (it has --debugger but not --debug),
> >>>
> >>> It's just shorthand for the benefit of bashdb.
> >>
> >>   $ help bashdb
> >>   bash: help: no help topics match `bashdb'.  Try `help help' or `man -k bashdb'
> >>   or `info bashdb'.
> >>   $ man -k bashdb
> >>   bashdb: nothing appropriate.
> >>   $ man bash | grep bashdb
> >>   $
> >>
> >> google... huh, it's a sourceforge package.
> >
> > It's the source of most of the bash debugging support. Rocky started out
> > distributing patches, but I folded most of the features into the mainline
> > source. It's a nifty little bit of work.
>
> When helping port Linux to the hexagon processor back in 2010, I stuck print
> statements into the uclibc dynamic loader to debug its relocation of its own
> symbols. (It was a macro that expanded to a write syscall.) I could not call any
> functions or use any globals, and had to put the data in stack char arrays
> initialized by assigning element at a time. (I made a macro to convert integers
> to hex I could output.)
>
> When I added memory tests between uboot's DRAM init and relocating itself from
> flash to DRAM, I stuck print statements into uboot that were a bit like
> https://balau82.wordpress.com/2010/02/28/hello-world-for-bare-metal-arm-using-qemu/
> except the loop had an if (FLAG&*register) to check the "can write next byte out
> now" (which QEMU doesn't need but real hardware does) and the tricky part is
> that because of the relocation all the string constants were linked at the
> address they would be relocated TO not where they were in flash, so I had to
> work out an UNRELOCATE constant to subtract from the string constants, ala
> emit("string"-UNRELOCATE). (The functions were PIE that would -o binary into a
> static blob or some such, but it couldn't find rodata without serious handholding.)
>
> I stick printk(KERN_ERR) into the kernel all the time.
>
> I started getting into VHDL when I figured out you could stick print statements
> (um, "alert"?) into the simulator builds.
>
> I don't think I'm the target audience for this feature, is what I'm saying. I
> started out on a commodore 64 (38911 basic bytes free) and it left a MARK.
>
> >> I'm not sure how I'd have divined the existence of the sourceforge package from
> >> the --debug option in the help output (which didn't make an obvious behavior
> >> difference when I tried it), but I often miss things...
> >
> > The debugging support exists independently of bashdb, and can be used
> > without it. Bashdb is just the biggest customer, and the origin of the
> > features. The `debugger profile' in the documentation is the bashdb
> > driver script. Try running `bash --debugger' sometime, you might like it.
> > Assuming, of course, your vendor has installed bashdb in the right place.
>
> Ok. Good luck with it.
>
> >> -D   Display all the $"translatable" strings in a script.
> >>
> >> Oh right, I remember reading about $"" existing and going "that's weird, out of
> >> scope" and moving on. Because I did _not_ understand how:
> >>
> >>        A double-quoted string preceded by a dollar sign ($"string") will cause
> >>        the string to be translated according to the current  locale.   If  the
> >>        current  locale  is  C  or  POSIX,  the dollar sign is ignored.  If the
> >>        string is translated and replaced, the replacement is double-quoted.
> >>
> >> was supposed to work. (HOW is it translated? Bash calls out to
> >> translate.google.com to convert english to japanese? Is there a thing humans can
> >> do to supply an external translation file? Is it just converting dates and
> >> currency markers and number grouping commas?)
> >
> > You have a message catalog, install the right files, and use the gnu
> > gettext infrastructure to get translated versions of the strings you
> > mark. It's a shell script way of doing what bash does internally for its
> > own messages. Very little-used.
>
> Possibly very little-used because the documentation assumes you already know
> what it is, how to use it, and why you'd want to? Which I didn't, but then I'm
> VERY good at not understanding things, and I break everything, so not
> necessarily representative...
>
>   If your current locale setting has an appropriate gettext database installed,
>   $"strings" get looked up and replaced with translated versions, otherwise
>   they act like normal double quoted strings. Also, "bash -D SCRIPT" will show
>   you all the $"" strings in a SCRIPT so translators can make a gettext database
>   for a new $LANG.
>
> >> Ah, gettext. That would explain why I don't know about it. I always used
> >> http://penma.de/code/gettext-stub/ in my automated Linux From Scratch test
> >> builds because it's one of those gnu-isms like info and libtool.
> >
> > It will be in Posix one day.
>
> I plead the third. (It's the one about quartering troops, it doesn't get enough
> attention so I do what I can.)
>
> > Chet
>
> Rob