[Toybox] [PATCH] sh: pass "\" to the later app

Rob Landley rob at landley.net
Thu Jun 8 19:31:46 PDT 2023


On 6/5/23 18:08, Chet Ramey wrote:
>> But escaping a _newline_ is funny in that it glues lines together instead of
>> creating a command line argument out of the result, which means it has to be
>> special cased and obviously I'm special casing it wrong, but the special case
>> has multiple nonobvious features.
> 
> I guess. There are two cases: in double quotes, when the backslash-newline
> is preserved,

$ printf "abc\
> def\n"
abcdef
$ basename "abc\
> def" | hd
00000000  61 62 63 64 65 66 0a                              |abcdef.|
00000007

Define "preserved".

> and unquoted, where it's removed. Single quotes obviously
> preserve and aren't worth mentioning.

In arguments double quote removes it:

  $ echo abc\
  > def
  abcdef
  $ echo "abc\
  > def"
  abcdef
  $ echo 'abc\
  > def'
  abc\
  def

In here documents, double quote does NOT remove it:

  $ cat<<EOF
  > ab\
  > c
  > EOF
  abc
  $ cat<<"EOF"
  > ab\
  > c
  > EOF
  ab\
  c
  $ cat<<'EOF'
  > ab\
  > c
  > EOF
  ab\
  c

Confirmed I'm testing bash:

  $ ls -l /proc/$$/exe
  lrwxrwxrwx 1 landley landley 0 Jun  8 18:33 /proc/25568/exe -> /bin/bash
  $ bash --version
  GNU bash, version 5.0.3(1)-release (x86_64-pc-linux-gnu)

I also tried to ask questions about how long a HERE document lasts, and:

  $ bash -c $'cat<<0;echo hello\nabc\n0'
  abc
  hello
  $ bash -c $'cat<<"";echo X\n\necho Z'
  X
  Z
  $ echo -n 'cat<<EOF' > one
  $ echo -n $'potato\nEOF' > two
  $ bash -c '. one;. two'
  one: line 1: warning: here-document at line 1 delimited by end-of-file (wanted
`EOF')
  two: line 1: potato: command not found
  two: line 2: EOF: command not found

(Trying to get "matching EOF vs newline" test cases in both directions turns out
to be difficult...)

I'm also vaguely curious how one WOULD terminate this one:

  $ bash -c $'cat<<\'\n\''
  bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `
  ')
  $ bash -c $'cat<<\'\n\'\n\n\n'
  bash: line 3: warning: here-document at line 1 delimited by end-of-file (wanted `
  ')



  $

Also, -s doesn't work as advertised in the man page?

       -s        If  the -s option is present, or if no arguments remain after
                 option processing, then commands are read from  the  standard
                 input.   This  option  allows the positional parameters to be
                 set when invoking an interactive shell or when reading  input
                 through a pipe.

  $ echo echo also | bash -s -c 'echo hello'
  hello
  $ echo echo also | bash -c -s 'echo hello'
  hello
  $ echo echo also | bash -c -s -s -s -s 'echo hello'
  hello

But I may have already asked about that one a while back. (I need to reread my
notes...)

>> I think part of it is that my tokenizer removes whitespace between tokens, and
>> you're not doing that until later? 
> 
> No, the tokenizer produces a stream of tokens. Unquoted whitespace doesn't
> matter.

Yes, but does a backslash newline count as quoted whitespace? Backslash
ordinarily quotes, and there's "" which is a quoted nothing but creates an
argument. So this is a new category: a quoted nothing that does NOT create an
argument. (I think I'm handling it properly now, but it was a new thing in my
if/else staircase.)

> (You're doing more passes over the data than
>> I am, my code tries to do all the work each pass can do so it's not repeating
>> itself. I had a problem that variable expansion and redirect are the same pass
>> in my code, and different passes in yours, which leads to me being unable to
>> produce quite the same error messages you do in a couple places...)
> 
> POSIX says you do them in separate steps.

Good to know.

Alas, posix says a lot of things, it would be nice if more of them were current
and relevant. I printed it all out and read the whole thing on a series of bus
rides into work when I first sat down to write a new shell for busybox back in
2006. I've had a vague todo to read the new one whenever Issue 8 finally comes
out, but it's been "real soon now" for... how long? (Posix-2008 came out 15
years ago.) The Linux Standard Base got eaten by the Linux Foundation, which is
the same kind of 501c6 as the Tobacco Institute and Microsoft's "don't copy that
floppy" sock puppet were, so of course it's long dead and the "linux device
list" from http://lanana.org/ is 404 and has been for many years. Michael
Kerrisk retired over the pandemic and handed off to a new guy (Alejandro
Colomar) who doesn't even maintain a web copy of the man pages. I yearn for
meaningful standards that aren't swiss-cheese and what _is_ there is "bypassed
like a christmas tree captain, don't give me too many bumps"...)

Mostly I'm collecting test cases I need to pass. I know where I am with a test
case...

>> In general, line continuation priority isn't always obvious to me until I've
>> determined it experimentally:
> 
> You go off and collect here-document bodies as soon as you get a newline
> token after seeing the operator-delimiter pair.

It does seem to take priority over everything, yes.

> We had a pretty good
> argument about this on the austin-group list.

I've been subscribed forever, and even dialed in to a few of their conference
calls, but I've generally found arguments there mostly just peter out without
resolution:

  https://mail-archive.com/austin-group-l@opengroup.org/msg09569.html

I've never been good at the politics side of things...

>> I'm trying to have tests for everything, but there are a number of corner cases...
>> 
>>>> Which is annoyingly magic because:
>>>>
>>>>     $ bash << 'EOF'
>>>>     > echo abc\
>>>>     > EOF
>>>>     abc
>>>
>>> So think about this in two pieces: what the here-document does to generate
>>> the input to the shell, and what the shell does with it.
>> 
>> The way I'd done it is the HERE document doesn't generate input, the funky
>> redirect _requests_ additional input, which is all basically the line
>> continuation logic where it can't proceed to the "can we actually run this now"
>> logic because it hasn't yet got a complete thought. I keep keep calling
>> parse_line() with the next line of input until it returns zero, at which point
>> it can call run_line() on the accumulated data structure it got parsed into.
> 
> There are two parts: reading the body of the here-document, and processing
> it as part of performing redirections.
> 
> Reading the body is simple. You read lines, until you get a line that
> consists solely of the here-document delimiter. You do backslash-newline
> processing (or not) during this phase. It's a completely lexical operation,
> since the entire here-document is a single word, but it's weird because
> you have to save the operator and delimiter until you get a newline and can
> go off and collect the body.

I've been treating each line as a single word. I'm trying to make stuff work on
nommu systems, which suffer from memory fragmentation very easily, so handling
stuff in smaller chunks where possible is an advantage. But yeah, I've gotta
handle line continuations in HERE document context. I've got them working ok in
other input contexts (it's basically a variant of quoting):

  $ ./toybox sh
  $ echo "${PATH//:/
  > }"
  /home/landley/bin
  /usr/local/bin
  /usr/bin
  /bin
  /usr/local/games
  /usr/games

Although apparently I need a test case for another one of those "$@" silently
becomes "$*" things:

  $ bash -c $'cat<<EOF\n"$@"\nEOF' one two three
  "two three"

(I think I have a control flag for it already...)

> You still have to expand it (or not) and pass it to the command on standard
> input or the designated file descriptor. That's where you have to do the
> `generating' part.

I copied the trick of writing to a deleted temp file and then seeking back to
the start, which isn't ideal but gives you seekable input, and is thus easily
distinguishable from other approaches. Downside: you need a writeable /tmp or
similar, which can fill up and then you've got an error occuring in an awkward
place...

Alas, you have to generate the contents at command execution time because
variables resolved in it can change in a loop and/or function call, which is why
I need to retain the list of input lines. (Which for me is an array of arrays of
arguments because you can have an unlimited number of HERE documents attached to
each command, and another batch at the end of each flow control block...)

>>> So the shell is supplied input on file descriptor 0 that consists of a
>>> single line (which ends with a newline):
>>>
>>> echo abc\
>> 
>> That was the intent, yes.
>> 
>>> which the shell reads. Since nothing is quoted, the backslash-newline gets
>>> removed, the shell reads EOF and delimits the token and command, and echo
>>> gets "abc" as its argument.
>> 
>> I thought that "there's a newline at the end of the line, which the \ is
>> escaping" was relevant, but apparently that's only true for -c.
> 
> I'm saying that the behavior should be consistent whether the shell is
> processing -c command or not. I think we agree on that.

Agreed.

> That behavior should be: if there is an unquoted backslash-newline pair,
> it should be removed.

Single or double quote?

> If there isn't, a trailing backslash before EOF
> should be preserved. Different shells have different behaviors, and
> different versions of echo have different bugs with backslash processing,
> but I think this is correct.

Echo isn't processing any of these backslashes. Both bash and toybox echo need
-e to care about backslashes in their arguments. (Again, posix-2008 says
"implementations shall not support any options", which seems widely ignored.)

>> Except when I have a file that doesn't end with a newline, a trailing \ on the
>> last line is removed. That was one of the later tests.
> 
> Yeah, I think that's wrong. If bash does it, bash is wrong, too.

I pine for a complete, reliable, and current standards document. I remember the
days of googling for corner cases of how mount is supposed to work and getting
back things I wrote. (Which was still better than the current Google behavior of
refusing to return anything older than about a few months.)

> (Given
>> how the shell gratuitously strips trailing newlines from "$BLAH" and such, 
> 
> It doesn't, you know.
> 
> $ ./bash ./x3
> before
> abc

I was probably thinking of:

> $ echo "$(echo $'abc\n\n\n')"
> abc
> $

I spent an afternoon once trying to come up with some combination of quoting
that would actually preserve what it returned. Things like:

  $ X=`cat <(echo $'abc\n\n\n\n')`
  $ echo "$X"
  abc
  $

> after
> $ cat x3
> BLAH=$'abc
> 
> '
> 
> echo before
> echo "$BLAH"
> echo after
> 

I should definitely have a test like that in tests/sh.test somewhere...

>> Yup, which is what led up to the next tests:
>> 
>>>>
>>>> So...
>>>>
>>>>     $ echo -n 'echo abc\' | bash
>>>>     abc
>>>>     $ echo -n 'echo abc\' > blah
>>>>     $ bash ./blah
>>>>     abc
>>>
>>> This looks inconsistent at first glance, I'll take a look.
> 
> See above.
> 
> There's genuine disagreement between shells here. The ash-based shells
> (dash, the BSD sh, gwsh) preserve the backslash. Bash through bash-5.2,
> yash, mksh, ksh93 all remove it.

See "pining for standards", above.

(The downside to using bash as a standard is when I ask you about corner cases,
half the time you fix things. Not a downside for YOU, but I'm left with a moving
target. https://threeplusone.com/quotes/pratchett/ .)

>> Which is where I got confused, yes. If -c doesn't end with a newline, then the \
>> persists, but when stdin or file input don't end with a newline, the trailing
>> backslash is still removed even when it's the last byte of the input and is thus
>> has nothing to escape.
> 
> Yes, you've convinced me this is a bug.
> 
> Maybe it's worth an austin-group interpretation request,

Paging Edvard Munch, please report to the ADR booth.

> but I doubt it:
> POSIX sh input files are required to be text files, which are composed of
> lines, which are required to end with a newline. The behavior with non-
> text files is unspecified.

I remember long and long ago, getting busybox sed to properly handle last lines
that didn't have a newline on them, but also putting the newline BACK when you
were processing multiple files AND had a match in a later file. It was retcon
logic: you don't output the newline when you match/output a last line that
didn't have one, but then later you may have to add a leading newline when you
output ANOTHER line after that. (I don't think posix was available to plebians
like me yet. Certainly not without paying hundreds of dollars I didn't have...)

That experience was probably best summarized by this patch:

https://git.busybox.net/busybox/commit/?id=c06f568ddaaa

I can make it correct, or I can make it work. I'm not always good enough to do both.

> Chet

Rob


More information about the Toybox mailing list