[Toybox] [PATCH] sh: pass "\" to the later app

Chet Ramey chet.ramey at case.edu
Fri Jun 9 13:23:02 PDT 2023


On 6/8/23 10:31 PM, Rob Landley wrote:
> On 6/5/23 18:08, Chet Ramey wrote:
>>> But escaping a _newline_ is funny in that it glues lines together instead of
>>> creating a command line argument out of the result, which means it has to be
>>> special cased and obviously I'm special casing it wrong, but the special case
>>> has multiple nonobvious features.
>>
>> I guess. There are two cases: in double quotes, when the backslash-newline
>> is preserved,
> 
> $ printf "abc\
>> def\n"
> abcdef
> $ basename "abc\
>> def" | hd
> 00000000  61 62 63 64 65 66 0a                              |abcdef.|
> 00000007
> 
> Define "preserved".

You got me. You're right; I had it backwards.

"The <backslash> shall retain its special meaning as an escape character 
(see Escape Character (Backslash)) only when followed by one of the 
following characters when considered special:

     $   `   "   \   <newline>"

So the backslash-newline gets removed, but, say, a \" only has the
backslash removed.


> In here documents, double quote does NOT remove it:

Quoting the here-document delimiter has the expected effect. The body is
considered to be in double quotes if the delimiter is *not* quoted, and
basically in single quotes if it is ("the here-document lines are not
expanded").

The next POSIX version goes into a lot more detail on how here-documents
are read and processed.


>    $ cat<<"EOF"
>    > ab\
>    > c
>    > EOF
>    ab\
>    c


> I also tried to ask questions about how long a HERE document lasts, and:

What does `lasts' mean? How the body is delimited, or something else?

> 
>    $ bash -c $'cat<<0;echo hello\nabc\n0'
>    abc
>    hello
POSIX specifies that "the end of a command_string operand (see sh) shall be
treated as a <newline> character."

>    $ bash -c $'cat<<"";echo X\n\necho Z'
>    X
>    Z

This is dodgy behavior to rely on: a null delimiter is matched by the next
blank line, since that's technically "a line containing only the delimiter
and a <newline>, with no <blank> characters in between."

>    $ echo -n 'cat<<EOF' > one
>    $ echo -n $'potato\nEOF' > two
>    $ bash -c '. one;. two'
>    one: line 1: warning: here-document at line 1 delimited by end-of-file (wanted
> `EOF')
>    two: line 1: potato: command not found
>    two: line 2: EOF: command not found

I don't think it's reasonable to expect a word, which is what the here-
document body is, to persist across `.' boundaries, since the contents of a
`.' script are (depending on how you parse them) either a `program' or a
`compound_list'.

> I'm also vaguely curious how one WOULD terminate this one:
> 
>    $ bash -c $'cat<<\'\n\''
>    bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `
>    ')

You can't. A newline here-document delimiter can never be matched, and
only EOF will terminate the here-document. Some shells (e.g., yash) treat
this as a fatal syntax error, but most treat it like bash does. I
considered printing a warning for a delimiter containing a newline, but
decided not to.


> Also, -s doesn't work as advertised in the man page?
> 
>         -s        If  the -s option is present, or if no arguments remain after
>                   option processing, then commands are read from  the  standard
>                   input.   This  option  allows the positional parameters to be
>                   set when invoking an interactive shell or when reading  input
>                   through a pipe.
> 
>    $ echo echo also | bash -s -c 'echo hello'
>    hello
>    $ echo echo also | bash -c -s 'echo hello'
>    hello
>    $ echo echo also | bash -c -s -s -s -s 'echo hello'
>    hello

-c has higher priority than -s, and you can only use one. It's unspecified
behavior; POSIX doesn't allow those options to be used together. Some ash-
based shells (e.g., dash) execute the command string and then start an
interactive shell, but I don't think that's a great idea.

> Yes, but does a backslash newline count as quoted whitespace?

No. In places where the backslash acts as escape character, the backslash-
newline pair is removed from the input stream.

  Backslash
> ordinarily quotes, and there's "" which is a quoted nothing but creates an
> argument. So this is a new category: a quoted nothing that does NOT create an
> argument. 

It's removed from the input stream before tokenization. It doesn't even
delimit a token.

>> POSIX says you do them in separate steps.
> 
> Good to know.

It's always said this. (And bash has always performed steps 3 and 4 in
reverse order, but ...)

> 
> Alas, posix says a lot of things, it would be nice if more of them were current
> and relevant. I printed it all out and read the whole thing on a series of bus
> rides into work when I first sat down to write a new shell for busybox back in
> 2006. I've had a vague todo to read the new one whenever Issue 8 finally comes
> out, but it's been "real soon now" for... how long? (Posix-2008 came out 15
> years ago.)

The current edition is from 2018. The next one is in its third draft, then
it has to go through the whole IEEE process, but it may get through
balloting by the end of the year. The standard is always evolving (things
are still being added and deprecated/removed) and being clarified.
The expanded text describing here-documents is a good example.


>>> In general, line continuation priority isn't always obvious to me until I've
>>> determined it experimentally:
>>
>> You go off and collect here-document bodies as soon as you get a newline
>> token after seeing the operator-delimiter pair.
> 
> It does seem to take priority over everything, yes.

It was the `where is the next NEWLINE token' and `does a here-document body
have to appear in the same command substitution as the delimiter' that
sparked disagreement.

> 
>> We had a pretty good
>> argument about this on the austin-group list.
> 
> I've been subscribed forever, and even dialed in to a few of their conference
> calls, but I've generally found arguments there mostly just peter out without
> resolution:

The only way to guarantee a resolution is to file an interpretation
request, which has to be acted on. Otherwise, unless we get to some kind
of consensus on the list, shells keep doing their thing.

> 
>    https://mail-archive.com/austin-group-l@opengroup.org/msg09569.html

This was resolved, and the accepted text is in the link:

https://austingroupbugs.net/view.php?id=267#c5990

> 
> I've never been good at the politics side of things...

Me either.

[back to here-documents]

> I've been treating each line as a single word. 

That's fine as far as it goes, but you eventually have to expand the lines
and handle line continuations (body not quoted). And you have to handle the
continuations before you check for the delimiter, so constructs like

cat <<EOF
abcde
next\
EOF

don't delimit the here-document, and constructs like

cat <<EOF
abcde
EO\
F

are commonly accepted, if officially unspecified, even though they make
ash-based shells fall over dead.

I'm trying to make stuff work on
> nommu systems, which suffer from memory fragmentation very easily, so handling
> stuff in smaller chunks where possible is an advantage. But yeah, I've gotta
> handle line continuations in HERE document context. I've got them working ok in
> other input contexts (it's basically a variant of quoting):

You might consider just discarding them in the lexer, since they have to be
removed from the input before you determine tokenization.

> 
>    $ ./toybox sh
>    $ echo "${PATH//:/
>    > }"
>    /home/landley/bin
>    /usr/local/bin
>    /usr/bin
>    /bin
>    /usr/local/games
>    /usr/games
> 
> Although apparently I need a test case for another one of those "$@" silently
> becomes "$*" things:
> 
>    $ bash -c $'cat<<EOF\n"$@"\nEOF' one two three
>    "two three"
> 
> (I think I have a control flag for it already...)

Since the here-document bodies do not undergo word splitting or quote
removal, you have to leave the double-quotes there and the positional
parameters are not split.

> 
>> You still have to expand it (or not) and pass it to the command on standard
>> input or the designated file descriptor. That's where you have to do the
>> `generating' part.
> 
> I copied the trick of writing to a deleted temp file and then seeking back to
> the start, which isn't ideal but gives you seekable input, and is thus easily
> distinguishable from other approaches. Downside: you need a writeable /tmp or
> similar, which can fill up and then you've got an error occuring in an awkward
> place...

Bash-5.1 switched to using pipes for the here-document if the document size
is smaller than the pipe buffer size (and hence won't block), keeping the
temporary file for documents larger than that.

That caused a rather large blowup, especially with people who assumed
that here-document bodies would always be seekable, even though POSIX
explicitly warns against making that assumption:

https://lists.gnu.org/archive/html/bug-bash/2022-04/msg00051.html

This was after people got up in arms about bash using temp files for here-
documents and here-strings in the first place:

https://lists.gnu.org/archive/html/bug-bash/2019-03/msg00073.html

> 
> Alas, you have to generate the contents at command execution time 

Quite true.

> because
> variables resolved in it can change in a loop and/or function call, which is why
> I need to retain the list of input lines. (Which for me is an array of arrays of
> arguments because you can have an unlimited number of HERE documents attached to
> each command, and another batch at the end of each flow control block...)

It's technically a list of redirections, and you can indeed have multiple
redirections associated with a command. Bash stores the document as a
single word (string), since it's going to be treated as a word.

>> I'm saying that the behavior should be consistent whether the shell is
>> processing -c command or not. I think we agree on that.
> 
> Agreed.
> 
>> That behavior should be: if there is an unquoted backslash-newline pair,
>> it should be removed.
> 
> Single or double quote?

Single quotes: preserved. Double quotes: removed when special. For
instance, the double quotes around a command substitution don't make the
characters in the command substitution quoted. That's the `special' part.
There's also the case of double quotes around the `new' word expansions

${parameter[#]#word}
${parameter[%]%word}

> 
>> If there isn't, a trailing backslash before EOF
>> should be preserved. Different shells have different behaviors, and
>> different versions of echo have different bugs with backslash processing,
>> but I think this is correct.
> 
> Echo isn't processing any of these backslashes. Both bash and toybox echo need
> -e to care about backslashes in their arguments. (Again, posix-2008 says
> "implementations shall not support any options", which seems widely ignored.)

They're not options, per se, according to POSIX. It handles -n as an
initial operand that results in implementation-defined behavior. The next
edition extends that treatment to -e/-E.

Other shells have versions of echo that perform backslash expansion
unconditionally, as POSIX (XSI) requires. They have various bugs or quirks.

When bash is in posix mode and has the xpg_echo option enabled, it behaves
as POSIX specifies for XSI implementations, so it's more than theoretical.
I have to confess, though, that the only time I've ever run bash that way
was to run the Open Group test suite.

> 
>>> Except when I have a file that doesn't end with a newline, a trailing \ on the
>>> last line is removed. That was one of the later tests.
>>
>> Yeah, I think that's wrong. If bash does it, bash is wrong, too.
> 
> I pine for a complete, reliable, and current standards document.

This is completely unspecified behavior. POSIX shell scripts are text
files, which consist of lines, and lines end with newlines. It's up to each
shell implementor to decide how to handle it. You can push for an extension
to that, but I would not hold my breath.


> I was probably thinking of:
> 
>> $ echo "$(echo $'abc\n\n\n')"
>> abc
>> $

Yes, that's the case where they're removed.

> 
> I spent an afternoon once trying to come up with some combination of quoting
> that would actually preserve what it returned. Things like:
> 
>    $ X=`cat <(echo $'abc\n\n\n\n')`
>    $ echo "$X"
>    abc
>    $

The old "end the command substitution with `echo .' and remove one
character from the end of the result" trick works, but any command
substitution is always going to remove trailing newlines, quoted or not.
We had a pretty good argument about that trick on the POSIX list, too, but
most of the objections are theoretical.


>>> Yup, which is what led up to the next tests:
>>>
>>>>>
>>>>> So...
>>>>>
>>>>>      $ echo -n 'echo abc\' | bash
>>>>>      abc
>>>>>      $ echo -n 'echo abc\' > blah
>>>>>      $ bash ./blah
>>>>>      abc
>>>>
>>>> This looks inconsistent at first glance, I'll take a look.
>>
>> See above.
>>
>> There's genuine disagreement between shells here. The ash-based shells
>> (dash, the BSD sh, gwsh) preserve the backslash. Bash through bash-5.2,
>> yash, mksh, ksh93 all remove it.
> 
> See "pining for standards", above.

File an interpretation request. I'm going to do what I think makes sense.
Be prepared to have it rejected, though.

> (The downside to using bash as a standard is when I ask you about corner cases,
> half the time you fix things. Not a downside for YOU, but I'm left with a moving
> target. https://threeplusone.com/quotes/pratchett/ .)

We talked about this before. Pick a fixed target (e.g., bash-5.1) and write
to that, then move forward if you like.

> 
>>> Which is where I got confused, yes. If -c doesn't end with a newline, then the \
>>> persists, but when stdin or file input don't end with a newline, the trailing
>>> backslash is still removed even when it's the last byte of the input and is thus
>>> has nothing to escape.
>>
>> Yes, you've convinced me this is a bug.
>>
>> Maybe it's worth an austin-group interpretation request,
> 
> Paging Edvard Munch, please report to the ADR booth.

You have only yourself to blame. ;-)

> I can make it correct, or I can make it work. I'm not always good enough to do both.

"Make it work, make it work right, make it work fast. In that order."

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet at case.edu    http://tiswww.cwru.edu/~chet/



More information about the Toybox mailing list