[Toybox] [PATCH] sh: pass "\" to the later app

Chet Ramey chet.ramey at case.edu
Mon Jun 5 16:08:24 PDT 2023


On 6/5/23 1:04 AM, Rob Landley wrote:
> On 6/1/23 10:20, Chet Ramey wrote:
>> On 5/29/23 12:39 PM, Rob Landley wrote:
>>
>>> But I'm still left with this divergence:
>>>
>>>     $ ./sh -c 'echo abc\'
>>>     abc
>>>     $ bash -c 'echo abc\'
>>>     abc\
>>
>> The backslash doesn't escape anything, EOF delimits the token and command,
>> and the backslash remains in place for echo to process (or not).
> 
> To me this is all part of line continuation logic. My tokenizer is returning
> "needs another line to continue" as part of quote processing, and backslash is
> basically a single character quote, which yours is doing too:
> 
>    $ echo \  | wc -c
>    2
>    $ echo | wc -c
>    1
> 
> But escaping a _newline_ is funny in that it glues lines together instead of
> creating a command line argument out of the result, which means it has to be
> special cased and obviously I'm special casing it wrong, but the special case
> has multiple nonobvious features.

I guess. There are two cases: in double quotes, when the backslash-newline
is preserved, and unquoted, where it's removed. Single quotes obviously
preserve and aren't worth mentioning.

> 
> I think part of it is that my tokenizer removes whitespace between tokens, and
> you're not doing that until later? 

No, the tokenizer produces a stream of tokens. Unquoted whitespace doesn't
matter.

(You're doing more passes over the data than
> I am, my code tries to do all the work each pass can do so it's not repeating
> itself. I had a problem that variable expansion and redirect are the same pass
> in my code, and different passes in yours, which leads to me being unable to
> produce quite the same error messages you do in a couple places...)

POSIX says you do them in separate steps.

> In general, line continuation priority isn't always obvious to me until I've
> determined it experimentally:

You go off and collect here-document bodies as soon as you get a newline
token after seeing the operator-delimiter pair. We had a pretty good
argument about this on the austin-group list.


> I'm trying to have tests for everything, but there are a number of corner cases...
> 
>>> Which is annoyingly magic because:
>>>
>>>     $ bash << 'EOF'
>>>     > echo abc\
>>>     > EOF
>>>     abc
>>
>> So think about this in two pieces: what the here-document does to generate
>> the input to the shell, and what the shell does with it.
> 
> The way I'd done it is the HERE document doesn't generate input, the funky
> redirect _requests_ additional input, which is all basically the line
> continuation logic where it can't proceed to the "can we actually run this now"
> logic because it hasn't yet got a complete thought. I keep keep calling
> parse_line() with the next line of input until it returns zero, at which point
> it can call run_line() on the accumulated data structure it got parsed into.

There are two parts: reading the body of the here-document, and processing
it as part of performing redirections.

Reading the body is simple. You read lines, until you get a line that
consists solely of the here-document delimiter. You do backslash-newline
processing (or not) during this phase. It's a completely lexical operation,
since the entire here-document is a single word, but it's weird because
you have to save the operator and delimiter until you get a newline and can
go off and collect the body.

You still have to expand it (or not) and pass it to the command on standard
input or the designated file descriptor. That's where you have to do the
`generating' part.


>> So the shell is supplied input on file descriptor 0 that consists of a
>> single line (which ends with a newline):
>>
>> echo abc\
> 
> That was the intent, yes.
> 
>> which the shell reads. Since nothing is quoted, the backslash-newline gets
>> removed, the shell reads EOF and delimits the token and command, and echo
>> gets "abc" as its argument.
> 
> I thought that "there's a newline at the end of the line, which the \ is
> escaping" was relevant, but apparently that's only true for -c.

I'm saying that the behavior should be consistent whether the shell is
processing -c command or not. I think we agree on that.

That behavior should be: if there is an unquoted backslash-newline pair,
it should be removed. If there isn't, a trailing backslash before EOF
should be preserved. Different shells have different behaviors, and
different versions of echo have different bugs with backslash processing,
but I think this is correct.

> 
>>> And also:
>>>
>>>     $ echo 'echo abc\' > blah
>>>     $ cat blah
>>>     echo abc\
>>>     $ bash ./blah
>>>     abc
>>
>> Same thing, the file ends with a backslash-newline that gets removed, EOF
>> delimits the token and command, echo gets "abc" and does the expected
>> thing.
> 
> File input and stdin were behaving the same, but -c wasn't. Hence me going "is
> it the newline?" later on...
> 
>>> So... do I special case -c here or what?
>>
>> What's the special case? EOF (or EOS, really) always delimits tokens when
>> you're using -c command. Just the same as if you had a file that didn't
>> end with a newline.
> 
> Except when I have a file that doesn't end with a newline, a trailing \ on the
> last line is removed. That was one of the later tests.

Yeah, I think that's wrong. If bash does it, bash is wrong, too.

(Given
> how the shell gratuitously strips trailing newlines from "$BLAH" and such, 

It doesn't, you know.

$ ./bash ./x3
before
abc


after
$ cat x3
BLAH=$'abc

'

echo before
echo "$BLAH"
echo after


> Yup, which is what led up to the next tests:
> 
>>>
>>> So...
>>>
>>>     $ echo -n 'echo abc\' | bash
>>>     abc
>>>     $ echo -n 'echo abc\' > blah
>>>     $ bash ./blah
>>>     abc
>>
>> This looks inconsistent at first glance, I'll take a look.

See above.

There's genuine disagreement between shells here. The ash-based shells
(dash, the BSD sh, gwsh) preserve the backslash. Bash through bash-5.2,
yash, mksh, ksh93 all remove it.

> Which is where I got confused, yes. If -c doesn't end with a newline, then the \
> persists, but when stdin or file input don't end with a newline, the trailing
> backslash is still removed even when it's the last byte of the input and is thus
> has nothing to escape.

Yes, you've convinced me this is a bug.

Maybe it's worth an austin-group interpretation request, but I doubt it:
POSIX sh input files are required to be text files, which are composed of
lines, which are required to end with a newline. The behavior with non-
text files is unspecified.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet at case.edu    http://tiswww.cwru.edu/~chet/



More information about the Toybox mailing list