[Toybox] [PATCH] sh: pass "\" to the later app

Chet Ramey chet.ramey at case.edu
Mon Jun 12 17:40:42 PDT 2023


On 6/12/23 5:23 PM, Rob Landley wrote:
> On 6/9/23 15:23, Chet Ramey wrote:
>> On 6/8/23 10:31 PM, Rob Landley wrote:
>>> On 6/5/23 18:08, Chet Ramey wrote:
>> You got me. You're right; I had it backwards.
> 
> I'm not trying to gotcha anybody, I'm just trying to understand what the right
> thing to implement is. I find this entire area surprisingly confusing...

No gotcha here, I was wrong and acknowledge it.

> 
>> "The <backslash> shall retain its special meaning as an escape character
> 
> The word pair "shall retain" is not in the bash man page so I'm guessing...
> Posix? 

The man page says "retains." I don't do the standard-speak "shall" stuff.


> and they have a list of "special built-in utilities" that does NOT include cd
> (that's listed in normal utilities: how would one go about implementing that
> outside of the shell, do you think?)

That's not what a special builtin means. alias, fg/bg/jobs, getopts, read,
and wait are all regular builtins, and they can't be implemented outside
the shell either.

Special builtins are defined that way because of their effect:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_14

It's really a useless concept, by the way.

> Anyway, I found the third shall retain" in V3_chap02, and... it's wrong?

No.

> 
>> (see Escape Character (Backslash)) only when followed by one of the
>> following characters when considered special:
>>
>>       $   `   "   \   <newline>"
>>
>> So the backslash-newline gets removed, but, say, a \" only has the
>> backslash removed.
> 
> Because when you put a backslash in front of another char:
> 
>    $ echo \x
>    x
>    $ basename \x
>    x

The text I quoted was from the Double Qoutes section. The additional
reference to (Escape Character...) gives it away.

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_02_03


> And my approach of handling HERE document lines one at a time probably came from
> "...the lines in the here-document are not expanded. If word is unquoted, all
> lines of the here-document are subjected to  parameter  expansion, command
> substitution, and arithmetic expansion, the character sequence \<newline> is
> ignored, and \ must be used to quote the characters \, $, and `."
> 
> Except... \<newline> is ignored when the EOF _is_ quoted? It glues lines
> together when it's not quoted? (It's late and I'm not sure I'm reading this
> clearly. Need test cases...)

When the EOF is not quoted, the here-document body is essentially double-
quoted.

"In this case, the <backslash> in the input behaves as the <backslash> 
inside double-quotes (see Double-Quotes)." (POSIX again.)

The backslash-newline gets removed just like it does in double quotes.

When the EOF is quoted, the here-document body is essentially single-quoted
(not exactly, but you get the idea). The backslash-newline gets preserved.

> 
>> The next POSIX version goes into a lot more detail on how here-documents
>> are read and processed.
> 
> Here's hoping spending a more words to explain it will wind up being an
> improvement...

It's not bad, actually. Too much to cut and paste here.


>> What does `lasts' mean? How the body is delimited, or something else?
> 
> Things like continuing past the end of a "source" file and so on. (Data can come
> from -c, from stdin, from source, from eval, through $() or <()...)
> 
> The colon was an attempt to indicate that examples of what I tried were forthcoming.

OK.

> 
>>>
>>>     $ bash -c $'cat<<0;echo hello\nabc\n0'
>>>     abc
>>>     hello
>>
>> POSIX specifies that "the end of a command_string operand (see sh) shall be
>> treated as a <newline> character."
> 
> Which says the trailing \ should vanish for -c, but the bug report this all
> started with was that it hadn't, and that broke somebody's thing.

No. The text I quoted is from the section on here-documents, since we're
talking about here-documents. That text is actually from the updated
current draft.


>>>     $ bash -c $'cat<<"";echo X\n\necho Z'
>>>     X
>>>     Z
>>
>> This is dodgy behavior to rely on: a null delimiter is matched by the next
>> blank line, since that's technically "a line containing only the delimiter
>> and a <newline>, with no <blank> characters in between."
> 
> I'm trying to match what bash does, which means figuring _out_ what bash does. I
> respect posix, but I expect to diverge from it a lot because so much of what I'm
> trying to be compatible with already does. :(

I understand, but what I wrote explains what bash currently does. It's just
not a good idea for anyone to rely on the behavior of a null here-document
delimiter. I can't imagine anyone does.

> 
>>>     $ echo -n 'cat<<EOF' > one
>>>     $ echo -n $'potato\nEOF' > two
>>>     $ bash -c '. one;. two'
>>>     one: line 1: warning: here-document at line 1 delimited by end-of-file (wanted
>>> `EOF')
>>>     two: line 1: potato: command not found
>>>     two: line 2: EOF: command not found
>>
>> I don't think it's reasonable to expect a word, which is what the here-
>> document body is, to persist across `.' boundaries, since the contents of a
>> `.' script are (depending on how you parse them) either a `program' or a
>> `compound_list'.
> 
> I'm basically abusing function contexts, because that's what I attach local
> variables to, and $LINENO resets but persists in the same way as local vars:

I don't think that makes any sense. You can use `return' in a `.' script as
a special case, but dot doesn't make local variables work. LINENO gets
reset because you're using a new input source, and reverts to its previous
value when you go back to the previous input source. LINENO's not good for
much more than error messages, and it's good to have the current input
source and the current line number match up. It's not a local variable, per se.


> 
>    $ bash -c $'echo $LINENO;. <(echo echo \\$LINENO);echo $LINENO'
>    0
>    1
>    0

The real question is what value LINENO should have when using -c command,
even though it's only defined for a script or function.


> Basically any time I call back into do_source() it stacks a new function
> context, but the ones with a NULL pointer for the function name behave slightly
> differently so things like:

So the NULL function name tells you which aspects of function behavior to
ignore?


>>> Yes, but does a backslash newline count as quoted whitespace?
>>
>> No. In places where the backslash acts as escape character, the backslash-
>> newline pair is removed from the input stream.
>>
>>    Backslash
>>> ordinarily quotes, and there's "" which is a quoted nothing but creates an
>>> argument. So this is a new category: a quoted nothing that does NOT create an
>>> argument.
>>
>> It's removed from the input stream before tokenization. It doesn't even
>> delimit a token.
> 
> I'm not tokenizing HERE documents 

OK, I guess we're back to here-documents again.

You don't have to tokenize a here-document body, but you do have to check
for and handle line continuations when they matter.


> so I'm not sure _how_ you'd remove it from input before tokenization, since
> resolving quotes is part of tokenization...?

I do it as part of the routine that fetches the next character. If you're
in a context where the backslash-newline is going to be removed, the
tokenizer never sees it.


> Mostly I'm reading the bash man page, pondering many years of
> writing and editing bash scripts, and doing LOTS of tests...

And pointing out places where the man page isn't clear or doesn't describe
the shell's behavior, which I appreciate.

>> The current edition is from 2018.
> 
> Except they said 2008 was the last feature release and everying since is
> bugfix-only, and nothing is supposed to introduce, deprecate, or significantly
> change anything's semantics.

That's clearly not true. If things are specified incorrectly, or don't
reflect what the majority of shell implementations actually do, they're
going to change.

If new needs arise, things are going to get added (e.g., gettext, or the
ferocious arguments over strlcpy, etc.).

> That's why it's still "Issue 7". The new stuff is
> all queued up for Issue 8, which has been coming soon now since the early Obama
> administration.

Oh, I was there.

> 
> They SUSv2 in 1997 (https://pubs.opengroup.org/onlinepubs/7990989775/), SUSv3 in
> 2001 (https://pubs.opengroup.org/onlinepubs/009695399/), SUSv4 in 2008
> (https://pubs.opengroup.org/onlinepubs/9699919799.2008edition/) and SUSv5 isn't
> out yet. A 4 year gap, a 6 year gap, and now 15 years and counting...

I don't think you get to count the interim (20xx edition) releases as not
existing. They may not satisfy your exact definition of an update, but
updates they are. You have to include them as updates, since those updates
incorporate defect resolutions, which possibly modify required behavior.

You're on the list, there's plenty of "let's try and figure out what they
were thinking in 1992 and then go from there." The only thing that does is
satisfy people who want all existing implementations to be non-conforming
on some minor point. You have to look at what the actual implementations do
and make the standard match those. Otherwise, how would users know what to
expect?

In the end, you have to balance stability and responsiveness. Like everyone
who does software releases.

> I'm following what the bash in my devuan install does. 

Ok?


>> The only way to guarantee a resolution is to file an interpretation
>> request, which has to be acted on. Otherwise, unless we get to some kind
>> of consensus on the list, shells keep doing their thing.
> 
> As I said, I respect the work the posix guys are doing, but it's not the
> standard I'm implementing the shell against. 

Second-hand, it is. If there's something in bash that isn't posix-
conformant, and there's not an extremely good reason to keep it non-
conformant, I'm going to make it conform. If you want to keep pace with
bash, that requirement will eventually hit your code base.

> dash left a bad enough taste in my mouth that "a posix-only shell" seems
> counterproductive.)

No one really wants `a posix-only shell' for anything but testing. It's
just too limited. Even dash isn't `posix-only.'

> 
>>>     https://mail-archive.com/austin-group-l@opengroup.org/msg09569.html
>>
>> This was resolved, and the accepted text is in the link:
>>
>> https://austingroupbugs.net/view.php?id=267#c5990
> 
> Let's see... a lot more micro-managing of when things are unspecified, carving
> out space for the DOS C: drive for some reason...

Be that as it may, you can't say there's no resolution.

(And it's not the C: drive, it's identifiers reserved for labels for a
goto-like target for break and continue.)


> That's another reason I'm reluctant to start threads with you: it's very easy to
> talk shop with somebody who knows MORE about this than I do, but I have a bit of
> homework still to do before approaching the teacher about a lot of this stuff. :)

I wish you were not so reluctant. Look at how many things you've discovered
that I decided were bugs based on our discussions.


>> Bash-5.1 switched to using pipes for the here-document if the document size
>> is smaller than the pipe buffer size (and hence won't block), keeping the
>> temporary file for documents larger than that.
> 
> I hate having multiple codepaths to do the same thing without a good reason.

I understand. You'd be surprised (or not) at how vocal people were about
"hitting the file system" and the horrible security consequences that has.
In this case, adding the pipe was a couple of dozen lines of code.

> 
> But doing pipes here seems like a microoptimization?

It's not the only consideration. Here-documents before any writable file
systems are mounted, for instance. Look at the dueling requirements from
my message and your own experience. IYKYK.



>> Single quotes: preserved. Double quotes: removed when special. For
>> instance, the double quotes around a command substitution don't make the
>> characters in the command substitution quoted.
> 
> Quotes around $() retain whitespace that would otherwise get IFS'd. 

Correct, but that's behavior that affects how the output of the command
substitution is treated, not how the substitution itself is parsed or
executed.

> And command
> substitution quoting contexts NEST:

Sure, in the sense that each command substitution has its own quoting
context that starts out as `unquoted'. The original Bourne shell
implementation of stuff like that (and stuff like double quotes inside
${...}) was just horribly broken, and ksh88 only fixed a couple of cases,
so it took POSIX a while to reconcile all that.

> 
>    echo -n "$(echo "hello $(eval $'echo -\\\ne \'world\\n \'')")"
>    hello world
> 
> When I can't puzzle through it I just run lots of tests against all the corner
> cases I can think of and try to retcon a general rule from the results...

The problem with doing that is that there were so many special cases in the
original Bourne shell, Korn felt he had to preserve a lot of them for
backwards compatibility. By the time POSIX came along, there were so many
that special cases snuck into the standard. Sometimes trying to puzzle out
a general rule works, sometimes it doesn't.

> 
>> That's the `special' part.
>> There's also the case of double quotes around the `new' word expansions
>>
>> ${parameter[#]#word}
>> ${parameter[%]%word}
> 
> This part I don't know about, it looks like that's the prefix/suffix removal syntax?

Yes. `new' is relative, of course -- they were in ksh88. They just weren't
in the SVR4 sh. But they're special, in that double quoting those 
expansions doesn't mean the patterns are quoted, and you have an
independent quoting context after the operator.

> I note that I have yet to open the can of worms that is bash array variables,
> although I've reserved plumbing for them in like five different places. (This is
> mostly because I have not historically used them much, and thus don't have a
> good handle on how to test it. But multiple people have said that's the biggest
> feature they're looking forward to...)
> 
> (And "$@" is kind of array variable-ish already...)

Kind of, but it's not sparse. Support for very large sparse arrays is one
thing that informs your implementation.

> 
> I remember being deeply confused by ${X at Q} when I was first trying to implement
> it, but it seems to have switched to a much cleaner $'' syntax since? 

The @Q transformation has preferred $'...' since I introduced the
parameter transformations in bash-4.4. I'm not sure when you were looking
at it?


>>> Echo isn't processing any of these backslashes. Both bash and toybox echo need
>>> -e to care about backslashes in their arguments. (Again, posix-2008 says
>>> "implementations shall not support any options", which seems widely ignored.)
>>
>> They're not options, per se, according to POSIX. It handles -n as an
>> initial operand that results in implementation-defined behavior. The next
>> edition extends that treatment to -e/-E.
> 
> An "initial operand", not an argument.

That's the same thing. There are no options to POSIX echo. Everything is
an operand. If a command has options, POSIX specifies them as options, and
it doesn't do that for echo.


> Right. So they're going from "wrong" to "wrong" then:
> 
>    $ echo -n -e 'hey\nthere'
>    hey
>    there$

Yeah, echo is a lost cause. Too many incompatible implementations, too much
existing code. That's why everything non-trivial (like the above) is
implementation-defined. POSIX recommends that everyone use printf.

> Maybe posix should eventually break down and admit this is a thing? "ls . -l"
> has to work, but "ssh user at server -t ls -l" really really REALLY needs that
> second -l going to ls not ssh.

Why do you think they don't acknowledge this today? Options only exist as
such if they come before the first non-option argument. Options have to
begin with `-'. So in your example, the -t isn't an option to ssh; it's
ssh that breaks the POSIX guidelines by accepting it as an option instead
of a command name. In the POSIX world, the -t is not an option -- it's an
operand that ssh happens to treat like an option. If POSIX were to take
ssh under consideration for standardization, it would probably make it an
application requirement that ssh options appear before operands (since
that's an existing utility syntax guideline). If you really want to go
hardcore, require that the application (user) supply a `--' before the
remote command and its arguments if you want to use it in this way.

> And yes, my echo parses initial -- the same way
> every other command that parses any arguments does, 

OK. Bash doesn't. POSIX doesn't.

> Is it more important for the toybox
> commands to be consistent with each OTHER, or for them to be consistent with
> other implementations? Navigating conflicting wossnames. It's a thing.

Right. Is internal or external consistency more important? You get to make
that call.


> But I'm trying to implement whatever it
> is bash is doing.

I get it. But you should have some idea of my motivation about where things
come from.


> For the shell though, I plan to document my deviations from BASH. Because that's
> my standard.

Sure, I appreciate that. You should pick a particular version for your
compatibility target, because then you can refer to bash and its
documentation of how it differs from POSIX.


> I have not committed to implementing 100% of what bash does. It's beyond 80/20,
> but whether it's 2 iterations of 8/20 (96%) or 3 (99.2%) I dunno yet. Somewhere
> between, probably... (https://landley.net/notes-2021.html#30-09-2021 was
> promising 3 but it's not an exact science.)

That is, of course, your call.

> 
> Implementing -p mode might be in the "two iterations" part. 

Behavior differences when invoked setuid? Or `-o posix'?

> Actually passing the
> posix test suite (where does one even GET that? Do you have to pay for it?)

I asked for it. I have a limited-use, limited-time license. I'm probably in
a different position vis a vis The Open Group than you are, though.


>> This is completely unspecified behavior.
> 
> The standard is not complete, yes.

A different interpretation. There's plenty of unspecified and
implementation-defined behavior.

> 
>> POSIX shell scripts are text
>> files, which consist of lines, and lines end with newlines. It's up to each
>> shell implementor to decide how to handle it. You can push for an extension
>> to that, but I would not hold my breath.
> 
> I don't care what posix says, I care what bash does.

I understand that. I'm telling you what might happen if you went for an
interpretation request.

> so I'm just doing
> the best I can and waiting for people to complain.

Sometimes that's the best you can do.


>> The old "end the command substitution with `echo .' and remove one
>> character from the end of the result" trick works,
> 
> Ooh, good trick. I hadn't thought of that.

That's an honest-to-god idiom.

> 
>> but any command
>> substitution is always going to remove trailing newlines, quoted or not.
>> We had a pretty good argument about that trick on the POSIX list, too, but
>> most of the objections are theoretical.
> 
> I'm not trying to make it work on solaris or AIX. And only a subset works on BSD
> or MacOS...

It's a locale thing, not necessarily an OS thing. It works in UTF-8
encodings, so it's basically universal enough.

> 
>>>> There's genuine disagreement between shells here. The ash-based shells
>>>> (dash, the BSD sh, gwsh) preserve the backslash. Bash through bash-5.2,
>>>> yash, mksh, ksh93 all remove it.
>>>
>>> See "pining for standards", above.
>>
>> File an interpretation request. I'm going to do what I think makes sense.
>> Be prepared to have it rejected, though.
> 
> I'm treating bash as my standard here, not posix.

Sure, then you get to be pissed off when I decide it's a bug and change it.

If you're pining for a standard, file an interpretation request against
something that claims to be one. Or stick a stake in the ground and say
bash-x.y is your standard. But "bash" is going to be a moving target.

> 
> I would like there to BE a standard, but do not believe Posix can ever become it

Maybe not, but there's no other alternative. There's not going to be a lot
of momentum among other shell authors to "do it like bash does." For a
long time, the opposite was true, especially among the austin-group
members.


> I've engaged with the posix guys when there was obvious missing info, ala:
> 
> https://landley.net/notes-2014.html#02-12-2014
> 
> But one of my big pet peeves about them is they haven't got an actual stable web
> archive so discussions that happen on the mailing list are NOT A RELIABLE RECORD
> of anything.

They never have been. Mantis is the only official record.

I don't know whether or not the austingroup-bugs list has a reliable
archive. I keep my own of things I'm concerned about. But I've been at it
longer than you have.

>>> (The downside to using bash as a standard is when I ask you about corner cases,
>>> half the time you fix things. Not a downside for YOU, but I'm left with a moving
>>> target. https://threeplusone.com/quotes/pratchett/ .)
>>
>> We talked about this before. Pick a fixed target (e.g., bash-5.1) and write
>> to that, then move forward if you like.
> 
> I did, it was bash-2.05b and I had to move forward to run "emerge".

What are you using now?


> P.S. I'd really hoped I could get a reasonable shell in 3500 lines. The version
> I checked in today is 4762 lines. Not exactly dunning-kruger, but there's always
> a certain amount of learning on the job. If I knew what I was doing I'd be done.

It took Bourne over 5000, but with a considerably less functional C library.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet at case.edu    http://tiswww.cwru.edu/~chet/



More information about the Toybox mailing list