[Toybox] And again.

Tue Sep 1 23:16:59 PDT 2020

On 9/1/20 9:19 AM, Chet Ramey wrote:
> On 8/28/20 2:28 AM, Rob Landley wrote:
>> I'm trying hard not to bother you anymore, but I think the bash man page is
>> wrong. It says says:
>>
>>        case word in [ [(] pattern [ | pattern ] ... ) list ;; ] ... esac
>>               A case command first expands word, and tries to match it against
>>               each pattern in turn, using the same matching rules as for path‐
>>               name  expansion  (see  Pathname  Expansion  below).  The word is
>>               expanded using tilde expansion, parameter  and  variable  expan‐
>>               sion,  arithmetic  expansion, command substitution, process sub‐
>>               stitution and quote removal.  Each pattern examined is  expanded
>>               using  tilde expansion, parameter and variable expansion, arith‐
>>               metic expansion, command substitution, and process substitution.
>>
>> And I have questions:
>>
>> 1) Bash DOES remove quotes from the pattern, it has to because splitting is
>> disabled so spaces and $IFS can get inserted:
> 
> It doesn't perform quote removal, and Posix says it should not.

Define "quote removal"?

  $ A="a b"; case $A in "a b") echo hello "$A"; esac
  hello a b
  $ A="a b"; case $A in a b) echo hello "$A"; esac
  bash: syntax error near unexpected token `b'
  $ A="a b"; case "$A" in "a b") echo hello; esac
  hello

The quotes between the in and the ) change the behavior in a way that seems an
awful lot LIKE quotes are being parsed and thus removed?

I was reading quote removal as "you can use quotes here, and they will be
understood rather than treated as literals". That seems to be the case...?

> What it
> does do is make sure that the quote characters arrange to quote parts of
> the pattern appropriately so that special matching characters match
> themselves. The shell has to remember which parts of the pattern were
> quoted,

It has to remember which parts were active (I.E. unquoted/unescaped) and which
parts weren't active when it hit each wildcard, yes. Which is largely the same
test condition as IFS splitting being able to occur there, so IFS active and
wildcard active can share most of their logic.

I wrote a collect_wildcards() function that assembles a deck of active wildcard
locations for the wildcard expansion pass to replace later:

  https://github.com/landley/toybox/blob/fcba64ecad07/toys/pending/sh.c#L848

Which is called for unquoted characters added to the output:

  https://github.com/landley/toybox/blob/fcba64ecad07/toys/pending/sh.c#L958
  (And also lines 1236 and 1249.)

That part seems to be working already, although I need to finish the consumer
side to properly test it.

> and make sure that those quoted characters get passed to the
> matcher (which may or may not be fnmatch()) in whatever way the matcher
> requires. That usually means prefixing them with a backslash, but then
> you get into what happens with quoted characters inside bracket
> expressions. The word expansions still happen how they're supposed to.

Since I can't expect libc to understand +() I'm writing my own glob() function
which consumes the string and the deck as input. (Which also needs partial match
support to make a/b*/c*/d work as it traverses down into only specific
subdirectories...)

> There was a ferocious argument about this a couple of years ago, and there
> are still arguments about how to specify quoting in shell pattern matching.

This is domain expertise I'm missing. I never even bothered to use case/select
in my shell scripts before this because an if/else staircase works about as
easily...

> If you were to perform quote removal on the patterns, you'd need something
> like
> 
>   case "$x" in \\*) echo 'literal asterisk' ;; esac
>
> to match an asterisk.

Except that logically says that \\ is a literal backslash, and then * isn't
escaped and is thus active? (That seems to be performing quote removal _twice_?)

>> 2) process substitution? Really? Under what circumstances does:
>>
>>   case <(potato) in $PATTERN) echo hello;; esac
>>
>> trigger usefully? 
> 
> If people want to do dumb shit, people are going to do dumb shit. One
> could use this to determine whether bash uses /dev/fd or named pipes for
> process substitution, but you shouldn't really have to care.

"dumb shit", "you shouldn't really have to care".

I'll wait for somebody to complain.

>> My code is treating <() as a form of redirection, so it's handled by
>> expand_redir() rather than expand_arg_nobrace(), and moving it is problematic
>> because only one context has the filehandle tracking (I.E. recording what to
>> close again afterwards).
> 
> That's probably going to come back and bite you, since process substitution
> is a word expansion.

It might, but if you glue anything to the beginning or the end it's not a valid
filename anymore? (Modulo reaching into somebody else's chroot except we
dynamically allocated this file out of _our_ host /dev?)

Other than this, the only other example I can think of is telling the kernel:

  KCONFIG_ALLCONFIG=<(cat file) make allnoconfig

Which has to be in the same process, because:

  $ X=<(echo hello); cat $X
  cat: /dev/fd/63: No such file or directory

Which implies maybe I should special case it in variable assignment? (But is
currently in my "wait for a user to show up with a real world use case" bucket?)

>> 3) $(case a in a); echo hello; esac) is not something my parsing can handle (it
>> just counts quotes and parentheses, it's not parsing flow control statements
>> inside what is essentially a nested quoting context), and I would have to rip
>> out SO much stuff to make it do so I think I'm ok with not supporting that for
>> the moment. And technically you can do $(case a in (a); echo hello; esac) and
>> that does work.
> 
> This will certainly come back and bite you.

Yeah, I have a TODO for it. I know how to fix it, it's just... really annoying.

> Bash used to do it this way, but Posix says that any arbitrary shell script
> can appear in $() command substitution, so I got a bunch of bug reports.
>
> I ended up having to write special-case code for this, because bison/yacc
> can't easily handle calling the parser recursively.

Mine can, I just have to shuffle stuff around. The result shouldn't even be
significantly bigger, it's really an object lifetime thing. 90% likely what I'll
do is parse it recursively for tokenizing and then discard the results and keep
it as a string like I am now once I've figured out how long that string is.
(Right now flow control statements are _statements_, they don't branch into a
new tree in the middle of an argument to a statement. The $() contents gets
passed to the "run subshell" logic and the output read into a string, and that
takes a string to run about the same way "exec" does.)

One design issue I have is that because I'm supporting nommu, I have to marshall
data to exec()ed child processes to implement things like subshells and
backgrounding, and rather than try to pass data structures across process
contenxts I'm just serializing everything back into strings and sending them
through a pipe on a high filehandle with a magic environment variable named
"@%d,%d" (with the pid and ppid of the child) and no = after it, which if it
exists tells the child it needs to receive the marshalled data on fd 254, which
must fstat() S_ISFIFO(). (Trying to make the attack surface at least mildly
annoying.) Anyway, this means I need to care about serializing stuff BACK to
strings a bit more than average. :P

*shrug* Different design...

Rob