[Toybox] Confused by bash trap handler return value.

Wed Feb 12 14:37:16 PST 2025

On 2/12/25 15:24, Chet Ramey wrote:
>>> Testing that trap actions preserve $? is not hard. This had better not
>>> echo 1.
>>>
>>> trap 'echo WINCH ; false' WINCH
>>> (exit 41)       # force $? to something neither 0 nor 1
>>> kill -WINCH $$
>>
>> At which point $? is now the return code of the kill command, 
>> overwriting the 41.
> 
> It will be 0, unless this script was invoked with SIGWINCH ignored, in
> which case all bets are off because the trap action never gets run.

All bets aren't off, kill is a command, commands exit with a return 
code, kill is exiting with zero. If the trap handler doesn't run the 
return code is zero.

>>> # add this if you're concerned that kill will finish before the SIGWINCH
>>> # arrives, since it will cause multiple system calls
>>> (exit 42)
>>
>> If the trap handler was processed right after kill but before (exit) 
>> this overwrites the $? from kill _or_ leaked by the signal handler.
> 
> That's what the comment means. Comment the subshell command out. Seriously,
> trap handlers never set $? that survives beyond their execution. You should
> be able to verify that with static code analysis.

I know what it _should_ do, the question is how do I confirm that a 
given implementation _did_ do it and didn't have a regression?

It's a feature. It should have a test in the regression test suite. How 
do I make a test that shows what I think it's showing?

>>> echo $?
>>
>> So seeing "42" here doesn't prove the signal handler preserved the 0 
>> from kill.
> 
> The goal is to force the handler to run if it hasn't due to delayed signal
> delivery. However you do that is ok.

If the handler DID change the return code, and you run something that 
overwrites the return code the handler set, you don't get to see that 
the handler changed the return code.

How is the testing what it says it's testing? If the behavior changed so 
the handler DID set the return code, the test's output wouldn't change.

> Probably (exit $?) would be better.

That's a NOP? Ok, to force the handler to run, sure...

> I just don't think you need it, and you don't need the subshell command.

I don't think you need it either.

>>> The trap won't get run until after the kill returns; trap actions are
>>> not run asynchronously.
>>
>> But the kill returns before the (exit 42) runs.
> 
> And as long as the signal is delivered by then, it should be fine.

So... a kernel other than Linux might delay signal delivery arbitrarily 
long? Either we have a process signaling itself, or we a process 
signaling a waiting parent that will be unblocked by receiving a SIGCHLD 
sent from the same child process _after_ it sends the signal from kill. 
(Even if they were delivered out of order, which Linux won't do, the 
first signal has to be queued for delivery before the second so can't 
NOT be pending when the parent unblocks.)

> I
> can't see any reason that it wouldn't be, since the kernel is posting
> the signal to the same process and bash is single-threaded. This is the
> theoeretical race condition I referred to.

I've read far too much kernel code over the years, but a lot of it was 
2.2 or 2.4 or 2.6 and I sometimes get tripped up on "they changed this 
over the years". But they're unlikely to loosen the semantics without me 
hearing gripes and Linus chewing out a dev for breaking stuff.

> In normal operation, the kill builtin calls kill(2), sends the signal,
> the signal arrives, the trap signal handler notes the pending trap,
> the kill builtin returns, the shell sets $?, and the trap action runs.

Same at my end, yes. Did the trap overwrite the return code from kill.

I first thought the trap handler was always _zeroing_ the rc rather than 
leaving it unchanged, and if the return code from kill is zero I can't 
confirm it's not doing that. It shouldn't, and "this should never happen 
so you don't need to test it" may apply, but I tend to add any test to 
the test suite I needed to run to understand the behavior of what I was 
implementing, and "type out a command, go to another terminal, send kill 
signal from that other terminal, hit enter in the first terminal, echo 
$? in first terminal" is hard to replicate in a shell script.

I mean, I could cheat and do something like:

(printf "command\ncommand\ncommand\n'; sleep 1; kill thingy&; sleep 1; 
printf "command\ncommand\ncommand\n") | bash -

But... ew?

I could also say "just because I was confused about how it worked and 
had to run a command line test MYSELF to understand it, and then I 
implemented code specifically to make that test pass, doesn't mean I 
need to add that test to the test suite". Except... those are exactly 
the tests I try to make sure are in the regression test suite.

>> I implemented a check for signals between each command, meaning my 
>> code checks for pending signals before launching the subshell (and 
>> inserts a synthetic "eval" on the sh_function call stack for each 
>> one). Are you saying it should NOT do that?
> 
> No.
> 
>>
>> (Also, I left the signal handler blocked and have the return from the 
>> "trap" unblock it, so it defers handling a second instance of the same 
>> signal until the trap handler for the first signal returns. You can 
>> have different signals interrupt each other, but only uniquely. 
> 
> I allow recursive trap invocations, with a warning for non-release
> versions of the shell. I got bug reports (or feature requests) when I
> did it your way (sorry, I forget the exact details).

I allow _other_ traps, just not the _same_ trap to recurse. Because 
that's crazy. (That said they do get queued up and should fire after the 
first trap handler returns. They're masked, not ignored.)

How all this interacts with SA_RESTART is a can of worms I have not 
looked deeply into yet. (At least "read" being interrupted or not is 
easier to test...)

> Some scripts change the trap action in the trap handler, resend the signal
> to themselves, and expect it to have effect immediately, even in a
> complicated shell function run in a trap action.

I already special case the trap handler getting reset while the 
appropriate trap is running (to free the trap string), I could have 
resetting the trap unmask the appropriate signal. (It's not the SAME 
handler running again, so the potential starvation issue is the same as 
a recursive function call. Pilot error.)

Thanks for the heads up. (Now I need to test THAT...)

>>> I'm going to change it for POSIX mode, since that's compatible with what
>>> other POSIX shells have implemented. Bash default mode will stay the 
>>> same.
>>
>> I plead the 5th on moving targets. 
> 
> We've talked about this before.

I didn't say it. I complained about _not_ saying it, but I didn't 
actually say it.

>> (Doesn't help me here, I'd have to do a bash version check before 
>> adding -p.
> 
> To what?

I meant that having my test do "$SH -p" on the trap test that calls a 
potentially builtin kill still gives me the return 0 behavior from kill 
on the bash currently installed in devuan detinue. So you changing the 
behavior in future doesn't mean my test can rely on it until 
https://landley.net/toybox/faq.html#support_horizon expires.

> I think I'll stick with running
>> $(which kill).)
> 
> You should probably run type instead of which.

The android guys currently run my test suite using mksh, and I'm never 
entirely sure what that supports.

I'm trying to get toysh to the point where it can run my test suite (and 
I can thus ask them to switch over), but it's got a ways to go yet.

(And run my "mkroot" system builder, which is implemented in ~500 lines 
of bash by the way: 
https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh produces 
https://landley.net/bin/mkroot/latest/ . Ideally the tests should run in 
there, so they can run as root under qemu with a known kernel configured 
in a known way so I can properly test insmod and ps and so on. It's 
already the command shell within those tiny systems though.)

Thanks,

Rob