[Toybox] [PATCH] timeout: fix exit status for sneaky subprocesses.

Mon Sep 16 01:34:56 PDT 2019

On 9/15/19 6:54 PM, enh wrote:
> On Fri, Sep 13, 2019 at 7:59 PM Rob Landley <rob at landley.net> wrote:

I should just randomly cut a release with whatever I've got in the tree. This
one ain't gonna quiesce, is it?

Ugh, where did I leave off on this... I found a small simple change to the code,
which I made and was happy with... and half the tests fail now.

Yeah, sounds about right. I decided to wait for my cold to go away before
circling back to it, which it mostly has now except for the part about it being
3am...

>> This looks like it will always return 124 even when it did SIGKILL because you
>> then overwrite the result?
> 
> yeah, they even mention the SIGKILL special case to the special case
> to the special case in the man page:
> 
>        If the command times out, and --preserve-status is not set,  then  exit
>        with  status  124.   Otherwise, exit with the status of COMMAND.  If no
>        signal is specified, send the TERM signal upon timeout.  The TERM  sig‐
>        nal kills any process that does not block or catch that signal.  It may
>        be necessary to use the KILL (9) signal, since this  signal  cannot  be
>        caught, in which case the exit status is 128+9 rather than 124.

Right, if we sent the kill signal exit with 128+KILL. Except it doesn't do it
automatically, you have to specify -k.

If a command catches term and then exits with a normal exit code, we still exit
with status "we sent kill".

>> There's a race condition where the command can exit for non-signal reasons
>> before the signal is delivered (especially on SMP).
> 
> if you're that close to the timeout, i think calling it a timeout is
> arguably _more_ fair than letting you get away with it :-)

I want deterministic behavior in nondeterministic situations. I'm aware I can't
get it, but I'm still annoyed about it.

However, the determinism here is "did timeout send a signal or not". If it did,
we override its result with the signal we sent unless --preserve says not to. If
we didn't, we say what it exited with. That's determinish. (It's at least
atomic, because we're the ones calling wait4 and we know what we did last syscall.)

> huh. other than timeout, it looks like we don't have tests for any of
> these, so (given TEST_HOST=1 passes) my money's on timeout atm.
> 
> i've added "write missing tests for exit statuses and fix the bugs" to
> my to-do list...

I'm serious I could spend literally months on the test suite. But nobody wants
to fund that.

>> I have a cold, none of this is making sense right now. Right, tests:
>>
>>   $ timeout -s 10 1 cat
>>   $ echo $?
>>   124
>>   $ timeout -s 10 --preserve-status 1 cat
>>   $ echo $?
>>   138
>>   $ timeout -s 10 -k 2 --preserve-status 1 dd
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 0.99923 s, 0.0 kB/s
>>   Killed
>>   $ echo $?
>>   137
>>   $ timeout -s 10 -k 2 1 dd
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 0.999119 s, 0.0 kB/s
>>   Killed
>>   $ echo $?
>>   137
>>   $ (sleep 2; killall -SIGBUS dd) & timeout -s 10 1 dd
>>   [1] 17701
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 0.999222 s, 0.0 kB/s
>>   $ echo $?
>>   124
>>   [1]+  Done                    ( sleep 2; killall -SIGBUS dd )
>>   $ (sleep 2; killall -SIGBUS dd) & timeout -s 10 --preserve-status 1 dd
>>   [1] 17780
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 0.998975 s, 0.0 kB/s
>>   $ echo $?
>>   135
>>   [1]+  Done                    ( sleep 2; killall -SIGBUS dd )
>>   $ (sleep 2) | timeout -s 10 1 dd
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 0.999297 s, 0.0 kB/s
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 1.999 s, 0.0 kB/s
>>   $ echo $?
>>   124
>>   $ (sleep 2) | timeout -s 10 --preserve-status 1 dd
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 0.999241 s, 0.0 kB/s
>>   0+0 records in
>>   0+0 records out
>>   0 bytes copied, 1.99905 s, 0.0 kB/s
>>   $ echo $?
>>   0

I should make sure the test suite has versions of all that, by the way...

>> SIGBUS is 7, 128+7 = 135 so that's right. It looks like --preserve-status
>> unconditionally overrides our thing, so we return the signal we SENT in the
>> absence of --preserve-status.
>>
>> So, if they catch the signal but then die for other reasons, the exit code
>> indicates the signal that was sent unless we --preserve-status...
> 
> yeah, i got that wrong until i added the tests.

The tests fail to pass TEST_HOST on devuan becauase loop.sh calls /bin/sh which
is the Defective Annoying SHell which says "trap: SIGTERM: bad trap" because
nobody should ever use dash for anything under any circumstances. Changing it to
/bin/bash makes it work, but I dunno if that screws up your test environment.

Working on toysh, it's got a ways to go yet...

>> Rob

Rob