[Toybox] weird timeout flake

Rob Landley rob at landley.net
Wed Oct 4 00:15:58 PDT 2023


On 10/3/23 23:43, Colin Cross wrote:
> I caught it running in strace.  There's some differences in signal
> setup in the forked child, the child exits with 126 in both the good
> and bad cases, but the parent exits with 127 after wait4 returns
> ECHILD in the bad case.

Beat me to it. (Serves me right for being AFK for a few hours...)

> The bad case, child exits with 126, parent exits with 127:
> rt_sigaction(SIGCHLD, {sa_handler=0x5fda0c26e4, sa_mask=[],
> sa_flags=SA_NOCLDSTOP}, NULL, 8) = 0

I thought the child would remain a zombie if the parent did this. Instead it
exits the rest of the way.

> clone(child_stack=NULL,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x7eadf59190) = 9378
> strace: Process 9378 attached
> [pid  9378] prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, 0x7eacc52000,
> 16384, "stack_and_tls:main") = 0
> [pid  9378] rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[],
> sa_flags=0}, NULL, 8) = 0
> [pid  9378] setpgid(0, 0)               = 0
> [pid  9378] execve("/", ["/"], 0x7feb7bad98 /* 26 vars */) = -1 EACCES
> (Permission denied)
> [pid  9378] write(2, "timeout: ", 9timeout: )    = 9
> [pid  9378] write(2, "exec /", 6exec /)       = 6
> [pid  9378] write(2, ": Permission denied", 19: Permission denied) = 19
> [pid  9378] write(2, "\n", 1
> )           = 1
> [pid  9378] exit_group(126)             = ?
> [pid  9378] +++ exited with 126 +++
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9378,
> si_uid=2000, si_status=126, si_utime=0, si_stime=0} ---

Sigchild was delivered, but SA_NOCLDSTOP is apparently equivalent to SIG_IGNORE?

> rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, NULL, 8) = 0
> wait4(0, 0xe000007feb7baae0, 0, NULL)   = -1 ECHILD (No child processes)

And then we re-enable SIGCHLD and call wait() but the child exited, and thus it
went boing.

I need to resequence the signal handling. I remember it got tangled up for -i
support. Did I blog about this? Commit ee00e1be0f5b was July 20, 2022... Yup:

<p>The reason the sed s///x test hangs with "make tests" but not "make test_sed"
is the toybox timeout command is used for the first, and the debian timeout
is used out of the $PATH for the second. Which is why I didn't notice it
when I checked in the timeout -i support, because I was testing individual
commands and "make tests" tests the toybox commands in combination, so finds
more weird corner cases.</p>

<p>The problem here is a race condition: if SIGCHLD from the child comes in
before timeout is ready for it, we miss it and then the full timeout elapses
before the command moves on.</p>

<p>I set up the SIGCHLD handler AFTER the call to xpopen() because I didn't
want the child process to inherit weird signal handler shenanigans. I'm a bit
gun-shy there after the <a href=notes-2011.html#24-08-2011>long</a>
<a href=notes-2011.html#02-09-2011>ago</a>
<a href=notes-2011.html#03-09-2011>multi</a>
<a href=notes-2011.html#04-09-2011>day</a>
<a href=notes-2011.html#05-09-2011>saga</a> of trying to debug a blocked
SIGALRM inherited from PID 1 due to an old bash bug, although that was
the child inheriting the signal _mask_ not the signal handlers, so I
don't ACTUALLY have to use the xpopen_setup() child callback to reset
SIGCHLD (since the child process can't have any children of its own before
calling exec). And yes of course I have to set the signal handler before
forking the child to avoid a race window with the child exiting. (Even
without SMP, it could get scheduled first.)</p>

<p>Darn it, there's still a race condition after that's moved, because the
SIGCHLD can come in before we record the fd we're reading from into the poll
structure, so the signal handler can't close the fd, and there's no other way
to tell poll() to exit early EXCEPT closing the fd? Trying NOT to pass
the fd into poll always has at least the race of making the poll() function
call (copying the fd into the function's local variable as an argument value)
and THEN having the signal happen before the actual poll system call starts.</p>

<p>I cut the gordian knot with siglongjmp() from the SIGCHLD handler
so the loop _doesn't_ exit and just goes "receiving SIGCHLD means timeout
can exit" because we only spawn one child process; if that has grandkids
those would reparent to PID 1 instead of us so getting the signal means
we can exit now. BUT while this is good enough for a standalone timeout,
it's not good enough for a bash builtin that doesn't fork: xpopen_both()
can create a pipe but not return it yet before the signal handler
goes, leaking filehandles. I may have to manually pipe() and VFORK()
myself instead of calling letting xpopen() do it, because pipe(TT.pipes)
is atomic: either the pipes got written where the signal handler can see
them, or they weren't created yet. The kernel either backs out any halfway
states before returning to userspace, or else doesn't check for signals until
the system call is about to return to userspace anyway.</p>

<p>Anyway, that should fix up the "timeout 10 does_not_exit" problem
Elliott spotted, and also the test suite slowdown I noticed.</p>

Rob


More information about the Toybox mailing list