[Toybox] weird timeout flake

Rob Landley rob at landley.net
Wed Oct 4 00:05:32 PDT 2023


On 10/3/23 20:18, enh wrote:
>> heh, no "cheetah" is the lunch target. pixel pro 7 was the marketing
>> name. "recent arm64 device running a hwasan build" was what i was
>> trying to say.

Bionic arm64 current android system with hwsan, reproduces in 2% of runs.

>> it's been running just that test under strace for a couple of hours
>> now without finding anything, and the host hasn't found anything
>> either. ccross suggested changing the test so it _doesn't_ throw away
>> stderr and see if that shows us anything, so i'll give up on the
>> strace runs now and switch over to that instead. (i'm going to give up
>> on the host too, since it's run orders of magnitude more cycles and
>> isn't obviously achieving anything beyond making the room warm.)
> 
> well, that's unfortunate:

New information tracking down a bug is never unfortunate. We're bisecting
reality. The bug doesn't move, we just get to slice off bits of the problem
space as "that wasn't it".

Learning something we didn't know is EXACTLY what we want.

> -- timeout
> FAIL: timeout can't execute
> echo -ne '' | timeout .1 / 2>&1 ; echo $?
> --- expected 2023-10-04 00:19:17.192008926 +0000
> +++ actual 2023-10-04 00:19:17.304008926 +0000
> @@ -1,2 +1,2 @@
>  timeout: exec /: Permission denied
> -126
> +127
> 
> wtf?

Permission denied is the string for EACCES which means the syscall from the
kernel is returning the expected errno. So something ELSE is providing 127,
let's grep toybox lib for 127 and... here we go:

// Wait for child process to exit, then return adjusted exit code.
int xwaitpid(pid_t pid)
{
  int status = 127<<8;

  while (-1 == waitpid(pid, &status, 0) && errno == EINTR) errno = 0;

  return WIFEXITED(status) ? WEXITSTATUS(status) : WTERMSIG(status)+128;
}

And musl has wifexited as:

#define WTERMSIG(s) ((s) & 0x7f)
#define WIFEXITED(s) (!WTERMSIG(s))

Nothing <<8 is going to have 0x7f so and WEXITSTATUS() is basically just >>8
again so if waitpid didn't set status it returns 127.

So _that_ can return 127 if waitpid() returns -1 with an errno other than EINTR,
such as ESRCH because the process already exited. And how's it gonna do that?
Because timeout.c is going xsignal_flags(SIGCHLD, handler, SA_NOCLDSTOP); which
happened in commit ee00e1be0f5b "Fix race condition with SIGCHLD being delivered
before timeout was ready."

So it looks like that MOVED the bug rather than fixing it, and it's possible I
need to figure out how to fix it in the morning because I'm running out of brain
for the evening.

Problem: child exits before parent is ready for it. (Which I never see testing
the vfork() path. :)

Supposed fix: tell it not to send a signal, but still let the parent wait() on
it and get return info.

Observed behavior: if the parent isn't already in wait() the child acts like the
parent's doing SIG_IGNORE on SIGCHLD and zombie-kills itself, and the wait()
gets 127 through a "no such PID" error path.

Actual fix: TBD, likely to involve hitting it with a rock repeatedly until it's
the right shape.

Rob


More information about the Toybox mailing list