[Toybox] weird timeout flake

enh enh at google.com
Tue Oct 3 14:07:38 PDT 2023


On Tue, Oct 3, 2023 at 1:44 PM Rob Landley <rob at landley.net> wrote:
>
> On 10/3/23 13:38, enh wrote:
> >> Trying that by hand on devuan (using coreutils' timeout):
> >>
> >>   $ timeout .1 /
> >>   timeout: failed to run command ‘/’: Permission denied
> >>   $ echo $?
> >>   126
> >>
> >> From the bash man page:
> >>
> >>   If a command is not found, the child process created to execute it  re‐
> >>   turns  a  status  of 127.  If a command is found but is not executable,
> >>   the return status is 126.
> >>
> >> I'm not sure how you can "file not found" the root directory? (Selinux?
> >> Filehandle exhaustion? Even chmod 000 should return EPERM not ENOENT.)
> >>
> >> The relevant code is xwrap.c line 233:
> >>
> >>   execvp(argv[0], argv);
> >>
> >>   toys.exitval = 126+(errno == ENOENT);
> >
> > +Colin Cross who just saw this too.
> So why is execvp("/", {"/", 0}); returning ENOENT? It's saying it cannot _find_
> the root directory, not that it can't execute a directory. Hmmm...
>
> The execvp man page says:
>
>        If  the specified filename includes a slash character, then PATH is ig‐
>        nored, and the file at the specified pathname is executed.
>
> Which seems like it would moot:
>
>        If permission is denied for a file (the attempted execve(2) failed with
>        the  error EACCES), these functions will continue searching the rest of
>        the search path.  If no other file is found, however, they will  return
>        with errno set to EACCES.
>
> Which would still be returning something other than ENOENT anyway.
>
> Hmmm...
>
>        If  the  header  of  a  file  isn't recognized (the attempted execve(2)
>        failed with the error ENOEXEC), these functions will execute the  shell
>        (/bin/sh)  with  the  path of the file as its first argument.  (If this
>        attempt fails, no further searching is done.)
>
> I don't THINK that's a likely fallback path here? Although /bin/sh not found
> might explain it. But that would be deterministically reproducible and you're
> having an intermittent issue, right?

correct. ccross tells me it's ~2% of all runs in CI.

locally, i commented out all the other tests, and just ran this
repeatedly on a device, and it did repro after about half an hour.
i've kicked off a similar test on the host, and i've kicked off the
device again but with strace in the mix (which hopefully doesn't slow
things down enough to make the problem disappear!).

> $ cat > potato.c << EOF
> #include <unistd.h>
> int main(int argc, char *argv[]) { execvp("/", (char *[]){"/", 0}); }
> EOF
> $ gcc potato.c
> $ strace ./a.out
> execve("/", ["/"], 0x7ffebd0880d8 /* 36 vars */) = -1 EACCES (Permission denied)
> $ sudo strace ./a.out
> execve("/", ["/"], 0x7ffc32654e08 /* 16 vars */) = -1 EACCES (Permission denied)
> $ ls -ld /
> drwxr-xr-x 24 root root 4096 Feb  9  2023 /
>
> In general root doesn't care about permission bits, and there's no /bin/sh
> follow-up to the syscall failure here. Tried again with bionic and there were a
> couple extra mprotect() calls on the way out but still no /bin/sh fallback...
>
> So the question here is does the kernel have a weird intermittent codepath, or
> does bionic+selinux have a weird intermittent codepath?

(yeah, that's why i'm trying on the host now.)

> Let's see: in the vanilla kernel source fs/exec.c has SYSCALL_DEFINE3(execve)
> which does return do_execve(getname(filename), argv, envp); which wraps
> do_execveat_common() on line 1888 of the same file.
>
> A quick cheat grepping for EACCES shows two uses in this file, one in
> SYSCALL_DEFINE1(uselib) which I just BOGGLE at because how are shared libraries
> THE KERNEL'S PROBLEM... but I really doubt we get there here. No, the NORMAL
> codepath (which we're apparently not reaching) is do_open_execat(int fd, struct
> filename *name, int flags) which says no, may_open() already checked and this is
> just a race condition check, and it's common plumbing in another file that
> returns this error code. Alright, cheat failed, back to drilling.
>
> Back to do_execveat_common(): filename was not a NULL pointer or similar.
> UCOUNT_RLIMIT_NPROC would return -EAGAIN. What error code might alloc_bprm()
> return, it's on line 1512 of this same file and it is understandably ENOMEM.
> count() can return EFAULT, E2BIG, and ERESTARTNOHAND. (Huh, launching a process
> with an argv of { NULL } has a kernel workaround with shaking finger of shame in
> the log? Did not know that.)

(yeah, surprisingly that "broke userspace" in the mild sense of "we
had tests" that made sure the _dynamic linker_ didn't crash in that
circumstance. and of course, we only had that test because it'd
happened in real life. but, yeah, i'm pretty happy with "don't do
that".)

> bprm_stack_limits() can set E2BIG.
> copy_string_kernel() and copy_strings() can both EFAULT or E2BIG.
>
> And now we're on to bprm_execve(), which I can drill through after lunch...
>
> Rob
>
> P.S. I note that 127 is to me an ACCEPTABLE failure return code for this since
> attempting to run the root directory is shenanigans in the first place. From
> toybox's perspective, it's possible the test is being unnecessarily specific
> here. But it would be nice to understand what's going on, pursuing this we may
> learn something about your system setup...

when i wrote the test i was trying to make sure we tested _all_ the
paths through....

...but there's a bug _somewhere_ for this to be non-deterministic.

> P.P.S. If this is (char)-1 getting returned in the wrong place by some obscure
> codepath... I'm not gonna be _that_ surprised, to tell the truth. Disappointed,
> but not really surprised.
>
> P.P.P.S. I assume this was seen on 64 bit arm android of a current flavor
> running the test suite through vendor_init or some such?

this was a hwasan build on cheetah. (a) because that's what happens to
be on my desk, but (b) because ccross' example was arm64 hwasan too.


More information about the Toybox mailing list