[Toybox] weird timeout flake
Rob Landley
rob at landley.net
Tue Oct 3 13:48:05 PDT 2023
On 10/3/23 13:38, enh wrote:
>> Trying that by hand on devuan (using coreutils' timeout):
>>
>> $ timeout .1 /
>> timeout: failed to run command ‘/’: Permission denied
>> $ echo $?
>> 126
>>
>> From the bash man page:
>>
>> If a command is not found, the child process created to execute it re‐
>> turns a status of 127. If a command is found but is not executable,
>> the return status is 126.
>>
>> I'm not sure how you can "file not found" the root directory? (Selinux?
>> Filehandle exhaustion? Even chmod 000 should return EPERM not ENOENT.)
>>
>> The relevant code is xwrap.c line 233:
>>
>> execvp(argv[0], argv);
>>
>> toys.exitval = 126+(errno == ENOENT);
>
> +Colin Cross who just saw this too.
So why is execvp("/", {"/", 0}); returning ENOENT? It's saying it cannot _find_
the root directory, not that it can't execute a directory. Hmmm...
The execvp man page says:
If the specified filename includes a slash character, then PATH is ig‐
nored, and the file at the specified pathname is executed.
Which seems like it would moot:
If permission is denied for a file (the attempted execve(2) failed with
the error EACCES), these functions will continue searching the rest of
the search path. If no other file is found, however, they will return
with errno set to EACCES.
Which would still be returning something other than ENOENT anyway.
Hmmm...
If the header of a file isn't recognized (the attempted execve(2)
failed with the error ENOEXEC), these functions will execute the shell
(/bin/sh) with the path of the file as its first argument. (If this
attempt fails, no further searching is done.)
I don't THINK that's a likely fallback path here? Although /bin/sh not found
might explain it. But that would be deterministically reproducible and you're
having an intermittent issue, right?
$ cat > potato.c << EOF
#include <unistd.h>
int main(int argc, char *argv[]) { execvp("/", (char *[]){"/", 0}); }
EOF
$ gcc potato.c
$ strace ./a.out
execve("/", ["/"], 0x7ffebd0880d8 /* 36 vars */) = -1 EACCES (Permission denied)
$ sudo strace ./a.out
execve("/", ["/"], 0x7ffc32654e08 /* 16 vars */) = -1 EACCES (Permission denied)
$ ls -ld /
drwxr-xr-x 24 root root 4096 Feb 9 2023 /
In general root doesn't care about permission bits, and there's no /bin/sh
follow-up to the syscall failure here. Tried again with bionic and there were a
couple extra mprotect() calls on the way out but still no /bin/sh fallback...
So the question here is does the kernel have a weird intermittent codepath, or
does bionic+selinux have a weird intermittent codepath?
Let's see: in the vanilla kernel source fs/exec.c has SYSCALL_DEFINE3(execve)
which does return do_execve(getname(filename), argv, envp); which wraps
do_execveat_common() on line 1888 of the same file.
A quick cheat grepping for EACCES shows two uses in this file, one in
SYSCALL_DEFINE1(uselib) which I just BOGGLE at because how are shared libraries
THE KERNEL'S PROBLEM... but I really doubt we get there here. No, the NORMAL
codepath (which we're apparently not reaching) is do_open_execat(int fd, struct
filename *name, int flags) which says no, may_open() already checked and this is
just a race condition check, and it's common plumbing in another file that
returns this error code. Alright, cheat failed, back to drilling.
Back to do_execveat_common(): filename was not a NULL pointer or similar.
UCOUNT_RLIMIT_NPROC would return -EAGAIN. What error code might alloc_bprm()
return, it's on line 1512 of this same file and it is understandably ENOMEM.
count() can return EFAULT, E2BIG, and ERESTARTNOHAND. (Huh, launching a process
with an argv of { NULL } has a kernel workaround with shaking finger of shame in
the log? Did not know that.) bprm_stack_limits() can set E2BIG.
copy_string_kernel() and copy_strings() can both EFAULT or E2BIG.
And now we're on to bprm_execve(), which I can drill through after lunch...
Rob
P.S. I note that 127 is to me an ACCEPTABLE failure return code for this since
attempting to run the root directory is shenanigans in the first place. From
toybox's perspective, it's possible the test is being unnecessarily specific
here. But it would be nice to understand what's going on, pursuing this we may
learn something about your system setup...
P.P.S. If this is (char)-1 getting returned in the wrong place by some obscure
codepath... I'm not gonna be _that_ surprised, to tell the truth. Disappointed,
but not really surprised.
P.P.P.S. I assume this was seen on 64 bit arm android of a current flavor
running the test suite through vendor_init or some such?
More information about the Toybox
mailing list