[Toybox] [toybox] 0.7.0: scripts/make.sh: line 270: wait: pid XYZ is not a child of this shell (#24)

Nicolas Boichat drinkcat at chromium.org
Fri Feb 26 20:52:59 PST 2016


Hi!

On Sat, Feb 27, 2016 at 2:12 AM, Rob Landley <rob at landley.net> wrote:
> On 02/26/2016 12:31 AM, Nicolas Boichat wrote:
>> On Fri, Feb 26, 2016 at 1:53 PM, Rob Landley <notifications at github.com
>>     On 02/25/2016 01:31 AM, drinkcat wrote:
>>     > We use toybox-0.7.0 as part of the Chromium OS project,
>
> P.S. Yay!
>
>>     > and sometimes
>>     > hit an issue when building it on our automated builders (see this
>>     issue
>>     > <https://bugs.chromium.org/p/chromium/issues/detail?id=584542>):
>>     >
>>     > |toybox-0.7.0: armv7a-cros-linux-gnueabi-gcc -O2 -O2 -pipe
>>     -march=armv7-a
>>     > -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard -g -fno-exceptions
>>     > -fno-unwind-tables -fno-asynchronous-unwind-tables -clang-syntax
>>     > -funsigned-char -Wno-string-plus-int -I . -Os -ffunction-sections
>>     > -fdata-sections -fno-asynchronous-unwind-tables
>>     -fno-strict-aliasing -c
>>     > toys/posix/tail.c -o generated/obj/tail.o toybox-0.7.0:
>>     scripts/make.sh:
>>     > line 270: wait: pid 8477 is not a child of this shell toybox-0.7.0:
>>
>>     Hmmm... PID wrap, maybe?
>>
>> That's what we were wondering about... The builder is building a lot of
>> other packages at the same time, including Chromium, so it's not
>> unlikely that the PID space is saturated... Also, the builder retries
>> after the first failure, and the second try always works (probably when
>> the builder is less busy...)
>
> Possibly the OS is killing zombies if it wants to reuse that PID before
> the zombie is reaped? (Which would be a horrible heuristic because
> process exit could happen after a long runtime but right before a new fork.)
>
> Or maybe it's doing so if it there _are_ no more free PIDs, instead of
> fork failing?
>
> In either case, moving to $! wouldn't fix it. But that also wouldn't
> explain why only bash was seeing the problem...

Actually, it might, see POSIX shell specification of "wait"
(http://pubs.opengroup.org/onlinepubs/009695399/utilities/wait.html,
alright, arguably, it's in the "informative" part ,-)):
"""
Historical implementations of interactive shells have discarded the
exit status of terminated background processes before each shell
prompt. Therefore, the status of background processes was usually lost
unless it terminated while wait was waiting for it. This could be a
serious problem when a job that was expected to run for a long time
actually terminated quickly with a syntax or initialization error
because the exit status returned was usually zero if the requested
process ID was not found.
[...]

The shell is allowed to discard the status of any process if it
determines that the application cannot get the process ID for that
process from the shell. It is also required to remember only
{CHILD_MAX} number of processes in this way. Since the only way to get
the process ID from the shell is by using the '!' shell parameter, the
shell is allowed to discard the status of an asynchronous list if "$!"
was not referenced before another asynchronous list was started. (This
means that the shell only has to keep the status of the last
asynchronous list started if the application did not reference "$!".
If the implementation of the shell is smart enough to determine that a
reference to "$!" was not saved anywhere that the application can
retrieve it later, it can use this information to trim the list of
saved information. Note also that a successful call to wait with no
operands discards the exit status of all asynchronous lists.)
"""

There was no reference to $! in the previous version of the code, so
it might be that bash is "smart enough", and trims the exit status
from its list...

(BTW, AFAICT, CHILD_MAX is a large number, 8192, so it cannot be an issue here)

> It's an interesting bug and I'd be interested in tracking it down if I
> was willing to get sucked into debugging GPLv3 bash. (GPLv2 bash I spent
> days tracking down weirdness, ala:
>
> The initial problem:
>   http://landley.net/notes-2011.html#24-08-2011
>
> Mentioned in passing:
>   http://landley.net/notes-2011.html#26-08-2011
>   http://landley.net/notes-2011.html#28-08-2011
>
> Deep dig:
>   http://landley.net/notes-2011.html#02-09-2011
>   http://landley.net/notes-2011.html#03-09-2011
>   http://landley.net/notes-2011.html#04-09-2011
>
> And finally finding it:
>   http://landley.net/notes-2011.html#05-09-2011
>
> Yes, that's me happily digging through libc, kernel, and back into a
> userspace program to find a problem.

Ah well, that's how "interesting" bugs are ,-)))

> But if a GPLv3 program is involved,
> "it's broken, let's replace it".
>
>>     > Looking at the code (|script/make.sh|), we are wondering about
>>     your use
>>     > of |$(jobs -rp)|. Wouldn't it be more correct to add jobs to PENDING
>>     > using |$!| right after you launch the job (|do_loudly|)?
>>
>>     If you think that'll help, I'm happy to give it a try, sure.
>>
>>
>> I have a commit ready here, that appears to fix the problem:
>> https://github.com/drinkcat/toybox/commit/4c705620d73e3e9c12a3be54dc5d2efda939241a
>
> I pushed a change last night based on your $! suggestion, did that fix
> it? (Your patch is using ${%%} to filter, which is interesting. I
> couldn't make ${//} work right but maybe that could replace my sed
> invocation?

Sure! Feel free to steal the idea ,-) (Note that you have to be a bit
more careful when constructing the "PENDING" list: it cannot start
with a space anymore)

You could also get rid of "wc" by using a counter variable instead.

> Trying to get the number of execs in the dispatch/monitoring
> cycle down as small as possible. Then again once it can build under a
> toybox shell then it's just a fork() and not an exec, which is cheaper.
> Eh, worry about it later...)

Your patch seems totally reasonable. I've tested it by running 8
builds with it, and it always worked (vanilla 0.7.0 failed once in 3
tests).

Any chance you could release a 0.7.1 version with it? This would make
it easier for us to test it at larger scale (without having to carry a
local patch).

>> It's a little less aggressive at parallelizing, as it always waits for
>> the first PID if PENDING is full (instead of refreshing the PENDING list
>> every time)...
>
> So's the one I did last night. I should poke around on my 8-way machine
> and see how it's doing keeping the cpus busy...

Well, actually, you could switch back to `jobs -rp` to find the number
of jobs... Just do not use it to update PENDING (well, if the $!
theory is right, you might be able to, but there would be no gain...)

Thanks!

Best,

Nicolas

 1456548779.0


More information about the Toybox mailing list