[Toybox] one last find thing...

Mon Jun 17 15:12:37 PDT 2019

On Mon, Jun 17, 2019 at 3:00 PM Rob Landley <rob at landley.net> wrote:
>
> On 6/17/19 2:49 PM, enh wrote:
> > On Sat, Jun 15, 2019 at 4:30 PM Rob Landley <rob at landley.net> wrote:
> >>
> >> On 6/14/19 3:57 PM, enh wrote:
> >>> (i haven't had time to investigate, and i don't have any useful test
> >>> case other than "some timezone testing fails to run on emulators in
> >>> the cloud, in a way that gives me no useful failure", but i'm getting
> >>
> >> Does it _reliably_ fail to run?
> >
> > seems like it. i've kicked off another build "just in case" every day,
> > but it looks like the same failure (modulo the fact that i don't have
> > any real detail), and _other_ changes are going through fine.
>
> If it fails reliably, we can start test-reverting bits of it. I'd start with the
> O_PATH (to eliminated it, if nothing else).
>
> Also, did you try a build with the commit before this one just to confirm this
> is what did it?

yeah, the success of that was what prompted me to call this probably
genuine and send the email :-)

> Sigh. It's a pity I can't see what the actual failure is. (When you say timezone
> testing, do you mean toybox's date.test...?)

no, it's "x86 CtsHostTzDataTests". afaict that's hundreds of lines of
Java to do what could be done in a couple of lines of shell, if only
we supported the latter.

> >>> increasingly convinced that the DIRTREE_STATELESS patch does break
> >>> something, and it's not just an infrastructure issue... i wouldn't
> >>> normally send such a useless bug report, but i've failed to get to
> >>> this in 3 days, and i'm not likely to for at least 3 more at this
> >>> point, so i thought i'd at least mention it...)
> >>
> >> This isn't going to break anything, is it?
> >>
> >> -      openat(dirtree_parentfd(new), new->name, O_CLOEXEC), flags);
> >> +      openat(dirtree_parentfd(new), new->name, O_PATH|O_CLOEXEC), flags);
> >
> > (one thing that occurred to me over the weekend is that it anywhere we
> > use O_PATH might break macOS, since there is no O_PATH there.
>
> #ifdef __APPLE__
> #define O_PATH 0
> #endif

i actually meant behavioral breaks. specifically xabspath breaks the
usual "you have to pick one of O_RDONLY/O_RDWR/O_WRONLY" rule.

i'm not worried about this right now, but it's something to think
about if we're going to seriously support macOS.

> > but the
> > failures in question are on Android. [the builds in question don't
> > contain a new host prebuilt.])
>
> So the host prebuilt is the same, you rebuild toybox from source, and then a
> test it runs afterwards fails?

yes, this is only rebuilding the device toybox. and then the
should-be-unrelated test fails. next time i'm actually at a real
computer i'll try to find out exactly what commands this test is
running on the device.

> Is the test the only thing that fails? (Or does the build stop there?)

afaict it's only this. (the tests all run after the build is done.)

> >> Moving struct st earlier within struct dirtree could reveal an existing bug, but
> >> the bug itself would be elsewhere.
> >>
> >> If strcpy(s, "") with only a single byte allocated to s[] wrote past the end of
> >> it, we'd have bigger problems...
> >>
> >> I'm not spotting what else could be the culprit? (And with a _timezone_
> >> test...?)
> >
> > i don't think that's relevant. it's just a test that (afaict) runs on
> > the host and calls commands on the device via adb. (don't ask... i
> > can't defend the "host-side tests" stuff because aiui it's
> > indefensible.)
>
> I try never to criticize the user workload. It has seniority, I CANNOT break it.
> (Unless they're really the only user and I ask nicely.)
>
> I added a workaround to toybox sed for an outright _bug_ in the perl package
> build (where whoever wrote the regex didn't understand how ranges work so
> created a NOP but I was erroring out on the invalid construct).
>
> Can this sideload test be extracted from the larger build? Maybe I can get an
> image I can replace toybox in and try the test again to see what's going on?

afaict it's only affecting x86? which is presumably another clue. but
i'm not certain that's true yet because i haven't had time to confirm
that everything still works on arm/arm64.

> >> My approach would be to revert bits of it (go back to the xzalloc()
> >> etc, which is really an attempt to speed up top with less memory churn although
> >> I should break down and bench what that's spending its time on...)
> >>
> >> But if I can't reproduce the failure, I can't bisect it. Hmmm.
> >
> > yeah, when i get time i'll try the bisection. (unfortunately it's a
> > multi-hour thing for me. but at that's better than nothing.)
>
> Obviously I can just revert the patch, but that doesn't explain what's happening.
>
> >> Rob
>
> Rob