[Toybox] Horrible microsoft github tests spamming me.

Tue Aug 4 20:49:24 PDT 2020

On 8/4/20 10:38 AM, enh wrote:
> On Mon, Aug 3, 2020 at 11:41 PM Rob Landley <rob at landley.net> wrote:
>>
>> I got email about https://github.com/landley/toybox/runs/940149373 in which one
>> of the cpio tests spuriously failed. I cannot cut and paste the failure because
>> microsoft github's crammed so much javascript into the reporting page that
>> doesn't work.
>>
>> The last time cpio.c changed was april, the last time tests/cpio.test changed
>> was may, and the last time lib/* or any of the scripts/test plumbing changed was
>> June.
>>
>> The test failed for non-obvious reasons which look like a shell race condition
>> with | or something? The test is doing a dd to grab a specific byte offset out
>> of the file. and the chunk it's looking at starts 8 bytes too early to get a
>> match with what it expects. Is this a dd problem? Is the file longer with
>> spurious crap inserted earlier in it? Did the container's | insert extra data?
>> Who knows, I haven't a clue how to dig any of the build artifacts out of this
>> mess, but I got email about it.
> 
> yeah, this is where i've been for years with the "found by Android CI"
> bugs. one thing i can say is that i'm _not_ seeing this on Android's
> CI. (an obvious difference there is that we use toybox dd!)

I tend to have thing | thing rather than checkpointing temporary files so I
don't have to clean up the temp files. What I really need to do is make the test
self-cleaning between invocations (which means run test rather than source test,
which means export variables that need exporting, which isn't a big deal it's
just a todo item a bit down on the list)...

But "thing > file && thing < file" does let you examine the intermediate parts
after a failure better. I just don't assume suprious failures on the part of
_my_ infrastructure (outside of pending). I do, sadly, assume spurious failures
on the part of ubuntu crap because I've hit several over the years (and reported
some upstream). The entire constellation of SA_RESTART issues where SIGSTOP and
friends cause short reads on pipes, for example, and a zero length read is NOT
necessarily EOF... I try to get that stuff right, and when there's a suprious
only-happened-once thing like this I'll examine the code to see if I can figure
out how it happened...

But this isn't my code, this is ubuntu's shell and ubuntu's dd testing in
microsoft github's container, and like 80% of the exposed surface area here
isn't my stuff, which makes the likelihood of finding the bug is 20%...

> anyway, the first time i see a failure i just take a mental note to
> prime myself for future occurrences and assume cosmic ray/failing
> hardware/bad kernel unless i see it again.

If it happens on my laptop I will drop EVERYTHING and reproduce/explain it.
Which sometimes takes days. It's one of those "I dropped a glass, there's
fragments everywhere, DO NOT MOVE, DO NOT STEP ON ANYTHING" situations.

If it happens in a random vm like this? 4 times out of 5 it's the VM.

> the nice thing about CI though is that it doesn't take long before
> it's run the tests more often than all the humans put together.

Oh sure. I seriously want to expand test coverage, but it's far down the todo list.

>> I thought I was not going to get these emails, and am sad. And to add insult,
>> the github email says it's from "Rob Landley", which is very much not the case.
> 
> annoyingly, i _want_ to get them, but don't. (and don't know how to
> sign up for them either :-( )

You should have access to fiddle. I dunno how github works under the surface...

Rob