[Toybox] Release 0.8.10

Mon Jul 31 18:54:29 PDT 2023

On 7/31/23 09:31, enh wrote:
> seems like this release is in a pretty bad state?

The tests passed locally! All of 'em! With glibc and musl! Sigh...

> github CI is failing for both
> linux and macOS... linux seems to have some tar failures

Yeah, it's those darn sparse failures again. On ext4 writing a sparse file
behaves determinstically-ish, but butterfly-effect-fs not so much.

Admittedly it's only user visible if you _ask_ for it, and I'm kind of tempted
to teach tar that "--sparse means any sufficient run of zeroes becomes #*%(#&
sparse whether or not the filesystem knows about it". Where "sufficient" would
logically be 512 byte aligned 512 byte blocks, because that's how tar thinks.
(It's a space savings. I don't THINK there's a maximum number of sparse extents?
I've even got a realloc every 512 entries in the existing loop! And a note to
figure out how to test that properly. Don't ask me what gnu/dammit does with a
multiblock sparse table, it _probably_ works?)

*shrug* If nothing else it would eliminate the filesystem dependency...

> FAIL: tar sparse without overflow
> echo -ne '' | tar c --owner root --group sys --mtime @1234567890 --sparse fweep
> | SUM 3
> --- expected 2023-07-29 01:27:20.471064281 +0000
> +++ actual 2023-07-29 01:27:20.475064343 +0000
> @@ -1 +1 @@
> -50dc56c3c7eed163f0f37c0cfc2562852a612ad0
> +4b0cf135987e8330d8a62433471faddccfabac75

In order for this to be happening the sparse test I added at the start has to
pass, but then the larger saving-of-sparseness does not match the file we just
created on the previous line.

I.E. the microsoft github behavior has to be INCONSISTENT within the same run to
trigger this. Wheee...

> and a mv failure
> 
> FAIL: mv random file to new file
> echo -ne '' | mv file1 file2 && [ ! -e file1 -a -f file2 ] && stat -c %s file2
> --- expected 2023-07-29 01:27:12.202935388 +0000
> +++ actual 2023-07-29 01:27:12.206935451 +0000
> @@ -1 +1 @@
> -5243392
> +10752
> 
> though the latter might be flaky?

It's another sparse file:

dd if=/dev/zero of=file1 seek=10k count=1 >/dev/null 2>&1

And this one is even WEIRDER because stat %s is "total size in bytes" which
toybox gets from stat->st_size which man 2 stat says is:

               off_t     st_size;        /* Total size, in bytes */

Which is NOT the st_blocks field, this is the ls -l size.

I thought maybe my new dd went weird but I just ran "ASAN=1 make clean defconfig
toybox tests" here _again_ and it passed? Short reads/writes should be
compensated for (it multiples up a total count of bytes and then tries to shovel
them in ifs and ofs sized chunks but the _total_ still has to add up).

10752 is 10*1024+512 meaning the seek was in done in BYTES not blocks? But
that's not what my dd is doing:

$ ./dd if=/dev/zero of=file1 seek=10k count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.001 s, 500 K/s
$ ls -l file1
-rw-r--r-- 1 landley landley 5243392 Jul 31 20:19 file1

And looking at the code the only way it _would_ do that is...

Oh. Duh. I moved TT.iflag from globals (where it's guaranteed zeroed) to a local
variable (which needs manual zeroing), and of course despite gcc regularly
producing "is provably never used uninitialized" warnings, it does NOT warn for
ACTUALLY UNINITIALIZED VARIABLES when you pass their address to an inlineable
function in the same translation unit.

Yup, my bad. Bug in the new dd introduced by my cleanup last week. Uninitialized
variable hallucinating iflag=skip_bytes. Commit a88bd5e9a0c5.

(Grumble grumble false sense of security. And you'd think ASAN would
randomize/poison that sort of thing. Memset new stack pages with 0x55 or some
such...)

> i tried to reproduce locally, but hit a few different asan failures. linux has
> an asan failure in blkid (patch sent already) and macOS dies in the id 0 test:

The downside of testing glibc+gcc's asan in debian. (Still can't use the NDK one
because it only ships asan.so not asan.a.)

> looks like a classic "the passwd functions don't guarantee the lifetime of a
> result past another call to a passwd function" issue?

Try commit 3d86ee9eaec1?

Which works on both glibc and musl, with ASAN on the glibc build and when I
enable ASAN on the musl build the cross compiler goes "x86_64-linux-musl-cc:
fatal error: cannot read spec file 'libsanitizer.spec': No such file or
directory" so that's nice...

> linux also dies in the sed timeout test; that seems to be a pathological case
> for asan because increasing the timeout to 60s also didn't pass. (though
> weirdly, that test is fine -- finishing almost instantly, just like non-asan
> -- on macOS.

Didn't see it on debian's gcc+glibc ASAN, but mostly likely that has fewer checks.

> not sure whether that's a bsd/glibc difference or a linux-only asan
> bug. the latter seems less likely, but i'll mention it to the asan folks anyway...)

I remind you of:

commit c0dca293c1301a6315684703706597db07a8dbe1
Author: Rob Landley <rob at landley.net>
Date:   Sat Jun 27 03:14:49 2020 -0500

    The bionic/clang asan plumbing slows the test down >10x, so expand timeout.

That test is ping-ponging between a bunch of different segments (the source
buffer, the destination buffer, the parsed regex struct, and the stack, global
variables, the toybox text segment, and glibc's library text segment) and it's
entirely possible whatever virtual TLB setup ASAN does to catch weirdness is
getting thrashed. Worse now than when the 20 second timeout was enough...

Meanwhile, without ASAN wrapping date +%s.%N around the test says it takes a
quarter of a second on my 10 year old laptop:

1690850299.108588387
PASS: sed megabyte s/x/y/g (20 sec timeout)
1690850299.342395058

A reasonable chunk of which is the shell test plumbing. (Just two consecutive
"date +%s.%N; date+%s.%N" calls from the shell are .007 seconds apart on this
machine, nontrivial chunk of that 250 milliseconds. I think the original 10
second timeout was to make it reliably pass on my 66mhz Turtle board.)

I can't think of a fix here other than disabling the test...

> i *couldn't* reproduce the tar failures locally (the failure at the top was copy
> & pasted from github ci results), so maybe another lurking fs dependency?

A fs dependency which commit d69584882ff8 worked around for five minutes.

Rob

P.S. Part of my drive to run tests in the known mkroot environment is although I
do want to catch all the real world weirdness, this is no good for REGRESSION
testing because "distro du jour" that runs the test suite for the first time has
failures that are due to assumptions IN THE TESTS rather than anything toybox
did wrong per se. Writing robust portable tests that actually SAY anything is
REALLY HARD. There's a zillion tests I haven't even tried to write yet because
"I could get this to pass once" is NOT something you put in the test suite.