[Toybox] ASan freaks out when using tsort in multicall binaries

Rob Landley rob at landley.net
Fri Oct 6 21:04:32 PDT 2023


On 10/6/23 16:33, Oliver Webb wrote:
>> Hadn't seen that one, I'm aware of a sparse file issue on some filesystems.
>> (That hit on microsoft github.)
> 
> My home directory is ecryptfs,

Yup, that would explain it.

> Testing on my /tmp directory (etx4) makes the errors
> go away for both du and tar.

config ECRYPT_FS
        tristate "eCrypt filesystem layer support"
        depends on KEYS && CRYPTO && (ENCRYPTED_KEYS || ENCRYPTED_KEYS=n)
        select CRYPTO_ECB
        select CRYPTO_CBC
        select CRYPTO_MD5
        help
          Encrypted filesystem that operates on the VFS layer.  See
          <file:Documentation/filesystems/ecryptfs.rst> to learn more about
          eCryptfs.  Userspace components are required and can be
          obtained from <http://ecryptfs.sf.net>.

Sourceforge. Lovely. And that website redirects to a page that lists a google+
page, and says the ecryptfs-utils source is in launchpad/bazaar. No obvious way
to get a tarball, but I can create a snap pack from the web page? Last commit to
https://bazaar.launchpad.net/~ecryptfs/ecryptfs/trunk/files says it was 6 years ago.

To quote the whale, "I'm quite dizzy with anticipation. Or is it the wind?"

> The specific test that fails with tar is "tar create long->long".

Is it the "touch" that fails, or tar? Because the test is doing:

# 255 bytes, longest VFS name
LONG=0123456789abcdef0123456789abcdef
LONG=$LONG$LONG$LONG$LONG$LONG$LONG$LONG$LONG
LONG=${LONG:1:255}

# 4+96=100 (biggest short name), 4+97=101 (shortest long name)
touch dir/${LONG:1:96} dir/${LONG:1:97}
testing "create long fname" "$TAR dir/${LONG:1:97} dir/${LONG:1:96} | SUM 3" \
  "d70018505fa5df19ae73498cfc74d0281601e42e\n" "" ""

And what I was trying to test was the border condition of the tar internals
where it switches over to an adjunct record to record an overlength field that
won't fit in the structure, and it sounds like what's failing is the
filesystem's ability to have two adjacent directories of length 96 and 97 that
differ only by that final character. Except I didn't add a check for failure to
the "touch" because it wasn't supposed to be part of the test, I just naively
assumed that would portably work...

> Oh, another one I forgot to mention is "truncate sparse" fails on ecryptfs as well, but works on ext4

Do you mean the tests/truncate.test entry:

testing "is sparse" "truncate -s 1g freep && [ $(stat -c %b freep) -le 8 ] &&
  echo okay" "okay\n" "" ""

Which is doing a "truncate -s 1g file" and then asking state if the file with
literally no contents used less than 8 512-byte blocks of storage?

The test is trying to ask "did the command create a sparse file", and the
failure seems to be "the filesystem cannot store a file sparsely", or at least
takes more than 4k to store literally no data.

I did not predict that failure mode from a filesystem merged into the mainline
kernel.

>> > sed fails the performance test even though it can process a megabyte of data in less then 20s,
>>
>>
>> On what hardware?
> 
> A laptop with 4GB of RAM and about 2.5 Gigahertz of processing power with 2 cores (Intel Celeron).
> This doesn't seem like a hardware speed issue
> 
> (seq 160000 generates about a MB of data so I used that in this test instead of the 20 doublings sed.test does)
> $ time ( seq 160000 | toybox sed "s/./y/g" > /dev/null )
> 
> real    0m0.282s
> user    0m0.235s
> sys     0m0.049s

It's not generating a megabyte of random data, it's generating a megabyte of the
same character, and then asking sed search-and-replace replace one byte at a
time in that string a million times. The search-and-replace is s/x/y/ meaning
each x gets replaced with y. The output of "seq" does not contain any "x"
characters, so the search and replace will trigger zero times instead of
triggering a million times.

If you want more efficient generation of the test string I could instead do:

  dd if=/dev/zero bs=65536 count=16 | tr '\0' x

For a definition of "efficient" that calls an external program to marshall the
same amount of data through multiple kernel pipe buffers rather than staying
process-local and just thrashing the heap a bit. (My ten year old laptop has 3
megs of L2 cache so it probably all stays in cache, the three-process monty
version is gonna do page table shenanigans across four different contexts, uses
more SMP but quite possibly bounces data out to DRAM? Dunno.)

Optimization is often non-obvious these days. I mostly try to do the simple
thing and stay out of the way of whatever clever stuff other people did, and
then fix it if the result is obvious unpleasant.

>> I haven't seen this (what distro/compiler/libc/filesystem are you testing on),
> 
> Linux Mint 21.1 (Which is essentially Ubuntu 22.04 with some irrelevant changes)/GCC 11.4.0
> /glibc 2.35/ecryptfs and etx4 (Both experience the same mkpasswd errors)

Sounds like I need to install mint in KVM with ecryptfs... Oh hey, they've got
an xfce version. Shouldn't be too hard to navigate...

> Huh, I just tested with make test_mkpasswd and it worked, Another one like tsort where it triggers
> ASAN _only_ when in a multicall binary.

Is it triggering ASAN in mkpasswd or in a different toybox command out of the
$PATH? (Yay reproduction sequence, but WHAT did it reproduce?)

>> but I mentioned I just redid the lib/password.c plumbing and need to re-audit
>> that list of commands before next release.
> 
> Here's the error message ASAN sends:
> 
> AddressSanitizer:DEADLYSIGNAL
> =================================================================
> ==15453==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7ffd90ffb280 sp 0x7ffd90ffb1a8 T0)
> ==15453==Hint: pc points to the zero page.
> ==15453==The signal is caused by a READ memory access.
> ==15453==Hint: address points to the zero page.
>     #0 0x0  (<unknown module>)
> 
> AddressSanitizer can not provide additional info.
> SUMMARY: AddressSanitizer: SEGV (<unknown module>)
> ==15453==ABORTING
> 
> (ASAN catching reading from a null pointer and "SEGV"-ing is different from the kernel catching one
> and sending a SIGSEGV for a reason I don't know)

The _program_counter_ was zero. It called a null function pointer. And did not
give a stack trace. And that doesn't say what executable did the dumping, or any
context that would say what test it was trying to run that might let me go look
at the script to guess.

>> > To my surprise, Every test from tsort failed, along with some messages from
>> > a "AddressSanitizer".
>>
>>
>> Sigh, I moved the initializations between the two nested loops and the local
>> variable declarations enough times I apparently dropped the plen initialization.
>>
>> Try commit 47946f241a4e.
> 
> Works perfectly, thanks

Yay.

One down, a half-dozen to go sounds like...

Rob


More information about the Toybox mailing list