[Toybox] strlower() bug

enh enh at google.com
Wed May 22 07:30:18 PDT 2024


On Tue, May 14, 2024 at 2:58 PM Rob Landley <rob at landley.net> wrote:
>
>
>
> On 5/14/24 12:12, enh wrote:
> > On Tue, May 14, 2024 at 1:04 PM Rob Landley <rob at landley.net> wrote:
> >>
> >> On 5/14/24 07:10, enh wrote:
> >> > macOS tests seem to be broken since this commit?
> >> >
> >> > FAIL: find strlower edge case
> >> > echo -ne '' | touch aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ; find . -iname aaaaaȺȺȺȺȺȺȺȺȺ
> >> > --- expected 2024-05-10 17:32:56.000000000 +0000
> >> > +++ actual 2024-05-10 17:32:56.000000000 +0000
> >> > @@ -1 +0,0 @@
> >> > -./aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> >>
> >> Sigh. Apple's handling of utf8/unicode continues to be... "a challenge".
> >>
> >> When I run "make test_find" standalone, it gives me:
> >>
> >> scripts/runtest.sh: line 219: syntax error near unexpected token `;'
> >> scripts/runtest.sh: line 219: `      R) LEN=0; B=1; ;&'
> >>
> >> Because bash 3.2 from 2007 doesn't understand ;&
> >
> > yeah, nor does mksh. it hasn't caused me any problems though; i've
> > been ignoring it for years now.
> >
> >> And THEN it goes:
> >>
> >> touch: out of range or illegal time specification: YYYY-MM-DDThh:mm:SS[.frac][tz]
> >> touch: out of range or illegal time specification: YYYY-MM-DDThh:mm:SS[.frac][tz]
> >> FAIL: find newerat
> >> echo -ne '' | find dir -type f -newerat @12345
> >> --- expected    2024-05-14 11:16:40.000000000 -0500
> >> +++ actual      2024-05-14 11:16:40.000000000 -0500
> >> @@ -1 +0,0 @@
> >> -dir/two
> >>
> >> Which is a different error that DOESN'T happen with the global tests, because
> >> those are using toybox touch rather than homebrew's $TOUCH. But it works on
> >> debian. Let's see:
> >>
> >> $ touch --version
> >> touch: illegal option -- -
> >> usage: touch [-A [-][[hh]mm]SS] [-achm] [-r file] [-t [[CC]YY]MMDDhhmm[.SS]]
> >>        [-d YYYY-MM-DDThh:mm:SS[.frac][tz]] file ...
> >>
> >> Thank you, gnu project. I'm gonna assume this is _also_ from 2007. (I made
> >> scripts/prereq/build.sh for a REASON...)
> >
> > no, i think this is a BSD touch.
> >
> > yeah, that looks very like the FreeBSD touch's usage:
> >
> > static void
> > usage(const char *myname)
> > {
> >         fprintf(stderr, "usage: %s [-A [-][[hh]mm]SS] [-achm] [-r file] "
> >                 "[-t [[CC]YY]MMDDhhmm[.SS]]\n"
> >                 "       [-d YYYY-MM-DDThh:mm:SS[.frac][tz]] "
> >                 "file ...\n", myname);
> >         exit(1);
> > }
> >
> >
> >> Then when I run "make clean macos_defconfig tests" I get:
> >>
> >> Undefined symbols for architecture arm64:
> >>   "_iconv", referenced from:
> >>       _do_iconv in iconv.o
> >>      (maybe you meant: _iconv_main)
> >>   "_iconv_open", referenced from:
> >>       _iconv_main in iconv.o
> >> ld: symbol(s) not found for architecture arm64
> >>
> >> Because the Makefile has:
> >>
> >> tests: ASAN=1
> >> tests: toybox
> >>         scripts/test.sh
> >>
> >> And ASAN apparently breaks on homebrew's toolchain but not debian's toolchain.
> >> Why does it break there but not on Linux...
> >>
> >> probe cc -Wall -Wundef -Werror=implicit-function-declaration
> >> -Wno-char-subscripts -Wno-pointer-sign -funsigned-char
> >> -Wno-deprecated-declarations -Wno-string-plus-int -Wno-invalid-source-encoding
> >> -fsanitize=address -O1 -g -fno-omit-frame-pointer -fno-optimize-sibling-calls
> >> -xc -o /dev/null -
> >> error: cannot parse the debug map for '/dev/null': The file was not recognized
> >> as a valid object file
> >> clang: error: dsymutil command failed with exit code 1 (use -v to see invocation)
> >>
> >> Because it tries to read back the -o output we discarded, and fails when it
> >> can't do so, thus all library probes fail and it tries to build with no
> >> libraries. But only when ASAN is enabled, because ASAN uses -o as INPUT. Bravo.
> >>
> >> None of this is the actual unicode failure, this is just ambient macos...
>
> FAIL: find strlower edge case
> echo -ne '' | touch aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ; find . -iname aaaaaȺȺȺȺȺȺȺȺȺ
> --- expected    2024-05-14 13:32:19.000000000 -0500
> +++ actual      2024-05-14 13:32:19.000000000 -0500
> @@ -1 +0,0 @@
> -./aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> make: *** [tests] Error 1
> cfarm104 (homebrew):toybox landley$ ls generated/testdir/testdir/
> aaaaa?????????
> $ LC_ALL=en_US.UTF-8 ls generated/testdir/testdir
> aaaaa?????????
> $ generated/testdir/ls generated/testdir/testdir
> aaaaa\342\261\245\342\261\245\342\261\245\342\261\245\342\261\245\342\261\245\342\261\245\342\261\245\342\261\245
> $ echo -./aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> -./aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> $ generated/testdir/ls -N generated/testdir/testdir
> aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> cfarm104 (homebrew):toybox landley$ generated/testdir/ls -N
> generated/testdir/testdir
> aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> cfarm104 (homebrew):toybox landley$ ls -N generated/testdir/testdir
> ls: invalid option -- N
> usage: ls [- at ABCFGHILOPRSTUWabcdefghiklmnopqrstuvwxy1%,] [--color=when] [-D
> format] [file ...]
>
> Why is toybox ls escaping by default here but not on Linux? Hmmm, it's gotta be
> this call in crunch_qb():
>
>     // scrute the inscrutable, eff the ineffable, print the unprintable
>     else if ((len = wcrtomb(buf, wc, 0) ) == -1) len = 1;
>
> Once again, I wist for stable/portable unicode functions in lib/unicode.c. I
> know why I haven't GOT them (mostly), but this is just ridiculous. (They don't
> have to be GREAT, but NOT THAT...)
>
> (There's only 100k code points and MOSTLY I'm doing tests that return ONE BIT
> answers. I'm aware it's a trap, but DUDE...)
>
> Anyway, STILL not the actual issue at hand, the issue is that:
>
> cfarm104 (homebrew):toybox landley$ generated/testdir/find
> generated/testdir/testdir -iname aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> generated/testdir/testdir/aaaaaⱥⱥⱥⱥⱥⱥⱥⱥⱥ
> cfarm104 (homebrew):toybox landley$ generated/testdir/find
> generated/testdir/testdir -iname aaaaaȺȺȺȺȺȺȺȺȺ
> cfarm104 (homebrew):toybox landley$
>
> The upper case string is not converting into the lower case string. Ok, let's
> stick a +dprintf(2, "%d->%d\n", c, towlower(c)); into strlower() and it says
> "570->58" which... is a colon? Hmmm, prepending LC_ALL=en_US.UTF-8 did not
> change that.
>
> It looks like macos towlower() refuses to return expanding unicode characters.
> Possibly to avoid exactly the kind of bug this fixed, in exchange for corrupting
> the data.

yeah, i don't know whether it's on purpose or a bug, but that does
seem to be the case... i tested with another Latin Extended-B
character whose uppercase and lowercase forms are both in the same
block (and thus have the same utf8 encoding length), and macOS
towlower() does work for that.

hmm, actually maybe it's just that their Unicode data is out of date?
it looks like they don't know about Latin Extended-C at all? a code
point like U+2c62 that gets _smaller_ (because it's in the IPA
Extensions block) doesn't work either.

i did try looking in FreeBSD, but i've never understood how this stuff
works there. i'm guessing from the fact i've never found them that the
implementations are all generated at build time, subtly enough that my
attempts to grep for the generators fail.

hmm... looking at Apple's online FreeBSD code, it looks like they have
very different (presumably older) FreeBSD code
[https://opensource.apple.com/source/Libc/Libc-320.1.3/locale/FreeBSD/tolower.c.auto.html],
and the footer of the file that reads implies that they're using data
from Unicode 3.2 (released in 2002, which would make sense given the
2002 BSD copyright date in the tolower.c source):
```
/usr/share/locale/UTF-8$ xxd LC_CTYPE | tail
00016530: 4004 2800 4004 2800 4004 2800 4004 2800  @.(. at .(. at .(. at .(.
00016540: 4004 2800 4004 2800 8004 2800 0004 2800  @.(. at .(...(...(.
00016550: 0004 2800 0004 2800 8004 2800 8004 2800  ..(...(...(...(.
00016560: 8004 2800 4004 2800 4004 2800 4004 2800  ..(. at .(. at .(. at .(.
00016570: 4004 2800 4004 2800 4004 2800 4004 2800  @.(. at .(. at .(. at .(.
00016580: 4004 2800 4004 2800 4004 2800 4004 2800  @.(. at .(. at .(. at .(.
00016590: 8004 2800 8004 2800 4004 2800 4004 2800  ..(...(. at .(. at .(.
000165a0: 4004 2800 4004 2800 8004 2800 8004 2800  @.(. at .(...(...(.
000165b0: 8004 2800 556e 6963 6f64 6520 332e 3220  ..(.Unicode 3.2
000165c0: 4368 6172 6163 7465 7220 5479 7065 7300  Character Types.
/usr/share/locale/UTF-8$
```

so, yeah, i don't think there was anything clever or mysterious going
on here --- macOS is just using Unicode data from 22 years ago. (which
is an amusing real-world example of why i keep saying "you probably
don't want to get into the business of redistributing Unicode data; it
changes every year" :-) )

> I don't know how to fix this other than stubbing out the test on macos, or
> adding lib/unicode.c. (I _really_ want to find an 80/20 there. I'm aware I have
> failed at least three previous attempts, and am 2/3 of the way to clearing off
> my laptop so I can install the new OS version and put the big ram sticks back so
> NOW IS NOT THE TIME, but still...)
>
> Rob


More information about the Toybox mailing list