[Toybox] The cut -C test is failing because bionic's wcwidth() doesn't match glibc or musl.

enh enh at google.com
Wed Oct 20 14:35:50 PDT 2021


On Tue, Oct 19, 2021 at 11:56 PM Rob Landley <rob at landley.net> wrote:

> On 10/19/21 10:57 AM, enh wrote:
> > are you sure? remember that if a test is currently checked in [and
> Android uses
> > the toy; my test runner reports but ignores failures for tests where
> readlink
> > says it's not actually toybox], that means that the tests pass on
> Android.
> >
> > there's even a CTS test:
> > ```
> > TEST(wchar, wcwidth_non_spacing_and_enclosing_marks_and_format) {
> >   if (!have_dl()) return;
> >
> >   EXPECT_EQ(0, wcwidth(0x0300)); // Combining grave.
> >   EXPECT_EQ(0, wcwidth(0x20dd)); // Combining enclosing circle.
> >   EXPECT_EQ(0, wcwidth(0x00ad)); // Soft hyphen (SHY).
> >   EXPECT_EQ(0, wcwidth(0x200b)); // Zero width space.
> > }
> > ```
> >
> > my guess is that you're using a statically-linked binary?
>
> Yup. I can't test dynamic bionic on devuan, bionic isn't installed in /lib.
>

if you don't mind creating a /system directory (or symlink) in /, you can
just run like that. there's a script in AOSP bionic that sets up all the
links you need to run the bionic tests on the host.


> (And when I _tried_ to create a chroot as a normal user, bionic segfaulted
> before calling main() because /dev/null wasn't there. Did that ever get
> fixed?
> Hard to run an init linked against bionic when that's the case.


have you tried? i'm pretty sure there's a special case for pid 1, for
exactly this reason :-)


> The kernel guys
> never merged my https://lkml.org/lkml/2017/9/13/651 stuff. I gave 'em
> three
> strikes and moved on. I have a todo item to teach cpio to hallucinate /dev
> nodes...)
>
> > bionic doesn't have a
> > "static libdl", so when it tries to dlopen() icu4c to handle an i18n
> question,
>
> ... why would it need to do that?
>

because the last thing the world needs is _multiple_ copies of all the i18n
data. icu is the source of truth for that on Android. (we've had bugs in
the past in this area that made us clean up our act.)


> > that'll fail and in most cases bionic will fall back to "what do i know
> about
> > ASCII?" but otherwise report failure. (that's what the first line of the
> test is
> > checking too --- "if we're the static version of the tests, skip this
> test
> > because this isn't available".)
>
> It's not skipping them for me, but I'm still using android-ndk-r21d so
> maybe the
> skipping logic is getting either NDK weirdness or old version weirdness?
>
> But back to "why does this need to to dlopen() anything, I that that was a
> uniquely glibc failing?"
>
> Sigh. I wrote my own utf8 conversion stuff, I _can_ do my own width
> detection
> logic if libc isn't reliable.
>
> Hmmm, according to https://twitter.com/doctorow/status/1449072198069587969
> the
> unicode version of rot13 is rot8000, which implies ballpark of 16k glyphs.
> Seems
> tractable. Let's see, wikipedia[citation needed] says 144k code points (of
> a
> million and change possible).
>
> I remember Rich being quite proud of his compression scheme for unicode
> evaluation, where's that...
>
> int wcwidth(wchar_t wc)
> {
>         if (wc < 0xffU)
>                 return (wc+1 & 0x7f) >= 0x21 ? 1 : wc ? -1 : 0;
>         if ((wc & 0xfffeffffU) < 0xfffe) {
>                 if ((table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1)
>                         return 0;
>                 if ((wtable[wtable[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1)
>                         return 2;
>                 return 1;
>         }
>         if ((wc & 0xfffe) == 0xfffe)
>                 return -1;
>         if (wc-0x20000U < 0x20000)
>                 return 2;
>         if (wc == 0xe0001 || wc-0xe0020U < 0x5f || wc-0xe0100 < 0xef)
>                 return 0;
>         return 1;
> }
>
> Oh goddess, what is this doing... The first 256 entries are peeled off by
> the
> first if() statement and taken care of by two ? : that honestly might as
> well
> just pull that later return 0 clause up to the top and if (!wc || ...) it
> all in
> one gulp. (Maybe he's micro-optimizing for performance?)
>
> Then the if (wc < 0xfffe) stanza is a bit silly because we have a HARD
> LIMIT of
> 0x10ffff entries in unicode so there should be an if (wc>0x10ffff) return
> -1 and
> then this should probably be & 0x1effff...
>
> The if (table[]) and if (wtable[]) bits are identical and initially
> confusing
> because of the if (table[table[]) construct, but since we already peeled
> off the
> first 256 wc values that table[wc>>8] part will never fetch entry zero, and
> nothing in the first 512 entries (of _either_ table) is <16, and 16*32 is
> 512 so
> the start of the table seems to be _just_ indexes to later in the table.
> (And if
> you feed it values >1ffff the guarding if() around it drops through, so
> this
> looks like it's actually two disjoint tables stored together.)
>
> And then it's just cleanups for specific cases. 90% of the cleverness here
> is in
> the two tables. table[] is initialized from #include "nonspacing.h" which
> is a
> big array of numbers, and wtable[] is similarly initialized from #include
> "wide.h". Which are both PAINFUL.
>
> Those tables are just under 2k each, and could SERIOUSLY be compressed. For
> example, the first 512 bytes of the first table are all "16" except for a
> sparse
> set of sequential values running from 18 to 59, so it could be initialized
> from
> a bitmask. Let's see, quick and dirty bash to generate said bitmask from
> the table:
>
>   X=0; Y=0; for i in $(cat src/ctype/nonspacing.h | tr , ' ' | xargs); do
>     [ $i -ne 16 ] && let 'Y|=(1<<(31&X))'; [ $((31&++X)) -eq 0 ] &&
>       printf '%x\n' $Y && Y=0; [ $X == 512 ] && break
>   done
>
> And then the C would be something like (untested):
>
>   unsigned bits[] = {0x3f89fff8, 0x13001, 0, 0, 0, 0xf40, 0, 0xc8000000,
>     0x430402, 0, 0, 0x8000, 0, 0, 0x60000, 0}, i, j;
>
>   memset(table, 16, 512);
>   for (i = 0, j = 18; i<512; i++) if (bits[i/32]&(1<<(i&31)) table[i] =
> j++;
>
> And then you go:
>
>   memset(table+512, 0, sizeof(table)-512);
>   memset(table+512+32, 255, 46);
>
> And the rest of it could be filled by traversing another sparse bit array I
> can't be bothered to scrape together right now but it would have... um...
> (for i
> in $(cat src/ctype/nonspacing.h | tr , ' ' | tail -n +24); do [ $i -ne 0 ]
> &&
> echo $i; done | xargs | wc -w)
>
> It would have 249 entries, and the whole thing looks like a dozen lines of
> C to
> init each table and then the function to use the table, putting it within
> range
> of merging if Rich is ok with me using his tables under 0bsd. And then I
> wouldn't _have_ to depend on libc to do this.
>
> Not sure if I _should_, but I _can_. (It was nice to leave this to libc.
> Then it
> wasn't my problem to update it every time Microsoft wrote another check to
> the
> unicode committee. Both glibc and musl can do this when statically linked.
> Sigh.)
>

i'll only send a patch so i can disable this again on Android even if you
_do_ do it, so i don't think it's a great use of your limited time :-)

hang on, let me look out the symlink script... here you go:

https://android.googlesource.com/platform/bionic/+/master/build/run-on-host.sh

that should set you up with everything you need (and both 32- and 64-bit
too, if you're chasing something 32-bit-only).


> Rob
>
> P.S. Rich's other table has some 17s mixed in the 16s which... I think it
> moves
> in runs of 8? Very small bitmap if so? It would be so much easier to work
> out
> the alignment if he'd wordwrapped his tables to a consistent number of
> entries
> per line, but no. Eh, runs of 4, 54 bits total. Plus two isolated
> weirdos.) And
> most of the nonzero values in the latter part are 255, so traverse a
> bitmap of
> THOSE and there's not much left to initialize afterwards. Yeah, a dozen
> lines
> per table of initialization is looking doable, within range of sticking it
> in
> portability.c...
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20211020/787a7cbf/attachment-0001.htm>


More information about the Toybox mailing list