[Toybox] The cut -C test is failing because bionic's wcwidth() doesn't match glibc or musl.

Rob Landley rob at landley.net
Tue Oct 19 23:56:21 PDT 2021


On 10/19/21 10:57 AM, enh wrote:
> are you sure? remember that if a test is currently checked in [and Android uses
> the toy; my test runner reports but ignores failures for tests where readlink
> says it's not actually toybox], that means that the tests pass on Android.
> 
> there's even a CTS test:
> ```
> TEST(wchar, wcwidth_non_spacing_and_enclosing_marks_and_format) {
>   if (!have_dl()) return;
> 
>   EXPECT_EQ(0, wcwidth(0x0300)); // Combining grave.
>   EXPECT_EQ(0, wcwidth(0x20dd)); // Combining enclosing circle.
>   EXPECT_EQ(0, wcwidth(0x00ad)); // Soft hyphen (SHY).
>   EXPECT_EQ(0, wcwidth(0x200b)); // Zero width space.
> }
> ```
> 
> my guess is that you're using a statically-linked binary?

Yup. I can't test dynamic bionic on devuan, bionic isn't installed in /lib.

(And when I _tried_ to create a chroot as a normal user, bionic segfaulted
before calling main() because /dev/null wasn't there. Did that ever get fixed?
Hard to run an init linked against bionic when that's the case. The kernel guys
never merged my https://lkml.org/lkml/2017/9/13/651 stuff. I gave 'em three
strikes and moved on. I have a todo item to teach cpio to hallucinate /dev nodes...)

> bionic doesn't have a
> "static libdl", so when it tries to dlopen() icu4c to handle an i18n question,

... why would it need to do that?

> that'll fail and in most cases bionic will fall back to "what do i know about
> ASCII?" but otherwise report failure. (that's what the first line of the test is
> checking too --- "if we're the static version of the tests, skip this test
> because this isn't available".)

It's not skipping them for me, but I'm still using android-ndk-r21d so maybe the
skipping logic is getting either NDK weirdness or old version weirdness?

But back to "why does this need to to dlopen() anything, I that that was a
uniquely glibc failing?"

Sigh. I wrote my own utf8 conversion stuff, I _can_ do my own width detection
logic if libc isn't reliable.

Hmmm, according to https://twitter.com/doctorow/status/1449072198069587969 the
unicode version of rot13 is rot8000, which implies ballpark of 16k glyphs. Seems
tractable. Let's see, wikipedia[citation needed] says 144k code points (of a
million and change possible).

I remember Rich being quite proud of his compression scheme for unicode
evaluation, where's that...

int wcwidth(wchar_t wc)
{
        if (wc < 0xffU)
                return (wc+1 & 0x7f) >= 0x21 ? 1 : wc ? -1 : 0;
        if ((wc & 0xfffeffffU) < 0xfffe) {
                if ((table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1)
                        return 0;
                if ((wtable[wtable[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1)
                        return 2;
                return 1;
        }
        if ((wc & 0xfffe) == 0xfffe)
                return -1;
        if (wc-0x20000U < 0x20000)
                return 2;
        if (wc == 0xe0001 || wc-0xe0020U < 0x5f || wc-0xe0100 < 0xef)
                return 0;
        return 1;
}

Oh goddess, what is this doing... The first 256 entries are peeled off by the
first if() statement and taken care of by two ? : that honestly might as well
just pull that later return 0 clause up to the top and if (!wc || ...) it all in
one gulp. (Maybe he's micro-optimizing for performance?)

Then the if (wc < 0xfffe) stanza is a bit silly because we have a HARD LIMIT of
0x10ffff entries in unicode so there should be an if (wc>0x10ffff) return -1 and
then this should probably be & 0x1effff...

The if (table[]) and if (wtable[]) bits are identical and initially confusing
because of the if (table[table[]) construct, but since we already peeled off the
first 256 wc values that table[wc>>8] part will never fetch entry zero, and
nothing in the first 512 entries (of _either_ table) is <16, and 16*32 is 512 so
the start of the table seems to be _just_ indexes to later in the table. (And if
you feed it values >1ffff the guarding if() around it drops through, so this
looks like it's actually two disjoint tables stored together.)

And then it's just cleanups for specific cases. 90% of the cleverness here is in
the two tables. table[] is initialized from #include "nonspacing.h" which is a
big array of numbers, and wtable[] is similarly initialized from #include
"wide.h". Which are both PAINFUL.

Those tables are just under 2k each, and could SERIOUSLY be compressed. For
example, the first 512 bytes of the first table are all "16" except for a sparse
set of sequential values running from 18 to 59, so it could be initialized from
a bitmask. Let's see, quick and dirty bash to generate said bitmask from the table:

  X=0; Y=0; for i in $(cat src/ctype/nonspacing.h | tr , ' ' | xargs); do
    [ $i -ne 16 ] && let 'Y|=(1<<(31&X))'; [ $((31&++X)) -eq 0 ] &&
      printf '%x\n' $Y && Y=0; [ $X == 512 ] && break
  done

And then the C would be something like (untested):

  unsigned bits[] = {0x3f89fff8, 0x13001, 0, 0, 0, 0xf40, 0, 0xc8000000,
    0x430402, 0, 0, 0x8000, 0, 0, 0x60000, 0}, i, j;

  memset(table, 16, 512);
  for (i = 0, j = 18; i<512; i++) if (bits[i/32]&(1<<(i&31)) table[i] = j++;

And then you go:

  memset(table+512, 0, sizeof(table)-512);
  memset(table+512+32, 255, 46);

And the rest of it could be filled by traversing another sparse bit array I
can't be bothered to scrape together right now but it would have... um... (for i
in $(cat src/ctype/nonspacing.h | tr , ' ' | tail -n +24); do [ $i -ne 0 ] &&
echo $i; done | xargs | wc -w)

It would have 249 entries, and the whole thing looks like a dozen lines of C to
init each table and then the function to use the table, putting it within range
of merging if Rich is ok with me using his tables under 0bsd. And then I
wouldn't _have_ to depend on libc to do this.

Not sure if I _should_, but I _can_. (It was nice to leave this to libc. Then it
wasn't my problem to update it every time Microsoft wrote another check to the
unicode committee. Both glibc and musl can do this when statically linked. Sigh.)

Rob

P.S. Rich's other table has some 17s mixed in the 16s which... I think it moves
in runs of 8? Very small bitmap if so? It would be so much easier to work out
the alignment if he'd wordwrapped his tables to a consistent number of entries
per line, but no. Eh, runs of 4, 54 bits total. Plus two isolated weirdos.) And
most of the nonzero values in the latter part are 255, so traverse a bitmap of
THOSE and there's not much left to initialize afterwards. Yeah, a dozen lines
per table of initialization is looking doable, within range of sticking it in
portability.c...



More information about the Toybox mailing list