[Toybox] Test suite and UTF8.

Rob Landley rob at landley.net
Fri Jan 15 12:28:41 PST 2016


Izabera on freenode sent in a bug report, boiling down to sort.c
predating the FLAG_x macros and when I converted it I apparently screwed
some stuff up. (I do some strange things with the macros,but if
(flags*FLAG_f) blah(); CAN'T work...)

The other issue this raised is making -f work with utf8, which Izabera
suggested _could_ work if instead of a for() loop calling toupper() sort
instead called strcmp or strcasecmp based on FLAG_f. Good idea, so let's
do that.

But updating tests/sort.test to tell this bug got fixed (which always a
good idea: if you submit a bug, submit a test to show that how bug
didn't used to work, which now passes after the fix)... is tricky. The
problem is the GNU/dammit sort changes its behavior based on LC_BLAH
internationalization, and we're matching the LC_ALL=c behavior, which is
what source package builds need. (Sort in ASCII order, so all uppercase
letters come before all lowercase letters. This is why "toybox ls" on
the toybox source puts Config.in and LICENSE and README at the top,
while Ubuntu's ls mixes them in with the rest.)

I want UTF8 awareness when comparing case insensitivity, but ASCII sort
order otherwise. (Yeah, judgement call, but toybox has always drawn the
line at "UTF8 support yes, full internationalization of currencies and
dates and help text no".)

In theory, this means I set LC_ALL=c in scripts/test.sh the same way I
do in scripts/make.sh, but I don't want to accidentally disable UTF8
support in the host version I'm testing against. Internationalization is
a can of worms generally requiring external files to look things up in a
per-country database, which makes it out of scope for toybox. But the
majority of the planet doesn't speak english, and now that there's one
format for international text it's obviously the right thing to do. (Yes
several countries still use historical encodings that got there first,
but they're out of scope.)

Anyway, that's why I set LC_COLLATE=c in scripts/test.sh instead of
LC_ALL=c. I _think_ I want LC_CTYPE=UTF8 and the rest (LC_COLLATE,
LC_MONETARY, LC_NUMERIC, LC_TIME, LC_MESSAGES, at least according to
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html)
set to C. (I sort of want LC_ALL_EXCEPT_THIS_BIT, but the posix page
doesn't actually define LC_ALL as an environment variable so I dunno
whether more specific ones override it if set?)

Oddly, my ubuntu system doesn't define LC_ anything, just:

  LANG=en_US.UTF-8
  LANGUAGE=en_US:en

Which is another tangent entirely. (See "full internationalization: can
of worms.")

Anyway, lemme know if I broke sort,

Rob



More information about the Toybox mailing list