[Toybox] [PATCH] lib/lib human_readable_long fix utf-8 LC_NUMERIC

Rob Landley rob at landley.net
Fri Sep 11 18:02:28 PDT 2020


On 9/11/20 2:27 PM, enh wrote:
>     I try to have strerror() display the error codes (but still think it's a missed
>     opportunity that the "C" locale doesn't output EPERM and friends as the actual
>     strings), and keep my error message vocabulary small and simple. I also try to
>     preserve and display utf8 input for usernames and filenames and such.
> 
> 
> glibc 2.32 actually added new functions for 9 -> "KILL" and 1 -> "EINVAL". i
> plan on adding those to bionic too, if only so i can add the moreutils "errno"
> to toybox, which i've found useful at times. (one of these days i'll write a
> static analyzer to catch people adding new %d errno printfs to the code base. do
> they not know about strerror()? do they not know that errno values aren't
> constant across architectures?)

Adding an "errno" to toybox seems reasonably straightforward, given that:

$
~/android/android-ndk-r21b/toolchains/llvm/prebuilt/linux-x86_64/bin/x86_64-linux-android-cc
-dM -E - <<< '#include <errno.h>' | sed -n 's/^#define E/E/p' | sort -k2,2n

seems to work fine. :)

>     Beyond that, I've stayed away from internationalization up until now, and if
>     your response is "kill it with fire" I can revert it.
> 
> that would be my choice, certainly.

Which brings us back to megabytes on an embedded system that only _has_ 16 megs
of ram being insufficient granularity, and reserving 10 digits of space and
guaranteeing we'll never use more than 5 seeming silly.

Eh, I'll try to work something out.

> if you weren't already convinced by my examples, a couple more common ones: when
> you add i18n to _output_ people (not unreasonably) expect it for _input_ too,
> which is a nightmare you don't want to get into. also the answer to "will the
> kernel ever localize the content of files in /proc?" means we're not doing our
> actual intended users (human or machine) any favors here. it's easier to learn
> when things are consistently wrong (like me dealing with the victorian^WUS use
> of Fahrenheit and the twelve-hour clock

"Freedom units."

> --- since they're _consistently_ wrong i
> know to be on the lookout, and i can cope). [for an example of the confusion
> that comes from inconsistency that you're already familiar with: which way round
> do you write a Korean or Japanese name in English?]

The instructions on filling out a form in No Time For Sargeants (1958) went
"Last name first, first name, middle name, last" so he wrote his last name, then
his first name, then his first name again, then his middle name, then his last name.

>     > if you're serious about localization, you're going to need
>     > icu4c anyway, which isn't scared to embrace all the diversity that's
>     > actually out there (rather than the tiny subset that the POSIX folks could
>     > imagine, which doesn't even stretch to the need for the genitive case in
>     dates,
>     > to pick one random fairly mainstream example).
> 
>     Nope. Not going there.
> 
> nor should you. "you are not an app", so real people never need to see you.
> developers and sysadmins do, often from machines in random
> locations/locales/timezones, and they (and their scripts) are better served by
> consistency.

In the disability world they have a thing "competing access needs" which is an
attempt to formalize "you can't make everybody happy simultaneously".

Still good to figure out the largest subset, which can be darn non-obvious...

>     I vaguely intend to have toysh command line editing handle right-to-left mode
>     due to a completionist streak, 
> 
> to me that's different. that's more in the bucket of "full UTF-8 support", which
> is clearly a good thing. _someone_ is going  to have to deal with Arabic
> filenames at some point (and they won't necessarily be able to  read them).
> thanks to confusion about uppercasing/lowercasing Turkish dotted/dotless 'i's i
> see rather more Turkish input than you'd expect from someone who doesn't speak
> Turkish  and has never been there.

My "wait for somebody to complain" policy is partly "am I serving enough of the
population that this isn't a real issue" and partly "someone with domain
expertise and a test case shows up to walk me through it". Deferring the
decision about whether or not to do something isn't the same as deciding up
front not to do it.

But in this case: no I'm not doing date formats, and should probably remove
CONFIG_I18N entirely because the only thing it's currently being used for is to
guard some utf8 support, which is mostly unconditional since I did my own
utf8towc() in lib.c. ("This can config out of libc" was a uClibc thing.)

> chinese-numbering-influenced countries sometimes count in ten-thousands rather
> than thousands, and indian numbering can let you have something 12,34,56,789 (2s
> and 3s in the same number).

I noticed that on the wikipedia[citation needed] page. India's a billion people.

> #include <if you really care, you need icu4c because humans are really really weird>
>  
>     If "consistently show megabytes for systems > X gigabytes" vs 'consitently show
>     kilobytes for systems < X gigabytes" is good enough, even when the resulting
>     numbers are long, I'm happy to rip the comma support back out.
> 
> personally, that's my preferred solution. (and what i _think_ current procps top
> is doing, though i  don't have enough systems to be sure, and i'm not sure what 
> to think about your  results.)

There's the answer then.

(Still 10 spaces and using 5. Grumble grumble...)

>     > no "real people" should ever need to look at this, but machines and developers
>     > will, and every bit of localization hurts the real audience.
> 
>     Yes and no. There's a lot of developers out there who don't speak english,
>     certainly not as their first language. I don't want to unnecessarily exclude
>     them.
> 
> they're going to have more trouble with --help output than they are here. and
> like i said, i'm  pretty  sure that "C/POSIX number formats" is something you 
> need to  learn  pretty  early on. no-one likes a for loop  condition like x <=
> 70,2 after all, and the kernel's never going to localize :-)

I'm attempting to minimize the ethnocentrism, but can't eliminate it.

>     > at least 15'936.2 would be a valid C++14 identifier (and i'm assuming will
>     make
>     > it into C2x) :-)
> 
>     That's the opposite of helping.
> 
> sorry,  just winding you up. probably  not the best time for it!

I was mostly going "the alternative to commas is half-assing this, you ok with
that?" and you were.

Rob


More information about the Toybox mailing list