[Toybox] [PATCH] lib/lib human_readable_long fix utf-8 LC_NUMERIC

Fri Sep 11 12:27:55 PDT 2020

On Wed, Sep 9, 2020 at 11:15 PM Rob Landley <rob at landley.net> wrote:

> On 9/9/20 7:19 PM, enh via Toybox wrote:
> > don't apps need libc localization? not really. the POSIX localization
> > functionality is so anaemic that it's really not useful even for "major
> > minority" languages.
>
> I try to have strerror() display the error codes (but still think it's a
> missed
> opportunity that the "C" locale doesn't output EPERM and friends as the
> actual
> strings), and keep my error message vocabulary small and simple. I also
> try to
> preserve and display utf8 input for usernames and filenames and such.
>

glibc 2.32 actually added new functions for 9 -> "KILL" and 1 -> "EINVAL".
i plan on adding those to bionic too, if only so i can add the moreutils
"errno" to toybox, which i've found useful at times. (one of these days
i'll write a static analyzer to catch people adding new %d errno printfs to
the code base. do they not know about strerror()? do they not know that
errno values aren't constant across architectures?)

> Beyond that, I've stayed away from internationalization up until now, and
> if
> your response is "kill it with fire" I can revert it.
>

that would be my choice, certainly.

if you weren't already convinced by my examples, a couple more common ones:
when you add i18n to _output_ people (not unreasonably) expect it for
_input_ too, which is a nightmare you don't want to get into. also the
answer to "will the kernel ever localize the content of files in /proc?"
means we're not doing our actual intended users (human or machine) any
favors here. it's easier to learn when things are consistently wrong (like
me dealing with the victorian^WUS use of Fahrenheit and the twelve-hour
clock --- since they're _consistently_ wrong i know to be on the lookout,
and i can cope). [for an example of the confusion that comes from
inconsistency that you're already familiar with: which way round do you
write a Korean or Japanese name in English?]

> > if you're serious about localization, you're going to need
> > icu4c anyway, which isn't scared to embrace all the diversity that's
> > actually out there (rather than the tiny subset that the POSIX folks
> could
> > imagine, which doesn't even stretch to the need for the genitive case in
> dates,
> > to pick one random fairly mainstream example).
>
> Nope. Not going there.
>

nor should you. "you are not an app", so real people never need to see you.
developers and sysadmins do, often from machines in random
locations/locales/timezones, and they (and their scripts) are better served
by consistency.

> I vaguely intend to have toysh command line editing handle right-to-left
> mode
> due to a completionist streak,

to me that's different. that's more in the bucket of "full UTF-8 support",
which is clearly a good thing. _someone_ is going  to have to deal with
Arabic filenames at some point (and they won't necessarily be able to  read
them). thanks to confusion about uppercasing/lowercasing Turkish
dotted/dotless 'i's i see rather more Turkish input than you'd expect from
someone who doesn't speak Turkish  and has never been there.

> and back when I was planning on implementing vi
> by vertically stacking the line editing plumbing (hence "linestack.c") I
> was
> gonna make sure that did it properly too. But now there's a vi there that
> I have
> nothing to do with which shares no infrastructure with anything else, so I
> guess
> that part's not my problem anymore.
>
> But that's all utf8 and unicode stuff. I haven't got a clue what the
> strings it
> includes MEAN.
>

yeah, exactly.

> > luckily, i've also been able to neuter Android's libc so none of this
> will
> > affect Android whichever way toybox goes[1]. but i still think it's a
> bad idea.
>
> I wouldn't have volunteered to do it myself, I'm being presented with
> complaints
> and attempting to find the least bad way to resolve them. :)
>
> "This is too many digits for humans to handle" is why adding commas to
> numbers
> was invented. It was the obvious solution. And then somebody complained
> that
> using commas is parochial, so I added the periods which should cover just
> well
> over 90% of the planet's population. (China uses 1,000.0 about everybody.
>

the question for a lot of  those people that you need to ask yourself is:
do they group in 2s or 3s or 4s or a mix or a mix at the same time?
chinese-numbering-influenced countries sometimes count in ten-thousands
rather than thousands, and indian numbering can let you have something
12,34,56,789 (2s and 3s in the same number).

#include <if you really care, you need icu4c because humans are really
really weird>

> If "consistently show megabytes for systems > X gigabytes" vs 'consitently
> show
> kilobytes for systems < X gigabytes" is good enough, even when the
> resulting
> numbers are long, I'm happy to rip the comma support back out.
>

personally, that's my preferred solution. (and what i _think_ current
procps top is doing, though i  don't have enough systems to be sure, and
i'm not sure what  to think about your  results.)

> > no "real people" should ever need to look at this, but machines and
> developers
> > will, and every bit of localization hurts the real audience.
>
> Yes and no. There's a lot of developers out there who don't speak english,
> certainly not as their first language. I don't want to unnecessarily
> exclude them.
>

they're going to have more trouble with --help output than they are here.
and like i said, i'm  pretty  sure that "C/POSIX number formats" is
something you  need to  learn  pretty  early on. no-one likes a for loop
condition like x <= 70,2 after all, and the kernel's never going to
localize :-)

> > at least 15'936.2 would be a valid C++14 identifier (and i'm assuming
> will make
> > it into C2x) :-)
>
> That's the opposite of helping.
>

sorry,  just winding you up. probably  not the best time for it!

> > ___
> > 1. strictly, the fact that you're doing your own insertion of ','
> separators
> > might hurt me (in the `top -b` case), but i'll worry about that if i
> notice it
> > actually break any parsing. i know that's included in Android's standard
> > bugreports, but i _don't_ know that anyone's parsing it.
>
> If the units weren't constant before then their parsing was iffy at best.
> Now at
> least the units should be constant on a given system.
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20200911/d1ed23aa/attachment-0001.html>