[Toybox] [PATCH] lib/lib human_readable_long fix utf-8 LC_NUMERIC

Wed Sep 9 22:15:13 PDT 2020

On Thu, Sep 10, 2020 at 3:20 AM enh <enh at google.com> wrote:
>
> if you've ever wondered why the same person (me) worked so hard to ensure that OEMs couldn't remove locale data from icu4c but also personally removed all the localization from the core Java libraries and libc...
>
> i'd always been a strong proponent of localization, but one of the first things i did on Android was to remove this sort of "low-level localization" where i found it. i was finding that bugs were getting less attention than they should because developers didn't know what to do with (say) a Turkish error message. automated bug report clustering was failing to realize that (say) `Datei oder Verzeichnis nicht gefunden` and `그런 파일이나 디렉터리가 없습니다` and `No such file or directory` are the same. or scripts failing to parse output because they've been trained on en_US.

Yes, googling problems based on error messages is a lot easier when
errors are on english.

>
> for *apps* -- anything that real people interact with directly -- localization is massively important. but, at least after working on Android, i came to believe that it's a _mistake_ and actively harmful for development tools. the fact that i've had to (say) help a native Russian speaker fix a bug where `x = 70,2` was valid but very much not what they meant only _strengthens_ this belief for me --- if you're going to work on this stuff, you're going to have to learn the C/POSIX locale sooner or later.

I'm ok with the C/Posix locale. It does not have thousands separators
so there is no confusion. But I think forcing en_US on the other hand
is not ok.

>
> see also: why ISO-8601 is the one true date format.
>
> don't apps need libc localization? not really. the POSIX localization functionality is so anaemic that it's really not useful even for "major minority" languages. if you're serious about localization, you're going to need icu4c anyway, which isn't scared to embrace all the diversity that's actually out there (rather than the tiny subset that the POSIX folks could imagine, which doesn't even stretch to the need for the genitive case in dates, to pick one random fairly mainstream example).
>
> Luckily, i've also been able to neuter Android's libc so none of this will affect Android whichever way toybox goes[1]. but i still think it's a bad idea. no "real people" should ever need to look at this, but machines and developers will, and every bit of localization hurts the real audience.
>
> at least 15'936.2 would be a valid C++14 identifier (and i'm assuming will make it into C2x) :-)

And rust has underlines 15_936.2 to add confusion.

>
> ___
> 1. strictly, the fact that you're doing your own insertion of ',' separators might hurt me (in the `top -b` case), but i'll worry about that if i notice it actually break any parsing. i know that's included in Android's standard bugreports, but i _don't_ know that anyone's parsing it.
>
> On Wed, Sep 9, 2020 at 10:37 AM Jarno Mäkipää <jmakip87 at gmail.com> wrote:
>>
>> Apparently LC_NUMERIC thousands_sep can be NARROW NO-BREAK SPACE
>>
>> There might be cleaner fix than this, but copying just char out of
>> thousands_sep spit out
>>
>>
>>   Mem:   15�36M total,    4�92M used,   11�44M free,      674M buffers
>>  Swap:    2�47M total,        0M used,    2�47M free,    1�97M cached
>>
>>
>> after patch
>>   Mem: 15 936M total,  4 658M used, 11 277M free,      677M buffers
>>  Swap:  2 047M total,        0M used,  2 047M free,  1 675M cached
>>
>>
>> -Jarno
>> _______________________________________________
>> Toybox mailing list
>> Toybox at lists.landley.net
>> http://lists.landley.net/listinfo.cgi/toybox-landley.net