[Toybox] [PATCH] lib/lib human_readable_long fix utf-8 LC_NUMERIC

enh enh at google.com
Wed Sep 9 17:19:52 PDT 2020


if you've ever wondered why the same person (me) worked so hard to ensure
that OEMs couldn't remove locale data from icu4c but also personally
removed all the localization from the core Java libraries and libc...

i'd always been a strong proponent of localization, but one of the first
things i did on Android was to remove this sort of "low-level localization"
where i found it. i was finding that bugs were getting less attention than
they should because developers didn't know what to do with (say) a Turkish
error message. automated bug report clustering was failing to realize that
(say) `Datei oder Verzeichnis nicht gefunden` and `그런 파일이나 디렉터리가 없습니다` and
`No such file or directory` are the same. or scripts failing to parse
output because they've been trained on en_US.

for *apps* -- anything that real people interact with directly --
localization is massively important. but, at least after working on
Android, i came to believe that it's a _mistake_ and actively harmful for
development tools. the fact that i've had to (say) help a native Russian
speaker fix a bug where `x = 70,2` was valid but very much not what they
meant only _strengthens_ this belief for me --- if you're going to work on
this stuff, you're going to have to learn the C/POSIX locale sooner or
later.

see also: why ISO-8601 is the one true date format.

don't apps need libc localization? not really. the POSIX localization
functionality is so anaemic that it's really not useful even for "major
minority" languages. if you're serious about localization, you're going to
need icu4c anyway, which isn't scared to embrace all the diversity that's
actually out there (rather than the tiny subset that the POSIX folks could
imagine, which doesn't even stretch to the need for the genitive case in
dates, to pick one random fairly mainstream example).

luckily, i've also been able to neuter Android's libc so none of this will
affect Android whichever way toybox goes[1]. but i still think it's a
bad idea. no "real people" should ever need to look at this, but machines
and developers will, and every bit of localization hurts the real audience.

at least 15'936.2 would be a valid C++14 identifier (and i'm assuming will
make it into C2x) :-)

___
1. strictly, the fact that you're doing your own insertion of ','
separators might hurt me (in the `top -b` case), but i'll worry about that
if i notice it actually break any parsing. i know that's included in
Android's standard bugreports, but i _don't_ know that anyone's parsing it.

On Wed, Sep 9, 2020 at 10:37 AM Jarno Mäkipää <jmakip87 at gmail.com> wrote:

> Apparently LC_NUMERIC thousands_sep can be NARROW NO-BREAK SPACE
>
> There might be cleaner fix than this, but copying just char out of
> thousands_sep spit out
>
>
>   Mem:   15�36M total,    4�92M used,   11�44M free,      674M buffers
>  Swap:    2�47M total,        0M used,    2�47M free,    1�97M cached
>
>
> after patch
>   Mem: 15 936M total,  4 658M used, 11 277M free,      677M buffers
>  Swap:  2 047M total,        0M used,  2 047M free,  1 675M cached
>
>
> -Jarno
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20200909/7fc22d32/attachment.html>


More information about the Toybox mailing list