[Toybox] [PATCH] Add support for 1024 as well as 1000 to human_readable.

Fri Sep 4 16:24:49 PDT 2015

On 09/04/2015 10:43 AM, James McMechan wrote:
>> From: enh at google.com
...
>> HN_B Use `B' (bytes) as prefix if the original result
>> does not have a prefix.
> 
> Is it just me or do you find this weird also, if you have an explicit prefix setting why not use it...
> If you don't want to use it why is it there in the first place?

Why is the _caller_ not appending B when they printf() the result? The
space is before the units but the B isn't, and this is a string that
gets put into a buffer and then used by something else. Further editing
is kinda _normal_...

>> HN_DIVISOR_1000 Divide number with 1000 instead of 1024.
> 
> Yep, I think network speeds are measured in SI units for example
> I could live with 1024 units everywhere esp. if we also used the IEC prefixes

I object to the word "kibibyte" on general principles, and disks are
also sold in decimal sizes (for historical marketing reasons).

(Of course "512 gigs" is mixing decimal and binary when you _do_ use
binary gigs, since the 512 is decimal and all. But let's be honest,
"kibibytes" is a stupid name, all else is details for me.)

>> HN_IEC_PREFIXES Use the IEE/IEC notion of prefixes (Ki, Mi,

Mebibytes. *shudder*

Huh, I thought the i was the second character in "binary", but this
implies it's "IEC"? Or possibly IEE? Or maybe the i from "mebi" which is
back to "binary" again...

>> Gi...). This flag has no effect when
>> HN_DIVISOR_1000 is also specified.
> 
> Err yes, but it is not that it has no effect but that if you are using 1000s there should not be the 'i'

The B is already a separate flag from the 1024. If the caller wants to
append the unicode character for "clown nose" to the returned string,
that's not really human_readable()'s business.

> For my two cents I would suggest we go for IEC prefixes by default, yes they are so-so
> but there is a standard and it does make things noticeably clearer, might as do it right instead
> of the usual customary ComSci notation where it is Notoriously ambiguous

The function is called human_readable().

You want to default to binary units.

What exactly is our goal here again?

(Keeping the thundering hordes of android users happy. Right. Trying not
to get emotionally invested in an aesthetic decision which hasn't _got_
a right answer and just needs to be consistent. That said, if I can help
kill the term "mebibytes" it is worth MUCH EFFORT on my part...)

>> in the entire tree, there's only one use of HN_GETSCALE
>> (/usr/bin/procstat), and it doesn't look like that's actually
>> necessary).
>>
>> HN_DECIMAL and HN_NOSPACE are used a lot: ls, df, du, and so on. HN_B
> 
> I did not have a HN_DECIMAL since I expect 0-9 to have a decimal point for a second
> digit of precision, the range is to 999 anyway so it will not use more characters.
> 
>> is used less, but in df, du, and vmstat. HN_DIVISOR_1000 is only
>> really used in df (it's also used once each in "edquota" and
>> "camcontrol").
> 
> I would have no problem with df using units 1024 instead and displaying IEC Units

Disks are sold in decimal measurements. People are going to ask why your
horribly inefficient file format is eating so much of their disk space.

(What, did they stop doing that with flash? I'd be surprised if they did...)

>> HN_IEC_PREFIXES isn't used at all. not even a test.
> 
> Yeah, I have noticed for myself, following the standard and even making it the default
> so that you know what everything is in would be good, alas somewhat incompatable
> with custom, but are scripts using -h and then parsing it... something is likely that dumb.
> But it would be nice to actually do the right thing.

Nothing extending the usage of the word "gibibytes" is the right thing.

>> so until we find a place where we want to turn off HN_DECIMAL, we're
>> good. (that's a harder thing to grep for, but i couldn't find an
>> instance in FreeBSD.)
> 
> I would hope not, I would regard it as a useless loss of presision.
> 9.9 will fit in the same space as 999 just fine.

human_readable() _IS_ a useless loss of precision. That's what it's _for_.

And the units advance by kilobytes so 9.9 and 999 are not rephrasings of
each other. 999k and 1.0M can be from a rounding  perspective, but "loss
of precision" is the reason rounding _exists_...

>>> If this behaves differently on big or little endian, your compiler is at
>>> fault. And long long should be 64 bit on 32 bit or 64 bit systems, due
>>> to LP64. (There's no spec requiring long long _not_ be 128 bit, which is
>>> a bit creepy, but nobody's actually done that yet that I'm aware of. I
>>> should probably use uint64_t but the name is horrid and PRI_U64 stuff in
>>> printf is just awkward, and it's a typedef not a real type the way
>>> "int", "long", and "long long" are...)
> 
> I have developed paranoia over BE/LE & 32/64 over the years, subtle assumptions about
> size or byte ordering can creep in and break things.

Oh sure. But I've been doing Aboriginal Linux in various forms since
1999 and started caring about cross compiling it in 2005, so I'm fairly
familiar with where the sharp edges are by now.

> One I can remember was in the ext2 code
> they had a bit map in LE order but accessed it using longs rather than bytes so it had to have
> the byteswap even though the code using bytes was just as simple and completely agnostic
> about wordsize and BE/LE.

Not my code. :)

(That said, my code's currently back on the todo heap because I have to
read about ext4. Although really if it can upconvert on the fly maybe I
should just genext2fs an ext2, stamp an ext3 journal on it, and let the
filesystem driver handle the rest...)

> I could argue that long should be 128 bit on 64 bit computers

Then there would be no 64 bit integer type.

char = 8 bit
short = 16 bit
int = 32 bit
long = 64 bit (on 64 bit)
long long = 64 bit on both 32 and 64 bit (de-facto).

The uint99_t stuff are typedefs that have to resolve to an underlying
integer type.

> but LP64 was a hack to work
> around poorly written software, long long /should/ be 256 bits :) not mearly 128 bit.

You know how people went to great lengths to avoid using uint64_t on 32
bit machines because it introduced libgcc_s.so calls and sucked in
_deeply_ crappy code to do FOIL multiplies and divides from high school
algebra?

You're saying "64 bit should have this problem too".

Bignum libraries exist. A 256 byte integer type doesn't let you do
crytptography or implement standards-compliant BC without using them.

(Heck, Posix and LSB are hacks to work around poorly written software.
Kinda both's reason d' et cetera.)

> Yes, uint64_t is a bit of a mess, but if the compiler puts some other size in there I would
> feel fully justified in bitching about it.

It would be a standards violation.

> int, long and long long are compiler dependent and can
> be whatever they desire and are per-arch,

LP64 says what int and long should be, and specifies at least a minimum
size for long long. Linux, BSD, and MacOS X depend on LP64. As does
toybox (in design.html I believe).

> so I try to use it where I want  a particular size.

Good for you...?

> For example int was the size to store pointers in,
> as it was the machine word per K & R explicited stated store pointer
in int.
> now it is long, or better yet void *.

Ah, the days when char could be 18 bits because some machines were just
crazy and we hadn't weeded out the weak hardware designs yet.

That went away.

> I did find a couple of uint128_t references on my system.

gcc of course added a __int128 compiler extension which is two 64 bit
integers glued together just like 32 bit mode. How you printf() them is
left as an exercise to the reader apparently?

I'm not going there. I did a sizeof(long long) on every aboriginal linux
target to check what the size actually _was_, but as far as I know the
limited number of units here are the first thing that might actually
care about the size being larger. (Because it could overflow the string
buffer allocation since we're not passing in a length. 64 bit input
won't produce more than ~6 bytes of output depending on flags.)

>>>> You can also set a flags to drop the space between number and prefix or use the ubuntu 0..1023 style
>>>> also you can request the limited range 0..999, 1.0 k-999 k style in either SI or IEC
>>>
>>> Yes, but why would we want to?
> 
> Strict conformance to the standard? avoiding the 9999->9.8Ki transition.

The first I heard of this standard was when you mentioned it. Ubuntu
clearly wasn't doing it.

>>>> This is pure integer, I could open code the printf also as it can only have 4 digits maximum at the moment.
>>>> If you want I could make it autosizing rather than just one decimal between 0.1..9.9
>>>> Also if any of the symbols are defined to 0 the capability will drop out.
>>>> Perhaps I should make it default to IEC "Ki" style? getting it right vs bug compatibility.
>>>>
>>>> I made a testing command e.g. toybox_human_readable_test to allow me to test it.
>>>
>>> I had toys/examples/test_human_readable.c which I thought I'd checked in
>>> a couple weeks ago but apparently forgot to "git add".
> 
> I was thinking maybe it needs a better name, outputting info for humans would be nice
> to be able to do from the shell, so it could be actually used in production.

It defaults to "n" in defconfig. It's a testing command. That's why it
has "test" in the name and lives in the "examples" directory.

This is beyond infrastructure in search of a user, you're letting
infrastructure suggest a use case. "If all you have is a hammer,
everything looks like a nail." Nobody's _asked_ for this.

>>> (If you git add a file, git diff shows no differences, mercurial diff
>>> shows it diffed against /dev/null. I'm STILL getting used to the weird
>>> little behavioral divergences.)
>>>
>>>> I hope this is interesting.
>>>
>>> It's very interesting and I'm keeping it around in case it's needed. I'm
>>> just trying to figure out if the extra flags are something any command
>>> is actually going to use. (And that's an Elliott question more than a me
>>> question, I never use -h and it's not in posix or LSB.)
> 
> Odd, it has been in common useage for years, but I guess it was just whatever
> people felt a human would like to see rather than one of the standards.

It's got a dozen flags because everybody who implemented this did it
differently because the machine readable scriptable version is just to
print out the actual NUMBER, thus the aesthetic cleanup is (or at least
should be) just that.

Bringing an international standards body into a purely aesthetic
decision is weird. ANSI vs ISO tea was a _joke_.

(Ok, maybe the aesthetic output has mutated into functional due to
screen scrapers, which is what Elliott was implying by scripts depending
on -h output. In which case either rigorously copying the historical
mistakes or breaking them really loudly is called for. Adding a
standards body to that sort of mess gives me a headache long before we
get into any sort of details.)

Rob

 1441409089.0