[Toybox] sort -n -M patch

Rob Landley rob at landley.net
Thu Sep 29 12:43:19 PDT 2016


On 09/29/2016 04:52 AM, isabella parakiss wrote:
> On 9/28/16, Rob Landley <rob at landley.net> wrote:
>> I'm looking at
>> https://github.com/landley/toybox/pull/19/commits/6d6252a8f39e3f813cd79dc96ebae61c6507717c
>> which does 2 things:
>>
>> 1) Hardwire in english month names. Not sure why, since strptime is posix:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/functions/strptime.html
>>
>> And having libc do it gives us locale-specific translation for free? (I
>> don't go out of my way for it, but I don't see why I should turn it down
>> if libc already did the work?)
>>
> 
> https://gist.github.com/izabera/d069eb12fa5d640564378a2f6163dbe8

You just linked to 70 lines of code with no explanation of what it's
supposed to do, and the comments are people running it three times and
getting two different output formats.

No idea what -2572 is supposed to mean. Then there are t1, t2, and t3
which have no explanation. t2-t3 is "4.x" with no units, t4-t3 is "1.x"
with no units. (Seconds? Iterations? How much input data was fed to this
thing?)

Given that this looks benchmarky, I'm going to assume your complaint
here is "performance" and leave it at that.

> calling strptime in a tight loop is absolutely great code

So you're complaining about performance (specifically speed of
execution? Not memory usage or how much the feature implementation
bloats the executable size?) of date sorts then? (This is the first
you've mentioned what your objection actually was. Your initial
submission did not specify WHY you were changing the code.)

I'm assuming it's doing the date sorts properly, getting the correct
result? I have yet to actually encounter a script _doing_ a date sort
out in the wild, so couldn't tell you how common it is or how
performance-critical. You haven't provided a real-world example of a
script using it and how the performance was unacceptable.

I just implemented it because it was a feature other implementations
had. It's hard for me to care about its performance until somebody shows
me an actual user. Given that it's NOT in posix, I lean more towards
removing it than trying to "fix" it.

I also dunno if your example producing the random output that means
nothing to me was linked against glibc or musl or bionic, and if that
makes a difference...

> and you should feel proud of yourself as an engineer for writing it.

I do, thanks.

Ten years before I wrote toybox's sort, I wrote busybox's sort. The code
in busybox git still reads:

                case FLAG_M: {
                        struct tm thyme;
                        int dx;
                        char *xx, *yy;

                        xx = strptime(x, "%b", &thyme);
                        dx = thyme.tm_mon;
                        yy = strptime(y, "%b", &thyme);


And according to git annotate the only change to it since I committed it
on January 24 2005 was whitespace. (Denys redid the whitespace of the
whole file in 2006.)

So in the decade since I did strptime for busybox, with the larger
busybox userbase pounding on it, nobody's cared to change it there. You
are the first, and you haven't got a use case. You're optimizing for the
sake of optimizing, trading off size and complexity for speed when
http://landley.net/toybox/design.html starts with "simple, small, fast,
and full featured" in that order. Explicitly ranking them as priorities.

I'm aware it could be faster if written in hand-coded assembly, if you
want that http://asm.sourceforge.net/asmutils.html exists.

I'm also happy to reshuffle the priorities based on real users' real
world needs. But you haven't got a use case! Your _second_ attempt at
explaining was a microbenchmark.

> if you want sort -M to have locale-specific support (which... basically
> nothing else in toybox has?

Did you miss all the work to handle UTF8? TOYFLAG_LOCALE, crunch_str()
and so on?

> zero gettext support and all error messages are "bad xyz"),

I'm trying to do the 20% of the work that gets me 80% of the results. I
don't use GNU/GNU/GNU/gettext for the same reason I don't use
GNU/GNU/GNU/autoconf.

I explained why the error messages are intentionally simple in the
"Error messages and internationalization" section of
http://landley.net/toybox/design.html

I'm not translating the command names into other langauges, so there's
ALREADY a line to be drawn about what is and isn't translated. But I
expect a significant number of people who do NOT have english as their
primary language to have to deal with these error messages, and I don't
want to force people to learn a hundred words when we could get by with
a dozen words.

My point is I did think about this, and I do care. You may disagree with
my conclusions, and argue about where I draw the 80/20 line. But arguing
"you didn't do everything therefore you can't do anything" is a silly
argument (and implies busybox should never have existed, let alone
toybox). What you're doing is convincing me you have no idea what I'm
trying to accomplish at _all_.

> the obvious solution is to precompute the names of the
> months according to the current locale

Which is set at runtime, not at compile time?

I'm not sure I've even hooked UP the internationalization stuff for sort
yet, but I was leaving the option open because if somebody started using
it they'd probably want it to work in their language, and there are a
lot more speakers of mandarin and hindi on the planet than english
speakers. (I'm waiting for actual users to present me with a real-world
use case to reason about instead of ivory-tower maybes.)

But one of the reasons this feature is so SILLY is that the month names
do vary by locale, and the man page says:

       -M, --month-sort
              compare (unknown) < 'JAN' < ... < 'DEC'

With no mention of this kind of issue. (Given that this is a GNU
extension, it's entirely possible they didn't think of it, and then
nobody ever used it. But if somebody DOES say "I'm trying to sort emails
headers from taiwan" or some such, I care about making it work for them.
And I need their failing example to _test_ against.)

Also I assume it's case insensitive and handles full month names (prefix
matching) but I need to add TESTS for this:

  echo -e "feb\nmar\njan" | sort -M
  echo -e "february\nmarch\njanuary" | sort -M

> and then use a very similar loop
> to what that patch does.

The size and complexity of strptime are sunk costs (in libc). The libc
function is always there and provides a simple interface for me to use.
Using strptime makes _my_ code small and simple, and means I have less
to test because if libc is broken that's what needs fixing.

>> 2) replace 2 lines of floating point comparison code with 18 lines of
>> floating point comparison code.
>>
>> I'm aware that this offers unlimited precision (in a way I'd have to
>> write a lot of corner case tests for; this patch comes with zero new
>> testcases), I'm just not sure why we need that? Posix sort -n says to
>> support locale-specific thousands separators, so this can't be for
>> standards compliance.
>>
> 
> no, but it's what gnu/freebsd/openbsd/etc sort do

You tested this in freebsd and openbsd? First time you mentioned it...

> and i was assuming that
> compatibility with the rest of the world was somehow high priority,

Often I do something simple and wait for users to show me what actually
broke. Given that I did this in busybox 10 years ago and nobody
complained to THEM in all that time, I'm pretty sure this is close enough.

You're not complaining because you needed this feature and toybox broke
your script. You're complaining because toybox differs from gnu and
therefore toybox is wrong QED. This is the ONLY justification you have
provided so far. No reference to posix or LSB, no showing me an existing
user out on the web that didn't work...

> and also the only reason for having -g, which isn't standard and basically a
> duplicate of -n as of now.

If you looked at my code, you'd see that -g has defined behavior for
comparing NAN and infinity. (Which I think I got out of the "info" page
many moons ago, because I wanted to know what the difference was so I
dug into it farther than I usually would to resolve the question.)

If somebody runs a script with -g, it gets a reasonable answer instead
of immediately barfing. As I said, I did it that way for busybox 10
years ago and nobody seems to have complained yet. I often do the simple
thing and wait for somebody to complain.

Until now, nobody had complained. And you're not pointing at a use case,
you seem to be saying "it should work like this because it should work
like this", and "it could be faster therefore it must be faster, no
matter how much bigger and more complicated that makes it".

Rob



More information about the Toybox mailing list