[Toybox] PM: code style, was: Re: New Subscriber

Wed Feb 8 09:59:43 PST 2012

On 02/08/2012 02:00 AM, Frank Bergmann wrote:
> Hi.
> 
> On Tue, Feb 07, 2012 at 07:26:22PM -0600, Rob Landley wrote:
>> But in this case, I plan to continue using the printf() family because
>> doing without it wouldn't actually simplify the code, I'd just wind up
>> writing something just as bad.
> 
> In this case you may use more control to use printf as desired. I'd
> made a clone pf toybox and busybox and issued one command to compare:
> 
> $ grep -r setvbuf busybox|wc -l
> 9
> $ grep -r setvbuf toybox-cloned-1328632507|wc -l
> 0

I have no idea what that means.  I pulled up the man page on setvbuf and
it says to me "micromanaging the behavior of stdio", which there's could
be a reason for but it wouldn't be my first choice.  However, a
defconfig busybox is 348 commands last time I checked, and defconfig
toybox is 45.  It's quite possible some of those busybox commands
legitimately need to do setvbuf, dunno.

(You have to define boundaries for your project. Beyond this point is
Somebody Else's Problem. I'd rather not have "libc's stdio sucks" on my
sideof that line just now.  Ask Dmitriy. :)

>> C and string handling are not a good mix.  String handling is the thing
>> C is weakest at, because string handling actually turns out to be a hard
>> problem to get right at the hardware level.
> 
> We all know that C actually doesn't know anything about "strings". ;-)
> Writing "bigger" software it *may* be worth implementing strings as a
> class in C (er... struct) like Wietse Venema does. But this means to
> rewrite all code.

And it means teach all your developers about the new semantics as a
condition of them contributing to your program, which is a big cost.
I've got enough of that already, and I'm explicitly leveraging the large
pool of existing knowledge of things like Kconfig syntax so the actual
cost isn't quite as high as it could be.

I want to make the code easy to _read_, which means I've got to be good
enough to make it look easy.  I regularly fail at this, but am trying...

P.S. According to Joel Spolsky, the history of C++ Standard Template
Library (and C++ template support in general) can be explained by a long
protracted fight with "making strings work". Every change is due to yet
another way they found their String class du jour was broken.  It's
still broken.

>> Yeah, but after 40 years of being grandfathered in, it's still useful
>> enough to stick around.
> 
> Yeah, but 99% of software out in the wild breaks the basic rules KISS and
> YAGNI and introduce (sometimes many) bugs and holes with this.

Sturgeon's Law is universal.

>> I run things under strace rather a lot, but even if it boiled down to
>> doing a for() loop around write(1, &char, 1) I'd tell you to fix your
>> libc rather than  change what I was doing.
> 
> OK, this self-written printf-implementation you are stracing is not very
> well optimized. ;-)

lib/libc.c get_line() which I have a pending question about which I hope
to answer this evening.  (No time right now...)

>> I'm not that interested in micromanaging something that most likely
>> stays in L1 cache either way.
> 
> Hmmm... if your code fits in L1-cache right before doing a sysenter then
> the cache will be dirty when doing the call of sysenter, isn't it?

What do you mean "dirty"?

The Linux kernel has a hugepage TLB entry covering the whole kernel
executable code, which is permanently there, which is one reason going
into the kernel is so cheap on Linux: no page table tree walk to switch
between kernel/user space, thanks to the pinned kernel TLB entries.
(Yeah,that leaves less for userspace but it's a good tradeoff.)  Yeah,
you flip permission bits in those entries but that's cheap.  Note that
doing this is why you have the 2G/2G split on 32 bit systems: the
kernel's virtual address space is always mapped when you're in user
space, it's just inaccessable permission-wise.  So you can run out of
virtual address space (or have to go to "high memory" which the kernel
has to create a temporary mapping for in order to access it), but it
makes system calls really cheap.

The actual cache lines faulted in are another matter, but if you make a
lot of syscalls at least the entry point tends to stay in L1, and beyond
that it's no different than function calls.

> I never measured applications touching this "problem" mostly because of
> that causes:
> - hard to measure L1 cache running an OS with many tasks and not many
>   cores
> - Loops already fit in L1 cache and did not call code "outside", running
>   so fast that you can't measure

CPU cycle counter.  Run it on a quiescent system and see how many cycles
it took.  (The phrase to google for is "linux microbenchmark" or
something like that.)

There was an awful lot of discussion of this when the P4 came out
because it turned out the pentium 4 SUCKED, in a lot of non-obvious
ways.  Also the "to prefetch or not to prefetch" wars, which lwn.net
covered.

> - reducing the amount of syscalls brought most speedup, other changes were
>   only ambiguous measurable
> - too many syscalls which can't be reduced let the advantage of the cache
>   vanish (mostly showing big i/o waits in top at the core the application
>   runs)
> - some causes where further optimizations didn't make any sense (due to
>   e.g. network latencies)
> As you wrote: You'll have to measure it (all). Until then you must keep
> caches in mind.

You have to understand what the system is doing.  You also have to
realize that different hardware works in different ways.

However "system calls" are an unavoidable hot path everywhere, and have
been optimized within an inch of their lives.  If you can reduce them by
an order of magnitude maybe you've got something, but if you're fighting
to reduce them 20% it's probably not worth complicating your code.

>> That's the worst you've seen?
> 
> Yes, I didn't expect it on rrdtool. But after stracing it I understand the
> cause why it is actually running slow even on big hosts with many
> rrd-updates even though fadvise/madvise should catch these cases.
> 
>> Never run strace on gcc.  Certainly not
> 
> gcc is one of the tools I never *wanted* to strace because I already
> expected a nightmare (other tools are e.g. php). ;-)

The horror is indescribable.  But I tried in my blog...

>> See lib/lib.h
> 
> clone done. Patches submit to the list?

Yup. :)

> My first make did throw the nasty "dereferencing type-punned pointer will
> break strict-aliasing rules". In sort.c you use TT.lines as char* and not
> char**.

Sigh. I suspected some compilers would do that.  The one I'm doing here
isn't warning about it, but the hexagon target would actually _break_
when you do this.

I hate that error. There is nothing WRONG with type-punning a pointer,
your darn optimizer is being too clever for its own good, and if there's
a -fstoptryingtodothat I would happily add it to CFLAGS and move on with
my life.  (I don't care about the performance change, IT'S VALID C!)

>> command line stuff.  It's only a win if you never have to do it more
>> than once.
> 
> That's why I often used small internal output buffers and the nasty
> stpcpy.

What's nasty?  It's in POSIX 2008.

>> Minimal system bootstrapping is theoretically four things:
> 
> BTW - I've read that pivot_root doesn't have a high priority in your
> TODO-list. It's very easy to implement cause Linux offers a syscall.

The internal implementation of that syscall is disgusting, because it
has to examine and potentially modify the state of every process on the
system.

Basically, I need to patch the kernel to make chroot() actually do
something different, and integrate the patch upstream in order to
obsolete switch_root():

  http://landley.net/notes-2011.html#02-06-2011

Note that adjusting the process-local mount tree wasn't possible until
A) there was a process-local mount tree, B) --bind mounts had been
invented so you can split a mount point.

But once you _have_ got it, reference counting should automatically
unmount orphaned filesystems, and a reparent_to_init() variant can move
kernel threads into initramfs.  (Which is pretty much where they should
_always_ be anyway, that's why we've got initramfs.)

> Older
> glibc didn't offer a wrapping but this is also easy to check (and to
> implement if necessary).

The one and only test along those lines I've implemented so far (just
added last week) adds a CONFIG symbol for a capability, and makes other
things depend on that symbol.  So "unshare" drops out when the compile
probe fails to find the relevant constants in the header.

I'm pretty happy doing that for pivot_root as well: if your libc hasn't
got the syscall, fix your libc.

> I want to write it as my first toy-code if no one
> else is working on it.
> Next thing could be mount even though it's not that easy.

I already wrote the busybox mount command, which had BUCKETS of corner
case behavior I care about getting right.  My old implementation wasn't
a clean from-scratch rewrite (instead I replaced all the exisiting code
in about three passes) so I can't just port it and have to redo it.  But
I have plans for that one and would like to do it myself.

>> Or you could do toys/cat.c using the global toybuf[4096] which is part
>> of the bss and only gets its page faulted in if it's actually dirtied.
>> (Modulo alignment considerations I haven't bothered about.)
> 
> I've read your docs but know I did also the clone and read some code. :-)
> 
>> As I think I said in design.html, I'm replying on c99, posix-2008, and
>> LP64.  (If I wasn't clear enough there tell me and I'll go fix it.)
> 
> No, it is clear. It was just a big bunch of docs. I yet don't know POSIX
> 2008, only 2001. I think there is much more "deprecated".

I'm more interested in the new capabilities that got added, and as long
as I'm implementing standardized command line utilities I might as well
implement (a defined subset of) the most recent standard.

Note that the common thing about all three of those?  Available free on
the web.  If it's not available free on the web, IT ISN'T A STANDARD.

>> the first commands I wrote for toybox.  (Actually I started it for
>> busybox but left that project before it was finished, so never submitted
>> it.)
> 
> Maybe they will backport some day. ;-)

Eh, I tried to push stuff upstream into busybox back when I mothballed
toybox.  Got a bit of it up, but:

  http://landley.net/notes-2010.html#11-03-2010

(The "chirping crickets" paragraph.)

I met Denys in person at CELF in April and _explained_ to him how toybox
puts everything in one darn file and busybox needed you to touch five
files, and he sort of got it, and this eventually resulted in:

  http://lists.busybox.net/pipermail/busybox/2010-May/072386.html

But I really like my syntax better, and "pick up that shirt" is not the
same as "clean your room" when they can't tell it's dirty...

I got toybox's "patch" implementation upstream because Aboriginal Linux
_needed_ that (the one in busybox was a joke), but my enthusiasm beyond
that gradually waned again:

  http://landley.net/notes-2010.html#05-01-2010
  http://landley.net/notes-2011.html#08-06-2011

The problem is busybox became a tool lots of people depend on, and those
guys happily monkey-patch it to get it to work for them, and they don't
care if it's clean they care if it _works_, and Denys has sort of gone
over to their side rather than fighting for "clean code" as the primary
objective.  He's let simplicity become about job 4, after features and
speed and executable size.

I consider fast and small to mostly be _functions_ of the simplest
implementation you can manage, and treat complexity as a cost you spend
to get features, with some features not being worth the cost. (Yes, this
means I wind up pushing back on the userbase a bit, but as Alan Cox
said, "a maintainer's job is to say no".)

>> Never heard of it.  I've got a strlcpy() but everybody does since
>> strncpy() isn't guaranteed to null terminate the output.)
> 
> Better you forget this nasty thing. The man-page says that it is a
> GNU-extension and that it maybe goes back to the old msdos times...
> 
>> Huh, apparently it's not an extension, it's in SUSv4:
>>   http://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
> 
> er... maybe this is a april fools day joke

Nope, it makes sense.  The implementation is trivial:

  char *stpcpy(char *to, char *from)
  {
    while (*to = *from++) to++;
    return to;
  }

>> That's really cool.  Thanks.  I wonder where that's needed in lib/* or
> 
> Don't forget the name "msdos" in its history. ;-)

Up through DOS 3 Paul Allen was in charge so it wasn't so bad.  Keep in
mind DOS was about the best you could do on the original IBM PC with
16-64k of ram.  (That's kilobytes, not megabytes.  Paul Allen wanted to
do xenix but it just didn't have the memory.)

Paul Allen came down with Hodgkins Lymphoma and quit the company in
1983, leaving Gates and his laptog Ballmer in charge, at which point the
techies were reporting to marketing.  But before that there was actual
technical merit in the company, viewed through the context they were
operating in.

This is also the reason Dos 4, 5, and 6 were essentially interchangeable
with 3.  No new useful innovation came out of Microsoft for the next 20
years with exactly one exception:

  http://blogs.msdn.com/larryosterman/archive/2005/02/02/365635.aspx

Did I mention comptuer history is a hobby of mine?

  http://landley.net/history/mirror

>> Did you read http://landley.net/toybox/design.html yet?  Do I need to
>> fluff that out a bit?
> 
> Sorry, it was a bunch of docs. First I read the stuff easily linked and
> then the urls you posted.

I have been working on toybox for a number of years, there's an awful
lot of context.  The code that's there is what I managed to distill all
the design work _down_ to.

>> all the same darn thing just like ANSI/ISO C is the same standard
>> approved by two standards bodies...)
> 
> Yes, we should be glad there is not a DIN standard yet. ;-)

I can presumably avoid caring. (Just because a standard body emits
something doesn't mean I have to pay attention.  SUSv4, C99, and LP64
are standards I _like_.)

>> Trust me: I know how to profile stuff, and how to understand the
> 
> I do. Before I read your opinion about tumb programmers I already tried to
> think so.
> 
> My experiences are mainly the results of writing some monitoring tools
> which sometimes can cause i/o wait, or writing a fast fgrep where I
> measured that the size of the buffer is a great killer if you want to
> speed it up. :-)

Proper batch sizes have been one of the main performance knobs for 60
years.  There was a marvelous talk Maddog gave at LinuxWorld Expo in
2001 about speeding up an old reel-to-reel tape backup job from taking
all day to being done in minutes (and from requiring many tapes to
fitting on one) by changing single byte writes into kilobyte block
writes, because for each write the tapes would spin up to speed (leaving
a gap), write a start of transaction, write the data byte, write an end
of transaction, and then spin to a stop (more gap).  Writing a kilobyte
at a time this didn't dominate, writing a byte at a time it did.

But most optimization is a moving target.  Precalculating trigonometry
tables for video games (so you could point your ship any direction in
360 degrees) was great... until the processors started getting a CPU
cache and the calculation that stayed in cache was faster than the table
lookup that had to fault in a cache line.  Then the cache grew large
enough that the whole table fit in memory again and the table was once
again faster!  Until machines got floating point coprocessors, and then
the table was slower again...

My response?  Do the simple thing.  That particular ping/pong cycle I
noticed was 15 years ago, that's 10 iterations of moore's law, I.E. 10
DOUBLINGS of processor speed, so everything they were optimizing for is
utterly irrelevant and people have since implemented an emulator IN
FLASH that runs the old binaries...

Fabrice Bellard (the creator of tinycc and qemu) wrote i386 emulator,
booting Linux, ENTIRELY IN JAVASCRIPT:

  http://bellard.org/jslinux

On a modern PC, it's reasonably snappy.  (It would look snappier if he'd
compiled his kernel with EARLY_PRINTK, but oh well.  Not his area.)

Back up and think about that for a moment.  What is the _goal_ here?

>> Which is why they changed it so gettimeofday() can just read an atomic
>> variable out of the vsyscall page:
> 
> Yes, I know this. But even if gettimeofday is not a "young" call there are
> many, many projects which didn't recognize. 
> I still use it in some of my tools but only one time and not many times
> and even more not many times a second.

I'm writing new code.  I care about "given the current context I'm
writing it in, what's the best way to write it now?"  I don't really
care about optimizing for obsolete stuff or trying to predict what
changes will happen in future.  I'm trying to write it so it can be
easily _changed_ in future, by keeping it simple and easy to read/modify.

You said that printf() violated an early unix maxim, but there's another
one: "When in doubt, use brute force".  Implement, _then_ optimize.

Rob