[Toybox] Weird stdout buffering effects

Rob Landley rob at landley.net
Sun Oct 22 01:12:47 PDT 2023


On 10/21/23 20:10, Ray Gardner wrote:
> I am working on a little something, and I found that output from my toybuf
> command redirected to a file is exceedingly slow. But if piped to cat, it
> runs about as fast as expected.

Linux pipe buffers consolidate transactions.

$ cat /proc/sys/fs/pipe-max-size
1048576

https://utcc.utoronto.ca/~cks/space/blog/unix/BufferedPipes

If you set O_DIRECT on the pipe (see man 2 pipe, you should also be able to
fcntl(fd, F_SETFD, FD_CLOEXEC); it) then it operates in "packet mode" so the far
end gets the same transaction sizes from read() that the writer sent into the pipe.

Which is important for the shell! Because otherwise stuff like:

$ { echo one; echo two; } | { read i; echo shell got $i; echo head got; head -n 1; }
shell got one
head got
two

Can't do the handoff. (For a seekable file you can fseeko(ftello()) to undo the
FILE * internal buffer readahead, but for a nonseekable file you can't put data
BACK into the input, so you have to not get extra in the first place.)

> So I messed around with the
> TOYFLAG_LINEBUF and with setvbuf() modes.
> 
> I put a line at the top of my program so it does only this:
> 
> for (i = 0; i < 1024 * 512; i++) xprintf("%s\n", "asdfasdf"); return;
> 
> Then I tried putting just in front of it one of these:
> 
> setvbuf(stdout, 0, _IOLBF, 0);
> setvbuf(stdout, 0, _IOFBF, 1024);
> 
> Can anyone explain these odd timings?

With line buffering encountering \n in the input stream flushes the FILE *
buffer so you're making a write() syscall every 7 bytes of output. With an
explicit buffer size of 1024 you're making a write() syscall every 1024 bytes of
output. (Meaning if you printf("It worked\n") and then do something else for a
while without exiting or manually flushing, they won't get notified it worked.)

The 7 byte one is making 146 times as many write() syscalls, each of which has a
kernel entry and (on modern systems that worry about heartbleep and melthammer
and so on) a page table swap. Cutting edge processors can change the page tables
WITHOUT flushing TLB... which I think is one of the things that opened the darn
sidechannel information leaks in the first place and they probably have some
nonsense explicitly disabling that. I watched them argue about it but didn't
take notes because it's chip-specific in ways you'd usually do a microcode
update for but "not in this instance, captain"...

As Scotty said about the Excelsior, the more they overhaul the plumbing the
easier it is to stop up the drain. Racehorses die if you give them water at the
wrong temperature, Indy 500 cars are even finickier and have a nasty habit of
dousing pit crews with flaming methanol (which puts out light outside the human
visible range when it burns so you CAN'T SEE IT in daylight, so people have
caught fire and nobody around them could TELL for a while... As workplace safety
issues go, "methanol fires" is a good one to avoid.)

Sigh. Jon maddog hall had a lovely anecdote about a program near the start of
his career running on a minicomputer saving output to reel-to-reel tape, and how
a transaction took many hours to run and ate a half-dozen tapes, so somebody had
to stay there and change tapes to let it complete. And he looked at the code and
found it was doing single byte writes of its output, and each time the tape
drive had to spin the tape up to speed (skipping some blank tape to do this),
write out the single byte surrounded by "here's some data" and "that was the
data" handshaking (bigger than the byte), and then spin the tape drive back down
to a stop (skipping more blank tape). So even though memory was precious back
then he added something like a 64 byte output buffer so it wrote the data in
blocks, and the many-hour job suddenly took a few minutes and its output fit on
one tape.

I believe said anecdote was in one of his talks at LinuxWorld Expo 2001, which I
bought the audiocasettes of and should try to digitize someday since I can't
find them online...

I keep quoting "There are 2 hard problems in computer science: cache
invalidation, naming things, and off-by-1 errors." which is apparently from Leon
Bambrick. For posix systems, anyway. Over in the smp/threading/network space
Mathias Verraes offers "There are only two hard problems in distributed systems:
 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery"

Anyway, the "cache invalidation" line is all this buffering nonsense. If you
don't buffer, it's slow. If you do buffer, you've opened a can of worms and have
to collate it. The buffer is never the right size, the buffer is never flushed
at the right time, information can get lost or duplicated going into and coming
out of the buffer...

I have this nasty habit of either wanting a black box that works and is NOT MY
PROBLEM, or to do it myself. Any black box I have to FIDDLE WITH is black magic,
and that's a skillset I prefer to admire from afar:

https://vimeo.com/74965870

> with TOYFLAG_LINEBUF and:
> setvbuf(stdout, 0, _IOLBF, 0);
> real    0m13.281s
> user    0m0.611s
> sys     0m2.486s
> 
> with TOYFLAG_LINEBUF and:
> setvbuf(stdout, 0, _IOFBF, 1024);
> real    0m0.087s
> user    0m0.045s
> sys     0m0.004s
...
> Seems setvbuf(..., _IOLBF,...) alone is horribly slow, but with
> TOYFLAG_LINEBUF it runs about the same as no setvbuf() runs with or
> without TOYFLAG_LINEBUF. Using setvbuf(..., _IOFBF,...) without
> TOYFLAG_LINEBUF takes twice as long, but with TOYFLAG_LINEBUF it runs way
> faster (< 1 sec).
>
> These results were in Linux Mint. I got similar results with WSL Ubuntu,
> but more exaggerated (bad times were worse).
> 
> Again, this is with output redirected to a file. Redirecting to /dev/null,
> or piping through cat or to a crc program that reads from stdin, they all
> run quite fast.
> 
> What gives?

Sigh. Since I _have_ accumulated a lot of black magic knowledge over the years
(I just don't like to RELY on it, or inflict the need on other people)... you're
using a variant of gnome terminal, aren't you?

Gnome terminal (and even quite distant derivatives that use its libraries)
implement an unbuffered blocking tty device. As in writes to gnome terminal
block until the recipient consumes them, which includes drawing and publishing
the relevant screen update. Meaning "make V=1 -j $(nproc)" running under gnome
terminal often runs significantly slower than the EXACT SAME BUILD piped through
tee, because that inserts a pipe buffer the unread data can accumulate into.
(The fact that tee blocks until all its outputs have consumed the written data
doesn't mean that the pipe INTO tee can't buffer the data).

That's a very small part of the reason I did
https://static.lwn.net/2002/0117/a/blueberry.php3 years ago which was one of
those "don't ask questions, post errors" thing that inspired the kernel guys to
redo the kernel build process so instead of spamming all the gcc command lines
to stdout it gave the modern summary, except they implemented it in make instead
of python which is PROBABLY less bad but you'd need an old priest and a young
priest to get a reliable answer on that one.

(Alas, people fixing this these days tend to invoke threads, but... eh, better
than NOT fixing it.)

If you're wondering why so many of my build scripts pipe the output through tee
even when they don't necessarily have to... old habit.

Rob


More information about the Toybox mailing list