[Toybox] [PATCH] Implement mv -n / cp -n (no clobber).

Sun Mar 27 11:25:25 PDT 2016

>   rm /usr/bin/toybox
>   cp toybox /usr/bin/toybox #fails because cp is dangling symlink
>
> It's not a race condition, it's a "you can keep running a binary after
> it's deleted but can't launch new instances" problem.

OK I see now.  Why not just:

mv /usr/bin/toybox /tmp/cp
/tmp/cp toybox /usr/bin/toybox

If that's really the only reason to use --remove-destination vs rm &&
cp, then it does seems superfluous (typical GNU bloat like cat -v).

> Until recently I had CP_MORE so you could configure a cp with only the
> posix options, but one of the philosophical differences I've developed
> since leaving toybox is all that extra configuration granularity is
> nuts.

FWIW I agree -- configuring support for flags *within* a command is
very fine grained and I doubt most people would use it.

> So I threw out CP_MORE as a bad idea, and almost all commands just have
> the "include it or not" option now. There are a few global options, but
> not many, and I may even eliminate some of those (I18N: the world has
> utf8 now, deal with it).

I agree utf-8 is the right choice... The expr.c code from mksh has a
bunch of multibyte character support at the end, which makes you
appreciate the simplicity of utf-8:

https://github.com/MirBSD/mksh/blob/master/expr.c

bash seems to talk with some regret over support for multibyte
characters: http://aosabook.org/en/bash.html

> The lfs-bootstrap.hdc build image (which I'm 2/3 done updating to 7.8, I
> really need to get back to that) does a "find / -xdev | cpio" trick to
> copy the root filesystem into a subdirectory under /home and then chroot
> into that, so your builds are internally persistent but run in a
> disposable environment.
>
> All this predates vanilla containers, I should probably add some
> namespace stuff to it but haven't gotten around to it yet...

I'll have to look at Aboriginal again... but for builds, don't you
just need chroots rather than full fledged containers?  (i.e. you
don't really care about network namespaces, etc.)

Oh one interesting thing I just found out is that you can use user
namespaces to fake root (compare with Debian's LD_PRELOAD fakeroot
solution)

Last year, I was using this setuid root executable
(https://git.gnome.org/browse/linux-user-chroot/commit/), which is a
nice primitive for reproducible builds (i.e. not running lots of stuff
as root just because you need to chroot).

And I see in their README they are pointing to a Bazel (google build
system) tool that has an option to fake root with user namespaces.
Although I'm not sure you want to make that executable setuid root.

> A) Elaborate on "oddly conflates" please? I saw it as 'ensure this path
> is there'.
>
> B) [ ! -d "$DIR" ] && mkdir "$DIR"

It says this right in the help:

 -p, --parents     no error if existing, make parent directories as needed

I guess you can think of the two things as related, but it's easy to
imagine situations where you only want to create a direct descendant
and it's OK if it exists.

B) has a race condition whereas checking errno doesn't, and mkdir $DIR
|| true has the problem that it would ignore other errors.

>> # likewise rm can't be run twice; the second time it will fail because
>> the file doesn't exist.  --force conflates the behavior of ignoring
>> missing arguments with not prompting for non-writable files
>
> -f means "make sure this file is not there".

The help also describes the two different things it does:

-f, --force           ignore nonexistent files and arguments, never prompt

The first behavior makes it idempotent... the second disables the
check when writing over read-only files, which is unrelated to
idempotency (and yes I get that you're modifying the directory and not
the file, but that's the behavior rm already has)

>> # behavior depends on whether bar is an existing directory, -T /
>> --no-target-directory fixes this I believe
>> $ cp foo bar
>
> I do a lot of "cp/mv/rsync fromdir/. todir/." just to make this sort of
> behavior shut up, but it's what posix said to do.

What does this do?  It doesn't seem to do quite what -T does:

$ ls
bar  foo  # empty dirs
$ mv foo/. bar/.
mv: cannot move ‘foo/.’ to ‘bar/./.’: Device or resource busy
$ mv -T foo bar  # now foo is moved over the empty dir bar

> Yes and no. I've seen a lot of people try to "fix" unix and go off into
> the weeds of MacOS X or GoboLinux. Any time a course of action can be
> refuted by an XKCD strip, I try to pay attention. In this case:
>
> https://xkcd.com/927/
>
> Unix has survived almost half a century now for a _reason_. A corollary
> to Moore's Law I noticed years ago is that 50% of what you know is
> obsolete every 18 months. The great thing about unix is it's mostly the
> same 50% cycling out over and over.

Definitely agreed -- but that's why I'm not creating an alternative,
but starting with existing behavior and adding to it.  That's one of
the reasons I am interested in toybox... to puzzle through all the
details of existing practice and standards where relevant, to make
sure I'm not inventing something worse :)

The motivation the idempotency is a long story... but suffice to say
that people are not really using Unix itself for distributed systems.
They are building non-composable abstractions ON TOP of Unix as the
node OS (new languages and data formats -- Chef/Puppet being an
example; Hadoop/HDFS; and tons of Google internal stuff).  AWS is a
distributed operating system; Google has a few distributed operating
systems as well.  It's still the early days and I think they are
missing some lessons from Unix.

Sure I could just go change coreutils and bash ... I've been puzzling
through the bash source code and considering that.

If one of your goals is to support Debian, I think you should be
really *happy* that they went through all the trouble of porting their
shell scripts to dash, because that means all the shell scripts use
some common subset of bash and dash.  Portability means that the
scripts don't go into all the dark corners of each particular
implementation.

bash is like 175K+ lines of code, and If you wanted to support all of
it, I think you would end up with at least 50K LOC in the shell...
which is almost the size of everything in toybox to date.  If on the
other hand you want a reasonable and compatible shell, rather than an
"extremely compatible" shell, it would probably be a lot less code...
hopefully less than 20K LOC (busybox ash is a 13K LOC IIRC, but it's
probably too bare)

> I've come to despise declarative languages. In college I took a language
> survey course that covered prolog, and the first prolog proram I wrote
> locked the prolog interpreter into a CPU-eating loop for an hour, in
> about 5 lines. The professor looked at it for a bit, and then basically
> said to write a prolog program that DIDN'T do that, I had to understand
> how the prolog interpreter was implemented. And this has pretty much
> been my experience with declarative languages ever since, ESPECIALLY make.

This is a long conversation, but I think you need an "escape hatch"
for declarative ones, and make has one -- the shell.  If you don't
have an escape hatch, you end up with tortured programs that work
around the straightjacket of the declarative language.  (But this is
not really related what I was suggesting with idempotency; this is
more of a semantic overload of "declarative")

Unfortunately GNU make's solution was not to rely on the escape hatch
of the shell, but to implement a tortured shell within Make (it has
looping, conditionals, functions, variables, string library functions,
etc. -- an entirely separate Turing complete language)

Make's abstraction of lazy computation is useful (although it needs to
be updated to support directory trees and stuff like that).  But most
people are breaking the model and using it for "actions" -- as
mentioned, the arguments to make should be *data* on the file system,
and not actions; otherwise you're using it for the wrong job and
semantics are confused (e.g. .PHONY pretty much tells you it's a hack)

> I do this kind of thing ALL THE TIME. I have a black belt in "sticking
> printfs into things" because I BREAK DEBUGGERS. (I'm quite fond of
> strace, though, largely because it's survived everything I've thrown at
> it and is basically sticking a printf into the syscall entry for me so I
> don't have to run the code under User Mode Linux anymore, where yes I
> literally did that.)

I think the problem is that you expect things to actually work!  :)  A
lot of programmers have high expectations of software; users generally
have low expectations.

http://blog.regehr.org/archives/861 -- "How have software bugs trained
us? The core lesson that most of us have learned is to stay in the
well-tested regime and stay out of corner cases. Specifically, we will
... "

Another hacker who has the same experience:
http://zedshaw.com/2015/07/08/i-can-kill-any-computer/

I was definitely like this until I learned to stop changing defaults.
Nobody tests anything by the default configuration.  Want to switch
window managers in Ubuntu?  Nope, I got subtle drawing bugs related to
my video card.  As penance for my lowered expectations, I try to work
on quality software...

> Oh, and when $(commands) produce NUL bytes in the output, different
> shells do different things with them. (Bash edits them out but retains
> the data afterwards.)
>
> I was apparently pretty deep into this stuff in mid-2006:

Yeah hence my warning about trying to be too compatible with bash ...
Reading the aosabook bash article and referring to the source code
opened my eyes a lot.  sh does have a POSIX grammar, but it's not
entirely useful, as he points out, and I see what he means when he
says that using yacc was a mistake (top-down parsing fits the shell
more than bottom-up).

On the other hand, writing a shell parser and lexer by hand is a
nightmare too (at least if you care about bugs, which most people seem
not to).  I'm experimenting with using 're2c' for my shell lexer,
which seems promising.

Reading the commit logs of bash is interesting... all of its features
seem to be highly coupled.  There are lots of lists like this where
one feature is compared against lots of other features:
http://git.savannah.gnu.org/cgit/bash.git/tree/RBASH .  The test
matrix would be insane.

>> and
>> toybox/busybox are obvious complements to a shell.  Though it's
>> interesting that busybox has two shells and toybox has zero, I think
>> my design space is a little different in that I want it to be sh/bash
>> compatible but also have significant new functionality.)
>
> Other than "loop", what are you missing?

At a high level, I would say:

1) People keep saying to avoid shell scripts for serious "software
engineering" and distributed systems.  I know a lot of the corner
cases and a lot of people don't, so that could be a defensible
position.  You can imagine a shell and set of tools that were a lot
more robust (e.g. pedantically correct quoting is hard and looks ugly,
but also more than that)

2) Related: being able to teach shell to novices with a straight face.
Shell really could be an ideal first computing language, and it was
for many years.  Python or even JavaScript is more favored now
(probably rightly).  But honestly shell has an advantage in that to
*DO* anything, you need to talk to a specific operating system, and
Python and JavaScript have this barrier of portability.  But the bar
has been raised in terms of usability -- e.g. memorizing all these
single letter flag names is not really something people are up to.

3) Security features for distributed systems ... sh is obviously not
designed for untrusted input (including what's on the file system).

I could get into a lot of details but I guess my first task is to come
up with something "reasonably" compatible with sh/bash, but with a
code structure that's extensible.

FWIW toybox code is definitely way cleaner than bash, though I
wouldn't necessarily call it extensible.  You seem to figure out the
exact set of semantics you want, and then find some *global* minimum
in terms of implementation complexity, which may make it harder to add
big features in the future (you would have to explode and compress
everything again).  I suppose that is why this silly -n patch requires
recalling everything else about cp/mv, like --remove-destination :)
But I definitely learned something from this style even though I'm not
sure I would use it for most projects!

Andy

 1459103125.0