[Toybox] [PATCH] A implemetation of the 'csplit' command

enh enh at google.com
Tue Sep 12 12:55:28 PDT 2023


On Tue, Sep 12, 2023 at 12:36 PM Rob Landley <rob at landley.net> wrote:
>
> On 9/11/23 23:56, Oliver Webb via Toybox wrote:
> > I have made a implementation of the 'csplit' command in about 160 lines of code.
>
> You have TOYFLAG_MAYFORK on this command. Sigh, explaining the lib/toyflags.h
> values is one of the tutorial videos I need to make.
>
> Forking is the default behavior for launching new commands in toybox.
> TOYFLAG_NOFORK and TOYFLAG_MAYFORK are for the toybox shell (sh.c). The first
> indicates a shell builtin that can only run within the shell's process (like
> "cd", since forking a child process, calling chdir() in the child, and having
> the child exit doesn't actually change the parent's getcwd() value). NOFORK
> commands don't show up in the command list output by running "toybox", but they
> do show up in the command list you get by running "help" with no arguments in
> the shell.
>
> The second (MAYFORK) indicates a command that _can_ run standalone, and thus
> shows up in the "toybox" list so the installer creates a symlink for it in the
> search $PATH, but when it runs from toysh it acts like NOFORK and is a function
> call made by the current process (and eventually returns back to the shell so
> the shell's PID can go on and do more shell things afterwards). This allows the
> command to access the shell's data structures, and thus perform additional
> functions such as setting environment variables in the shell (printf %n), or
> accessing the job control list (kill %1).
>
> Since both NOFORK and MAYFORK commands can be run from within the shell, they
> have to scrupulously clean up after themselves. When they call xexit() and
> friends (which includes things like perror_exit() and stuff like xmalloc() that
> can call it) they longjmp() back to toysh instead of exiting, which means
> resources like filehandles and heap allocations and any mmap() it does may have
> to live in the GLOBALS() block, and it may need a sigatexit() handler to free
> that stuff out of GLOBALS so long-running shells (or shell scripts) don't
> accumulate leaked debris from builtins that exited abnormally.
>
> (Note: lib/lib.c has sigatexit() instead of libc's atexit() because WHEN we
> longjmp() back to the shell, we need to first call our own atexit() handlers and
> then remove them from the list. The libc ones don't let you call them and remove
> them from the list libc maintains without exiting. Auditing everything for
> leaks, including all the NOFORK and MAYFORK commands, is a big todo item in the
> shell work I need to dive into at some point...)
>
> I dunno why csplit would want MAYFORK here. A normal command can just xexit()
> and let the kernel close filehandles and free memory when the process exits. I
> note that 95% of the overhead of fork/exec is the exec part, not the fork part,
> so "fork and call toy_find("blah")->toy_main()" is still pretty cheap. (On
> systems with an MMU, anyway. It's all copy on write. I'm aware Rich Felker
> disagrees, but he's always using threads for everything, and threads have
> _always_ combined badly with fork(). I suspect he's setting up some gratuitous
> thread plumbing by default that he thought was free, and suddenly he noticed
> he's penalized fork(), and now he's blaming fork(). But I haven't looked deeply
> into the details of what he's mad about, because I dowanna. But, you know, the
> linux-kernel guys would have NOTICED if fork() was slow. As would everybody else
> everywhere.)

(i doubt it's him so much as people using musl in large programs. but
the issues with fork() on large modern hardware running large modern
programs are well known.
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf
is a good recent summary, but USENIX has been talking about this stuff
for at least 20 years. macOS implements posix_spawn() as a syscall.
linux still seems to be on the clone() and close_range() path of
hacks.)

> > The implementation is mostly POSIX compliment, but it is missing a few things
>
> Missing stuff out of posix is pretty normal, they specify a lot of nonsense. My
> patch implementation is missing various the posix options like -b and -e, and
> not only has nobody complained, but I submitted my patch implementation to
> busybox in 2010 and _they_ haven't bothered to implement those options since either.
>
> > It works as a Read-Eval-Print loop, where it prints to a file that changes based on context. So doing negative offsets would require it to print lines it doesn't accumulate yet.
>
> Yeah, grep -A -B -C does that sort of ring buffer nonsense with lines it _may_
> need depending on later stuff. It's a fiddly pain.
>
> > The other main one is the fact it doesn't do "[LINE] {[NUMBER]}" cleanly yet.
>
> I applied what you sent verbatim and haven't started cleaning anything up yet,
> if you have more work to do I'm not actually familiar with csplit. (Never used
> it, still need to come up to speed...)
>
> > It also includes the GNU extension "{*}" argument
> >
> > The other breaks from POSIX are mostly insignificant, like the fact it doesn't
> > check locale environment variables or uses "%lu" for file size instead of "%d".
>
> Nothing in toybox checks the locale environment variables (outside of UTF-8
> enablement for the fontmetrics stuff in main.c, and we usually _set_ the
> variables when we do that).
>
> And posix has been just plain wrong about int-vs-long printf variables since the
> general switch to 64 bits in 2005. It's coming up on 20 years since then, so
> possibly Issue 8 will finally fix that? Or maybe that's just when they finally
> noticed they're obsolete and the NEXT release would fix it? Wake me when they
> restore "tar" and deprecate "pax"...
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net


More information about the Toybox mailing list