[Toybox] tar --null

Rob Landley rob at landley.net
Fri Jul 15 09:41:29 PDT 2022


On 7/14/22 18:53, enh wrote:
> On Wed, Jul 13, 2022 at 11:58 PM Rob Landley <rob at landley.net
> <mailto:rob at landley.net>> wrote:
> 
>     On 7/12/22 19:13, enh via Toybox wrote:
>     > so.. --transform works (though it confused people that it's not in the --help
>     > output; they didn't know that there's a --xform synonym),
> 
>     Well, it doesn't FULLY work yet: those trailing type indicators aren't in yet. I
>     have a todo thingy for it and think I know how I want to handle it now (sed
>     --tartransform enabling extra s///X trailing selectors, making the result NULL
>     terminated, disabling NULL output repacement so it doesn't get out of sync, and
>     stripping a file type indicator character from the start of each input line; yes
>     this is the correct use for a --longopt, internal nonsense one program passes to
>     another which no human should ever type...)
> 
> yeah, no worries --- sufficient unto the day are the bugs that are currently
> getting in people's way :-)

Yeah but August 6 is 3 months from the previous release and I'd like to do that
on a more regular schedule (modulo maybe slipping a bit to sync up with kernel
releases for mkroot), meaning I want to finish this properly soonish. :)

I have a half dozen open cans of worms right now... dd, sh, mkroot walkthrough,
diff, tar --transform, a redo of lib/passwd.c and everything depending on it,
and in file.c:

+ * TODO: XZ, JPEG size, dpkg.deb, rpm, mp3, odt, mp4, iso
+ * MBR boot sector (partition X: startsector %d, %d sectors;)
+ * word (.docx: Word 2007+), excel

Trying to close tabs for a release. :)

>     And the streaming approach does not care at ALL about the size of the _file_.
>     You can feed it gigabytes of input and for the matching sections it's going
>     read/compare/discard cycling through three trailing lines of context so it can
>     start a new hunk when it sees a difference...
> 
>     I strongly suspect there are pathological inputs that break this, or else why
>     would all these guys have done the complicated "read the whole file in at once"
>     approach in the first place? But I can't find what those inputs ARE. My real
>     problem here is a shortage of good TEST CASES...
> 
> (i know nothing useful about diff, other than where to find the pdf of the
> original paper :-) )

Neither did I at the start of this mess. Alas, I've read that paper through
twice and have not managed to turn their explanation into pseudo-code, and
that's WITH an implementation in front of me. (No guarantee that implementation
is what the paper describes, of course...)

There are two main schools of programming: the burned out ex-mathematicians and
ascended plumbers. I am firmly in the plumber camp, water wheels driving
clockwork make sense to me. I got a math minor for the same reason an acrophobic
person would go skydiving (I refused to be beaten by it), and can usually
translate from mathematician-speak to plumber-speak, but this one's being stroppy.

(I've met plenty of math-side programmers and generally pair program with them
quite well. They're always disgusted when I pile up overlapping heuristics to
cover the entire input space, but I'm very good at getting them unblocked when
they're blocked because I've basically never NOT got a way to do whatever it is.
I have a black belt in disgusting solutions, I can show you the Wrong Way To Do
It whatever it is, and then we clean it up from there.

I can pull apart what the current code is doing but there's four different
algorithms in play, and I do NOT understand why they didn't just do the simple
thing... which is sort of a fifth? Except I think it's a subset of the N^2 one,
except it's N^2 on hunk size not file size if you do it right, which means "big
enough to hurt" should almost never come up?

Stream forward until you hit a diff, and then accumulate lines from each file
one at a time scanning BACKWARDS in the other file to find matching lines (where
does new last line of file 2 match in the list-since-difference of file1), and
when you find -U *2 lines of match you've ended the hunk. Flush what you've seen
(keeping the usual three lines of starting context) and move forward again as
matched. This usually leaves unconsumed lines in the other file (sometimes ALL
of what we've loaded from one file is unconsumed, that happens when you add or
remove a single line in isolation for example) but you just need to feed those
back in as "new" lines to the search algorithm...

Yeah it's an N^2 search algorithm but what's the biggest hunk you've ever seen,
200 lines? 1000? Modern hardware doing N^2 search over 1000 lines isn't going to
break stride. The INPUT FILE size doesn't matter, except as a theoretical bound
on the upper size of the hunk if you diff two completely unrelated files, but
optimizing for that case seems silly?

Anyway, that's the rathole I'm going down there. If it doesn't work I go back to
the other thing, but this SHOULD be tiny and simple and basically an inverse of
the patch.c I wrote way back when. AND it means you don't have to special case
FIFO input and stuff because it doesn't have to be seekable, you're doing
xgetline() in a loop...

>     > but in the meantime
>     > the kernel build script now uses --null with
>     >
>     -T: https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
>     <https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964>
>     >
>     <https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
>     <https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964>>
>     >
>     > i think implementing that is a sub-line change here...
>     >
>     >   for (;TT.T; TT.T = TT.T->next) do_lines(xopenro(TT.T->arg), '\n', do_XT);
> 
>     What does this... ah, another null termination variant.
> 
>     The man page says "subsequent". Affecting _subsequent_ -T. So technically, you
>     could have:
> 
>       tar -T file-with-newlines.txt --null file-with-nulls.txt
> 
>     Which lib/args.c fundamentally can't cope with because all command line
>     arguments get processed before the command's main gets called. There's a similar
>     issue in sed.c:
> 
>       https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009
>     <https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009>
> 
>     > ...but i'm assuming you'd also want to add --no-null (last one wins),
> 
>     Yeah, modulo it still won't help the sequencing issue above. 
> 
> in a sense, it makes it worse --- they explicitly document that you can have
> stuff like `--null -T nulls --no-null -T newlines`. and yet it's only the -T
> file. not the -X file. (i mean, i get YAGNI, but there's an argument for
> orthogonality, and personally i'd put "--null should work with all file lists"
> over "i should be able to alternate between nulls and newlines if i want".)

Sigh. I do know HOW to make that work, it's just... really ugly:

Have --null and --no-null be two separate flags. If EITHER is set, iterate
through toys.argv (which is the raw unparsed command line saved for exactly this
sort of nonsense) to find the "--null" and "--no-null" entries in there, and the
-T entries (which are single dash then any "T" in the resulting string) so you
can count through the collected arg_list and annotate them appropriately.

Note: this approach will be thrown off if you have a file argument named --null
or -T and I'm willing to call that pilot error in this instance. (Although
--null -T --no-null would actually do what you want, it would just screw up
SUBSEQUENT arguments, if any. And the failure wouldn't be a segfault, it would
be a -T entry parsed in the wrong terminator mode, which is not the end of the
world.)

>     But as with
>     "truncate -s 8G" not working on 32 bits (because # parses into a long), there
>     are some corner cases in the plumbing that I have TODO items about but the fix
>     is uglier than the problem until somebody comes to me with a real world bug it
>     causes...
> 
> i'm interpreting this as a "will accept patch for just --null [that affects all
> -Ts, not just following ones], since that's all anyone wants right now"...

I was about to ask if you wanted to write it or if I should do it. :)

> sent!

And apparently eaten by gmail's spam filter? Yes, that's where it was. (Thanks
gmail, you are EATING MESSAGES ACTUALLY COMING FROM GOOGLE.COM. Again.)

But now that I've gone "well here's the 80/20 solution to handling mode shifts",
I'm tempted to code that up instead. Lemme see if I get to it this weekend, if
not I owe you this applied before monday.

Thanks,

Rob



More information about the Toybox mailing list