[Toybox] tar --null

enh enh at google.com
Fri Jul 15 19:19:27 PDT 2022


On Fri, Jul 15, 2022 at 9:34 AM Rob Landley <rob at landley.net> wrote:

> On 7/14/22 18:53, enh wrote:
> > On Wed, Jul 13, 2022 at 11:58 PM Rob Landley <rob at landley.net
> > <mailto:rob at landley.net>> wrote:
> >
> >     On 7/12/22 19:13, enh via Toybox wrote:
> >     > so.. --transform works (though it confused people that it's not in
> the --help
> >     > output; they didn't know that there's a --xform synonym),
> >
> >     Well, it doesn't FULLY work yet: those trailing type indicators
> aren't in yet. I
> >     have a todo thingy for it and think I know how I want to handle it
> now (sed
> >     --tartransform enabling extra s///X trailing selectors, making the
> result NULL
> >     terminated, disabling NULL output repacement so it doesn't get out
> of sync, and
> >     stripping a file type indicator character from the start of each
> input line; yes
> >     this is the correct use for a --longopt, internal nonsense one
> program passes to
> >     another which no human should ever type...)
> >
> > yeah, no worries --- sufficient unto the day are the bugs that are
> currently
> > getting in people's way :-)
>
> Yeah but August 6 is 3 months from the previous release and I'd like to do
> that
> on a more regular schedule (modulo maybe slipping a bit to sync up with
> kernel
> releases for mkroot), meaning I want to finish this properly soonish. :)
>
> I have a half dozen open cans of worms right now... dd, sh, mkroot
> walkthrough,
> diff, tar --transform, a redo of lib/passwd.c and everything depending on
> it,
> and in file.c:
>
> + * TODO: XZ, JPEG size, dpkg.deb, rpm, mp3, odt, mp4, iso
> + * MBR boot sector (partition X: startsector %d, %d sectors;)
> + * word (.docx: Word 2007+), excel
>

you shouldn't do those yourself --- you should make each of those a
separate bug on github with a "help wanted" or "starter project" label, and
then next time you have someone asking "hey, is there something i can look
at?", you have stuff ready and waiting...

(not that you can 100% trust me not to do some of those when i've had a
week when i didn't get to write even a line of code and i'm looking for
something to do. but i'm trying to _stop_ doing all the easy little pieces
myself at work for similar reasons!)


> Trying to close tabs for a release. :)
>
> >     And the streaming approach does not care at ALL about the size of
> the _file_.
> >     You can feed it gigabytes of input and for the matching sections
> it's going
> >     read/compare/discard cycling through three trailing lines of context
> so it can
> >     start a new hunk when it sees a difference...
> >
> >     I strongly suspect there are pathological inputs that break this, or
> else why
> >     would all these guys have done the complicated "read the whole file
> in at once"
> >     approach in the first place? But I can't find what those inputs ARE.
> My real
> >     problem here is a shortage of good TEST CASES...
> >
> > (i know nothing useful about diff, other than where to find the pdf of
> the
> > original paper :-) )
>
> Neither did I at the start of this mess. Alas, I've read that paper through
> twice and have not managed to turn their explanation into pseudo-code, and
> that's WITH an implementation in front of me. (No guarantee that
> implementation
> is what the paper describes, of course...)
>
> There are two main schools of programming: the burned out
> ex-mathematicians and
> ascended plumbers. I am firmly in the plumber camp, water wheels driving
> clockwork make sense to me. I got a math minor for the same reason an
> acrophobic
> person would go skydiving (I refused to be beaten by it), and can usually
> translate from mathematician-speak to plumber-speak, but this one's being
> stroppy.
>
> (I've met plenty of math-side programmers and generally pair program with
> them
> quite well. They're always disgusted when I pile up overlapping heuristics
> to
> cover the entire input space, but I'm very good at getting them unblocked
> when
> they're blocked because I've basically never NOT got a way to do whatever
> it is.
> I have a black belt in disgusting solutions, I can show you the Wrong Way
> To Do
> It whatever it is, and then we clean it up from there.
>
> I can pull apart what the current code is doing but there's four different
> algorithms in play, and I do NOT understand why they didn't just do the
> simple
> thing... which is sort of a fifth? Except I think it's a subset of the N^2
> one,
> except it's N^2 on hunk size not file size if you do it right, which means
> "big
> enough to hurt" should almost never come up?
>
> Stream forward until you hit a diff, and then accumulate lines from each
> file
> one at a time scanning BACKWARDS in the other file to find matching lines
> (where
> does new last line of file 2 match in the list-since-difference of file1),
> and
> when you find -U *2 lines of match you've ended the hunk. Flush what
> you've seen
> (keeping the usual three lines of starting context) and move forward again
> as
> matched. This usually leaves unconsumed lines in the other file (sometimes
> ALL
> of what we've loaded from one file is unconsumed, that happens when you
> add or
> remove a single line in isolation for example) but you just need to feed
> those
> back in as "new" lines to the search algorithm...
>
> Yeah it's an N^2 search algorithm but what's the biggest hunk you've ever
> seen,
> 200 lines? 1000? Modern hardware doing N^2 search over 1000 lines isn't
> going to
> break stride. The INPUT FILE size doesn't matter, except as a theoretical
> bound
> on the upper size of the hunk if you diff two completely unrelated files,
> but
> optimizing for that case seems silly?
>

aye, though -- like you -- i assume that's the kind of pathological case
they were thinking of. (because although it never happens "for real", it
happens interactively, and that's probably when people are most sensitive
to speed.) i don't remember seeing a single hunk more than tens of lines
(except the other pathological case of "new file").


> Anyway, that's the rathole I'm going down there. If it doesn't work I go
> back to
> the other thing, but this SHOULD be tiny and simple and basically an
> inverse of
> the patch.c I wrote way back when. AND it means you don't have to special
> case
> FIFO input and stuff because it doesn't have to be seekable, you're doing
> xgetline() in a loop...
>
> >     > but in the meantime
> >     > the kernel build script now uses --null with
> >     >
> >     -T:
> https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> >     <
> https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> >
> >     >
> >     <
> https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> >     <
> https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> >>
> >     >
> >     > i think implementing that is a sub-line change here...
> >     >
> >     >   for (;TT.T; TT.T = TT.T->next) do_lines(xopenro(TT.T->arg),
> '\n', do_XT);
> >
> >     What does this... ah, another null termination variant.
> >
> >     The man page says "subsequent". Affecting _subsequent_ -T. So
> technically, you
> >     could have:
> >
> >       tar -T file-with-newlines.txt --null file-with-nulls.txt
> >
> >     Which lib/args.c fundamentally can't cope with because all command
> line
> >     arguments get processed before the command's main gets called.
> There's a similar
> >     issue in sed.c:
> >
> >
> https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009
> >     <https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009
> >
> >
> >     > ...but i'm assuming you'd also want to add --no-null (last one
> wins),
> >
> >     Yeah, modulo it still won't help the sequencing issue above.
> >
> > in a sense, it makes it worse --- they explicitly document that you can
> have
> > stuff like `--null -T nulls --no-null -T newlines`. and yet it's only
> the -T
> > file. not the -X file. (i mean, i get YAGNI, but there's an argument for
> > orthogonality, and personally i'd put "--null should work with all file
> lists"
> > over "i should be able to alternate between nulls and newlines if i
> want".)
>
> Sigh. I do know HOW to make that work, it's just... really ugly:
>
> Have --null and --no-null be two separate flags. If EITHER is set, iterate
> through toys.argv (which is the raw unparsed command line saved for
> exactly this
> sort of nonsense) to find the "--null" and "--no-null" entries in there,
> and the
> -T entries (which are single dash then any "T" in the resulting string) so
> you
> can count through the collected arg_list and annotate them appropriately.
>
> Note: this approach will be thrown off if you have a file argument named
> --null
> or -T and I'm willing to call that pilot error in this instance. (Although
> --null -T --no-null would actually do what you want, it would just screw up
> SUBSEQUENT arguments, if any. And the failure wouldn't be a segfault, it
> would
> be a -T entry parsed in the wrong terminator mode, which is not the end of
> the
> world.)
>
> >     But as with
> >     "truncate -s 8G" not working on 32 bits (because # parses into a
> long), there
> >     are some corner cases in the plumbing that I have TODO items about
> but the fix
> >     is uglier than the problem until somebody comes to me with a real
> world bug it
> >     causes...
> >
> > i'm interpreting this as a "will accept patch for just --null [that
> affects all
> > -Ts, not just following ones], since that's all anyone wants right
> now"...
>
> I was about to ask if you wanted to write it or if I should do it. :)
>
> > sent!
>
> And apparently eaten by gmail's spam filter? Yes, that's where it was.
> (Thanks
> gmail, you are EATING MESSAGES ACTUALLY COMING FROM GOOGLE.COM. Again.)
>
> But now that I've gone "well here's the 80/20 solution to handling mode
> shifts",
> I'm tempted to code that up instead. Lemme see if I get to it this
> weekend, if
> not I owe you this applied before monday.
>

sgtm. i've been trying to stop committing things on fridays, so monday's
the earliest i'd be giving the kernel folks a new prebuilt anyway :-)


> Thanks,
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220715/605de8bc/attachment.htm>


More information about the Toybox mailing list