[Toybox] tar --null

enh enh at google.com
Thu Jul 14 16:53:30 PDT 2022


On Wed, Jul 13, 2022 at 11:58 PM Rob Landley <rob at landley.net> wrote:

> On 7/12/22 19:13, enh via Toybox wrote:
> > so.. --transform works (though it confused people that it's not in the
> --help
> > output; they didn't know that there's a --xform synonym),
>
> Well, it doesn't FULLY work yet: those trailing type indicators aren't in
> yet. I
> have a todo thingy for it and think I know how I want to handle it now (sed
> --tartransform enabling extra s///X trailing selectors, making the result
> NULL
> terminated, disabling NULL output repacement so it doesn't get out of
> sync, and
> stripping a file type indicator character from the start of each input
> line; yes
> this is the correct use for a --longopt, internal nonsense one program
> passes to
> another which no human should ever type...)
>

yeah, no worries --- sufficient unto the day are the bugs that are
currently getting in people's way :-)


> But I've been distracted trying to wrap my head around diff, and am
> currently
> writing a toy diff that works like the reverse of my "patch" where it
> DOESN'T
> load the whole file in but just finds hunks in a streaming manner. Probably
> terrible results but I'm curious. (Where are good diff TEST inputs, that's
> what
> I wanna know. Currently I'm diffing linux-2.6.12 and linux-2.6.13 to see
> what
> fun corner cases come out of that.
>
> I actually strongly suspect a streaming approach would work reasonably
> well for
> unified diffs because A) patch hunks can't occur out of order (a move is
> always
> an insert and a delete), B) we only need to remember 3 trailing "match"
> lines to
> start a hunk and can END a hunk whenever we've seen 6 trailing "match"
> lines.
>
> Yes that's 6 not 3: patch can't handle overlapping hunks, so if there are
> fewer
> than 6 match lines we can't break the hunk and guarantee non-overlapping
> end/start context. This is why a hunk can have up to head+tail-1 many
> consecutive match lines in the middle of it, because:
>
> @@ count @@
>  start one
>  start two
>  start three
> +add one
> -zap one
>  unchanged one
>  unchanged two
>  unchanged three
>  unchanged four
>  unchanged five
> +add two
> -zap two
>  end one
>  end two
>  end three
>
> Would otherwise be:
>
> @@ count @@
>  start one
>  start two
>  start three
> +add one
> -zap one
>  unchanged one
>  unchanged two
>  unchanged three
> @@ count @@
>  unchanged three
>  unchanged four
>  unchanged five
> +add two
> +zap two
>  end one
>  end two
>  end three
>
> And that writes three to the output and then backs UP to read it in again,
> which
> patch can't do. (I tried it to see if I needed to implement it in mine.)
>
> The trick with a "streaming" diff is of course determining what's a real
> match
> and what's a spurious match, which is what all this "find longest run of
> match
> lines" algorithmic stuff is for. It's not to make the diff CORRECT, it's
> to make
> it small/clean in nasty corner cases, which C is full of largely due to
> blank
> lines and runs of:
>
>           }
>         }
>       }
>     }
>   }
>
> But UNIFIED diff needing six consecutive lines of match to end a hunk
> might make
> that less horrible? The naive approach is N^2 on the size of the hunk, but
> individual diff hunks are seldom more than a hundred lines.
>
> And the streaming approach does not care at ALL about the size of the
> _file_.
> You can feed it gigabytes of input and for the matching sections it's going
> read/compare/discard cycling through three trailing lines of context so it
> can
> start a new hunk when it sees a difference...
>
> I strongly suspect there are pathological inputs that break this, or else
> why
> would all these guys have done the complicated "read the whole file in at
> once"
> approach in the first place? But I can't find what those inputs ARE. My
> real
> problem here is a shortage of good TEST CASES...
>

(i know nothing useful about diff, other than where to find the pdf of the
original paper :-) )


> > but in the meantime
> > the kernel build script now uses --null with
> > -T:
> https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> > <
> https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> >
> >
> > i think implementing that is a sub-line change here...
> >
> >   for (;TT.T; TT.T = TT.T->next) do_lines(xopenro(TT.T->arg), '\n',
> do_XT);
>
> What does this... ah, another null termination variant.
>
> The man page says "subsequent". Affecting _subsequent_ -T. So technically,
> you
> could have:
>
>   tar -T file-with-newlines.txt --null file-with-nulls.txt
>
> Which lib/args.c fundamentally can't cope with because all command line
> arguments get processed before the command's main gets called. There's a
> similar
> issue in sed.c:
>
>   https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009
>
> > ...but i'm assuming you'd also want to add --no-null (last one wins),
>
> Yeah, modulo it still won't help the sequencing issue above.


in a sense, it makes it worse --- they explicitly document that you can
have stuff like `--null -T nulls --no-null -T newlines`. and yet it's only
the -T file. not the -X file. (i mean, i get YAGNI, but there's an argument
for orthogonality, and personally i'd put "--null should work with all file
lists" over "i should be able to alternate between nulls and newlines if i
want".)


> But as with
> "truncate -s 8G" not working on 32 bits (because # parses into a long),
> there
> are some corner cases in the plumbing that I have TODO items about but the
> fix
> is uglier than the problem until somebody comes to me with a real world
> bug it
> causes...
>

i'm interpreting this as a "will accept patch for just --null [that affects
all -Ts, not just following ones], since that's all anyone wants right
now"... sent!


> > and i'm not sure how to write that in the USE_ line?
>
> I've been thinking about --no-longopt as general plumbing for a while.
> It's on
> the todo list, because it comes up a lot:
>
>   $ zgrep '[-]no-' /usr/share/man/man?/*.gz | wc -l
>   787
>
> Although it hasn't been worth it yet because so far most of the --no-blah
> longopts in toybox (grep '(no-' toys/*/*.c ) haven't had a corresponding
> non-no
> version they toggle. They're mostly "--no-fork", "--no-dereference",
> "--no-symlinks" and so on. Even tar has --no-recursion. The exceptions are
> both
> in pwgen.c (--capitalize and --numerals have --no- variants).
>
> But in the meantime, when there's a short alias then the [-abc] logic will
> switch off the previous one. (Which is what pwgen is doing with
> [-cA][-n0].)
>
> When there aren't short aliases, here's a TRULY HORRIBLE trick to add them:
>
>   NEWTOY(blah, "\376(null)\375(no-null)[-\376\375]", BLAH);
>
> Octal escapes for high ascii. (Yes I added support in mkflags.c a while
> back. :)
>  Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220714/c276f1e6/attachment-0001.htm>


More information about the Toybox mailing list