<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 13, 2022 at 11:58 PM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 7/12/22 19:13, enh via Toybox wrote:<br>

> so.. --transform works (though it confused people that it's not in the --help<br>

> output; they didn't know that there's a --xform synonym),<br>

<br>

Well, it doesn't FULLY work yet: those trailing type indicators aren't in yet. I<br>

have a todo thingy for it and think I know how I want to handle it now (sed<br>

--tartransform enabling extra s///X trailing selectors, making the result NULL<br>

terminated, disabling NULL output repacement so it doesn't get out of sync, and<br>

stripping a file type indicator character from the start of each input line; yes<br>

this is the correct use for a --longopt, internal nonsense one program passes to<br>

another which no human should ever type...)<br></blockquote><div><br></div><div>yeah, no worries --- sufficient unto the day are the bugs that are currently getting in people's way :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

But I've been distracted trying to wrap my head around diff, and am currently<br>

writing a toy diff that works like the reverse of my "patch" where it DOESN'T<br>

load the whole file in but just finds hunks in a streaming manner. Probably<br>

terrible results but I'm curious. (Where are good diff TEST inputs, that's what<br>

I wanna know. Currently I'm diffing linux-2.6.12 and linux-2.6.13 to see what<br>

fun corner cases come out of that.<br>

<br>

I actually strongly suspect a streaming approach would work reasonably well for<br>

unified diffs because A) patch hunks can't occur out of order (a move is always<br>

an insert and a delete), B) we only need to remember 3 trailing "match" lines to<br>

start a hunk and can END a hunk whenever we've seen 6 trailing "match" lines.<br>

<br>

Yes that's 6 not 3: patch can't handle overlapping hunks, so if there are fewer<br>

than 6 match lines we can't break the hunk and guarantee non-overlapping<br>

end/start context. This is why a hunk can have up to head+tail-1 many<br>

consecutive match lines in the middle of it, because:<br>

<br>

@@ count @@<br>

 start one<br>

 start two<br>

 start three<br>

+add one<br>

-zap one<br>

 unchanged one<br>

 unchanged two<br>

 unchanged three<br>

 unchanged four<br>

 unchanged five<br>

+add two<br>

-zap two<br>

 end one<br>

 end two<br>

 end three<br>

<br>

Would otherwise be:<br>

<br>

@@ count @@<br>

 start one<br>

 start two<br>

 start three<br>

+add one<br>

-zap one<br>

 unchanged one<br>

 unchanged two<br>

 unchanged three<br>

@@ count @@<br>

 unchanged three<br>

 unchanged four<br>

 unchanged five<br>

+add two<br>

+zap two<br>

 end one<br>

 end two<br>

 end three<br>

<br>

And that writes three to the output and then backs UP to read it in again, which<br>

patch can't do. (I tried it to see if I needed to implement it in mine.)<br>

<br>

The trick with a "streaming" diff is of course determining what's a real match<br>

and what's a spurious match, which is what all this "find longest run of match<br>

lines" algorithmic stuff is for. It's not to make the diff CORRECT, it's to make<br>

it small/clean in nasty corner cases, which C is full of largely due to blank<br>

lines and runs of:<br>

<br>

          }<br>

        }<br>

      }<br>

    }<br>

  }<br>

<br>

But UNIFIED diff needing six consecutive lines of match to end a hunk might make<br>

that less horrible? The naive approach is N^2 on the size of the hunk, but<br>

individual diff hunks are seldom more than a hundred lines.<br>

<br>

And the streaming approach does not care at ALL about the size of the _file_.<br>

You can feed it gigabytes of input and for the matching sections it's going<br>

read/compare/discard cycling through three trailing lines of context so it can<br>

start a new hunk when it sees a difference...<br>

<br>

I strongly suspect there are pathological inputs that break this, or else why<br>

would all these guys have done the complicated "read the whole file in at once"<br>

approach in the first place? But I can't find what those inputs ARE. My real<br>

problem here is a shortage of good TEST CASES...<br></blockquote><div><br></div><div>(i know nothing useful about diff, other than where to find the pdf of the original paper :-) )</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> but in the meantime<br>

> the kernel build script now uses --null with<br>

> -T: <a href="https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964" rel="noreferrer" target="_blank">https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964</a><br>

> <<a href="https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964" rel="noreferrer" target="_blank">https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964</a>><br>

> <br>

> i think implementing that is a sub-line change here...<br>

> <br>

>   for (;TT.T; TT.T = TT.T->next) do_lines(xopenro(TT.T->arg), '\n', do_XT);<br>

<br>

What does this... ah, another null termination variant.<br>

<br>

The man page says "subsequent". Affecting _subsequent_ -T. So technically, you<br>

could have:<br>

<br>

  tar -T file-with-newlines.txt --null file-with-nulls.txt<br>

<br>

Which lib/args.c fundamentally can't cope with because all command line<br>

arguments get processed before the command's main gets called. There's a similar<br>

issue in sed.c:<br>

<br>

  <a href="https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009" rel="noreferrer" target="_blank">https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009</a><br>

<br>

> ...but i'm assuming you'd also want to add --no-null (last one wins),<br>

<br>

Yeah, modulo it still won't help the sequencing issue above. </blockquote><div><br></div><div>in a sense, it makes it worse --- they explicitly document that you can have stuff like `--null -T nulls --no-null -T newlines`. and yet it's only the -T file. not the -X file. (i mean, i get YAGNI, but there's an argument for orthogonality, and personally i'd put "--null should work with all file lists" over "i should be able to alternate between nulls and newlines if i want".)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">But as with<br>

"truncate -s 8G" not working on 32 bits (because # parses into a long), there<br>

are some corner cases in the plumbing that I have TODO items about but the fix<br>

is uglier than the problem until somebody comes to me with a real world bug it<br>

causes...<br></blockquote><div><br></div><div>i'm interpreting this as a "will accept patch for just --null [that affects all -Ts, not just following ones], since that's all anyone wants right now"... sent!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> and i'm not sure how to write that in the USE_ line?<br>

<br>

I've been thinking about --no-longopt as general plumbing for a while. It's on<br>

the todo list, because it comes up a lot:<br>

<br>

  $ zgrep '[-]no-' /usr/share/man/man?/*.gz | wc -l<br>

  787<br>

<br>

Although it hasn't been worth it yet because so far most of the --no-blah<br>

longopts in toybox (grep '(no-' toys/*/*.c ) haven't had a corresponding non-no<br>

version they toggle. They're mostly "--no-fork", "--no-dereference",<br>

"--no-symlinks" and so on. Even tar has --no-recursion. The exceptions are both<br>

in pwgen.c (--capitalize and --numerals have --no- variants).<br>

<br>

But in the meantime, when there's a short alias then the [-abc] logic will<br>

switch off the previous one. (Which is what pwgen is doing with [-cA][-n0].)<br>

<br>

When there aren't short aliases, here's a TRULY HORRIBLE trick to add them:<br>

<br>

  NEWTOY(blah, "\376(null)\375(no-null)[-\376\375]", BLAH);<br>

<br>

Octal escapes for high ascii. (Yes I added support in mkflags.c a while back. :)<br>

 Rob<br>

</blockquote></div></div>