[Toybox] tar --null

Thu Jul 14 00:05:01 PDT 2022

On 7/12/22 19:13, enh via Toybox wrote:
> so.. --transform works (though it confused people that it's not in the --help
> output; they didn't know that there's a --xform synonym),

Well, it doesn't FULLY work yet: those trailing type indicators aren't in yet. I
have a todo thingy for it and think I know how I want to handle it now (sed
--tartransform enabling extra s///X trailing selectors, making the result NULL
terminated, disabling NULL output repacement so it doesn't get out of sync, and
stripping a file type indicator character from the start of each input line; yes
this is the correct use for a --longopt, internal nonsense one program passes to
another which no human should ever type...)

But I've been distracted trying to wrap my head around diff, and am currently
writing a toy diff that works like the reverse of my "patch" where it DOESN'T
load the whole file in but just finds hunks in a streaming manner. Probably
terrible results but I'm curious. (Where are good diff TEST inputs, that's what
I wanna know. Currently I'm diffing linux-2.6.12 and linux-2.6.13 to see what
fun corner cases come out of that.

I actually strongly suspect a streaming approach would work reasonably well for
unified diffs because A) patch hunks can't occur out of order (a move is always
an insert and a delete), B) we only need to remember 3 trailing "match" lines to
start a hunk and can END a hunk whenever we've seen 6 trailing "match" lines.

Yes that's 6 not 3: patch can't handle overlapping hunks, so if there are fewer
than 6 match lines we can't break the hunk and guarantee non-overlapping
end/start context. This is why a hunk can have up to head+tail-1 many
consecutive match lines in the middle of it, because:

@@ count @@
 start one
 start two
 start three
+add one
-zap one
 unchanged one
 unchanged two
 unchanged three
 unchanged four
 unchanged five
+add two
-zap two
 end one
 end two
 end three

Would otherwise be:

@@ count @@
 start one
 start two
 start three
+add one
-zap one
 unchanged one
 unchanged two
 unchanged three
@@ count @@
 unchanged three
 unchanged four
 unchanged five
+add two
+zap two
 end one
 end two
 end three

And that writes three to the output and then backs UP to read it in again, which
patch can't do. (I tried it to see if I needed to implement it in mine.)

The trick with a "streaming" diff is of course determining what's a real match
and what's a spurious match, which is what all this "find longest run of match
lines" algorithmic stuff is for. It's not to make the diff CORRECT, it's to make
it small/clean in nasty corner cases, which C is full of largely due to blank
lines and runs of:

          }
        }
      }
    }
  }

But UNIFIED diff needing six consecutive lines of match to end a hunk might make
that less horrible? The naive approach is N^2 on the size of the hunk, but
individual diff hunks are seldom more than a hundred lines.

And the streaming approach does not care at ALL about the size of the _file_.
You can feed it gigabytes of input and for the matching sections it's going
read/compare/discard cycling through three trailing lines of context so it can
start a new hunk when it sees a difference...

I strongly suspect there are pathological inputs that break this, or else why
would all these guys have done the complicated "read the whole file in at once"
approach in the first place? But I can't find what those inputs ARE. My real
problem here is a shortage of good TEST CASES...

> but in the meantime
> the kernel build script now uses --null with
> -T: https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964
> <https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/build.sh;l=964>
> 
> i think implementing that is a sub-line change here...
> 
>   for (;TT.T; TT.T = TT.T->next) do_lines(xopenro(TT.T->arg), '\n', do_XT);

What does this... ah, another null termination variant.

The man page says "subsequent". Affecting _subsequent_ -T. So technically, you
could have:

  tar -T file-with-newlines.txt --null file-with-nulls.txt

Which lib/args.c fundamentally can't cope with because all command line
arguments get processed before the command's main gets called. There's a similar
issue in sed.c:

  https://github.com/landley/toybox/blob/0.8.7/toys/posix/sed.c#L1009

> ...but i'm assuming you'd also want to add --no-null (last one wins),

Yeah, modulo it still won't help the sequencing issue above. But as with
"truncate -s 8G" not working on 32 bits (because # parses into a long), there
are some corner cases in the plumbing that I have TODO items about but the fix
is uglier than the problem until somebody comes to me with a real world bug it
causes...

> and i'm not sure how to write that in the USE_ line?

I've been thinking about --no-longopt as general plumbing for a while. It's on
the todo list, because it comes up a lot:

  $ zgrep '[-]no-' /usr/share/man/man?/*.gz | wc -l
  787

Although it hasn't been worth it yet because so far most of the --no-blah
longopts in toybox (grep '(no-' toys/*/*.c ) haven't had a corresponding non-no
version they toggle. They're mostly "--no-fork", "--no-dereference",
"--no-symlinks" and so on. Even tar has --no-recursion. The exceptions are both
in pwgen.c (--capitalize and --numerals have --no- variants).

But in the meantime, when there's a short alias then the [-abc] logic will
switch off the previous one. (Which is what pwgen is doing with [-cA][-n0].)

When there aren't short aliases, here's a TRULY HORRIBLE trick to add them:

  NEWTOY(blah, "\376(null)\375(no-null)[-\376\375]", BLAH);

Octal escapes for high ascii. (Yes I added support in mkflags.c a while back. :)
 Rob