[Toybox] New diff.c

Wed Aug 31 13:53:51 PDT 2022

On 8/13/22 17:36, Ray Gardner wrote:
> Rob, I don't know if you've done much work on diff.c since your Aug. 8
> post that had attachment.c, but I tried that program and it apparently
> gets into an infinite loop with a certain pair of files. I can send
> them to you or attach them to a post here if you like.

Sorry, I missed this getting stuck in the moderation queue. (Just went through
to whitelist Weizhao and pass through his two emails, and saw I'd missed two
other old ones in the ocean of spam. I think I already dealt with the BSD issue,
and while I think I already replied to this one too, it's a good excuse for an
update.)

I pushed the diff work to a temporary branch at
https://github.com/landley/toybox/commits/branch ala
https://github.com/landley/toybox/blob/branch/toys/pending/diff.c which is not
load bearing yet but getting there, so people can at least see it.

It passes your test case now (diffs the two files without segfaulting, producing
reasonable hunks), although the output still isn't quite right in a couple places:

---- dif1.c	2022-08-26 00:54:19.827964685 -0500
-+++ dif2.c	2022-08-26 00:54:41.231964278 -0500
-@@ -8,30 +8,15 @@
+diff dif1.c dif2.c
+--- dif1.c
++++ dif2.c
+@@ -8,29 +8,14 @@

I need to add the timestamp info to the header output, and only output the
"diff" line when comparing directory contents (and possibly --from-file and
--to-file). Plus THAT ONE HUNK (and none of the others!) is off by one in the
lengths and I should figure out why. (The emitted lines match but the @@ header
line doesn't, which can't be right...)

@@ -17,8 +18,8 @@
  }

 -#if 00
--void putqln(LINE *pln, DFILE *file)
--{
+ void putqln(LINE *pln, DFILE *file)
+ {
...
- void putqln(LINE *pln, DFILE *file)
- {
-       if ( ! pln->lneof ) {
+-void putqln(LINE *pln, DFILE *file)
+-{

That's an interesting corner case: the three lines that end the hunk also occur
verbatim in order earlier in the removed section, so matching them earlier and
emitting them as context lines at the start, then ending the hunk with three
subtraction lines is technically the same transformation (even the same number
of matched lines)... but it violates the expectation that a unified diff hunk
starts and ends with context lines when not at the start or end of the file.
(The find_hunks() did delineate the hunk with appropriate context lines, but
dump_hunk() is printing the hunk via a simple greedy algorithm that takes
earliest matches, and this is an example of the two disagreeing.)

That's where I left off last night. I need to decide if I should just do a
heuristic to peel off up to -U many context lines from the end so the "greedy"
algorithm doesn't get the chance to eat them earlier than it should, or if I
should teach dump_hunk() to symmetrically parse from start and end and stop in
the middle. (It's getting the STARTING context lines right because greedy match
grabbing does the right thing there. If it started at the end it would get the
end right. Maybe it should alternate?)

I already have to split "identify context lines to keep" and the "output result"
into two passes to properly implement -B, because -B (suppress hunks that only
change blank lines) is kind of annoying:

$ diff -Bu <(echo $'one\n\ntwo\n\nthree') <(echo $'one\n\ntwo\nthree')
$ diff -Bu <(echo $'one\ntwo\n\nthree') <(echo $'one\n\ntwo\nthree')

--- /dev/fd/63
+++ /dev/fd/62
@@ -1,4 +1,4 @@
 one
-two

+two
 three

That's ALMOST the same diff, and in both cases the change is addition/removal of
blank lines. In the first one all the changes are identified as being in blank
lines so it suppresses outputting the hunk. But in the second one debian's diff
(not mine, the host system!) decided that "two" is what moved instead of the
blank line before/after it, and thus the hunk gets output even though
identifying the blank line as what moved would produce a smaller patch even
without -B, and would also suppress it with -B.

I haven't seen _any_ of the various diff algorithms pay attention to THAT yet. :)

Anyway, breaking it into two passes means I'm no longer printing the lines as I
go so identifying them out of order isn't a big deal either. But the heuristic
might still be smaller...?

*shrug* It's corner cases all the way down, this stuff. Last year when Elliott
gave me his todo list I went:

>>   diff (--line-format, -I)
>
> If I keep the current algorithm instead of switching to Bram Cohen's patience
> diff, it's only a few days of cleanup. (Implementing patience diff shouldn't
> be that hard either, but I haven't sat down and scoped it because my todo list
> runneth over.)

And I turned out to be profoundly wrong in that time estimate. Way more of a
learning curve than I expected. Sorry it's taken so long...

Rob