[Toybox] diff.c

Sat Aug 20 03:57:18 PDT 2022

On 8/19/22 21:08, Ray Gardner wrote:
> Finally got subscribed here, thanks.

Yay!

> As I mentioned somewhere, I've been working on a diff.c that uses a
> longest common subsequence (LCS) algorithm. I've attached a version
> here. It handles only -u for now, only two files, and I think it does
> it correctly. But it's at least a proof of concept.

I've got an algorithm working, the part I've been reimplementing now (since the
block copied help text has convinced me not to use any of the previous
submission for potential licensing reasons) is all the directory traversal
stuff. I've read what posix had to say, and made it through a chunk of the diff
man page, and I'm adding --from-file support and so on that the previous one
didn't have. I'm also trying to fill out diff.tests to something meaningful.

My big hangup with this part was trying to figure out whether or not to use
lib/dirtree.c which was one of those "the two options are close enough there's
no clear winner" problems. (In academia the fighting is so vicious because the
stakes are so small.) The problem is dirtree isn't designed to traverse two
directories in parallel, and _making_ it do so wasn't really an improvement over
just writing a new one. (The reason cp/mv could get away with it is they don't
care what's in the target directory unless it collides, so they never do a
readdir() on dest. But diff has to provide "only in" messages, which means
actually doing a readdir() on dest, plus the stable output order requiring a
sort of the directory contents).

Doing new infrastructure that does its own readdir meant also reimplementing
things like dirtree_path(), which offended my aesthetic sensibilities but I got
over it eventually. :)

> This uses my modification of Kuo and Cross's modification of Hunt and
> Szymanski's LCS algorithm. I have read your (Rob's) posts regarding
> grokking how LCS algos work and the opacity of the technical papers. I
> am working on a blog post or two where I will try to explain the algos
> in a somewhat informal but convincing manner (I hope).

I posted my diff.c when I got an algorithm working. The endless loop was because
the hunk detection logic wasn't handling EOF correctly, a one line fix in
find_hunks():

     // matching lines. Hunks can't overlap so first line of new hunk can't be
     // within old hunk so shorter matching run than end+start won't break hunk.
     while (!h1->done || !h2->done) {
-      if (!h1->done && h1->len<h2->len) ii = back_search(start, h1, h2);
+      if (!h1->done && (h2->done || h1->len<h2->len))
+        ii = back_search(start, h1, h2);
       else ii = back_search(start, h2, h1);
       if (!ii) continue;

Ok, 2 lines because it went over 80 chars so I split it. :)

> But pending that, I thought I'd get this out to y'all to take a look
> at. It's standalone, not integrated into toybox. Just gcc diff.c may
> work.

So I tried to look at four diff algorithms before the one I posted (all of which
required loading in the entire file at once instead of a streaming approach that
only kept the current hunk in memory), and now you've done a fifth along the
same lines and want me to look at it.

Let me get diff.c finished and promoted (which may include support for things
like --ifdef) and then maybe I'll revisit this for diff -d support? (I still
haven't looked at pathological inputs that would drive my diff algorithm nuts; I
have a linux-fullhist git repository going back to 0.0.1 and I want to have my
algorithm diff every version in "git log" all the way back to make sure that
whatever output it produces run through my "patch" recreates the second tree.
That's my criteria to consider the new implementation load-bearing. Modulo I'm
not sure how "binary diff" works in here, and yes the linux git repo has some
binary files in it, )

Rob