[Toybox] diff.c

Ray Gardner raygard at gmail.com
Sun Aug 21 09:11:53 PDT 2022


I guess I'll skip writing up an explanation of these algos then; I
thought you were looking for a less "mathematical" explanation ("why
can nobody explain what the old stuff is doing without pretending it's
math instead of an algorithm?" "Doug McIlroy's old diff paper from
1976 is still written in math-ese. It SEEMS to be describing very
simple concepts but it's trying to explain them as if this is
calculus, which it is not." etc.) But that's moot if you're going with
what appears to be the old "look ahead a little for matching lines to
re-sync on" heuristic diff method.

Anyway, I applied your 1-or-2 line patch to the Aug. 8 attachment.c
you posted and it still chokes on the two files I sent you (dif_old2.c
and dif_new.c). So maybe you've made some other significant changes to
that code, if it works for you? It appeared to be stuck in dump_hunk()
but I didn't look any deeper.



On Sat, Aug 20, 2022 at 4:49 AM Rob Landley <rob at landley.net> wrote:
>
> On 8/19/22 21:08, Ray Gardner wrote:
> > Finally got subscribed here, thanks.
>
> Yay!
>
> > As I mentioned somewhere, I've been working on a diff.c that uses a
> > longest common subsequence (LCS) algorithm. I've attached a version
> > here. It handles only -u for now, only two files, and I think it does
> > it correctly. But it's at least a proof of concept.
>
> I've got an algorithm working, the part I've been reimplementing now (since the
> block copied help text has convinced me not to use any of the previous
> submission for potential licensing reasons) is all the directory traversal
> stuff. I've read what posix had to say, and made it through a chunk of the diff
> man page, and I'm adding --from-file support and so on that the previous one
> didn't have. I'm also trying to fill out diff.tests to something meaningful.
>
> My big hangup with this part was trying to figure out whether or not to use
> lib/dirtree.c which was one of those "the two options are close enough there's
> no clear winner" problems. (In academia the fighting is so vicious because the
> stakes are so small.) The problem is dirtree isn't designed to traverse two
> directories in parallel, and _making_ it do so wasn't really an improvement over
> just writing a new one. (The reason cp/mv could get away with it is they don't
> care what's in the target directory unless it collides, so they never do a
> readdir() on dest. But diff has to provide "only in" messages, which means
> actually doing a readdir() on dest, plus the stable output order requiring a
> sort of the directory contents).
>
> Doing new infrastructure that does its own readdir meant also reimplementing
> things like dirtree_path(), which offended my aesthetic sensibilities but I got
> over it eventually. :)
>
> > This uses my modification of Kuo and Cross's modification of Hunt and
> > Szymanski's LCS algorithm. I have read your (Rob's) posts regarding
> > grokking how LCS algos work and the opacity of the technical papers. I
> > am working on a blog post or two where I will try to explain the algos
> > in a somewhat informal but convincing manner (I hope).
>
> I posted my diff.c when I got an algorithm working. The endless loop was because
> the hunk detection logic wasn't handling EOF correctly, a one line fix in
> find_hunks():
>
>      // matching lines. Hunks can't overlap so first line of new hunk can't be
>      // within old hunk so shorter matching run than end+start won't break hunk.
>      while (!h1->done || !h2->done) {
> -      if (!h1->done && h1->len<h2->len) ii = back_search(start, h1, h2);
> +      if (!h1->done && (h2->done || h1->len<h2->len))
> +        ii = back_search(start, h1, h2);
>        else ii = back_search(start, h2, h1);
>        if (!ii) continue;
>
> Ok, 2 lines because it went over 80 chars so I split it. :)
>
> > But pending that, I thought I'd get this out to y'all to take a look
> > at. It's standalone, not integrated into toybox. Just gcc diff.c may
> > work.
>
> So I tried to look at four diff algorithms before the one I posted (all of which
> required loading in the entire file at once instead of a streaming approach that
> only kept the current hunk in memory), and now you've done a fifth along the
> same lines and want me to look at it.
>
> Let me get diff.c finished and promoted (which may include support for things
> like --ifdef) and then maybe I'll revisit this for diff -d support? (I still
> haven't looked at pathological inputs that would drive my diff algorithm nuts; I
> have a linux-fullhist git repository going back to 0.0.1 and I want to have my
> algorithm diff every version in "git log" all the way back to make sure that
> whatever output it produces run through my "patch" recreates the second tree.
> That's my criteria to consider the new implementation load-bearing. Modulo I'm
> not sure how "binary diff" works in here, and yes the linux git repo has some
> binary files in it, )
>
> Rob


More information about the Toybox mailing list