[Toybox] diff.c

Ray Gardner raygard at gmail.com
Fri Aug 26 15:03:30 PDT 2022


On Wed, Aug 24, 2022 at 2:26 AM Rob Landley <rob at landley.net> wrote:
>
> On 8/22/22 10:39, enh via Toybox wrote:
> > On Sun, Aug 21, 2022 at 9:12 AM Ray Gardner <raygard at gmail.com
> > <mailto:raygard at gmail.com>> wrote:
> >
> >     I guess I'll skip writing up an explanation of these algos then; I
> >     thought you were looking for a less "mathematical" explanation ("why
> >     can nobody explain what the old stuff is doing without pretending it's
> >     math instead of an algorithm?" "Doug McIlroy's old diff paper from
> >     1976 is still written in math-ese. It SEEMS to be describing very
> >     simple concepts but it's trying to explain them as if this is
> >     calculus, which it is not." etc.) But that's moot if you're going with
> >     what appears to be the old "look ahead a little for matching lines to
> >     re-sync on" heuristic diff method.
> >
> > (fwiw, i think such a document would be a useful thing to leave lying around on
> > the internet ... someone will want it sooner or later :-) )
>
> I'm also interested. I've just been a bit distracted because my air conditioner
> broke over the weekend (soonest somebody can come fix it is thursday. It's
> August in Texas...) and because my diff code isn't quite ready to go back and
> run this new test through yet, and I wanted to include that in my reply.

Thanks for the encouragement, guys. Check it out. I've posted it at
https://www.raygard.net/2022/08/26/diff-LCS-Hunt-Szymanski-Kuo-Cross/
along with the code. C code is
https://github.com/raygard/lcs_diff_demo/blob/main/src/c/diff.c ;
there is also Python code in that repo, including a version of
Hunt-McIlroy that may satisfy Rob's complaint "Alas, I've read that
paper through twice and have not managed to turn their explanation
into pseudo-code, and that's WITH an implementation in front of me",
if you find that Python is acceptable as executable pseudocode, as
I've heard it said. The code is written very closely to the paper's
appendix. But, if you read the post linked above, you'll see that it's
probably better to use Kuo/Cross (or my mod of that) instead.

I did find a bug in the diff.c I posted earlier and I think this
version gets everything right. I expanded it a bit to include
Hunt/Szymanski but that's just for demo, it's not as good as the
Kuo/Cross implementations also included. (Compile it with HS, KC, or
KCMOD defined to get the different algos.)

I do this for my own enjoyment but also hoping someone will benefit
from it. I wrote the earlier C code I posted to the list in the hope
that it would help Rob. He's decided to go his own way, so that was
probably many wasted hours. But if you want to get, e.g. a unified
diff with -U 0 (no context lines), I have it. I think that may be
harder with Rob's approach. I will also bet that my diff.c will be
faster on large-ish files and produce better diffs but that remains to
be tested. And I've also back-tested it by patch-ing the files with
the diff output, including context lines of 0, 1, 2, 3, 4.

So, nobody reads my blog ("glob") at least not yet. I spent many many
hours developing the world's fastest qsort() implementations, but I
have no way to get anyone who can get them used to look at it.

Now I've written what I hope is a good explanation of a couple of LCS
algos. I sure hope you'll read it and give me feedback. If you don't,
I may never speak to you again :-).

Ray

Ray


More information about the Toybox mailing list