[Toybox] diff.c

Mon Aug 22 08:39:39 PDT 2022

On Sun, Aug 21, 2022 at 9:12 AM Ray Gardner <raygard at gmail.com> wrote:

> I guess I'll skip writing up an explanation of these algos then; I
> thought you were looking for a less "mathematical" explanation ("why
> can nobody explain what the old stuff is doing without pretending it's
> math instead of an algorithm?" "Doug McIlroy's old diff paper from
> 1976 is still written in math-ese. It SEEMS to be describing very
> simple concepts but it's trying to explain them as if this is
> calculus, which it is not." etc.) But that's moot if you're going with
> what appears to be the old "look ahead a little for matching lines to
> re-sync on" heuristic diff method.
>

(fwiw, i think such a document would be a useful thing to leave lying
around on the internet ... someone will want it sooner or later :-) )

> Anyway, I applied your 1-or-2 line patch to the Aug. 8 attachment.c
> you posted and it still chokes on the two files I sent you (dif_old2.c
> and dif_new.c). So maybe you've made some other significant changes to
> that code, if it works for you? It appeared to be stuck in dump_hunk()
> but I didn't look any deeper.
>
>
>
> On Sat, Aug 20, 2022 at 4:49 AM Rob Landley <rob at landley.net> wrote:
> >
> > On 8/19/22 21:08, Ray Gardner wrote:
> > > Finally got subscribed here, thanks.
> >
> > Yay!
> >
> > > As I mentioned somewhere, I've been working on a diff.c that uses a
> > > longest common subsequence (LCS) algorithm. I've attached a version
> > > here. It handles only -u for now, only two files, and I think it does
> > > it correctly. But it's at least a proof of concept.
> >
> > I've got an algorithm working, the part I've been reimplementing now
> (since the
> > block copied help text has convinced me not to use any of the previous
> > submission for potential licensing reasons) is all the directory
> traversal
> > stuff. I've read what posix had to say, and made it through a chunk of
> the diff
> > man page, and I'm adding --from-file support and so on that the previous
> one
> > didn't have. I'm also trying to fill out diff.tests to something
> meaningful.
> >
> > My big hangup with this part was trying to figure out whether or not to
> use
> > lib/dirtree.c which was one of those "the two options are close enough
> there's
> > no clear winner" problems. (In academia the fighting is so vicious
> because the
> > stakes are so small.) The problem is dirtree isn't designed to traverse
> two
> > directories in parallel, and _making_ it do so wasn't really an
> improvement over
> > just writing a new one. (The reason cp/mv could get away with it is they
> don't
> > care what's in the target directory unless it collides, so they never do
> a
> > readdir() on dest. But diff has to provide "only in" messages, which
> means
> > actually doing a readdir() on dest, plus the stable output order
> requiring a
> > sort of the directory contents).
> >
> > Doing new infrastructure that does its own readdir meant also
> reimplementing
> > things like dirtree_path(), which offended my aesthetic sensibilities
> but I got
> > over it eventually. :)
> >
> > > This uses my modification of Kuo and Cross's modification of Hunt and
> > > Szymanski's LCS algorithm. I have read your (Rob's) posts regarding
> > > grokking how LCS algos work and the opacity of the technical papers. I
> > > am working on a blog post or two where I will try to explain the algos
> > > in a somewhat informal but convincing manner (I hope).
> >
> > I posted my diff.c when I got an algorithm working. The endless loop was
> because
> > the hunk detection logic wasn't handling EOF correctly, a one line fix in
> > find_hunks():
> >
> >      // matching lines. Hunks can't overlap so first line of new hunk
> can't be
> >      // within old hunk so shorter matching run than end+start won't
> break hunk.
> >      while (!h1->done || !h2->done) {
> > -      if (!h1->done && h1->len<h2->len) ii = back_search(start, h1, h2);
> > +      if (!h1->done && (h2->done || h1->len<h2->len))
> > +        ii = back_search(start, h1, h2);
> >        else ii = back_search(start, h2, h1);
> >        if (!ii) continue;
> >
> > Ok, 2 lines because it went over 80 chars so I split it. :)
> >
> > > But pending that, I thought I'd get this out to y'all to take a look
> > > at. It's standalone, not integrated into toybox. Just gcc diff.c may
> > > work.
> >
> > So I tried to look at four diff algorithms before the one I posted (all
> of which
> > required loading in the entire file at once instead of a streaming
> approach that
> > only kept the current hunk in memory), and now you've done a fifth along
> the
> > same lines and want me to look at it.
> >
> > Let me get diff.c finished and promoted (which may include support for
> things
> > like --ifdef) and then maybe I'll revisit this for diff -d support? (I
> still
> > haven't looked at pathological inputs that would drive my diff algorithm
> nuts; I
> > have a linux-fullhist git repository going back to 0.0.1 and I want to
> have my
> > algorithm diff every version in "git log" all the way back to make sure
> that
> > whatever output it produces run through my "patch" recreates the second
> tree.
> > That's my criteria to consider the new implementation load-bearing.
> Modulo I'm
> > not sure how "binary diff" works in here, and yes the linux git repo has
> some
> > binary files in it, )
> >
> > Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220822/4ad58e31/attachment.htm>