[Toybox] diff algo blog post 2022-08-27

Mon Sep 19 16:45:07 PDT 2022

Rob, first, let me say thank you for at least looking at my post. Sorry I
did not see your August 27 blog post sooner. I fell behind, I thought
any response to my writeup would be in the list.

I am wondering if you read my post all the way through, and did it give
you an understanding of the actual LCS algo. Recall that I wrote it for you,
specifically though not only, because you (and Elliott) encouraged me to.

> Ray Gardner wrote up his own explanation of the classic diff algorithm,
> and I'm trying to read it to see if it makes more sense to me than the
> other writeups did.

What other writeups did you look at, that try to explain any LCS
algorithms? The only one I have found is Thomas Guest's
https://wordaligned.org/articles/longest-common-subsequence
which (after much prelim) explains Hirschberg's linear space LCS, which
is good on space but always N^2 time so not practical.

> The algorithm is trying to collect/mark all the lines that will be kept
> (leading space instead of leading - or +). I got that part. We want to
> keep the most possible unchanged lines, but there may be more than one
> way to do that and some are less legible than others.
>
> Saying that a subsequence is zero or more symbols and that it's
> different than a common subsequence didn't really help my understanding,
> because it's spending time defining terminology in a way different than
> what someone who has used the output of diff and the input of patch for
> many years already calls these things, instead of bringing the mountain
> to McDonald's as it were.

It's just reiterating the way [[longest] common] subsequence is often
defined. Anyway, the subsequence isn't "zero or more symbols", it's
what's left after _removing_ zero or more symbols from the original
sequence. I thought defining a few terms would be a good idea.

> The paper by Hunt and McIlroy describes HM better than I can Then
> why...?

Why...? Because the Hunt/McIlroy paper isn't as easy to understand as
the Hunt/Szymanski paper, and I am trying to explain the basic idea in a
less formal way than in that paper, but well enough for the reader to (I
hope) understand and be convinced it works. And the subsequent
modification by Kuo/Cross and my own mod of that generate exactly the
same results as H/M (as I said in the post) so I suspect Kuo/Cross is
essentially the same as H/M, which you were trying to follow.

> Sigh. He KNOWS I already found and read that paper, and I said it did
> not explain to me "why are they doing it that way?" (I can see WHAT
> they're doing. That was never the problem. Some PDP-11 programmers back
> in the 1970's decided loading the entire file into memory at once and
> hashing every line was the best approach: Why did they do that?

But they didn't load the entire file, they read and hashed each line
but did not keep the entire file in memory. The H/M paper said it was
to allow diffing files larger than memory. After finding the LCS of the hashes,
they read back the files as they generate output, and toss out false
matches (the "jackpots") as they go.

> What are
> the actual advantages of this approach, what are the downsides, what are
> the corner cases and pathological inputs, what are the design tradeoffs
> and alternatives? The McIlroy paper SAID that doing it right is N^2 and
> the approach they've taken is cheaper but imperfect, but didn't really
> go into HOW. Sorting has quick sort and heap sort and shell sort and so
> on, with explanations of the tradeoffs and videos showing them in
> action, some by dance troupes. I am not finding anything like that
> here.)

Actually any LCS algo (but one) will have worst case time at least N^2. The
H/M and H/S and K/C methods all perform better when there are not
too many lines repeated on both files. They are dominated in time by R^2
where R is the number of cases where a line in one file matches a line
in the other. Repeated searching of the "match lists" causes this.

> Asking me to read a paper which literally says I should just read the
> other paper again, presumably over and over until I "get it"... Great.

I wasn't asking you to re-read H/M, I was hoping you'd read my post and
understand the H/S algo and its K/C improvement. I had the mistaken
impression that you wanted to understand the way a practical LCS algo
works, without the math-ish formalities, which I tried to avoid. No lemmas
or theorems in my writeup!

> Let's look at his python code and see if it's illuminating:
>
> # Here if returning actual LCS elements:
> # Omit step 9. It weeds out jackpots, where the LCS contains matches
> # between values that hashed the same but are not actually equal
> # due to hash collisions. Instead just return LCS elements from B.
>
> If a "jackpot" is a spurious hash collision, why not say spurious hash
> collision instead of inventing more terminology?

I'd guess McIlroy invented that term to have a shorthand way of
referring to it. It's not all hash collisions, just those between what
the algorithm identified wrongly as matching lines. Early versions of
diff reported jackpots so McIlroy could know how badly they affected the
output.

> Is omitting step 9 what
> weeds out "jackpots", or would step 9 weed out jackpots and we're not
> bothering to do that? (If so wouldn't NOT checking for hash collisions
> have consequences? If our hunk says lines are equal that aren't actually
> equal, that seems problematic...)

My python implementation omits weeding jackpots because I am not
assuming the input sequences are hashes that need to have that done.
With the amount of memory available these days I am just going with
running LCS on the entire original lines. This is just intended to be a
faithful implementation of the actual LCS part of H/M, not of the entire
original 1970's Unix diff program. But I matched the code step-for-step
to the description in the H/M paper's appendix, so you could have a
working "executable pseudo-code" of H/M.

> Sigh, I can sit and chew through this too, but it doesn't seem much more
> productive than the first five such writeups? There have been at least
> two variants since "patience diff" all built on top of the same base,
> and git had performance issues due to changing hash algorithms, and
> everybody keeps tweaking diff but nobody has bothered to explain WHY the
> unquestioned base they're all building on is like that.

What two variants, Rob? I've read about histogram, but what else? I have
not found an explanation of histogram that has enough info to implement
it, but I'd like to try sometime.

BTW I'd like to know where to find the other five writeups that explain
any LCS algo in a straightforward way. (I can assure you that there are
some algos that are way way more abstruse than H/M: see this dilly:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.306&rep=rep1&type=pdf
Note that they call it "practical", then look at the pseudocode on pages
5, 6, 14. I don't pretend to understand a word of it.)

> The real problem is I've reached the point where I'm tired of this
> problem space and don't really WANT to implement a "diff" at this point.
> It is not fun. Especially since a while back I threw up my hands and
> went "I'll write a new one from scratch that I _do_ undersand!" and did
> most of that work and people keep going "how stupid are you not to
> understand the old one, here, it's easy, stop what you're doing and look
> at what I wrote.... which literally says "read the math paper from the
> 1970's again".

I never said you should re-read any 1970s paper. I am trying to explain
one (not H/M, but H/S -- very closely related!) in a less formal way. I
never said "how stupid" ... did someone else? Sorry if you took anything
I said that way. This started (for me) with your saying "Alas, I've read that
paper through twice and have not managed to turn their explanation into
pseudo-code, and that's WITH an implementation in front of me.
(No guarantee that implementation is what the paper describes, of course...)"
So I tried to explain the underlying algorithm, and also give you
pseudo-pseudocode (actual Python) for H/M.

> Long ago I wrote a "patch" implementation that did not work like other
> implementations, and it turned out to be ok. I wanted to explore doing
> that in reverse for diff, but I didn't get to because I got an external
> submission dictating how it should go (which wasn't even patience diff,
> which I'd asked about at the time), and my questions about it are met
> with incredulity. No "why" question has received a "why" answer so far,
> and I am tired.

Your main "why" seems to be why base diff on LCS instead of a more
brute-force "look ahead for lines that match to resync." (Correct me if
I misunderstand.) You read what McIlroy wrote about that: "The simplest
idea—go through both files line by line until they disagree, then search
forward somehow in both until a matching pair of lines is encountered,
and continue similarly—reduces the problem to implementing the
‘somehow’, which doesn’t help much. [...] An extremely simple heuristic
for the ‘somehow’, which works well when there are relatively few
differences between files and relatively few duplications of lines within
one file, has been used by Johnson and others[1, 11]: Upon encountering a
difference, compare the kth line ahead in each file with the k lines
following the mismatch in the other for k = 1,2,... until a match is
found. On more difficult problems, the method can missynchronize badly.
To keep a lid on time and space, k is customarily limited, with the
result that longer changed passages defeat resynchronization."

The test case files I sent you (diff_old2.c and diff_new.c) are of a diff
program I wrote that uses the brute force technique, and it does work
well in most cases. Someone once gave me a case where it went badly off
track. I don't have those files anymore, sorry. I will be interested to
see if you find any such cases when you do more extensive testing on
your new diff. I could maybe create a case that gives your diff fits, but
it would be artificial. I know there are cases that will give any known algo
trouble.