[Toybox] diff algo blog post 2022-08-27

Mon Sep 19 22:59:26 PDT 2022

On 9/19/22 18:45, Ray Gardner wrote:
> Rob, first, let me say thank you for at least looking at my post. Sorry I
> did not see your August 27 blog post sooner. I fell behind, I thought
> any response to my writeup would be in the list.

I try to write blog entries regularly, but it's quick dashed off things without
all the right HTML tags and lots of [LINK find url to thingy] and sentences
trailing off unfinished as I get distracted.

I go through and edit and upload them in batches later. Making them seem
coherent takes far longer than initially writing them did. :)

> I am wondering if you read my post all the way through, and did it give
> you an understanding of the actual LCS algo. Recall that I wrote it for you,
> specifically though not only, because you (and Elliott) encouraged me to.

I still have the tab open, but I haven't had the stomach to circle back around
to diff.c recently. (Sigh, I should get it over with.)

>> Ray Gardner wrote up his own explanation of the classic diff algorithm,
>> and I'm trying to read it to see if it makes more sense to me than the
>> other writeups did.
> 
> What other writeups did you look at, that try to explain any LCS
> algorithms?

Other than the original toys/pending/diff.c implementation and the paper I
emailed Doug McIlroy to get a URL to since the bell labs website everybody was
linking to had gone down and wasn't in archive.org? (Doug beat Elliott with the
updated URL by about 20 minutes.) And the wikipedia explanation of the algorithm
(https://en.wikipedia.org/wiki/Diff#Algorithm)?

I've closed a lot of the tabs but here's a few that haven't cycled out of
about:history yet:

https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/
https://blog.jcoglan.com/2017/09/19/the-patience-diff-algorithm/
https://www.quora.com/How-does-Bram-Cohens-patience-algorithm-compare-relative-to-myers-minimal-and-histogram-options-for-gits-diff-function
https://bramcohen.livejournal.com/73318.html
https://tiarkrompf.github.io/notes/?/diff-algorithm/aside2
https://tiarkrompf.github.io/notes/?/diff-algorithm/

There were more...

> The only one I have found is Thomas Guest's
> https://wordaligned.org/articles/longest-common-subsequence
> which (after much prelim) explains Hirschberg's linear space LCS, which
> is good on space but always N^2 time so not practical.

Nope, hadn't seen that one.

I eventually broke down and looked at the 15 year old busybox implementation I
used to maintain and the explanation at
https://git.busybox.net/busybox/tree/coreutils/diff.c?h=f86a5ba510ef#n839 made a
reasonable amount of sense. (Knew I'd seen it explained SOMEWHERE.) But that
still reads both entire input files in one go and hashes everything, so I'd have
to care about choice of hashing algorithm and potential collisions.

I didn't check if the ancient busybox implementation seeks around and re-reads
lines multiple times, but back when I was wrestling with the submitted
toys/pending/diff.c I was trying to figure out if I needed to use long for
offset and length or could get away with int for length (so those two were just
12 bytes/line on 64 bits), because a quick check of the likely average length of
lines in the linux kernel source puts it at around 25 bytes:

$ X=0; for i in */*.c; do while read j; do ((X+=${#j})); done < $i; done; echo $X
13993089
$ wc */*.c | tail -n 1
  571208  1877113 15297205 total
$ python -c 'print 13993089.0/571208'
24.4973617316

Meaning it was actually pretty easy for the metadata we were tracking _about_
the lines to get bigger than just keeping the lines.

And THAT still didn't answer "why do you need the whole file" when diff hunks
can't occur out of order. Once you output a hunk you can't go back, and you can
tell when a hunk ends by an appropriate number of matching lines.

The implementation I did at
https://github.com/landley/toybox/blob/branch/toys/pending/diff.c was tailored
specifically to unified diffs and worked by reading a line at a time from each
input, discarding what it could as soon as it could (so immediately matching
lines weren't retained except enough history to start the current hunk). It
identified and retained one diff hunk at a time, so scaled with size of hunk,
not size of file. (Sure that wasn't efficient if the entire file is one big
hunk, but at that point what is?) It didn't have to make special provision for
nonseekable input ala <(command) because it read a line at a time and retained
what it hasn't matched off and disposed of in memory. There was no hash
algorithm, just a comparison function which handled all the case insensitivity
and whitespace skipping consistently in one place.

That approach made sense to me, but I eventually admitted to myself that people
will never stop objecting to me using anything other than the one true diff
algorithm, so I stopped working on it.

Rob