[Toybox] histogram diff

Fri Jan 31 12:06:50 PST 2025

On 1/31/25 11:18, Ray Gardner wrote:
> On Fri, Jan 31, 2025 at 7:19 AM Rob Landley <rob at landley.net> wrote:
>>
>> On 1/30/25 17:59, Ray Gardner wrote:
>>>> [patience diff] may have predated the "histogram" algorithm which is
>>>> surprisingly hard to google for...
>>>> ...
>>>> I was really hoping I could implement just ONE algorithm and call it good.
>>>> Right now it looks like "histogram" would be that one (it's an improvement
>> I just started work on my diff branch again after
>> https://github.com/landley/toybox/issues/489#issuecomment-2588079658
>> with the goal of replacing code I don't understand with code I do
>> understand.
>> ...
>> If I throw what I've done and replace the code I understand with code I
>> don't understand, it will go behind "man.c" on the todo list.
> 
> As I quoted, you indicated you wanted a histogram diff.

I did. For many years. Heck, view source on 
https://landley.net/notes-2008.html and the commented out todo list at 
the top has an entry about it. (I was following his livejournal when he 
posted it, that's how I found out about it.)

But I also wanted a simple streaming diff that worked like the opposite 
of "patch", doing one pass scanning along two files and zipping them 
together without having to load both simultaneously into memory or 
requiring lseek().

When I sat down to try to implement patience from the old livejournal 
entry... it was very much not that. And what it did do was 
computationally expensive in multiple ways, and needed a fallback 
algorithm anyway.

All these diff algorithms have pathological edge cases, and when 
fiddling with them I get distracted by what the failure modes would be 
and what inputs would trigger them. Max memory use, max cpu time, hunks 
being way bigger than they need to be, hunks being unreadable salad when 
the actual change isn't...

Back when I was shoveling through this the first time I was wondering if 
there's some sort of diff test corpus out there to compare algorithms 
against, but "implementing diff in multiple ways" doesn't seem to be a 
large enough community for such a thing. (I'd really hoped Bram Cohen 
had test inputs showing conventional diff vs his new algorithm, but when 
I went to look I couldn't find them. Maybe there were some in 2007 and 
they fell off the net?)

I was never worried about producing something that works on a simple 
test input, I was worried about figuring out what inputs would break 
what I'd written, in what ways. Where "break" isn't always "wrong 
answer", but "expensive to compute" or "results are unnecessarily hard 
for a human to read".

> I put a pretty
> detailed explanation at the end of my submission (can see it in the
> patch) and more, including pseudocode, in my blog post
> https://raygard.net/2025/01/28/how-histogram-diff-works/ .

Ooh, very's nice. I have the tab open and have read through the first 
quarter of it and need more caffeine and a nap before tackling this.

> It's (I hope)
> easier to understand than Hunt-McIlroy or Myers, as it's not based on
> any CS theory but on an empirical method that just works pretty well.
> Well enough that Torvalds prefers it for patch submissions
> (https://lkml.org/lkml/2023/5/7/206).
> 
>> What is "hunky-dory code" in this context?
> 
> Just a reference to your function names hunky() and dory() and their
> friends in your hunk detection (etc.) code.

Ah, ok. Not a bad name for it, I just hadn't connected the dots. (I have 
terrible naming hygiene during development, especially while sleep 
deprived. I went through later to remove everything named after Roger 
Zelazny's "nine princes in amber" series from patch.c and all the 
beatles references in ps.c, but that's a cleanup pass at the end.)

The core of that algorithm is finding the end of the next hunk, either 
the next X matching lines or EOF on either input. The rest of it boils 
down to counting consecutive pairs of matching lines (cheap) or figuring 
out what to emit once you've delineated a hunk (tricksy but not expensive).

The brute force way of finding the next matching line pair is N^2 
computationally expensive in the size of the current hunk: just read a 
line from one of the inputs and scan back over the previously seen lines 
of the other to see if it matches. If so, can I read MORE lines from 
that input source to get X consecutive matches at that point in the 
other input source? (At which point I've ended the hunk, and generally 
one of the inputs has read ahead and has some buffered lines.) Alternate 
which input you're reading a new line from because we dunno which one 
did an insert and which one did a deletion.

That said, individual hunks larger than a couple hundred lines are 
uncommon. There IS pathological input that would make this slow and use 
a lot of memory (basically diff -u <(seq 1 1000000) <(seq 1000000 -1 1) 
) but it's still only N^2 on number of input lines slow (not any sort of 
X^N shenanigans, it should finish even given a gigabyte of differing 
input on each end, just not conveniently). And its pathological memory 
use case is the base memory use of the other algorithms. And I COULD 
apply hashing to speed the search up (modulo -i and -b and -I would have 
to apply to the hash too in order to be of any benefit, don't ask me how 
-I regex is supposed to hash, "cut out the match and replace it with an 
empty string" isn't actually what the man page describes -I as 
doing...). But I have yet to find a real world input that would actually 
benefit from it rather than net lose from the overhead, and NOT doing it 
is simplier.

The positive tradeoff is it doesn't care about the size of the _file_ 
because the matching parts are almost free. Comparing two matching lines 
and discarding them when not currently in a differing hunk is O(1) 
("yup, they still match up, keep going"), and everything before the 
current hunk has already been output/discarded so makes no difference to 
processing the current hunk. And that's SHOULD be optimizing for the 
common case, I think?

Note that X to end current hunk is actually 2x+1 the usual start/end 
context because if your hunks have 3 context lines but your files have 
"change, 5 same lines, another change" then that's one hunk with 5 
consecutive unchanged in the middle. No really:

$ diff -u <(printf '%s\n' a b c d e f g h i j) <(printf '%s\n' a j c d e 
f g h k j)

--- /dev/fd/63	2025-01-31 13:13:37.357744367 -0600
+++ /dev/fd/62	2025-01-31 13:13:37.357744367 -0600
@@ -1,10 +1,10 @@
  a
-b
+j
  c
  d
  e
  f
  g
  h
-i
+k
  j
landley at driftwood:~/toybox/clean2$ diff -u <(printf '%s\n' a b c d e f g 
h i i j) <(printf '%s\n' a j c d e f g h i k j)
--- /dev/fd/63	2025-01-31 13:13:54.946009279 -0600
+++ /dev/fd/62	2025-01-31 13:13:54.950009339 -0600
@@ -1,5 +1,5 @@
  a
-b
+j
  c
  d
  e
@@ -7,5 +7,5 @@
  g
  h
  i
-i
+k
  j

This is for two reasons: 1) ending a hunk and starting a new hunk is a 
longer output than just letting it run, 2) patch doesn't reprocess its 
output and thus can't apply overlapping hunks. (It also needs hunks to 
occur in order. Strangely that's not just a limitation of MY patch, 
other patch programs consider hunks out of order within the same file 
illegal. Concatenating patches works because they close and re-open the 
file, which starts over at the beginning. That's part of the reason I 
did mine my way, because as long as that limitation's there anyway...)

Yes, the 2x case being treated as contiguous is a weird edge case, I 
would have thought 2x was where it would break (the point where patch 
doesn't have to reprocess output of a previous hunk to apply the next 
patch), but it's 2x+1 (because that's where output size is equivalent 
with the @@ resync line) which means a hunk can't be considered "done" 
until we've seen 2x+1 matching trailing lines (or EOF on at least one 
input).

Once you've loaded and terminated a hunk, you know how many leading and 
trailing common lines the resulting hunk should output (leading common 
lines naturally match up, but I mentioned needing to treat TRAILING 
common lines specially because they don't automatically zip back 
together all the time, although I have yet to to find my test case of 
where they didn't) and if they don't it sends the wrong signal to patch 
(which COUNTS leading/trailing match lines to detect "must match SOF/EOF").

That was the pending bug I needed to fix back when I mothballed this 
because I got too many "why are you ignoring code that works to go off 
and putter with something irrelevant, shame on you wasting our time" 
pokes and walked away for a while. The challenging part wasn't fixing 
the bug, the problem was delineating the edge cases and coming up with 
all the tests for them. Because you can mix SOF and EOF into there too, 
and it should handle them all right, and the hard part isn't handling 
them right it's making sure you've tested each one so you can PROVE it's 
handling them right. I suspect I blogged about this at the time...)

And then you have a pile of lines in between that need to be output with 
+, space, and -. That's the part where I thought patience/histogram 
might come in (and potentially the constraint of operating WITHIN a 
known hunk might cancel out the computational expense), but doing 
patience _within_ already delineated hunks turned out not to be what it 
was designed for at all.

Note that one thing this algorithm (fairly fundamentally) cannot do is 
detect REORDERED hunks. (I miss the OS/2 tool that did that in 1996. It 
was really nice to use. Color coded output, drew little diagonal lines 
between the left side and right side versions when it detected 
reorders... Alas, IBM propritetary. Part of my formative "don't get 
addicted to proprietary crap, this too shall pass and then you'll have 
withdrawl symptoms when it's gone extinct", LONG before its logical 
continuation "cloud is bad, do not cloud".)

But diff doesn't have a syntax for specially marking reordered hunks 
(not unified diff, anyway), it treats it as a deletion and an insertion. 
And patch couldn't handle it either because of the 
not-reprocessing-output limitation. (Git added some file level move 
indications, but even they don't have a syntax to move text within a file.)

So if what you're looking for is "here's the set of transforms to turn A 
into B", this should reliably give you a workable answer. (Modulo 
fixable bugs.) Will it produce the most human-readable possible answer? 
Not sure. When the next hunk starts and ends seems to have fairly 
deterministic right answers, it's the + and - decisions within a hunk 
that vary. (I think.) And "what were you thinking" should be easy for a 
human to work through with this algorithm. If somebody comes to me with 
a good "This was just X, you got confused" test input maybe I can tweak 
the +/- output emitter to understand what the human saw that the 
algorithm didn't, but I need the tests first.

(Don't ask me how diff3 works into any of this, I've never used it but I 
know git does under the covers so will have to care at some point. And I 
have yet to open the can of worms of marking changes WITHIN lines either.)

> I might not have bothered if I'd known you were into diff again, after
> your posts about how you were sick of it.

Sorry. I _was_ sick of it (not the code, the "stop what you're doing and 
listen to me tell what you should be doing instead"), but somebody needs 
a new feature because kernel, and shortest path to that was beating the 
behavior out of the codebase I theoretically understand.

> I thought you were focused on sh.c.

I am (well, ok, the top of stack is currently kexec for jcore and some 
boot rom rewrites and SOC documentation, but top of stack for TOYBOX is 
sh.c), but real world users coming to me with use cases is always a 
priority boost.

"I need X and can't figure out how to make your tool do it" is the 
highest feedback there is. That's the part I _can't_ know already. Even 
when the answer's just documentation, that's still something that was 
missing.

> Anyway, my code and blog is there if you decide you want to try
> it, see how it works and find out what histogram diff really does.
> I don't expect you will.

I'm definitely reading the blog entry. Thanks for writing it.

> BTW you're concerned that the current pending diff.c may have
> plagiarized code from toybox.

busybox.

(I should have called the new one dorodango. I should have called mkroot 
dorodango. If I actually get an automated LFS bootstrapping project 
working on the level of what aboriginal linux used to do I probably WILL 
call it dorodango. Keep polishing the ball of mud until it shines...)

> I looked at them a bit, and I didn't see a
> lot in common, but the read_tok() function you called out in your issue
> comment, and a related nameless enum of flags or states, bear a strong
> resemblance to read_token() and an enum in busybox diff.c.

Eh, it's not a strong objection, I just got a bit gun-shy after catching 
a source very obviously (obvious to me, the ex-busybox maintainer) doing 
that a few years back.

This was not a straight copy, my cleanups tend to be significant 
rewrites anyway (I'm not entirely sure of the provenance ala ifconfig in 
cleanup.html), and I would LOVE to give Bradley a PR black eye if he 
tried to use busybox to sue toybox. (The new SCO.) But I'm not gonna 
invite it either. Basic hygiene.

Mostly, the problem was cleaning up the old code was several times more 
work (for me) than just writing a new one, and when I broke the old code 
I didn't understand WHY it was broken so everything was endless 
debugging of code I was trying to reverse engineer how to replace it.

> Ray

Rob

P.S. Sorry if I'm reading "stop what you're doing, do this instead, I 
command it" into things that DO NOT MEAN THAT. My motivation has 
recovered a LOT recently but is still sadly fragile, and that's a me 
problem. I spent far too much of last year curled up in a ball waiting 
for the meteor to hit, but now that the worst has basically happened and 
we're back in The Time Of The Circular Firing Squad, I've gotten on with 
it again. (The "loss aversion" part is over: we're fscked. Moving on...)

(Selling a house, moving cross country, my wife graduating and getting a 
full time job so I'm househusbanding now, the industry I'm usually 
employed in having the same kind of layoffs the dot-com crash and 
mortgage crisis caused while smiling and claiming everything's fine (I'm 
ok for now but not all my friends are), getting covid AGAIN, and seeing 
america's version of brexit coming 6 months ahead of time and REALLY 
HOPING I WAS WRONG... all that stacked poorly. Not your fault, not 
Oliver's fault, I feel terrible about being unable to properly harness 
your enthusiasm.)