[Toybox] histogram diff
Rob Landley
rob at landley.net
Fri Jan 31 12:06:50 PST 2025
On 1/31/25 11:18, Ray Gardner wrote:
> On Fri, Jan 31, 2025 at 7:19 AM Rob Landley <rob at landley.net> wrote:
>>
>> On 1/30/25 17:59, Ray Gardner wrote:
>>>> [patience diff] may have predated the "histogram" algorithm which is
>>>> surprisingly hard to google for...
>>>> ...
>>>> I was really hoping I could implement just ONE algorithm and call it good.
>>>> Right now it looks like "histogram" would be that one (it's an improvement
>> I just started work on my diff branch again after
>> https://github.com/landley/toybox/issues/489#issuecomment-2588079658
>> with the goal of replacing code I don't understand with code I do
>> understand.
>> ...
>> If I throw what I've done and replace the code I understand with code I
>> don't understand, it will go behind "man.c" on the todo list.
>
> As I quoted, you indicated you wanted a histogram diff.
I did. For many years. Heck, view source on
https://landley.net/notes-2008.html and the commented out todo list at
the top has an entry about it. (I was following his livejournal when he
posted it, that's how I found out about it.)
But I also wanted a simple streaming diff that worked like the opposite
of "patch", doing one pass scanning along two files and zipping them
together without having to load both simultaneously into memory or
requiring lseek().
When I sat down to try to implement patience from the old livejournal
entry... it was very much not that. And what it did do was
computationally expensive in multiple ways, and needed a fallback
algorithm anyway.
All these diff algorithms have pathological edge cases, and when
fiddling with them I get distracted by what the failure modes would be
and what inputs would trigger them. Max memory use, max cpu time, hunks
being way bigger than they need to be, hunks being unreadable salad when
the actual change isn't...
Back when I was shoveling through this the first time I was wondering if
there's some sort of diff test corpus out there to compare algorithms
against, but "implementing diff in multiple ways" doesn't seem to be a
large enough community for such a thing. (I'd really hoped Bram Cohen
had test inputs showing conventional diff vs his new algorithm, but when
I went to look I couldn't find them. Maybe there were some in 2007 and
they fell off the net?)
I was never worried about producing something that works on a simple
test input, I was worried about figuring out what inputs would break
what I'd written, in what ways. Where "break" isn't always "wrong
answer", but "expensive to compute" or "results are unnecessarily hard
for a human to read".
> I put a pretty
> detailed explanation at the end of my submission (can see it in the
> patch) and more, including pseudocode, in my blog post
> https://raygard.net/2025/01/28/how-histogram-diff-works/ .
Ooh, very's nice. I have the tab open and have read through the first
quarter of it and need more caffeine and a nap before tackling this.
> It's (I hope)
> easier to understand than Hunt-McIlroy or Myers, as it's not based on
> any CS theory but on an empirical method that just works pretty well.
> Well enough that Torvalds prefers it for patch submissions
> (https://lkml.org/lkml/2023/5/7/206).
>
>> What is "hunky-dory code" in this context?
>
> Just a reference to your function names hunky() and dory() and their
> friends in your hunk detection (etc.) code.
Ah, ok. Not a bad name for it, I just hadn't connected the dots. (I have
terrible naming hygiene during development, especially while sleep
deprived. I went through later to remove everything named after Roger
Zelazny's "nine princes in amber" series from patch.c and all the
beatles references in ps.c, but that's a cleanup pass at the end.)
The core of that algorithm is finding the end of the next hunk, either
the next X matching lines or EOF on either input. The rest of it boils
down to counting consecutive pairs of matching lines (cheap) or figuring
out what to emit once you've delineated a hunk (tricksy but not expensive).
The brute force way of finding the next matching line pair is N^2
computationally expensive in the size of the current hunk: just read a
line from one of the inputs and scan back over the previously seen lines
of the other to see if it matches. If so, can I read MORE lines from
that input source to get X consecutive matches at that point in the
other input source? (At which point I've ended the hunk, and generally
one of the inputs has read ahead and has some buffered lines.) Alternate
which input you're reading a new line from because we dunno which one
did an insert and which one did a deletion.
That said, individual hunks larger than a couple hundred lines are
uncommon. There IS pathological input that would make this slow and use
a lot of memory (basically diff -u <(seq 1 1000000) <(seq 1000000 -1 1)
) but it's still only N^2 on number of input lines slow (not any sort of
X^N shenanigans, it should finish even given a gigabyte of differing
input on each end, just not conveniently). And its pathological memory
use case is the base memory use of the other algorithms. And I COULD
apply hashing to speed the search up (modulo -i and -b and -I would have
to apply to the hash too in order to be of any benefit, don't ask me how
-I regex is supposed to hash, "cut out the match and replace it with an
empty string" isn't actually what the man page describes -I as
doing...). But I have yet to find a real world input that would actually
benefit from it rather than net lose from the overhead, and NOT doing it
is simplier.
The positive tradeoff is it doesn't care about the size of the _file_
because the matching parts are almost free. Comparing two matching lines
and discarding them when not currently in a differing hunk is O(1)
("yup, they still match up, keep going"), and everything before the
current hunk has already been output/discarded so makes no difference to
processing the current hunk. And that's SHOULD be optimizing for the
common case, I think?
Note that X to end current hunk is actually 2x+1 the usual start/end
context because if your hunks have 3 context lines but your files have
"change, 5 same lines, another change" then that's one hunk with 5
consecutive unchanged in the middle. No really:
$ diff -u <(printf '%s\n' a b c d e f g h i j) <(printf '%s\n' a j c d e
f g h k j)
--- /dev/fd/63 2025-01-31 13:13:37.357744367 -0600
+++ /dev/fd/62 2025-01-31 13:13:37.357744367 -0600
@@ -1,10 +1,10 @@
a
-b
+j
c
d
e
f
g
h
-i
+k
j
landley at driftwood:~/toybox/clean2$ diff -u <(printf '%s\n' a b c d e f g
h i i j) <(printf '%s\n' a j c d e f g h i k j)
--- /dev/fd/63 2025-01-31 13:13:54.946009279 -0600
+++ /dev/fd/62 2025-01-31 13:13:54.950009339 -0600
@@ -1,5 +1,5 @@
a
-b
+j
c
d
e
@@ -7,5 +7,5 @@
g
h
i
-i
+k
j
This is for two reasons: 1) ending a hunk and starting a new hunk is a
longer output than just letting it run, 2) patch doesn't reprocess its
output and thus can't apply overlapping hunks. (It also needs hunks to
occur in order. Strangely that's not just a limitation of MY patch,
other patch programs consider hunks out of order within the same file
illegal. Concatenating patches works because they close and re-open the
file, which starts over at the beginning. That's part of the reason I
did mine my way, because as long as that limitation's there anyway...)
Yes, the 2x case being treated as contiguous is a weird edge case, I
would have thought 2x was where it would break (the point where patch
doesn't have to reprocess output of a previous hunk to apply the next
patch), but it's 2x+1 (because that's where output size is equivalent
with the @@ resync line) which means a hunk can't be considered "done"
until we've seen 2x+1 matching trailing lines (or EOF on at least one
input).
Once you've loaded and terminated a hunk, you know how many leading and
trailing common lines the resulting hunk should output (leading common
lines naturally match up, but I mentioned needing to treat TRAILING
common lines specially because they don't automatically zip back
together all the time, although I have yet to to find my test case of
where they didn't) and if they don't it sends the wrong signal to patch
(which COUNTS leading/trailing match lines to detect "must match SOF/EOF").
That was the pending bug I needed to fix back when I mothballed this
because I got too many "why are you ignoring code that works to go off
and putter with something irrelevant, shame on you wasting our time"
pokes and walked away for a while. The challenging part wasn't fixing
the bug, the problem was delineating the edge cases and coming up with
all the tests for them. Because you can mix SOF and EOF into there too,
and it should handle them all right, and the hard part isn't handling
them right it's making sure you've tested each one so you can PROVE it's
handling them right. I suspect I blogged about this at the time...)
And then you have a pile of lines in between that need to be output with
+, space, and -. That's the part where I thought patience/histogram
might come in (and potentially the constraint of operating WITHIN a
known hunk might cancel out the computational expense), but doing
patience _within_ already delineated hunks turned out not to be what it
was designed for at all.
Note that one thing this algorithm (fairly fundamentally) cannot do is
detect REORDERED hunks. (I miss the OS/2 tool that did that in 1996. It
was really nice to use. Color coded output, drew little diagonal lines
between the left side and right side versions when it detected
reorders... Alas, IBM propritetary. Part of my formative "don't get
addicted to proprietary crap, this too shall pass and then you'll have
withdrawl symptoms when it's gone extinct", LONG before its logical
continuation "cloud is bad, do not cloud".)
But diff doesn't have a syntax for specially marking reordered hunks
(not unified diff, anyway), it treats it as a deletion and an insertion.
And patch couldn't handle it either because of the
not-reprocessing-output limitation. (Git added some file level move
indications, but even they don't have a syntax to move text within a file.)
So if what you're looking for is "here's the set of transforms to turn A
into B", this should reliably give you a workable answer. (Modulo
fixable bugs.) Will it produce the most human-readable possible answer?
Not sure. When the next hunk starts and ends seems to have fairly
deterministic right answers, it's the + and - decisions within a hunk
that vary. (I think.) And "what were you thinking" should be easy for a
human to work through with this algorithm. If somebody comes to me with
a good "This was just X, you got confused" test input maybe I can tweak
the +/- output emitter to understand what the human saw that the
algorithm didn't, but I need the tests first.
(Don't ask me how diff3 works into any of this, I've never used it but I
know git does under the covers so will have to care at some point. And I
have yet to open the can of worms of marking changes WITHIN lines either.)
> I might not have bothered if I'd known you were into diff again, after
> your posts about how you were sick of it.
Sorry. I _was_ sick of it (not the code, the "stop what you're doing and
listen to me tell what you should be doing instead"), but somebody needs
a new feature because kernel, and shortest path to that was beating the
behavior out of the codebase I theoretically understand.
> I thought you were focused on sh.c.
I am (well, ok, the top of stack is currently kexec for jcore and some
boot rom rewrites and SOC documentation, but top of stack for TOYBOX is
sh.c), but real world users coming to me with use cases is always a
priority boost.
"I need X and can't figure out how to make your tool do it" is the
highest feedback there is. That's the part I _can't_ know already. Even
when the answer's just documentation, that's still something that was
missing.
> Anyway, my code and blog is there if you decide you want to try
> it, see how it works and find out what histogram diff really does.
> I don't expect you will.
I'm definitely reading the blog entry. Thanks for writing it.
> BTW you're concerned that the current pending diff.c may have
> plagiarized code from toybox.
busybox.
(I should have called the new one dorodango. I should have called mkroot
dorodango. If I actually get an automated LFS bootstrapping project
working on the level of what aboriginal linux used to do I probably WILL
call it dorodango. Keep polishing the ball of mud until it shines...)
> I looked at them a bit, and I didn't see a
> lot in common, but the read_tok() function you called out in your issue
> comment, and a related nameless enum of flags or states, bear a strong
> resemblance to read_token() and an enum in busybox diff.c.
Eh, it's not a strong objection, I just got a bit gun-shy after catching
a source very obviously (obvious to me, the ex-busybox maintainer) doing
that a few years back.
This was not a straight copy, my cleanups tend to be significant
rewrites anyway (I'm not entirely sure of the provenance ala ifconfig in
cleanup.html), and I would LOVE to give Bradley a PR black eye if he
tried to use busybox to sue toybox. (The new SCO.) But I'm not gonna
invite it either. Basic hygiene.
Mostly, the problem was cleaning up the old code was several times more
work (for me) than just writing a new one, and when I broke the old code
I didn't understand WHY it was broken so everything was endless
debugging of code I was trying to reverse engineer how to replace it.
> Ray
Rob
P.S. Sorry if I'm reading "stop what you're doing, do this instead, I
command it" into things that DO NOT MEAN THAT. My motivation has
recovered a LOT recently but is still sadly fragile, and that's a me
problem. I spent far too much of last year curled up in a ball waiting
for the meteor to hit, but now that the worst has basically happened and
we're back in The Time Of The Circular Firing Squad, I've gotten on with
it again. (The "loss aversion" part is over: we're fscked. Moving on...)
(Selling a house, moving cross country, my wife graduating and getting a
full time job so I'm househusbanding now, the industry I'm usually
employed in having the same kind of layoffs the dot-com crash and
mortgage crisis caused while smiling and claiming everything's fine (I'm
ok for now but not all my friends are), getting covid AGAIN, and seeing
america's version of brexit coming 6 months ahead of time and REALLY
HOPING I WAS WRONG... all that stacked poorly. Not your fault, not
Oliver's fault, I feel terrible about being unable to properly harness
your enthusiasm.)
More information about the Toybox
mailing list