[Toybox] histogram diff

Ray Gardner raygard at gmail.com
Fri Jan 31 09:18:35 PST 2025


On Fri, Jan 31, 2025 at 7:19 AM Rob Landley <rob at landley.net> wrote:
>
> On 1/30/25 17:59, Ray Gardner wrote:
> >> [patience diff] may have predated the "histogram" algorithm which is
> >> surprisingly hard to google for...
> >> ...
> >> I was really hoping I could implement just ONE algorithm and call it good.
> >> Right now it looks like "histogram" would be that one (it's an improvement
> I just started work on my diff branch again after
> https://github.com/landley/toybox/issues/489#issuecomment-2588079658
> with the goal of replacing code I don't understand with code I do
> understand.
> ...
> If I throw what I've done and replace the code I understand with code I
> don't understand, it will go behind "man.c" on the todo list.

As I quoted, you indicated you wanted a histogram diff. I put a pretty
detailed explanation at the end of my submission (can see it in the
patch) and more, including pseudocode, in my blog post
https://raygard.net/2025/01/28/how-histogram-diff-works/ . It's (I hope)
easier to understand than Hunt-McIlroy or Myers, as it's not based on
any CS theory but on an empirical method that just works pretty well.
Well enough that Torvalds prefers it for patch submissions
(https://lkml.org/lkml/2023/5/7/206).

> What is "hunky-dory code" in this context?

Just a reference to your function names hunky() and dory() and their
friends in your hunk detection (etc.) code.

I might not have bothered if I'd known you were into diff again, after
your posts about how you were sick of it. I thought you were focused on
sh.c. Anyway, my code and blog is there if you decide you want to try
it, see how it works and find out what histogram diff really does.
I don't expect you will.

BTW you're concerned that the current pending diff.c may have
plagiarized code from toybox. I looked at them a bit, and I didn't see a
lot in common, but the read_tok() function you called out in your issue
comment, and a related nameless enum of flags or states, bear a strong
resemblance to read_token() and an enum in busybox diff.c.

Ray


More information about the Toybox mailing list