[Toybox] New diff.c

Mon Aug 8 08:30:14 PDT 2022

On Mon, Aug 8, 2022 at 1:07 AM Rob Landley <rob at landley.net> wrote:
>
> After wrestling with diff on and off for weeks (I can see WHAT it's doing but
> not WHY) I broke down and wrote a new one (attached, very preliminary). Dunno
> yet if I'll use it instead of the old one, but I understand how this one _works_
> which still isn't true for the conventional algorithms despite much squinting
> and reading old papers and people's blog posts about them.
>
> This diff is streaming, like my patch implementation: it accumulates changed
> lines for the current hunk and frees them again when it's done with them. Thus
> its memory and CPU consumption scale with the size of the _hunk_, not the size
> of the file. Between hunks it's only retaining as much trailing context as it
> needs to start a new hunk if the next line differs. (Sometimes it's working
> through a backlog of lines it preloaded looking for the end of the previous
> hunk, but it's freeing lines as it goes between hunks when that's the case,
> working DOWN from a memory consumption high water mark.)
>
> The overall flow control is diff_main() calls find_hunks() to find hunks, which
> calls dump_hunk() to display each hunk as a series of +, -, and context lines.
> All line comparisons are done by diffcmp() which takes all the flags into
> account (whitespace, case sensitivity, stripping newlines). Lines are read by
> get_lines() and freed by trim_lines().
>
> struct half_hunk contains all the info we currently know about one of the input
> files: its name, its FILE *, the array of lines we've read from it so far but
> not freed yet, what line number we're up to, etc. There are generally two
> instances of this: h1 is the minus half and h2 is the plus half.
>
> Details: diff_main() currently opens two input files (using hunky("filname"))
> and calls find_hunks() on them. It's a stub: I still need to add the directory
> traversal logic back, it's not setting a return code...
>
> find_hunks() traverses two input files to find differences. Between hunks it
> calls get_line() on each half_hunk to append one line (or detect EOF). If the
> new lines match according to diffcmp() it keeps the last -U many context lines
> (to start a new hunk with as necessary) and frees the rest via trim_hunk(). If
> both files end at the same time during this mode, it exits the loop (which will
> free both half_hunk structures and their contents).
>
> If the search loop in find_hunks() doesn't get a pair of matching lines
> (possibly because one file ended before the other) it alternates loading one
> line from each input (that hasn't ended yet) and calling back_search() to see if
> this new line ends the hunk. (If so, the shorter half-hunk will unget lines so
> it ends in the appropriate spot, and future get_line() calls will add them back
> as necessary.) If both inputs end, that also ends an in-progress hunk.
>
> back_search() looks for 2 times -U many matching lines from one half_hunk
> somewhere in the other because that's what it takes to break a unified diff
> hunk. Since hunks occur in order and can't overlap, that means you can't END a
> hunk until you can start the next one, or have hit the end of the file. Yes this
> means with -U 3 you can have up to 5 consecutive matching lines in the middle of
> a hunk, try it:
>
> $ diff -u <(echo axbxcxdxexfxgxhxi | tr x '\n') <(echo axBxcxdxexfxgxHxi | tr x
> '\n')
> --- /dev/fd/63  2022-08-08 01:35:42.463922199 -0500
> +++ /dev/fd/62  2022-08-08 01:35:42.463922199 -0500
> @@ -1,9 +1,9 @@
>  a
> -b
> +B
>  c
>  d
>  e
>  f
>  g
> -h
> +H
>  i
>
> find_hunks() alternates adding a new line to the end of each half_hunk and
> calling back_search on the other half_hunk because we don't know ahead of time
> whether we're net adding or removing lines, and thus which half_hunk will be
> shorter. This is why we need the unget plumbing using the "saved" member of
> struct half_hunk, because we're likely to overshoot on one of them and must save
> those extra lines after this hunk until they're needed later.
>
> Once find_hunks() has delineated a hunk it calls dump_hunk() to display the hunk
> as +, -, and context lines. Since lines are displayed in order and context lines
> match in both hunks, this means identifying context lines. This part sucks and
> is darn near AI complete to do properly: what I did was:
>
> 1) glue both half hunks together and qsort() them, identifying and tagging
> matches. (And yes "match" requires a matching line in the OTHER half_hunk, not
> just repeated in this half_hunk.)
>
> 2) Advance forward in both half_hunks simultaneously looking for the next marked
> line in each. Marked lines MAY be matched as context lines, but it doesn't mean
> they will. Once we have the next pair or marked lines, simultaneously search
> forward checking the OTHER half_hunk for a marked line that diffcmp() says
> matches this one. This means we keep the closer match in either direction. If we
> hit the end of both half_hunks before finding a match for either line, we unmark
> them both as matched lines and start over. If we hit the end of both files, then
> set "next context line" to the length of the hunk.
>
> 3) Emit - lines from h1 up to the next context line. Emit + lines from h2 up to
> the next context line. If we're not at the end of the hunk, emit the next
> context line from h1 and skip it in both h1 and h2. Loop back to the start of
> (2) above.
>
> Yes I can come up with a pathological input that breaks this. Probably the "if
> we hit the end of both half_hunks" above should be cleverer, hitting the end of
> just ONE half_hunk means skip that marked line and start its search over,
> traversing as far as the other side's searched before going back to alternating...
>
> The trick to making dump_hunk() work is an extra layer of indirection. Each
> half_hunk stores its lines as an array of char pointers using the members char
> **str and unsigned len. dump_hunk() collates both half_hunks into char ***big
> which is an array of pointers to each line in that hunk's str array, and then
> calls qsort() on that to find duplicate pairs. The extra layer of indirection
> lets us:
>
> A) figure out which hunk the string came from after the fact (is it >= h1->str
> and <= h2->str+h1->len) so matches only count if it's got an entry in both
> half_hunks.
>
> B) Mark the matches by modifying each string pointer in the half_hunk that has a
> match, by setting its bottom bit which is otherwise guaranteed to be zero by
> malloc() returning word-aligned memory. We do this by dereferencing the char ***
> enough times to incrementing the relevant char * pointer by one.)
>
> The display loop at the end calls a squish() function that unmarks the line
> (checks the bottom bit and decrements the pointer if it's set). We need to do
> this before we can free() the line.
>
> Anyway, the attached is still buggy and incomplete, but it gives the general
> idea. If this ACTUALLY pending code this would be at least my third checkin of
> the new plumbing, but since people are building and using stuff out of pending I
> can't replace it until the new stuff fully works, which means you don't get to
> see all the development history. So here's a checkpoint since I've been kinda
> quiet otherwise. :)

no history is always a mistake that ends up being regretted... just
check in as diff2, ndiff (new), tdiff (toybox), rdiff (rob) and then
rename or delete when you make your final decision? seems like the
best of all worlds?

> I need to add the above 5 context line test to the test suite, and add a test
> with 5 context lines at the end of the file (not enough to trigger back_search()
> but more than -U trailing context), and... so many tests needed for this to be
> load bearing.
>
> My REAL test is probably doing a "git log" of my fullhist kernel tree back to
> 0.0.1 and having two parallel directories one version apart so this diff can
> compare each commit with the next commit. I don't expect the resulting diffs to
> be identical to what git produces, but my "patch" should be able to apply the
> resulting diff so that "git reset $COMMIT" followed by "git diff" says there are
> no differences. And my patches shouldn't be significantly LONGER than git's
> patches in line count, although I dunno about that last part...
>
> Rob
>
> P.S. Seriously, proper context line detection is painful. I had a whole "find
> longest span of consecutive matches" approach I ripped out to do this one
> instead, because there was ALWAYS a pathological input I could construct that
> made it give the wrong answer. This one can do that too but at least it's CHEAP,
> assuming qsort() isn't pathological on the hunk..._______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net