[Toybox] More than you really wanted to know about patch.

Mon Jan 14 09:24:06 PST 2019

(i actually thought the question was more about the workflow, in which
case an answer would look more like the "Exporting a patch"/"Importing
patches" sections of https://git.wiki.kernel.org/index.php/QuickStart
...)

On Sun, Jan 13, 2019 at 9:30 PM Rob Landley <rob at landley.net> wrote:
>
> On 1/13/19 3:57 PM, scsijon wrote:
> > Any chance of a two or three page "Introduction to Creating and Understanding
> > Patches for Dummies" for those of us who either don't know how to build one, or
> > like me, have, "but don't really know what i'm doing".
> >
> > When you can make time of course, i'd really like to understand more of what the
> > group is doing with patches submitted rather than only a little.
> >
> > Please, with pure honey on crumpets.
> Patches are reasonably straightforward, if somewhat reverse engineered historically.
>
> Back in the 1980's somebody invented diff -u ("unified diff format") as a more
> human readable alternative o the <old >new lines format you get without the -u,
> and then Larry Wall whipped up a program to reverse the process and use saved
> diff -u output to modify a file (which was mind-blowing at the time). As far as
> I can tell the format wasn't really meant for that, and was made to work with
> heuristics and hitting it with a rock, but Larry _did_ go on to invent Perl...
>
> A patch is a series of "hunks", describing a range of lines in the "old" version
> and the corresponding range in the "new" version. Patches have 6 different types
> of lines, each starting with one of "+++ ", "--- ", "@@ ", " ", "+", or "-".
>
> The first 2 (the --- and +++ lines) are control lines that come at the start and
> indicate we're working on a new file. They indicate the old file name and the
> new file name for the changed files. If you "diff -u oldfile newfile" you get a
> hunk starting with:
>
>   --- oldfile
>   +++ newfile
>   @@ -oldstart,oldlines +newstart,newlines @@ comment
>   and so on
>
> Those first two lines are --- or +++, one space, and the filename.
>
> Unfortunately, the original unified diff format then followed each filename with
> a tab character and the timestamp of the file (in yyyy-mm-dd hh:mm:ss tzoff"
> format), which means if you have a tab character in the filename you can't patch
> them. These days this datestamp is optional, and most patches don't have them
> anymore. (I have a todo item to make toybox patch work backwards from the end of
> the line and peel off only a properly formatted tab+date entry and leave it
> alone otherwise, but right now it just stops at the first tab. Which is not a
> space or newline, and thus almost never occurs in filenames and nobody's
> complained yet (because if you tab in the windows gui it switches focus so
> windows people can't trivially create this breakage and then wine for us to
> "support" it)...  Still, lemme do a quick commit to make that suck _slightly_
> less by at least requiring the next character to be a digit in order to match
> the date and strip it off. It still doesn't handle filenames with a newline in
> them, but... how would you?)
>
> If this (now optional) date was the unix epoch (midnight, january 1, 1970, which
> timezone adjustements often moved to December 31, 1969), it indicated we were
> comparing against a nonexistent file. The more modern way to say this is to use
> the special filename /dev/null. So if you want patch to create a file, what you
> do is "diff -u /dev/null newfile", and if you want it to delete a file, "diff -u
> oldfile /dev/null". (Otherwise it leaves a zero length file when you remove
> allthe lines, or expects an empty file to already be there when adding with no
> context lines.)
>
> The other fun thing is when you diff 2 files, the files need to have different
> names. How do you know which one you're applying the patch to? Historically, it
> tried both names and used whichever one worked... but if you happen to have a
> file with your tempname lying around in the directory you're applying the patch
> _to_ (which happens a lot when you habitually use the same tempfile name), the
> hunk may try to apply to the wrong file. (There were certain horrible heuristics
> I don't remember that tried to work out what you _meant_ to do, which didn't
> really help and I don't think I implemented them?)
>
> And these days files have paths. As the switch from CVS to SVN (let alone git)
> taught us: individual standalone files aren't very interesting, you're almost
> always operating on a _tree_ of files.
>
> So generally what you do _now_ (and what tools like svn or mercurial or git
> pretend to do behind the scenes) is back up one directory, have two full trees
> (the vanilla project and your modified version), and "diff -ruN" the two
> subdirectories: -r is recursive, -u is unified format instead of the old < and >
> version, and -N says pretend to compare new or removed files against /dev/null
> so the diff says to add or remove them properly. That's why tools like svn or
> mercurial or git will create diffs that start like:
>
>   +++ a/path/to/file
>   --- b/path/to/file
>
> Except... now you've got an extra level of directory you don't want, so you have
> to back up _out_ of your project's tree to apply the patch and it's STILL
> guessing which name you mean.
>
> So what you do is create the diffs like that, then use the "-p 1" option when
> applying them, which says "peel off one layer of directory when parsing the
> filenames". That removes the a/ and b/ from the paths, and the rest should be
> identical so it's no longer ambiguous and it doesn't matter if you use the +++
> or the --- line as the file to apply the patch to. (No, -p1 doesn't apply to the
> magic name /dev/null, absolute paths aren't modified, only relative ones. Also,
> you can say "-p0" to disable the above "certain horrible heuristics" on pathless
> filenames and just literally use the filenames in the patch, but that doesn't
> come up much these days. Creating a diff between two trees and applying it
> within the top level of the tree via "patch -p1" is nearly universal now. That's
> the format "git format-patch -1 $HASH" and "git am file.patch" are using, for
> example.)
>
> Ok, so all that's indicating what file hunks apply to, then you get to actual
> hunks describing what changes to make within the file.  Each hunk starts with an
> @@ line, with 4 numbers, like so:
>
>   @@ -start,len +start,len @@ comment
>
> Each "start" is the (decimal) line number in that file the hunk starts applying
> at, and the "len" is the (decimal) number of lines described in that file. These
> numbers measure the body of the hunk, which comes next.
>
> (The "comment" part can be anything, and doesn't even have to be there. It's
> ignored. Modern language-aware diff -u variants stick which C function you're
> modifying in there, which is nice for humans but not used by patch that I know
> of. This simple crappy heuristic there is "last unindented line", which can find
> goto labels: ...)
>
> Each line of the rest of the body of that hunk starts with one of three characters:
>
> 1) + meaning this line is only in the new version (it was added).
> 2) - meaning this line is only in the old version (it was removed).
> 3) " " (space) = this line is the same in both (it's context for the changes).
>
> The context lines plus + lines need to add up to the "len" in the + part of the
> @@ line, and the context lines plus - lines need to add up to the len in the -
> part. (The start is more or less a comment, used to indicate how far off it
> applies at if the hunk moved but otherwise not rally mattering as far as I can
> tell. Well toybox doesn't use it.)
>
> Note: if your code is tab indented, it still needs a space (ascii 32) at the
> start of it to be a context line, then it's binary identical for the contents
> (so tabs or spaces as appropriate). This causes some editors to flip out about
> mixing tabs and spaces, but the distinction is functional here.
>
> Patch opens files when it sees +++ --- line pairs, reads in the next @@ hunk and
> the appropriate number of lines after it (with the right number of context
> lines, additions, and removals for what the @@ line counts said), and then
> searches in the file for a place where the appropriate context lines and removed
> lines appear in the right order (removed lines are matched just like context, if
> they're not there in the file the hunk doesn't apply), then replaces it with the
> set of context lines and added lines the hunk says should go there instead.
> (Note that if you patch -r then it's the + lines being removed and the - lines
> being added, "reversing" the patch.)
>
> Each hunk generally starts with 3 leading context lines, and end with 3 trailing
> context lines, which generally provides enough context to uniquely identify
> where to apply the hunk even if you're just adding a single line (that's the
> pathological case of providing no other corroborating information). The
> exception is when you're hunk applies at the start or end of the file: then
> there aren't enough context lines, and may not be _any_ if you're right at the
> end or beginning of the file.
>
> The hunk also has interstitial context lines as appropriate (between the
> additions and removals, which also have to match or the hunk won't apply), but
> not more than 6 (leading + trailing context line count) or it'd split into 2
> hunks. (This _does_ mean you can have 4 context lines in a row though.)
>
> What IS important is that you have the same number of leading context lines as
> trailing context lines, unless you're at the start/end of a file. If they don't,
> it's not a valid hunk and patch barfs on the corrupted patch. And the number of
> leading/trailing context lines not being the same means the patch program will
> try to MATCH the start/end of the file (whichever one's got truncated context),
> and fail if it can't (hunk does not apply, context is wrong).
>
> You can have as many hunks as you want within a file, I.E as many @@ lines after
>  a given --- +++ pair, but the hunks must apply in order, and this INCLUDES the
> context lines. A line that's been "seen" as a trailing context line won't match
> against the leading context of the next hunk.
>
> Because of this, you sometimes need 3 or more interstitial context lines in a
> row in the _middle_ of a hunk (between + and - lines), if that's how your
> changes work out. A number of consecutive context lines matching the leading
> context does NOT end the hunk, only consumig the line counts from the @@ line
> does that. And then you figure out if leading/trailing context counts match
> (indicating the need to match start/end of file) _after_ that. (If you really
> want to back up and modify an earlier part of the file, you need a new --- +++
> pair to flush and reopen the file, so it can start over searching at the beginning.)
>
> Oh, I know I said the start numbers in the @ line were only used for warnings,
> but you CAN use them to sanity check the leading context number if you want to.
> (Since if you're forcing a match with the beginning of the hunk, it had better
> start at 0 in that file or something is wrong.) Doesn't help with end of file
> though.
>
> So you wind up with:
>
> --- filename
> +++ filename
> @@ -start,len +start,len @@
>  context
>  context
>  context
> -blah
> +blah
>  context
>  context
>  context
> @@ -start,len +start,len @@
> ...
>
> Oh, the - lines usually come before the + lines when they're on the same line,
> but I don't think that's actually required? The entire context is matched before
> applying the hunk anyway. And note that you don't skip what you've already
> looked at when a hunk didn't apply, you go down ONE line and try matching again.
> If your context lines are all blank, you can skip the start of where this hunk
> applies otherwise. I hit and fixed that bug years ago in toybox. :)
>
> And of course all this is before git added a "rename" syntax that looks like:
>
>   https://lwn.net/Articles/244448/
>
> And has copy and delete variants that allow it to be much less verbose (avoids
> including the body of the matched file(s)).
>
> It's on the todo list... :)
>
> Rob
>
> P.S. You asked.
>
> > ps and i'm looking forward to the next mkroot, I miss Aborigonal!
>
> Alas, I just landed back in Milwaukee to do another round of $DAYJOB because
> neither toybox nor mkroot pay the bills. (I'm very grateful to the
> https://patreon.com/landley subscribers, and it's great encouragement, but my
> mortgage alone is like 25 times what that brings in. Nobody with a significant
> budget wants to fund this work, and keeping the lights on gets scheduled higher
> than things that don't. But I can presumably cut a mkroot release with the 4.20
> kernel right after I do a toybox release at the end of the month. All 4.20 broke
> that I've noticed so far was adding sha256 as a hard requirement to the s390x
> build, and I can add that to the toybox airlock install passthroughs for the
> moment...)
>
> (I had a huge todo list for my month off... and wound up going limp for most of
> it. I was doing ok until the battery in this old laptop completely died (as in
> unplug = instant off, so suspend is useless and I lose all open windows every
> time I move it. And alas I did NOT get Devuan working on the new System76 laptop
> I ordered a few months back (binary wifi firmware tantrum in the installer), and
> what they preinstalled on it has systemd, and given a choice between "system
> with no battery" and "system with systemd" it's no contest. But I did get the
> new She-Ra and Hilda watched, and the first season of The Good Place, so that's
> something...)
>
> Still Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net