[Toybox] More than you really wanted to know about patch.
scsijon
scsijon at lamiaworks.com.au
Mon Jan 14 15:11:52 PST 2019
No enh, it's perfect for what I want. Something for 'us dummies' to
understand what's happening.
I hadn't thought about Exporting/Importing!
I did forget (oops) to ask Rob for permission to spread it otherwise
though, I want to add it to both Puppy's and T2's maillists, If he
doesn't mind.
regards
scsijon
On 15/01/19 04:24, enh wrote:
> (i actually thought the question was more about the workflow, in which
> case an answer would look more like the "Exporting a patch"/"Importing
> patches" sections of https://git.wiki.kernel.org/index.php/QuickStart
> ...)
>
> On Sun, Jan 13, 2019 at 9:30 PM Rob Landley <rob at landley.net> wrote:
>>
>> On 1/13/19 3:57 PM, scsijon wrote:
>>> Any chance of a two or three page "Introduction to Creating and Understanding
>>> Patches for Dummies" for those of us who either don't know how to build one, or
>>> like me, have, "but don't really know what i'm doing".
>>>
>>> When you can make time of course, i'd really like to understand more of what the
>>> group is doing with patches submitted rather than only a little.
>>>
>>> Please, with pure honey on crumpets.
>> Patches are reasonably straightforward, if somewhat reverse engineered historically.
>>
>> Back in the 1980's somebody invented diff -u ("unified diff format") as a more
>> human readable alternative o the <old >new lines format you get without the -u,
>> and then Larry Wall whipped up a program to reverse the process and use saved
>> diff -u output to modify a file (which was mind-blowing at the time). As far as
>> I can tell the format wasn't really meant for that, and was made to work with
>> heuristics and hitting it with a rock, but Larry _did_ go on to invent Perl...
>>
>> A patch is a series of "hunks", describing a range of lines in the "old" version
>> and the corresponding range in the "new" version. Patches have 6 different types
>> of lines, each starting with one of "+++ ", "--- ", "@@ ", " ", "+", or "-".
>>
>> The first 2 (the --- and +++ lines) are control lines that come at the start and
>> indicate we're working on a new file. They indicate the old file name and the
>> new file name for the changed files. If you "diff -u oldfile newfile" you get a
>> hunk starting with:
>>
>> --- oldfile
>> +++ newfile
>> @@ -oldstart,oldlines +newstart,newlines @@ comment
>> and so on
>>
>> Those first two lines are --- or +++, one space, and the filename.
>>
>> Unfortunately, the original unified diff format then followed each filename with
>> a tab character and the timestamp of the file (in yyyy-mm-dd hh:mm:ss tzoff"
>> format), which means if you have a tab character in the filename you can't patch
>> them. These days this datestamp is optional, and most patches don't have them
>> anymore. (I have a todo item to make toybox patch work backwards from the end of
>> the line and peel off only a properly formatted tab+date entry and leave it
>> alone otherwise, but right now it just stops at the first tab. Which is not a
>> space or newline, and thus almost never occurs in filenames and nobody's
>> complained yet (because if you tab in the windows gui it switches focus so
>> windows people can't trivially create this breakage and then wine for us to
>> "support" it)... Still, lemme do a quick commit to make that suck _slightly_
>> less by at least requiring the next character to be a digit in order to match
>> the date and strip it off. It still doesn't handle filenames with a newline in
>> them, but... how would you?)
>>
>> If this (now optional) date was the unix epoch (midnight, january 1, 1970, which
>> timezone adjustements often moved to December 31, 1969), it indicated we were
>> comparing against a nonexistent file. The more modern way to say this is to use
>> the special filename /dev/null. So if you want patch to create a file, what you
>> do is "diff -u /dev/null newfile", and if you want it to delete a file, "diff -u
>> oldfile /dev/null". (Otherwise it leaves a zero length file when you remove
>> allthe lines, or expects an empty file to already be there when adding with no
>> context lines.)
>>
>> The other fun thing is when you diff 2 files, the files need to have different
>> names. How do you know which one you're applying the patch to? Historically, it
>> tried both names and used whichever one worked... but if you happen to have a
>> file with your tempname lying around in the directory you're applying the patch
>> _to_ (which happens a lot when you habitually use the same tempfile name), the
>> hunk may try to apply to the wrong file. (There were certain horrible heuristics
>> I don't remember that tried to work out what you _meant_ to do, which didn't
>> really help and I don't think I implemented them?)
>>
>> And these days files have paths. As the switch from CVS to SVN (let alone git)
>> taught us: individual standalone files aren't very interesting, you're almost
>> always operating on a _tree_ of files.
>>
>> So generally what you do _now_ (and what tools like svn or mercurial or git
>> pretend to do behind the scenes) is back up one directory, have two full trees
>> (the vanilla project and your modified version), and "diff -ruN" the two
>> subdirectories: -r is recursive, -u is unified format instead of the old < and >
>> version, and -N says pretend to compare new or removed files against /dev/null
>> so the diff says to add or remove them properly. That's why tools like svn or
>> mercurial or git will create diffs that start like:
>>
>> +++ a/path/to/file
>> --- b/path/to/file
>>
>> Except... now you've got an extra level of directory you don't want, so you have
>> to back up _out_ of your project's tree to apply the patch and it's STILL
>> guessing which name you mean.
>>
>> So what you do is create the diffs like that, then use the "-p 1" option when
>> applying them, which says "peel off one layer of directory when parsing the
>> filenames". That removes the a/ and b/ from the paths, and the rest should be
>> identical so it's no longer ambiguous and it doesn't matter if you use the +++
>> or the --- line as the file to apply the patch to. (No, -p1 doesn't apply to the
>> magic name /dev/null, absolute paths aren't modified, only relative ones. Also,
>> you can say "-p0" to disable the above "certain horrible heuristics" on pathless
>> filenames and just literally use the filenames in the patch, but that doesn't
>> come up much these days. Creating a diff between two trees and applying it
>> within the top level of the tree via "patch -p1" is nearly universal now. That's
>> the format "git format-patch -1 $HASH" and "git am file.patch" are using, for
>> example.)
>>
>> Ok, so all that's indicating what file hunks apply to, then you get to actual
>> hunks describing what changes to make within the file. Each hunk starts with an
>> @@ line, with 4 numbers, like so:
>>
>> @@ -start,len +start,len @@ comment
>>
>> Each "start" is the (decimal) line number in that file the hunk starts applying
>> at, and the "len" is the (decimal) number of lines described in that file. These
>> numbers measure the body of the hunk, which comes next.
>>
>> (The "comment" part can be anything, and doesn't even have to be there. It's
>> ignored. Modern language-aware diff -u variants stick which C function you're
>> modifying in there, which is nice for humans but not used by patch that I know
>> of. This simple crappy heuristic there is "last unindented line", which can find
>> goto labels: ...)
>>
>> Each line of the rest of the body of that hunk starts with one of three characters:
>>
>> 1) + meaning this line is only in the new version (it was added).
>> 2) - meaning this line is only in the old version (it was removed).
>> 3) " " (space) = this line is the same in both (it's context for the changes).
>>
>> The context lines plus + lines need to add up to the "len" in the + part of the
>> @@ line, and the context lines plus - lines need to add up to the len in the -
>> part. (The start is more or less a comment, used to indicate how far off it
>> applies at if the hunk moved but otherwise not rally mattering as far as I can
>> tell. Well toybox doesn't use it.)
>>
>> Note: if your code is tab indented, it still needs a space (ascii 32) at the
>> start of it to be a context line, then it's binary identical for the contents
>> (so tabs or spaces as appropriate). This causes some editors to flip out about
>> mixing tabs and spaces, but the distinction is functional here.
>>
>> Patch opens files when it sees +++ --- line pairs, reads in the next @@ hunk and
>> the appropriate number of lines after it (with the right number of context
>> lines, additions, and removals for what the @@ line counts said), and then
>> searches in the file for a place where the appropriate context lines and removed
>> lines appear in the right order (removed lines are matched just like context, if
>> they're not there in the file the hunk doesn't apply), then replaces it with the
>> set of context lines and added lines the hunk says should go there instead.
>> (Note that if you patch -r then it's the + lines being removed and the - lines
>> being added, "reversing" the patch.)
>>
>> Each hunk generally starts with 3 leading context lines, and end with 3 trailing
>> context lines, which generally provides enough context to uniquely identify
>> where to apply the hunk even if you're just adding a single line (that's the
>> pathological case of providing no other corroborating information). The
>> exception is when you're hunk applies at the start or end of the file: then
>> there aren't enough context lines, and may not be _any_ if you're right at the
>> end or beginning of the file.
>>
>> The hunk also has interstitial context lines as appropriate (between the
>> additions and removals, which also have to match or the hunk won't apply), but
>> not more than 6 (leading + trailing context line count) or it'd split into 2
>> hunks. (This _does_ mean you can have 4 context lines in a row though.)
>>
>> What IS important is that you have the same number of leading context lines as
>> trailing context lines, unless you're at the start/end of a file. If they don't,
>> it's not a valid hunk and patch barfs on the corrupted patch. And the number of
>> leading/trailing context lines not being the same means the patch program will
>> try to MATCH the start/end of the file (whichever one's got truncated context),
>> and fail if it can't (hunk does not apply, context is wrong).
>>
>> You can have as many hunks as you want within a file, I.E as many @@ lines after
>> a given --- +++ pair, but the hunks must apply in order, and this INCLUDES the
>> context lines. A line that's been "seen" as a trailing context line won't match
>> against the leading context of the next hunk.
>>
>> Because of this, you sometimes need 3 or more interstitial context lines in a
>> row in the _middle_ of a hunk (between + and - lines), if that's how your
>> changes work out. A number of consecutive context lines matching the leading
>> context does NOT end the hunk, only consumig the line counts from the @@ line
>> does that. And then you figure out if leading/trailing context counts match
>> (indicating the need to match start/end of file) _after_ that. (If you really
>> want to back up and modify an earlier part of the file, you need a new --- +++
>> pair to flush and reopen the file, so it can start over searching at the beginning.)
>>
>> Oh, I know I said the start numbers in the @ line were only used for warnings,
>> but you CAN use them to sanity check the leading context number if you want to.
>> (Since if you're forcing a match with the beginning of the hunk, it had better
>> start at 0 in that file or something is wrong.) Doesn't help with end of file
>> though.
>>
>> So you wind up with:
>>
>> --- filename
>> +++ filename
>> @@ -start,len +start,len @@
>> context
>> context
>> context
>> -blah
>> +blah
>> context
>> context
>> context
>> @@ -start,len +start,len @@
>> ...
>>
>> Oh, the - lines usually come before the + lines when they're on the same line,
>> but I don't think that's actually required? The entire context is matched before
>> applying the hunk anyway. And note that you don't skip what you've already
>> looked at when a hunk didn't apply, you go down ONE line and try matching again.
>> If your context lines are all blank, you can skip the start of where this hunk
>> applies otherwise. I hit and fixed that bug years ago in toybox. :)
>>
>> And of course all this is before git added a "rename" syntax that looks like:
>>
>> https://lwn.net/Articles/244448/
>>
>> And has copy and delete variants that allow it to be much less verbose (avoids
>> including the body of the matched file(s)).
>>
>> It's on the todo list... :)
>>
>> Rob
>>
>> P.S. You asked.
>>
>>> ps and i'm looking forward to the next mkroot, I miss Aborigonal!
>>
>> Alas, I just landed back in Milwaukee to do another round of $DAYJOB because
>> neither toybox nor mkroot pay the bills. (I'm very grateful to the
>> https://patreon.com/landley subscribers, and it's great encouragement, but my
>> mortgage alone is like 25 times what that brings in. Nobody with a significant
>> budget wants to fund this work, and keeping the lights on gets scheduled higher
>> than things that don't. But I can presumably cut a mkroot release with the 4.20
>> kernel right after I do a toybox release at the end of the month. All 4.20 broke
>> that I've noticed so far was adding sha256 as a hard requirement to the s390x
>> build, and I can add that to the toybox airlock install passthroughs for the
>> moment...)
>>
>> (I had a huge todo list for my month off... and wound up going limp for most of
>> it. I was doing ok until the battery in this old laptop completely died (as in
>> unplug = instant off, so suspend is useless and I lose all open windows every
>> time I move it. And alas I did NOT get Devuan working on the new System76 laptop
>> I ordered a few months back (binary wifi firmware tantrum in the installer), and
>> what they preinstalled on it has systemd, and given a choice between "system
>> with no battery" and "system with systemd" it's no contest. But I did get the
>> new She-Ra and Hilda watched, and the first season of The Good Place, so that's
>> something...)
>>
>> Still Rob
>> _______________________________________________
>> Toybox mailing list
>> Toybox at lists.landley.net
>> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
More information about the Toybox
mailing list