[Toybox] git.c progress

Tue Jan 3 07:04:35 PST 2023

On 1/2/23 16:20, Moritz Christian Weber wrote:
> Hi,
> It seem I CCed the mailing list accidentally ;-).

Um, sorry. That was me. (The code is now checked in, so I wanted to ask about it
here. I edited a subsection of relevant comments from the emails about the code
as checkin comments. Shoulda asked first, my bad.)

> The git toy is most likely not
> the easiest toy to start learning C coming from memory managed languages, but
> the 90MB of dependencies of git give an incredible motivation to do so :-).

You did a bunch of good work, I'm still trying to understand it. (With things
like the three sha1 fields in the struct, I can see what the code is doing, but
not _why_.)

I'm happy to do more cleanups on it myself, but... you've built a lot of domain
expertise in git that I don't have yet. All of my changes so far were cosmetic...

> Thank for reviewing read_index(). I will include xmalloc() and check what I am
> actually pointing to. Read_index is currently a dead and used function. I
> scaffolded it to get an idea of deserialization in C. It will be more useful
> once git clone serializes the index correct.

>From a code style perspective I tend to use the unix style read(fd, buf, len)
functions instead of the ANSI C style fread(buf, size, also_size, FILE)
functions, unless it's using getline() and friends. Partly personal preference,
partly eliminating unnecessary layers of infrastructure (fread has to be calling
read under the covers), partly because of the old saying "the two hard problems
in computer science are naming things, cache invalidation, and off by one
errors" and just not want wanting to mentally track the buffer inside each FILE
object. (Elliott's and my long discussion about TOYFLAG_LINEBUF and setvbuf()
only applies to FILE objects.)

ANSI FILE * was invented to read lines of text from input, which is a hard
problem because you don't know where the in-band signaling delimiter (the
newline character) will occur. Reading a byte at a time from the operating
system is _very_slow_, but reading a larger block of data almost guarantees
you'll overshoot and be left over with extra input you need to re-use next time.
The FILE structure is a place to put that buffer. When you know how many bytes
of data you need to fetch, the unix read() layer is easier.

Another code style thing: I tend to make a small number of large allocations and
subdivide them, rather than having multiple small allocations stitched together,
because the one-big-one is easier to free() later. If my setup is:

  blah = xmalloc(sizeof(struct blah)+sizeof(thingy)+37);
  blah->thingy = (void *)(blah+1);
  blah->field37 = (void *(blah->thingy+1);

Then my cleanup is just:

  free(blah);

Because object lifetime rules are one of the big scalability headaches as any
project grows, and I find it best to sort them out early. That's also why toybox
has a page sized (4k) "toybuf" global available as scratch space for each
command (and an equivalent "libbuf" for use in lib/* and such without stomping a
command's toybuf).

(Remember, pointers work like arrays so increments move in sizeof(*obj), so
"move past this object" is always +1 when the type is right. And then a "void *"
typecast tells pointer logic to autoconvert because void * autoconverts to any
other pointer type, so I'm lazy and just let the compiler do it for me. The
parentheses around the addition say to increment by the original type, and THEN
smash the conversion to whatever type we're assigning. I'm glossing over
alignment here, but if the compiler adds padding to the struct then
sizeof(struct) would include the padding. Remember, it's gotta make arrays of
this type work already. :)

(There's also a bunch of tiny things about cache locality and heap allocation
granularity I'm not going into right now, but they lean towards the
one-big-allocation instead of chasing pointers in a tree. The reason for a
pointer chasing approach usualy boils down to dynamically adding/removing
members being way easier there.)

In your structure setup you're doing:

  i->crc = malloc(sizeof(unsigned));
  i->offset = malloc(sizeof(unsigned));
  i->offset64 = malloc(sizeof(long long));

And I'm going... are those going to be arrays that get resized via xrealloc()
later? Because otherwise, why not just have the field instead of a pointer to
the field? You're declaring:

  unsigned *crc, *offset;
  long long *offset64; //not supported yet

On 64 bit systems those first two are 8 byte fields (64 bit pointer is 8 bytes),
so if they're only storing one value you can just have them be "unsigned crc,
offset;" and store the value directly in them, and then if you need a pointer to
them just do &i->crc and you've saved a step?

> Regarding the different kind of SHA fields: 1) sha1 field is a dynamic array
> storing the values of the "git index hash table". I did not invent this.

I didn't think you did, I'm just trying to figure out what it means. :)

I'm _guessing_ the [20] is because an sha1sum is 20 bytes long...

> This
> struct shall represent the actually documented git idx format. The git "hash
> table" reads the first char of the sha1 hash (FanOutTable(FOT)) and then binary
> searches the bucket for the requested hash.

Ok.

> 2) packsha1 field is the sha1 hash
> of the pack file which the idx struct/file is created for

Ah, so it's one struct instance per pack file. (I was trying to figure out the
granularity, you didn't seem to have a tree of these structures...)

> 3) idxsha1 is the sha1
> hash of the overall struct including the packsha1. The last two sha1 field can
> only be calculated correct, when git clone works correct. So I dont care much
> about this hashes right now. git uses them most likely to detect bit flips in
> downloaded backs and calculated indexes.

Yup. The PC world stopped paying for ECC memory years ago, and phones don't seem
to have ever used it, which is sad because the probability of a bit flip goes up
linearly with the number of bits of memory, mulitiplied by the amount of time
the bits have to retain that memory (I.E. how long the thing stays on).

We've gone from "4 megabytes with error correcting codes, powered off every
night" to "multiple gigabytes without, last rebooted a month ago". Software
suspend keeps the memory powered up while the device is "off". I mean yay, DRAM
refresh covers a multitude of sins but the amount of current in the capacitor
goes down each die shrink so your margin for that saving you gets smaller. Meh,
I'm still not convinced we ever fixed rowhammer so what do I know...

Sorry, tangent. Bit flip checking: check.

> The git pack format V2 is documented here:
> https://github.com/git/git/blob/master/Documentation/gitformat-pack.txt#L266

Ooh, cool!

Imma go read that. Back once I have enough of a frame of reference to hold an
intelligent conversation with you on this. :)

> (like was broken due to renamings in the git/git repo. Fixed the links over the
> weekend.)
> 
> Best regards,
> Moritz
> 
> P.S.: It also corrected git fetch over the weekend, so that I read the actually
> master head from the remote repo and persists it into HEAD, so that I can get
> rid of the static version tag hashes in the code.

If you send me a new version on top of your previous one without any of my
changes, I can translate it and apply the update on this end.

Thanks,

Rob