[Toybox] grep and empty regexes

Tue Jul 30 10:16:09 PDT 2019

On Tue, Jul 30, 2019 at 4:45 AM Rob Landley <rob at landley.net> wrote:
>
> On 7/29/19 1:59 PM, enh wrote:
> >> Do you have a catalog of what the differences _are_? (Empty regex and leading +
> >> so far?)
> >
> > no, i've got no idea. seems like there are bits and bobs in various
> > GNU docs (about the tools themselves and libc), but i didn't find
> > anything detailed enough to include either of these.
> > https://www.regular-expressions.info/gnu.html is horrifically garish,
> > but does seem to be a good condensation of the GNU info i found from
> > other sources.
>
> Sounds like "wait for people to complain, then whack-a-mole it" is a reasonable
> option for the moment.
>
> (You also have the ability to tweak bionic's regex plumbing, although the
> workaround for _this_ issue is a 2 line fix, so...)

yeah, though for something like the regex code i worry about the old
n+1 joke about standards.

(it would be easier if there was _one_ BSD to talk to, but we use the
NetBSD regex code while Apple uses the FreeBSD regex code...)

> > would you like me to move the workaround into xregcomp instead? or do
> > you want to wait until we see someone need this in sed or wherever
> > first?
> What would a sed test case look like?

(i didn't realize that an empty regex means "repeat the previous one".)

> If it doesn't break anything, moving it into xregcomp seems like the right thing
> to do.

all the tests (not just the grep tests) still pass, so i'll send you the patch.

> >> I'm glad you're giving the test suite some attention, but you're polishing
> >> what's _there_ and my concerns are more about what _isn't_. Longer term I need
> >> to A) write a gazillion more tests (based on a close re-reading of the spec, the
> >> relevant man page or RFC, and the source code), B) get mkroot to the point I can
> >> run more sheep across the minefield and catch issues with real world data.
> >
> > oh, yeah, *coverage* is a huge blind spot for us at the moment, and
> > something i want to look at. but first i wanted to get as many of the
> > tests running in presubmit as possible, to maximize my
> > pain^W^W^Wminimize the number of bugs that make it through to folks
> > who're just trying to build AOSP/just trying to use a device. not
> > every OEM is as keen on the idea of a reproduceable hermetic build as
> > you might expect, so giving them fewer stones to throw seems like a
> > good idea :-)
>
> Indeed.
>
> > i'm down to just blkid and du failures on a local taimen device now...
>
> Is that building for the taimen device, or building on the taimen device?

for :-)

> > it's time to admit to myself i'm not likely to implement the ntfs
> > LABEL support any time soon, and at least send you a patch that fixes
> > all the other issues. (sent separately.)
>
> Looking at the ntfs image in tests/files/blkid:
>
> $ /sbin/blkid ntfs.img
> ntfs.img: LABEL="myntfs" UUID="6EE1BF3808608585" TYPE="ntfs"
> $ hd -s 0x4d80 -n 16 ntfs.img
> 00004d80  6d 00 79 00 6e 00 74 00  66 00 73 00 00 00 00 00  |m.y.n.t.f.s.....|
>
> But it's repeated at 3ffd80 and there's a "Volume" before it that smells a
> little like it's the second member of a linked list of structures? I only have
> the _one_ NTFS file. I
>
> (As a teenager, I reverse engineered a _lot_ of game save formats on the C64 and
> DOS; not so much on the amiga because the system I had came with zero
> development tools and was a read-only game machine except for the word
> processor. The first nontrivial program I ever wrote was a commodore 64 disk
> sector hex editor, and I lost the source to the first version when I used it on
> its own disk and it had an off by one error that corrupted the root directory. I
> was... 11?)
>
> Do you have a lot of NTFS disk labels you need identified? (I.E. is this a use
> case you'd actually use, or just a completeness thing?)

i have exactly one NTFS disk image --- this one in the test suite. if
we had the opposite of `toyonly` i'd be tempted to just `toyonly` the
output without the label and `nontoyonly` the output with the label.

> Sigh, lemme check https://en.wikipedia.org/wiki/NTFS to see what I'm supposed to
> do. Read the boot sector, 8 bytes at 0x30 times 1 byte at 0x0D is the LBA sector
> offset (presumably still 512 byte) of the start of the master file table, then
> segment #3 in there is $VOLUME data which includes a $VOLUME_NAME record...
>
> Looks like you have to chase some sort of tree structure to reliably find the
> volume ID. If that's added to blkid it's as a special case function doing that,
> it's not gonna fit in the table even conceptually.

yeah, exactly. a huge pain unless you're actually getting into the
NTFS business.

> > as for du versus the extra space used for extended attributes, i'm
> > still not sure what to do about that...
>
> My plan was running the tests in a known environment. I'm working on getting
> mkroot to where I can make an ext2 image (or maybe vfat?), loopback mount it,
> and run df against that so I get consistent results without a different
> filesystem changing the results.
>
> It should also run in a container, but half the time what you're debugging is
> what's different about your system that's giving "different but not wrong"
> results. Needs a reference implementation to regression test in so you can see
> _what_ is different...

the likelihood that we have legitimate differences (as is actually the
case with du at least) is why i'm trying to resist the urge to just
ignore failures, especially on the "skip the whole tool" level.

> Rob