[Toybox] [PATCH] file, tar: basic zstd awareness.

Wed Dec 11 03:27:33 PST 2024

On 12/10/24 12:37, enh wrote:
> On Sun, Dec 8, 2024 at 12:51 AM Rob Landley <rob at landley.net> wrote:
> 
>> On 12/7/24 18:39, enh wrote:
>>> On Sat, Dec 7, 2024, 18:25 Rob Landley <rob at landley.net> wrote:
>>>
>>>> On 12/6/24 13:57, enh wrote:
>>>>> We're seeing ever more zstd-compressed files in the wild, so even
>> though
>>>>> toybox can't compress/decompress zstd without an external helper, it
>>>>> still seems useful to integrate with any that happens to be on the
>>>>> system.
>>>>
>>>> No short option for zstd, even though every other explicit archive
>>>> format has one?
>>>>
>>>
>>> technically there are a couple of other compression options that are
>>> longopt only,
>>
>> In gnu/gnu.
>>
>>> such as --lzma (but i haven't added those here because i've
>>> yet to see them used).
>>>
>>> this probably made sense when it was added in 2019, and it wasn't clear
>> how
>>> popular, zstd was going to become. (especially in comparison to the other
>>> options we don't have.)
>>>
>>> though tbh, zstd seems more popular in non-tar contexts ... i had to ask
>>> the internet what the long and short extensions were!
>>
>> Imma hijack -Z. I'm aware in debian that's "compress" but we've never
>> supported that format, which was patented in the 1980s causing it to be
>> completely replaced by gzip except for some old legacy archives you can
>> "compress -d file.Z | tar x" if you like.
>>
> 
> yeah, sounds reasonable.
> 
> coincidentally i saw https://www.phoronix.com/news/Linux-EFI-Zboot-Gzip-Zstd
> "Linux EFI Zboot Abandoning "Compression Library Museum", Focusing On Gzip
> & Zstd" which made me laugh, given that that had been my reaction to the
> other formats that gnu tar supports (and has single-letter options for!)
> that toybox tar doesn't (and almost certainly shoudn't) like lzip and lzop.
> presumably characters from a children's show in a language i don't speak?

Way back when then pkzip 2.0 came out there was arj and pak and zoo and 
several others, I was never entirely sure what the under the cover 
differences were (especially since the archive and the compression are 
two different formats). I also remember that zip itself supported a 
bunch of legacy formats (hence the Nancy Button: "Unzip, expand, 
explode, what pervert came up with this in "the little caligraphic 
button catalogue on the prairie" circa 1984. I think that was the first 
one I got at that Dr. Who convention, "Don't crush that dwarf, hand me 
the caligraphic button catalogue" was later...)

I blogged about there being a similar group of compression formats 
(supported in the linux kernel's zimage and initramfs expanders) and 
having no idea which would "win", and winding up with xz because txz was 
the format kernel tarballs were available in and I found a public domain 
expander program.

I don't know what the difference between xz and zstd is, I've mostly 
avoided technology that comes from faceboot because zuckerberg and thiel 
somehow manage to be worse than gates and ballmer.

>> (I just like there to BE a short option, and another obvious contender
>> isn't presenting itself. Plus I haven't got an obvious way to test this
>> anyway.)
> 
> yeah, i just tested manually. it did occur to me that the test shell script
> could check to see whether there's a zstd(1) binary on the path, and skip
> any zstd tests if not?

At some point I need to categorize the skips. Not sure how yet, there's 
a missing design idea.

But "gnu/command never passed this", "musl never passed this", "busybox 
doesn't pass this", "bionic never passed this", "old glibc passes this 
but new one has version skew"...

I want more granularity out of skipped but dunno what the annotation(s) 
should be. Maybe "skip strings" added to the end of the line as a 
parenthetical? With a VERBOSE=why added to VERBOSE=allfailnopassquietspam

As I said, missing design work...

> (and there's really no excuse for me not adding a file(1) test beyond "we
> don't have tests for _most_ of the recognized formats", though "this is
> just a constant prefix match" is a slightly better excuse.)

I'm always up for adding more tests, but I haven't been trying to do so 
piecemeal because it doesn't save work for an eventual "trying to be 
systemic" pass where you go line by line through the source and relevant 
standards and write a test for every decision..

>> https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md

I note that I have yet to see zstd tarballs in the wild. Not one of the 
kernel formats, not one of the linux from scratch formats... 
Implementing "zip" is higher on my priority list, which means finishing 
deflate compression side, which means answering the dictionary reset 
question. (Although if I don't care about producing binary equivalent 
tarballs, "every X bytes" is fine. Maybe  every 250k? The problem with 
calculating a non-default huffman tree is you need to read the data 
before compressing it to count the symbol frequency, so what's the input 
buffer size...)

Rob