[Toybox] cut is nuts.

Rob Landley rob at landley.net
Sun Sep 4 11:45:50 PDT 2016



On 09/04/2016 06:44 AM, Felix Janda wrote:
> Rob Landley wrote:
>> I expected:
>>
>>   $ echo abcdefghijklmnopqrstuvwxyz | cut -b 10-20,5-15
>>   jklmnopqrs efghijklmn
>>   $ echo one two three | cut -d " " -f 1,3,1
>>   one three one
>>   $ echo one two three | cut -d " " -f 3,2,1
>>   three two one
>>
>> But what it's doing is:
>>
>>   $ echo abcdefghijklmnopqrstuvwxyz | cut -b 10-20,5-15
>>   efghijklmnopqrst
>>   $ echo one two three | cut -d " " -f 1,3,1
>>   one three
>>   $ echo one two three | cut -d " " -f 3,2,1
>>   one two three
>>
>> Sigh... so turning this into an awk replacement isn't as neat a fit as I
>> thought. Why is it doing any of this? Honestly couldn't tell you.
>> (Conservation of bytes?)
> 
> Mathematically, the list that cut accepts stands for a *set* of
> integers specifying the bytes or fields to be cut out the input.

Oh you just sort the thing and merge overlapping ranges, it's trivial to
do. It makes the tool way less useful to NEED to do it, though.

> The set is formed by taking the *union* over all intervals (ranges) and
> one-element-sets from the list. Bytes or fields are then output from
> the set in order.

Translation: the people who created this thing hadn't shaken the mindset
of physical constraints if you take scissors and glue to printouts,
where there's only one physical instance of each word on the printout.
(Computers were new back then, their ability to endlessly copy data
hadn't quite sunk in.)

> So, e.g.
> 
> 10-20,5-15   --->     {10,...,20} u {5,...,15} = {5,...,20}
> 1,3,1        --->     {1} u {3} u {1} = {1,3}
> 3,2,1        --->     {3} u {2} u {1} = {1,2,3}

I know what it's doing, I'm horrified it's doing it as its default
action, and that nobody's bothered to implement a "-D don't be stupid"
to make it STOP doing it.

I'm adding the -D to disable deduplication.

> This is how I interpret the following from POSIX:
> 
>     The elements in list can be repeated, can overlap, and can be
>     specified in any order, but the bytes, characters, or fields
>     selected shall be written in the order of the input data. If an
>     element appears in the selection list more than once, it shall be
>     written exactly once.
> 
> So cut doesn't look like a good replacement for awk when fields might
> be needed multiple times, or need to be reordered.

Using a bigger hammer.

> Felix

Rob



More information about the Toybox mailing list