[Toybox] cut is nuts.

isabella parakiss izaberina at gmail.com
Tue Sep 6 11:51:10 PDT 2016


On 9/4/16, Rob Landley <rob at landley.net> wrote:
>
>
> On 09/04/2016 06:44 AM, Felix Janda wrote:
>> Rob Landley wrote:
>>> I expected:
>>>
>>>   $ echo abcdefghijklmnopqrstuvwxyz | cut -b 10-20,5-15
>>>   jklmnopqrs efghijklmn
>>>   $ echo one two three | cut -d " " -f 1,3,1
>>>   one three one
>>>   $ echo one two three | cut -d " " -f 3,2,1
>>>   three two one
>>>
>>> But what it's doing is:
>>>
>>>   $ echo abcdefghijklmnopqrstuvwxyz | cut -b 10-20,5-15
>>>   efghijklmnopqrst
>>>   $ echo one two three | cut -d " " -f 1,3,1
>>>   one three
>>>   $ echo one two three | cut -d " " -f 3,2,1
>>>   one two three
>>>
>>> Sigh... so turning this into an awk replacement isn't as neat a fit as I
>>> thought. Why is it doing any of this? Honestly couldn't tell you.
>>> (Conservation of bytes?)
>>
>> Mathematically, the list that cut accepts stands for a *set* of
>> integers specifying the bytes or fields to be cut out the input.
>
> Oh you just sort the thing and merge overlapping ranges, it's trivial to
> do. It makes the tool way less useful to NEED to do it, though.
>
>> The set is formed by taking the *union* over all intervals (ranges) and
>> one-element-sets from the list. Bytes or fields are then output from
>> the set in order.
>
> Translation: the people who created this thing hadn't shaken the mindset
> of physical constraints if you take scissors and glue to printouts,
> where there's only one physical instance of each word on the printout.
> (Computers were new back then, their ability to endlessly copy data
> hadn't quite sunk in.)
>
>> So, e.g.
>>
>> 10-20,5-15   --->     {10,...,20} u {5,...,15} = {5,...,20}
>> 1,3,1        --->     {1} u {3} u {1} = {1,3}
>> 3,2,1        --->     {3} u {2} u {1} = {1,2,3}
>
> I know what it's doing, I'm horrified it's doing it as its default
> action, and that nobody's bothered to implement a "-D don't be stupid"
> to make it STOP doing it.
>
> I'm adding the -D to disable deduplication.
>
>> This is how I interpret the following from POSIX:
>>
>>     The elements in list can be repeated, can overlap, and can be
>>     specified in any order, but the bytes, characters, or fields
>>     selected shall be written in the order of the input data. If an
>>     element appears in the selection list more than once, it shall be
>>     written exactly once.
>>
>> So cut doesn't look like a good replacement for awk when fields might
>> be needed multiple times, or need to be reordered.
>
> Using a bigger hammer.
>
>> Felix
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cut.html

The list option-arguments are historically used to select the portions of the
line to be written, but do not affect the order of the data. For example:

echo abcdefghi | cut -c6,2,4-7,1

yields "abdefg".

A proposal to enhance cut with the following option:

-o   Preserve the selected field order. When this option is specified, each
byte, character, or field (or ranges of such) shall be written in the order
specified by the list option-argument, even if this requires multiple outputs
of the same bytes, characters, or fields.

was rejected because this type of enhancement is outside the scope of the IEEE
P1003.2b draft standard.


---
xoxo iza



More information about the Toybox mailing list