[Toybox] [NEW TOY] iconv

Rob Landley rob at landley.net
Sun Apr 13 20:24:54 PDT 2014


On 04/13/14 17:18, Felix Janda wrote:
> Rob Landley wrote:
>> On 04/13/14 04:37, Felix Janda wrote:
>>> Isaac Dunham wrote:
>>> [..]
>>>> locale and iconv were already triaged. 
>>> [..]
...
>> Really the only interesting errno case from iconv is illegal sequence.
>> The rest just say "ran out of input" or "ran out of output" which is
>> what you expect from a conversion that's not at the end of the file yet.
>> (Ok, truncated sequence is a synonym for illegal sequence if we're not
>> at the end of the buffer, which we can special case as at the _start_ of
>> the buffer with the memmove logic.)
> 
> You mean "if we're at the end of the buffer"?

No, if we are at the end of the buffer, truncated sequence isn't an
error. It means the buffer ran out before the sequence did. But if we're
_not_ at the end of the buffer, it means the

However, if we just zap the parts we handled, do the memmove to the
front, refill the buffer, and then have the error _again_ that means the
truncated sequence is invalid, not a problem with running out of data.

(And that means we don't have to care how long the truncated sequence
is, so we don't care how far from the end of the buffer still counts as
retrying instead of skipping.)

>> Hmmm... we should probably pass illegal sequence bytes through. (Pass
>> 'em through.) Except check if output buffer is full before doing that.
>> (Don't have to check inleft nonzero because if inconv() returns illegal
>> sequence but used up all the input buffer, that's a libc bug.)
> 
> Right...

I think the -c flag controls whether or not to pass them through,
although posix is going "and we refuse to specify the behavior here at
all because Microsoft paid us money not to".

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html

>> memmove() with length 0 isn't an error, is it? Ok.
>>
>> Where would I get a test file to convert? I just ran a text file through
>> it and confirmed it's not making any changes to it, but that doesn't
>> mean much. :)
> 
> More interesting would be roundtrip encoding some files.

Except that means "cat" would pass. Not really a test that instills a
lot of confidence in me...

> For testing, I just used an uim tarball[1] with some eucjp encoded files.
> The cleaned up version still seems to work properly.

We can echo -e some snippets. Basically if we convert between utf-8 and
whatever it is windows uses (latin pi) for like japan or korean or
something, we'll have shown it Did A Thing. We're not trying to test the
libc implementation of iconv, just show that we're feeding data into it.

> Even more interesting would be a file with some illegal sequences. I didn't
> test that at all.

The failure paths are always the most interesting thing to test. And the
most often overlooked...

We'd also want to test retry across 2k boundaries on both input and
output if we were being serious. _and_ test a file that exactly filled
up the input and another that exactly filled the output buffer when the
file ended.

But again, since I dunno what success looks like, I'll wait for somebody
who does to complain. :)

>> (Sorry, rewrote it a bit more than I expected to. Checking in now...)
> 
> Error handling looks more sensible. Have you considered that iconv_open()
> might also fail because of insufficient memory.

I looked at doing perror_exit(0) but EINVAL is "Invalid argument" which
isn't necessarily enough to figure out what went wrong. As for other
failure causes:

A) we still fail at the same place, we just provide an inaccurate error
message as to why.

B) ENOMEM is really only on a nommu system. With mmu, the malloc happens
fine (allocating virtual address space) and then the failure occurs at
page fault time when the memory is dirtied and the copy on write mapping
of the zero page gets written to and thus copied. (Ok, there's no longer
a literal "zero page" because its reference counter tended to wrap on
big systems and because said counter turned into a giant SMP contention
point. But conceptually the same thing going on, they just special cased
it in the page fault handler.) The failure mode of that is the OOM
killer triggers and whacks a randomish process.

(Yes, crazy people can enable strict overcommit accounting to make a
system with mmu act like nommu, and then they get to keep the pieces.
Doing so does not improve any aspect of a modern system. Maybe if it was
designed from the ground up to expect that, but we've got a couple
decades of "not that" to overcome the same way windows has decades of
being a single user system baked into its design assumptions. Containers
at least localize the damage.)

>> P.S. Posix iconv has several more command line options. -c is easy and
>> -s is NOP for us, but I dunno how to do -l.
> 
> glibc's doesn't have them. So I guessed that they are not much used.
> Now I see that libiconv has them.

When glibc and posix disagree, posix can potentially win. I'll probably
do the extra 2 posix options on general principles, and fluff out the
help text before promoting it.

But now, it's bedtime. Trying to get on an even earlier morning schedule...

Rob

 1397445894.0


More information about the Toybox mailing list