[Toybox] [NEW TOY] iconv

Felix Janda felix.janda at posteo.de
Fri Apr 18 09:56:25 PDT 2014


Rob Landley wrote:
> On 04/13/14 17:18, Felix Janda wrote:
> > Rob Landley wrote:
[..]
> >> Really the only interesting errno case from iconv is illegal sequence.
> >> The rest just say "ran out of input" or "ran out of output" which is
> >> what you expect from a conversion that's not at the end of the file yet.
> >> (Ok, truncated sequence is a synonym for illegal sequence if we're not
> >> at the end of the buffer, which we can special case as at the _start_ of
> >> the buffer with the memmove logic.)
> > 
> > You mean "if we're at the end of the buffer"?
> 
> No, if we are at the end of the buffer, truncated sequence isn't an
> error. It means the buffer ran out before the sequence did. But if we're
> _not_ at the end of the buffer, it means the

Ah, I was confusing "end of buffer" and "end of file".

> However, if we just zap the parts we handled, do the memmove to the
> front, refill the buffer, and then have the error _again_ that means the
> truncated sequence is invalid, not a problem with running out of data.
> 
> (And that means we don't have to care how long the truncated sequence
> is, so we don't care how far from the end of the buffer still counts as
> retrying instead of skipping.)
> 
> >> Hmmm... we should probably pass illegal sequence bytes through. (Pass
> >> 'em through.) Except check if output buffer is full before doing that.
> >> (Don't have to check inleft nonzero because if inconv() returns illegal
> >> sequence but used up all the input buffer, that's a libc bug.)
> > 
> > Right...
> 
> I think the -c flag controls whether or not to pass them through,
> although posix is going "and we refuse to specify the behavior here at
> all because Microsoft paid us money not to".
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html
>
> >> Where would I get a test file to convert? I just ran a text file through
> >> it and confirmed it's not making any changes to it, but that doesn't
> >> mean much. :)
> > 
> > More interesting would be roundtrip encoding some files.
> 
> Except that means "cat" would pass. Not really a test that instills a
> lot of confidence in me...
> 
> > For testing, I just used an uim tarball[1] with some eucjp encoded files.
> > The cleaned up version still seems to work properly.
> 
> We can echo -e some snippets. Basically if we convert between utf-8 and
> whatever it is windows uses (latin pi) for like japan or korean or
> something, we'll have shown it Did A Thing. We're not trying to test the
> libc implementation of iconv, just show that we're feeding data into it.
>
> > Even more interesting would be a file with some illegal sequences. I didn't
> > test that at all.
> 
> The failure paths are always the most interesting thing to test. And the
> most often overlooked...
> 
> We'd also want to test retry across 2k boundaries on both input and
> output if we were being serious. _and_ test a file that exactly filled
> up the input and another that exactly filled the output buffer when the
> file ended.
> 
> But again, since I dunno what success looks like, I'll wait for somebody
> who does to complain. :)

I think the simplest thing would be to translate between iso-8859-1 and
utf-8. Attached a simple test.

> > Error handling looks more sensible. Have you considered that iconv_open()
> > might also fail because of insufficient memory.
> 
> I looked at doing perror_exit(0) but EINVAL is "Invalid argument" which
> isn't necessarily enough to figure out what went wrong. As for other
> failure causes:
[..]

Ok thanks, I see why you changed the error message.

> >> P.S. Posix iconv has several more command line options. -c is easy and
> >> -s is NOP for us, but I dunno how to do -l.
> > 
> > glibc's doesn't have them. So I guessed that they are not much used.
> > Now I see that libiconv has them.
> 
> When glibc and posix disagree, posix can potentially win. I'll probably
> do the extra 2 posix options on general principles, and fluff out the
> help text before promoting it.

Yeah, -c and -s look sensible.

Felix
-------------- next part --------------
#!/bin/bash

[ -f testing.sh ] && . testing.sh

#testing "name" "command" "result" "infile" "stdin"

iso=$(printf '\357')
utf=$(printf '\303\257') # "?"
printf a > iso
printf a > utf
for i in $(seq 4096)
do
  printf "$iso" >> iso
  printf "$utf" >> utf
done

testing "iconv" "iconv -f iso-8859-1 iso" "$(cat utf)" "" ""
testing "iconv -c" "iconv -c -f utf-8 iso" "a" "" ""

rm iso utf


More information about the Toybox mailing list