[r6rs-discuss] Stateful codecs and inefficient transcoding

Per Bothner per at bothner.com
Tue Oct 31 01:26:46 EST 2006


William D Clinger wrote:
> By "default transcoder", I meant the transcoder that
> an implementation uses for procedures like get-char
> when no explicit transcoder argument is given.
> According to draft R6RS section 15.3.5, paragraph 1,
> the default transcoder *must* be "UTF-8 with a
> platform-specific end-of-line convention."

Sorry - I wasn't sure what you meant by "default transcoder".
In other environments the "default transcoder" is the one the
system picks for you *depending on your implicit locale*.

> If the default transcoder were for Latin-1 with the lf
> eol-style, then the implementation of get-char would
> be *exactly* the same as the implementation of read-char
> in the current development version of Larceny.

While I can't speak about Larceny, I don't think is quite true
in general, since you do need an extra test or indirection
that you wouldn't otherwise need to handle the possibility
of a non-default transcoder.  But we can agree one can make
this trivially small, for example by only doing the extra
test or indirection when re-filling a large buffer.

>>>     * how character decoding could call iconv, with about
>>>       the same performance as in C, in programs that do
>>>       not need to follow a get-char with a get-u8
>> But how is the implementation supposed to know this, except
>> when something like get-string-all is called?
> 
> What I am suggesting is that programs that want to
> use iconv should read the input as binary and then
> call iconv without relying on the transcoders and
> textual i/o of the draft R6RS.

But the issue isn't "programs that want to use iconv".
It's somebody wanting to write "hello world" in an
environment where their files use a different encoding
than UTF-8.  Perhaps the world is moving to UTF-8, there
are still a lot of legacy systems.

Having the the default transcoder always be UTF-8 rather
than depending on the current locale is I think wrong.

> I don't think it is possible to design an i/o system
> that satisfies both of the following requirements:
> 
>   * arbitrary mixing of binary and textual i/o
>   * efficient support for all possible transcoders

I agree.  But I think the second is as important or
more so than the first.

> The i/o system of the draft R6RS isn't intended to
> support all possible transcoders.  Its intent is to
> support efficient mixing of binary and textual i/o
> down to the byte level (but no lower) for a small
> set of stateless transcoders that don't require much
> lookahead.
> 
> For stateful transcoders, and for transcoders that
> require a lot of lookahead, I think the right thing
> to do is to compose something like the draft's binary
> i/o with something like iconv.

That's ok, as long as you're willing to write off all
casual use of R6RS on systems where UTF-8 is not the
default encoding for files.

Perhaps that aren't very many system systems, and this is
a concern that is dwindling.  But I'm curious: How many people
on this list run Scheme on systems where the default character
encoding is not UTF-8 (or ASCII)?  Would you would find it acceptable
that (open-output-file "foo.txt") creates a file you can't read
without understanding about character encodings?

>> Those are the only encodings that R6RS requires.  But presumably
>> an implementation can provide others.  I don't see anything that
>> mentions a restriction on mixing byte/char input if a non-standard
>> codec is used, though I perhaps I missed it.
> 
> The draft R6RS places no restrictions on mixing
> binary and textual input.  That means there is an
> implied restriction on encodings and efficiency:
> Implementations can provide any stateless codecs
> they want in addition to those listed, but those
> implementations must figure out how to mix binary
> with textual i/o for all codecs they supply, and
> programmers must live with any inefficiencies that
> result.
> 
> That may well be unreasonable, but it is what the
> draft implies.

I agree that is what the draft implies.  I think it needs to be
changed.
-- 
	--Per Bothner
per at bothner.com   http://per.bothner.com/



More information about the r6rs-discuss mailing list