[r6rs-discuss] Stateful codecs and inefficient transcoding

John Cowan cowan at ccil.org
Sat Nov 4 13:26:22 EST 2006


William D Clinger scripsit:

>  *  If t1 and t2 are transcoders, then their composition
>     is defined by describing their composition in both
>     the input and output directions.  In the input
>     direction, their composition is t1input followed by
>     t2output followed by t2input.  For output, their
>     composition is t2output followed by t2input followed
>     by t1output.

This is very confusing.  Let t1 be the binary transcoder, and let t2
be the {utf-8-codec, lf, raise} transcoder.  Differences other than
encoding drop out.  On input, t1input maps bytes to characters in the
Latin-1 repertoire.  t2output maps these characters to either one-byte
or two-byte sequences; t2input then maps them back again.  Where does
full UTF-8 decoding get done?  I would expect rather that the sequence
is t1input followed by t1output (the identity) followed by t2input.

> The rationale for this definition of composition is that
> it adds a new layer of transcoding onto the existing layer,
> instead of replacing the existing layer.  That allows
> some of the weirder, file-at-a-time transcodings.

Other than composing the binary transcoder with an ordinary one, what are
the actual use cases for this?  I don't know what you mean by "file at
a time"; transcoders require varying amounts of state from none (UTF-8)
to one bit (UTF-16) to a few bytes (full ISO 2022), but I know of no
encodings in which a byte can retroactively change the interpretation
of bytes that have already been transcoded.

> In practice, t1 will usually be the binary transcoder, and t2output
> followed by t2input will be the identity restricted to some subset of
> the Unicode characters.

Note, however, that there are transcoders which accept certain characters
for output that they never generate on input; they conflate multiple
Unicode characters into the same external encoding.

-- 
Business before pleasure, if not too bloomering long before.
        --Nicholas van Rijn
                John Cowan <cowan at ccil.org>
                    http://www.ccil.org/~cowan



More information about the r6rs-discuss mailing list