[r6rs-discuss] Stateful codecs and inefficient transcoding
cowan at ccil.org
Tue Oct 31 02:02:04 EST 2006
William D Clinger scripsit:
> Although the draft R6RS does not have your hypothetical utf-16-codec
> that relies on an initial BOM to select the endianness,
I will be filing a formal comment objecting to this. The standard
encoding of Unicode files (that is, files which may contain any Unicode
character) on Windows systems is UTF-16; neither UTF-16LE nor UTF-16BE
is customarily used there. In addition, UTF-16 is one of the two
encodings (the other being UTF-8) which all XML processors are required
> the implementation could peek at the first two bytes, decide whether
> to use UTF-16BE or UTF-16LE, and could install one of those two as
> the transcoder associated with the port.
That's not quite right. The implementation must peek
at the first two bytes, and if:
1) they are FE FF, they must be consumed and UTF-16BE installed;
2) they are FF FE, they must be consumed and UTF-16LE installed;
3) otherwise, the environment must be interrogated to see if
UTF-16 is by default in little-endian order, and if so,
UTF-16LE must be installed without consuming the two bytes;
4) otherwise, UTF-16BE must be installed without consuming the
The point here is that neither UTF-16LE nor UTF-16BE encodings are
permitted to use a BOM; if a U+FEFF character appears, it is the
substantive character ZERO-WIDTH NON-BREAKING SPACE. In the UTF-16
encoding, U+FEFF is a BOM at the beginning of a file but a ZWNBSP
> I don't like it much myself, but not for the two reasons you gave.
> For one thing, I harbor a prejudice against stateful encodings;
Be that as it may, they are prominent in both pre-Unicode and Unicode
> Furthermore I am told that some important file formats,
> e.g. XML, use several different textual encodings.
You are told wrongly.
XML *documents* may comprise multiple files (external entities), but
each file is in one and only one encoding, indicated thus:
1) Files in UTF-16 MUST begin with a BOM and MAY follow the BOM
with an internal encoding declaration.
2) Files in UTF-8 MAY begin with a BOM which MAY be followed by an
internal encoding declaration.
3) Files in other Unicode encodings MAY begin with a BOM and MUST
be followed by an internal encoding declaration.
4) Files in non-Unicode encodings MUST begin with an internal
Note that because ASCII encoding is a subset of UTF-8 encoding, ASCII
files do not require an internal encoding declaration.
In order to read the internal encoding declaration, XML processors must
read each file at the byte level. They then have a choice between
switching to character reading or restarting at the beginning with
character reading. An XML file MUST NOT contain bytes that are not
permitted in the character encoding of the file.
Well, I have news for our current leaders John Cowan
and the leaders of tomorrow: the Bill of cowan at ccil.org
Rights is not a frivolous luxury, in force http://www.ccil.org/~cowan
only during times of peace and prosperity.
We don't just push it to the side when the going gets tough. --Molly Ivins
More information about the r6rs-discuss