[r6rs-discuss] BOM at start of ports
cowan at ccil.org
Wed Dec 5 08:31:01 EST 2007
Abdulaziz Ghuloum scripsit:
> I can't find this in the spec.
This is a Unicode question rather than an R6RS question.
> For textual ports obtained using open- file-input-port,
> open-bytevector-input-port, and transcoded-port, is the first call
> to get-char/peek-char supposed to recognize a BOM if it exists in
> the beginning of the port buffer, or should a BOM, if one exists,
> be decoded as a regular character?
If the encoding is UTF-16, the process MUST recognize a BOM if one is
present and use it to set the endianness of what follows: the BOM MUST not be
returned to the caller. If no BOM is present, the process SHOULD use a
local convention if there is one (this mostly means that Windows UTF-16
files are typically little-endian), and if not, SHOULD assume big-endian.
The same is true of the UTF-32 encoding if you choose to support it.
In the encodings UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE, there are
no BOMs; the endianness is specified by the encoding name, and any U+FEFF
characters MUST be returned as such.
In the UTF-8 encoding, things are less clear-cut. A process SHOULD
discard any BOM that is present. There are no endianness considerations
for UTF-8, so the BOM is serving primarily as a signature.
(See RFC 2119 for the meanings of MUST, SHOULD, and MAY.)
> A related question is about the endianness of the data read when
> using the (utf-16-codec) in a transcoder that's passed to any of the
> procedures listed above. Should the BOM, if one exists, be used to
> determine the endianness of the data in the port?
John Cowan cowan at ccil.org http://www.ccil.org/~cowan
Historians aren't constantly confronted with people who carry on
self-confidently about the rule against adultery in the sixth amendment to
the Declamation of Independence, as written by Benjamin Hamilton. Computer
scientists aren't always having to correct people who make bold assertions
about the value of Objectivist Programming, as examplified in the HCNL
entities stored in Relaxational Databases. --Mark Liberman
More information about the r6rs-discuss