jason.orendorff at gmail.com
Fri Mar 23 12:34:57 EDT 2007
On 3/22/07, Alexander Kjeldaas <alexander.kjeldaas at gmail.com> wrote:
> Python is *definitively* not utf16. Python can be compiled to use
> utf8, utf16 or utf32/ucs4.
UTF-16 or UTF-32. Not UTF-8.
I'll ask around and see if the Python folks think this has been good,
bad, or indifferent. My impression was that it's considered to have
been a mistake, but I could be wrong.
My thoughts on this topic actually come largely from Python's
experience in this arena.
> Python does not have a character type,
> avoiding the issue of whether there should be O(1) access to
Um, this is a misunderstanding of how Python works. Python
provides O(1) access to code units, so for example on a
"ucs2" build (the default):
>>> s = u'\U00012345'
On a "ucs4" build the same code gives different answers. No one
exactly likes this in the Python camp, and I don't think we want this
for Scheme. If R6RS exposes code units, it should either
standardize on a representation everyone can live with; or
set the code unit API aside in a separate library, maybe
(r6rs string-code-units), so people won't naively trip over it.
> According to Guido van Rossum, python 3000 might use all
> three internal representations at the same time.
Well, it's possible. I think he mentioned it at PyCon. I'll gladly
bet it doesn't change: too much work, and it would either complicate
the Python C API (one of Python's major strings--er, strengths) or
hurt performance, or both.
I'll ask about this too.
> Neither Xerces-C nor ICU specifies their internal representation as
> part of the interface AFAIK. On the other hand, since they deal with
> with encodings they support lots of them.
"String is represented by 'XMLCh*' which is a pointer to unsigned
16 bit type holding utf-16 values, null terminated."
"In ICU, a Unicode string consists of 16-bit Unicode code units.
A Unicode character may be stored with either one code unit
(the most common case) or with a matched pair of special
code units ("surrogates"). The data type for code units is UChar.
"Indexes and offsets into and lengths of strings always count
code units, not code points."
Regarding the rest of your comments: your experience and mine
obviously differ. I wonder if you have profiled a system using both
UTF-16 and UTF-32 strings. I have not.
I think the rate-determining step is probably neither unaligned
accesses nor processor cache but how much copying and transcoding
you're forced to do. UTF-16 is a significant win in that regard.
More information about the r6rs-discuss