[r6rs-discuss] unicode (re comment #134)
lord at emf.net
Sun Dec 17 18:41:21 EST 2006
Marcin 'Qrczak' Kowalczyk wrote:
> IMHO the two best choices for the Unicode-based notion of a character
> in a programming language are:
> * code points: 0..#x10FFFF
> * Unicode scalar values: 0..#x10FFFF excluding #xD800..#xD7FF
> They are simple to understand in the context of Unicode, atomic,
> easy to store in strings, and easy to exchange with other languages
> with Unicode-based strings.
I agree, more or less. I would add "a superset of code points"
to the mix.
The issue raised in comment #134 is that, as written, the draft
*requires* that char be "unicode scalar value". An implementation
may not make char the same as code points, or a superset of
code points (at least if integer->char is to behave in the expected
> Thomas Lord <lord at emf.net> writes:
>> *Allowing* unpaired surrogates does not *require* that
>> unpaired surrogates be supported.
> Permitting variation hinders portability, it leads to programs which
> run correctly only on a subset of implementations.
>> Tom> For every natural number (integers greater than or equal to 0)
>> Tom> there exists a distinct CHAR value. The set of all such values
>> Tom> are called "simple characters".
>> John> Whatever for?
>> So that the abstract model of character values is mathematically
>> simple and so that it is a good model for communications generally.
> It's indeed simpler, but it's worse for communication, not better,
> because all the rest of the world uses Unicode within its limits.
>> It keeps the communications model in tact. An N-bit wide port,
>> in this model, conveys characters 0..2^N-1.
> The range of Unicode code points is not a power of 2, so this can't
> express the code point port nor Unicode scalar value port.
> The current computing world doesn't use N-bit wide ports for arbitrary
> values of N. Almost all interchange formats are based on sequences of
> bytes, and there is no obvious mapping between an N-bit port and a
> byte port (there are several choices), so different pieces of software
> using N-bit ports can't necessarily communicate directly with each
> other, even for an agreed N.
>> You shorted yourself, then, by not getting to the topic of combining
>> sequence characters. Remember that, in addition to a simple
>> character for every integer, I'm also suggesting that all tuples
>> of simple characters are, themselves characters.
> This requires an API to look inside characters, char->integer is no
> longer sufficient.
> This complicates exchange with the rest of the world, because almost
> all Unicode-based strings in various languages consist of some atomic
> units: either Unicode scalar values, or code points, or code units of
> some encoding form (UTF-8/16/32).
> I would design this differently: don't use a separate character type
> at all, identify them with strings of length 1. But it would break
> Scheme tradition (which goes back to Lisp tradition), so I'm not
> proposing this for Scheme.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the r6rs-discuss