[r6rs-discuss] unicode (re comment #134)

Thomas Lord lord at emf.net
Sun Dec 17 18:41:21 EST 2006


Marcin 'Qrczak' Kowalczyk wrote:
> IMHO the two best choices for the Unicode-based notion of a character
> in a programming language are:
>
> * code points: 0..#x10FFFF
> * Unicode scalar values: 0..#x10FFFF excluding #xD800..#xD7FF
>
> They are simple to understand in the context of Unicode, atomic,
> easy to store in strings, and easy to exchange with other languages
> with Unicode-based strings.
>   

I agree, more or less.   I would add "a superset of code points"
to the mix.

The issue raised in comment #134 is that, as written, the draft
*requires* that char be "unicode scalar value".  An implementation
may not make char the same as code points, or a superset of
code points (at least if integer->char is to behave in the expected
way).

-t







> Thomas Lord <lord at emf.net> writes:
>
>   
>> *Allowing* unpaired surrogates does not *require* that
>> unpaired surrogates be supported.
>>     
>
> Permitting variation hinders portability, it leads to programs which
> run correctly only on a subset of implementations.
>
>   
>> Tom>  For every natural number (integers greater than or equal to 0)
>> Tom>  there exists a distinct CHAR value.  The set of all such values
>> Tom>  are called "simple characters".
>>
>> John> Whatever for?
>>
>> So that the abstract model of character values is mathematically
>> simple and so that it is a good model for communications generally.
>>     
>
> It's indeed simpler, but it's worse for communication, not better,
> because all the rest of the world uses Unicode within its limits.
>
>   
>> It keeps the communications model in tact.  An N-bit wide port,
>> in this model, conveys characters 0..2^N-1.
>>     
>
> The range of Unicode code points is not a power of 2, so this can't
> express the code point port nor Unicode scalar value port.
>
> The current computing world doesn't use N-bit wide ports for arbitrary
> values of N. Almost all interchange formats are based on sequences of
> bytes, and there is no obvious mapping between an N-bit port and a
> byte port (there are several choices), so different pieces of software
> using N-bit ports can't necessarily communicate directly with each
> other, even for an agreed N.
>
>   
>> You shorted yourself, then, by not getting to the topic of combining
>> sequence characters. Remember that, in addition to a simple
>> character for every integer, I'm also suggesting that all tuples
>> of simple characters are, themselves characters.
>>     
>
> This requires an API to look inside characters, char->integer is no
> longer sufficient.
>
> This complicates exchange with the rest of the world, because almost
> all Unicode-based strings in various languages consist of some atomic
> units: either Unicode scalar values, or code points, or code units of
> some encoding form (UTF-8/16/32).
>
> I would design this differently: don't use a separate character type
> at all, identify them with strings of length 1. But it would break
> Scheme tradition (which goes back to Lisp tradition), so I'm not
> proposing this for Scheme.
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r6rs.org/pipermail/r6rs-discuss/attachments/20061217/11707b63/attachment.html


More information about the r6rs-discuss mailing list