[r6rs-discuss] [ANN] scheme-reports.org
Brian Mastenbrook
brian at mastenbrook.net
Mon Aug 24 21:04:31 EDT 2009
On Aug 24, 2009, at 5:38 PM, Ray Dillinger wrote:
> On Mon, 2009-08-24 at 16:39 -0400, John Cowan wrote:
>
>> As you know, I'd like to see characters flushed from Scheme and all
>> other languages. That's not practical, though, given the high
>> barriers
>> to removing IEEE Scheme features from small Scheme.
>
> I agree in principle; characters in Unicode do not behave in the
> well-ordered ways that made the distinction between characters and
> strings seem useful in IEEE Scheme. There was an unspoken
> assumption that we were talking exclusively about environments
> with ASCII-like encodings, which has turned out recently to be
> false.
>
> It would be better to abandon the idea of characters as separate
> from strings. What is a character, after all? It's a string of
> length one. And what consistent semantics are provided by our
> character-specific functions that aren't visibly redundant with
> the semantics of string functions? Approximately none. So yeah,
> there's a point here to be made about characters being a fundamentally
> flawed notion in the presence of unicode environments.
>
> In practice, I don't know if we can do this. It would break
> so much existing scheme code.
After thinking about this for a while, I'm convinced that there is
value to having a tagged type to represent individual code points. I
believe that the facilities provided by the language (or is that "the
language, working group 2"?) should provide a range of facilities for
working with strings or text suitable for uses ranging from writing
new encoders and decoders to interactive editing and display functions
that work with text at the grapheme cluster level. At the highest
level, the notion of a code point as something which stands alone
seems a bit silly, but at the lowest level I believe it makes sense.
It is the smallest unit of text which is idempotent under encoding and
decoding, which means that it is for all practical purposes
indivisible. (I don't think half of a surrogate pair counts as a
proper division of a code point, and it's actually a rather dangerous
thing to have lying around.) It is logically distinguishable from an
integer; while every code point can be uniquely mapped to an integer,
not every integer can be mapped to a code point, and the operations
defined on integers don't make sense on code points.
I'm also not convinced by the argument that a string of length one
removes the need for a separate tagged representation for the units of
which the string is composed. The most primitive facility provided by
any decoder or encoder is a mapping between code points and sequences
of bytes; when working at that level, I'd prefer to have a type with a
disjoint predicate representing the well-defined input type I am
receiving.
--
Brian Mastenbrook
brian at mastenbrook.net
http://brian.mastenbrook.net/
More information about the r6rs-discuss
mailing list