[r6rs-discuss] [ANN] scheme-reports.org
cowan at ccil.org
Mon Aug 24 21:37:33 EDT 2009
Brian Mastenbrook scripsit:
> After thinking about this for a while, I'm convinced that there is
> value to having a tagged type to represent individual code points. I
> believe that the facilities provided by the language (or is that "the
> language, working group 2"?)
I call it "Thing Two".
> It is the smallest unit of text which is idempotent under encoding and
> decoding, which means that it is for all practical purposes
> indivisible. (I don't think half of a surrogate pair counts as a
> proper division of a code point, and it's actually a rather dangerous
> thing to have lying around.)
I find this convincing. Codepoints are the appropriate smallest unit.
> It is logically distinguishable from an integer; while every code
> point can be uniquely mapped to an integer, not every integer can be
> mapped to a code point, and the operations defined on integers don't
> make sense on code points.
> I'm also not convinced by the argument that a string of length one
> removes the need for a separate tagged representation for the units of
> which the string is composed. The most primitive facility provided by
> any decoder or encoder is a mapping between code points and sequences
> of bytes; when working at that level, I'd prefer to have a type with a
> disjoint predicate representing the well-defined input type I am
I provide another intuition pump. Back in the Very Old Days, when symbols
were the only kind of strings Lisp had, people did string work with EXPLODE
and IMPLODE, mapping symbols to and from a list of the characters in
the symbol's print name. Those characters were themselves symbols,
not a distinct datatype. That worked fine.
The argument from encoding seems irrelevant to me. One can do a *better*
job of encoding if handed whole strings: the string "a\x0301;" can be
intelligently encoded into ISO 8859-1 as the bytevector #vu8(#xE1),
whereas the individual character #\x301 can't be encoded in 8859-1 at all.
Likewise, since integer->char isn't total, there's no real reason why
a version of char->integer that accepts single-codepoint strings should
be either. As I pointed out in my last posting, this says nothing about
the underlying implementation, which may well represent single-codepoint
Business before pleasure, if not too bloomering long before.
--Nicholas van Rijn
John Cowan <cowan at ccil.org>
More information about the r6rs-discuss