[r6rs-discuss] Re: [Formal] formal comment (ports, characters,
William D Clinger
will at ccs.neu.edu
Tue Mar 20 00:08:16 EDT 2007
I am posting this as an individual member of the Scheme
community. I am not speaking for the R6RS editors, and
this message should not be confused with the editors'
eventual formal response.
> Or when the abstraction leaks, as string-ref does for UTF-8 and UTF-16.
I don't understand what you mean by saying "the abstraction
leaks" for string-ref and/or UTF-8 and UTF-16, particularly
since the draft R6RS does not tell implementations to use
UTF-8 or UTF-16 or not to use UTF-8 or UTF-16.
> you think that being able to write string-find portably & efficiently is
Yes. With the current draft R6RS, that can be done only if
implementors have enough brains to provide O(1) amortized
time for string-ref. Implementors can accomplish that by
any one of dozens of plausible strategies. The simplest
strategy is to use UTF-32, and the more complex strategies
use a mixture of representations, some of which may use
I don't intend to teach a seminar here on implementation
strategies for O(1) string-ref, but I'll describe just one
simple strategy that achieves O(1) time for both string-ref
and string-set! while using only a little more space than
UTF-8. The basic idea is to represent every string by an
opaque, sealed record whose fields include a vector of
bytevectors. All but the last of those bytevectors is the
UTF-8 encoding of exactly 100 characters; the last one
contains between 0 and 100 characters, inclusive, and
contains 0 characters iff the length of the entire string
Implementation of O(1) string-ref and string-set! for that
representation is left as an exercise for readers who
understand big-oh notation.
I don't expect any implementations to use a representation
as bad as the one I described above. That was just to show
that achieving O(1) time for string-ref and string-set! is
child's play compared to some of the other stuff mandated
by the current draft R6RS.
I do think most implementors have enough brains to provide
efficient O(1) amortized time string-ref, but I could be
wrong about that. Programmers who are paranoid about the
performance of string-ref can convert their strings to
bytevectors in whatever byte-level representation they
prefer, and hope that bytevector-ref is O(1).
To make it easier to write representation-specific
algorithms in Scheme, someone could write a SRFI that
provides conversions between R6RS strings and bytevectors
that represent text using UTF-8, UTF-16, or UTF-32,
and provides an appropriate set of operations for each
of those bytevector representations. I don't think this
SRFI needs to be part of the R6RS, since a portable
reference implementation would solve the portability
problem. Folding that SRFI into the R6RS wouldn't
make it run any faster.
> The one other consideration is the use of external libraries. Unicode is a
> very big standard, and parts of it (like collation) are very complicated.
> You really do not want to be writing your own implementation of the
> Unicode Collation Algorithm.
> Windows and Mac are both UTF-16. Java and .NET are both UTF-16. IBM's
> ICU--an excellent open source, cross-platform, cross-language [C, C++,
> Java] internationalization library--is UTF-16 (with increasing UTF-8
> support). Linux (and, I believe, Solaris) are UCS-4.
> You left out one popular encoding, UCS-2.
> On Linux, for example, UTF-8 is increasingly the default system
> encoding--but Linux's wide-chars are UCS-4. Many of libc's string
> operations--eg, strcoll--will work directly on UTF-8 strings; others first
> require conversion to UCS-4.
> These days UTF-8 is the overwhelming favorite for transmitting and storing
> text, and is the assumed default of almost any new standard.
Summarizing: No single encoding is going to solve the
problem of interfacing with external libraries (which,
by the way, is a problem the draft R6RS does not even
attempt to address).
Conclusion: The R6RS should not mandate any particular
encoding or representation of strings.
The current draft doesn't.
More information about the r6rs-discuss