[r6rs-discuss] Why Unicode matters
bear at sonic.net
Tue Mar 3 04:51:59 EST 2009
On Wed, 2009-02-18 at 18:48 -0500, John Cowan wrote:
> As the R6RS process's chief Unicode hound, I'd like to say a word or
> two about why I think Unicode matters. There are at least three kinds
> of reasons.
> 1) If a process must deal with text, it should be designed from the
> ground up to deal with text in a universal encoding, converting to local
> encodings only when required to interface with surrounding systems.&It's
> been estimated that building in Unicode adds perhaps 20% to development
> cost, whereas retrofitting it adds about 100%.
I think that %20 is too high a cost to bear when one is working on
code that one already knows will *not* be deployed widely. A typical
use for a scheme program is a one-time format conversion of a large
database. Once it's done, the program gets erased. Why should I
spend an extra 20% effort when I know the target database does not
and will not use unicode?
> That's an "industrial"
> motive to support Unicode, and although the (rnrs unicode (6)) library
> doesn't come close to providing all that's needed for practical work,
> it does provide a useful core.
The useful core should be *available,* true. It should be possible
to write Scheme programs without using it. It would be good, IMO, to
define a set of functions that a character library must provide, so
that it's possible for an implementation to define multiple character
libraries and for users to choose one to include.
> 2) Scheme requires that there exist in the application domain strings
> which are constructed as sequences of characters. (I think that's a
> mistake: I'd rather have strings as primitives and understand characters
> to be a finite subset of short strings.) Having the significance and
> interpretation of characters differ from one implementation to the next
> is a needless kind of variation: in practice it means that portable
> programs must be confined to ASCII data.
One does not address needless variation by imposing needless uniformity.
It is *important* to the language that it should be possible to make
a useful conforming implementation using the local environment's native
character set. In some cases that's unicode, so it's worthwhile for the
standard to speak about what a unicode implementation ought to look
like. But in some cases it's not. There are still companies publishing
phone books that use IBM mainframes with an EBCDIC encoding; requiring
unicode semantics for all the functions in the scheme language makes a
conforming implementation utterly useless in their environments. So it's
worthwhile for the standard to allow other character sets too.
> Breaking the historical link
> between characters and octets is something that should be done in the
> core whether or not anything else about Unicode is supported.
Supporting Octets (as opposed to characters) is something that has
needed doing for a long time. Proper support for octets should have
allowed higher-level concepts like characters and strings to be moved
out of the core and into libraries, where people could define
libraries to support various character sets and programmers
could choose which library - Unicode or other - to load. Why
> 3) But most deeply, I believe, is the fact that Scheme programmers are
> themselves dealing with text when they write their programs, and if the
> repertoire of characters allowed in a program is non-universal, the result
> is an unfair disadvantaging of people who use another repertoire natively.
This argument is political, not technical.
In practice I agree with it; but I don't agree that unicode is a
truly universal encoding. Unicode is a good choice, but I don't
see this as a situation where there has to be a single choice.
Scheme syntax _requires_ the parentheses, a very few other
punctuation characters, arabic digits for its numeric syntax, and
latin letters to spell the names initially bound to its defined
procedures. Although the standard may and probably ought to
recommend unicode semantics and define what a conforming unicode
implementation looks like, I do not believe that the standard
should require more of the character set than that it have the
characters scheme syntax requires.
It should be possible to implement a conforming scheme system
using any encoding where the encoding has the characters
that schems's syntax requires. The standard should certainly
*allow* conforming unicode implementations (R5RS didn't, and R6
was supposed to fix that) but it should not require unicode
semantics in environments where unicode isn't the local machine's
More than that, it should be possible to load one's *choice* of
character encoding libraries, written in scheme, now that we
have binary octets in the core of the language.
More information about the r6rs-discuss