[r6rs-discuss] Strings as codepoint-vectors: bad

Jason Orendorff jason.orendorff at gmail.com
Thu Mar 15 11:13:15 EDT 2007

I think people who favor strings-as-codepoint-vectors must think that the
codepoint is a good level of abstraction for text.  Really it's not.

   One or more Unicode characters may make up what the user thinks of
   as a character or basic unit of the language.  To avoid ambiguity
   with the computer use of the term character, this is called a
   grapheme cluster.  For example, `G' + acute-accent is a grapheme
   cluster: it is thought of as a single character by users, yet is
   actually represented by two Unicode code points.

   -- Unicode Standard Annex #29

In Java, C#, and in all likelihood Python 3.0, strings are immutable
sequences of 16-bit values (UTF-16 code units).  Surrogate pairs are
totally ignored.

This is a good design.

Treating a string as a sequence of Unicode codepoints has few
real-world use cases.  For ordinary text-munging, we use higher-level
functions such as (string-append), (string-find), (string-replace),
(string-starts-with?), and so on.  In other words, the objects we want
to use when working with strings are... substrings.  Note that all
these useful functions can be implemented "naively" in terms of UTF-16
code units and they'll work just fine, even on surrogate pairs.

The only use cases I know of for codepoint sequences are to implement
Unicode algorithms, like laying out bidirectional text.  Here UTF-16
is no real burden compared to the sheer complexity of the task at
hand. (See http://unicode.org/reports/tr9/ for example.)

By contrast, passing a UTF-16 string to some external function is an
extremely common and important use case.  It's especially important on
Windows and for anything that targets the JVM or CLR.

I think people who favor strings-as-codepoint-vectors must also think
that breaking a surrogate pair is really bad.  But even with a
codepoint-centric view of text you can unwittingly break a grapheme
cluster, which amounts to the same sort of bug--it can lead to garbled
text--and which is probably much *more* common in practice.  I never
hear anyone complain about that.

Making strings vectors of 16-bit values is simple, familiar,
speed-efficient, memory-efficient, easy to implement, and convenient
for programmers.


More information about the r6rs-discuss mailing list