agj at alum.mit.edu
Sun Mar 25 12:48:01 EDT 2007
| From: "Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl>
| Date: Sun, 25 Mar 2007 12:46:49 +0200
| Dnia 24-03-2007, sob o godzinie 13:31 -0400, MichaelL at frogware.com
| > Summary
| > "This document attempts to make the case that it is advantageous to use
| > UTF-16 (or 16-bit Unicode strings) for text processing..."
| IMHO this is one of the worst mistakes Unicode is trying to make.
| It convinces people that they should not worry about characters above
| U+FFFF just because they are very rare. UTF-16 combines the worst
| aspects of UTF-8 and UTF-32.
| If size is important and variable width of the representation of a code
| point is acceptable, then UTF-8 is usually a better choice. If O(1)
| indexing by code points is important, then UTF-32 it better. Nobody
| wants to process texts in terms of UTF-16 code units. Nobody wants to
| have surrogate processing sprinkled around the code, and thus if one
| accepts an API which extracts variable width characters, then the API
| could as well deal with UTF-8, which is better for interoperability.
| UTF-16 makes no sense.
There also seems to be a hidden assumption in some posts that
character alignment can only be recovered if a string is scanned from
the beginning. This is not the case.
Character alignment can be discovered from any octet within a UTF-8
encoded string. The octet which begins a code point can never be
mistaken for the subsequent octets, which always have the most
significant two bits #b10.
There are algorithms (like binary search) which access a string at
approximate locations. The asymptotic running time of such algorithms
will not be impacted by using strings coded in UTF-8.
More information about the r6rs-discuss