jason.orendorff at gmail.com
Tue Mar 27 09:01:12 EDT 2007
Jon Wilson wrote:
> Jason Orendorff wrote:
> > And most (but not all) Unicode string implementations use UTF-16.
> > Among languages and libraries that are very widely used, the majority
> > Xerces-C, and on and on. (The few counterexamples use UTF-8: glib,
> > expat. And expat can be compiled to use UTF-16.)
> If this is true, then I would expect to find relatively little mention
> of UTF-8 compared to UTF-16 on the internet. However, the google test
> turns up *1,040,000* for *utf-16* versus *173,000,000* for *utf-8*.
> Now, of course I realize that this is a particularly crude technique for
> determining the relative popularity of UTF-8 and UTF-16, but even a very
> crude technique does not cause this much of a discrepancy. 173 : 1 is
> quite a steep ratio.
By this reckoning, UTF-8 is more popular than Unicode, which only
gets 39,000,000 hits. Actually, according to Google, UTF-8 is more
popular than Jesus.
Incidentally, if you don't adjust for cluefulness, UTF-16 is more often
called "Unicode". Dreadful but true, especially in the Windows and
Java worlds. Bottom line: nobody else thinks about this stuff but
language designers and highly clueful library designers.
> The Internet Engineering Task Force (IETF) requires all Internet
> protocols to identify the encoding used for character data with UTF-8 as
> at least one supported encoding.
As a *transmission* format, UTF-8 is much more common than UTF-16,
for good reasons--but nowhere near as common as, say, Latin-1. In
other words, when doing I/O, a transcoding step is usually necessary
More information about the r6rs-discuss