Uniconundrum

I really want to get the Unicode support in Smile right. I really do.

The unfortunate thing is that it seems there’s no consensus on what “right” looks like. Some people argue that UTF-8 is the right answer, some argue that UTF-16 is the right answer, some argue that UCS-4 is the right answer, and if you ask a hundred people, you get a hundred different answers as to what’s right.

Y’all are complicatin’ my life ‘cuz ya can’t agree on nothin’, I tells ya what.

I’ve already built a lot of the data tables necessary to process UCS-4 properly. But folks make a good argument that doing it as UTF-16 (which I’d intended to do) isn’t going to fly. UTF-16 still has surrogate pairs, which are really not a lot better than the multibyte code points of UTF-8, just far rarer for most languages. And UTF-16 and UCS-4 come at a cost that every text file has to be translated to and from them, ballooning memory and computation time for a lot of programs for no real benefit.

Perl solved this by directly implementing support for UTF-8. Every “character” in Perl is still an 8-bit byte, but case changes and case-insensitivity are implemented by converting UTF-8 bytes into UCS-4 for case stuff. I like that, but it means that substring indexing can slice out a part of a code point, which is bad.

But then even in UCS-4, you can still end up slicing “characters” in half, since you can have decomposed forms for characters like “é”, and a cut at a random index might cut off the accent and just leave you with “e”. It gets even worse in a language like Korean where you might end up cutting off part of a decomposed Hangul syllable.

Python and Java and .NET went for UTF-16, which isn’t perfect, but it plays nice with a lot of languages, at a cost in encoding/decoding time, and notably, it’s compatible with the Windows OS’s native implementation of UTF-16 under the hood. That causes a bit of trouble on something like Linux, where everything has to get translated back and forth, but it’s an improvement over no support at all.

What a mess.

My inclination, I think, is to go with UTF-8 for strings, and to have really complex implementations of the casing and diacritical stuff that work with UTF-8 directly. This is what Perl does, and it seems to work for them. Right now, I have Smile implemented using UTF-16, but that seems like the worst of all worlds: It’s neither a “pure” form like UCS-4, nor a “clean” multibyte form like UTF-8. The reasoning works like this:

  • Programmers in most languages are used to bytes equalling characters. Less cognitive friction is involved if this is true.
  • I/O for most text files can be handled as straight read/writes, instead of having to convert every character on the fly. This is a major performance boost for most text-crunching programs.
  • Ruby uses bytes for strings too, and Ruby was designed by a Japanese man for whom character encodings are a much more critical issue. If it’s good enough for people that need a syllabary that has 3,500 characters in it, it should be good enough for my needs too.
  • You don’t incur the overhead of non-ASCII if you don’t need it. In the US and Europe, 95% of the text you encounter is just ASCII, so adding overhead to the 95% to support the 5% is bad for performance.
  • Mid-string indexing that could break UTF-8 characters isn’t that common in most applications; it’s far more typical to use whole strings, or at most break at known (usually ASCII) marks like newlines and commas, and you can do that with UTF-8, without decoding it into UCS-4.
  • A pair of to-UCS-4 and from-UCS-4 routines can handle the rare cases where you really do need to do indexing in the middle of a string, and you can then operate on an array of 32-bit integers as necessary.

It’s not quite the backpedal it might seem. Smile still groks Unicode, just in a different way that will hopefully make it more future-proof. I still need a lot of the work I already did to support Unicode: I still need all those data tables. I just need to change String back to the 8-bit thing it used to be, and then include cleverer implementations of case-conversion, case-folding, normalization, and so on that can operate on UTF-8 directly. I do still need to include routines to convert to/from certain regional encodings like ISO Latin-1, but I can do those on 8-bit strings without needing a UCS-4 or UTF-16 internal representation.

If UTF-8 is good enough for Perl and Ruby and this very vocal guy whose native language has vastly different requirements from English, I think that makes it good enough for me too.

(Have a disagreement? I’d be happy to hear your opinion, as long as you can keep it to 140 characters or less.)